The rate limits in Azure Open AI are defined by TPM (Tokens-Per-Minute) and RPM (Requests-Per-Minute). The TPM quota determines the maximum number of tokens that can be processed by a model deployment per minute. The RPM is now intrinsically linked to TPM, indicating that the two metrics are no longer considered separately. The relationship between TPM and RPM is 6 RPM per 1K TPM for quota. This means that for every 1,000 TPM, you have a maximum of 6 RPM available.
If your requests exceed the allowed RPM, you may encounter a 429 throttling error, indicating that the system is limiting the rate of requests. To handle such scenarios, it is recommended to implement strategies like exponential backoff and retry mechanisms to manage the request rate and ensure smoother system performance.