By Johnny Chadda -

Resilient Azure OpenAI using Azure API Management

Using OpenAI APIs in Azure allows you to create multiple deployments of the API in different regions. This is useful for failover and load balancing, making it easier to scale. To load balance OpenAI calls to multiple OpenAI instances in Azure, we can use Azure API Management (APIM). APIM allows us to create a single API that can access multiple OpenAI instances, instead of managing it from the client code.

In this blog post, we will explore how to set up APIM with an OpenAI-compatible API and connect it to multiple OpenAI deployments. We will also cover how to set up failover and load balancing to improve resiliency in case of rate limits or other failures.

Configure OpenAI Deployments

Setting up OpenAI in Azure is out of scope for this post, but it is easy to get started. It begins with requesting access to the service using the Azure Portal.

For this post, we will be using two OpenAI deployments. For each deployment, note the Endpoint URL and API key. The following fictitious demo data will be used throughout this post:

Deploying Models

Once you have set up OpenAI in Azure, you can deploy your models by creating a new deployment. To make the APIM configuration as simple as possible, it is recommended that you use the same name for each model in every region. For instance, if you are deploying GPT-4 in two regions, give them the same name, e.g., gpt-4.

Set Up Azure API Management Service

Once you have the OpenAI deployments available, we can continue with setting up APIM.

Create an APIM Instance

First, navigate to the API Management service in the Azure portal and click Create. Select an appropriate subscription and resource group. Choose a region and, for the Pricing tier, choose Developer if you do not intend to use this in production. Otherwise, start with the Basic plan.

Finish the configuration by clicking Review + create, then Create, and wait for the service to deploy.

Navigate to the Overview tab of the deployed APIM service and note the Gateway URL. In our example, we will use:

Create Frontend API

Once your APIM instance is up and running, the first thing we want to do is define how our API should behave for the clients. The easiest way is to import the actual OpenAI OpenAPI specification to APIM.

Download the 2024-02-01 API specification and open it in your favorite editor. Update the url and default values in the servers section to look like this:

"servers": [
  {
    "url": "https://opper-apim-demo-openai-se.openai.azure.com/openai",
    "variables": {
      "endpoint": {
        "default": "opper-apim-demo-openai-se.openai.azure.com"
      }
    }
  }
]

To apply it to the APIM service, navigate to the APIs tab, find Create from definition, and choose OpenAPI. Upload the file and, where it says API URL suffix, enter openai, and finally click Create.

So far, we now have an APIM service with an OpenAI-compatible API, except for the authentication. You can test it out by clicking the APIs tab, then choosing Azure OpenAI Service API. Click the "Creates a completion for the chat message" endpoint and then the Test tab. Enter the template parameters as follows:

Click "Add header" and add a header:

In the Request body, enter a sample query, for instance:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What comes after Thursday?"}
  ],
  "max_tokens": 20
}

This should confirm that you are able to connect from APIM to your OpenAI deployment and perform a basic request.

Configure Backends

To configure load balancing and failover, we need to define the backends that will be used.

Start by configuring the API keys for the OpenAI deployments by navigating to Named values and clicking Add. Configure the following parameters:

Click Save and enter another one for the other deployment:

Now click the Backends tab in the APIM portal, and then click Add. Configure the following parameters:

Expand the Advanced section and click the "Headers" tab. Enter:

Finally, click Create and then create another backend for the "fr" deployment, choosing the other URL and API key.

Configure Authentication

To call the APIM API in an OpenAI-compatible way, we need to define a new API key that APIM will use to authenticate the clients. Start by clicking the "Named values" tab and then click Add. Enter these values:

Configure Policy for Routing

Finally, we can create a policy that authenticates the clients, load balances between deployments, and handles failovers.

Start by clicking the APIs tab in APIM, and then selecting the "Azure OpenAI Service API". In the "Design" tab, click on the </> icon under "Inbound processing". You should be presented with an XML editor where you can enter the following, and then click "Save":

<policies>
  <inbound>
    <base />
    <set-variable name="apim_api_key" value="{{APIM-API-KEY}}" />
    <choose>
      <!-- Check if the api-key header is present and valid -->
      <when condition="@(context.Request.Headers.GetValueOrDefault("api-key") == context.Variables.GetValueOrDefault<string>("apim_api_key"))">
        <!-- Load balancing logic -->
        <choose>
          <when condition="@(new Random().Next(100) < 75)">
            <set-backend-service backend-id="oai-se" />
          </when>
          <otherwise>
            <set-backend-service backend-id="oai-fr" />
          </otherwise>
        </choose>
      </when>
      <otherwise>
        <!-- Return 401 Unauthorized if api-key header is missing or invalid -->
        <return-response>
          <set-status code="401" reason="Unauthorized" />
          <set-body><![CDATA[{"message": "Unauthorized. Invalid API key."}]]></set-body>
        </return-response>
      </otherwise>
    </choose>
  </inbound>
  <backend>
    <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)" count="5" interval="1" delta="1" max-interval="8" first-fast-retry="false">
      <choose>
        <when condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)">
          <choose>
            <when condition="@(context.Request.Headers.GetValueOrDefault("x-backend-retry") != "oai-se")">
              <set-backend-service backend-id="oai-se" />
              <set-header name="x-backend-retry" exists-action="override">
                <value>oai-se</value>
              </set-header>
            </when>
            <otherwise>
              <set-backend-service backend-id="oai-fr" />
              <set-header name="x-backend-retry" exists-action="override">
                <value>oai-fr</value>
              </set-header>
            </otherwise>
          </choose>
        </when>
      </choose>
      <forward-request buffer-request-body="true" />
    </retry>
  </backend>
  <outbound>
    <base />
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>

Let's break this down a bit.

The inbound block manages incoming connections to the API. The first thing we check is that the client passes the correct API key and responds with an HTTP error if it is incorrect.

The <when condition="@(new Random().Next(100) < 75)"> block within inbound is a simple way of load balancing between the backends. In this example, we send 75% of the traffic to the oai-se backend and the rest to oai-fr. This can be adjusted depending on your specific token limits in each region.

The retry block <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)" count="5" interval="1" delta="1" max-interval="8" first-fast-retry="false"> comes from the official Azure APIM with OpenAI documentation and handles retries with exponential backoff.

When a request to one of the deployments fails with a status code of 429 or any 5xx server error, the policy triggers a retry process. The x-backend-retry header is used to track which backend has been attempted. If the initial request to the oai-se backend fails, the header is set to oai-se, and the next retry is directed to the oai-fr backend. This ensures that subsequent retries are attempted on a different backend, providing a robust failover mechanism. The retry process can occur up to five times with exponential backoff, enhancing the resilience and reliability of the API service by distributing the load and handling failures gracefully.

The next step is to navigate to the "Settings" tab and enter the following information before clicking "Save":

With this configuration, the APIM API should behave just like the normal Azure OpenAI API. The only difference is specifying the base path to APIM instead of Azure.

You can test this by running a curl command:

curl -X POST "https://opper-apim-demo.azure-api.net/openai/deployments/gpt-4/chat/completions?api-version=2024-02-01" \
-H "Content-Type: application/json" \
-H "api-key: your-apim-api-key-here" \
-d '{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What comes after Thursday?"}
  ],
  "max_tokens": 50
}' | jq

Conclusion

Having the APIM gateway with OpenAI compatible authentication in front of your model deployments removes the need to manage this from your client code. It lets you call the API in an Azure OpenAI-compatible way and have the server side deal with the failver and retry logic. The policy can be made more complex to grow with your increasing requirements without affecting the code.

Thanks for reading! Please join our Discord for any questions.

Further Reading and References