Handling Transient Errors in Durable Functions

One of the things which are certain in the case of non-trivial code running in production are errors. Sooner or later they will happen and they don't even have to result from bugs. Some resources might be down, there can be a network problem, hardware failure, and many other short-lived issues. This is why resiliency is a key aspect of every application. In this post, I want to focus on handling transient errors in a very specific context - Durable Functions.

A Word of Caution - Don’t Use Polly

The fact that Durable Functions provide their own mechanism for automatic retry on failure should be a strong suggestion that it's the way it should be done. But we are creatures of habits and we often bring those habits from other types of applications to serverless ones. One of such habits is using Polly for retries. In the case of Durable Functions (or Azure Functions in general), this can have serious consequences. The time spent at awaits is counted as execution time, this will include delays between retries. Yes, the rules are different in orchestrator functions but you can't perform I/O operations there and those are the ones you will most likely want to retry. So, the built-in automatic retry is the way to go, but sometimes it might be tricky to achieve some scenarios.

Built-in Automatic Retry

To discuss the built-in automatic retry in Durable functions I'm going to use the below activity as an example.

[FunctionName(nameof(MessageOfTheDayActivity))]
public async Task<string> MessageOfTheDayActivity([ActivityTrigger] IDurableActivityContext activityContext)
{
    HttpClient httpClient = _httpClientFactory.CreateClient();

    return await httpClient.GetStringAsync("https://host/messageOfTheDay");
}

Yes, I know that performing an HTTP request is probably not the best choice. In simple cases, it is better to use Durable Functions HTTP features. That said, HttpClient allows me to easily show various scenarios.

In the above code, I'm using GetStringAsync which will throw an exception whenever an unsuccessful response is received. If this activity will be called via CallActivityWithRetryAsync, Durable Functions will retry it according to a policy defined with RetryOptions. The minimum which must be provided is the maximum number of attempts and the first retry interval.

[FunctionName(nameof(MessageOfTheDayOrchestration))]
public async Task<string> MessageOfTheDayOrchestration([OrchestrationTrigger] IDurableOrchestrationContext orchestrationContext)
{
    string messageOfTheDay = await orchestrationContext.CallActivityWithRetryAsync<string>(
        nameof(MessageOfTheDayActivity),
        new RetryOptions(TimeSpan.FromSeconds(5), 3),
        null);

    return messageOfTheDay;
}

So this will retry all requests which have resulted in a response with an unsuccessful status code. Is this truly the correct behavior? I would say no. There are unsuccessful status codes that shouldn't be retried. A good example is 403 Forbidden. In the case of this status code, the service is telling us that we are not allowed to make the request. Can the policy be adjusted to not retry when this happens? Yes, but first we need to modify the activity so it throws an exception which propagates the information we need.

[FunctionName(nameof(MessageOfTheDayActivity))]
public async Task<string> MessageOfTheDayActivity([ActivityTrigger] IDurableActivityContext activityContext)
{
    HttpClient httpClient = _httpClientFactory.CreateClient();

    using HttpResponseMessage response = await httpClient.GetAsync("https://host/message-of-the-day");

    if (!response.IsSuccessStatusCode)
    {
        throw new MessageOfTheDayApiException(response.StatusCode);
    }

    return await response.Content.ReadAsStringAsync();
}

There is a property on RetryOptions which allows for providing a callback to determine whether an activity should be retried. The important thing to remember here is that exceptions are wrapped and to get the original one you need to look at the inner exception.

[FunctionName(nameof(MessageOfTheDayOrchestration))]
public async Task<string> MessageOfTheDayOrchestration([OrchestrationTrigger] IDurableOrchestrationContext orchestrationContext)
{
    RetryOptions retryOptions = new RetryOptions(TimeSpan.FromSeconds(5), 3)
    {
        Handle = (Exception ex) =>
        {
            MessageOfTheDayApiException messageOfTheDayApiException = ex.InnerException as MessageOfTheDayApiException;

            if (messageOfTheDayApiException != null && messageOfTheDayApiException.StatusCode == HttpStatusCode.Forbidden)
            {
                return false;
            }

            return true;
        }
    };

    string messageOfTheDay = await orchestrationContext.CallActivityWithRetryAsync<string>(
        nameof(MessageOfTheDayActivity),
        retryOptions,
        null);

    return messageOfTheDay;
}

Avoiding unnecessary retries is just one of the aspects of a mature retry policy. There are more sophisticated scenarios, which can't be handled by built-in automatic retry.

Handling More Sophisticated Cases

Another interesting HTTP response status code is 429 Too Many Requests. Receiving this status code means that you are being rate limited. You should wait and retry, but waiting is a little bit more tricky in this case. The information on how long to wait usually comes in form of the Retry-After header value. So the waiting period can be different every time and is known only after execution. How can we deal with that? First, we need to make sure that our exception propagates not only the status code but also the extracted Retry-After value (if it has been provided).

[FunctionName(nameof(MessageOfTheDayActivity))]
public async Task<string> MessageOfTheDayActivity([ActivityTrigger] IDurableActivityContext activityContext)
{
    HttpClient httpClient = _httpClientFactory.CreateClient();

    using HttpResponseMessage response = await httpClient.GetAsync("https://host/message-of-the-day");

    if (!response.IsSuccessStatusCode)
    {
        throw new MessageOfTheDayApiException(response.StatusCode, GetRetryAfter(response));
    }

    return await response.Content.ReadAsStringAsync();
}

Now, having the necessary information, we can try to do something useful with it. Of course, we can't use it as a retry interval, so built-in capabilities will not help us. What we can do is exclude this scenario from automatic retry and use two Durable Functions concepts to handle it ourselves: timers and orchestrator capability to restart itself. Once we "catch" 429, we can create a timer to wait the needed period and after that use IDurableOrchestrationContext.ContinueAsNew to restart the orchestration.

[FunctionName(nameof(MessageOfTheDayOrchestration))]
public async Task<string> MessageOfTheDayOrchestration([OrchestrationTrigger] IDurableOrchestrationContext orchestrationContext)
{
    string messageOfTheDay = String.Empty;

    RetryOptions retryOptions = new RetryOptions(TimeSpan.FromSeconds(5), 3)
    {
        Handle = (Exception ex) =>
        {
            MessageOfTheDayApiException messageOfTheDayApiException = ex.InnerException as MessageOfTheDayApiException;

            if (messageOfTheDayApiException != null && (
                messageOfTheDayApiException.StatusCode == HttpStatusCode.Forbidden ||
                messageOfTheDayApiException.StatusCode == HttpStatusCode.TooManyRequests
                ))
            {
                return false;
            }

            return true;
        }
    };

    try
    {
        messageOfTheDay = await orchestrationContext.CallActivityWithRetryAsync<string>(
            nameof(MessageOfTheDayActivity),
            retryOptions,
            null);
    }
    catch (Exception ex)
    {
        MessageOfTheDayApiException messageOfTheDayApiException = ex.InnerException as MessageOfTheDayApiException;

        if (messageOfTheDayApiException != null &&
            messageOfTheDayApiException.StatusCode == HttpStatusCode.TooManyRequests &&
            messageOfTheDayApiException.RetryAfter.HasValue)
        {
            DateTime retryAt = orchestrationContext.CurrentUtcDateTime.Add(messageOfTheDayApiException.RetryAfter.Value);
            await orchestrationContext.CreateTimer(retryAt, CancellationToken.None);

            orchestrationContext.ContinueAsNew(null);
        }
    }

    return messageOfTheDay;
}

This will allow for the proper handling of 429 Too Many Requests. This is also the way to build more complex failure handling - by leveraging Durable Functions specific mechanisms.

Be Thoughtful

Proper error handling requires consideration. You need to think about how you want to react to different situations so you don't end up with pointless retries, forever retries, or retry storms. You should also be using the right tools for the job, especially in Durable Functions which have very specific programming model and wrong patterns will cost you money or cause timeouts.