As developers, we want reassurance that our code functions as expected. We also want to be given that reassurance as fast as possible. This is why we are writing automated tests. We also desire for those tests to be easy to run on our machine or in the worst-case scenario on the build agent as part of the continuous integration pipeline. This is something that can be challenging to achieve when it comes to Azure Functions.

Depending on the language we are using for our Azure Functions, the challenges can be different. Let's take .NET (I guess I haven't surprised anyone with that choice) as an example. In the case of .NET, we can write any unit tests we want (I will deliberately avoid trying to define what is that unit). But the moment we try to move to integration tests, things get tricky. If our Azure Function is using the in-process model, we have an option of crafting a system under test based on the WebJobs host which will be good enough for some scenarios. If our Azure Function is using the isolated worker model there are only two options: accept that our tests will integrate only to a certain level and implement Test Doubles or wait for the Azure Functions team to implement a test worker. This is all far from perfect.

To work around at least some of the above limitations, I've adopted with my teams a different approach for Azure Functions integration testing - we've started using Testcontainers. Testcontainers is a framework for defining through code throwaway, lightweight instances of containers, to be used in test context.

We initially adopted this approach for .NET Azure Functions, but I know that teams creating Azure Functions in different languages also started using it. This is possible because the approach is agnostic to the language in which the functions are written (and the tests can be written using any language/framework supported by Testcontainers).

In this post, I want to share with you the core parts of this approach. It starts with creating a Dockerfile for your Azure Functions.

Creating a Dockerfile for an Azure Functions Container Image

You may already have a Dockerfile for your Azure Functions (for example if you decided to host them in Azure Container Apps or Kubernetes). From my experience, that's usually not the case. That means you need to create a Dockerfile. There are two options for doing that. You can use Azure Functions Core Tools and call func init with the --docker-only option, or you can create the Dockerfile manually. The Dockerfile is different for every language, so until you gain experience I suggest using the command. Once you are familiar with the structure, you will very likely end up with a modified template that you will be reusing with small adjustments. The one below is my example for .NET Azure Functions using the isolated worker model.

FROM mcr.microsoft.com/dotnet/sdk:7.0 AS installer-env
ARG RESOURCE_REAPER_SESSION_ID="00000000-0000-0000-0000-000000000000"
LABEL "org.testcontainers.resource-reaper-session"=$RESOURCE_REAPER_SESSION_ID

WORKDIR /src
COPY function-app/ ./function-app/

RUN dotnet publish function-app \
    --output /home/site/wwwroot

FROM mcr.microsoft.com/azure-functions/dotnet-isolated:4-dotnet-isolated7.0

ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true

COPY --from=installer-env ["/home/site/wwwroot", "/home/site/wwwroot"]

What's probably puzzling you right now is that label based on the provided argument. This is something very specific to Testcontainers. The above Dockerfile is for a multi-stage build, so it will generate intermediate layers. Testcontainers has a concept of Resource Reaper which job is to remove Docker resources once they are no longer needed. This label is needed for the Resource Reaper to be able to track those intermediate layers.

Once we have the Dockerfile we can create the test context.

Creating a Container Instance in Test Context

The way you create the test context depends on the testing framework you are going to use and the isolation strategy you want for that context. My framework of choice is xUnit. When it comes to the isolation strategy, it depends 😉. That said, the one I'm using most often is test class. For xUnit that translates to class fixture. You can probably guess that there are also requirements when it comes to the context lifetime management. After all, we will be spinning containers and that takes time. That's why the class fixture must implement IAsyncLifetime to provide support for asynchronous operations.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    ...

    public AzureFunctionsTestcontainersFixture()
    { 
        ...
    }

    public async Task InitializeAsync()
    {
        ...
    }

    public async Task DisposeAsync()
    {
        ...
    }
}

There are a couple of things that we need to do here. The first is creating an image based on our Dockerfile. For this purpose, we can use ImageFromDockerfileBuilder. The minimum we need to provide is the location of the Dockerfile (directory and file name). Testcontainers provides us with some handy helpers for getting the solution, project, or Git directory. We also want to set that RESOURCE_REAPER_SESSION_ID argument.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    private readonly IFutureDockerImage _azureFunctionsDockerImage;

    public AzureFunctionsTestcontainersFixture()
    {
        _azureFunctionsDockerImage = new ImageFromDockerfileBuilder()
            .WithDockerfileDirectory(CommonDirectoryPath.GetSolutionDirectory(), String.Empty)
            .WithDockerfile("AzureFunctions-Testcontainers.Dockerfile")
            .WithBuildArgument(
                 "RESOURCE_REAPER_SESSION_ID",
                 ResourceReaper.DefaultSessionId.ToString("D"))
            .Build();
    }

    public async Task InitializeAsync()
    {
        await _azureFunctionsDockerImage.CreateAsync();

        ...
    }

    ...
}

With the image in place, we can create a container instance. This will require reference to the image, port binding, and wait strategy. Port binding is something that Testcontainers can almost completely handle for us. We just need to tell which port to bind and the host port can be assigned randomly. The wait strategy is quite important. This is how the framework knows that the container instance is available. We have a lot of options here: port availability, specific message in log, command completion, file existence, successful request, or HEALTHCHECK. What works great for Azure Functions is a successful request to its default page.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    private readonly IFutureDockerImage _azureFunctionsDockerImage;

    public IContainer AzureFunctionsContainerInstance { get; private set; }

    ...

    public async Task InitializeAsync()
    {
        await _azureFunctionsDockerImage.CreateAsync();

        AzureFunctionsContainerInstance = new ContainerBuilder()
            .WithImage(_azureFunctionsDockerImage)
            .WithPortBinding(80, true)
            .WithWaitStrategy(
                Wait.ForUnixContainer()
                .UntilHttpRequestIsSucceeded(r => r.ForPort(80)))
            .Build();
        await AzureFunctionsContainerInstance.StartAsync();
    }

    ...
}

The last missing part is the cleanup. We should nicely dispose the container instance and the image.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    private readonly IFutureDockerImage _azureFunctionsDockerImage;

    public IContainer AzureFunctionsContainerInstance { get; private set; }

    ...

    public async Task DisposeAsync()
    {
        await AzureFunctionsContainerInstance.DisposeAsync();

        await _azureFunctionsDockerImage.DisposeAsync();
    }
}

Now we are ready to write some tests.

Implementing Integration Tests

At this point, we can start testing our function. We need a test class using our class fixture.

public class AzureFunctionsTests : IClassFixture<AzureFunctionsTestcontainersFixture>
{
    private readonly AzureFunctionsTestcontainersFixture _azureFunctionsTestcontainersFixture;

    public AzureFunctions(AzureFunctionsTestcontainersFixture azureFunctionsTestcontainersFixture)
    {
        _azureFunctionsTestcontainersFixture = azureFunctionsTestcontainersFixture;
    }

    ...
}

Now for the test itself, let's assume that the function has an HTTP trigger. To build the URL of our function we can use the Hostname provided by the container instance and acquire the host port by calling .GetMappedPublicPort. This means that the test only needs to create an instance of HttpClient, make a request, and assert the desired aspects of the response. The simplest test I could think of was to check for a status code indicating success.

public class AzureFunctionsTests : IClassFixture<AzureFunctionsTestcontainersFixture>
{
    private readonly AzureFunctionsTestcontainersFixture _azureFunctionsTestcontainersFixture;

    ...

    [Fact]
    public async Task Function_Request_ReturnsResponseWithSuccessStatusCode()
    {
        HttpClient httpClient = new HttpClient();
        var requestUri = new UriBuilder(
            Uri.UriSchemeHttp,
            _azureFunctionsTestcontainersFixture.AzureFunctionsContainerInstance.Hostname,
            _azureFunctionsTestcontainersFixture.AzureFunctionsContainerInstance.GetMappedPublicPort(80),
            "api/function"
        ).Uri;

        HttpResponseMessage response = await httpClient.GetAsync(requestUri);

        Assert.True(response.IsSuccessStatusCode);
    }
}

And Voila. This will run on your machine (assuming you have Docker) and in any CI/CD environment which build agents have Docker pre-installed (for example Azure DevOps or GitHub).

Adding Dependencies

What I've shown you so far covers the scope of the function itself. This is already beneficial because it allows for verifying if dependencies are registered properly or if the middleware pipeline behaves as expected. But Azure Functions rarely exist in a vacuum. There are almost always dependencies and Testcontainers can help us with those dependencies as well. There is a wide set of preconfigured implementations that we can add to our test context. A good example can be storage. In the majority of cases, storage is required to run the function itself. For local development, Azure Functions are using the Azurite emulator and we can do the same with Testcontainers as it is available as a ready-to-use module. To add it to the context you just need to reference the proper NuGet package and add a couple of lines of code.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    ...

    public AzuriteContainer AzuriteContainerInstance { get; private set; }

    ...

    public async Task InitializeAsync()
    {
        AzuriteContainerInstance = new AzuriteBuilder().Build();
        await AzuriteContainerInstance.StartAsync();

        ...
    }

    public async Task DisposeAsync()
    {
        ...

        await AzuriteContainerInstance.DisposeAsync();
    }
}

We also need to point Azure Functions to use this Azurite container by setting the AzureWebJobsStorage parameter.

public class AzureFunctionsTestcontainersFixture : IAsyncLifetime
{
    ...

    public async Task InitializeAsync()
    {
        ...

        AzureFunctionsContainerInstance = new ContainerBuilder()
            ...
            .WithEnvironment("AzureWebJobsStorage", AzuriteContainerInstance.GetConnectionString())
            ...
            .Build();

        ...
    }

    ...
}

That's it. Having Azurite in place also enables testing functions that use triggers and bindings based on Azure Storage. There are also ready-to-use modules for Redis, Azure Cosmos DB, Azure SQL Edge, MS SQL, Kafka, or RabbitMQ. So there is quite good out-of-the-box coverage for potential Azure Functions dependencies. Some other dependencies can be covered by creating containers yourself (for example with an unofficial Azure Event Grid simulator). That said, some dependencies can be only satisfied by a real thing (at least for now).

A Powerful Tool in Your Toolbox

Is Testcontainers a solution for every integration problem - no. Should Testcontainers be your default choice when thinking about integration tests - also no. But it is a very powerful tool and you should be familiar with it, so you can use it when appropriate.

The first time I wrote about change feeds consumption from .NET (ASP.NET Core to be more precise) was back in 2018 in the context of RethinkDB. It was always a very powerful concept. Having access to an ordered flow of information about changes to items is a low-entry enabler for various event-driven, stream processing, or data movement scenarios. As a result, over the years, this capability (with various name variations around the words change, stream, and feed) has found its way to many databases and sometimes even other storage services. The list includes (but is not limited to) MongoDB, RavenDB, Cosmos DB, DynamoDB or Azure Blob Storage (in preview).

As I was cleaning up and updating a demo application that shows how to consume and expose various change feeds from ASP.NET Core, I decided to write down some notes to refresh the content from my previous posts.

IAsyncEnumerable as Universal Change Feed Abstraction

When I started working with change feeds over 5 years ago, I initially didn't put them behind any abstraction. I like to think that I was smart and avoided premature generalization. The abstraction came after a couple of months when I could clearly see that I was implementing the same concepts through similar components in different projects where teams were using RethinkDB, MongoDB, or Cosmos DB. The abstraction that I started advocating back then looked usually like this.

public interface IChangeFeed<T>
{
    T CurrentChange { get; }

    Task<bool> MoveNextAsync(CancellationToken cancelToken = default(CancellationToken));
}

In retrospect, I'm happy with this abstraction, because around two or more years later, when those teams and projects started to adopt C# 8 and .NET Core 3 (or later versions), refactoring all those implementations was a lot easier. C# 8 has brought async streams, a natural programming model for asynchronous streaming data sources. Asynchronous streaming data source is exactly what change feeds are and modeling them through IAsyncEnumerable results in nice and clean consumption patterns. This is why currently I advocate for using IAsyncEnumerable as a universal change feed abstraction. The trick to properly using that abstraction is defining the right change representation to be returned. That should depend on change feed capabilities and actual needs in a given context. Not all change feeds are the same. Some of them can provide information on all operations performed on an item, and some only on a subset. Some can provide old and new value, and some only the old. Your representation of change should consider all that. In the samples ahead I'm avoiding this problem by reducing the change representation to the changed version of the item.

Azure Cosmos DB Change Feed

Azure Cosmos DB change feed is the second (after the RethinkDB one) I've been writing about in the past. It's also the one which consumption has seen the most evolution through time.

The first consumption model was quite complicated. It required going through partition key ranges, building document change feed queries for them, and then obtaining enumerators. This whole process required managing its state, which resulted in non-trivial code. It's good that it has been deprecated as part of Azure Cosmos DB .NET SDK V2, and it's going out of support in August 2024.

Azure Cosmos DB .NET SDK V3 has brought the second consumption model based on change feed processor. The whole inner workings of consuming the change feed have been enclosed within a single class, which reduced the amount of code required. But change feed processor has its oddities. It requires an additional container - a lease container that deals with previously described state management. This is beneficial in complex scenarios as it allows for coordinated processing by multiple workers, but becomes an unnecessary complication for simple scenarios. It also provides only a push-based programming model. The consumer must provide a delegate to receive changes. Once again this is great for certain scenarios, but leads to awkward implementation when you want to abstract change feed as a stream.

The story doesn't end there, version 3.20.0 of Azure Cosmos DB .NET SDK has introduced the third consumption model based on change feed iterator. It provides a pull-based alternative to change feed processor for scenarios where it's more appropriate. With the change feed iterator the control over the pace of consuming the changes is given back to the consumer. State management is also optional, but it's the consumer's responsibility to persist continuation tokens if necessary. Additionally, the change feed iterator brings the option of obtaining a change feed for a specific partition key.

The below snippet shows a very simple consumer implementation of the change feed iterator model - no state management, just starting the consumption from a certain point in time and waiting one second before polling for new changes.

public async IAsyncEnumerable<T> FetchFeed(
    [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
    FeedIterator<T> changeFeedIterator = _container.GetChangeFeedIterator<T>(
        ChangeFeedStartFrom.Time(DateTime.UtcNow),
        ChangeFeedMode.LatestVersion
    );

    while (changeFeedIterator.HasMoreResults && !cancellationToken.IsCancellationRequested)
    {
        FeedResponse<T> changeFeedResponse = await changeFeedIterator
            .ReadNextAsync(cancellationToken);

        if (changeFeedResponse.StatusCode == HttpStatusCode.NotModified)
        {
            await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
        }
        else
        {
            foreach (T item in changeFeedResponse)
            {
                yield return item;
            }
        }
    }
}

MongoDB Change Feed

MongoDB is probably the most popular NoSQL choice among the teams that I've been working with, which doesn't use cloud PaaS databases for their needs. Among its many features, it has quite powerful change feed (a.k.a. Change Streams) capability.

The incoming change information can cover a wide spectrum of operations which can come from a single collection, database, or entire deployment. If the operation relates to a document, the change feed can provide the current version, the previous version, and the delta. There is also support for resume tokens which can be used to manage state if needed.

One unintuitive thing when it comes to MongoDB change feed is that it's only available when you are running a replica set or a sharded cluster. This doesn't mean that you have to run a cluster. You can run a single instance as a replica set (even in a container), you just need the right configuration (you will find a workflow that handles such a deployment to Azure Container Instances in the demo repository).

The consumption of MongoDB change feed is available through the Watch and WatchAsync methods available on IMongoCollection, IMongoDatabase, and IMongoClient instances. The below snippet watches a single collection and configures the change feed to return the current version of the document. You can also provide a pipeline definition when calling Watch or WatchAsync to filter the change feed (for example to monitor only specific operation types).

public async IAsyncEnumerable<T> FetchFeed(
    [EnumeratorCancellation]CancellationToken cancellationToken = default)
{
    IAsyncCursor<ChangeStreamDocument<T>> changefeed = await _collection.WatchAsync(
        new ChangeStreamOptions { FullDocument = ChangeStreamFullDocumentOption.UpdateLookup },
        cancellationToken: cancellationToken
    );

    while (!cancellationToken.IsCancellationRequested)
    {
        while (await changefeed.MoveNextAsync(cancellationToken))
        {
            IEnumerator<ChangeStreamDocument<T>>  changefeedCurrentEnumerator = changefeed
                .Current.GetEnumerator();

            while (changefeedCurrentEnumerator.MoveNext())
            {
                if (changefeedCurrentEnumerator.Current.OperationType
                    == ChangeStreamOperationType.Insert)
                {
                    yield return changefeedCurrentEnumerator.Current.FullDocument;
                }

                ...
            }
        }

        await Task.Delay(_moveNextDelay, cancellationToken);
    }
}

Azure Blob Storage Change Feed

Azure Blob Storage is the odd one on this list because it's an object storage, not a database. Its change feed provides information about changes to blobs and blobs metadata in an entire storage account. Under the hood the change feed is implemented as a special container (yes it's visible, yes you can take a look) which is being created once you enable it. As it is a container you should consider the configuration of the retention period as it will affect your costs.

There is one more important aspect of Azure Blob Storage change feed when considering its usage - latency. It's pretty slow. It can take minutes for changes to appear.

From the consumption perspective, it follows the enumerator approach. You can obtain the enumerator by calling BlobChangeFeedClient.GetChangesAsync. The enumerator is not infinite, it will return the changes currently available and once you process them you have to poll for new ones. This makes managing the continuation tokens required even for a local state. What is unique is that you can request changes within a specified time window.

The change feed supports six events in the latest schema version. In addition to expected ones like created or deleted, there are some interesting ones like tier changed. The information never contains the item, which shouldn't be surprising as in the context of object storage this would be quite risky.

The below snippet streams the change feed by locally managing the continuation token and for changes that represent blob creation, it downloads the current version of the item.

public async IAsyncEnumerable<T> FetchFeed(
    [EnumeratorCancellation]CancellationToken cancellationToken = default)
{
    string? continuationToken = null;

    TokenCredential azureCredential = new DefaultAzureCredential();

    BlobServiceClient blobServiceClient = new BlobServiceClient(_serviceUri, azureCredential);
    BlobChangeFeedClient changeFeedClient = _blobServiceClient.GetChangeFeedClient();

    while (!cancellationToken.IsCancellationRequested)
    {
        IAsyncEnumerator<Page<BlobChangeFeedEvent>> changeFeedEnumerator = changeFeedClient
            .GetChangesAsync(continuationToken)
            .AsPages()
            .GetAsyncEnumerator();

        while (await changeFeedEnumerator.MoveNextAsync())
        {
            foreach (BlobChangeFeedEvent changeFeedEvent in changeFeedEnumerator.Current.Values)
            {
                if ((changeFeedEvent.EventType == BlobChangeFeedEventType.BlobCreated)
                    && changeFeedEvent.Subject.StartsWith($"/blobServices/default/containers/{_container}"))
                {
                    BlobClient createdBlobClient = new BlobClient(
                        changeFeedEvent.EventData.Uri,
                        azureCredential);

                    if (await createdBlobClient.ExistsAsync())
                    {
                        MemoryStream blobContentStream =
                            new MemoryStream((int)changeFeedEvent.EventData.ContentLength);
                        await createdBlobClient.DownloadToAsync(blobContentStream);
                        blobContentStream.Seek(0, SeekOrigin.Begin);

                        yield return JsonSerializer.Deserialize<T>(blobContentStream);
                    }
                }
            }

            continuationToken = changeFeedEnumerator.Current.ContinuationToken;
        }

        await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
    }
}

There Is More

The above samples are in no way exhaustive. They don't show all the features of given change feeds and they don't show all the change feeds out there. But they are a good start, this is why I've been evolving them for the past five years.

Containers have become one of the main, if not the main, ways to modularize, isolate, encapsulate, and package applications in the cloud. The sidecar pattern allows for taking this even further by allowing the separation of functionalities like monitoring, logging, or configuration from the business logic. This is why I recommend that the teams who are adopting containers adopt sidecars as well. One of my preferred suggestions is Dapr which can bring early value by providing abstractions for message broker integration, encryption, observability, secret management, state management, or configuration management.

To my surprise, many conversations starting around adopting sidecars quickly deviate to "we should set up a Kubernetes cluster". It's almost like there are only two options out there - you either run a standalone container or you need Kubernetes for anything more complicated. This is not the case. There are multiple ways to run containers and you should choose the one which is most suitable for your current context. Many of those options will give you more sophisticated features like sidecars or init containers while your business logic is still in a single container. Sidecars give here an additional benefit of enabling later evolution to more complex container hosting options without requirements for code changes.

In the case of Azure, such a service that enables adopting sidecars at an early stage is Azure Container Instances.

Quick Reminder - Azure Container Instances Can Host Container Groups

Azure Container Instances provides a managed approach for running containers in a serverless manner, without orchestration. What I've learned is that a common misconception is that Azure Container Instances can host only a single container. That is not exactly the truth, Azure Container Instances can host a container group (if you're using Linux containers 😉).

A container group is a collection of containers scheduled on the same host and sharing lifecycle, resources, or local network. The container group has a single public IP address, but the publicly exposed ports can forward to ports exposed on different containers. At the same time, all the containers within the group can reach each other via localhost. This is what enables the sidecar pattern.

How to create a container group? There are three options:

  • With ARM/Bicep
  • With Azure CLI by using YAML file
  • With Azure CLI by using Docker compose file

I'll go with Bicep here. The Microsoft.ContainerInstance namespace contains only a single type which is containerGroups. This means that from ARM/Bicep perspective there is no difference if you are deploying a standalone container or a container group - there is a containers list available as part of the resource properties where you specify the containers.

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  name: CONTAINER_GROUP
  location: LOCATION
  ...
  properties: {
    sku: 'Standard'
    osType: 'Linux'
    ...
    containers: [
      ...
    ]
    ...
  }
}

How about a specific example? I've mentioned that Dapr is one of my preferred sidecars, so I'm going to use it here.

Running Dapr in Self-Hosted Mode Within a Container Group

Dapr has several hosting options. It can be self-hosted with Docker, Podman, or without containers. It can be hosted in Kubernetes with first-class integration. It's also available as a serverless offering - part of Azure Container Apps. The option interesting us in the context of Azure Container Instances is self-hosted with Docker, but from that list, you could pick up how Dapr enables easy evolution from Azure Container Instances to Azure Container Apps, Azure Kubernetes Services or non-Azure Kubernetes clusters.

But before we will be ready to deploy the container group, we need some infrastructure around it. We should start with a resource group, container registry and managed identity.

az group create -l $LOCATION -g $RESOURCE_GROUP
az acr create -n $CONTAINER_REGISTRY -g $RESOURCE_GROUP --sku Basic
az identity create -n $MANAGED_IDENTITY -g $RESOURCE_GROUP

We will be using the managed identity for role-based access control where possible, so we should reference it as the identity of the container group in our Bicep template.

resource managedIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' existing = {
  name: MANAGED_IDENTITY
}

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${managedIdentity.id}': {}
    }
  }
  properties: {
    ...
  }
}

The Dapr sidecar requires a components directory. It's a folder that will contain YAML files with components definitions. To provide that folder to the Dapr sidecar container, we have to mount it as a volume. Azure Container Instances supports mounting an Azure file share as a volume, so we have to create one.

az storage account create -n $STORAGE_ACCOUNT -g $RESOURCE_GROUP --sku Standard_LRS
az storage share create -n daprcomponents --account-name $STORAGE_ACCOUNT

The created Azure file share needs to be added to the list of volumes that can be mounted by containers in the group. Sadly, the integration between Azure Container Instances and Azure file share doesn't support role-based access control, an access key has to be used.

...

resource storageAccount 'Microsoft.Storage/storageAccounts@2022-09-01' existing = {
  name: STORAGE_ACCOUNT
}

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  properties: {
    ...
    volumes: [
      {
        name: 'daprcomponentsvolume'
        azureFile: {
          shareName: 'daprcomponents'
          storageAccountKey: storageAccount.listKeys().keys[0].value
          storageAccountName: storageAccount.name
          readOnly: true
        }
      }
    ]
    ...
  }
}

We also need to assign the AcrPull role to the managed identity so it can access the container registry.

az role assignment create --assignee $MANAGED_IDENTITY_OBJECT_ID \
    --role AcrPull \
    --scope "/subscriptions/$SUBSCRIPTION_ID/resourcegroups/$RESOURCE_GROUP/providers/Microsoft.ContainerRegistry/registries/$CONTAINER_REGISTRY"

I'm skipping the creation of the image for the application with the business logic, pushing it to the container registry, adding its definition to the containers list, and exposing needed ports from the container group - I want to focus on the Dapr sidecar.

In this example, I will be grabbing the daprd image from the Docker Registry.

The startup command for the sidecar is ./daprd. We need to provide a --resources-path parameter which needs to point to the path where the daprcomponentsvolume will be mounted. I'm also providing the --app-id parameter. This parameter is mostly used for service invocation (it won't be the case here and I'm not providing --app-port) but Dapr is using it also in different scenarios (for example as partition key for some state stores).

Two ports need to be exposed from this container (not publicly): 3500 is the default HTTP endpoint port and 50001 is the default gRPC endpoint port. There is an option to change both ports through configuration if they need to be taken by some other container.

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  properties: {
    ...
    containers: [
      ...
      {
        name: 'dapr-sidecar'
        properties: {
          image: 'daprio/daprd:1.10.9'
          command: [ './daprd', '--app-id', 'APPLICATION_ID', '--resources-path', './components']
          volumeMounts: [
            {
              name: 'daprcomponentsvolume'
              mountPath: './components'
              readOnly: true
            }
          ]
          ports: [
            { 
              port: 3500
              protocol: 'TCP'
            }
            { 
              port: 50001
              protocol: 'TCP'
            }
          ]
          ...
        }
      }
    ]
    ...
  }
}

I've omitted the resources definition for brevity.

Now the Bicep template can be deployed.

az deployment group create -g $RESOURCE_GROUP -f container-group-with-dapr-sidecar.bicep

The below diagram visualizes the final state after the deployment.

Diagram of Azure Container Instances hosting a container group including application container and Dapr sidecar integrated with Azure Container Registry and having Azure file share mounted as volume with Dapr components definitions.

Configuring a Dapr Component

We have a running Dapr sidecar, but we have yet to make it truly useful. To be able to use APIs provided by Dapr, we have to provide the mentioned earlier components definitions which will provide implementation for those APIs. As we already have a storage account as part of our infrastructure, a state store component seems like a good choice. Dapr supports quite an extensive list of stores, out of which two are based on Azure Storage: Azure Blob Storage and Azure Table Storage. Let's use the Azure Table Storage one.

First I'm going to create a table. This is not a required step, the component can do it for us, but let's assume we want to seed some data manually before the deployment.

Second, the more important operation is granting needed permissions to the storage account. Dapr has very good support for authenticating to Azure which includes managed identities and role-based access control, so I'm just going to assign the Storage Table Data Reader role to our managed identity for the scope of the storage account.

az storage table create -n $TABLE_NAME --account-name $STORAGE_ACCOUNT
az role assignment create --assignee $MANAGED_IDENTITY_OBJECT_ID \
    --role "Storage Table Data Contributor" \
    --scope "/subscriptions/$SUBSCRIPTION_ID/resourcegroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"

The last thing we need is the component definition. The component type we want is state.azure.tablestorage. The name is what we will be using when making calls with a Dapr client. As we are going to use managed identity for authenticating, we should provide accountName, tableName, and azureClientId as metadata. I'm additionally setting skipCreateTable because I created the table earlier and the component will fail on an attempt to create it once again.

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: state.table.<TABLE_NAME>
spec:
  type: state.azure.tablestorage
  version: v1
  metadata:
  - name: accountName
    value: <STORAGE_ACCOUNT>
  - name: tableName
    value: <TABLE_NAME>
  - name: azureClientId
    value: <Client ID of MANAGED_IDENTITY>
  - name: skipCreateTable
    value: true

The file with the definition needs to be uploaded to the file share which is mounted as the components directory. The Azure Container Instances need to be restarted for the component to be loaded. We can quickly verify if it has been done by taking a look at logs.

time="2023-08-31T21:25:22.5325911Z"
level=info
msg="component loaded. name: state.table.<TABLE_NAME>, type: state.azure.tablestorage/v1"
app_id=APPLICATION_ID
instance=SandboxHost-638291138933285823
scope=dapr.runtime
type=log
ver=1.10.9

Now you can start managing your state with a Dapr client for your language of choice or with HTTP API if one doesn't exist.

The Power of Abstraction, Decoupling, and Flexibility

As you can see, the needed increase in complexity (when compared to a standalone container hosted in Azure Container Instances) is not that significant. At the same time, the gain is. Dapr allows us to abstract all the capabilities it provides in the form of building blocks. It also decouples the capabilities provided by building blocks from the components providing implementation. We can change Azure Table Storage to Azure Cosmos DB if it better suits our solution, or to AWS DynamoDB if we need to deploy the same application to AWS. We also now have the flexibility of evolving our solution when the time comes to use a more sophisticated container offering - we just need to take Dapr with us.

This series on implementing DevOps practices for Azure infrastructure is nearing its conclusion. The last part remaining is completing the operations side of the loop.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

This brings focus to the last two practices on our list:

Continuous Operations & Continuous Monitoring

The Continuous Operations and Continuous Monitoring practices are closely tied together. They jointly serve the goal of ensuring the overall reliability, resiliency, and security of solutions. The majority of capabilities supporting that goal are within the scope of Continuous Operations practice and cover aspects like compliance enforcement, cost management, proactive maintenance, security posture management, and intelligence-driven responses to operational and security events. That said, most of those capabilities can't be achieved without capabilities coming from Continuous Monitoring practice. There can be no cost management without cost tracking. There is no way to have proactive maintenance and intelligence-driven responses without gathering observability signals, configuring alerts, and building dashboards.

Organizations usually have the capabilities covered by Continuous Operations and Continuous Monitoring already established, but often they are not aligned with DevOps cultural philosophies. This means that implementing those practices is often about addressing gaps around automation, collaboration, continuous feedback, and continuous improvement.

But before we start addressing those gaps, it's worth making sure that the capabilities have been established on the right foundations, as Azure provides a wide range of services to support us here:

  • Azure Policy for compliance enforcement.
  • Azure Monitor with its insights, visualization, analytics, and response stack for gathering observability signals, configuring alerts, and building dashboards.
  • Microsoft Defender for Cloud for workload protection and security posture management.
  • Azure Sentinel for security information and event management (SIEM) as well as security orchestration, automation, and response (SOAR).
  • Azure Automation and Azure Logic Apps for automating event-based intelligence-driven responses and orchestrating proactive maintenance.
  • Microsoft Cost Management and Billing for cost tracking and management.

With the right foundations in place, we can focus on aspects that make the difference between "being DevOps" and "not being DevOps". The most crucial one is ensuring that everyone has access to information on how the part they are responsible for is behaving in production.

All Teams Need Observability Signals

As you may remember from the post on Continuous Delivery and Continuous Deployment, at certain sizes solutions often start moving from centralized ownership to being owned by multiple independent applications teams and an environment team. This dynamics needs to be reflected in monitoring architecture as well. A single, centralized monitoring service, although needed by the environment team, may not be sufficient. This is why mature monitoring implementations utilize resource-context observability signals and granular insights, visualization, and analytics workspaces from which the signals are later centrally aggregated. This approach enables every application team to have direct access to its signals, configure alerts and build dashboards, while the environment team still has visibility into the whole picture.

This approach also enables democratization when it comes to the tools itself. The native observability stack in Azure is provided by Azure Monitor, but it no longer means that application teams are fully limited to Application Insights. If they prefer they can use Prometheus and Grafana for metrics (which is great when they need to be more cloud agnostic and looking at adopting Open Telemetry).

Diagram representing a democratized monitoring architecture with dedicated workspaces for every team and teams using different tools

Of course, such a democratized monitoring architecture cannot be left without governance. There need to be rules around observability signals granularity, retention, and archiving data to cool-tier storage. Otherwise, we can be very unpleasantly surprised by the cost of our monitoring implementation.

Automated responses should also be exporting proper context information to the respective tools - because part of automated response should be creating a proper item in the collaboration tool to ensure continuous feedback. What item should that be? That depends on the event category.

Operational Events Should Create Issues

From the infrastructure perspective, there are usually two main types of operational events that are potentially interesting:

  • Resources events like creation, deletion, or modification
  • Alerts defined in Azure Monitor

The usage of resource events often covers adding special tags, granting permissions to special groups, or reacting to delete/create/update operation fails.

Alerts are usually raised when the measured state of the system deviates from what is considered a baseline. To name just a few examples, this can mean networking issues, an erroneously stopped VM, or a resource reaching its capacity.

The remediation for every resource event or alert can be different. In some cases, the remediation can be fully automated (restarting a VM, truncating tables in a database, or increasing RUs for Azure Cosmos DB). In some cases, all that is needed is just a notification to deal with the problem at the earliest convenience (failure to delete a resource). There are also those cases that require waking up an engineer immediately (networking issues).

In the first post of this series, I wrote that the cornerstone of implementing DevOps practices for Azure infrastructure is infrastructure as code and the Git ecosystem used for collaboration. This means that regardless if the remediation is fully automated or an engineer needs to be engaged, part of the process should be issue creation (if the remediation has been already performed that issue can be closed and exist just for tracking purposes). In the stack I've chosen for this series, the Git ecosystem is GitHub. Integrating GitHub issue creation into the response workflow is not a huge challenge, because there is a ready-to-use GitHub connector for Azure Logic Apps. So, if we consider alerts, this means that we can build an automated response flow by using Azure Monitor Alerts, Azure Monitor Action Group, and Azure Logic Apps.

Diagram representing an automated response flow for an alert raised in Azure Monitor which uses Azure Logic App to create issues in GitHub and perform remediation action

Almost identical flow can be built for the resources events if we use Azure Event Grid in place of Azure Monitor (as Azure Event Grid supports resource groups and subscriptions as sources).

This is the approach that should be applied to ensure collaboration and continuous feedback when it comes to operational events, how about security events?

Security Events Should Create Vulnerabilities

Security events have a specific lifecycle that falls under the responsibility of the organization's Security Operations Center (SOC). It's the SOC that uses available observability signals, CVE alerting platforms like OpenCVE, and other tools to detect, investigate, and remediate threats. In the case of Azure, Azure Sentinel is the one-stop shop to build and automate this responsibility.

That said, SOC usually deals with the immediate remediation of a threat. For example, SOC operator or automation may determine that to mitigate a threat a specific resource needs to be isolated because a new CVE has been disclosed. The only action performed will be isolation - the responsibility for mitigating the CVE is with the application or environment team. In such cases, the SOC operator or automation should report the specific vulnerability with context and findings in the collaboration tool. When using GitHub as the Git ecosystem for collaboration, a great way to report such vulnerabilities may be through security advisories.

Security advisories facilitate the process of reporting, discussing, and fixing vulnerabilities. Creating security advisories requires admin or security manager role within the repository, so the integration must be designed properly to avoid excessive permissions within the organization. My approach is to create a GitHub App. GitHub Apps use OAuth 2.0 and can act on behalf of a user, which in this case will be SOC operator or automation. To make the creation of security advisories available directly from Azure Sentinel, I expose a webhook from the GitHub App which can be called by a Playbook.

Diagram representing a security advisory creation flow through Azure Sentinel Playbook and GitHub App

Providing automated tools which don't require context switching from the SOC perspective removes roadblocks, which is crucial for the adoption of collaboration and continuous feedback between otherwise disconnected teams. This is the true spirit of DevOps.

Infrastructure Drift Detection

There is one capability in the context of Continuous Monitoring and Continuous Operations, which is very specific to infrastructure - detecting drift.

As I have shown through the series, if we want to implement DevOps practices for Azure infrastructure, the infrastructure should be changed only through modifying and deploying its code. The repository should be the single source of truth. But sometimes, when there is pressure, stress, or time constraints (for example when solving a critical issue) engineers do take shortcuts and modify the infrastructure directly. It's not that big of an issue if such an engineer will later reflect the changes in the infrastructure code. But humans are humans and they sometimes forget. This can cause the environment to drift from its source of truth and creates potential risks from applied change being reverted after the deployment to deployment failures. This is why detecting drift is important.

Infrastructure drift detection is a complex problem. Depending on chosen stack there are different tools you can use to make it as sophisticated as you need. Here, as an example, I'm going to show a mechanism that can be set up quickly based on the stack I've already used throughout this series. It's far from perfect, but it's a good start. It's using the what-if command, which I've already been using for creating previews of changes as part of Continuous Integration implementation.

az deployment group what-if \
    --resource-group rg-devops-practices-sample-application-prod \
    --template-file applications/sample-application/application.bicep \
    --mode Complete \
    --no-pretty-print

You may notice two differences between the usage of what-if for previews and drift detection.

The first difference is the Complete deployment mode. The difference between Incremental (the default) and Complete deployment modes is that in the case of the second resources that exist in the resource group but aren't specified in the template will be deleted instead of ignored.

The second difference is the output format. For the previews, I wanted something human-readable, but here I prefer something which will be easy to process programmatically. Providing the --no-pretty-print switch changes the output format to JSON. Below you can see a snippet of it.

{
  "changes": [
    {
      "after": null,
      "before": {
        "name": "kvsampleapplication",
        ...
      },
      "changeType": "Delete",
      ...
    },
    {
      "after": {
        "name": "id-sampleapplication-gd3f7mnjwpuyu",
        ...
      },
      "before": {
        "name": "id-sampleapplication-gd3f7mnjwpuyu",
        ...
      },
      "changeType": "NoChange",
      ...
    },
    ...
  ],
  "error": null,
  "status": "Succeeded"
}

Our attention should focus on the changeType property. It provides information on what will happen with the resource after the deployment. The possible values are: Create, Delete, Ignore, NoChange, Modify, and Deploy. Create, Delete, and NoChange are self-explanatory. The Ignore value should not be present in the case of Complete deployment mode unless limits (number of nested templates or expanding time) have been reached - in such case, it will mean that the resource hasn't been evaluated. Modify and Deploy are tricky. They mean that the properties of the resource will be changed after the deployment. Unfortunately, the Resource Manager is not perfect here and those two can give false positive predictions. This is why this technique is far from perfect - the only drift that can be reliably detected are missing resources or resources which shouldn't exist. But, as I said, it's a good start as we can quickly create a GitHub Actions workflow that will be performing the detection. Let's start by checking out the deployed tag and connecting to Azure.

...

env:
  TAG: sample-application-v1.0.0

jobs:
  drift-detection:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Checkout
      uses: actions/checkout@v3
      with:
        ref: ${{ env.TAG }}
    - name: Azure Login
      uses: azure/login@v1
      with:
        ...
  ...

The next step is to run a script that will call what-if and process the results to create an array of detected changes.

...

env:
  ...
  RESOURCE_GROUP: 'rg-devops-practices-sample-application-prod'

jobs:
  drift-detection:
    ...
    steps:
    ...
    - name: Detect infrastructure drift
      shell: pwsh
      run: |
        $issues = @()

        $drift = az deployment group what-if `
          --resource-group $env:RESOURCE_GROUP `
          --template-file applications/sample-application/application.bicep `
          --mode Complete `
          --no-pretty-print | ConvertFrom-Json

        foreach ($change in $drift.Changes)
        {
          switch ($change.changeType)
          {
            'Create'
            {
              $issues += @{
                ResourceName = $change.after.name
                Description = 'Defined resource doesn''t exist'
              }
            }
            'Delete'
            {
              $issues += @{
                ResourceName = $change.before.name
                Description = 'Undefined resource exists'
              }
            }
          }
        }

        'DRIFT_ISSUES<> $env:GITHUB_ENV
        $issues | ConvertTo-Json -AsArray >> $env:GITHUB_ENV
        'EOF' >> $env:GITHUB_ENV
  ...

Having all the changes gathered, we can use the proven script action to create an issue for every detected change.

...

jobs:
  drift-detection:
    ...
    permissions:
      ...
      issues: write
    steps:
    ...
    - name: Report detected infrastructure drift
      uses: actions/github-script@v6
      with:
        script: |
          const issues = JSON.parse(process.env.DRIFT_ISSUES);
          for (const issue of issues) {
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: '[DRIFT DETECTED] ' + issue.Description + ' (' + issue.ResourceName + ')'
            });
          }
  ...

We can have this action running regularly. It will be creating nice issues like the one in the screenshot and will give us some start in drift detection.

Drift Detection (Sample Application) workflow - created issue for undefined resource

The Journey Never Ends

With Continuous Operations and Continuous Monitoring practices, we have closed the loop.

DevOps Pipeline With Tools for Create, Verify, Package, Release, Operate, and Monitor Stages

But the nature of a loop is that an end is also the beginning. The implementation of DevOps is never "done". It's a direct consequence of its core cultural philosophies: continuous feedback and continuous improvement. Regardless of how your initial implementation will look, you should constantly evaluate it in the context of the ecosystem around and evolve. This will mean modifying the implementation of already established practices, but also implementing new complementary ones (like Continuous Learning or Continuous Documentation).

The goal of this series was to draw the overall picture and provide examples that will bring that picture to life. The accompanying repository contains working workflows that can kickstart your journey.

So far, as part of my series on implementing DevOps practices for Azure infrastructure, I've walked through Continuous Integration, Continuous Delivery, and Continuous Deployment. In many conversations I had around implementing DevOps I've heard an opinion that once you have CI/CD (or CI/CD/CD) you have DevOps. That's not true. DevOps is about a continuous loop of feedback, automation, collaboration, and improvement. As you can see in the picture below, those three practices give only about half of that loop and cover mostly the development side.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

This is why there are more practices on the list:

To complete the loop and speak about complete DevOps implementation, it's time to start implementing practices that provide feedback from the deployed environment to the teams and automate operations concerns. In this post, I'm going to discuss Continuous Testing.

The goal of Continuous Testing is to ensure quality at different stages of the development life cycle. It's a practice that applies to both sides of the loop. We have already encountered it as part of the Continuous Integration practice. It's sometimes present as part of Continuous Delivery (for example running specific tests when versions of modules referenced from the environment repository are being updated) and it should be present as part of Continuous Deployment and later in the form of after-deployment tests. The after-deployment tests are what I want to focus on.

Discussing tests often revolves around discussing two aspects: tools to be used for implementation and types of tests to be implemented. Those are two main ingredients used to create a test strategy (of course a mature test strategy covers much more, but it's a discussion for a different occasion). Let's first consider the tools.

There are no real special requirements when choosing tools for infrastructure tests. As long as the stack allows calling APIs it should be sufficient. Very often the applications teams are using the same tools for testing the infrastructure tied to the application which they are using for testing the application itself. The environment teams, on the other hand, are looking for specific tools which fit the ecosystem they are familiar with. A popular choice when it comes to Azure is Pester, a test framework for Powershell. I'm going to use it for examples here.

What types of tests should you consider implementing? There are two which I consider a must-have - smoke tests and negative tests.

Smoke Tests

Smoke tests should be the first tests to verify the deployment. Their goal is to quickly provide feedback on crucial functions of the system without delving into finer details. Their implementation should be fast and simple. A typical smoke test is a verification if a host is responsive, which in Pester is just a couple of lines:

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Serves pages over HTTPS' {
      $request = [System.Net.WebRequest]::Create("https://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode |
        Should -Be 200 -Because "It's responsive"
    }
}

Notice that we are not trying to determine if the hosted application is healthy beyond getting a successful status code - we are testing infrastructure and the application hasn't been deployed yet.

Running smoke tests should be the first job in our post-deployment workflow. GitHub-hosted runners come with Pester, which means that running the tests is just two lines of Powershell.

jobs:
  smoke-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/smoke-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

So, running the tests is not a challenge. But running the tests is not the goal by itself. The goal is to properly react when tests fail. What should we do when smoke tests fail? There are two options we can choose from: roll back or roll forward. For smoke tests, we should almost always aim for rolling back. After all a crucial function of our system is not working and reverting it to the previous stable version is usually the quickest way to fix this. Of course roll back may not always be possible and then you are left with roll forward as the only option. Still, you should aim for this to be an edge case.

The situation with negative tests is a little bit different.

Negative Tests

While smoke tests are providing feedback if crucial functions of the system are working as expected, negative tests are there to provide feedback on how the system will behave in invalid scenarios. A good example can be unencrypted requests over HTTP. They are not secure and we want to disable them at the host level by configuring a redirect. A negative test to verify that can look like below.

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Does not serves pages over HTTP' {
      $request = [System.Net.WebRequest]::Create("http://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode | 
        Should -Be  301 -Because "Redirect is forced"
    }
}

As negative tests should be independent of smoke tests and still considered important, the typical approach is to run them in parallel with smoke tests.

jobs:
  smoke-tests:
    ...
  negative-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/negative-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

There is often a discussion about how to decide if something is a negative test or a smoke test. The distinction should be based on the impact. Taking our two examples:

  • Host not being responsive is catastrophic, we can't provide any service for our users.
  • Host responding to HTTP requests is something we can live with for a moment. There is secondary protection in the application code and in our industry context, it's only a recommendation, not a requirement.

Of course, context matters and what is a negative test in one situation might be a critical smoke test in another. The key aspect is that negative tests failing don't have to mean rolling back. The system can still provide a valuable service for users. This is why the strategy in case of negative tests is often to roll forward, fix the issue as soon as possible, and perform another deployment.

Other After-Deployment Tests

Smoke and negative tests are crucial, but they are only scratching the surface to provide initial feedback as soon as possible. They should be followed by different types of tests which go into previously ignored finer details. Depending on the needs you should implement functional, integration, or other types of tests.

You also shouldn't limit running tests only to the deployment moment. Infrastructure is much more fragile than application code, so you should continuously run at least the key tests to ensure that everything is working as expected. You should also adopt health checks (yes they are usually considered part of monitoring, but sophisticated health checks are often complex tests) to notice when something becomes unavailable or starts to misbehave. You can also go a step further and continuously test how your system will behave when something goes down by adopting chaos engineering.

Chaos Engineering

Chaos engineering is testing through experimenting. You can think of it as a form of exploratory testing. It's about discovering and building confidence in system resilience by exploring its reactions to infrastructure failures.

The level of 'chaotic-ness' can be very different. Chaos Monkey, probably the most famous chaos engineering tool, randomly terminates virtual machines and containers, but there are more structured approaches. The methodical approach to chaos engineering starts by defining a steady state, a measurable set of system characteristics that indicates normal behavior. That steady state is a base of a hypothesis that the system will continue in the same state after the experiment. To prove or disprove that hypothesis, an experiment is designed. The design of the experiment should include faults and their targets. Once the design is complete, the experiment is executed by injecting the faults into the environment and capturing the output state. The output state is being verified against the steady state. If the hypothesis has been disproven, the output state should be used for learning and improvement of the system. If the hypothesis has been proven, it's time to design a new experiment.

Chaos Engineering Process (Loop: Steady State, Hypothesis, Design Experiment, Inject Faults, Verify & Learn, Improve)

Despite being around for several years, the tooling ecosystem for chaos engineering wasn't growing as rapidly as one could wish for. That was until 2020 when AWS announced AWS Fault Injection Simulator. About a year later Microsoft followed by announcing a public preview of Azure Chaos Studio. Adopting chaos engineering through a managed service has become an option.

What does Azure Chaos Studio offer? Currently (still in preview) it provides ~30 faults and actions which can be applied to ~10 targets. What is interesting is that Azure Chaos Studio has two types of faults: service-direct and agent-based. The service-direct run directly against resources, while agent-based enable in-guest failures on virtual machines (for example high CPU).

How to adopt Azure Chaos Studio? The service provides capabilities to create experiments through ARM or Bicep. There is also REST API which can be used to create, manage, and run experiments. Those capabilities can be used to implement an architecture similar to the following, with continuous experiment execution (1, 2, 3, and 4).

Chaos Engineering Architecture Based On GitHub

Not There Yet

With Continuous Testing we have moved a little bit toward the right side of the loop, as part of this practice starts to provide us with feedback from the living system that we can use in our cycle of improvement. Still, there is a significant portion missing.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

There are practices that I haven't touched yet, which are focusing on the missing part - Continuous Operations and Continuous Monitoring. It's quite likely that they are already present in your organization, just not providing feedback to the loop. This is the journey further and I intend to go there in the next post.

You can find samples for some of the aspects I'm discussing here on GitHub.

Older Posts