DevOps Practices for Azure Infrastructure

So far, as part of my series on implementing DevOps practices for Azure infrastructure, I've walked through Continuous Integration, Continuous Delivery, and Continuous Deployment. In many conversations I had around implementing DevOps I've heard an opinion that once you have CI/CD (or CI/CD/CD) you have DevOps. That's not true. DevOps is about a continuous loop of feedback, automation, collaboration, and improvement. As you can see in the picture below, those three practices give only about half of that loop and cover mostly the development side.

This is why there are more practices on the list:

To complete the loop and speak about complete DevOps implementation, it's time to start implementing practices that provide feedback from the deployed environment to the teams and automate operations concerns. In this post, I'm going to discuss Continuous Testing.

The goal of Continuous Testing is to ensure quality at different stages of the development life cycle. It's a practice that applies to both sides of the loop. We have already encountered it as part of the Continuous Integration practice. It's sometimes present as part of Continuous Delivery (for example running specific tests when versions of modules referenced from the environment repository are being updated) and it should be present as part of Continuous Deployment and later in the form of after-deployment tests. The after-deployment tests are what I want to focus on.

Discussing tests often revolves around discussing two aspects: tools to be used for implementation and types of tests to be implemented. Those are two main ingredients used to create a test strategy (of course a mature test strategy covers much more, but it's a discussion for a different occasion). Let's first consider the tools.

There are no real special requirements when choosing tools for infrastructure tests. As long as the stack allows calling APIs it should be sufficient. Very often the applications teams are using the same tools for testing the infrastructure tied to the application which they are using for testing the application itself. The environment teams, on the other hand, are looking for specific tools which fit the ecosystem they are familiar with. A popular choice when it comes to Azure is Pester, a test framework for Powershell. I'm going to use it for examples here.

What types of tests should you consider implementing? There are two which I consider a must-have - smoke tests and negative tests.

Smoke Tests

Smoke tests should be the first tests to verify the deployment. Their goal is to quickly provide feedback on crucial functions of the system without delving into finer details. Their implementation should be fast and simple. A typical smoke test is a verification if a host is responsive, which in Pester is just a couple of lines:

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Serves pages over HTTPS' {
      $request = [System.Net.WebRequest]::Create("https://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode |
        Should -Be 200 -Because "It's responsive"
    }
}

Notice that we are not trying to determine if the hosted application is healthy beyond getting a successful status code - we are testing infrastructure and the application hasn't been deployed yet.

Running smoke tests should be the first job in our post-deployment workflow. GitHub-hosted runners come with Pester, which means that running the tests is just two lines of Powershell.

jobs:
  smoke-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/smoke-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

So, running the tests is not a challenge. But running the tests is not the goal by itself. The goal is to properly react when tests fail. What should we do when smoke tests fail? There are two options we can choose from: roll back or roll forward. For smoke tests, we should almost always aim for rolling back. After all a crucial function of our system is not working and reverting it to the previous stable version is usually the quickest way to fix this. Of course roll back may not always be possible and then you are left with roll forward as the only option. Still, you should aim for this to be an edge case.

The situation with negative tests is a little bit different.

Negative Tests

While smoke tests are providing feedback if crucial functions of the system are working as expected, negative tests are there to provide feedback on how the system will behave in invalid scenarios. A good example can be unencrypted requests over HTTP. They are not secure and we want to disable them at the host level by configuring a redirect. A negative test to verify that can look like below.

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Does not serves pages over HTTP' {
      $request = [System.Net.WebRequest]::Create("http://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode | 
        Should -Be  301 -Because "Redirect is forced"
    }
}

As negative tests should be independent of smoke tests and still considered important, the typical approach is to run them in parallel with smoke tests.

jobs:
  smoke-tests:
    ...
  negative-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/negative-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

There is often a discussion about how to decide if something is a negative test or a smoke test. The distinction should be based on the impact. Taking our two examples:

Host not being responsive is catastrophic, we can't provide any service for our users.
Host responding to HTTP requests is something we can live with for a moment. There is secondary protection in the application code and in our industry context, it's only a recommendation, not a requirement.

Of course, context matters and what is a negative test in one situation might be a critical smoke test in another. The key aspect is that negative tests failing don't have to mean rolling back. The system can still provide a valuable service for users. This is why the strategy in case of negative tests is often to roll forward, fix the issue as soon as possible, and perform another deployment.

Other After-Deployment Tests

Smoke and negative tests are crucial, but they are only scratching the surface to provide initial feedback as soon as possible. They should be followed by different types of tests which go into previously ignored finer details. Depending on the needs you should implement functional, integration, or other types of tests.

You also shouldn't limit running tests only to the deployment moment. Infrastructure is much more fragile than application code, so you should continuously run at least the key tests to ensure that everything is working as expected. You should also adopt health checks (yes they are usually considered part of monitoring, but sophisticated health checks are often complex tests) to notice when something becomes unavailable or starts to misbehave. You can also go a step further and continuously test how your system will behave when something goes down by adopting chaos engineering.

Chaos Engineering

Chaos engineering is testing through experimenting. You can think of it as a form of exploratory testing. It's about discovering and building confidence in system resilience by exploring its reactions to infrastructure failures.

The level of 'chaotic-ness' can be very different. Chaos Monkey, probably the most famous chaos engineering tool, randomly terminates virtual machines and containers, but there are more structured approaches. The methodical approach to chaos engineering starts by defining a steady state, a measurable set of system characteristics that indicates normal behavior. That steady state is a base of a hypothesis that the system will continue in the same state after the experiment. To prove or disprove that hypothesis, an experiment is designed. The design of the experiment should include faults and their targets. Once the design is complete, the experiment is executed by injecting the faults into the environment and capturing the output state. The output state is being verified against the steady state. If the hypothesis has been disproven, the output state should be used for learning and improvement of the system. If the hypothesis has been proven, it's time to design a new experiment.

Despite being around for several years, the tooling ecosystem for chaos engineering wasn't growing as rapidly as one could wish for. That was until 2020 when AWS announced AWS Fault Injection Simulator. About a year later Microsoft followed by announcing a public preview of Azure Chaos Studio. Adopting chaos engineering through a managed service has become an option.

What does Azure Chaos Studio offer? Currently (still in preview) it provides ~30 faults and actions which can be applied to ~10 targets. What is interesting is that Azure Chaos Studio has two types of faults: service-direct and agent-based. The service-direct run directly against resources, while agent-based enable in-guest failures on virtual machines (for example high CPU).

How to adopt Azure Chaos Studio? The service provides capabilities to create experiments through ARM or Bicep. There is also REST API which can be used to create, manage, and run experiments. Those capabilities can be used to implement an architecture similar to the following, with continuous experiment execution (1, 2, 3, and 4).

Not There Yet

With Continuous Testing we have moved a little bit toward the right side of the loop, as part of this practice starts to provide us with feedback from the living system that we can use in our cycle of improvement. Still, there is a significant portion missing.

There are practices that I haven't touched yet, which are focusing on the missing part - Continuous Operations and Continuous Monitoring. It's quite likely that they are already present in your organization, just not providing feedback to the loop. This is the journey further and I intend to go there in the next post.

You can find samples for some of the aspects I'm discussing here on GitHub.

Yet Another Developer Blog

DevOps Practices for Azure Infrastructure - Continuous Testing

Smoke Tests

Negative Tests

Other After-Deployment Tests

Chaos Engineering

Not There Yet