Containers have become one of the main, if not the main, ways to modularize, isolate, encapsulate, and package applications in the cloud. The sidecar pattern allows for taking this even further by allowing the separation of functionalities like monitoring, logging, or configuration from the business logic. This is why I recommend that the teams who are adopting containers adopt sidecars as well. One of my preferred suggestions is Dapr which can bring early value by providing abstractions for message broker integration, encryption, observability, secret management, state management, or configuration management.

To my surprise, many conversations starting around adopting sidecars quickly deviate to "we should set up a Kubernetes cluster". It's almost like there are only two options out there - you either run a standalone container or you need Kubernetes for anything more complicated. This is not the case. There are multiple ways to run containers and you should choose the one which is most suitable for your current context. Many of those options will give you more sophisticated features like sidecars or init containers while your business logic is still in a single container. Sidecars give here an additional benefit of enabling later evolution to more complex container hosting options without requirements for code changes.

In the case of Azure, such a service that enables adopting sidecars at an early stage is Azure Container Instances.

Quick Reminder - Azure Container Instances Can Host Container Groups

Azure Container Instances provides a managed approach for running containers in a serverless manner, without orchestration. What I've learned is that a common misconception is that Azure Container Instances can host only a single container. That is not exactly the truth, Azure Container Instances can host a container group (if you're using Linux containers 😉).

A container group is a collection of containers scheduled on the same host and sharing lifecycle, resources, or local network. The container group has a single public IP address, but the publicly exposed ports can forward to ports exposed on different containers. At the same time, all the containers within the group can reach each other via localhost. This is what enables the sidecar pattern.

How to create a container group? There are three options:

  • With ARM/Bicep
  • With Azure CLI by using YAML file
  • With Azure CLI by using Docker compose file

I'll go with Bicep here. The Microsoft.ContainerInstance namespace contains only a single type which is containerGroups. This means that from ARM/Bicep perspective there is no difference if you are deploying a standalone container or a container group - there is a containers list available as part of the resource properties where you specify the containers.

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  name: CONTAINER_GROUP
  location: LOCATION
  ...
  properties: {
    sku: 'Standard'
    osType: 'Linux'
    ...
    containers: [
      ...
    ]
    ...
  }
}

How about a specific example? I've mentioned that Dapr is one of my preferred sidecars, so I'm going to use it here.

Running Dapr in Self-Hosted Mode Within a Container Group

Dapr has several hosting options. It can be self-hosted with Docker, Podman, or without containers. It can be hosted in Kubernetes with first-class integration. It's also available as a serverless offering - part of Azure Container Apps. The option interesting us in the context of Azure Container Instances is self-hosted with Docker, but from that list, you could pick up how Dapr enables easy evolution from Azure Container Instances to Azure Container Apps, Azure Kubernetes Services or non-Azure Kubernetes clusters.

But before we will be ready to deploy the container group, we need some infrastructure around it. We should start with a resource group, container registry and managed identity.

az group create -l $LOCATION -g $RESOURCE_GROUP
az acr create -n $CONTAINER_REGISTRY -g $RESOURCE_GROUP --sku Basic
az identity create -n $MANAGED_IDENTITY -g $RESOURCE_GROUP

We will be using the managed identity for role-based access control where possible, so we should reference it as the identity of the container group in our Bicep template.

resource managedIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' existing = {
  name: MANAGED_IDENTITY
}

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${managedIdentity.id}': {}
    }
  }
  properties: {
    ...
  }
}

The Dapr sidecar requires a components directory. It's a folder that will contain YAML files with components definitions. To provide that folder to the Dapr sidecar container, we have to mount it as a volume. Azure Container Instances supports mounting an Azure file share as a volume, so we have to create one.

az storage account create -n $STORAGE_ACCOUNT -g $RESOURCE_GROUP --sku Standard_LRS
az storage share create -n daprcomponents --account-name $STORAGE_ACCOUNT

The created Azure file share needs to be added to the list of volumes that can be mounted by containers in the group. Sadly, the integration between Azure Container Instances and Azure file share doesn't support role-based access control, an access key has to be used.

...

resource storageAccount 'Microsoft.Storage/storageAccounts@2022-09-01' existing = {
  name: STORAGE_ACCOUNT
}

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  properties: {
    ...
    volumes: [
      {
        name: 'daprcomponentsvolume'
        azureFile: {
          shareName: 'daprcomponents'
          storageAccountKey: storageAccount.listKeys().keys[0].value
          storageAccountName: storageAccount.name
          readOnly: true
        }
      }
    ]
    ...
  }
}

We also need to assign the AcrPull role to the managed identity so it can access the container registry.

az role assignment create --assignee $MANAGED_IDENTITY_OBJECT_ID \
    --role AcrPull \
    --scope "/subscriptions/$SUBSCRIPTION_ID/resourcegroups/$RESOURCE_GROUP/providers/Microsoft.ContainerRegistry/registries/$CONTAINER_REGISTRY"

I'm skipping the creation of the image for the application with the business logic, pushing it to the container registry, adding its definition to the containers list, and exposing needed ports from the container group - I want to focus on the Dapr sidecar.

In this example, I will be grabbing the daprd image from the Docker Registry.

The startup command for the sidecar is ./daprd. We need to provide a --resources-path parameter which needs to point to the path where the daprcomponentsvolume will be mounted. I'm also providing the --app-id parameter. This parameter is mostly used for service invocation (it won't be the case here and I'm not providing --app-port) but Dapr is using it also in different scenarios (for example as partition key for some state stores).

Two ports need to be exposed from this container (not publicly): 3500 is the default HTTP endpoint port and 50001 is the default gRPC endpoint port. There is an option to change both ports through configuration if they need to be taken by some other container.

resource containerGroup 'Microsoft.ContainerInstance/containerGroups@2023-05-01' = {
  ...
  properties: {
    ...
    containers: [
      ...
      {
        name: 'dapr-sidecar'
        properties: {
          image: 'daprio/daprd:1.10.9'
          command: [ './daprd', '--app-id', 'APPLICATION_ID', '--resources-path', './components']
          volumeMounts: [
            {
              name: 'daprcomponentsvolume'
              mountPath: './components'
              readOnly: true
            }
          ]
          ports: [
            { 
              port: 3500
              protocol: 'TCP'
            }
            { 
              port: 50001
              protocol: 'TCP'
            }
          ]
          ...
        }
      }
    ]
    ...
  }
}

I've omitted the resources definition for brevity.

Now the Bicep template can be deployed.

az deployment group create -g $RESOURCE_GROUP -f container-group-with-dapr-sidecar.bicep

The below diagram visualizes the final state after the deployment.

Diagram of Azure Container Instances hosting a container group including application container and Dapr sidecar integrated with Azure Container Registry and having Azure file share mounted as volume with Dapr components definitions.

Configuring a Dapr Component

We have a running Dapr sidecar, but we have yet to make it truly useful. To be able to use APIs provided by Dapr, we have to provide the mentioned earlier components definitions which will provide implementation for those APIs. As we already have a storage account as part of our infrastructure, a state store component seems like a good choice. Dapr supports quite an extensive list of stores, out of which two are based on Azure Storage: Azure Blob Storage and Azure Table Storage. Let's use the Azure Table Storage one.

First I'm going to create a table. This is not a required step, the component can do it for us, but let's assume we want to seed some data manually before the deployment.

Second, the more important operation is granting needed permissions to the storage account. Dapr has very good support for authenticating to Azure which includes managed identities and role-based access control, so I'm just going to assign the Storage Table Data Reader role to our managed identity for the scope of the storage account.

az storage table create -n $TABLE_NAME --account-name $STORAGE_ACCOUNT
az role assignment create --assignee $MANAGED_IDENTITY_OBJECT_ID \
    --role "Storage Table Data Contributor" \
    --scope "/subscriptions/$SUBSCRIPTION_ID/resourcegroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"

The last thing we need is the component definition. The component type we want is state.azure.tablestorage. The name is what we will be using when making calls with a Dapr client. As we are going to use managed identity for authenticating, we should provide accountName, tableName, and azureClientId as metadata. I'm additionally setting skipCreateTable because I created the table earlier and the component will fail on an attempt to create it once again.

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: state.table.<TABLE_NAME>
spec:
  type: state.azure.tablestorage
  version: v1
  metadata:
  - name: accountName
    value: <STORAGE_ACCOUNT>
  - name: tableName
    value: <TABLE_NAME>
  - name: azureClientId
    value: <Client ID of MANAGED_IDENTITY>
  - name: skipCreateTable
    value: true

The file with the definition needs to be uploaded to the file share which is mounted as the components directory. The Azure Container Instances need to be restarted for the component to be loaded. We can quickly verify if it has been done by taking a look at logs.

time="2023-08-31T21:25:22.5325911Z"
level=info
msg="component loaded. name: state.table.<TABLE_NAME>, type: state.azure.tablestorage/v1"
app_id=APPLICATION_ID
instance=SandboxHost-638291138933285823
scope=dapr.runtime
type=log
ver=1.10.9

Now you can start managing your state with a Dapr client for your language of choice or with HTTP API if one doesn't exist.

The Power of Abstraction, Decoupling, and Flexibility

As you can see, the needed increase in complexity (when compared to a standalone container hosted in Azure Container Instances) is not that significant. At the same time, the gain is. Dapr allows us to abstract all the capabilities it provides in the form of building blocks. It also decouples the capabilities provided by building blocks from the components providing implementation. We can change Azure Table Storage to Azure Cosmos DB if it better suits our solution, or to AWS DynamoDB if we need to deploy the same application to AWS. We also now have the flexibility of evolving our solution when the time comes to use a more sophisticated container offering - we just need to take Dapr with us.

This series on implementing DevOps practices for Azure infrastructure is nearing its conclusion. The last part remaining is completing the operations side of the loop.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

This brings focus to the last two practices on our list:

Continuous Operations & Continuous Monitoring

The Continuous Operations and Continuous Monitoring practices are closely tied together. They jointly serve the goal of ensuring the overall reliability, resiliency, and security of solutions. The majority of capabilities supporting that goal are within the scope of Continuous Operations practice and cover aspects like compliance enforcement, cost management, proactive maintenance, security posture management, and intelligence-driven responses to operational and security events. That said, most of those capabilities can't be achieved without capabilities coming from Continuous Monitoring practice. There can be no cost management without cost tracking. There is no way to have proactive maintenance and intelligence-driven responses without gathering observability signals, configuring alerts, and building dashboards.

Organizations usually have the capabilities covered by Continuous Operations and Continuous Monitoring already established, but often they are not aligned with DevOps cultural philosophies. This means that implementing those practices is often about addressing gaps around automation, collaboration, continuous feedback, and continuous improvement.

But before we start addressing those gaps, it's worth making sure that the capabilities have been established on the right foundations, as Azure provides a wide range of services to support us here:

  • Azure Policy for compliance enforcement.
  • Azure Monitor with its insights, visualization, analytics, and response stack for gathering observability signals, configuring alerts, and building dashboards.
  • Microsoft Defender for Cloud for workload protection and security posture management.
  • Azure Sentinel for security information and event management (SIEM) as well as security orchestration, automation, and response (SOAR).
  • Azure Automation and Azure Logic Apps for automating event-based intelligence-driven responses and orchestrating proactive maintenance.
  • Microsoft Cost Management and Billing for cost tracking and management.

With the right foundations in place, we can focus on aspects that make the difference between "being DevOps" and "not being DevOps". The most crucial one is ensuring that everyone has access to information on how the part they are responsible for is behaving in production.

All Teams Need Observability Signals

As you may remember from the post on Continuous Delivery and Continuous Deployment, at certain sizes solutions often start moving from centralized ownership to being owned by multiple independent applications teams and an environment team. This dynamics needs to be reflected in monitoring architecture as well. A single, centralized monitoring service, although needed by the environment team, may not be sufficient. This is why mature monitoring implementations utilize resource-context observability signals and granular insights, visualization, and analytics workspaces from which the signals are later centrally aggregated. This approach enables every application team to have direct access to its signals, configure alerts and build dashboards, while the environment team still has visibility into the whole picture.

This approach also enables democratization when it comes to the tools itself. The native observability stack in Azure is provided by Azure Monitor, but it no longer means that application teams are fully limited to Application Insights. If they prefer they can use Prometheus and Grafana for metrics (which is great when they need to be more cloud agnostic and looking at adopting Open Telemetry).

Diagram representing a democratized monitoring architecture with dedicated workspaces for every team and teams using different tools

Of course, such a democratized monitoring architecture cannot be left without governance. There need to be rules around observability signals granularity, retention, and archiving data to cool-tier storage. Otherwise, we can be very unpleasantly surprised by the cost of our monitoring implementation.

Automated responses should also be exporting proper context information to the respective tools - because part of automated response should be creating a proper item in the collaboration tool to ensure continuous feedback. What item should that be? That depends on the event category.

Operational Events Should Create Issues

From the infrastructure perspective, there are usually two main types of operational events that are potentially interesting:

  • Resources events like creation, deletion, or modification
  • Alerts defined in Azure Monitor

The usage of resource events often covers adding special tags, granting permissions to special groups, or reacting to delete/create/update operation fails.

Alerts are usually raised when the measured state of the system deviates from what is considered a baseline. To name just a few examples, this can mean networking issues, an erroneously stopped VM, or a resource reaching its capacity.

The remediation for every resource event or alert can be different. In some cases, the remediation can be fully automated (restarting a VM, truncating tables in a database, or increasing RUs for Azure Cosmos DB). In some cases, all that is needed is just a notification to deal with the problem at the earliest convenience (failure to delete a resource). There are also those cases that require waking up an engineer immediately (networking issues).

In the first post of this series, I wrote that the cornerstone of implementing DevOps practices for Azure infrastructure is infrastructure as code and the Git ecosystem used for collaboration. This means that regardless if the remediation is fully automated or an engineer needs to be engaged, part of the process should be issue creation (if the remediation has been already performed that issue can be closed and exist just for tracking purposes). In the stack I've chosen for this series, the Git ecosystem is GitHub. Integrating GitHub issue creation into the response workflow is not a huge challenge, because there is a ready-to-use GitHub connector for Azure Logic Apps. So, if we consider alerts, this means that we can build an automated response flow by using Azure Monitor Alerts, Azure Monitor Action Group, and Azure Logic Apps.

Diagram representing an automated response flow for an alert raised in Azure Monitor which uses Azure Logic App to create issues in GitHub and perform remediation action

Almost identical flow can be built for the resources events if we use Azure Event Grid in place of Azure Monitor (as Azure Event Grid supports resource groups and subscriptions as sources).

This is the approach that should be applied to ensure collaboration and continuous feedback when it comes to operational events, how about security events?

Security Events Should Create Vulnerabilities

Security events have a specific lifecycle that falls under the responsibility of the organization's Security Operations Center (SOC). It's the SOC that uses available observability signals, CVE alerting platforms like OpenCVE, and other tools to detect, investigate, and remediate threats. In the case of Azure, Azure Sentinel is the one-stop shop to build and automate this responsibility.

That said, SOC usually deals with the immediate remediation of a threat. For example, SOC operator or automation may determine that to mitigate a threat a specific resource needs to be isolated because a new CVE has been disclosed. The only action performed will be isolation - the responsibility for mitigating the CVE is with the application or environment team. In such cases, the SOC operator or automation should report the specific vulnerability with context and findings in the collaboration tool. When using GitHub as the Git ecosystem for collaboration, a great way to report such vulnerabilities may be through security advisories.

Security advisories facilitate the process of reporting, discussing, and fixing vulnerabilities. Creating security advisories requires admin or security manager role within the repository, so the integration must be designed properly to avoid excessive permissions within the organization. My approach is to create a GitHub App. GitHub Apps use OAuth 2.0 and can act on behalf of a user, which in this case will be SOC operator or automation. To make the creation of security advisories available directly from Azure Sentinel, I expose a webhook from the GitHub App which can be called by a Playbook.

Diagram representing a security advisory creation flow through Azure Sentinel Playbook and GitHub App

Providing automated tools which don't require context switching from the SOC perspective removes roadblocks, which is crucial for the adoption of collaboration and continuous feedback between otherwise disconnected teams. This is the true spirit of DevOps.

Infrastructure Drift Detection

There is one capability in the context of Continuous Monitoring and Continuous Operations, which is very specific to infrastructure - detecting drift.

As I have shown through the series, if we want to implement DevOps practices for Azure infrastructure, the infrastructure should be changed only through modifying and deploying its code. The repository should be the single source of truth. But sometimes, when there is pressure, stress, or time constraints (for example when solving a critical issue) engineers do take shortcuts and modify the infrastructure directly. It's not that big of an issue if such an engineer will later reflect the changes in the infrastructure code. But humans are humans and they sometimes forget. This can cause the environment to drift from its source of truth and creates potential risks from applied change being reverted after the deployment to deployment failures. This is why detecting drift is important.

Infrastructure drift detection is a complex problem. Depending on chosen stack there are different tools you can use to make it as sophisticated as you need. Here, as an example, I'm going to show a mechanism that can be set up quickly based on the stack I've already used throughout this series. It's far from perfect, but it's a good start. It's using the what-if command, which I've already been using for creating previews of changes as part of Continuous Integration implementation.

az deployment group what-if \
    --resource-group rg-devops-practices-sample-application-prod \
    --template-file applications/sample-application/application.bicep \
    --mode Complete \
    --no-pretty-print

You may notice two differences between the usage of what-if for previews and drift detection.

The first difference is the Complete deployment mode. The difference between Incremental (the default) and Complete deployment modes is that in the case of the second resources that exist in the resource group but aren't specified in the template will be deleted instead of ignored.

The second difference is the output format. For the previews, I wanted something human-readable, but here I prefer something which will be easy to process programmatically. Providing the --no-pretty-print switch changes the output format to JSON. Below you can see a snippet of it.

{
  "changes": [
    {
      "after": null,
      "before": {
        "name": "kvsampleapplication",
        ...
      },
      "changeType": "Delete",
      ...
    },
    {
      "after": {
        "name": "id-sampleapplication-gd3f7mnjwpuyu",
        ...
      },
      "before": {
        "name": "id-sampleapplication-gd3f7mnjwpuyu",
        ...
      },
      "changeType": "NoChange",
      ...
    },
    ...
  ],
  "error": null,
  "status": "Succeeded"
}

Our attention should focus on the changeType property. It provides information on what will happen with the resource after the deployment. The possible values are: Create, Delete, Ignore, NoChange, Modify, and Deploy. Create, Delete, and NoChange are self-explanatory. The Ignore value should not be present in the case of Complete deployment mode unless limits (number of nested templates or expanding time) have been reached - in such case, it will mean that the resource hasn't been evaluated. Modify and Deploy are tricky. They mean that the properties of the resource will be changed after the deployment. Unfortunately, the Resource Manager is not perfect here and those two can give false positive predictions. This is why this technique is far from perfect - the only drift that can be reliably detected are missing resources or resources which shouldn't exist. But, as I said, it's a good start as we can quickly create a GitHub Actions workflow that will be performing the detection. Let's start by checking out the deployed tag and connecting to Azure.

...

env:
  TAG: sample-application-v1.0.0

jobs:
  drift-detection:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Checkout
      uses: actions/checkout@v3
      with:
        ref: ${{ env.TAG }}
    - name: Azure Login
      uses: azure/login@v1
      with:
        ...
  ...

The next step is to run a script that will call what-if and process the results to create an array of detected changes.

...

env:
  ...
  RESOURCE_GROUP: 'rg-devops-practices-sample-application-prod'

jobs:
  drift-detection:
    ...
    steps:
    ...
    - name: Detect infrastructure drift
      shell: pwsh
      run: |
        $issues = @()

        $drift = az deployment group what-if `
          --resource-group $env:RESOURCE_GROUP `
          --template-file applications/sample-application/application.bicep `
          --mode Complete `
          --no-pretty-print | ConvertFrom-Json

        foreach ($change in $drift.Changes)
        {
          switch ($change.changeType)
          {
            'Create'
            {
              $issues += @{
                ResourceName = $change.after.name
                Description = 'Defined resource doesn''t exist'
              }
            }
            'Delete'
            {
              $issues += @{
                ResourceName = $change.before.name
                Description = 'Undefined resource exists'
              }
            }
          }
        }

        'DRIFT_ISSUES<> $env:GITHUB_ENV
        $issues | ConvertTo-Json -AsArray >> $env:GITHUB_ENV
        'EOF' >> $env:GITHUB_ENV
  ...

Having all the changes gathered, we can use the proven script action to create an issue for every detected change.

...

jobs:
  drift-detection:
    ...
    permissions:
      ...
      issues: write
    steps:
    ...
    - name: Report detected infrastructure drift
      uses: actions/github-script@v6
      with:
        script: |
          const issues = JSON.parse(process.env.DRIFT_ISSUES);
          for (const issue of issues) {
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: '[DRIFT DETECTED] ' + issue.Description + ' (' + issue.ResourceName + ')'
            });
          }
  ...

We can have this action running regularly. It will be creating nice issues like the one in the screenshot and will give us some start in drift detection.

Drift Detection (Sample Application) workflow - created issue for undefined resource

The Journey Never Ends

With Continuous Operations and Continuous Monitoring practices, we have closed the loop.

DevOps Pipeline With Tools for Create, Verify, Package, Release, Operate, and Monitor Stages

But the nature of a loop is that an end is also the beginning. The implementation of DevOps is never "done". It's a direct consequence of its core cultural philosophies: continuous feedback and continuous improvement. Regardless of how your initial implementation will look, you should constantly evaluate it in the context of the ecosystem around and evolve. This will mean modifying the implementation of already established practices, but also implementing new complementary ones (like Continuous Learning or Continuous Documentation).

The goal of this series was to draw the overall picture and provide examples that will bring that picture to life. The accompanying repository contains working workflows that can kickstart your journey.

So far, as part of my series on implementing DevOps practices for Azure infrastructure, I've walked through Continuous Integration, Continuous Delivery, and Continuous Deployment. In many conversations I had around implementing DevOps I've heard an opinion that once you have CI/CD (or CI/CD/CD) you have DevOps. That's not true. DevOps is about a continuous loop of feedback, automation, collaboration, and improvement. As you can see in the picture below, those three practices give only about half of that loop and cover mostly the development side.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

This is why there are more practices on the list:

To complete the loop and speak about complete DevOps implementation, it's time to start implementing practices that provide feedback from the deployed environment to the teams and automate operations concerns. In this post, I'm going to discuss Continuous Testing.

The goal of Continuous Testing is to ensure quality at different stages of the development life cycle. It's a practice that applies to both sides of the loop. We have already encountered it as part of the Continuous Integration practice. It's sometimes present as part of Continuous Delivery (for example running specific tests when versions of modules referenced from the environment repository are being updated) and it should be present as part of Continuous Deployment and later in the form of after-deployment tests. The after-deployment tests are what I want to focus on.

Discussing tests often revolves around discussing two aspects: tools to be used for implementation and types of tests to be implemented. Those are two main ingredients used to create a test strategy (of course a mature test strategy covers much more, but it's a discussion for a different occasion). Let's first consider the tools.

There are no real special requirements when choosing tools for infrastructure tests. As long as the stack allows calling APIs it should be sufficient. Very often the applications teams are using the same tools for testing the infrastructure tied to the application which they are using for testing the application itself. The environment teams, on the other hand, are looking for specific tools which fit the ecosystem they are familiar with. A popular choice when it comes to Azure is Pester, a test framework for Powershell. I'm going to use it for examples here.

What types of tests should you consider implementing? There are two which I consider a must-have - smoke tests and negative tests.

Smoke Tests

Smoke tests should be the first tests to verify the deployment. Their goal is to quickly provide feedback on crucial functions of the system without delving into finer details. Their implementation should be fast and simple. A typical smoke test is a verification if a host is responsive, which in Pester is just a couple of lines:

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Serves pages over HTTPS' {
      $request = [System.Net.WebRequest]::Create("https://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode |
        Should -Be 200 -Because "It's responsive"
    }
}

Notice that we are not trying to determine if the hosted application is healthy beyond getting a successful status code - we are testing infrastructure and the application hasn't been deployed yet.

Running smoke tests should be the first job in our post-deployment workflow. GitHub-hosted runners come with Pester, which means that running the tests is just two lines of Powershell.

jobs:
  smoke-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/smoke-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

So, running the tests is not a challenge. But running the tests is not the goal by itself. The goal is to properly react when tests fail. What should we do when smoke tests fail? There are two options we can choose from: roll back or roll forward. For smoke tests, we should almost always aim for rolling back. After all a crucial function of our system is not working and reverting it to the previous stable version is usually the quickest way to fix this. Of course roll back may not always be possible and then you are left with roll forward as the only option. Still, you should aim for this to be an edge case.

The situation with negative tests is a little bit different.

Negative Tests

While smoke tests are providing feedback if crucial functions of the system are working as expected, negative tests are there to provide feedback on how the system will behave in invalid scenarios. A good example can be unencrypted requests over HTTP. They are not secure and we want to disable them at the host level by configuring a redirect. A negative test to verify that can look like below.

param(
  [Parameter(Mandatory)]
  [ValidateNotNullOrEmpty()]
  [string] $HostName
)

Describe 'Application Host' {
    It 'Does not serves pages over HTTP' {
      $request = [System.Net.WebRequest]::Create("http://$HostName/")
      $request.AllowAutoRedirect = $false
      $request.GetResponse().StatusCode | 
        Should -Be  301 -Because "Redirect is forced"
    }
}

As negative tests should be independent of smoke tests and still considered important, the typical approach is to run them in parallel with smoke tests.

jobs:
  smoke-tests:
    ...
  negative-tests:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Run Pester Tests
      shell: pwsh
      run: |
        $container = New-PesterContainer `
          -Path 'applications/sample-application/tests/negative-tests.ps1' `
          -Data @{ HostName = '${{ env.APPLICATION_HOST_NAME }}' }
        Invoke-Pester -Container $container -CI
  ...

There is often a discussion about how to decide if something is a negative test or a smoke test. The distinction should be based on the impact. Taking our two examples:

  • Host not being responsive is catastrophic, we can't provide any service for our users.
  • Host responding to HTTP requests is something we can live with for a moment. There is secondary protection in the application code and in our industry context, it's only a recommendation, not a requirement.

Of course, context matters and what is a negative test in one situation might be a critical smoke test in another. The key aspect is that negative tests failing don't have to mean rolling back. The system can still provide a valuable service for users. This is why the strategy in case of negative tests is often to roll forward, fix the issue as soon as possible, and perform another deployment.

Other After-Deployment Tests

Smoke and negative tests are crucial, but they are only scratching the surface to provide initial feedback as soon as possible. They should be followed by different types of tests which go into previously ignored finer details. Depending on the needs you should implement functional, integration, or other types of tests.

You also shouldn't limit running tests only to the deployment moment. Infrastructure is much more fragile than application code, so you should continuously run at least the key tests to ensure that everything is working as expected. You should also adopt health checks (yes they are usually considered part of monitoring, but sophisticated health checks are often complex tests) to notice when something becomes unavailable or starts to misbehave. You can also go a step further and continuously test how your system will behave when something goes down by adopting chaos engineering.

Chaos Engineering

Chaos engineering is testing through experimenting. You can think of it as a form of exploratory testing. It's about discovering and building confidence in system resilience by exploring its reactions to infrastructure failures.

The level of 'chaotic-ness' can be very different. Chaos Monkey, probably the most famous chaos engineering tool, randomly terminates virtual machines and containers, but there are more structured approaches. The methodical approach to chaos engineering starts by defining a steady state, a measurable set of system characteristics that indicates normal behavior. That steady state is a base of a hypothesis that the system will continue in the same state after the experiment. To prove or disprove that hypothesis, an experiment is designed. The design of the experiment should include faults and their targets. Once the design is complete, the experiment is executed by injecting the faults into the environment and capturing the output state. The output state is being verified against the steady state. If the hypothesis has been disproven, the output state should be used for learning and improvement of the system. If the hypothesis has been proven, it's time to design a new experiment.

Chaos Engineering Process (Loop: Steady State, Hypothesis, Design Experiment, Inject Faults, Verify & Learn, Improve)

Despite being around for several years, the tooling ecosystem for chaos engineering wasn't growing as rapidly as one could wish for. That was until 2020 when AWS announced AWS Fault Injection Simulator. About a year later Microsoft followed by announcing a public preview of Azure Chaos Studio. Adopting chaos engineering through a managed service has become an option.

What does Azure Chaos Studio offer? Currently (still in preview) it provides ~30 faults and actions which can be applied to ~10 targets. What is interesting is that Azure Chaos Studio has two types of faults: service-direct and agent-based. The service-direct run directly against resources, while agent-based enable in-guest failures on virtual machines (for example high CPU).

How to adopt Azure Chaos Studio? The service provides capabilities to create experiments through ARM or Bicep. There is also REST API which can be used to create, manage, and run experiments. Those capabilities can be used to implement an architecture similar to the following, with continuous experiment execution (1, 2, 3, and 4).

Chaos Engineering Architecture Based On GitHub

Not There Yet

With Continuous Testing we have moved a little bit toward the right side of the loop, as part of this practice starts to provide us with feedback from the living system that we can use in our cycle of improvement. Still, there is a significant portion missing.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

There are practices that I haven't touched yet, which are focusing on the missing part - Continuous Operations and Continuous Monitoring. It's quite likely that they are already present in your organization, just not providing feedback to the loop. This is the journey further and I intend to go there in the next post.

You can find samples for some of the aspects I'm discussing here on GitHub.

In my previous post, I started the journey of implementing DevOps practices for infrastructure. I've proposed implementation for Continuous Integration practice, which covers the Create and Verify stages of the DevOps pipeline.

DevOps Pipeline With Tools for Create and Verify Stages

But Continuous Integration is just the first of several practices which should be implemented for a complete pipeline:

In this post, I want to focus on Continuous Delivery and Continuous Deployment practices which are there to pick up where Continuous Integration has finished and continue through the Package and Release stages of the pipeline.

Continuous Delivery vs. Continuous Deployment

Quite often, when I'm discussing Software Development Life Cycle with teams, there is confusion around Continuous Delivery and Continuous Deployment. Teams will often say that they are doing CI/CD and when I ask about the CD part the terms Continuous Delivery and Continuous Deployment are being used interchangeably. A lot of marketing "What is DevOps" articles also don't help by confusing the terms. So what is the difference?

In short, Continuous Delivery is about making artifacts ready for deployment and Continuous Deployment is about actually deploying them. That seems to be quite a clear separation, so why the confusion? Because in the real world, they often blend. In an ideal scenario, when the Continuous Integration workflow is finished, the deployment workflow can kick off automatically and get the changes to the production. In such a scenario, the Continuous Delivery may not be there, and if it is there it will be considered an implicit part of Continuous Deployment. This is where the terms are often misused - the separation is not clear. Continuous Delivery exists in explicit form only when there is some kind of handover or different step between "packaging" and "deployment".

Why am I discussing this here? Because when it comes to infrastructure, especially for solutions of considerable size, there often is a need for separated Continuous Delivery and Continuous Deployment. Where does this need come from? From different responsibilities. For large solutions, there is infrastructure responsible for the overall environment and infrastructure tied to specific applications. That means multiple teams are owning different parts of the infrastructure and working on it independently. But from a governance and security perspective, there is often a desire to treat the entire infrastructure as one. Properly implemented Continuous Delivery and Continuous Deployment can solve this conflict, but before I move to discuss the practices I need to extend the context by discussing the repositories structure for such solutions.

Structuring Repositories

How repositories of your solutions are structured has an impact on the implementation of your DevOps practices. Very often projects start small with a monorepo structure.

Monorepo

There is nothing wrong with monorepo structure. It can be the only structure you will ever need. The main benefit of monorepo is that it's simple. All your code lives in one place, you can iterate fast, it's easy to govern and you can implement just a single set of DevOps practices. But there is a point at which those advantages are becoming limitations. This point comes when the solutions grow to consist of multiple applications owned by different teams. Sooner or later those teams start to ask for some level of independence. They want to have a little bit different governance rules (which better suit their culture) and they don't want to be blocked by work being done by other teams. Sometimes just the size of monorepo becomes a productivity issue. This is where the decoupling of the monorepo starts. Usually, the first step is that new applications are being created in their own repositories. Later, the existing ones are moved out from the monorepo. The outcome is multiple, independent repositories.

Repositories per Application Without Dedicated Environment Repository

But, this structure has some problems of its own. There is no longer a single source of truth that would represent the entire solution. Establishing governance rules which are required for the entire solution is harder. There are multiple sources of deployments which means access to the environment from multiple places, which means increased security risk. There is a need for balance between those aspects and teams needs in areas of flexibility and productivity. A good option for such a balance is having applications repositories and a dedicated environment repository.

Repositories per Application and Dedicated Environment Repository

The environment repository is the single source of truth. It's also the place to apply required governance rules and the only source of deployments. It can also hold the shared infrastructure. The applications repositories are owned by the applications teams and contain the source code for the application as well as the infrastructure code tied to it. This is the structure we will focus on because this is the structure that requires Continuous Delivery and Continuous Deployment. The applications teams should implement Continuous Delivery for packaging the infrastructure to be used by the environment repository, while the team responsible for the environment repository should implement Continuous Deployment. Let's start with Continuous Delivery.

Continuous Delivery for Applications Infrastructure

The first idea for Continuous Delivery implementation can be simply copying the infrastructure code to the environment repository. The allure of this approach is that the environment repository will contain the complete code of the infrastructure removing any kind of context switching. The problem is that now the same artifacts live in two places and the sad truth is that when the same artifacts live in two places sooner or later something is going to get messed up. So, instead of copying the infrastructure code a better approach is to establish links from the environment module to the applications modules.

Options for linking from the environment module to applications modules strongly depend on chosen infrastructure as code tooling. Some tools support a wide variety of sources for the links starting with linking directly to git repositories (so Continuous Delivery can be as simple as creating a tag and updating a reference in the environment module). In rare cases, when there is no support for linking by the tooling, you can always use git submodules.

In the case of Bicep, there is one interesting option - using Azure Container Registry. This option can be attractive for two reasons. One is the possibility to create a private, isolated registry. The other is treating infrastructure the same way we would treat containers (so if you are using containers, both are treated similarly).

Repositories per Application and Dedicated Environment Repository With Links for Modules

The publishing of bicep files is available through the az bicep publish command. We can create a workflow around this command. A good trigger for this workflow may be the creation of a tag. We can even extract the version from the tag name which we will later use to publish the module.

name: Continuous Delivery (Sample Application)
on:
  push:
    tags:
    - "sample-application-v[0-9]+.[0-9]+.[0-9]+"

...

jobs:
  publish-infrastructure-to-registry:
    runs-on: ubuntu-latest
    steps:
    - name: Extract application version from tag
      run: |
        echo "APPLICATION_VERSION=${GITHUB_REF/refs\/tags\/sample-application-v/}" >> $GITHUB_ENV
    ...

Now all that needs to be done is checking-out the repository, connecting to Azure, and pushing the module.

...

env:
  INFRASTRUCTURE_REGISTRY: 'crinfrastructuremodules'

jobs:
  publish-infrastructure-to-registry:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    ...
    - name: Checkout
      uses: actions/checkout@v3
    - name: Azure Login
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
    - name: Publish application Bicep to infrastructure registry
      run: |
        bicep publish  \
          applications/sample-application/application.bicep  \
          --target br:${INFRASTRUCTURE_REGISTRY}.azurecr.io/infrastructure/applications/sample-application:${APPLICATION_VERSION}

The linking itself is to be done in the environment infrastructure Bicep file. The module syntax allows the module path to be either a local file or a file in a registry. This is the part that applications teams will be contributing to the environment repository - the module definition.

...

resource sampleApplicationResourceGroupReference 'Microsoft.Resources/resourceGroups@2022-09-01' = {
  name: 'rg-devops-practices-sample-application-prod'
  location: environmentLocation
}

module sampleApplicationResourceGroupModule 'br:crinfrastructuremodules.azurecr.io/infrastructure/applications/sample-application:1.0.0' = {
  name: 'rg-devops-practices-sample-application-rg'
  scope: resourceGroup(sampleApplicationResourceGroupReference.name)
}

...

Now the Continuous Deployment practice for the environment can be implemented.

Continuous Deployment for Environment Infrastructure

There are two deployment strategies that you may have heard of in the context of deploying infrastructure: push-based deployment and pull-based deployment.

The push-based deployment is what one could call a classic approach to deployment. You implement a workflow that pushes the changes to the environment. That workflow is usually triggered as a result of changes to the code.

The Push-based Deployment Strategy

The pull-based deployment strategy is the newer approach. It introduces an operator in place of the workflow. The operator monitors the repository and the environment and reconciles any differences to maintain the infrastructure as described in the environment repository. That means it will not only react to changes done to the code but also changes applied directly to the environment protecting it from drifting (at least in theory).

The Pull-based Deployment Strategy

The pull-based deployment strategy has found the most adoption in the Kubernetes space with two ready-to-use operators (Flux and Argo CD). When it comes to general Azure infrastructure, the push-based strategy is still the way to go, although there is a way to have a pull-based deployment for Azure resources that are tied to applications hosted in Kubernetes. Azure Service Operator for Kubernetes provides Custom Resource Definitions for deploying Azure resources, enabling a unified experience for application teams.

In the scope of this post, I'm going to stick with a typical push-based deployment, which means checking-out the repository, connecting to Azure, and deploying infrastructure based on the Bicep file.

name: Continuous Deployment (Environment)
...

jobs:
  deploy-environment:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Azure Login
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
    - name: Deploy Environment
      uses: azure/arm-deploy@v1
      with:
        scope: 'subscription'
        region: 'westeurope'
        template: 'environment/environment.bicep'

The Journey Continues

The same as with the previous post, this one also just scratches the surface. There are many variations possible, depending on your needs. It can also by further automated - for example the Continuous Delivery implementation can be automatically creating a pull request to the environment repository. Part of the DevOps culture is continuous improvement and that also means improving the practices implementations itself.

Our journey is also not over yet, there are a couple more practices I would like to explore in the next posts, so the loop is complete.

DevOps Pipeline With Tools for Create, Verify, Package, and Release Stages

If you want to play with the workflows, they are sitting on GitHub.

The generally adopted definition of DevOps methodology says that it's a combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver solutions. That's very broad. So broad, that initial adoption in the case of many organizations has focused on applying it only to application code. This has led to the naming of several additional methodologies to either further blend DevOps with additional areas or focus on previously neglected aspects: DevSecOps, FinOps, DataOps, GitOps, or MLOps. But, regardless of the flavor, the core remains the same. This is why, although I'm writing about DevOps in the context of infrastructure, I have avoided using GitOps in the title.

If you look for a definition of GitOps, you may find statements like "GitOps is a process of automating IT infrastructure using infrastructure as code and software development best practices" or "GitOps is a subset of DevOps". In reality, the term has been strongly tied to a specific ecosystem and specific tools. I don't want to fight with those associations. Instead, I want to focus on the essence - applying DevOps practices to infrastructure.

DevOps Practices

DevOps practices are a way to bring DevOps cultural philosophies (collaboration, automation, continuous feedback, continuous improvement, etc.) to life. They are used to implement all the stages of the DevOps pipeline:

DevOps Pipeline

You may find slightly different lists of those practices, this is the one I prefer:

In this post, I want to focus on Continuous Integration.

Continuous Integration Practice

On some occasions, I've heard an opinion that Continuous Integration is about merging the changes. The truth is that correctly implemented Continuous Integration practice covers the Create and Verify stages of the DevOps pipeline.

It shouldn't be a surprise that the cornerstone of the Create stage in the case of infrastructure is infrastructure as code, the tooling used for development, and the Git ecosystem used for collaboration. This is something that is already widely adopted with many options to choose from:

  • Terraform, Bicep, or Pulumi just to name some popular infrastructure as code options.
  • GitHub, Azure DevOps, or GitLab as potential Git ecosystems.
  • VS Code, Neovim, or JetBrains Fleet as possible development environments.

The above list is in no way exhaustive. I also don't aim at discussing the superiority of one tool over another. That said, discussing the Verify stage, which is the more challenging part of Continuous Integration practice, will be better done with specific examples. This is why I must choose a stack and I'm going to choose Bicep (as I'm writing this in the context of Azure) and GitHub (because it has some nice features which will make my life easier).

So, once we have our infrastructure as code created, what should we consider as verification from the perspective of Continuous Integration? In the beginning, I quoted a statement saying that GitOps is about using software development best practices in the process of automating infrastructure. What would be the first thing one would do with an application code to verify it? Most likely build it.

Building and Linting for Infrastructure Code

Building or compiling application code is the first step of the Verify stage. In the software development context, it's sometimes thought of as a way to generate the binaries (and it is), but it's also verifying if the code is syntactically correct. In the context of IaC, it means checking for the correct use of language keywords and that resources are defined according to the requirements for their type. This is something that IaC tooling should always support out of the box. Bicep provides this capability through az bicep build command, which we can simply run as a step in a workflow.

...

jobs:
  build-and-lint:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Build and lint Bicep
      run: |
        az bicep build --file applications/sample-application/application.bicep
  ...

The az bicep build command also performs a second activity, which is closely tied to building/compiling - it runs linter over the template. The goal of linting is to help enforce best practices and coding standards based on defined rules. Best practices and coding standards are something that sometimes needs to be tailored to a specific team and organization, this is why Bicep allows for the configuration of rules severity through the bicepconfig.json file. Possible options are Error, Warning, Info, and Off. By default, the majority of rules are set to either Warning or Off. The typical adjustment which I almost always do is bumping No unused parameters to Error and enabling Use recent API versions (as it is Off by default).

{
    "analyzers": {
        "core": {
            "enabled": true,
            "rules": {
                ...
                "no-unused-params": {
                    "level": "error"
                },
                ...
                "use-recent-api-versions": {
                     "level": "warning"
                },
            }
        }
    }
}

The bicepconfig.json file should be committed to the repository, which will ensure that the local development environment will pick up the same configuration. This includes VS Code (if the Bicep extension is installed), enabling immediate feedback for engineers (in the spirit of DevOps cultural philosophies). Of course, engineers can ignore that feedback or simply use tooling which doesn't provide it, but then the Build and lint Bicep step of the integration workflow will catch it and give them that feedback.

Continuous Integration (Sample Application) workflow - output of Build and lint Bicep step

If everything is correct, the workflow should move to the next phase, which doesn't mean we should be done with looking at the code itself. Following the software development best practices, the next phase should be static analysis.

Static Analysis for Infrastructure Code

Application code is usually scanned with tools like SonarQube, Veracode, Snyk, or GitHub's own CodeQL to detect potential vulnerabilities or bad patterns. The same should be done for infrastructure code and there are ready-to-use tools for that like KICS or Checkov. They are both designed to detect security vulnerabilities, compliance issues, and misconfigurations in our IaC. They both come with a huge set of configurable rules and the capability to create your own.

I prefer KICS, especially the way it can be integrated with GitHub. Checkmarx, the company behind KICS, provides a ready-to-use action. The support for Bicep is "indirect" - KICS supports ARM so the analysis has to be done after the build step. There is also small preparation needed as the directory for output should be created. Still, adding KICS-based static analysis to the workflow is only about 10 lines.

...

jobs:
  build-lint-and-static-analysis:
    runs-on: ubuntu-latest
    steps:
    ...
    - name: Create static analysis results folder
      run: |
        mkdir -p static-analysis-results
    - name: Perform KICS static analysis
      id: kics
      uses: checkmarx/[email protected]
      with:
        path: 'applications/sample-application/'
        fail_on: 'high,medium'
        output_path: 'static-analysis-results'
        output_formats: 'json,sarif'
  ...

The above analysis step will fail if any issues with severity high or medium are detected. Similarly to the build step, the feedback will be provided through workflow output.

Continuous Integration (Sample Application) workflow - output of Perform KICS static analysis step

But KICS integration is even more powerful than that. As you may have noticed I've configured output formats from the analysis to be JSON and SARIF. SARIF is a standardized format for sharing static analysis results and it can be used to integrate with the code scanning feature of GitHub Advanced Security. Once again we can use an existing action (this time provided by GitHub) to upload the SARIF file. The only tricky part is to put a proper condition on the upload step so the results are pushed also when the analysis step fails due to the severity of detected issues.

...

jobs:
  build-lint-and-static-analysis:
    runs-on: ubuntu-latest
    permissions:
      actions: read
      contents: read
      security-events: write
    steps:
    ...
    - name: Upload KICS static analysis results
      if: always() && (steps.kics.outcome == 'success' || steps.kics.outcome == 'failure')
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'static-analysis-results/results.sarif'
  ...

Thanks to this, the issues will be available in the Code Scanning section of the repository Security tab. This will provide alerts for those issues, the ability to triage them, and audit for taken actions.

Continuous Integration (Sample Application) workflow - KICS static analysis results in the Code Scanning section of the repository Security tab

Now we can say that we have looked at the code enough as part of the integration workflow. In the case of software development, we would probably run now some unit tests. In the case of infrastructure, the equivalent at this stage is testing if the template will deploy successfully.

Preflight Validation for Infrastructure Code

We have verified that the template will build probably and we have removed all important vulnerabilities and misconfigurations. Sadly, this doesn't guarantee that the template will deploy. There may be some policies or conditions on the environment, which are not reflected in any of the checks. To make sure that the template will deploy, we need to perform a preflight validation against the environment. This capability is provided differently by different ecosystems, in the case of Bicep and ARM it comes as Validate deployment mode. This means that we can add another job to our workflow which will establish a connection to Azure and test the deployment.

...

jobs:
  ...
  preflight-validation:
    needs: build-lint-and-static-analysis
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Azure Login
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
    - name: Perform preflight validation
      uses: azure/arm-deploy@v1
      with:
        scope: 'resourcegroup'
        resourceGroupName: 'rg-devops-practices-sample-application-sandbox'
        template: 'applications/sample-application/application.bicep'
        deploymentMode: 'Validate'
        failOnStdErr: false
  ...

This will catch issues like duplicated storage account names (or simple cases where the name is too long) without actually deploying anything.

Continuous Integration (Sample Application) workflow - output of Perform preflight validation step

What's next? Well, there is one common software development practice that we haven't touched yet - pull requests and code reviews. This subject recently caused some heated discussions. There are opinions that if you have an explicit code review step in your process it's not true Continuous Integration. There are also opinions that code reviews are perfectly fine. My opinion is that it's part of team culture. If your team has asynchronous culture, then doing code reviews through pull requests may be the correct way. If your team is collocated or strongly collaborates online, using pair or mob programming instead of code reviews may be the best. We can also detach the discussion around pull requests from the discussion around code reviews. I know teams that are relying on pair programming in place of code reviews but still use pull requests (automatically closed) for tracking purposes. And when we are talking pull requests in the context of infrastructure code, there is one challenge - it's hard to understand the actual change just by looking at code diff (especially after some time). This is why generating a preview of changes as part of the integration workflow can be extremely beneficial.

Preview of Infrastructure Changes

Infrastructure as code tooling usually provides a method to generate a preview - Terraform has the plan, Pulumi has the preview, and Bicep/ARM has the what-if. From the perspective of the integration workflow we are not thinking about running those commands locally but as part of the workflow. And this time we are not interested in results being available as part of the workflow output, we are looking for adding them as more context to the pull request. To be able to do that we first must capture the results. A good method is writing the results to an environment variable.

...

jobs:
  ...
  preview:
    needs: preflight-validation
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Azure Login
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
    - name: Prepare preview
      run: |
        echo 'DEPLOYMENT_WHAT_IF<<EOF' >> $GITHUB_ENV
        az deployment group what-if \
          --resource-group rg-devops-practices-sample-application-sandbox \
          --template-file applications/sample-application/application.bicep \
          --result-format ResourceIdOnly >> $GITHUB_ENV
        echo 'EOF' >> $GITHUB_ENV
  ...

Once we have the results, we can add them to the pull request. My preferred approach is to create a comment. GitHub provides us with script action which allows us to use a pre-authenticated GitHub API client. The issue number and all other necessary information will be available through the context object (if we are using the right trigger).

...

jobs:
  ...
  preview:
    needs: preflight-validation
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      pull-requests: write
    steps:
    ...
    - name:  Create preview comment
      uses: actions/github-script@v6
      with:
        script: |
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: process.env.DEPLOYMENT_WHAT_IF
          })
  ...

As a result of this job, we will get a nice comment describing in a more human-readable form the changes which deploying the template would cause at this very moment.

Continuous Integration (Sample Application) workflow - preview comment on the pull request

We may want even more. Changes are not the only valuable context we may be interested in. The second important information is how deploying the changes will impact the costs.

Cost Estimation of Infrastructure Changes

ThoughtWorks has been recommending run cost as an architecture fitness function since 2019, and there is more than one infrastructure cost estimation tool available to us. The two which are worth mentioning are Infracost (for Terraform) and Azure Cost Estimator (for Bicep/ARM and recently also Terraform). As I'm using Bicep in this article, I'm going to focus on Azure Cost Estimator.

Azure Cost Estimator is still a young tool, yet it's already quite powerful. At the moment of writing this, it supports ~86 resource types. What is very important, it's capable of generating usage base consumption for some resources if you provide usage patterns through metadata in the template. The only tricky part can be integrating it into the workflow. The project repository provides a reusable workflow, but this may not be desired (or even allowed) method in many organizations. This is why I'll walk you through the integration step by step.

The first step is getting the binaries and installing them. If you are using self-hosted runners this can be part of runner setup. You can also download and install the binaries from some central location as part of the workflow itself. Below I'm doing exactly that from the official project releases.

...

jobs:
  ...
  cost-estimation:
    needs: preflight-validation
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Download Azure Cost Estimator
      id: download-ace
      uses: robinraju/[email protected]
      with:
        repository: "TheCloudTheory/arm-estimator"
        tag: "1.2"
        fileName: "ace-linux-x64.zip"
    - name: Install Azure Cost Estimator
      run: |
        unzip ace-linux-x64.zip
        chmod +x ./azure-cost-estimator
  ...

With the binaries in place, we can use the same pattern as in the case of preview to run the tool, grab the results into an environment variable, and create a comment.

...

jobs:
  ...
  cost-estimation:
    needs: preflight-validation
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      pull-requests: write
    steps:
    ...
    - name: Azure Login
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
    - name: Prepare cost estimation
      run: |
        echo 'COST_ESTIMATION<<EOF' >> $GITHUB_ENV
        azure-cost-estimator applications/sample-application/application.bicep \
          ${{ secrets.AZURE_SUBSCRIPTION_ID } \
          rg-devops-practices-sample-application-sandbox \
          --stdout --disableDetailedMetrics >> $GITHUB_ENV
        echo 'EOF' >> $GITHUB_ENV
    - name:  Create pull request comment
      uses: actions/github-script@v6
      with:
        script: |
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: process.env.COST_ESTIMATION
          })
  ...

This will give us a lot of additional, valuable context in the pull request.

Continuous Integration (Sample Application) workflow - cost estimation comment on the pull request

The Beginning of the Implementation Journey

This post describes just the beginning of the DevOps practices implementation journey.

DevOps Pipeline With Tools for Create and Verify Stages

It gives a hint about the entire technology ecosystem, but the remaining practices have many interesting aspects to dive into. I do intend to continue walking through them with proposed implementations, just to make your journey easier. I've also created a repository that contains samples with different parts of implementation (available in different branches with results in different closed pull requests). You can review it to find all the relevant information. You can also create a fork and have your own playground.

Older Posts