Chaos Experiments as Day-2 Operations

Chaos engineering is a hot topic in the platform engineering field as organizations try to build more robust applications. Chaos engineering involves deliberately injecting faults into a system to observe its behavior and build confidence in the behavior of that system. The Chaos Monkey from Netflix pioneered the concept of deliberately inflicting chaos on a production system, but the discipline has grown extensively since this initial project.

The Chaos Mesh project is a Cloud Native Computing Foundation incubating project that aims to simplify the process of injecting chaos into an environment. Chaos Mesh supports both Kubernetes and virtual machine environments, but it really shines in its integration with Kubernetes.

In this article, I will take you through the journey of building a Chaos Experiment and running it against an environment. Chaos experiments can also be easily exposed as Day 2 Operations in Cloudify, allowing your developers to leverage them from the web interface or as part of a CI/CD pipeline.


The blueprints used in this article are available on the Cloudify Community GitHub. You can follow along with this example by uploading the main blueprint to your Cloudify manager:

$ cfy blueprint upload -b Chaos-Demo-Environment
Publishing blueprint archive
Blueprint `Chaos-Demo-Environment` upload started.
2022-12-07 13:51:28.408  CFY <None> Starting 'upload_blueprint' workflow execution
2022-12-07 13:51:28.779  LOG <None> INFO: Blueprint archive uploaded. Extracting...
2022-12-07 13:51:28.868  LOG <None> INFO: Blueprint archive extracted. Parsing...
2022-12-07 13:51:29.820  LOG <None> INFO: Blueprint parsed. Updating DB with blueprint plan.
2022-12-07 13:51:29.969  CFY <None> 'upload_blueprint' workflow execution succeeded
Blueprint uploaded. The blueprint's id is Chaos-Demo-Environment

You should also upload the “AKS-Azure-TFM” blueprint from the Cloudify Marketplace as this example will deploy an AKS cluster to support the example workloads.

Your Cloudify Manager should also meet the following prerequisites:

  • Terraform plugin v0.19.10 or greater is installed
  • Kubernetes plugin v2.13.16 or greater is installed
  • Utilities plugin v1.25.12 or greater is installed
  • Azure credentials exist as secrets within the Manager

Environment Overview

In this article, I will deploy a simple web application that can be accessed as an externally exposed Kubernetes load balancer service. The web application, which uses the Flask Python framework, returns a JSON response when queried and exposes Prometheus metrics on the /metrics endpoint.

Deploying an application is only one piece of the puzzle. Operators also need to collect and analyze metrics about their services. I will also deploy the Prometheus Operator, which automatically installs Prometheus and Grafana. The operator exposes Custom Resource Definitions (CRDs) to configure monitoring for an application that exposes Prometheus metrics. Since the sample Python application exposes these metrics, I can automatically scrape them by specifying the right CRD as part of the application blueprint.

Finally, Chaos Mesh is deployed in the cluster for running Chaos Experiments. I will use Chaos Mesh to simulate failures with the web application. These experiments can be transformed into Day 2 Operations in Cloudify that can be easily executed through the user interface or via an API call.

This approach involves treating the overall environment as a product that can be easily consumed, deployed, and managed. This includes the application components, supporting infrastructure, such as monitoring, and operational workflows that enable product teams to rapidly deploy and self-manage their application and all of its ancillary pieces. This represents the shift-left mentality of the emerging platform engineering discipline, which empowers upstream developers to create and manage their environments using the interfaces that make sense for their workflows. In this example, our environment can be provisioned and maintained using the Cloudify UI, CLI, API, or any of our existing integrations for platforms such as GitHub.

Initial Deployment

Deploying the full environment requires two steps:

  1. Deploy an AKS cluster using the AKS-Azure-TFM blueprint from the Cloudify Marketplace
  2. Deploy the application environment blueprint, which includes all of the containerized components in the topology diagram.

Creating the AKS cluster is very straightforward using the blueprint provided in the Cloudify Marketplace:

Deploying the end-to-end environment is also easy. Cloudify’s service composition feature enables this blueprint to be built from other, smaller blueprints. This enables reusability, as these individual components could also be deployed on their own. The child blueprints will automatically be uploaded as part of the main blueprint deployment.

In this case, the following child blueprints exist and map to the components described earlier in the Environment Overview:

  • Prometheus-Operator
  • Chaos-Mesh
  • Python-Prometheus

To deploy the end-to-end environment, I will use the “Deploy On” feature in Cloudify to deploy this application environment onto the Kubernetes cluster. This enables the child blueprints to take advantage of contextual information about the Kubernetes cluster, such as the Kubernetes API endpoint and authentication information:

Once the deployment is complete, I can verify that all of the deployments are running:

# Obtain credentials for Kubectl
$ az aks get-credentials --resource-group chaosdemo-rg --name chaosdemo-aks
Merged "chaosdemo-aks" as current context in /home/acritelli/.kube/config

# Verify deployments in all namespaces
$ kubectl get deployment -A
NAMESPACE     NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
chaos-mesh    chaos-controller-manager   3/3     3            3           46m
chaos-mesh    chaos-dashboard            1/1     1            1           46m
default       demo-app                   3/3     3            3           46m
kube-system   coredns                    2/2     2            2           84m
kube-system   coredns-autoscaler         1/1     1            1           84m
kube-system   konnectivity-agent         2/2     2            2           84m
kube-system   metrics-server             2/2     2            2           84m
monitoring    blackbox-exporter          1/1     1            1           46m
monitoring    grafana                    1/1     1            1           46m
monitoring    kube-state-metrics         1/1     1            1           46m
monitoring    prometheus-adapter         2/2     2            2           46m
monitoring    prometheus-operator        1/1     1            1           46m

Finally, I can test the web application by finding the service endpoint and making a web request to the public IP address:

# Get demo-app service
$ kubectl get service
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
demo-app     LoadBalancer   80:30308/TCP   44m
kubernetes   ClusterIP       <none>          443/TCP        82m

# Verify that the web endpoint is accessible
$ curl

Grafana Dashboard Setup

I now have an application and a full supporting environment including Chaos Mesh, Prometheus, and Grafana. The web application exposes Prometheus style metrics, and the application blueprint included a service monitor so that Prometheus can automatically scrape metrics from the application. Next, I will set up a simple Grafana dashboard to visualize the Prometheus metrics exposed by the web application.

The Grafana installation is not exposed as an external service. To use the dashboard, I need to port-forward the Grafana service so that it can be accessed from my local machine:

$ kubectl port-forward -n monitoring svc/grafana 3000:3000
Forwarding from -> 3000
Forwarding from [::1]:3000 -> 3000

Once the service has been port-forwarded, I can log in at http://localhost:3000 using the default username and password of “admin”. I then create a very basic graph to visualize the rate of incoming HTTP requests for the sample application:

This graph will make it easy to see the fault injected by Chaos Mesh, as the request rate will drop during the experiment time window.

Note: If following along on your own, you may need to send a few more requests to the web service endpoint before all of the relevant metrics appear in Grafana. 

Access Chaos Mesh

I now have a Python application that exposes metrics and a Grafana dashboard to visualize the incoming request rate. Next, I will simulate an application failure by performing a Chaos Mesh experiment and observing the impact on application availability. I’ll create this experiment by using the Chaos Mesh dashboard. The dashboard makes it easy to create experiments using an exploratory interface. In the next section, I will demonstrate how this experiment can be turned into a custom Day 2 Operation in Cloudify.

A service account token is required to access the Chaos Mesh dashboard. The blueprint for Chaos Mesh automatically creates this token and exposes it as a deployment capability. I can obtain this token using the Cloudify CLI or UI:

# Find the deployment ID for the environment
$ cfy deployment list --search-name Chaos-Demo-Environment --json | jq '.[] | {id, display_name}'
  "id": "c3bdc27a-dd78-463f-a186-1bd2697ddc8c",
  "display_name": "Chaos-Demo-Environment-chaos"

# Look up the capabilities to obtain the secret name
$ cfy deployment capabilities c3bdc27a-dd78-463f-a186-1bd2697ddc8c
Retrieving capabilities for deployment c3bdc27a-dd78-463f-a186-1bd2697ddc8c...
 - "dashboard_rbac_token_secret":
     Description: Secret for RBAC token for Chaos Dashboard
     Value: account-cluster-manager-demo-token-s6nl7

# Obtain the secret value
$ kubectl describe secret account-cluster-manager-demo-token-s6nl7
Name:         account-cluster-manager-demo-token-s6nl7
Namespace:    default
Labels:       <none>
Annotations: account-cluster-manager-demo


ca.crt:     1765 bytes
namespace:  7 bytes
token:      eyJhbGciOiJSUzI1NiIsImtpZCI6IkdWenYyQkhGX1dTVklnQXR5SFlKY2ZXUjlqZU5RTk5pdDNJZ0tURnZkMm8ifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImFjY291bnQtY2x1c3Rlci1tYW5hZ2VyLWRlbW8tdG9rZW4tczZubDciLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiYWNjb3VudC1jbHVzdGVyLW1hbmFnZXItZGVtbyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6Ijg2ZGJkNjAwLTM5NWItNDAzYy04OTg3LTAzMTI1Y2NkMmIwOSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpkZWZhdWx0OmFjY291bnQtY2x1c3Rlci1tYW5hZ2VyLWRlbW8ifQ.aEwSunuotfqaVyhGopDJGiBy_KVVbFj317yhR82838G-iyA0vMf0IBp-lX-T9xjnHlcCH750kqM3ssNdwPefqJbBN8VqI2C7T7QOocBn2ih3tTFVDV7lepuE6nd-mzDIfO8K_ts4BcufAfMa_8X0w9VKwbr1nbM3w_Cb8_XzxMUMYPeCYSRx4nKaogD8_--paSbrVaWLURExIJfHF7-zwUGOrYjA-SwSnwFMv9t1LYcrMVKWHUeNzq-HyVB1meDpvIM4UFlfHC-bHgrk9Y7EaplKVNfcHaAw11H0RxeLAa02GKyJerafrJ03lXoSi6GKfhXzeJXnZu7aLJcsk96kKoUDUzXMe7YykGS9Uc1YCMX7NpRbiKfTq2YzeYr3bQfSPlO4iQXkCJfMkpLUhFRxomS7z355gc1ok1tt2M_RUnNCt0Vqc1QdqrsgnNV4alCI_barIbXbxG_KOa8ZyXkkSxmY28ZNFSGZ6K2emdNWIvjoNXtAtxXwEItU7XdoHPwffZg9k3MtSQ9NGEpIlyZFLUt2dDhPZhYINwyuHvpljt3MiGG-FCWTXh2D483l42Qx4NLqV-R6sSCcbmxTgPY1AJmB3Ei75-gHvho9lHH8m_WtjnexO5oD1NojgB049s-a-DG3wHOLCY4yQe5Ffk7pOjRpAuXAlmh4qCCz8Pdas0M

Like the Grafana dashboard, the Chaos Mesh dashboard is not publicly exposed. I need to create a local port-forward to access the dashboard:

$ kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
Forwarding from -> 2333
Forwarding from [::1]:2333 -> 2333

I can now log into the Dashboard by navigating to http://localhost:2333 and providing the service account token:

Chaos Mesh is built around the idea of experiments. An experiment represents a defined fault injected into your environment. I will inject a very simple failure scenario into this environment. Incoming HTTP requests to the sample application will simply be dropped for two minutes. The Chaos Mesh dashboard provides an excellent interface for easily defining a new experiment:

I can confirm the test is successful by attempting to access the same web application endpoint while the experiment is running. As expected, these attempts fail:

$ curl
curl: (52) Empty reply from server

Defining experiments from the Chaos Mesh UI provides a simple, exploratory approach for designing a Chaos Experiment. However, the real power of an experiment is the ability to trigger it via automation. In the next section, I will explain how experiments can be exposed as Day 2 Operations in Cloudify.

Experiments as Day 2 Operations

Using the Chaos Mesh dashboard is a great way to get started when you first begin designing experiments. You will likely discover common Chaos Experiments that make sense for your organization to run on a schedule or on-demand, triggered by an external automation such as a CI/CD pipeline. The Chaos Mesh project provides CRDs for defining Chaos experiments, and these can be sent to the Kubernetes API.

Cloudify can trigger a Chaos Experiment as a Day 2 Operation. This empowers teams to create reusable Chaos Experiments that can be triggered through a self-service interface, such as the Cloudify UI, CLI, or API.

The Python application blueprint includes a Kubernetes manifest for an HTTPChaos experiment. This is the same experiment that I ran manually from the Chaos Mesh dashboard. The blueprint itself defines a custom interface operation that applies this manifest and a custom workflow to run the operation. This exposes the Chaos Experiment as a workflow that can be run from the Cloudify UI or API:


    type: cloudify.kubernetes.resources.FileDefinedResource
      client_config: { get_input: kubernetes_client_config }
        resource_path: manifests/deployment.yaml
          implementation: scripts/
          executor: central_deployment_agent
            host: { get_input: kubernetes_api_endpoint }
            token: { get_input: kubernetes_token }
            manifest: manifests/chaos.yaml


    mapping: cloudify_custom_workflow.cloudify_custom_workflow.tasks.customwf
          - python_app
          - chaos_actions.run_http_chaos

To test out this workflow, I will start up requests to the web service endpoint using a Bash script that will constantly send requests:

$ while true; do curl; done;

While this is running in the background, I will trigger the custom workflow. The workflow can be triggered via the Cloudify UI or the API. The UI exposes the workflow as a “Cloudify custom workflow” on the deployment page for the Python application. You can find the deployment by navigating to Blueprints > Python-Prometheus and selecting the deployment. You can then execute the workflow from the Execute workflow > Cloudify custom workflow menu:

The workflow can also be triggered via the CLI or API (or any other API-driven integration, such as the Cloudify GitHub Action or ServiceNow integration). This provides an easy interface for running Chaos Experiments using the interfaces that your organization is already comfortable with, such as within a CI/CD pipeline as part of a deployment process. The CLI makes it easy to run the workflow:

# Obtain the deployment ID
$ cfy deployment list --search-name python-app --json | jq '.[] | {id, display_name}'
  "id": "chaos-python-app-0",
  "display_name": "chaos-python-app-0"

# Execute the RunHTTPChaos workflow against the deployment
$ cfy execution start -d chaos-python-app-0 RunHTTPChaos
Executing workflow `RunHTTPChaos` on deployment `chaos-python-app-0` [timeout=900 seconds]

The workflow will run and create a new Chaos Experiment in the cluster. I can use Grafana to observe the sudden decrease in requests during the two minute experiment window. Notice that the request traffic returns to normal once the experiment is done running:

Wrapping Up

Chaos engineering is an excellent approach for understanding system behavior in failure scenarios and building confidence in your organization’s ability to handle and recover from unexpected events. The Chaos Mesh project makes it easy to design, run, and manage experiments. Cloudify further simplifies the process of deploying the end-to-end environment, including an application, monitoring, and Chaos Mesh itself. Chaos Experiments can then be exposed as custom Day 2 Operations so that your developers can perform chaos engineering tests in a self-service method, from either the UI or API.


    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Back to top