What is Azure Chaos Studio?

In 2021 when Netflix decided to move their systems to the cloud. They moved to AWS. In the new environment they realized workloads could be terminated and replaced at any point of time. This constrain lead to new way of testing the reliability of the solutions by randomly rebooting their workloads. Netflix created a tool called Chaos Monkey to automate the process and standardize system stability testing. Chaos Monkey helped Netflix finding wakeness in their systems and helped them build more reliable solutions.

💡

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos engineering became a new practice on how to test large scale, distributed systems. Because these systems were so complex new approach was needed.

Chaos Engineering

The main reason for creating new testing methodology was to have a way to assess reliability of services, especially in case of a failure. It is done by injecting faults to the system causing it to fail. Once fail gets injected key point is to observe, monitor and (if needed) respond to the issue. The outcome is valuable knowledge helping to design architecture/processes to survive the failure.

Key points

Improve service/application/solution resiliency and build processes helping react to failures
Chaos principals have to be applied continuously
Experiments should be created and organized by a dedicated Chaos Engineering Team
Follow best practices for Chaos Testing

Goals of Chaos Engineering

Gain expertise with monitoring tools
Recognize outage patterns
Learn how to assess the impact
Quickly determine root cause
Practice log analysis

Method

Define a hypothesis
Measure baseline behavior
Inject fault
Monitor the solution's reaction to the injected fault
Document observations
Identify key findings and plan improvements

Microsoft Azure Chaos Studio

At the end of 2021 Microsoft introduced Azure service called Chaos Studio. It was developed to help measure, understand and improve application and service resilience for real world incidents. It allows to simulate region failure, high CPU/Memory usage, networking issues. Running experiments can help validate solutions architecture to improve overall end-user experience in case of unexpected events.

⚠️

Chaos Studio is currently (Feb 2022) in Public Preview. It is not recommended to use it for production environments just now.

Chaos experiments

In Chaos Studio to design failures as experiments. Chaos experiment is an Azure resource contains a description of injecting failures as well as targets.

Experiment is defined in two sections:

Selectors – group of targets receiving faults
Logic – Description of the fault. The fault be for example intermittent loss of network connection or unexpected stress on CPU.

Base on the Chaos Experiment scenarios you can test how (or if) your solution will survive. By injecting simple faults you can constantly improve your infrastructure by challenging it to the new fault scenarios.

Faults and actions

Actions triggered as a part of experiment can be executed in two ways.

Continuous – action (fault) will run for the predefined amount of time. This can be helpful when you want to test app availability (scaling) under heavy load.
Run once (discrete) for – fault will be executed once. A perfect idea when you want to cause unexpected reboot and test auto healing functionality.

Chaos experiment scenarios can have multiple stages. Faults disrupting operations for a resource (or group of resources) can be injected in a controller manner. Time delays help to set up scenario by injecting “waits” without the faults injections. That time can be required to recover solution from a previous fault before injecting another one. A perfect way to test the multiple scenarios at once.

Actions (faults) can cause disruptions on many levels, starting from killing a process up to causing a fault on Azure service level. Some faults require a Chaos Studio agent to be installed on a target VM. Agent-based faults are required when the injected fault is executed on the system level (stress tests, memory pressure tests, process killing).

VM	Network	AKS	Other
CPU pressure	DNS failure	AKS Chaos Mesh network faults	ARM virtual machine shutdown
Physical memory pressure	Network latency	AKS Chaos Mesh pod faults	ARM virtual machine scale set instance shutdown
Virtual memory pressure	Network disconnect	AKS Chaos Mesh stress faults	Cosmos DB failover
Disk I/O pressure (Windows)	Network disconnect with firewall rule	AKS Chaos Mesh IO faults	Azure Cache for Redis reboot
Disk I/O pressure (Linux)	Network security group (set rules)	AKS Chaos Mesh time faults
Arbitrary Stress-ng stress		AKS Chaos Mesh kernel faults
Stop Windows service		AKS Chaos Mesh HTTP faults
Time change		AKS Chaos Mesh DNS faults
Kill process

Availability of the failure scenarios is limited, you can see the up-to-date list in the faults library

Azure Chaos studio helps you improve resilience of your systems. It can also help you understand better the systems' behavior during and after the unexpected fault.

Azure Chaos Studio documentation - tutorials, API reference

Learn about Azure Chaos Studio, a solution for building resilience of your services using chaos engineering and fault injection on Azure.

Microsoft Docsjohnkemnetz

Official Azure Chaos Studio documentation

Sources

https://www.gremlin.com/chaos-monkey/
https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
https://docs.microsoft.com/en-us/azure/chaos-studio/

What is Azure Chaos Studio?

Chaos Engineering

Key points

Goals of Chaos Engineering

Method

Microsoft Azure Chaos Studio

Chaos experiments

Faults and actions

Sources

Kamil Konderak

Landing zone accelerator for Azure Arc-enabled servers

Subscribe to newsletter

Features

Chaos Engineering

Key points

Goals of Chaos Engineering

Method

Microsoft Azure Chaos Studio

Chaos experiments

Faults and actions

Sources

Kamil Konderak

Landing zone accelerator for Azure Arc-enabled servers

You might also like

Why might you want to create a Terraform module for Azure Resource Group?

How to reduce the cost of your Virtual Machines using Azure Automate?

What is new in ArcBox 2.0?

Landing zone accelerator for Azure Arc-enabled servers

Subscribe to newsletter

See posts by Popular tags