What is Azure Chaos Studio?

In 2021 when Netflix decided to move their systems to the cloud. They moved to AWS. In the new environment they realized workloads could be terminated and replaced at any point of time. This constrain lead to new way of testing the reliability of the solutions by randomly rebooting their workloads. Netflix created a tool called Chaos Monkey to automate the process and standardize system stability testing. Chaos Monkey helped Netflix finding wakeness in their systems and helped them build more reliable solutions.

đź’ˇ
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos engineering became a new practice on how to test large scale, distributed systems. Because these systems were so complex new approach was needed.

Chaos Engineering

The main reason for creating new testing methodology was to have a way to assess reliability of services, especially in case of a failure. It is done by injecting faults to the system causing it to fail. Once fail gets injected key point is to observe, monitor and (if needed) respond to the issue. The outcome is valuable knowledge helping to design architecture/processes to survive the failure.

Key points

  • Improve service/application/solution resiliency and build processes helping react to failures
  • Chaos principals have to be applied continuously
  • Experiments should be created and organized by a dedicated Chaos Engineering Team
  • Follow best practices for Chaos Testing

Goals of Chaos Engineering

  • Gain expertise with monitoring tools
  • Recognize outage patterns
  • Learn how to assess the impact
  • Quickly determine root cause
  • Practice log analysis

Method

  1. Define a hypothesis
  2. Measure baseline behavior
  3. Inject fault
  4. Monitor the solution's reaction to the injected fault
  5. Document observations
  6. Identify key findings and plan improvements

Microsoft Azure Chaos Studio

At the end of 2021 Microsoft introduced Azure service called Chaos Studio. It was developed to help measure, understand and improve application and service resilience for real world incidents. It allows to simulate region failure, high CPU/Memory usage, networking issues. Running experiments can help validate solutions architecture to improve overall end-user experience in case of unexpected events.

⚠️
Chaos Studio is currently (Feb 2022) in Public Preview. It is not recommended to use it for production environments just now.

Chaos experiments

In Chaos Studio to design failures as experiments. Chaos experiment is an Azure resource contains a description of injecting failures as well as targets.

Experiment is defined in two sections:

  • Selectors – group of targets receiving faults
  • Logic – Description of the fault. The fault be for example intermittent loss of network connection or unexpected stress on CPU.
Source: Microsoft docs - Chaos Studio

Base on the Chaos Experiment scenarios you can test how (or if) your solution will survive. By injecting simple faults you can constantly improve your infrastructure by challenging it to the new fault scenarios.

Faults and actions

Actions triggered as a part of experiment can be executed in two ways.

  • Continuous – action (fault) will run for the predefined amount of time. This can be helpful when you want to test app availability (scaling) under heavy load.
  • Run once (discrete) for – fault will be executed once. A perfect idea when you want to cause unexpected reboot and test auto healing functionality.

Chaos experiment scenarios can have multiple stages. Faults disrupting operations  for a resource (or group of resources) can be injected in a controller manner. Time delays help to set up scenario by injecting “waits” without the faults injections. That time can be required to recover solution from a previous fault before injecting another one. A perfect way to test the multiple scenarios at once.

Actions (faults) can cause disruptions on many levels, starting from killing a process up to causing a fault on Azure service level. Some faults require a Chaos Studio agent to be installed on a target VM. Agent-based faults are required when the injected fault is executed on the system level (stress tests, memory pressure tests, process killing).  

VM Network AKS Other
CPU pressure DNS failure AKS Chaos Mesh network faults ARM virtual machine shutdown
Physical memory pressure Network latency AKS Chaos Mesh pod faults ARM virtual machine scale set instance shutdown
Virtual memory pressure Network disconnect AKS Chaos Mesh stress faults Cosmos DB failover
Disk I/O pressure (Windows) Network disconnect with firewall rule AKS Chaos Mesh IO faults Azure Cache for Redis reboot
Disk I/O pressure (Linux) Network security group (set rules) AKS Chaos Mesh time faults
Arbitrary Stress-ng stress AKS Chaos Mesh kernel faults
Stop Windows service AKS Chaos Mesh HTTP faults
Time change AKS Chaos Mesh DNS faults
Kill process

Availability of the failure scenarios is limited, you can see the up-to-date list in the faults library

Azure Chaos studio helps you improve resilience of your systems. It can also help you understand better the systems' behavior during and after the unexpected fault.

Azure Chaos Studio documentation - tutorials, API reference
Learn about Azure Chaos Studio, a solution for building resilience of your services using chaos engineering and fault injection on Azure.
Official Azure Chaos Studio documentation

Sources

  1. https://www.gremlin.com/chaos-monkey/
  2. https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
  3. https://docs.microsoft.com/en-us/azure/chaos-studio/
You've successfully subscribed to Cloudoing
Great! Next, complete checkout to get full access to all premium content.
Error! Could not sign up. invalid link.
Welcome back! You've successfully signed in.
Error! Could not sign in. Please try again.
Success! Your account is fully activated, you now have access to all content.
Error! Stripe checkout failed.
Success! Your billing info is updated.
Error! Billing info update failed.