Software development teams often struggle to release software that isn’t tainted with a litany of performance issues and bugs. The truth is, it’s not easy to iron out every little problem in an application before its launch date – and conventional Testing in Production (TiP) can only do so much to minimize potentially costly defects.
Moreover, as companies continue to deliver more solutions via containers, the very act of testing has also become necessarily more complex. Enter chaos engineering.
Chaos Engineering debunks the prevailing wisdom that developers can identify all production failures; the idea is to deliberately create small ‘problems’ in software and fix them before they escalate.
The Origins of Chaos Engineering at Netflix
While overseeing Netflix‘s migration to the cloud in 2011, Greg Orzell had the idea to address the lack of adequate resilience testing by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
First dubbed, “Chaos Money”, this collection of tools allowed the admin to cause ‘random’ failures at random intervals. This unique testing approach made Netflix’s distributed cloud-based system much more resilient to system-wide faults.
The idea of ‘intentionally’ causing problems in a system may sound like an oxymoron at first – but it’s a powerful tool that prepares you and your team for actual system outages. In many ways, chaos engineering is analogous to a fire drill. That is, you can perform failure drills from time to time to find out real scenarios in which system recovery fails.
From this early version, the discipline of Chaos Engineering was officially born.
The Goal of Chaos Engineering
All system tests are designed to identify bugs within a system. Chaos Engineering, however, can be thought of more as a “stress test,” wherein the systems are intentionally strained to determine how both the software, and those entrusted with its upkeep, will respond in the event of a failure.
Obviously, how substantial “failure” hits the bottom line depends greatly on the type of organization you’re running and the direct impact of a failure to that organization.
According to a 2017 survey by ITIC, a single hour of downtime can cost over $100,000 to organizations. For at least 1/3rd of enterprises, a single hour of downtime can cost them anywhere from $1 million to over $5 million. For these companies, there’s an obvious inherent value to preventing such failures.
This is why organizations are more than happy to break their systems on purpose, find out weaknesses, and patch them up before they have a chance of breaking the customer experience.
The Four Steps of Chaos Engineering
Chaos engineering isn’t so much about ‘chaos’ as it is about ‘precision.’ You have to use precise engineering processes to test your systems for failure. In general, there are four steps involved in chaos engineering.
1. Form a hypothesis
Start by asking yourself, “What could possibly go wrong?”
2. Plan your experiment
Design experiments with variables that reflect real-world events, such as server failure, network malfunction, dependency failure, and so on. But try to do so in a way that won’t create problems for users.
Note: Most chaos experiments are performed in a production environment when users start interacting with the program. This is why you should plan the smallest experiments that don’t disrupt the overall user experience.
3. Minimize the blast radius
Don’t affect critical systems just yet. Rather, start with smaller experiments that teach you more about the program and its response to problems.
A few experiments that you can tinker around with include:
- Simulating the failure of a data center
- Forcing system clocks to become out of sync
- Emulating I/O errors
- Simulating a high CPU load
- Creating latency between services
- Randomly causing functions to throw exceptions
4. Measure the Outcome of the Experiment
Carefully review the results of the steady-state system to those gathered after creating disturbance into the system. If you find any differences, you can strengthen and prepare the system so that it doesn’t cause problems in stressful situations.
If your engineering team can find weaknesses in the system, then consider it a successful chaos experiment. If not, they can expand their hypothetical boundaries.
Once weaknesses are found, the team can address and fix those issues before they can cause actual problems.
The Future of Chaos Engineering
Looking toward the future, a number of platforms are planning to offer chaos engineering as a set of tools to businesses. Among them is Gremlin, a chaos engineering startup that provides services for both Google and Netflix.
From Idea to Platform
In the years since Netflix released their Chaos Money, an entire suite of tools have come to market in the swiftly growing Chaos Engineering space.
Gremlin Chaos Engineering is one of the larger companies in the space. Since 2017, the company has raised more than twenty-five million dollars to develop its infrastructure, which has solutions for customers who use either AWS or Microsoft Azure. Similar offerings include Litmus, Pumba and the Chaos Toolkit (an open source initiative, with many tools), and even AWS has even announced the forthcoming release of their own toolset, known as the AWS Fault Injection Simulator.
Taken together, it’s clear that in a very short time Chaos Engineering has crossed the chasm from concept to discipline.