Chaos days
A chaos day is a day where we deliberately break parts of a running system to test and practice our incident response. The main aim is to identify gaps in knowledge, tooling, or anything else that hinders our ability to react.
Why do we do them?
It's a fact of life that software systems break — especially complex, distributed systems like the ones we often work on with our clients. Things will go wrong no matter how hard we strive to make them reliable.
We can spend time and money trying to minimise breakage, but returns diminish after a certain point. A system will never reach 100% reliability. If we pretend that we've created an unbreakable system, something will go wrong and take us by surprise.
If we accept that a certain amount of breakage is inevitable, we should strive to be as prepared as possible to notice it. We can then understand what's wrong and fix it in a reasonable amount of time. Among other things, we do this by following good practices around observability. This means:
- Creating alerts to warn us when things aren't right.
- Creating dashboards to observe the behaviour of the system.
- Using good tools that will allow us to debug issues and observe the state of the system accurately.
We also try to share knowledge so that, when an incident happens, we have the right skills and experience to put things right.
But how do we know we're actually prepared? Are we sure our alerts will tell us when there is a problem? Are we sure we'll then be able to fix it? We don't have to assume we're ready. We can test it! This is where chaos days come in.
Regularly practising chaos days has the following benefits:
- The chance to get comfortable with the incident management process in a practice situation instead of a live incident.
- Identifies gaps in alerting coverage, logging/metrics, and general observability of the system.
- Helps to understand how prepared the team is to work together to fix incidents.
- A chance to debrief on how we handle a simulated incident and get some actionable insights on things to improve.
How to run a chaos day
Start with a planning session with the team. These are some valuable things to find out before the chaos day:
- Which systems/environments are in scope and can be broken?
- Which other squads rely on your systems and may be impacted by testing?
- Are there parts of the system that cannot be touched and must not break?
- Are there any existing gaps in support coverage for your systems that you already have plans to fix?
- Is there anything you need to put in place before the day to make sure the test is practical? (This will often include steps to make dev more production-like. If the testing is happening in dev, for example, you might need to make sure the same alerts are fired as in production).
- Do you understand what will happen on the day, and is the squad prepared? (Do they understand the incident process? How will they decide who manages each incident? etc.)
Things to plan and note for each scenario
It's worth making a plan for each scenario you will carry out.
Before the test
- Describe the steps you are going to take to cause one or more problems during the test.
- Describe the steps you will take to clean up and make sure everything is back to normal after the test.
- What do you expect will be the solution to the problem?
- What lessons do you expect the squad will learn from fixing the problem?
During the test
Are the squad doing anything interesting while solving the problem, and are they solving it differently from how you expected?
After the test
What actions will the squad take to address issues found in the test? These could include fixing gaps in observability, adding technical mitigations, addressing knowledge gaps within the team etc.
Top tips
There are no hard and fast rules, but here are some tips that should help to get the most out of chaos days:
- Decide if you're going to break dev or prod. You should start with dev until you've run a few chaos days and are confident in the team's ability to find and fix issues.
- Write up scenarios beforehand. This should include the problem, the cleanup, the expected solution and the expected things that the team will learn.
- Consider surprise chaos days. This requires buy-in from the target team but allows for more realistic scenarios where devs are less prepared.
- Don't just break backend services. By considering ways to break frontend systems, you will discover whether the team monitors the pieces of UI they are responsible for.
- Make follow-up exercises to test the team's ability to observe what happened. For example, how many users did it affect? What time did the problem start happening? Role-play these questions coming from other parts of the business if it helps (e.g. manager X wants to know how many sales we lost during the outage).
- Have a good knowledge of the system to understand what might break. It can be helpful to include a team member, such as an automation engineer, in creating scenarios. The added benefit of this is that the team won't be able to rely on the automation engineer to help them debug
- Warn other teams about the likely scope of breakage. Remember that even breaking dev could impact other teams. If a problem will need interaction with another team to solve, warn them first and check they have time to help.
- Make sure the team knows how to triage incidents and write reports.
- Ask the team to treat incidents as if they were real. They should broadcast them to the company but make sure to prefix everything with [CHAOS DAY] to avoid causing panic outside the team.
- Don't underestimate how long it will take. Do the highest value scenarios first. Don't be afraid to break lots and see what happens.
- Be evil! Don't be tempted to help the team. Leave them to flounder if needs be, they'll learn more about knowledge gaps that way.
- Let the team find out about breakages from alerts rather than by telling them. They shouldn't be watching dashboards either. If alerts don't happen, role-play a scenario (e.g. 'customer services are getting a lot of calls about X').
- Listen in when they are problem-solving and make notes. It's interesting when they do things you don't expect.
- Run a mini retro after each scenario. Use this to document things the team has learnt. Make a note of actions (e.g. improve alerting for X, add a dashboard for Y, learn more about failure modes of Z) and get them on the team's backlog.