The purpose of this document is to serve as a guide to incident postmortems. This should help establish the postmortem process and achieve the best fit for the organisation or team.
There already exists some excellent online resources that go more in-depth on the subject, links to which can be found below.
What is an incident postmortem?
An incident postmortem is a process to assist learning from past incidents, identifying improvements that can help reduce the risk of repeat occurrences.
Postmortems go by several different aliases:
- Learning Review
- After-Action Review
- Incident Review
- Incident Report
- Post-Incident Review
- Root Cause Analysis (or RCA)
Source: PagerDuty -- What is an Incident Postmortem?
Setting the tone
The naming of the process may highlight subtle variations between the organisational cultures that use them, with some sources suggesting that "postmortem" may actually carry negative connotations (e.g. blame); where conversely, "learning review" helps shift the focus from "what or who is to blame?" to "what can we learn?"
To encourage adoption, a positive, blameless approach should be sought. No one should feel discouraged from engaging for fear of having a finger pointed in their direction. Therefore, the naming may be an important first step in achieving the right tone.
Ideally, the team involved in the process should own it with no one person left solely responsible. That's not to say there couldn't be an appointed guardian or champion to encourage adoption, while ensuring quality and consistency. Perhaps the role is light-touch or even short-lived until the process is firmly bedded in.
For the team to believe they own the process, there must be a clear pay-off. No one should ever be left asking, "Why am I doing this?"
Who will be reading your postmortems? Will they be exclusive to a technical team or are there other business stakeholders?
A managed service will have a customer and they may (or may not) have their own process already.
- Is there a requirement to write for a broader audience?
- Is there a balance to be struck between technical and business language?
- Is there an expectation (contractually or otherwise) to align with their established process?
- If so, how closely must it align?
- Is there an opportunity to improve upon their process?
Consistency will help with adoption of the process. Consider such things as:
- a template with:
- a consistent title, e.g. Postmortem: Product Name (dd-mm-yyyy)
- clear sections with annotations for guidance
- a permanent, accessible location for easy referencing of past postmortems
There's potentially a lot of data to be gathered in preparation for an incident postmortem. However, this should never distract from resolving an incident.
As PagerDuty say,
During incident response, the team is 100% focused on restoring service. They can not, and should not, be wasting time and mental energy on thinking about how to do something more optimally, nor performing a deep dive on figuring out the root cause of an outage.
However, some effort may be undertaken by an elected individual to capture details during an incident (perhaps in a service ticket), provided it's non-impacting. This individual could be, for example, a duty manager. The sort of details to capture may be one or more of the following:
- timeline of events
- pasted logs
- pasted screenshots
- chat excerpts
Having these artefacts ahead of the postmortem could open up more avenues to explore, which could expose a key detail that helps identify the root cause.
Reports should be consistent in structure and a templated report can help. Other than consistency, it means one less barrier and one less distraction.
There are many examples available online, with some common overlaps. The following is just one example of some of the sections and their purposes.
It's useful to display some key details of the incident at the very head of the report. This may come in the form of a table.
Brief summary of the incident including:
- Affected service(s)
- Service degradation (e.g. increased response times) or outage (e.g. serving a maintenance page)?
- Was there any data loss?
- Was there any human intervention during the incident? If so, what?
- Was the incident sustained or intermittent?
- Did the incident reoccur, and if so, are there any related service tickets?
Summarise the business impact here.
e.g. Intermittent loss of site for a 1-hour duration meant 80% of users were unable to compare prices on items.
A brief explanation of the root cause(s) should be detailed here. If none is found, a brief explanation of why this wasn't possible may be more useful than simply leaving the section blank.
A timeline of events throughout the incident. This should highlight key events, including alerts, actions to mitigate/fix, communication, etc. This can also include events leading up to the incident, including any root cause(s) established as part of investigative work.
This is best displayed in a table, with date and time on the left and event on the right.
This should be a list of resources that may be collated as supporting evidence. First establish a list of available resources, and check them off the list once they've been secured.
- App server log files
- Database server log files
- Snapshot of app server data volume [for offline inspection]
- Relevant monitoring dashboards
This should be a list of actions taken both short-term and long-term to mitigate and fix intermediate and root causes. This should include ticket numbers where possible to track supporting work.
This is best displayed in a table and, as a minimum, should include an agreed action and a ticket reference to track as appropriate. Additional columns may include a description and a priority, but you may wish to keep these details within the referenced tickets.
An easy to reference, bulleted list of lessons learned. Storybook prose should be avoided in favour of key points so it can be quickly digested.
What went well
It can be hard to think about the positives following an incident, but there will likely always be at least one. Recognise it, record it and keep doing it.
- we all remained calm and focussed throughout
- we quickly put a short-term fix in place to return the site into service
- the CTO bought us pizza
What went wrong
What was unfortunate about the incident that could be avoided in future?
- this has happened twice before but we didn't have a run book to follow (an action could be taken to create a run book for next time)
- the primary issue cascaded and caused another issue before we could mitigate for the first, extending the time to fix
Where we got lucky
What happened by chance that caused us to benefit? This is about recognising potential gaps that won't always be filled so they can be addressed where possible.
- a visiting engineer used to support the product and held some key knowledge that helped us resolve quicker
- the incident occurred outside of peak time, so the impact was reduced
Prefix each item with Fig. n for quick reference within the document, accompanied by a short description.
e.g. Fig. 1 -- screenshot of response codes before, during and after the incident.
Evidence may comprise:
- screenshots of monitoring graphs (and relative links)
- snippets of log errors
- screenshots of user-facing issues caused by the incident