This section covers operating the software produced in the build activities.
This is a set of defaults for teams to use, but is not mandatory if teams have good reason to do something different (see What this is — and is not).
- Software running in production
- Reliable services
In addition to the recommended defaults described here, make use of external validation tools such as the AWS or Azure Well-Architected Frameworks which cover similar ground but from a slightly different angle.
Start by understanding what is needed
Some questions to ask before starting delivery:
- Are you replacing an existing system?
- How will system functionality be redesigned to be simpler and leaner, rather than simply reimplementing every feature?
- What data migration will be needed?
- Will there be dual running or a hard switch over? How can you roll back?
- How will users be migrated or on boarded?
- Will there be a service outage?
- How will you know how users have been affected by the change? e.g. Monitoring, analytics, user feedback mechanisms.
- What are the non-functional requirements?
- Treat these as unspoken user needs the system must meet.
- Capture these as Service Level Objectives (SLOs).
- What existing or agreed technologies are in place and available to use, or are required to be used?
- If requirements are stated, it does not necessarily mean you need to stick to this, but you need to negotiate any different proposals and justify them. At the end of the day we are there to recommend, but the customer is ultimately the one making decisions.
Design and build for operability
Live operations should be at the front of your mind when designing systems, including factors like choosing technologies and architectures which:
- are simple and minimise cleverness and complexity.
- support good observability.
- support required performance, scalability and reliability.
- are highly automatable, to allow for easy maintainability.
- Prefer serverless managed services where practical.
- Using these means that much of the operational overhead is transferred to the cloud platform provider, letting you focus on delivering business value.
- Split systems into the right number and size of components:
- Not too big, or too small.
- So that each has a clear purpose and clean interfaces.
- Such that each may be developed, deployed and operated independently.
- In agreement with the structure of the organisation (i.e. team structures), though note that the organisation structure may need to change to achieve this (Conway's Law and its inverse).
- Focus on getting live with a simple steel thread / walking skeleton as soon as possible.
- Identify which metrics matter most to users — your Service Level Objectives.
- Build simple, accessible dashboards to surface those metrics (see Monitoring & Alerting).
- Augment with more detailed dashboards to drill down as required.
- Configure automated alerts — again, based on the agreed SLOs.
- Ensure logs are readily accessible and queryable (see Structured Logging).
- Consider implementing distributed tracing (see Tracing).
Essential reading: The recommended default is to follow the guidance in Secure Engineering, which discusses many aspects of designing and building with security in mind.
- Secure your software supply chain.
- Ensure authentication and authorization are robust.
- Bake guards for OWASP Top 10 vulnerabilities into your delivery process, including during product and technical design, code review and using automated checks.
- Ensure the infrastructure is secure, using automated tools to verify.
- Put in place appropriate scanning and testing.
- Cover people factors.
Performance and scalability
- Design with performance and scalability in mind.
- Consider performance and scalability during technical elaboration, implementation and validation of each backlog item.
- Perform regular load tests to verify performance and scalability (see Load Testing).
- Design with reliability in mind.
- Consider reliability during technical elaboration, implementation and validation of each backlog item.
- Perform regular tests to verify the reliability of both automated system responses and incident management processes.
- Design with maintainability in mind.
- Invest in automation to reduce manual maintenance toil.
- For example, deployment, certificate updates, credential refreshes, log management, backups, scaling up and down.
- Automate the detection and if possible application of upgrades and security updates.
- Ensure support staff have the access they need to deal with unexpected occurrences easily and safely.
- For example, always-on read access to production along with audited "break-glass" time-limited write access when required.
- Ensure non-production environments are as close as possible to production.
- Ensure documentation is up to date.
- System architecture and design, at the right level.
- READMEs and other code-level documentation.
- Run books for common problem solving or resolution routines.
Preparing for live support
As well as preparation and rehearsals for all of the non-functional aspects outlined above, the live support process must be agreed and documented:
- Document the Service Level Agreement (SLA) to accompany the agreed Service Level Objectives (SLOs).
- Decide on parameters to determine when an incident is occurring and its severity, often just described as P1, P2, P3, with P1 the most severe.
- Common determining factors are whether the system is fully/partially/not available to users, response times or error rates.
- For each incident severity level, agree the communication mechanism and frequency.
- Agree which team(s) will have operational responsibility.
- If responsibility is to be shared, what will the hand-offs be and how will information be shared?
- How will user-reported issues be raised, triaged, prioritised and managed through to resolution, including communication?
- What system(s) will be used to manage and communicate incident reports?
- Agree a support rota with identified people to fulfil the Support Lead and Support Engineer roles. Even if specific individuals are capable of performing either of these roles, at any one time there should be two separate people covering the two roles. In the heat of the moment, this separation can become important — somewhat like the driver/navigator distinction in pair programming.
- The Support Lead is responsible for:
- Coordinating activities during an incident.
- Communicating in line with the SLA.
- Deciding the course of action to diagnose and restore service using input from the Support Engineer.
- Calling in help if needed.
- The Support Engineer is responsible for:
- Diagnosing and fixing the incident (subject to agreement from the Support Lead).
- Keeping records of hypotheses and actions taken.
- Keeping the Support Lead informed.
- Think about how support will be handled outside office hours.
- Will people be on call? If so, you have a few things to consider. Ask around as other teams have solved these problems before.
- You will need to agree on how people doing support will be compensated.
- You might need to put in place new commercial arrangements with your client.
- You might need to buy and configure new tools, such as Pingdom and PagerDuty.
If an incident is detected:
- Focus on restoring service quickly and safely.
- Make sure the Support Lead and Support Engineer are both aware and engaged.
- Keep the communication flowing, ideally on Slack so you have a written record.
- Quickly agree what the issue is and the severity of the incident. Do not spend too long trying to get this right; it can always be revised later as more is learned.
- Form hypotheses for what might be causing the issue.
- Select one hypothesis and do safe, minimally invasive tests to verify whether it is correct.
- When you find a cause then propose a fix to restore service. Get agreement between the Support Lead and Support Engineer — and customer stakeholders as appropriate — before proceeding.
- Make sure you keep adequate notes as you work for later analysis. Writing things down clearly is also a good way to clarify your understanding and keep people informed.
Learn from incidents:
- Perform a Post Incident Review (see Incident Postmortems).
- Feed any improvement work into your delivery plan to prevent future incidents.
- The best way to ensure incidents are handled effectively is to rehearse regularly using approaches like chaos engineering and Game Days
- This helps improve the way the system is engineered, the automation and observability, and in particular helps you hone processes and practices so they are second nature when a real incident happens.