Delivery Quality questions
Introduction
These questions are used for delivery quality reviews.
Scoring
The team scores each area to highlight where they want to focus their efforts:
- Exemplar, though may not be perfect
- Good
- Some improvement needed / work to do
- Moderate improvement needed / work to do
- Significant improvement needed / work to do
Scores are relative to the Infinity Works shared view of what good looks like, as captured in this question set. These questions have evolved over time, with many contributors from across Infinity Works, and will be constantly undergoing refinement. See contributing for ways to get involved.
This is a self-assessment
- The review is carried out by the team, not "done to" the team.
- The emphasis is on discussion and action, not on the scores.
- Scores help the team focus attention on where to concentrate efforts to improve.
- Scores can help teams communicate and escalate issues outside the team.
- Scores cannot be compared between teams, but they can help spot common issues which would benefit from a coordinated effort between and across teams.
You may need to also consider the GDS Service Manual. This Delivery Quality review cannot replace this process but can help prepare for any GDS Service Assessments.
1. Goals
- Are the business goals clear and visible to all?
- Are objectives backed by clear business metrics / KPIs?
- How is user feedback incorporated into decision making?
- Are users part of the development process?
- Are non-functional requirements understood and written down?
2. Plan
Read Epic Roadmap for background.
- Is there a plan for the next 3–6 months?
- Is the plan at the right level? e.g. a small number of epics per sprint.
- Are milestones clearly shown?
- Is the plan up to date and complete?
- Does the team have regular backlog refinement sessions?
- Does everyone in the team understand the plan?
- Does everyone believe the plan?
- Is it a forecast of what is most likely to happen?
- Does the plan deliver business value iteratively, starting with a walking skeleton?
- Is the plan "operations first"?
- Does the plan include the right amount of maintainability and operability?
3. Delivery
- Is there a defined delivery process that everyone sticks to?
- Is there a clear definition of ready, definition of done and an exit criteria for each column?
- Are analysis and elaboration done prior to implementation?
- Are stand-ups effective?
- Is functionality delivered as thin vertical slices?
- Are items typically delivered in two days or less, from backlog to done?
- Is each item individually demo-able?
- Does the team do trunk-based development, with short-lived feature branches off a single long-lived main branch?
- Is all code reviewed, and do all tests pass before merging?
- Can code changes be traced to business requirements? e.g. using an issue reference in code commits.
- Is continuous improvement embedded in the ways of working?
- Does the team have regular retrospectives?
- Do team members look forward to retrospectives?
- Do retrospectives drive valuable improvements?
- Are bugs and tech debt dealt with, and not left to accumulate?
- Do stakeholders have good visibility of progress?
- Does the team have sprint reviews and produce sprint reports?
4. Team
These questions are about the entire team. If you're working in an augmented team also include the client folks in your thinking.
- Is the team a fun place to be?
- Do people trust each other?
- Do they go out of their way to help each other?
- Are team members learning new things?
- Is the pace of work sustainable?
- Do people give honest feedback when something isn't right?
5. Risks and decisions
- Are risks and issues tracked?
- Are they actively worked to prompt resolution?
- Are they communicated and escalated effectively?
- Are technical and product decisions recorded?
- Is there a clear sign-off for each decision?
6. Skills and knowledge
- Does the team have the skills and knowledge it needs?
- Are skills and knowledge well spread between team members?
- Do team members work across all deliverables and are knowledge silos avoided?
- What approaches do the team use to develop skills and share knowledge?
- Does the team have good documentation?
- Does the documentation cover architecture and operability?
- Are the docs kept up to date?
- Is onboarding of new team members quick and easy?
7. User-centred design
- Does the team have access to appropriate design and research skills? (Product Design, Service Design, User Research?)
- Are designers and researchers integrated in the team?
- Are they attending the same team events, using shared tooling, aligned to the same roadmaps, collaborating with other disciplines, and planning and tracking their work using similar processes?
- How well does the whole team understand who the users are and what they need?
- Can the team get access to real or proxy end-users on a regular basis (every 6 weeks)?
- How regularly is the team engaging with users?
- Is user research being conducted and stored inline with GDPR?
- Is the wider team involved in design and research activities at least monthly?
- Do designers and researchers influence product decisions?
- Does the team have enough time to explore and iterate solutions?
- Are designers empowered to solve problems, not just design requirements?
- Does the team measure the outcomes and impact of their work?
8. Healthy code base
- Is the code clean, easy to read, well-structured, and safe to work with?
- Does the code have good unit test coverage?
9. Functional testing
- Is testing everyone's responsibility?
- Is there good automated test coverage?
- Are integration tests quick and effective? (e.g. DB or API interactions)
- Is there explicit contract testing?
- Are whole-system tests reliable and maintainable?
- Is test data valuable, and is it created and managed reliably?
- Are tests required to pass before code is merged?
- Can all tests be run locally?
- Does the team do effective exploratory testing?
10. Tech and architecture
- Do the tech and architecture work well for the team?
- Is the developer experience good?
- Do these make testing easy?
- Do they make it easy to operate services reliably?
- Is the tech strategy clear?
- Is the architecture modern?
- If changes are needed, is there a roadmap?
11. Deployment
Read CI/CD for background:
- Does CI/CD work well?
- Is a release candidate built on merge?
- Is this release candidate progressed through environments rather than built specifically for each environment?
- Does each environment have a clear purpose?
- Is every environment in the deployment pipeline identical?
- Are deployments fully automated — can automation create an environment from scratch?
- Is it possible to deploy / roll-back to any recent version?
- Is it easy to release a change all the way to production?
- (Ideal) Can changes be released on demand, multiple times per day?
- (Ideal) Do deployments follow a blue-green / canary pattern, with automated roll-back if unhealthy?
12. Security and compliance
Have the team read Secure Engineering and acted on it?
- System design
- Is the minimum necessary data handled?
- Is the system designed to minimise attack surface?
- Does the team consider OWASP Top 10 during design and implementation?
- Software supply chain
- Is source code access secure, with reliable audit for code changes?
- Are static analysis tools used to scan code?
- Are build-time dependencies scanned for vulnerabilities?
- Are run-time dependencies scanned for vulnerabilities? e.g. base containers or VM images.
- Is the running system scanned for vulnerabilities? e.g. using OWASP ZAP.
- Do CI/CD-triggered deployments use a role with minimal permissions?
- Are secrets managed securely?
- Infrastructure
- Do IAM roles have minimal privilege?
- Are best practices enforced? e.g. using AWS Config.
- Is there audit logging and traceability?
- Is there active scanning of hosts, software or running systems? e.g. using Amazon Inspector.
- Is all ingress protected appropriately? e.g. using a Web-Application Firewall (WAF).
- Is all data secured in transit and at rest?
- People factors
- Is there a process to manage access controls?
- Do people have the minimum permissions needed to do their job?
- Do the team rehearse security incident responses?
13. Observability
Read Structured Logging:
- Are the logs useful and accessible?
- Is the log retention reasonable?
Read Monitoring & Alerting:
- Does the team trust the monitoring — is it reliable?
- Is the monitoring isolated?
- Can the team access monitoring during incidents?
- Could the monitoring system cause incidents?
- Do the right people have access to the monitoring?
- Are system dependencies monitored?
- Is the monitoring data storage resilient, with an appropriate retention policy?
- Are alerts in place?
- Are alerts relevant, with few false alarms or gaps?
- Can incident impact/priority be easily determined?
- Do the right people get alerted?
- (Ideal) Is alerting based on business KPIs (for example, shopping basket abandonment rates)?
Read Tracing:
- Has distributed tracing been put in place, if relevant?
14. Performance and scalability
- Are performance and capacity requirements understood?
- Are there any constraints from systems the solution depends on?
- How does the team verify requirements are met?
- How regularly does the team carry out performance testing?
- Is testing efficient? Is testing automated?
- Is the team aware of any bottlenecks?
- In which components do these bottlenecks occur?
- On which resources, such as CPU, memory or network?
- Have stakeholders signed off on performance and capacity?
- Does the system scale automatically?
- Does it scale without impacting availability?
- Does it do so quickly enough to meet spiky demand?
- Are the right metrics used to drive scale up/down?
- Are safe minimum and maximum scaling limits known and configured?
- Can the system scale further without major re-engineering?
15. Availability
- Are the availability requirements defined and understood?
- Are there constraints from systems the solution depends on?
- Is the system reliable?
- Is it resilient, and does it self-heal?
- Is it fault-tolerant?
- Where is data stored in the system, and where is it mastered?
- Is possible data loss understood and acceptable?
- Is the system's reliability tested?
- Are these tests automated?
16. Cost
- Are platform costs understood?
- Is cost tracking in place? Is cost attributable? e.g., by tagging resources.
- Is the system cost-efficient?
- Does the chosen tech give value for money?
- Are cost-control measures (such as environmental shutdown or optimal sizing) in place?
17. Operability and operations
- What is the team's role in live support?
- What about out of hours?
- How is the team informed of incidents from alerts or user reports?
- Are issues detected / prevented before users report them?
- Are SLOs used to determine incident severity?
- Are response and communication agreements clear for each severity?
- Is incident resolution slick?
- What is the Mean Time to Recovery (MTTR)?
- Does the team have defined on-call roles and procedures?
- Is diagnosis and resolution easy?
- Is communication and collaboration smooth?
- Are run books in place and working well?
- Does the team learn from incidents using blameless postmortems?
- Does the tech being used make operations easy?
- Are vendor support agreements in place?
- Are support processes well-rehearsed and tested?
18. Go-live readiness
Are you aiming for an initial release or a significant change that will affect users? If not, you can skip this section.
- Do you have established acceptance into service criteria?
- Have these criteria been met?
- How will business processes have to change?
- Are people ready for the changes?
- How will the team know how users have been affected by the change? Will there be monitoring, alerting, analytics, or user feedback mechanisms in place? See Monitoring & Alerting.
- Are external go-live dependencies and prerequisites understood and being tracked?
- Is any data migration needed?
- Will there be dual-running or a hard switch over?
- How can the go-live deployment be rolled back?
- How will users be migrated or onboarded?
- How does the migration affect dependent systems?
- Will there be a service outage?
- Is there a warranty period?