Skip to main content

Monitoring and Alerting

Effective monitoring provides the foundation for delivering reliable services.

The Site Reliability Engineering (SRE) book defines these terms:

Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.

Alert
A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager.

As DigitalOcean says,

Metrics represent the raw measurements of resource usage or behaviour that can be observed and collected throughout your systems ... monitoring is the process of collecting, aggregating, and analysing those values to improve awareness of your components' characteristics and behaviour.

Context

Monitoring and alerting based on metrics is one important facet of the broader topic of Observability, discussed by The New Stack:

[Observability is] a measure of how well internal states of a system can be inferred from knowledge of its external outputs. ... There are many practices that contribute towards observability, ... externalizing key application events through logs, metrics and events.

This article discusses monitoring and alerting based on metrics, which is primarily aimed at knowing whether a service is healthy. Monitoring can go some way toward understanding why a service is unhealthy, but logs and events (such as distributed tracing) fill in much more of the detail when diagnosing the cause of issues, and are also important to understand.

What is monitoring for?

Monitoring serves three purposes:

  1. Helps people such as support engineers and product managers understand how services are operating at a point in time and how metric values have changed over time.
  2. Drives automated routines which maintain system health such as auto-scaling or failover.
  3. Drives automated alerts which notify people when service reliability is impaired or at risk.

What should you monitor?

For user-facing systems, the simple answer is "whatever matters most to your users", and more generally, "whatever most impacts your service's reliability".

Four Golden Signals

Specifically, the most important metrics to monitor for user-facing systems according to the SRE book are the four 'golden signals'.

Latency
The time it takes to service a request. e.g. ms response time for an HTTP API.

Traffic
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. e.g. Requests per second for an HTTP API.

Errors
The rate of requests that fail, either explicitly (e.g. HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). e.g. 5xx errors per second for an HTTP API.

Saturation
How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g. in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.

Availability

The four Golden Signals represent things which can affect the service's availability.

Google's meaningful availability paper has this to say about availability:

A good availability metric should be meaningful, proportional, and actionable. By "meaningful" we mean that it should capture what users experience. By "proportional" we mean that a change in the metric should be proportional to the change in user-perceived availability. By "actionable" we mean that the metric should give system owners insight into why availability for a period was low.

Rather than blunt "uptime" measures, the paper proposes an interpretation of availability based on whether the service responds correctly within a defined latency threshold for each individual request. That is, it emphasises the importance of properly measuring partial failure and using metrics which measure real user experience.

For example, the Transaction count by experience graph below shows user counts by their perceived experience as measured by response time.

Perceived performance

It is clear that there have been two service-impacting incidents, one at 8.15 and another at 8.33, and we can see that the user count dropped significantly after the second failure. This information helps engineers to prioritise work in order to resolve the root cause of the incidents.

Dashboards

Monitoring dashboards are a primary way for people like support engineers and product managers to visualise and understand service health and behaviour.

Example dashboard

Dashboards should be easy to understand at a glance. Some tips to achieve this are:

  • Limit the dashboard to a small set of individual graphs or charts, no more than 10.
  • Restrict the number of individual metrics being shown on each chart, typically no more than 5.
  • Make it obvious when things go wrong. Use fixed axes, rather than allowing them to rescale dynamically. Don't be afraid of choosing scales which make your dashboard looking boring when everything is fine "boring" graphs just mean it's immediately obvious when things go wrong. Additionally, consider plotting fixed lines to indicate limits set out in service level agreements. For example, adding a horizontal red line to an average response times graph to indicate the agreed maximum response time makes it easy to spot when this limit is breached.
  • Graphs have a big advantage in that they show changes over time. A CPU gauge might look cool but only displays one instance of a measurement at a time, so large spikes in CPU utilisation are easy to miss. The fact that graphs display changes over time means unusual spikes are easier to notice.
  • Show related metrics together, and carefully consider placement of graphs on a dashboard. Graphs stacked on top of each other with the same x-axis unit make it easier to notice correlations between different things.
  • Adopt conventions around use of colour and units (e.g. standardise on milliseconds for times and rates per second).
  • Display metrics over an appropriate rolling time period (e.g. the last hour). It's important to get the right balance between providing enough context to understand trends in the data and providing a high enough resolution to accurately show metrics.
  • Configure an appropriate refresh rate to ensure changes in metric values are seen promptly.
  • Make a conscious trade-off between indicating "long tail" performance vs representing the majority experience. For example, the 95th centile latency is usually a more useful indicator than either the mean or maximum latency, which under- and over-emphasise the worst case, respectively.

It is often a good idea to draw a clear distinction between:

  • High level dashboards used to determine whether the service is healthy. These typically focus on business-meaningful metrics and symptoms of service health.
  • More detailed / lower level dashboards which show underlying causes. These help diagnose why a service is unhealthy once that has been determined.

Designing good dashboards is hard. Take the time to do it well. When faced with a production failure, well-designed dashboards can help resolve the incident much faster.

Alerts

Alerts are triggered when rules applied to monitoring metrics evaulate to true. For example, if our business requirement is "99% (averaged over 1 minute) of Get HTTP requests should complete in less than 100 ms (measured across all the back-end servers)" then the corresponding alerting rule in Prometheus format could be api_http_request_latencies_second{quantile="0.99"} > 0.1.

We recommend using Service Level Objectives and Error Budgets as the basis for alert rules.

An SLI is a service level indicator: a carefully-defined quantitative measure of some aspect of the level of service that is provided.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a [time period]

The Site Reliability Workbook lists these attributes to consider when evaluating an alerting strategy:

Precision
The proportion of events detected that were significant. Precision is 100% if every alert corresponds to a significant event.

Recall
The proportion of significant events detected. Recall is 100% if every significant event results in an alert.

Detection time
How long it takes to send notifications in various conditions. Long detection times can negatively impact the error budget.

Reset time
How long alerts fire for after an issue is resolved. Long reset times can lead to confusion or to issues being ignored.

Collecting metrics

The SRE book discusses two types of monitoring:

White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.

Black-box monitoring
Testing externally-visible behavior as a user would see it.

Examples of white-box monitoring:

  • Using Prometheus to scrape metrics from a component which uses a compatible metrics library to collect metrics and expose them on an HTTP endpoint.
  • Amazon CloudWatch Metrics exposes default metrics from AWS-provided services.
  • Publishing internal metrics from a component as CloudWatch custom metrics, either by pushing them to CloudWatch using the SDK or by using metric filters to extract metrics from application logs in CloudWatch Logs.
  • Using Java Management Extensions (JMX) to collect internal metrics from an instrumented Java component.

Examples of black-box monitoring:

  • Passive monitoring by a load balancer of the instances it is directing traffic to, in order to determine whether they are healthy. Here the load balancer observes the responses from real user requests as they happen and records factors such as response times and HTTP status codes.
  • Active monitoring of a service, where the monitoring system generates regular 'synthetic' requests to the service and observes the response to determine health.

Serverless and ephemeral

Special consideration is needed for highly ephemeral and dynamic deployment models such as when using AWS Lambda. Pull-based tools such as Prometheus need to be able to interact directly with each individual instance, which is not possible with some deployment models. This can limit the options available, and typically means that the intrinsic monitoring capabilities of the platform provider are the best choice for this kind of workload.

Logs and metrics

Some tools make a clear distinction between application logs and metrics, and the two have no bearing on each other. For example the ELK stack deals only with logs while Prometheus and Grafana deal only with metrics.

AWS CloudWatch blurs the boundary slightly by allowing custom metrics to be derived from application logs using metric filters. Importantly though, the retention period for the metrics is independent of the logs they are derived from, which allows metrics to be kept for longer than the more storage-hungry (and therefore expensive) logs. Analysis over longer timescales (e.g. using CloudWatch Insight) can be useful to spot trends and cycles, for example to predict future growth. Analysis of metrics is limited to the metrics calculated / captured at the time, whereas in some cases logs can contain more information than the derived metrics, allowing for more novel and exploratory analysis. As such, there is a trade-off to be had between the two.

Tools like Splunk take a different approach and use logs as the underlying storage from which metrics are calculated on demand. This maximises the ability to perform exploratory analysis, but requires large and fast storage to retain full fidelity log data over long periods, which quickly becomes expensive.

Tracing

Tracing in distributed systems is a technique which arguably sits somewhere between logs and metrics.

Commercial model

The pricing model for monitoring tools can have an undesirable influence on how the tool is used, especially if there is a cost per monitored instance. This can lead to some services not being monitored or some non-production environments not being monitored. It is far better to choose tools where this is not the case, so all services which should be monitored can be, and monitoring can be present in all non-production environments to encourage monitoring to be a first class consideration for the delivery team.

Reliable monitoring

Independent vs intrinsic

A comprehensive monitoring setup will likely include several tools and techniques.

Black-box monitoring using tools like Pingdom provide a fairly crude measure of health, but its big benefit is that it is fully independent of the system being monitored and sits outside your infrastructure, which has two important implications. Firstly this means that they provide a reliable real-world measure of whether your service is available, including the complexities of factors such as DNS resolution and internet routing. Secondly, it means that they are not subject to platform level issues which could impact both the system being monitored and the monitoring tool and potentially mean that your monitoring cannot tell you that your system is down because the monitoring itself is also unavailable.

In contrast, more detailed metrics will need to come from monitoring tools which have a more intimate relationship with the system being monitored, and will likely be deployed into the same platform. It is best to treat these tools as an intrinsic part of your system, with configuration stored in source control and changes deployed and validated through the pipeline of environments as with any other system change. This brings the same rigour to bear on tracking and verifying changes to the monitoring system as is applied to the production system.

Security

The detailed metrics surfaced by monitoring tools provide valuable information to those supporting the service. But this information can also be useful to potential attackers as it can help them to understand the effect which denial of service attacks are having and help them refine their techniques to more effectively attack the system. For this reason, it is important to use effective access control measures to ensure only authorised people can access the monitoring tools. This also applies to any interfaces which services present to expose metrics, such as /metrics endpoints for integration with Prometheus: these should not be exposed publicly without authentication.

Retention and granularity

Give careful consideration to the metric retention period. A configured limit is usually desirable to avoid hitting hard limits of the monitoring system (such as disk space) or incurring spiralling costs for auto-scaling systems. Too short a period will prevent you from performing long-term analysis of trends, whereas too long a period may incur unnecessary cost. Ensure your monitoring can scale to meet production requirements.

Performance and resilience

Ensure your monitoring system is isolated from the production system so that issues in production don't affect monitoring and issues in monitoring don't affect production. In Kubernetes you might be tempted to run Prometheus on the same host as everything else but that could lead to losing monitoring due to "noisy neighbour" effects, or during production failures.

Practices

Building

It takes longer to configure effective monitoring and alerting than many teams expect. It is worth including monitoring in the delivery plan as part of they delivery work, allowing time to understand the needs of different users, design systems in ways which make monitoring easier and perform various types of testing to refine the monitoring.

Testing

The only way to measure the effectiveness of the monitoring configuration is to test it. Load testing the service to breaking point is a great way to verify whether the monitoring allows you to tell what which part has broken and how. Similarly, deliberately breaking the service in various ways can verify whether the monitoring correctly and clearly indicates these failures. The best way to tell whether the monitoring allows support engineers to quickly diagnose and fix problems is to role-play incident scenarios using practices like Game Days.

Continuous improvement

When incidents do occur, practices such as blameless postmortems and Post Incident Reviews are good ways to feed learning back into the team, and in particular to improve monitoring to make future incidents more avoidable and easier to detect, diagnose and fix.

Further reading

Google Site Reliability Engineering book
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

Google Site Reliability Workbook
https://landing.google.com/sre/workbook/toc/

Digital Ocean An Introduction to Metrics, Monitoring, and Alerting
https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting

Datadog Monitoring 101
https://www.datadoghq.com/blog/monitoring-101-collecting-data/