Prometheus
Over recent years, Prometheus has become established across our teams as a firm favourite for capturing metrics, usually used with Grafana for visualisation. It is our default choice for monitoring on premise systems and to supplement cloud platform metrics systems such Azure Monitor and Application Insights or AWS CloudWatch Metrics. It has a highly versatile model, which has led to a rich ecosystem of exporters which provide a way for Prometheus to collect metrics from almost any source, including infrastructure, databases and message queues, and application/business metrics. Prometheus also supports alerting via the alert manager
component using a built-in query syntax and rules engine.
Prometheus is a great fit for long-running processes deployed as containers or on VMs, but the main caveat for recommending it is that it is not a perfect fit for highly serverless systems. Firstly, Prometheus and Grafana must run as long-running processes, typically as containers, necessitating the use of more traditional infrastructure which can be out of step with an otherwise serverless system. Secondly, Prometheus's 'pull-based' model does not lend itself to monitoring highly ephemeral components such as serverless functions, meaning an intermediary like the push gateway is needed to collect and expose metrics. However, push gateway does come with some caveats which should be considered carefully. We typically find that the serverless monitoring systems provided by the cloud platform such Azure Application Insights or AWS CloudWatch Metrics are a more natural choice for serverless systems. Integration with these services can either be via application logs, by deriving custom metrics from log events, or by pushing custom metrics to the metrics system from applications via relevant SDKs.
How Prometheus works
Prometheus periodically collects metrics from each configured component by 'scraping' a /metrics
HTTP endpoint exposed either by the component directly or by a standalone exporter (see Exporting metrics below). The exporter is responsible for performing initial aggregation and calculation of metrics, depending on the metric type. Metric scrapes are performed once per minute by default and Prometheus stores a history of scraped metrics, subject to the configured retention configuration. Grafana can display graphs and dashboards of Prometheus metrics, 'querying' Prometheus using PromQL to perform filtering, aggregations and calculations.
For components with multiple load balanced instances, Prometheus must scrape each instance individually rather than going via the load balancer to avoid the confusion which mixing metrics would cause. This usually requires some form of service discovery such as DNS-based discovery using SRV records.
Exporting metrics
It is easy to provide the required /metrics
endpoint from your application by including one of the client libraries provided by Prometheus or the community. As well as allowing you to register custom metrics, these libraries typically hook into standard frameworks and provide a basic set of metrics by default such as request rates and durations for web applications. Standalone exporters inspect and query the systems they are exporting metrics for (such as a MySQL database or RabbitMQ instance) and expose them on a /metrics
endpoint. For cloud deployments, metrics from the underlying platform monitoring system can be fed into Prometheus, e.g. using the CloudWatch exporter for AWS. This allows all metrics to be stored and queried in one place, but the metric granularity is obviously restricted to that available from the cloud platform.
Typical deployment
The starter configuration is to run one Prometheus container instance and one Grafana container in your favoured container orchestrator (e.g. ECS, Kubernetes). It is usual to run these within each environment (such as dev, test, production) to maintain separation between them. Run the node exporter on any VMs you want to monitor and configure Prometheus to discover VMs using service discovery. Configure service discovery for our application orchestration system, e.g. AWS ECS supports DNS service discovery through built-in integration with Route 53 DNS.
Resilience
It is common to store Prometheus data on a dedicated block storage volume with regular snapshots to avoid losing all history if the instance fails. Prometheus can also be configured with a basic form of high availability by running two instances which share the same configuration and each perform the same metrics collection, thereby building up similar metrics histories. Alert manager can also be run in high availability mode, with instances communicating with each other to avoid duplicate alerts.