AWS CloudWatch Logs is a managed log aggregation service, and is our default solution for log storage and analysis in AWS. It is integrated with many AWS services, including ECS and Lambda, and often provides the easiest route to collecting logs and making them queryable. As a fully managed service, CloudWatch Logs also typically requires no maintenance once an appropriate retention period is set. It's important to be aware of the importance of setting this retention period as the default is to retain logs indefinitely, meaning log volume (and so monthly cost) increases over time and can become significant.
What is it?
ELK and EFK
The ELK stack (and its EFK variant) provides a mature self-hosted solution for log collection, aggregation and inspection. ELK/EFK is a good choice for on-premise systems, but for cloud-hosted systems we usually find that native log aggregation solutions such as AWS CloudWatch Logs or Azure Monitor logs are a better choice. These give adequate functionality for most use cases without the operational overhead of running a self-hosted log aggregation system, and with pricing which is attractive in most cases compared to self-hosted options.
Grafana has become the de facto standard Open Source graphing and visualisation tool for systems monitoring and is used by many of our teams. It is a highly versatile tool, and while it is commonly used with Prometheus, it can also be used to visualise metrics from cloud platform monitoring systems such as Azure Monitor and AWS CloudWatch. Grafana also has a rich plugin system for panel components to help you represent your metrics the way you want.
Monitoring and Alerting
Effective monitoring provides the foundation for delivering reliable services.
Over recent years, Prometheus has become established across our teams as a firm favourite for capturing metrics, usually used with Grafana for visualisation. It is our default choice for monitoring on premise systems and to supplement cloud platform metrics systems such Azure Monitor and Application Insights or AWS CloudWatch Metrics. It has a highly versatile model, which has led to a rich ecosystem of exporters_ which provide a way for Prometheus to collect metrics from almost any source, including infrastructure, databases and message queues, and application/business metrics. Prometheus also supports alerting via the alert manager component using a built-in query syntax and rules engine.