What is it?
Distributed tracing (also known as distributed request tracing) is a technique for instrumenting an application to gather metrics and metadata in order to monitor a system.
Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
The "span" is the primary building block of a distributed trace, representing an individual unit of work done in a distributed system. Each component of the distributed system contributes a span — a named, timed operation representing a piece of the workflow.
Why should I do it?
Distributed tracing is especially valuable in service-oriented architectures where multiple systems may be involved in the processing of a request or event. In these circumstances, it can be challenging to identify which component is responsible for service degradation. Tracing allows you to see which services were involved in the end-to-end processing of a request, as well as providing information such as timing metrics which can help to identify the source of issues.
How can I do it?
To make the most out of distributed tracing, it is usually necessary to add instrumentation to your application to include additional contextual metadata to a trace.
Some implementations to consider:
For a walk-through of implementing distributed tracing in AWS, see this guide to using AWS X-Ray with Serverless Node.js.
A span is a subset of a trace and represents a unit of work within a distributed system. A trace may consist of multiple spans, each recording metadata as the request traverses multiple services within the system.
A flame graph is a visualisation of the timeline of a trace, and the execution time associated with each span. This can make it easier to identify performance issues with a system and at which points latency is being introduced.
A service map shows an aggregation of traces and the interaction between services. Typically, this information will be enhanced with metrics such as requests per second, error rates, and latency metrics for each service.