The management of modern software environments hinges on the three so-called “pillars of observability”: logs, metrics and traces. Each of these data sources provides crucial visibility into applications and the infrastructure hosting them.
For many IT operations and site reliability engineering (SRE) teams, two of these pillars — logs and metrics — are familiar enough. For years, teams have analyzed logs and metrics to establish baselines of normal application behavior and detect anomalies that could signal a problem.
But the third pillar of observability — traces — may be less familiar. In conventional applications that ran as monoliths, tracing existed, but it was less important to understand what was happening. There were fewer moving parts through which requests could flow as the application processed them, which meant that traces were less essential for understanding performance problems.
In modern, cloud-native applications built on microservices, however, traces are absolutely critical for achieving full observability. Traces are the only way to gain end-to-end visibility into service interactions and to identify the root cause of performance problems within complicated distributed microservice architectures that run on multi-layered stacks consisting of servers, application code, containers, orchestrators and more.
That’s why understanding why and how to implement distributed tracing as part of your observability strategy is critical for modern IT and SRE teams, especially those tasked with managing environments based on Kubernetes or other cloud-native platforms. To provide guidance, this blog post explains what distributed tracing is, distributed tracing best practices, why it’s so important and how best to add distributed traces to your observability toolset.
What is distributed tracing?
Distributed tracing refers to the process of following a request as it moves between multiple services within a microservices architecture.
When you perform a distributed trace, you identify the service where a request originates which is typically a user-facing application frontend — and then record its state as it travels from the initial service to others (and possibly back again).
As an example of distributed tracing, imagine a collection of microservices in a standard modern application. The user interface is rendered by a small group of microservices, user data is recorded in a database (that runs as a different service) and some number of small backend services handle data processing.
In this environment, a distributed trace of the user’s request would start by recording information about the request’s status on the first frontend service — which data the user inputs and how long it takes the service to forward that data to other services. The next touchpoint in the trace would involve the backend services, which accept the input and perform any necessary data processing. Then, the backend services transfer the processed data to the database service, which stores it.
By monitoring the request’s status and performance characteristics on all of these services, SREs and IT teams can pinpoint the source of performance issues. Rather than merely recording the time it takes for the request as a whole to complete, they can track the responsiveness of each individual service in order to determine, for example, that the database service is suffering from high latency, or that one service that is used to render part of the home page is failing 10% of the time.
Traditional tracing for distributed services is challenging
The fundamental goal behind tracing — understanding transactions — is always the same. However, there are different approaches to implementing tracing.
Traditionally, tracing tools have performed probabilistic sampling, which captures only a small (and arbitrary) portion of all transactions. Probabilistic sampling may provide some insight into what is happening as an application processes requests. But because sampling captures only some transactions, it doesn’t provide full visibility. If certain types of transactions are not well represented among those that are captured, a sampling-based approach to tracing will not reveal potential issues with those transactions.
To illustrate the limitations of a probabilistic sampling approach, let’s go back to the example of the three-tiered application described above. Here, a tracing strategy based on sampling would at most allow IT and SRE teams to understand the general trends associated with the most common types of user requests. It might also reveal major changes in performance, such as a complete service failure that causes all of the sampled transactions to result in errors.
But this approach would likely not yield much insight into more nuanced performance trends and can’t scale enough to measure the thousands of distributed services in a transient containerized environment. In our earlier example, a slight degradation in performance, such as an increase in average latency from 1 second to 1.2 seconds for users hosted in a particular shard of the backend database, may go undetected. This is because the traditional APM tool may not be capturing enough transactions to identify this change. Alternately, errors that result from some transactions due to certain types of user input may go unnoticed because the errors would not appear frequently enough in the sampled data to become a meaningful trend. In some cases, the ephemeral nature of distributed systems that causes other unrelated alerts to happen might even exacerbate troubleshooting. For example, if an EC2 node fails and another replaces it, but it only affects one user request, is that worth alerting about? As a result, the team would not identify these issues until they grew into major disruptions.
That’s why Splunk APM takes a different approach. Splunk APM captures all transactions with a NoSample™ full-fidelity ingest of all traces alongside your logs and metrics. By tracing every transaction, correlating transaction data with other events from the software environment and using AI to interpret them in real time, Splunk is able to identify anomalies, errors and outliers among transactions that would otherwise go undetected.
With a NoSample™ approach, these nuanced performance problems would be easy to detect. When you trace and analyze all transactions, you can detect cases of user experience degradation that you would otherwise miss by not tracing the experience of every user. You can also make decisions with confidence, knowing that you’ve got visibility into every user’s experience with your application.
Of course, the example above, which involves only a small number of microservices, is an overly simplified one. In a real-world microservices environment, distributed traces often require tracing requests across dozens of different services. That makes it even more important to trace all transactions and to avoid sampling. From there, teams can use AI-backed systems to interpret the complex patterns within trace data which would be difficult to recognize through manual interpretation — especially when dealing with complex, distributed environments in which relevant performance trends become obvious only when comparing data across multiple services.
Tracing vs. distributed tracing
In a basic sense, tracing is not a new concept. Teams responsible for developing and managing monolithic applications have long used traces to understand how applications process requests and to help trace performance problems to specific parts of the application source code.
In distributed, microservices-based environments, however, tracing requires more than just monitoring requests within a single body of code. Because each service in a microservice architecture operates and scales independently from others, you can’t simply trace a request within a single codebase. You must collect additional data, such as the specific service instance or version that handles the request and where it is hosted within your distributed environment, in order to understand how requests flow within your complex web of microservices.
Why distributed tracing?
Distributed tracing is a must-have source of observability for any modern microservices-based environment for several reasons.
Monitoring complex runtime environments
As noted above, distributed environments are a complex web of services that operate independently yet interact constantly in order to implement application functionality. Distributed tracing is the only way to associate performance problems with specific services within this type of environment.
Looking only at requests as a whole, or measuring their performance from the perspective of the application frontend, provides little actionable visibility into what is happening inside the application and where the performance bottlenecks lie.
What’s more, simply understanding how requests move within the systems is rarely obvious from a surface-level: Monitoring just an application frontend tells you nothing about the state of the orchestrator that is helping to manage the frontend, for example, or about the scale-out storage system that plays a role in processing requests that originate through the frontend.
Transaction-based troubleshooting
In many cases, NoSample™ distributed tracing is the fastest way to understand the root cause of performance problems that impact certain types of transactions or users — like user requests for a particular type of information or requests initiated by users running a certain browser.
Faced with performance problems like these, teams can trace the request to identify exactly which service is causing the issue. They might discover, for example, that a container that handles requests for one group of customers (like those in the “gold” tier of users) is running properly, while an issue is affecting a separate container that serves a different group of customers.
With a NoSample™ approach, even nuanced performance issues become readily identifiable — especially when you can use AI-based alerting and automatic outlier reporting to correlate transaction data with other information from your environment to help pinpoint the root cause.
Likewise, NoSample™ tracing can help pinpoint where the root cause of a problem lies within a complex, cloud-native application stack. Is a slowdown in application response time caused by an issue with the application code itself, or with the container that’s hosting a particular microservice? Or, maybe it’s an issue in the Kubernetes Scheduler, or with the Kubelet running on the node where the relevant container is running.
Without the ability to trace and analyze every transaction in a complex environment like this, it would be very challenging to quickly evaluate all of these possible points of failure in order to identify the correct one. While it may be possible to pinpoint performance problems like these using other approaches, such as log or metrics analysis, doing so is likely to be more difficult and time-consuming. Logs typically don’t (by default) expose transaction-specific data. Instead, they record information about the status of the system as a whole. Similarly, monitoring metrics typically only reveals the existence of an anomaly that requires further investigation. They don't pinpoint the root cause of the anomaly.
Team collaboration
Tracing facilitates collaboration between teams because it helps monitor transactions as they pass through the entire system, from one team’s domain to another. In other words, traces provide visibility that frontend developers, backend developers, IT engineers, SREs and business leaders alike can use to understand and collaborate around performance issues.
An IT or SRE team that notices a performance problem with one application component, for example, can use a distributed tracing system to pinpoint which service is causing the issue, then collaborate with the appropriate development team to address it. Or, technical teams can use distributed traces to collect data about ongoing performance issues and share it with business leaders so the latter know what to expect until the issue is resolved.
Operationalizing NoSample™ distributed tracing with Splunk APM
Given the complexity of monitoring requests that involve so many different types of services, distributed tracing that allows you to trace every transaction can be challenging to implement — and it is, if you take a manual approach that requires custom instrumentation of traces for each microservice in your application, or if you have to deploy agents for every service instance you need to monitor (a task that becomes especially complicated when you deploy services using constantly changing hosting environments like a Kubernetes environment or a serverless model).
A better approach is to leverage a tool like Splunk APM, which simplifies data collection using open, standards-based frameworks. Splunk APM provides out-of-the-box support for all of the major open instrumentation frameworks, including OpenTelemetry, Jaeger and Zipkin. That means that no matter which framework you prefer, which languages your application is written in or how you deploy your services, you can use Splunk APM to perform distributed tracing with no special application refactoring or agent deployment necessary.
Part of a comprehensive, integrated Splunk Observability Cloud, Splunk APM also simplifies the process of putting your distributed tracing data to use. Not only does Splunk APM seamlessly correlate traces with log data, metrics and the other information you need to contextualize and understand each trace, but it also provides rich visualization features to help interpret tracing data. Splunk APM makes it easy to visualize service dependencies within your environment, monitor high-level service health and drill down into the status of specific services when you need to pinpoint the source of problems associated with latency, throughput, errors and other common issues. These are only some of the reasons why GigaOm has recognized Splunk as the only Outperformer in their Cloud Observability report for 2021.
And because manual analytics doesn’t work at the massive scale teams face when they trace every transaction, Splunk also provides machine learning capabilities to help detect anomalies automatically, so you can focus on responding to rather than finding the problems within your environment. Splunk’s APM technology automatically retains exemplars of traces with unusual characteristics for later debugging, and also has a sophisticated engine to intelligently detect and flag outliers.
What is Splunk?
This posting is my own and does not necessarily represent Splunk's position, strategies, or opinion.