Logs, Metrics, Traces: From Theory to Use at Coinbase

Software Engineer
|
July 7, 2025
Logs, Metrics, Traces: From Theory to Use at Coinbase

Modern systems are complex webs of microservices, APIs, databases, and cloud infrastructure. When something goes wrong, teams need more than isolated error logs. They need a clear picture of what happened, where it happened, and why. This is where observability comes in. Observability brings together logs, metrics, and traces to help teams monitor, debug, and improve their systems with confidence. In this article, we’ll explore each of the three pillars of observability, examine how they complement one another, and look at how companies like Coinbase have used graph-based approaches to make sense of billions of events and accelerate incident response.

Introduction to Observability in Modern Systems

In modern software environments, traditional monitoring alone no longer provides the clarity teams need. Monitoring answers questions like “Is the system running?” or “How much memory is being used?” but stops short of explaining why a system behaves the way it does. Observability goes a step further: it is the ability to understand a system’s internal workings by examining the signals it produces.

This distinction matters because modern systems are rarely monolithic. They consist of dozens, or sometimes thousands, of loosely coupled services. A single user request can traverse microservices, queues, databases, and third party APIs, leaving behind a trail of signals. These signals can be spread across logs, time series metrics, and distributed traces. Individually, each piece offers part of the story. Together, they reveal how the system actually operates under load, how failures propagate, and where bottlenecks emerge.

Observability helps teams diagnose issues that are too complex for traditional dashboards or static alerts. For example, if a checkout workflow intermittently fails, engineers need to trace the full path of each request, see performance patterns over time, and correlate error logs across services. Observability provides the framework and tools to connect these dots in real time.

By embracing observability, teams move beyond reacting to isolated symptoms. Instead, they gain the context to answer deeper questions: How are services interacting? Which dependencies are the most fragile? What changes preceded this issue? This shift enables faster incident response, safer deployments, and continuous performance improvements.

Throughout this article, we’ll explore the three main types of telemetry data that underpin observability, logs, metrics, and traces, and how combining them helps teams maintain reliable systems at scale.

What are Logs?

Logs are structured or unstructured records of events generated by applications, infrastructure, and security systems. They are among the oldest and most familiar forms of telemetry. Each log entry typically captures a timestamp, a severity level (such as info, warning, or error), and a message describing what happened.

Logs are created continuously as software runs. An HTTP server might log incoming requests and responses, an authentication service might log login attempts, and a database might log queries or errors. Because logs record discrete events in sequence, they provide a narrative of system activity over time.

Unlike metrics, which aggregate data into numerical summaries, logs preserve detail and context. A log message can include stack traces, user identifiers, query parameters, or any other information the developer chooses to record. This makes logs an essential tool for diagnosing unexpected behavior, tracking security events, and auditing transactions.

Benefits of Logs

  • Detailed Context: Logs can capture rich information about what was happening at the moment of an event, often including variables, error messages, and contextual metadata.

  • Flexible Schema: Logs can record nearly any kind of data. While structured logging formats like JSON are increasingly common, logs can still be free-form text when needed.

  • Historical Record: Logs serve as an authoritative timeline of activities, helping teams reconstruct incidents long after they occur.

  • Searchable Evidence: Modern log management systems let teams search, filter, and correlate logs across multiple services, accelerating investigations.

When combined with metrics and traces, logs fill in the details that metrics alone can’t provide and help pinpoint exactly where and why something went wrong.

What are Metrics?

Metrics are numeric measurements collected over time to capture the behavior and performance of a system. Unlike logs, which record individual events, metrics aggregate data into time series that show trends, thresholds, and patterns.

Every part of a modern application can emit metrics. An API might record request rates, error counts, and response times. A database could expose metrics for query latency or connection pool usage. Infrastructure metrics often track CPU utilization, memory consumption, and disk I/O. These measurements are usually collected at regular intervals and stored in specialized systems optimized for time-series data.

Metrics are typically labeled with dimensions—metadata that describes what is being measured. For example, an HTTP request duration metric might include labels for the service name, endpoint path, response status, and region. This labeling allows teams to slice and filter metrics to pinpoint issues in specific parts of the system.

Because metrics are lightweight and aggregated, they are ideal for alerting. They enable teams to set thresholds and automatically trigger notifications when performance degrades or resource usage exceeds safe limits.

Benefits of Metrics

  • Real-Time Monitoring: Metrics are well suited for tracking performance and health as it happens, making them essential for dashboards and alerting systems.

  • Efficient Storage: Aggregated numeric data requires far less storage than logs and can be retained over long periods to analyze trends.

  • Clear Visualization: Time series metrics can be graphed to reveal patterns, such as gradual latency increases or sudden traffic spikes.

  • Objective Thresholds: Metrics provide quantitative baselines that help define what “normal” looks like, enabling teams to detect anomalies quickly.

By combining metrics with logs and traces, teams get both high-level trends and detailed evidence of what occurred, supporting faster detection and diagnosis of issues.

What are Traces?

Traces capture the full journey of a single request or transaction as it moves through a distributed system. Each trace records how a request passes between services, databases, queues, and other components, showing the sequence of operations and how long each step takes.

A trace is composed of spans—individual units of work, like an HTTP call or a database query. Each span contains metadata such as start and end timestamps, operation names, error flags, and tags describing what happened. Together, the spans form a tree that visualizes the request’s path from start to finish.

For example, a trace might begin when a user clicks “Submit Order” in a web application. The trace would include the initial API call to the order service, the subsequent call to the payment service, the database writes, and the final confirmation. If any part of this workflow is slow or fails, tracing shows exactly where the problem occurred.

Because traces follow requests across system boundaries, they are invaluable for understanding complex dependencies, measuring latency end to end, and uncovering hidden sources of failure.

Benefits of Traces

  • End-to-End Visibility: Traces show how services interact and where time is spent, making it easier to isolate performance bottlenecks.

  • Causal Relationships: Unlike logs or metrics alone, traces reveal the exact sequence of events that led to an issue.

  • Detailed Timing: Each span records precise durations, enabling teams to quantify latency and optimize critical paths.

  • Contextual Debugging: Traces often include metadata such as user IDs, request parameters, and error codes, helping teams correlate issues to real transactions.

When combined with logs and metrics, traces give teams the clarity to move beyond symptoms and pinpoint root causes in distributed systems.

Comparative Analysis: Logs vs. Metrics vs. Traces

Logs, metrics, and traces each provide a unique perspective on system behavior. Used individually, they help answer different questions. Used together, they create a more complete understanding of what is happening and why.

Logs excel at capturing granular details about discrete events. They preserve context—like stack traces, request payloads, and custom error messages—that other telemetry types often omit. When an incident occurs, logs are usually the first place teams look to reconstruct what happened.

Metrics are best for tracking numeric trends over time. They provide a high-level view of system health and performance, enabling teams to set alerts, monitor thresholds, and visualize long-term patterns. Metrics are lightweight to collect and store, making them ideal for real-time dashboards and historical analysis.

Traces connect the dots between services and reveal how requests flow through the system. They uncover performance bottlenecks, illustrate dependencies, and help teams understand the full lifecycle of a transaction across distributed components.

While logs and metrics often describe what happened, traces show how and where it happened in context. This distinction becomes critical in distributed environments, where requests span multiple services and infrastructure layers.

Choosing the right mix of telemetry depends on the problem you’re trying to solve. For example:

  • If you need to understand why latency increased for a user request, traces provide the clearest explanation.

  • If you want to monitor the rate of failed logins over time, metrics offer an efficient way to track and alert.

  • If you’re diagnosing an unexpected crash, logs will likely contain the error details you need.

The table below summarizes how logs, metrics, and traces differ in purpose, strengths, and limitations:

Aspect Logs Metrics Traces
Purpose Record discrete events and messages Track numeric trends over time Capture end-to-end request flows
Data Format Textual (structured or unstructured) Time series of numeric values Tree of spans with metadata
Granularity High detail, single event level Aggregated, summary-level Request-level with per-operation detail
Best For Debugging errors, auditing events Monitoring health, triggering alerts Diagnosing performance, understanding dependencies
Storage Needs High (due to volume and detail) Low to moderate Moderate to high, depending on sampling
Example Questions What error occurred? Who triggered it? How many requests per second? What is CPU usage? Where did latency occur in this workflow?
Strength Rich context and detail Efficient monitoring and trend analysis Clear view of request paths and timing
Limitation Hard to aggregate, noisy Lacks detail about individual events More complex to store and analyze

Integrating the Three Pillars for Enhanced Observability

While logs, metrics, and traces each bring valuable insights on their own, their real strength comes from how they work together. In practice, integration starts with consistent instrumentation. When services emit logs, metrics, and traces in a coordinated way—using shared identifiers such as trace IDs and request IDs—it becomes possible to connect a metric spike to the exact logs and traces that explain it. For example, when latency rises, teams can immediately pull up related traces to see which operations were slow, and then review logs to find detailed error messages or configuration changes that contributed.

Correlation is essential. A single event may generate a log entry, increment a metric counter, and create a span in a trace. By linking these signals together, observability platforms can provide a unified timeline of events and surface the root cause more quickly. Modern observability tools often do this automatically, letting engineers pivot seamlessly between dashboards, log searches, and trace visualizations.

Integrated observability also helps teams proactively improve reliability. Metrics show long-term performance trends, traces uncover hidden latency across services, and logs provide the necessary detail for thorough postmortems. By combining them, teams can validate fixes, measure the impact of optimizations, and ensure that improvements are sustained over time.

In high-scale environments, this approach becomes even more critical. When millions of events happen every minute, observability systems must help filter noise and highlight what matters. Correlated telemetry makes it possible to detect emerging issues before they become major incidents, and to respond with confidence when they do.

Challenges and Best Practices

As systems become more complex, the challenge of observability shifts from merely collecting data to making it actionable. One of the most frequent issues is data fragmentation. Logs, metrics, and traces often live in separate tools, making it difficult to correlate signals quickly. Investing in platforms that can ingest and link all three types of telemetry helps reduce blind spots and speeds up investigations.

Another challenge is lack of standardization. If teams don’t agree on consistent naming, tagging, and structure, even well-instrumented systems produce noisy, inconsistent data that is hard to search or visualize. Establishing conventions for labels (like environment, service name, and request IDs) makes it easier to filter and join information across sources.

Retention and storage costs are also important to consider. High-resolution traces and detailed logs can quickly grow to terabytes of data. Defining sensible retention policies and sampling strategies ensures that teams keep what they need without overwhelming budgets or storage systems.

Finally, observability is most effective when it is proactive, not reactive. Setting clear objectives, like tracking SLO compliance, monitoring key user journeys, and alerting on early signs of degradation, helps teams catch issues before they impact customers.

Graph Modeling for Deeper Insights

While logs, metrics, and traces are the foundation of observability, some teams choose to go further by modeling their systems as graphs. A graph represents components and their relationships as nodes and edges, making it easier to visualize how requests, dependencies, and failures are connected.

In distributed environments, a single request often touches many services. Understanding which paths were involved and how issues propagate can be challenging when data is scattered across tools. Graph modeling helps bring this information together into a single view that shows how everything fits.

With a graph, teams can answer questions like:

  • Which services are impacted by a failing dependency?

  • What are all the downstream systems involved in a critical workflow?

  • How did this request flow across infrastructure layers?

Graph query engines like PuppyGraph make this approach more accessible. Instead of requiring a separate graph database, PuppyGraph can connect to existing relational data and treat it as a graph, allowing teams to use familiar query languages such as Cypher or Gremlin. This enables faster troubleshooting and deeper insights without duplicating or moving large amounts of data.

For organizations operating at scale, graph modeling can complement traditional observability by uncovering hidden connections and improving incident analysis.

Case Study: Coinbase’s Service Graph for Automated RCA

Coinbase runs one of the world’s largest cryptocurrency platforms, supported by thousands of microservices and data pipelines. This scale has made it increasingly difficult to understand how services interact and why issues occur. In environments like this, even a low-severity alert in a minor service can trigger cascading issues that affect seemingly unrelated products, often requiring significant time to understand what happened.

To tackle this problem, Coinbase built an automated service graph combining trace data, metadata tagging, and graph analytics. The team began by collecting massive volumes of distributed tracing data using Datadog agents. To ensure complete and unsampled visibility, they adopted a dual-shipping model, streaming traces simultaneously to Datadog for monitoring and to Databricks via a custom ingestion API. This approach preserved a comprehensive dataset of all service interactions. Coinbase then transformed the raw spans into a graph representation, where each row showed which service called which other service, along with frequency and metadata.

Figure: A diagram showing the observability tech stack

PuppyGraph ingested this data as a live graph model without ETL, allowing teams to query relationships across the system with openCypher or Gremlin. This graph serves as a key component in the RCA Bot's workflow, an automated system that streamlines incident response.

Figure: Coinbase's incident bot automation flow

When an incident is created, RCA Bot automatically:

  1. Finds starting clues by checking who was paged and which monitors fired.

  2. Identifies related services owned by the responder or linked to the alert.

  3. Expands the service graph using PuppyGraph to see connected dependencies.
Figure: Observability root cause analysis graph powered by PuppyGraph
  1. Collects recent data about deployments, config changes, errors, and performance.

  2. Recommends likely causes and shares results in Slack.

Beyond RCA, Coinbase uses the same tracing data and service graph to map which backend projects support critical user journeys, measure end-to-end reliability, and extend data lineage mapping across teams. This approach has significantly reduced investigation time and improved visibility into their distributed systems. 

Watch the talk by Eric Weiss for more details.

Conclusion

Observability is no longer optional in modern software environments. As systems become more distributed and complex, teams need ways to understand not just whether something is working, but how and why it fails. Logs, metrics, and traces each offer valuable perspectives. When combined, they help teams detect issues early, diagnose root causes, and measure the health of their services with confidence.

For organizations operating at larger scales, techniques like graph modeling can provide an even richer understanding of dependencies and relationships between systems. Whether you are just starting to improve observability or exploring advanced approaches like service graphs, investing in these practices helps ensure your teams can respond quickly to problems and build more reliable software over time. Feel free to try the forever-free PuppyGraph Developer Edition or book a demo with our team.

Sa Wang is a Software Engineer with exceptional math abilities and strong coding skills. He earned his Bachelor's degree in Computer Science from Fudan University and has been studying Mathematical Logic in the Philosophy Department at Fudan University, expecting to receive his Master's degree in Philosophy in June this year. He and his team won a gold medal in the Jilin regional competition of the China Collegiate Programming Contest and received a first-class award in the Shanghai regional competition of the National Student Math Competition.

Sa Wang
Software Engineer

Sa Wang is a Software Engineer with exceptional math abilities and strong coding skills. He earned his Bachelor's degree in Computer Science from Fudan University and has been studying Mathematical Logic in the Philosophy Department at Fudan University, expecting to receive his Master's degree in Philosophy in June this year. He and his team won a gold medal in the Jilin regional competition of the China Collegiate Programming Contest and received a first-class award in the Shanghai regional competition of the National Student Math Competition.

No items found.
Join our newsletter

See PuppyGraph
In Action

See PuppyGraph
In Action

Graph Your Data In 10 Minutes.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Dev Edition

Free Download

Enterprise Edition

Developer

$0
/month
  • Forever free
  • Single node
  • Designed for proving your ideas
  • Available via Docker install

Enterprise

$
Based on the Memory and CPU of the server that runs PuppyGraph.
  • 30 day free trial with full features
  • Everything in Developer + Enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required

Developer Edition

  • Forever free
  • Single noded
  • Designed for proving your ideas
  • Available via Docker install

Enterprise Edition

  • 30-day free trial with full features
  • Everything in developer edition & enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required