Distributed Tracing in Explore
Explore provides comprehensive support for investigating distributed traces from sources like Tempo, Jaeger, Zipkin, and other tracing backends. The trace view helps you understand request flows across microservices, identify performance bottlenecks, and debug distributed system issues.What is Distributed Tracing?
Distributed tracing tracks requests as they flow through multiple services in a distributed system. Each trace consists of:- Trace - The entire journey of a request through your system
- Spans - Individual operations within the trace (service calls, database queries, etc.)
- Span relationships - Parent-child relationships showing how operations are nested
- Metadata - Tags, logs, and timing information for each span
Traces help answer questions like “Why was this request slow?” and “Which service caused this error?”
Trace Visualization
When you query a trace, Explore displays it with several views:Timeline View
The default timeline view shows:- Horizontal bars - Each span as a bar, with length representing duration
- Vertical alignment - Time-aligned across all services
- Nesting - Parent-child relationships shown through indentation
- Color coding - Different services in different colors
- Error indicators - Red highlighting for spans with errors
Span Details
Clicking a span reveals:- Duration and timing - Start time, duration, percentage of total trace time
- Service information - Service name, operation name
- Tags - Key-value metadata (HTTP status, SQL query, error messages, etc.)
- Process information - Host, pod, container details
- Logs - Log events associated with this span
- References - Links to related spans and traces
Node Graph
The node graph visualization shows:- Services as nodes - Each service in the trace as a circle
- Calls as edges - Arrows showing request flow between services
- Metrics overlay - Request counts, error rates, latencies
- Interactive navigation - Click nodes to filter or drill down
The node graph is particularly useful for understanding service dependencies and identifying which service-to-service calls are slow or failing.
Flamegraph
For traces with many spans, the flamegraph view provides:- Hierarchical layout - Spans stacked by call hierarchy
- Width = duration - Wider sections represent longer operations
- Quick identification - Easily spot the slowest operations
- Color coding - Different services in different colors
Querying Traces
Tempo Queries
Tempo is Grafana’s open-source tracing backend optimized for large-scale deployments.Query by Trace ID
The most direct way to view a trace:- Select Tempo as the data source
- Choose TraceQL query type
- Enter the trace ID:
- Click Run query
- Logs containing trace_id fields
- Error reports from applications
- Data links from metrics (exemplars)
- Other traces (linked traces)
TraceQL Queries
TraceQL is Tempo’s query language for searching traces by attributes: Search by service name:Search Results
TraceQL queries return multiple matching traces:- Listed in reverse chronological order
- Shows trace ID, start time, duration, and number of spans
- Color-coded by status (success/error)
- Click any trace to view its full timeline
Service Graph Queries
Tempo can generate service graphs showing request flow patterns:- Select Service Graph query type
- Optionally filter by service or time range
- View the generated node graph showing:
- Request rates between services
- Error rates
- Latency percentiles
Jaeger Queries
Jaeger provides similar capabilities with its own UI and query syntax:- Select Jaeger as the data source
- Choose Search query type
- Configure search parameters:
- Service name
- Operation name
- Tags (key=value pairs)
- Min/max duration
- Review matching traces
Trace Navigation
Expanding and Collapsing Spans
Searching Within a Trace
- Click the Find button (or press Ctrl/Cmd + F)
- Enter search terms to find:
- Service names
- Operation names
- Tag values
- Log messages
- Navigate matches with next/previous buttons
- Matching spans highlight in the view
Critical Path Analysis
The critical path identifies which spans contributed most to overall trace latency:- Click Show critical path
- Spans on the critical path highlight
- These are the operations that, if made faster, would reduce total trace time
- Focus optimization efforts on critical path spans
The critical path may not always be the longest single span—it’s the sequence of dependent spans that determines total duration.
Data Links and Correlations
Traces connect to other observability data through data links:From Traces to Logs
- Click a span in the trace view
- Look for Logs for this span in the span details
- Click to open logs in a split pane, filtered to:
- The span’s time range
- The service that created the span
- The trace_id (if logged)
From Traces to Metrics
Data links can connect spans to related metrics:- Click a span
- Find configured data links (e.g., “View service metrics”)
- Click to query metrics for that service during the trace period
Between Traces
Some systems link related traces:- Asynchronous operations (queues, background jobs)
- Retry attempts
- Cascading operations
Trace Analysis Patterns
Finding Slow Requests
Debugging Errors
Check error details
Click the error span and review:
- Error tags (error.message, error.type)
- Span logs with stack traces
- HTTP status codes
Understanding Dependencies
Split View Workflows
Split view is powerful for trace investigation:Trace + Logs
- Open a trace in the left pane
- Click Split to open a right pane
- Change right pane to your log data source (Loki, Elasticsearch)
- Query logs for the same service and time range:
- Cross-reference trace spans with detailed logs
Trace + Metrics
- View a trace showing a slow database query
- Split to open metrics pane
- Query database metrics during the same time:
- Correlate trace latency with overall database performance
Compare Traces
- Load a slow trace in the left pane
- Split view
- Load a normal/fast trace in the right pane
- Compare spans side-by-side to identify differences
Advanced Features
Span Filters
Filter visible spans within a trace:- Click the Filter button
- Filter by:
- Service name
- Duration threshold
- Tag values
- Error status
- Hidden spans collapse, making it easier to focus on relevant operations
Export Trace Data
Export trace data for external analysis:- Click the … menu in the trace view
- Choose Export
- Select format:
- JSON - Full trace data with all spans and tags
- OTLP - OpenTelemetry Protocol format
- Save or share the exported data
Trace Comparison
Some tracing backends support built-in trace comparison:- Select multiple traces in search results
- Click Compare
- View side-by-side timelines
- Spot differences in:
- Execution paths
- Span durations
- Error patterns
Performance Considerations
Query Efficiency
Optimize trace queries:- Add time range filters - Narrow the search window
- Use indexed tags - Query tags that are indexed in your backend
- Combine filters - Multiple specific filters are better than one broad filter
- Limit results - Set a reasonable limit on returned traces
Large Traces
Traces with thousands of spans can be slow to render:- Use span filters to hide irrelevant spans
- Collapse services you’re not investigating
- Consider if you need all the detail or just high-level flow
- Switch to flamegraph view for better performance with many spans
Best Practices
Include context in span tags
Include context in span tags
Use consistent naming
Use consistent naming
Standardize span and service names:
- Use semantic operation names (“GET /users/:id”, not “handler”)
- Consistent service names across environment
- Follow OpenTelemetry semantic conventions
- Document naming patterns for your team
Connect traces to logs and metrics
Connect traces to logs and metrics
Maximize observability by linking signals:
- Include trace_id in log entries
- Enable exemplars in Prometheus
- Configure data links in Grafana
- Use consistent label names across signals
Sample intelligently
Sample intelligently
Most systems can’t trace every request:
- Sample high-traffic endpoints at lower rates
- Always trace errors and slow requests (tail sampling)
- Sample more in non-production environments
- Configure sampling per service based on traffic
Troubleshooting
Trace not found
- Verify the trace ID is correct (trace IDs are typically long hex strings)
- Check the time range includes when the trace was created
- Confirm the trace was actually sent to the backend
- Check data retention policies (old traces may be deleted)
Missing spans
- Verify all services are instrumented
- Check that context propagation is working (trace IDs passed between services)
- Look for errors in instrumentation libraries
- Review sampling configuration (some spans may be intentionally dropped)
Incorrect timing
- Verify clock synchronization across services (use NTP)
- Check for time zone issues in span timestamps
- Look for instrumentation errors (start/end times swapped)
No data links
- Verify data links are configured in the data source settings
- Check that trace fields match data link expectations (trace_id format)
- Confirm you have access to linked data sources
Next Steps
Querying Metrics
Learn how to query and visualize metrics data
Querying Logs
Explore log querying and analysis techniques