Add Observability to LLM and Agent Applications

A practical guide to instrumenting LLM and agent applications with traces, metrics, logs, GenAI semantic attributes, sampling, and privacy-aware redaction so teams can debug model calls, tool use, retries, and cost.

by MkDev11·added 2026-06-04·

Claude Code

HarnessClaude Code

Command center

Source

Review first

Review safety and privacy notes before installing or copying commands.

Safety notes Privacy notes

Install & copy

## TL;DR

LLM and agent observability starts with one trace per user request. Inside that
trace, record the retrieval steps, model calls, tool calls, retries, fallbacks,
and final response path. Add metrics for latency, errors, token usage, cost, and
tool outcomes. Use structured logs for decisions that need human debugging.

The hard part is not adding more telemetry. The hard part is deciding what to
keep, what to redact, and which signals actually help maintainers debug a
production incident.

## Prerequisites & Requirements

- [ ] {"task": "Request boundaries", "description": "The app can identify one user request, task, job, or conversation turn"}
- [ ] {"task": "Telemetry backend", "description": "Traces, metrics, and logs can be exported to a collector or observability service"}
- [ ] {"task": "Privacy policy", "description": "Prompt, completion, retrieval, and tool data retention is defined before export"}
- [ ] {"task": "Failure fixtures", "description": "Tests or staging traffic cover model errors, tool errors, retries, and timeouts"}

## Core Concepts Explained

### Traces show the agent path

Traces explain what happened during one request. For an LLM app, the trace should
connect the incoming request, retrieval, prompt assembly, model call, tool calls,
retry decisions, and final response.

### Metrics show system health

Metrics answer operational questions across many requests: latency, error rate,
model-call volume, token usage, cost, queue depth, retry count, fallback rate,
and tool success rate.

### Logs explain individual decisions

Logs should be structured, redacted, and tied to trace ids. Use them for events
that need narrative detail: chosen tool, rejected tool output, fallback reason,
validation failure, policy decision, or user-visible error.

### GenAI attributes make traces comparable

OpenTelemetry's GenAI semantic conventions define common attributes for
generative AI systems. Using shared names for provider, model, operation, usage,
and request metadata makes traces easier to query across providers and tools.

By convention, a GenAI span is named `{gen_ai.operation.name} {gen_ai.request.model}`
(for example, `chat claude-3`), so traces stay readable even when several
providers and models run side by side.

### GenAI semantic-convention attributes

These are the OpenTelemetry GenAI span attributes most relevant to LLM and agent
instrumentation. Provider and model attributes are recommended on every model
span; token usage and `error.type` should be set when available.

| Attribute | Type | Example value | What it captures |
| --- | --- | --- | --- |
| `gen_ai.operation.name` | string | `chat`, `embeddings`, `execute_tool`, `invoke_agent` | The kind of GenAI operation the span represents |
| `gen_ai.provider.name` | string | `anthropic`, `openai`, `aws.bedrock`, `gcp.vertex_ai` | The provider or platform serving the request |
| `gen_ai.request.model` | string | `claude-3` | The model requested by the caller |
| `gen_ai.response.model` | string | `claude-3-0613` | The model that actually produced the response |
| `gen_ai.request.temperature` | double | `0.7` | Sampling temperature requested |
| `gen_ai.request.max_tokens` | int | `100` | Maximum tokens requested for the completion |
| `gen_ai.request.top_p` | double | `1.0` | Nucleus-sampling parameter requested |
| `gen_ai.usage.input_tokens` | int | `100` | Tokens consumed by the prompt/input |
| `gen_ai.usage.output_tokens` | int | `180` | Tokens produced in the completion |
| `gen_ai.response.id` | string | `chatcmpl-123` | Provider-assigned response identifier |
| `gen_ai.response.finish_reasons` | string[] | `["stop"]`, `["length"]` | Why generation stopped |
| `error.type` | string | provider error code or exception name | Set when the operation fails |

For aggregate health, the conventions also define client metrics:
`gen_ai.client.operation.duration` (histogram, unit `s`) for operation latency
and `gen_ai.client.token.usage` (histogram, unit `{token}`) for input/output
token counts. Both carry the operation name, provider name, and request model as
attributes, so latency and token cost can be sliced by model the same way traces
are.

### Instrumenting a model span

The example below creates a model-call span and sets GenAI attributes by their
semantic-convention names. Use your provider SDK in place of `call_model`; the
attribute names stay the same across providers, which is the point of the shared
conventions.

```python
from opentelemetry import trace

tracer = trace.get_tracer("llm.app")

def traced_chat(request_model: str, messages: list) -> dict:
    # Span name follows the GenAI convention: "{operation} {model}".
    with tracer.start_as_current_span(f"chat {request_model}") as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.provider.name", "anthropic")
        span.set_attribute("gen_ai.request.model", request_model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
        span.set_attribute("gen_ai.request.temperature", 0.2)

        try:
            response = call_model(request_model, messages)
        except Exception as exc:
            # Conditionally required when the operation fails.
            span.set_attribute("error.type", type(exc).__name__)
            span.record_exception(exc)
            span.set_status(trace.StatusCode.ERROR)
            raise

        usage = response["usage"]
        span.set_attribute("gen_ai.response.model", response["model"])
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.finish_reasons", [response["stop_reason"]])
        span.set_attribute("gen_ai.usage.input_tokens", usage["input_tokens"])
        span.set_attribute("gen_ai.usage.output_tokens", usage["output_tokens"])
        return response
```

Note that prompt and completion text are deliberately not placed on the span;
only ids, the model, finish reason, and token counts are recorded. Capture raw
message content only through an approved debug path, per your redaction policy.

## Step-by-Step Implementation Guide

1. **Pick the root span.** Create one root span for each user request,
   conversation turn, background job, or agent task. Put the request id,
   environment, route, tenant or workspace hash, and app version on that span.

2. **Trace the model boundary.** Add child spans around each model call. Capture
   model provider, model name, operation name, prompt version, latency, status,
   retry count, fallback use, and token usage when available.

3. **Trace retrieval and context assembly.** Add spans for vector search,
   database lookups, document fetches, reranking, and prompt assembly. Store
   counts, ids, and scores rather than raw documents unless your privacy policy
   explicitly allows content capture.

4. **Trace tool calls separately.** Each tool call should have its own span with
   tool name, target system, status, latency, retry count, and error class. Link
   the tool span to the model decision that requested it.

5. **Emit operational metrics.** Track request latency, model latency, tool
   latency, error counts, token usage, cost estimates, timeout counts, retry
   counts, fallback counts, and queue depth for async agents.

6. **Use structured logs for decisions.** Log compact events such as
   `tool_selected`, `tool_rejected`, `schema_validation_failed`,
   `retrieval_empty`, `fallback_model_used`, and `human_review_required`. Include
   trace ids so logs and traces can be joined.

7. **Redact before export.** Remove or hash emails, access tokens, file names,
   account ids, raw documents, prompt text, completions, and tool outputs unless
   the team has approved retention for that field.

8. **Sample deliberately.** Keep full traces for errors, timeouts, high-latency
   requests, new releases, and expensive model calls. Sample routine successful
   traffic if volume or privacy risk is high.

9. **Build an incident view.** A useful dashboard answers: which model failed,
   which tool failed, where latency grew, whether retries helped, whether cost
   spiked, and whether a release changed behavior.

## Reusing Existing Agent Telemetry

If your agents run on top of Claude Code, you do not have to instrument the host
from scratch. Claude Code can export OpenTelemetry data directly: set
`CLAUDE_CODE_ENABLE_TELEMETRY=1`, choose exporters with `OTEL_METRICS_EXPORTER`
and `OTEL_LOGS_EXPORTER` (both support `otlp`, `console`, or `none`; metrics also
support `prometheus`), and point `OTEL_EXPORTER_OTLP_ENDPOINT` at your collector.
It emits metrics such as `claude_code.token.usage`, `claude_code.cost.usage`,
`claude_code.session.count`, and `claude_code.tool.execution`, plus
`claude_code.api_error` and `claude_code.api_request` for model-call health. Send
those into the same collector as your own GenAI spans so host usage and your
application traces share one backend.

## Observability Checklist

- [ ] {"task": "Trace coverage", "description": "Requests include model, retrieval, tool, retry, fallback, and response spans"}
- [ ] {"task": "Metrics coverage", "description": "Latency, errors, token usage, cost, retry, and tool outcome metrics exist"}
- [ ] {"task": "Log joins", "description": "Structured logs carry trace ids or request ids"}
- [ ] {"task": "GenAI attributes", "description": "Provider, model, operation, status, and usage fields follow shared semantic names where possible"}
- [ ] {"task": "Redaction boundary", "description": "Sensitive prompt, completion, retrieval, and tool data is removed or explicitly retained"}
- [ ] {"task": "Sampling policy", "description": "Errors and high-risk paths keep enough detail while routine traffic is sampled"}
- [ ] {"task": "Incident dashboard", "description": "Maintainers can diagnose model, retrieval, tool, latency, and cost failures quickly"}

## What to Alert On

Alert on symptoms that a maintainer can act on:

- Model error rate or timeout rate above normal.
- Tool failure rate, validation failures, or repeated retries.
- Retrieval returning empty or low-confidence context for important routes.
- Token usage or cost estimates rising sharply after a release.
- Queue depth, job age, or agent task duration crossing a service target.
- Fallback model usage increasing unexpectedly.

Avoid alerting on every individual model refusal, low-confidence answer, or
sampled trace gap unless it maps to a clear action.

## Troubleshooting

- **Traces are too noisy**: keep the request, model, retrieval, and tool spans,
  then drop internal helper spans that do not explain behavior.
- **Telemetry contains too much user data**: export ids, counts, hashes, scores,
  and prompt versions by default; capture raw content only in approved debug
  paths.
- **Costs are hard to explain**: record model name, token usage, retry count,
  fallback model, and request route on model spans.
- **Tool failures are invisible**: put every external action in its own span and
  log the validation or error class.
- **Sampling hides incidents**: always keep error, timeout, high-cost, and
  high-latency traces, then sample ordinary successful requests.

## Duplicate Check

This guide is vendor-neutral and focuses on the observability architecture for
LLM and agent applications. Existing entries cover specific observability,
evaluation, and tracing tools; this guide is distinct because it explains the
signals, spans, metrics, logs, redaction, and sampling strategy that can be used
with those tools.

## References

- OpenTelemetry traces - https://opentelemetry.io/docs/concepts/signals/traces/
- OpenTelemetry metrics - https://opentelemetry.io/docs/concepts/signals/metrics/
- OpenTelemetry logs - https://opentelemetry.io/docs/concepts/signals/logs/
- OpenTelemetry GenAI semantic conventions - https://opentelemetry.io/docs/specs/semconv/gen-ai/
- OpenTelemetry sampling - https://opentelemetry.io/docs/concepts/sampling/
- OpenTelemetry JavaScript instrumentation - https://opentelemetry.io/docs/languages/js/instrumentation/
- OpenTelemetry Python instrumentation - https://opentelemetry.io/docs/languages/python/instrumentation/
- OpenTelemetry documentation home - https://opentelemetry.io/docs/
- Claude Code monitoring and OpenTelemetry export - https://code.claude.com/docs/en/monitoring-usage

Trust & readiness

TrustReview first
Sourcesource-backed
Safety notesPresent
ReviewedYes

Community context

Related entries(4)
Related guides(3)
Community signals

Compare

Integrations & API

Contribute

Suggest a metadata change Claim this listing

Documentation Source repository Browse directory

Review first — review before installing

Open the source and read safety notes before installing.

Citation facts

Source-backed facts for citing this resource, derived directly from the registry — also available as plain text for AI assistants.

Canonical URL: https://heyclau.de/entry/guides/llm-agent-application-observability
Source URLs: https://opentelemetry.io/docs/specs/semconv/gen-ai/, https://github.com/JSONbored/awesome-claude/blob/main/content/guides/llm-agent-application-observability.mdx
Safety notes: Observability data is production evidence, not proof that an LLM answer or agent action is correct., Do not let tracing wrappers change request ordering, retry behavior, timeout handling, or user-visible agent decisions., Keep alerting focused on actionable symptoms such as latency, error rate, failed tool calls, and budget anomalies.
Privacy notes: Prompts, completions, retrieved documents, tool arguments, tool outputs, embeddings metadata, user ids, and file names may appear in telemetry., Redact or hash sensitive fields before export, and store raw prompt/response content only when the team has an explicit retention policy., Use sampling and field-level controls so debug detail can increase during incidents without retaining every user conversation forever.
Author: MkDev11
Submitted by: MkDev11
Claim status: unclaimed
Last verified: 2026-06-04

Decision playbook

Review trust signals before you adopt

Signals are present but mixed. Use the checklist below to confirm the source and operational safety for your environment.

Compare context

Selected

Current score

Baseline

—

Delta

No baseline selected

No major trust-signal divergence detected in the current selection.

Source and provenance checks

Complete

Confirm ownership and provenance before trusting install instructions.

Source link availableRequired
Open the canonical repository and verify ownership.
Done
Source provenance statusRequired
Marked as source-backed.
Done
Metadata reviewed
Registry metadata indicates a reviewed listing.
Done

Safety and privacy checks

Complete

Validate risk disclosures before installation or API wiring.

Safety notes presentRequired
Review the listed safety guidance before running commands.
Done
Privacy notes presentRequired
Review data handling notes before connecting accounts or secrets.
Done
Trust level risk gateRequired
Trust level does not block evaluation.
Done

Package and install checks

Needs review

Check package metadata and artifact integrity signals.

Install payload available
Install or copy payload is available for review.
Done
Package verification flag
No package verification flag provided.
Pending
Checksum metadata
No checksum provided for downloaded artifact.
Pending

Compare-driven decision checks

Needs review

Use compare context to validate trade-offs before adoption.

Compare tray has multiple entries
Add at least one more entry to compare trust differences.
Pending
Baseline comparison available
No baseline peer selected yet.
Pending
Diverging trust signals identified
No major trust-signal divergence found.
Pending

Setup at a glance

Copy & paste

Copy-ready — paste the snippet to get started.

Install command

Not provided

Config snippet

Not provided

Copy snippet

Provided

Prerequisites

4 to clear

Platforms

1 listed

Difficulty

60/100

Adoption plan

Balanced adoption plan

Current risk score 16/100. Use staged verification before broader rollout.

Risk 16

Pre-adoption checks

Validate source and review signals before any execution.

Confirm source provenanceRequired
Source URL/provenance metadata is present.
Done
Confirm metadata review state
Listing has review metadata.
Done
Verify install payload
Install/config payload exists and can be inspected.
Done

Security checks

Confirm safety, privacy, and package integrity signals.

Review safety notesRequired
Safety notes are present.
Done
Review privacy notesRequired
Privacy notes are present.
Done
Verify package integrity metadata
No package verification/checksum metadata.
Pending

Rollout

Adopt in controlled steps based on the selected plan.

Run in isolated sandbox firstRequired
Use a constrained sandbox and observe behavior across multiple tasks.
Pending
Roll out graduallyRequired
Roll out to a small cohort before wider usage.
Pending
Set monitoring and fallback
Define rollback path and monitor errors after adoption.
Pending

Evidence readiness

Evidence readiness matrix · balanced

Required evidence gates are covered (5/6 signals complete).

Risk 15

Source provenance

Present

Source repository/provenance is listed.

Required in this preset

Metadata review

Present

Review metadata is present.

Required in this preset

Safety notes

Present

Safety notes are present.

Required in this preset

Privacy notes

Present

Privacy notes are present.

Optional in this preset

Package integrity

Missing

Package integrity metadata is missing.

Optional in this preset

Install payload

Present

Install payload is available.

Required in this preset

Required evidence gates are covered for this preset.

Decision timeline

Decision timeline · balanced

5/6 steps complete with no blocking gaps for this preset.

Risk 14

triage

Confirm source provenanceRequired

Source/provenance metadata is available.

Done

triage

Check metadata review statusRequired

Review metadata is available.

Done

verify

Review safety notesRequired

Safety notes are available.

Done

verify

Review privacy notes

Privacy notes are available.

Done

verify

Validate package integrity metadata

Package integrity metadata is missing.

Pending

rollout

Verify install payload and commandsRequired

Install payload is available.

Done

No required blockers for this timeline preset.

Prerequisite readiness

4 prerequisites to line up before setup.

0/4 ready

General4

Safety & privacy surface

3 safety and 3 privacy notes across 4 risk areas. Review closely: network access.

4 areas

SafetyGeneralObservability data is production evidence, not proof that an LLM answer or agent action is correct.
SafetyNetwork accessDo not let tracing wrappers change request ordering, retry behavior, timeout handling, or user-visible agent decisions.
SafetyGeneralKeep alerting focused on actionable symptoms such as latency, error rate, failed tool calls, and budget anomalies.
PrivacyLocal filesPrompts, completions, retrieved documents, tool arguments, tool outputs, embeddings metadata, user ids, and file names may appear in telemetry.
PrivacyData retentionRedact or hash sensitive fields before export, and store raw prompt/response content only when the team has an explicit retention policy.
PrivacyData retentionUse sampling and field-level controls so debug detail can increase during incidents without retaining every user conversation forever.

Safety notes

Observability data is production evidence, not proof that an LLM answer or agent action is correct.
Do not let tracing wrappers change request ordering, retry behavior, timeout handling, or user-visible agent decisions.
Keep alerting focused on actionable symptoms such as latency, error rate, failed tool calls, and budget anomalies.

Privacy notes

Prompts, completions, retrieved documents, tool arguments, tool outputs, embeddings metadata, user ids, and file names may appear in telemetry.
Redact or hash sensitive fields before export, and store raw prompt/response content only when the team has an explicit retention policy.
Use sampling and field-level controls so debug detail can increase during incidents without retaining every user conversation forever.

Prerequisites

An LLM or agent application with identifiable request, model-call, retrieval, and tool-execution boundaries.
An observability backend or collector that can receive OpenTelemetry traces, metrics, and logs.
A policy for which prompt, completion, retrieval, and tool data may be retained.
Test traffic that exercises normal responses, model failures, retries, and tool errors.

Schema details

Install type: copy
Reading time: 9 min
Difficulty score: 60
Troubleshooting: Yes
Breaking changes: No

Skill and platform metadata

Retrieval sources

https://opentelemetry.io/docs/specs/semconv/gen-ai/https://opentelemetry.io/docs/https://code.claude.com/docs/en/monitoring-usage

Full copyable content

## TL;DR

LLM and agent observability starts with one trace per user request. Inside that
trace, record the retrieval steps, model calls, tool calls, retries, fallbacks,
and final response path. Add metrics for latency, errors, token usage, cost, and
tool outcomes. Use structured logs for decisions that need human debugging.

The hard part is not adding more telemetry. The hard part is deciding what to
keep, what to redact, and which signals actually help maintainers debug a
production incident.

## Prerequisites & Requirements

- [ ] {"task": "Request boundaries", "description": "The app can identify one user request, task, job, or conversation turn"}
- [ ] {"task": "Telemetry backend", "description": "Traces, metrics, and logs can be exported to a collector or observability service"}
- [ ] {"task": "Privacy policy", "description": "Prompt, completion, retrieval, and tool data retention is defined before export"}
- [ ] {"task": "Failure fixtures", "description": "Tests or staging traffic cover model errors, tool errors, retries, and timeouts"}

## Core Concepts Explained

### Traces show the agent path

Traces explain what happened during one request. For an LLM app, the trace should
connect the incoming request, retrieval, prompt assembly, model call, tool calls,
retry decisions, and final response.

### Metrics show system health

Metrics answer operational questions across many requests: latency, error rate,
model-call volume, token usage, cost, queue depth, retry count, fallback rate,
and tool success rate.

### Logs explain individual decisions

Logs should be structured, redacted, and tied to trace ids. Use them for events
that need narrative detail: chosen tool, rejected tool output, fallback reason,
validation failure, policy decision, or user-visible error.

### GenAI attributes make traces comparable

OpenTelemetry's GenAI semantic conventions define common attributes for
generative AI systems. Using shared names for provider, model, operation, usage,
and request metadata makes traces easier to query across providers and tools.

By convention, a GenAI span is named `{gen_ai.operation.name} {gen_ai.request.model}`
(for example, `chat claude-3`), so traces stay readable even when several
providers and models run side by side.

### GenAI semantic-convention attributes

These are the OpenTelemetry GenAI span attributes most relevant to LLM and agent
instrumentation. Provider and model attributes are recommended on every model
span; token usage and `error.type` should be set when available.

| Attribute | Type | Example value | What it captures |
| --- | --- | --- | --- |
| `gen_ai.operation.name` | string | `chat`, `embeddings`, `execute_tool`, `invoke_agent` | The kind of GenAI operation the span represents |
| `gen_ai.provider.name` | string | `anthropic`, `openai`, `aws.bedrock`, `gcp.vertex_ai` | The provider or platform serving the request |
| `gen_ai.request.model` | string | `claude-3` | The model requested by the caller |
| `gen_ai.response.model` | string | `claude-3-0613` | The model that actually produced the response |
| `gen_ai.request.temperature` | double | `0.7` | Sampling temperature requested |
| `gen_ai.request.max_tokens` | int | `100` | Maximum tokens requested for the completion |
| `gen_ai.request.top_p` | double | `1.0` | Nucleus-sampling parameter requested |
| `gen_ai.usage.input_tokens` | int | `100` | Tokens consumed by the prompt/input |
| `gen_ai.usage.output_tokens` | int | `180` | Tokens produced in the completion |
| `gen_ai.response.id` | string | `chatcmpl-123` | Provider-assigned response identifier |
| `gen_ai.response.finish_reasons` | string[] | `["stop"]`, `["length"]` | Why generation stopped |
| `error.type` | string | provider error code or exception name | Set when the operation fails |

For aggregate health, the conventions also define client metrics:
`gen_ai.client.operation.duration` (histogram, unit `s`) for operation latency
and `gen_ai.client.token.usage` (histogram, unit `{token}`) for input/output
token counts. Both carry the operation name, provider name, and request model as
attributes, so latency and token cost can be sliced by model the same way traces
are.

### Instrumenting a model span

The example below creates a model-call span and sets GenAI attributes by their
semantic-convention names. Use your provider SDK in place of `call_model`; the
attribute names stay the same across providers, which is the point of the shared
conventions.

```python
from opentelemetry import trace

tracer = trace.get_tracer("llm.app")

def traced_chat(request_model: str, messages: list) -> dict:
    # Span name follows the GenAI convention: "{operation} {model}".
    with tracer.start_as_current_span(f"chat {request_model}") as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.provider.name", "anthropic")
        span.set_attribute("gen_ai.request.model", request_model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
        span.set_attribute("gen_ai.request.temperature", 0.2)

        try:
            response = call_model(request_model, messages)
        except Exception as exc:
            # Conditionally required when the operation fails.
            span.set_attribute("error.type", type(exc).__name__)
            span.record_exception(exc)
            span.set_status(trace.StatusCode.ERROR)
            raise

        usage = response["usage"]
        span.set_attribute("gen_ai.response.model", response["model"])
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.finish_reasons", [response["stop_reason"]])
        span.set_attribute("gen_ai.usage.input_tokens", usage["input_tokens"])
        span.set_attribute("gen_ai.usage.output_tokens", usage["output_tokens"])
        return response
```

Note that prompt and completion text are deliberately not placed on the span;
only ids, the model, finish reason, and token counts are recorded. Capture raw
message content only through an approved debug path, per your redaction policy.

## Step-by-Step Implementation Guide

1. **Pick the root span.** Create one root span for each user request,
   conversation turn, background job, or agent task. Put the request id,
   environment, route, tenant or workspace hash, and app version on that span.

2. **Trace the model boundary.** Add child spans around each model call. Capture
   model provider, model name, operation name, prompt version, latency, status,
   retry count, fallback use, and token usage when available.

3. **Trace retrieval and context assembly.** Add spans for vector search,
   database lookups, document fetches, reranking, and prompt assembly. Store
   counts, ids, and scores rather than raw documents unless your privacy policy
   explicitly allows content capture.

4. **Trace tool calls separately.** Each tool call should have its own span with
   tool name, target system, status, latency, retry count, and error class. Link
   the tool span to the model decision that requested it.

5. **Emit operational metrics.** Track request latency, model latency, tool
   latency, error counts, token usage, cost estimates, timeout counts, retry
   counts, fallback counts, and queue depth for async agents.

6. **Use structured logs for decisions.** Log compact events such as
   `tool_selected`, `tool_rejected`, `schema_validation_failed`,
   `retrieval_empty`, `fallback_model_used`, and `human_review_required`. Include
   trace ids so logs and traces can be joined.

7. **Redact before export.** Remove or hash emails, access tokens, file names,
   account ids, raw documents, prompt text, completions, and tool outputs unless
   the team has approved retention for that field.

8. **Sample deliberately.** Keep full traces for errors, timeouts, high-latency
   requests, new releases, and expensive model calls. Sample routine successful
   traffic if volume or privacy risk is high.

9. **Build an incident view.** A useful dashboard answers: which model failed,
   which tool failed, where latency grew, whether retries helped, whether cost
   spiked, and whether a release changed behavior.

## Reusing Existing Agent Telemetry

If your agents run on top of Claude Code, you do not have to instrument the host
from scratch. Claude Code can export OpenTelemetry data directly: set
`CLAUDE_CODE_ENABLE_TELEMETRY=1`, choose exporters with `OTEL_METRICS_EXPORTER`
and `OTEL_LOGS_EXPORTER` (both support `otlp`, `console`, or `none`; metrics also
support `prometheus`), and point `OTEL_EXPORTER_OTLP_ENDPOINT` at your collector.
It emits metrics such as `claude_code.token.usage`, `claude_code.cost.usage`,
`claude_code.session.count`, and `claude_code.tool.execution`, plus
`claude_code.api_error` and `claude_code.api_request` for model-call health. Send
those into the same collector as your own GenAI spans so host usage and your
application traces share one backend.

## Observability Checklist

- [ ] {"task": "Trace coverage", "description": "Requests include model, retrieval, tool, retry, fallback, and response spans"}
- [ ] {"task": "Metrics coverage", "description": "Latency, errors, token usage, cost, retry, and tool outcome metrics exist"}
- [ ] {"task": "Log joins", "description": "Structured logs carry trace ids or request ids"}
- [ ] {"task": "GenAI attributes", "description": "Provider, model, operation, status, and usage fields follow shared semantic names where possible"}
- [ ] {"task": "Redaction boundary", "description": "Sensitive prompt, completion, retrieval, and tool data is removed or explicitly retained"}
- [ ] {"task": "Sampling policy", "description": "Errors and high-risk paths keep enough detail while routine traffic is sampled"}
- [ ] {"task": "Incident dashboard", "description": "Maintainers can diagnose model, retrieval, tool, latency, and cost failures quickly"}

## What to Alert On

Alert on symptoms that a maintainer can act on:

- Model error rate or timeout rate above normal.
- Tool failure rate, validation failures, or repeated retries.
- Retrieval returning empty or low-confidence context for important routes.
- Token usage or cost estimates rising sharply after a release.
- Queue depth, job age, or agent task duration crossing a service target.
- Fallback model usage increasing unexpectedly.

Avoid alerting on every individual model refusal, low-confidence answer, or
sampled trace gap unless it maps to a clear action.

## Troubleshooting

- **Traces are too noisy**: keep the request, model, retrieval, and tool spans,
  then drop internal helper spans that do not explain behavior.
- **Telemetry contains too much user data**: export ids, counts, hashes, scores,
  and prompt versions by default; capture raw content only in approved debug
  paths.
- **Costs are hard to explain**: record model name, token usage, retry count,
  fallback model, and request route on model spans.
- **Tool failures are invisible**: put every external action in its own span and
  log the validation or error class.
- **Sampling hides incidents**: always keep error, timeout, high-cost, and
  high-latency traces, then sample ordinary successful requests.

## Duplicate Check

This guide is vendor-neutral and focuses on the observability architecture for
LLM and agent applications. Existing entries cover specific observability,
evaluation, and tracing tools; this guide is distinct because it explains the
signals, spans, metrics, logs, redaction, and sampling strategy that can be used
with those tools.

## References

- OpenTelemetry traces - https://opentelemetry.io/docs/concepts/signals/traces/
- OpenTelemetry metrics - https://opentelemetry.io/docs/concepts/signals/metrics/
- OpenTelemetry logs - https://opentelemetry.io/docs/concepts/signals/logs/
- OpenTelemetry GenAI semantic conventions - https://opentelemetry.io/docs/specs/semconv/gen-ai/
- OpenTelemetry sampling - https://opentelemetry.io/docs/concepts/sampling/
- OpenTelemetry JavaScript instrumentation - https://opentelemetry.io/docs/languages/js/instrumentation/
- OpenTelemetry Python instrumentation - https://opentelemetry.io/docs/languages/python/instrumentation/
- OpenTelemetry documentation home - https://opentelemetry.io/docs/
- Claude Code monitoring and OpenTelemetry export - https://code.claude.com/docs/en/monitoring-usage

About this resource

TL;DR

LLM and agent observability starts with one trace per user request. Inside that trace, record the retrieval steps, model calls, tool calls, retries, fallbacks, and final response path. Add metrics for latency, errors, token usage, cost, and tool outcomes. Use structured logs for decisions that need human debugging.

The hard part is not adding more telemetry. The hard part is deciding what to keep, what to redact, and which signals actually help maintainers debug a production incident.

Prerequisites & Requirements

{"task": "Request boundaries", "description": "The app can identify one user request, task, job, or conversation turn"}
{"task": "Telemetry backend", "description": "Traces, metrics, and logs can be exported to a collector or observability service"}
{"task": "Privacy policy", "description": "Prompt, completion, retrieval, and tool data retention is defined before export"}
{"task": "Failure fixtures", "description": "Tests or staging traffic cover model errors, tool errors, retries, and timeouts"}

Core Concepts Explained

Traces show the agent path

Traces explain what happened during one request. For an LLM app, the trace should connect the incoming request, retrieval, prompt assembly, model call, tool calls, retry decisions, and final response.

Metrics show system health

Metrics answer operational questions across many requests: latency, error rate, model-call volume, token usage, cost, queue depth, retry count, fallback rate, and tool success rate.

Logs explain individual decisions

Logs should be structured, redacted, and tied to trace ids. Use them for events that need narrative detail: chosen tool, rejected tool output, fallback reason, validation failure, policy decision, or user-visible error.

GenAI attributes make traces comparable

OpenTelemetry's GenAI semantic conventions define common attributes for generative AI systems. Using shared names for provider, model, operation, usage, and request metadata makes traces easier to query across providers and tools.

By convention, a GenAI span is named {gen_ai.operation.name} {gen_ai.request.model} (for example, chat claude-3), so traces stay readable even when several providers and models run side by side.

GenAI semantic-convention attributes

These are the OpenTelemetry GenAI span attributes most relevant to LLM and agent instrumentation. Provider and model attributes are recommended on every model span; token usage and error.type should be set when available.

Attribute	Type	Example value	What it captures
`gen_ai.operation.name`	string	`chat`, `embeddings`, `execute_tool`, `invoke_agent`	The kind of GenAI operation the span represents
`gen_ai.provider.name`	string	`anthropic`, `openai`, `aws.bedrock`, `gcp.vertex_ai`	The provider or platform serving the request
`gen_ai.request.model`	string	`claude-3`	The model requested by the caller
`gen_ai.response.model`	string	`claude-3-0613`	The model that actually produced the response
`gen_ai.request.temperature`	double	`0.7`	Sampling temperature requested
`gen_ai.request.max_tokens`	int	`100`	Maximum tokens requested for the completion
`gen_ai.request.top_p`	double	`1.0`	Nucleus-sampling parameter requested
`gen_ai.usage.input_tokens`	int	`100`	Tokens consumed by the prompt/input
`gen_ai.usage.output_tokens`	int	`180`	Tokens produced in the completion
`gen_ai.response.id`	string	`chatcmpl-123`	Provider-assigned response identifier
`gen_ai.response.finish_reasons`	string[]	`["stop"]`, `["length"]`	Why generation stopped
`error.type`	string	provider error code or exception name	Set when the operation fails

For aggregate health, the conventions also define client metrics: gen_ai.client.operation.duration (histogram, unit s) for operation latency and gen_ai.client.token.usage (histogram, unit {token}) for input/output token counts. Both carry the operation name, provider name, and request model as attributes, so latency and token cost can be sliced by model the same way traces are.

Instrumenting a model span

The example below creates a model-call span and sets GenAI attributes by their semantic-convention names. Use your provider SDK in place of call_model; the attribute names stay the same across providers, which is the point of the shared conventions.

from opentelemetry import trace

tracer = trace.get_tracer("llm.app")

def traced_chat(request_model: str, messages: list) -> dict:
    # Span name follows the GenAI convention: "{operation} {model}".
    with tracer.start_as_current_span(f"chat {request_model}") as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.provider.name", "anthropic")
        span.set_attribute("gen_ai.request.model", request_model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
        span.set_attribute("gen_ai.request.temperature", 0.2)

        try:
            response = call_model(request_model, messages)
        except Exception as exc:
            # Conditionally required when the operation fails.
            span.set_attribute("error.type", type(exc).__name__)
            span.record_exception(exc)
            span.set_status(trace.StatusCode.ERROR)
            raise

        usage = response["usage"]
        span.set_attribute("gen_ai.response.model", response["model"])
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.finish_reasons", [response["stop_reason"]])
        span.set_attribute("gen_ai.usage.input_tokens", usage["input_tokens"])
        span.set_attribute("gen_ai.usage.output_tokens", usage["output_tokens"])
        return response

Note that prompt and completion text are deliberately not placed on the span; only ids, the model, finish reason, and token counts are recorded. Capture raw message content only through an approved debug path, per your redaction policy.

Step-by-Step Implementation Guide

Pick the root span. Create one root span for each user request, conversation turn, background job, or agent task. Put the request id, environment, route, tenant or workspace hash, and app version on that span.
Trace the model boundary. Add child spans around each model call. Capture model provider, model name, operation name, prompt version, latency, status, retry count, fallback use, and token usage when available.
Trace retrieval and context assembly. Add spans for vector search, database lookups, document fetches, reranking, and prompt assembly. Store counts, ids, and scores rather than raw documents unless your privacy policy explicitly allows content capture.
Trace tool calls separately. Each tool call should have its own span with tool name, target system, status, latency, retry count, and error class. Link the tool span to the model decision that requested it.
Emit operational metrics. Track request latency, model latency, tool latency, error counts, token usage, cost estimates, timeout counts, retry counts, fallback counts, and queue depth for async agents.
Use structured logs for decisions. Log compact events such as tool_selected, tool_rejected, schema_validation_failed, retrieval_empty, fallback_model_used, and human_review_required. Include trace ids so logs and traces can be joined.
Redact before export. Remove or hash emails, access tokens, file names, account ids, raw documents, prompt text, completions, and tool outputs unless the team has approved retention for that field.
Sample deliberately. Keep full traces for errors, timeouts, high-latency requests, new releases, and expensive model calls. Sample routine successful traffic if volume or privacy risk is high.
Build an incident view. A useful dashboard answers: which model failed, which tool failed, where latency grew, whether retries helped, whether cost spiked, and whether a release changed behavior.

Reusing Existing Agent Telemetry

If your agents run on top of Claude Code, you do not have to instrument the host from scratch. Claude Code can export OpenTelemetry data directly: set CLAUDE_CODE_ENABLE_TELEMETRY=1, choose exporters with OTEL_METRICS_EXPORTER and OTEL_LOGS_EXPORTER (both support otlp, console, or none; metrics also support prometheus), and point OTEL_EXPORTER_OTLP_ENDPOINT at your collector. It emits metrics such as claude_code.token.usage, claude_code.cost.usage, claude_code.session.count, and claude_code.tool.execution, plus claude_code.api_error and claude_code.api_request for model-call health. Send those into the same collector as your own GenAI spans so host usage and your application traces share one backend.

Observability Checklist

{"task": "Trace coverage", "description": "Requests include model, retrieval, tool, retry, fallback, and response spans"}
{"task": "Metrics coverage", "description": "Latency, errors, token usage, cost, retry, and tool outcome metrics exist"}
{"task": "Log joins", "description": "Structured logs carry trace ids or request ids"}
{"task": "GenAI attributes", "description": "Provider, model, operation, status, and usage fields follow shared semantic names where possible"}
{"task": "Redaction boundary", "description": "Sensitive prompt, completion, retrieval, and tool data is removed or explicitly retained"}
{"task": "Sampling policy", "description": "Errors and high-risk paths keep enough detail while routine traffic is sampled"}
{"task": "Incident dashboard", "description": "Maintainers can diagnose model, retrieval, tool, latency, and cost failures quickly"}

What to Alert On

Alert on symptoms that a maintainer can act on:

Model error rate or timeout rate above normal.
Tool failure rate, validation failures, or repeated retries.
Retrieval returning empty or low-confidence context for important routes.
Token usage or cost estimates rising sharply after a release.
Queue depth, job age, or agent task duration crossing a service target.
Fallback model usage increasing unexpectedly.

Avoid alerting on every individual model refusal, low-confidence answer, or sampled trace gap unless it maps to a clear action.

Troubleshooting

Traces are too noisy: keep the request, model, retrieval, and tool spans, then drop internal helper spans that do not explain behavior.
Telemetry contains too much user data: export ids, counts, hashes, scores, and prompt versions by default; capture raw content only in approved debug paths.
Costs are hard to explain: record model name, token usage, retry count, fallback model, and request route on model spans.
Tool failures are invisible: put every external action in its own span and log the validation or error class.
Sampling hides incidents: always keep error, timeout, high-cost, and high-latency traces, then sample ordinary successful requests.

Duplicate Check

This guide is vendor-neutral and focuses on the observability architecture for LLM and agent applications. Existing entries cover specific observability, evaluation, and tracing tools; this guide is distinct because it explains the signals, spans, metrics, logs, redaction, and sampling strategy that can be used with those tools.

References

OpenTelemetry traces - https://opentelemetry.io/docs/concepts/signals/traces/
OpenTelemetry metrics - https://opentelemetry.io/docs/concepts/signals/metrics/
OpenTelemetry logs - https://opentelemetry.io/docs/concepts/signals/logs/
OpenTelemetry GenAI semantic conventions - https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry sampling - https://opentelemetry.io/docs/concepts/sampling/
OpenTelemetry JavaScript instrumentation - https://opentelemetry.io/docs/languages/js/instrumentation/
OpenTelemetry Python instrumentation - https://opentelemetry.io/docs/languages/python/instrumentation/
OpenTelemetry documentation home - https://opentelemetry.io/docs/
Claude Code monitoring and OpenTelemetry export - https://code.claude.com/docs/en/monitoring-usage

#observability #llm #agents #opentelemetry #tracing #metrics #logs

Source citations

Source methodology →

Add this badge to your README

Show that Add Observability to LLM and Agent Applications is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.

[![Listed on HeyClaude](https://heyclau.de/badge/guides/llm-agent-application-observability.svg)](https://heyclau.de/entry/guides/llm-agent-application-observability)

How it compares

Add Observability to LLM and Agent Applications side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.

3 trust signals differ across this comparison (Package trust, Source provenance, Submitter).

Next steps differ across entries — use the actions in the table below to copy install commands and source links per resource.

Field	Add Observability to LLM and Agent Applications A practical guide to instrumenting LLM and agent applications with traces, metrics, logs, GenAI semantic attributes, sampling, and privacy-aware redaction so teams can debug model calls, tool use, retries, and cost. Open dossier	AI Agent Observability and Incident Response Skill Instrument AI agent systems with high-signal telemetry and runbook-driven incident response for reliability and safety. Open dossier	Claude Agent SDK OpenTelemetry Observability Capability Pack Skill Expert capability pack for Agent SDK OpenTelemetry export using CLAUDE_CODE_ENABLE_TELEMETRY, OTLP exporters, and monitoring-usage reference variables documented for Claude Code CLI subprocesses. Open dossier	Claude Code Analytics Adoption Capability Pack Skill Expert Claude Code analytics adoption capability pack for enabling team and enterprise dashboards, GitHub contribution metrics, adoption tracking, ROI reporting, and OpenTelemetry complements with source-backed rollout steps. Open dossier
Next stepsDiffers	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing
Trust
Review status	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed
Package trustDiffers	Package not verified	Package verified2026-04-10	Package not verified	Package not verified
Source provenanceDiffers	Source-backed	No submission link	Submission linkedSource submission	Submission linkedSource submission
SubmitterDiffers	MkDev11	—	kiannidev	kiannidev
Install risk	Review first	Low risk	Review first	Review first
Notes	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓
Brand	—	—	—	GitHub
Category	guides	skills	skills	skills
Source	Source-backed	First-party	Source-backed	Source-backed
Author	MkDev11	JSONbored	kiannidev	kiannidev
Added	2026-06-04	2026-04-10	2026-06-16	2026-06-13
Platforms	Claude Code	Claude Code Codex Windsurf Gemini Cursor CLI	Claude Code Codex Windsurf Gemini Cursor CLI	Claude Code Codex Windsurf Gemini Cursor CLI
Harness	Claude Code	Claude Code Codex Windsurf Gemini Cursor CLI	Claude Code Codex Windsurf Gemini Cursor CLI	Claude Code Codex Windsurf Gemini Cursor CLI
Source repo	—	—	—	—
Safety notes	✓Observability data is production evidence, not proof that an LLM answer or agent action is correct. Do not let tracing wrappers change request ordering, retry behavior, timeout handling, or user-visible agent decisions. Keep alerting focused on actionable symptoms such as latency, error rate, failed tool calls, and budget anomalies.	✓Use this skill as planning or review guidance; verify generated commands, code, configuration, and infrastructure changes before running them. Apply least-privilege credentials and test in staging or a disposable branch before using it on production systems, CI, deployment, or account-write workflows.	✓OTEL_LOG_* opt-ins can export prompts and tool payloads—enable only with approved pipelines. Do not use console exporter through the SDK per observability doc warning.	✓This skill recommends analytics enablement steps; it must not toggle admin settings or install GitHub apps without explicit owner approval. Contribution metrics are conservative underestimates and should not be treated as exact productivity scores for individuals. Leaderboards and CSV exports can create unintended performance pressure; align rollout with HR and management policy first. Zero Data Retention organizations cannot use GitHub contribution metrics; usage metrics only. Console spend figures are estimates; use billing pages for actual costs.
Privacy notes	✓Prompts, completions, retrieved documents, tool arguments, tool outputs, embeddings metadata, user ids, and file names may appear in telemetry. Redact or hash sensitive fields before export, and store raw prompt/response content only when the team has an explicit retention policy. Use sampling and field-level controls so debug detail can increase during incidents without retaining every user conversation forever.	✓Inputs can include source files, prompts, logs, account metadata, repository details, and operational context that may be sent to the configured AI model. Redact secrets, customer data, private URLs, credentials, and proprietary implementation details before sharing prompts, reports, or generated artifacts.	✓Default telemetry is structural; opt-in variables add user content to exports.	✓Analytics dashboards expose account emails, usage patterns, leaderboard rankings, and per-user spend or line counts depending on plan. GitHub contribution metrics analyze merged PR diffs and Claude Code session activity within attribution windows; confirm code and identity visibility with security review. CSV exports include all users, not just the top ten shown in the dashboard UI. OpenTelemetry exports can replicate usage events into customer observability systems and need retention and access-control review.
Prerequisites	An LLM or agent application with identifiable request, model-call, retrieval, and tool-execution boundaries. An observability backend or collector that can receive OpenTelemetry traces, metrics, and logs. A policy for which prompt, completion, retrieval, and tool data may be retained. Test traffic that exercises normal responses, model failures, retries, and tool errors.	Runtime access where agent requests can be instrumented Centralized logging/metrics/tracing destination On-call or owner process for incident handling	Agent SDK query() deployment with access to environment or options.env. OTLP collector reachable from the host running the SDK.	Claude for Teams, Claude for Enterprise, or Claude Console API access depending on the target dashboard. Admin or Owner role for Team and Enterprise analytics setup, or UsageView permission for Console analytics. GitHub admin access if enabling contribution metrics through the Claude GitHub app. Clear rollout goals such as adoption tracking, ROI reporting, champion identification, or spend visibility.
Install	—	`curl -L https://heyclau.de/downloads/skills/ai-agent-observability-incident-response.zip -o ai-agent-observability-incident-response.zip && unzip -o ai-agent-observability-incident-response.zip -d ./ai-agent-observability-incident-response`	—	—
Config	—	—	—	—
Citations	Source repositorygithub.com 2026-07-19T11:20:19-07:00 Documentationopentelemetry.io Submitted by MkDev112026-06-04 Source methodology →	Source repositorygithub.com 2026-07-19T11:20:19-07:00 Documentationopentelemetry.io Package (SHA-256 pinned)/downloads/skills/ai-agent-observability-incident-response.zip Source methodology →	Source repositorygithub.com 2026-07-19T11:20:19-07:00 Documentationcode.claude.com Package downloadgithub.com Submitted by kiannidev2026-06-16 Source methodology →	Source repositorygithub.com 2026-07-19T11:20:19-07:00 Documentationgithub.com Package downloadgithub.com Submitted by kiannidev2026-06-13 Source methodology →
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed

Open 4 picks in the interactive comparison tool

Related guides

Source-backed guides for putting this to work.

Signals

Loading live community signals…

Citation facts

Review trust signals before you adopt

Source and provenance checks

Safety and privacy checks

Package and install checks

Compare-driven decision checks

Copy & paste

Balanced adoption plan

Pre-adoption checks

Security checks

Rollout

Evidence readiness matrix · balanced

Source provenance

Metadata review

Safety notes

Privacy notes

Package integrity

Install payload

Decision timeline · balanced

Confirm source provenanceRequired

Check metadata review statusRequired

Review safety notesRequired

Review privacy notes

Validate package integrity metadata

Verify install payload and commandsRequired

Prerequisite readiness

Safety & privacy surface

Safety notes

Privacy notes

Prerequisites

Schema details

About this resource

TL;DR

Prerequisites & Requirements

Core Concepts Explained

Traces show the agent path

Metrics show system health

Logs explain individual decisions

GenAI attributes make traces comparable

GenAI semantic-convention attributes

Instrumenting a model span

Step-by-Step Implementation Guide

Reusing Existing Agent Telemetry

Observability Checklist

What to Alert On

Troubleshooting

Duplicate Check

References

Source citations

Add this badge to your README

How it compares

Related resources

AI Agent Observability and Incident Response Skill

Claude Agent SDK OpenTelemetry Observability Capability Pack Skill

Claude Code Analytics Adoption Capability Pack Skill

Agent SDK Production Architect Agent

Related guides

Build Cloudflare Workers AI Agents With Durable State

OpenTelemetry Observability for Claude Agent SDK Agents

Agent Skills in Claude Agent SDK Applications

Signals