Agent Observability SRE Agent
Community reusable agent prompt for Claude Code analytics and agent platform on-call using official analytics documentation: usage signals, session failure triage, MCP latency patterns, and SRE runbooks for agent hosting teams.
Open the source and read safety notes before installing.
Safety notes
- Incident commands must not exfiltrate customer prompts into public tickets.
- Scaling replicas without reviewing tool side effects can amplify destructive MCP calls.
- Disabling tracing to reduce noise may hide regressions—prefer sampling over full off.
- Rollback plans should include MCP allowlist and permission settings, not only code.
Privacy notes
- Analytics and logs may contain prompts, diffs, and credentials if misconfigured.
- Recommend redaction before exporting incident timelines externally.
- Shared dashboards should aggregate metrics without raw user content fields.
Prerequisites
- Access to Claude Code analytics or org usage exports for affected teams.
- Logs from agent hosts, MCP gateways, and background workers when self-hosting SDK workloads.
- Defined SLOs for session completion time and error budgets for agent tasks.
- Architecture diagram showing model calls, tool execution, and persistence layers.
Schema details
- Install type
- copy
- Troubleshooting
- No
- Scope
- Source repo
Full copyable content
## Content
Agent Observability SRE Agent is a community-authored reusable prompt for on-call
teams operating Claude Code at scale. It applies official Claude Code analytics
documentation—not an official Anthropic SRE service.
## Scope Note
This prompt operationalizes documented analytics and usage signals from code.claude.com.
Self-hosted Agent SDK observability may require additional OpenTelemetry instrumentation.
## Agent Prompt
You are an agent observability SRE for Claude Code deployments. Triage production
degradation using official analytics documentation and structured runbooks.
Workflow:
1. **Symptom framing.** Capture user impact: stuck sessions, failed tools, elevated cost, SLA misses.
2. **Analytics review.** Interpret usage, model mix, and session patterns from documented analytics surfaces.
3. **Dependency check.** Identify failing MCP connectors, OAuth expiry, or quota limits.
4. **Context pressure.** Look for compaction loops or oversized tool results in long sessions.
5. **Mitigation.** Propose safe mitigations: disable flaky tools, reduce concurrency, adjust timeouts.
6. **Communication.** Draft status updates without leaking proprietary prompt content.
7. **Post-incident.** List dashboards, alerts, and permission hardening follow-ups.
Output contract:
- Incident summary with blast radius hypothesis.
- Analytics evidence and top contributing signals.
- Immediate mitigations ranked by risk.
- Post-incident action items with owners.
## Features
- Maps official analytics docs to on-call triage workflows.
- Correlates MCP and session failures with documented usage signals.
- Separates Anthropic-side issues from connector or host misconfiguration.
- Produces runbook-ready mitigation steps.
## Use Cases
- On-call when agent tasks stall at MCP tool calls org-wide.
- Investigate analytics spikes after enabling a new connector.
- Design dashboards before enterprise Claude Code rollout.
- Support post-mortems after cost or reliability incidents.
## Source Notes
Verified against Claude Code analytics documentation on **2026-06-16**:
- Official docs describe analytics surfaces for understanding Claude Code adoption,
usage patterns, and operational signals at team or org scope.
- Analytics guidance helps leaders detect anomalies such as sudden model mix shifts or
usage spikes that may indicate misconfigured automation.
- Analytics complements costs and security documentation for holistic platform operations.
## Duplicate Check
Checked content/agents for observability coverage.
live-incident-debugging-triage-agent covers general incidents with OpenTelemetry references.
No agents entry applies official Claude Code analytics documentation to SRE on-call
workflows with MCP and session failure triage.
## Editorial Disclosure
Submitted as an independent community agent entry by kiannidev, based on public Claude
Code analytics documentation and the public anthropics/claude-code repository.
No paid placement, referral, or affiliate relationship.
## Sources
- Claude Code analytics - https://code.claude.com/docs/en/analytics
- Claude Code costs - https://code.claude.com/docs/en/costs
- Claude Code repository - https://github.com/anthropics/claude-codeAbout this resource
Content
Agent Observability SRE Agent is a community-authored reusable prompt for on-call teams operating Claude Code at scale. It applies official Claude Code analytics documentation—not an official Anthropic SRE service.
Scope Note
This prompt operationalizes documented analytics and usage signals from code.claude.com. Self-hosted Agent SDK observability may require additional OpenTelemetry instrumentation.
Agent Prompt
You are an agent observability SRE for Claude Code deployments. Triage production degradation using official analytics documentation and structured runbooks.
Workflow:
- Symptom framing. Capture user impact: stuck sessions, failed tools, elevated cost, SLA misses.
- Analytics review. Interpret usage, model mix, and session patterns from documented analytics surfaces.
- Dependency check. Identify failing MCP connectors, OAuth expiry, or quota limits.
- Context pressure. Look for compaction loops or oversized tool results in long sessions.
- Mitigation. Propose safe mitigations: disable flaky tools, reduce concurrency, adjust timeouts.
- Communication. Draft status updates without leaking proprietary prompt content.
- Post-incident. List dashboards, alerts, and permission hardening follow-ups.
Output contract:
- Incident summary with blast radius hypothesis.
- Analytics evidence and top contributing signals.
- Immediate mitigations ranked by risk.
- Post-incident action items with owners.
Features
- Maps official analytics docs to on-call triage workflows.
- Correlates MCP and session failures with documented usage signals.
- Separates Anthropic-side issues from connector or host misconfiguration.
- Produces runbook-ready mitigation steps.
Use Cases
- On-call when agent tasks stall at MCP tool calls org-wide.
- Investigate analytics spikes after enabling a new connector.
- Design dashboards before enterprise Claude Code rollout.
- Support post-mortems after cost or reliability incidents.
Source Notes
Verified against Claude Code analytics documentation on 2026-06-16:
- Official docs describe analytics surfaces for understanding Claude Code adoption, usage patterns, and operational signals at team or org scope.
- Analytics guidance helps leaders detect anomalies such as sudden model mix shifts or usage spikes that may indicate misconfigured automation.
- Analytics complements costs and security documentation for holistic platform operations.
Duplicate Check
Checked content/agents for observability coverage. live-incident-debugging-triage-agent covers general incidents with OpenTelemetry references. No agents entry applies official Claude Code analytics documentation to SRE on-call workflows with MCP and session failure triage.
Editorial Disclosure
Submitted as an independent community agent entry by kiannidev, based on public Claude Code analytics documentation and the public anthropics/claude-code repository. No paid placement, referral, or affiliate relationship.
Sources
- Claude Code analytics - https://code.claude.com/docs/en/analytics
- Claude Code costs - https://code.claude.com/docs/en/costs
- Claude Code repository - https://github.com/anthropics/claude-code
Source citations
Add this badge to your README
Show that Agent Observability SRE Agent is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.
[](https://heyclau.de/entry/agents/agent-observability-sre-agent)How it compares
Agent Observability SRE Agent side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.
| Field | Agent Observability SRE Agent Community reusable agent prompt for Claude Code analytics and agent platform on-call using official analytics documentation: usage signals, session failure triage, MCP latency patterns, and SRE runbooks for agent hosting teams. Open dossier | Claude Code Analytics Adoption Capability Pack Skill Expert Claude Code analytics adoption capability pack for enabling team and enterprise dashboards, GitHub contribution metrics, adoption tracking, ROI reporting, and OpenTelemetry complements with source-backed rollout steps. Open dossier | Claude Code Zero Data Retention Review Capability Pack Skill Expert Claude Code zero data retention review capability pack for auditing ZDR scope on Claude for Enterprise, disabled features, model availability, analytics limits, and third-party integration gaps before rollout. Open dossier | Claude Code Champion Kit Rollout Capability Pack Skill Expert Claude Code champion kit rollout capability pack for selecting advocates, localizing kit assets, running enablement sessions, and measuring adoption with source-backed rollout checklists aligned to official champion kit docs. Open dossier |
|---|---|---|---|---|
| Trust | ||||
| Install risk | Review first | Review first | Review first | Review first |
| Notes | Safety ✓ Privacy ✓ | Safety ✓ Privacy ✓ | Safety ✓ Privacy ✓ | Safety ✓ Privacy ✓ |
| Category | agents | skills | skills | skills |
| Source | source-backed | source-backed | source-backed | source-backed |
| Author | kiannidev | kiannidev | kiannidev | kiannidev |
| Added | 2026-06-16 | 2026-06-13 | 2026-06-13 | 2026-06-15 |
| Platforms | Claude Code | Claude CodeCodexWindsurfGeminiCursorCLI | Claude CodeCodexWindsurfGeminiCursorCLI | Claude CodeCodexWindsurfGeminiCursorCLI |
| Source repo | — | — | — | — |
| Safety notes | ✓Incident commands must not exfiltrate customer prompts into public tickets. Scaling replicas without reviewing tool side effects can amplify destructive MCP calls. Disabling tracing to reduce noise may hide regressions—prefer sampling over full off. Rollback plans should include MCP allowlist and permission settings, not only code. | ✓This skill recommends analytics enablement steps; it must not toggle admin settings or install GitHub apps without explicit owner approval. Contribution metrics are conservative underestimates and should not be treated as exact productivity scores for individuals. Leaderboards and CSV exports can create unintended performance pressure; align rollout with HR and management policy first. Zero Data Retention organizations cannot use GitHub contribution metrics; usage metrics only. Console spend figures are estimates; use billing pages for actual costs. | ✓This skill summarizes official ZDR scope; it must not claim ZDR covers chat on claude.ai, Cowork, third-party MCP servers, or Bedrock, Vertex, or Foundry routes. ZDR is enabled per organization; new organizations require separate enablement by the Anthropic account team. Disabled features such as Claude Code on the Web, Desktop cloud sessions, and `/feedback` are blocked at the backend regardless of client UI. Claude Fable 5 is unavailable under ZDR; the `best` alias resolves to Opus for ZDR organizations instead. Policy-violation sessions may still be retained for up to two years even when ZDR is enabled. | ✓This skill plans champion programs; it must not change admin settings or install integrations without owner approval. Champion programs can create informal performance pressure; align incentives and messaging with HR policy. Do not task champions with bypassing security, MCP, or managed policy controls. Pilot exercises should use non-production repos unless explicitly approved. |
| Privacy notes | ✓Analytics and logs may contain prompts, diffs, and credentials if misconfigured. Recommend redaction before exporting incident timelines externally. Shared dashboards should aggregate metrics without raw user content fields. | ✓Analytics dashboards expose account emails, usage patterns, leaderboard rankings, and per-user spend or line counts depending on plan. GitHub contribution metrics analyze merged PR diffs and Claude Code session activity within attribution windows; confirm code and identity visibility with security review. CSV exports include all users, not just the top ten shown in the dashboard UI. OpenTelemetry exports can replicate usage events into customer observability systems and need retention and access-control review. | ✓ZDR review discussions often involve account emails, organization names, contract terms, and security architecture details that should stay in internal channels. Claude Code Analytics under ZDR does not store prompts or model responses but still collects productivity metadata such as account emails and usage statistics. Administrative data such as seat assignments and audit logs follow standard retention policies and remain in scope for compliance review. Third-party MCP servers, local logs, hooks, and customer-managed observability stacks are outside Anthropic ZDR and need separate review. | ✓Champion rosters and adoption metrics can expose team structure and individual usage patterns. Office hours notes may contain unreleased product feedback and internal project details. Analytics dashboards differ by plan; avoid sharing per-user stats broadly without admin approval. Localized kit assets may include internal wiki links that should not leave the organization. |
| Prerequisites |
|
|
|
|
| Install | — | — | — | — |
| Config | — | — | — | — |
| Citations | ||||
| Claim | Unclaimed | Unclaimed | Unclaimed | Unclaimed |
Signals
Loading live community signals…
A short, calm digest of reviewed Claude resources. Unsubscribe any time.