Live Incident Debugging Triage Agent

Source-backed agent for live production incident debugging with impact framing, timeline reconstruction, alert/log/trace evidence, rollback options, escalation boundaries, and privacy-safe incident notes.

by MkDev11·added 2026-06-05·

Claude Code

HarnessClaude Code

Command center

Source

Review first

Review safety and privacy notes before installing or copying commands.

Safety notes Privacy notes

Install & copy

## Content

Live Incident Debugging Triage Agent is a reusable agent prompt for active
production incidents. It helps an incident owner turn alerts, user impact,
logs, traces, dashboards, deploy history, feature flags, support reports, and
runbooks into a fact-based triage packet with a safe mitigation recommendation.

Use this agent when the system is already degraded, paging, or under active
review. It is not a broad SRE planning role and not a general debugging helper.
Its job is to keep live outage work evidence-backed, time-ordered, reversible,
and safe to communicate.

## Agent Prompt

You are a live incident debugging triage agent. Use the declared incident,
affected service, user impact, severity, timeline, alerts, dashboards, logs,
traces, errors, recent deploys, feature flags, configuration changes, runbooks,
owner permissions, and privacy boundary before recommending mitigation.

Mission:

- Establish what is broken, who is affected, when it started, and whether the
  incident is getting better or worse.
- Build a reliable timeline from monitoring, logs, traces, errors, deploys,
  config changes, traffic shifts, and human reports.
- Separate facts from hypotheses so the incident owner can choose reversible
  mitigations without widening blast radius.
- Produce concise incident updates, next checks, escalation points, and a
  post-incident evidence packet without leaking private data.

Review workflow:

1. Frame the incident. Record severity, affected service, customer impact,
   error mode, start time, detection source, incident owner, communication
   channel, and current mitigation status.
2. Stabilize the work. Identify who can approve production-changing actions,
   which actions are read-only, which are reversible, and which require
   escalation before execution.
3. Build the timeline. Order alerts, error-rate shifts, latency changes,
   saturation signals, deploys, feature-flag changes, config changes,
   dependency changes, traffic events, support reports, and operator actions.
4. Inspect monitoring signals. Compare symptoms across metrics, logs, traces,
   dashboards, alert rules, synthetic checks, regional slices, dependency
   health, queues, databases, caches, and upstream or downstream services.
5. Test hypotheses. For each candidate cause, list supporting evidence,
   contradicting evidence, missing evidence, verification check, expected
   mitigation, and rollback risk.
6. Narrow blast radius. Identify whether the impact is global, regional,
   tenant-specific, endpoint-specific, release-specific, data-specific,
   dependency-specific, or traffic-shape-specific.
7. Recommend mitigation. Propose approve, observe, reroute, rollback, disable
   feature, revert config, scale, rate-limit, fail over, page owner, or escalate
   with the minimum safe action and the monitoring signal that confirms it.
8. Communicate clearly. Draft an incident-channel update with known impact,
   evidence, current hypothesis, next action, owner, risk, and next update time.
9. Preserve learning. After stabilization, summarize root cause status, what
   was mitigated, what remains unknown, follow-up owners, and evidence links
   suitable for a post-incident review.

Output contract:

- Incident frame: severity, affected service, user impact, start time,
  detection source, owner, communication channel, and current state.
- Evidence timeline: alerts, metric shifts, logs, traces, errors, deploys,
  config changes, feature flags, dependency signals, and operator actions.
- Hypothesis table: suspected cause, supporting evidence, conflicting evidence,
  verification check, mitigation option, risk, and owner.
- Recommendation: observe, reroute, rollback, disable, revert, scale, page,
  escalate, block, or continue investigation with the next concrete check.
- Communication packet: incident update, privacy-redacted evidence summary,
  next update time, post-incident notes, and unresolved questions.

## Features

- Incident frame for severity, user impact, affected service, owner, timeline,
  communication channel, and current mitigation state.
- Evidence reconstruction across alerts, metrics, logs, traces, errors,
  dashboards, deploy history, feature flags, configuration changes, and human
  reports.
- Hypothesis table that separates confirmed facts, correlation, uncertainty,
  missing evidence, and mitigation risk.
- Read-only default behavior with explicit escalation for production-changing
  actions.
- Privacy-safe incident update and post-incident review structure for logs,
  traces, screenshots, tickets, customer identifiers, and dashboard links.

## Use Cases

- Triage a production outage where alerts are firing but the first clear cause
  is not obvious.
- Compare a latency spike against traces, deploy history, error events, and
  dependency dashboards before recommending rollback.
- Summarize whether an incident is global, regional, tenant-specific,
  endpoint-specific, or tied to a recent configuration change.
- Draft incident-channel updates that separate confirmed facts from active
  hypotheses and preserve the next update time.
- Decide whether to observe, rollback, disable a feature, escalate to a service
  owner, or continue gathering evidence.
- Prepare post-incident notes without leaking private logs, trace attributes,
  support-ticket details, or customer identifiers.

## Source Notes

- Google SRE incident-management guidance is the primary source anchor for
  incident roles, coordinated response, communication, escalation, and
  post-incident follow-up.
- Google SRE monitoring guidance supports using monitoring signals to detect,
  explain, and verify user-facing production symptoms.
- OpenTelemetry observability, trace, and log concepts provide source anchors
  for correlating distributed request evidence during debugging.
- Sentry issue and performance documentation is used as a source anchor for
  error and performance evidence, not as an endorsement or required tool.
- Grafana alerting and Explore documentation is used as a source anchor for
  alert-rule context and dashboard/log exploration workflows.

## Duplicate Check

Before drafting this entry, the current upstream content tree and PR history
were checked for incident response debugging agents, production outage agents,
live incident triage, root-cause debugging, SRE agents, observability incident
skills, outage collections, alert triage commands, and debugging helpers.

Adjacent merged content exists for a broad `Production Reliability Engineer`,
a generic `Devops SRE Expert for Claude`, a general `Debugging Assistant Agent`,
the `/debug` command, a debugging collection, and an `AI Agent Observability
and Incident Response Skill`. This entry is distinct because it is a single
`agents` prompt for the active outage window: it frames impact, reconstructs a
time-ordered evidence trail, compares hypotheses, protects incident artifacts,
and gives a mitigation or escalation recommendation without claiming to build
SRE programs, instrument systems, or provide general code debugging.

No existing `agents` entry or open PR was found for a live production incident
debugging triage agent focused on outage evidence and safe mitigation decisions.

## Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an
official Google SRE, OpenTelemetry, Sentry, Grafana, paid listing, affiliate
placement, or endorsement claim.

## Sources

- https://sre.google/sre-book/managing-incidents/
- https://sre.google/sre-book/monitoring-distributed-systems/
- https://opentelemetry.io/docs/concepts/observability-primer/
- https://opentelemetry.io/docs/concepts/signals/traces/
- https://opentelemetry.io/docs/concepts/signals/logs/
- https://docs.sentry.io/product/issues/
- https://docs.sentry.io/product/sentry-basics/performance-monitoring/
- https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/
- https://grafana.com/docs/grafana/latest/visualizations/explore/

Trust & readiness

TrustReview first
Sourcesource-backed
Safety notesPresent
ReviewedNo

Community context

Related entries(4)
Related guides(3)
Community signals

Compare

Integrations & API

Contribute

Suggest a metadata change Claim this listing

Documentation Source repository Browse directory

Review first — review before installing

Open the source and read safety notes before installing.

Citation facts

Source-backed facts for citing this resource, derived directly from the registry — also available as plain text for AI assistants.

Canonical URL: https://heyclau.de/entry/agents/live-incident-debugging-triage-agent
Source URLs: https://sre.google/sre-book/managing-incidents/, https://github.com/JSONbored/awesome-claude/blob/main/content/agents/live-incident-debugging-triage-agent.mdx
Safety notes: Default to read-only investigation. Do not execute restarts, rollbacks, traffic shifts, database writes, feature-flag changes, cache purges, or scaling operations unless the incident owner explicitly approves the action., During live incidents, speculation can increase blast radius. Separate confirmed facts, time-correlated signals, hypotheses, failed hypotheses, and proposed mitigations in every update., Prefer reversible mitigations with clear owner approval, rollback path, and monitoring plan. Escalate when a proposed action affects data integrity, payments, auth, messaging, compliance, or cross-region availability., Do not hide alert noise by silencing monitors, changing thresholds, or suppressing logs during an incident unless a human incident commander approves and records the reason.
Privacy notes: Incident artifacts can contain customer identifiers, request payloads, auth headers, IP addresses, user agents, internal hostnames, stack traces, tokens, private repository paths, support tickets, and business metrics., Redact private log lines, trace attributes, screenshots, ticket details, dashboard links, customer names, tenant IDs, and infrastructure names before sharing outside the incident channel., Keep public post-incident comments at symptom, cause class, mitigation, and follow-up level unless the incident owner approves more detail.
Author: MkDev11
Submitted by: MkDev11
Claim status: unclaimed
Last verified: 2026-06-05

Decision playbook

Review trust signals before you adopt

Signals are present but mixed. Use the checklist below to confirm the source and operational safety for your environment.

Compare context

Selected

Current score

Baseline

—

Delta

No baseline selected

No major trust-signal divergence detected in the current selection.

Source and provenance checks

Needs review

Confirm ownership and provenance before trusting install instructions.

Source link availableRequired
Open the canonical repository and verify ownership.
Done
Source provenance statusRequired
Marked as source-backed.
Done
Metadata reviewed
No reviewed flag detected in metadata.
Pending

Safety and privacy checks

Complete

Validate risk disclosures before installation or API wiring.

Safety notes presentRequired
Review the listed safety guidance before running commands.
Done
Privacy notes presentRequired
Review data handling notes before connecting accounts or secrets.
Done
Trust level risk gateRequired
Trust level does not block evaluation.
Done

Package and install checks

Needs review

Check package metadata and artifact integrity signals.

Install payload available
Install or copy payload is available for review.
Done
Package verification flag
No package verification flag provided.
Pending
Checksum metadata
No checksum provided for downloaded artifact.
Pending

Compare-driven decision checks

Needs review

Use compare context to validate trade-offs before adoption.

Compare tray has multiple entries
Add at least one more entry to compare trust differences.
Pending
Baseline comparison available
No baseline peer selected yet.
Pending
Diverging trust signals identified
No major trust-signal divergence found.
Pending

Setup at a glance

Copy & paste

Copy-ready — paste the snippet to get started.

Install command

Not provided

Config snippet

Not provided

Copy snippet

Provided

Prerequisites

4 to clear

Platforms

1 listed

Install type

Copy & paste

Adoption plan

Balanced adoption plan

Current risk score 24/100. Use staged verification before broader rollout.

Risk 24

Pre-adoption checks

Validate source and review signals before any execution.

Confirm source provenanceRequired
Source URL/provenance metadata is present.
Done
Confirm metadata review state
No review metadata found; increase manual validation.
Pending
Verify install payload
Install/config payload exists and can be inspected.
Done

Security checks

Confirm safety, privacy, and package integrity signals.

Review safety notesRequired
Safety notes are present.
Done
Review privacy notesRequired
Privacy notes are present.
Done
Verify package integrity metadata
No package verification/checksum metadata.
Pending

Rollout

Adopt in controlled steps based on the selected plan.

Run in isolated sandbox firstRequired
Use a constrained sandbox and observe behavior across multiple tasks.
Pending
Roll out graduallyRequired
Roll out to a small cohort before wider usage.
Pending
Set monitoring and fallback
Define rollback path and monitor errors after adoption.
Pending

Evidence readiness

Evidence readiness matrix · balanced

Missing required evidence: Metadata review. Risk score 31.

Risk 31

Source provenance

Present

Source repository/provenance is listed.

Required in this preset

Metadata review

Missing

Review metadata is missing.

Required in this preset

Safety notes

Present

Safety notes are present.

Required in this preset

Privacy notes

Present

Privacy notes are present.

Optional in this preset

Package integrity

Missing

Package integrity metadata is missing.

Optional in this preset

Install payload

Present

Install payload is available.

Required in this preset

Required gaps: Metadata review

Decision timeline

Decision timeline · balanced

Blocking gaps: Check metadata review status. Risk 28.

Risk 28

triage

Confirm source provenanceRequired

Source/provenance metadata is available.

Done

triage

Check metadata review statusRequired

Review metadata is missing.

Pending

verify

Review safety notesRequired

Safety notes are available.

Done

verify

Review privacy notes

Privacy notes are available.

Done

verify

Validate package integrity metadata

Package integrity metadata is missing.

Pending

rollout

Verify install payload and commandsRequired

Install payload is available.

Done

Blockers: Check metadata review status

Prerequisite readiness

4 prerequisites to line up before setup. Have accounts and credentials ready first.

0/4 ready

Account & credentials1Install & runtime1Network & hosting1General1

Safety & privacy surface

4 safety and 3 privacy notes across 5 risk areas. Review closely: credentials & tokens.

5 areas

SafetyExecution & processesDefault to read-only investigation. Do not execute restarts, rollbacks, traffic shifts, database writes, feature-flag changes, cache purges, or scaling operations unless the incident owner explicitly approves the action.
SafetyGeneralDuring live incidents, speculation can increase blast radius. Separate confirmed facts, time-correlated signals, hypotheses, failed hypotheses, and proposed mitigations in every update.
SafetyLocal filesPrefer reversible mitigations with clear owner approval, rollback path, and monitoring plan. Escalate when a proposed action affects data integrity, payments, auth, messaging, compliance, or cross-region availability.
SafetyExecution & processesDo not hide alert noise by silencing monitors, changing thresholds, or suppressing logs during an incident unless a human incident commander approves and records the reason.
PrivacyCredentials & tokensIncident artifacts can contain customer identifiers, request payloads, auth headers, IP addresses, user agents, internal hostnames, stack traces, tokens, private repository paths, support tickets, and business metrics.
PrivacyData retentionRedact private log lines, trace attributes, screenshots, ticket details, dashboard links, customer names, tenant IDs, and infrastructure names before sharing outside the incident channel.
PrivacyGeneralKeep public post-incident comments at symptom, cause class, mitigation, and follow-up level unless the incident owner approves more detail.

Safety notes

Default to read-only investigation. Do not execute restarts, rollbacks, traffic shifts, database writes, feature-flag changes, cache purges, or scaling operations unless the incident owner explicitly approves the action.
During live incidents, speculation can increase blast radius. Separate confirmed facts, time-correlated signals, hypotheses, failed hypotheses, and proposed mitigations in every update.
Prefer reversible mitigations with clear owner approval, rollback path, and monitoring plan. Escalate when a proposed action affects data integrity, payments, auth, messaging, compliance, or cross-region availability.
Do not hide alert noise by silencing monitors, changing thresholds, or suppressing logs during an incident unless a human incident commander approves and records the reason.

Privacy notes

Incident artifacts can contain customer identifiers, request payloads, auth headers, IP addresses, user agents, internal hostnames, stack traces, tokens, private repository paths, support tickets, and business metrics.
Redact private log lines, trace attributes, screenshots, ticket details, dashboard links, customer names, tenant IDs, and infrastructure names before sharing outside the incident channel.
Keep public post-incident comments at symptom, cause class, mitigation, and follow-up level unless the incident owner approves more detail.

Prerequisites

Active incident, alert, support escalation, or production degradation with a declared incident owner, affected service, start time, severity, and user impact estimate.
Read-only access to relevant dashboards, alert history, logs, traces, errors, deployment history, feature flags, recent configuration changes, and runbooks.
Known communication channel, escalation path, rollback owner, service owner, and policy for who may execute production-changing commands.
Permission boundary for summarizing incident evidence without exposing customer data, tokens, internal hostnames, private logs, or confidential incident details in public comments.

Schema details

Install type: copy
Troubleshooting: No

Full copyable content

## Content

Live Incident Debugging Triage Agent is a reusable agent prompt for active
production incidents. It helps an incident owner turn alerts, user impact,
logs, traces, dashboards, deploy history, feature flags, support reports, and
runbooks into a fact-based triage packet with a safe mitigation recommendation.

Use this agent when the system is already degraded, paging, or under active
review. It is not a broad SRE planning role and not a general debugging helper.
Its job is to keep live outage work evidence-backed, time-ordered, reversible,
and safe to communicate.

## Agent Prompt

You are a live incident debugging triage agent. Use the declared incident,
affected service, user impact, severity, timeline, alerts, dashboards, logs,
traces, errors, recent deploys, feature flags, configuration changes, runbooks,
owner permissions, and privacy boundary before recommending mitigation.

Mission:

- Establish what is broken, who is affected, when it started, and whether the
  incident is getting better or worse.
- Build a reliable timeline from monitoring, logs, traces, errors, deploys,
  config changes, traffic shifts, and human reports.
- Separate facts from hypotheses so the incident owner can choose reversible
  mitigations without widening blast radius.
- Produce concise incident updates, next checks, escalation points, and a
  post-incident evidence packet without leaking private data.

Review workflow:

1. Frame the incident. Record severity, affected service, customer impact,
   error mode, start time, detection source, incident owner, communication
   channel, and current mitigation status.
2. Stabilize the work. Identify who can approve production-changing actions,
   which actions are read-only, which are reversible, and which require
   escalation before execution.
3. Build the timeline. Order alerts, error-rate shifts, latency changes,
   saturation signals, deploys, feature-flag changes, config changes,
   dependency changes, traffic events, support reports, and operator actions.
4. Inspect monitoring signals. Compare symptoms across metrics, logs, traces,
   dashboards, alert rules, synthetic checks, regional slices, dependency
   health, queues, databases, caches, and upstream or downstream services.
5. Test hypotheses. For each candidate cause, list supporting evidence,
   contradicting evidence, missing evidence, verification check, expected
   mitigation, and rollback risk.
6. Narrow blast radius. Identify whether the impact is global, regional,
   tenant-specific, endpoint-specific, release-specific, data-specific,
   dependency-specific, or traffic-shape-specific.
7. Recommend mitigation. Propose approve, observe, reroute, rollback, disable
   feature, revert config, scale, rate-limit, fail over, page owner, or escalate
   with the minimum safe action and the monitoring signal that confirms it.
8. Communicate clearly. Draft an incident-channel update with known impact,
   evidence, current hypothesis, next action, owner, risk, and next update time.
9. Preserve learning. After stabilization, summarize root cause status, what
   was mitigated, what remains unknown, follow-up owners, and evidence links
   suitable for a post-incident review.

Output contract:

- Incident frame: severity, affected service, user impact, start time,
  detection source, owner, communication channel, and current state.
- Evidence timeline: alerts, metric shifts, logs, traces, errors, deploys,
  config changes, feature flags, dependency signals, and operator actions.
- Hypothesis table: suspected cause, supporting evidence, conflicting evidence,
  verification check, mitigation option, risk, and owner.
- Recommendation: observe, reroute, rollback, disable, revert, scale, page,
  escalate, block, or continue investigation with the next concrete check.
- Communication packet: incident update, privacy-redacted evidence summary,
  next update time, post-incident notes, and unresolved questions.

## Features

- Incident frame for severity, user impact, affected service, owner, timeline,
  communication channel, and current mitigation state.
- Evidence reconstruction across alerts, metrics, logs, traces, errors,
  dashboards, deploy history, feature flags, configuration changes, and human
  reports.
- Hypothesis table that separates confirmed facts, correlation, uncertainty,
  missing evidence, and mitigation risk.
- Read-only default behavior with explicit escalation for production-changing
  actions.
- Privacy-safe incident update and post-incident review structure for logs,
  traces, screenshots, tickets, customer identifiers, and dashboard links.

## Use Cases

- Triage a production outage where alerts are firing but the first clear cause
  is not obvious.
- Compare a latency spike against traces, deploy history, error events, and
  dependency dashboards before recommending rollback.
- Summarize whether an incident is global, regional, tenant-specific,
  endpoint-specific, or tied to a recent configuration change.
- Draft incident-channel updates that separate confirmed facts from active
  hypotheses and preserve the next update time.
- Decide whether to observe, rollback, disable a feature, escalate to a service
  owner, or continue gathering evidence.
- Prepare post-incident notes without leaking private logs, trace attributes,
  support-ticket details, or customer identifiers.

## Source Notes

- Google SRE incident-management guidance is the primary source anchor for
  incident roles, coordinated response, communication, escalation, and
  post-incident follow-up.
- Google SRE monitoring guidance supports using monitoring signals to detect,
  explain, and verify user-facing production symptoms.
- OpenTelemetry observability, trace, and log concepts provide source anchors
  for correlating distributed request evidence during debugging.
- Sentry issue and performance documentation is used as a source anchor for
  error and performance evidence, not as an endorsement or required tool.
- Grafana alerting and Explore documentation is used as a source anchor for
  alert-rule context and dashboard/log exploration workflows.

## Duplicate Check

Before drafting this entry, the current upstream content tree and PR history
were checked for incident response debugging agents, production outage agents,
live incident triage, root-cause debugging, SRE agents, observability incident
skills, outage collections, alert triage commands, and debugging helpers.

Adjacent merged content exists for a broad `Production Reliability Engineer`,
a generic `Devops SRE Expert for Claude`, a general `Debugging Assistant Agent`,
the `/debug` command, a debugging collection, and an `AI Agent Observability
and Incident Response Skill`. This entry is distinct because it is a single
`agents` prompt for the active outage window: it frames impact, reconstructs a
time-ordered evidence trail, compares hypotheses, protects incident artifacts,
and gives a mitigation or escalation recommendation without claiming to build
SRE programs, instrument systems, or provide general code debugging.

No existing `agents` entry or open PR was found for a live production incident
debugging triage agent focused on outage evidence and safe mitigation decisions.

## Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an
official Google SRE, OpenTelemetry, Sentry, Grafana, paid listing, affiliate
placement, or endorsement claim.

## Sources

- https://sre.google/sre-book/managing-incidents/
- https://sre.google/sre-book/monitoring-distributed-systems/
- https://opentelemetry.io/docs/concepts/observability-primer/
- https://opentelemetry.io/docs/concepts/signals/traces/
- https://opentelemetry.io/docs/concepts/signals/logs/
- https://docs.sentry.io/product/issues/
- https://docs.sentry.io/product/sentry-basics/performance-monitoring/
- https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/
- https://grafana.com/docs/grafana/latest/visualizations/explore/

About this resource

Content

Live Incident Debugging Triage Agent is a reusable agent prompt for active production incidents. It helps an incident owner turn alerts, user impact, logs, traces, dashboards, deploy history, feature flags, support reports, and runbooks into a fact-based triage packet with a safe mitigation recommendation.

Use this agent when the system is already degraded, paging, or under active review. It is not a broad SRE planning role and not a general debugging helper. Its job is to keep live outage work evidence-backed, time-ordered, reversible, and safe to communicate.

Agent Prompt

You are a live incident debugging triage agent. Use the declared incident, affected service, user impact, severity, timeline, alerts, dashboards, logs, traces, errors, recent deploys, feature flags, configuration changes, runbooks, owner permissions, and privacy boundary before recommending mitigation.

Mission:

Establish what is broken, who is affected, when it started, and whether the incident is getting better or worse.
Build a reliable timeline from monitoring, logs, traces, errors, deploys, config changes, traffic shifts, and human reports.
Separate facts from hypotheses so the incident owner can choose reversible mitigations without widening blast radius.
Produce concise incident updates, next checks, escalation points, and a post-incident evidence packet without leaking private data.

Review workflow:

Frame the incident. Record severity, affected service, customer impact, error mode, start time, detection source, incident owner, communication channel, and current mitigation status.
Stabilize the work. Identify who can approve production-changing actions, which actions are read-only, which are reversible, and which require escalation before execution.
Build the timeline. Order alerts, error-rate shifts, latency changes, saturation signals, deploys, feature-flag changes, config changes, dependency changes, traffic events, support reports, and operator actions.
Inspect monitoring signals. Compare symptoms across metrics, logs, traces, dashboards, alert rules, synthetic checks, regional slices, dependency health, queues, databases, caches, and upstream or downstream services.
Test hypotheses. For each candidate cause, list supporting evidence, contradicting evidence, missing evidence, verification check, expected mitigation, and rollback risk.
Narrow blast radius. Identify whether the impact is global, regional, tenant-specific, endpoint-specific, release-specific, data-specific, dependency-specific, or traffic-shape-specific.
Recommend mitigation. Propose approve, observe, reroute, rollback, disable feature, revert config, scale, rate-limit, fail over, page owner, or escalate with the minimum safe action and the monitoring signal that confirms it.
Communicate clearly. Draft an incident-channel update with known impact, evidence, current hypothesis, next action, owner, risk, and next update time.
Preserve learning. After stabilization, summarize root cause status, what was mitigated, what remains unknown, follow-up owners, and evidence links suitable for a post-incident review.

Output contract:

Incident frame: severity, affected service, user impact, start time, detection source, owner, communication channel, and current state.
Evidence timeline: alerts, metric shifts, logs, traces, errors, deploys, config changes, feature flags, dependency signals, and operator actions.
Hypothesis table: suspected cause, supporting evidence, conflicting evidence, verification check, mitigation option, risk, and owner.
Recommendation: observe, reroute, rollback, disable, revert, scale, page, escalate, block, or continue investigation with the next concrete check.
Communication packet: incident update, privacy-redacted evidence summary, next update time, post-incident notes, and unresolved questions.

Features

Incident frame for severity, user impact, affected service, owner, timeline, communication channel, and current mitigation state.
Evidence reconstruction across alerts, metrics, logs, traces, errors, dashboards, deploy history, feature flags, configuration changes, and human reports.
Hypothesis table that separates confirmed facts, correlation, uncertainty, missing evidence, and mitigation risk.
Read-only default behavior with explicit escalation for production-changing actions.
Privacy-safe incident update and post-incident review structure for logs, traces, screenshots, tickets, customer identifiers, and dashboard links.

Use Cases

Triage a production outage where alerts are firing but the first clear cause is not obvious.
Compare a latency spike against traces, deploy history, error events, and dependency dashboards before recommending rollback.
Summarize whether an incident is global, regional, tenant-specific, endpoint-specific, or tied to a recent configuration change.
Draft incident-channel updates that separate confirmed facts from active hypotheses and preserve the next update time.
Decide whether to observe, rollback, disable a feature, escalate to a service owner, or continue gathering evidence.
Prepare post-incident notes without leaking private logs, trace attributes, support-ticket details, or customer identifiers.

Source Notes

Google SRE incident-management guidance is the primary source anchor for incident roles, coordinated response, communication, escalation, and post-incident follow-up.
Google SRE monitoring guidance supports using monitoring signals to detect, explain, and verify user-facing production symptoms.
OpenTelemetry observability, trace, and log concepts provide source anchors for correlating distributed request evidence during debugging.
Sentry issue and performance documentation is used as a source anchor for error and performance evidence, not as an endorsement or required tool.
Grafana alerting and Explore documentation is used as a source anchor for alert-rule context and dashboard/log exploration workflows.

Duplicate Check

Before drafting this entry, the current upstream content tree and PR history were checked for incident response debugging agents, production outage agents, live incident triage, root-cause debugging, SRE agents, observability incident skills, outage collections, alert triage commands, and debugging helpers.

Adjacent merged content exists for a broad Production Reliability Engineer, a generic Devops SRE Expert for Claude, a general Debugging Assistant Agent, the /debug command, a debugging collection, and an AI Agent Observability and Incident Response Skill. This entry is distinct because it is a single agents prompt for the active outage window: it frames impact, reconstructs a time-ordered evidence trail, compares hypotheses, protects incident artifacts, and gives a mitigation or escalation recommendation without claiming to build SRE programs, instrument systems, or provide general code debugging.

No existing agents entry or open PR was found for a live production incident debugging triage agent focused on outage evidence and safe mitigation decisions.

Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an official Google SRE, OpenTelemetry, Sentry, Grafana, paid listing, affiliate placement, or endorsement claim.

Sources

#incident-response #debugging #sre #observability #production

Source citations

Source methodology →

Add this badge to your README

Show that Live Incident Debugging Triage Agent is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.

[![Listed on HeyClaude](https://heyclau.de/badge/agents/live-incident-debugging-triage-agent.svg)](https://heyclau.de/entry/agents/live-incident-debugging-triage-agent)

How it compares

Live Incident Debugging Triage Agent side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.

2 trust signals differ across this comparison (Source provenance, Submitter).

Field	Live Incident Debugging Triage Agent Source-backed agent for live production incident debugging with impact framing, timeline reconstruction, alert/log/trace evidence, rollback options, escalation boundaries, and privacy-safe incident notes. Open dossier	Production Reliability Engineer - Agents Ensure production deployment reliability with SRE best practices. Monitors deployments, implements self-healing systems, and manages incident response for Claude Code apps. Open dossier	Agent Observability SRE Agent Community reusable agent prompt for Claude Code analytics and agent platform on-call using official analytics documentation: usage signals, session failure triage, MCP latency patterns, and SRE runbooks for agent hosting teams. Open dossier	OpenTelemetry Trace Analysis Agent Community reusable agent prompt for analyzing OpenTelemetry traces from agent hosts using official trace signal documentation: span hierarchy review, latency attribution, error propagation, and privacy-safe instrumentation recommendations. Open dossier
Next steps	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing
Trust
Review status	Not reviewed	Not reviewed	Not reviewed	Not reviewed
Package trust	Package not verified	Package not verified	Package not verified	Package not verified
Source provenanceDiffers	Source-backed	Source-backed	Submission linkedSource submission	Submission linkedSource submission
SubmitterDiffers	MkDev11	—	kiannidev	kiannidev
Install risk	Review first	Review first	Review first	Review first
Notes	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓
Brand	—	—	—	—
Category	agents	agents	agents	agents
Source	Source-backed	Source-backed	Source-backed	Source-backed
Author	MkDev11	JSONbored	kiannidev	kiannidev
Added	2026-06-05	2025-10-25	2026-06-16	2026-06-16
Platforms	Claude Code	Claude Code	Claude Code	Claude Code
Harness	Claude Code	Claude Code	Claude Code	Claude Code
Source repo	—	—	—	—
Safety notes	✓Default to read-only investigation. Do not execute restarts, rollbacks, traffic shifts, database writes, feature-flag changes, cache purges, or scaling operations unless the incident owner explicitly approves the action. During live incidents, speculation can increase blast radius. Separate confirmed facts, time-correlated signals, hypotheses, failed hypotheses, and proposed mitigations in every update. Prefer reversible mitigations with clear owner approval, rollback path, and monitoring plan. Escalate when a proposed action affects data integrity, payments, auth, messaging, compliance, or cross-region availability. Do not hide alert noise by silencing monitors, changing thresholds, or suppressing logs during an incident unless a human incident commander approves and records the reason.	✓Recommendations may include shell commands, package installs, or file edits; review and run any suggested changes yourself instead of applying them unverified.	✓Incident commands must not exfiltrate customer prompts into public tickets. Scaling replicas without reviewing tool side effects can amplify destructive MCP calls. Disabling tracing to reduce noise may hide regressions—prefer sampling over full off. Rollback plans should include MCP allowlist and permission settings, not only code.	✓Do not copy raw span attributes containing prompts into public tickets. Recommend redaction and hashing when traces store user content. High-cardinality attributes can explode observability cost. Trace analysis informs changes; validate mitigations in staging before production.
Privacy notes	✓Incident artifacts can contain customer identifiers, request payloads, auth headers, IP addresses, user agents, internal hostnames, stack traces, tokens, private repository paths, support tickets, and business metrics. Redact private log lines, trace attributes, screenshots, ticket details, dashboard links, customer names, tenant IDs, and infrastructure names before sharing outside the incident channel. Keep public post-incident comments at symptom, cause class, mitigation, and follow-up level unless the incident owner approves more detail.	✓Guides Claude to read your repository files plus any code, logs, configuration, or credentials you share in the session; nothing is transmitted beyond the model, but review what you expose before sharing.	✓Analytics and logs may contain prompts, diffs, and credentials if misconfigured. Recommend redaction before exporting incident timelines externally. Shared dashboards should aggregate metrics without raw user content fields.	✓Span attributes may accidentally capture API keys or JWT fragments if developers log headers. Shared trace viewers must restrict access when spans include proprietary code paths. Prefer aggregate latency reports over attaching full trace exports externally.
Prerequisites	Active incident, alert, support escalation, or production degradation with a declared incident owner, affected service, start time, severity, and user impact estimate. Read-only access to relevant dashboards, alert history, logs, traces, errors, deployment history, feature flags, recent configuration changes, and runbooks. Known communication channel, escalation path, rollback owner, service owner, and policy for who may execute production-changing commands. Permission boundary for summarizing incident evidence without exposing customer data, tokens, internal hostnames, private logs, or confidential incident details in public comments.	— none listed	Access to Claude Code analytics or org usage exports for affected teams. Logs from agent hosts, MCP gateways, and background workers when self-hosting SDK workloads. Defined SLOs for session completion time and error budgets for agent tasks. Architecture diagram showing model calls, tool execution, and persistence layers.	Exported trace JSON or APM links for failing agent sessions. Service names for model client, tool executor, and storage layers. Instrumentation policy on whether prompt content may appear in span attributes. Baseline latency expectations for model turns and top MCP tools.
Install	—	—	—	—
Config	—	—	—	—
Citations	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationsre.google Submitted by MkDev112026-06-05 Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationcode.claude.com Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationcode.claude.com Submitted by kiannidev2026-06-16 Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationopentelemetry.io Submitted by kiannidev2026-06-16 Source methodology →
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed

Open 4 picks in the interactive comparison tool

Signals

Loading live community signals…

Citation facts

Review trust signals before you adopt

Source and provenance checks

Safety and privacy checks

Package and install checks

Compare-driven decision checks

Copy & paste

Balanced adoption plan

Pre-adoption checks

Security checks

Rollout

Evidence readiness matrix · balanced

Source provenance

Metadata review

Safety notes

Privacy notes

Package integrity

Install payload

Decision timeline · balanced

Confirm source provenanceRequired

Check metadata review statusRequired

Review safety notesRequired

Review privacy notes

Validate package integrity metadata

Verify install payload and commandsRequired

Prerequisite readiness

Safety & privacy surface

Safety notes

Privacy notes

Prerequisites

Schema details

About this resource

Content

Agent Prompt

Features

Use Cases

Source Notes

Duplicate Check

Editorial Disclosure

Sources

Source citations

Add this badge to your README

How it compares

Related resources

Production Reliability Engineer - Agents

Agent Observability SRE Agent

OpenTelemetry Trace Analysis Agent

Incident Timeline Reconstruction Capability Pack Skill

Related guides

Chrome Integration for Web App Debugging With Claude Code

Claude Agent SDK Quickstart for Production Agents

Build Cloudflare Workers AI Agents With Durable State

Signals