LLM observability · tools · 14 picks

Best LLM observability tools

Observability and tracing platforms for LLM and agent applications — traces, metrics, prompts, and evaluation.

Curated by @heyclaude-editors Updated 2026-06-19

Observability and tracing platforms for LLM and agent applications — traces, metrics, prompts, and evaluation.

Compared at a glance

The top 5 picks side by side on trust, install, platform support, and disclosed notes — full rationale for each below.

Field	LangSmith Observability, evaluation, tracing, and testing platform for LLM applications and agent workflows. Open dossier	Evidently Open-source ML and LLM observability framework for evaluating, testing, and monitoring data quality, drift, model behavior, and AI application outputs. Open dossier	Arize Phoenix Open-source observability and evaluation tooling for LLM applications, traces, datasets, and experiments. Open dossier	AgentOps Open-source observability platform and SDK for tracing, debugging, replaying, and cost-monitoring AI agent and LLM application runs. Open dossier	DeepEval Open-source Python framework for unit-testing LLM applications, agents, RAG pipelines, metrics, regression suites, and traces. Open dossier
Trust
Install risk	Review first	Review first	Review first	Review first	Review first
Notes	Safety · Privacy ✓	Safety ✓ Privacy ✓	Safety · Privacy ·	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓
Category	tools	tools	tools	tools	tools
Source	source-backed	source-backed	source-backed	source-backed	source-backed
Author	LangChain	Evidently AI	Arize AI	AgentOps	Confident AI
Added	2026-04-27	2026-06-03	2026-04-27	2026-06-03	2026-06-03
Platforms	CLI	CLI	CLI	CLI	CLI
Source repo	—	—	—	—	—
Safety notes	— missing	✓Evidently metrics and tests are decision support, not proof that a model, dataset, prompt, or LLM application is correct, fair, safe, or production-ready. Drift, data quality, and LLM judge results can be noisy or context-dependent, so thresholds should be calibrated on representative data before blocking releases or triggering alerts. Reports, test suites, and dashboards can influence deployment and incident workflows, so review generated conditions before wiring them into CI, monitoring, or agent-managed remediation. Synthetic data generation, prompt optimization, LLM-as-judge evaluations, and provider-backed metrics can call configured model services and should be scoped for cost and data handling. Self-hosted dashboards, local reports, and exported artifacts need normal access controls because they can become a shared source of operational decisions.	— missing	✓AgentOps instruments LLM calls, tools, operations, and agent workflows, so enable it intentionally in environments where captured traces are allowed. Cost and latency dashboards are useful for operations, but alerting and budget decisions still need human-reviewed thresholds. Self-hosted deployments require normal backend hardening for database access, secrets, authentication, and retained trace data.	✓DeepEval metrics should be treated as regression and review signals, not proof that an LLM application is safe, correct, or production-ready. LLM-as-a-judge metrics can call configured model providers, consume quota, hit rate limits, and produce judge-model errors that need separate handling. Evaluation thresholds should be calibrated on real examples before they block deployments or trigger automated rollback, ranking, billing, or moderation decisions. Tracing instrumentation can wrap live application code, agents, retrievers, tools, and model calls; keep eval and production environments clearly separated.
Privacy notes	✓LangSmith receives traces of your LLM and agent runs — prompts, outputs, tool calls, and metadata — sent to LangSmith's cloud (or your self-hosted instance); review what trace data leaves your environment and keep secrets out of logged inputs.	✓Evidently can process dataset columns, feature values, predictions, labels, model metadata, prompts, retrieved context, responses, traces, evaluation scores, and custom metric outputs. HTML, JSON, and Python dictionary reports can contain samples, column names, feature distributions, prompt text, generated answers, labels, or other sensitive operational data. Evidently Platform and Cloud workflows add hosted storage, dashboards, dataset management, tracing, user management, and alerting that should be reviewed against team data-retention and access-control policies. LLM-based evaluations may send prompts, responses, references, or scoring context to configured model providers unless a local evaluation path is used. Local report files and dashboard exports should be kept out of public repositories and shared workspaces unless reviewed for sensitive data.	— missing	✓Traces can include prompts, completions, tool inputs, tool outputs, errors, costs, tokens, tags, and application metadata. The docs say AgentOps automatically collects basic host environment details such as OS, Python version, anonymized hostname, and SDK version. Hosted dashboard use sends telemetry to AgentOps infrastructure; self-hosted use still requires retention, access-control, and log-review policies.	✓Test cases, traces, spans, prompts, actual outputs, expected outputs, retrieval context, tool arguments, metadata, and evaluation results may contain sensitive user or business data. LLM-based metrics can send evaluation payloads to the configured model provider unless a reviewed local model path is used. DeepEval documentation says evaluations run locally by default, while Confident AI login and cloud reporting are optional paths for centralized results. The official data privacy docs say DeepEval collects basic PostHog telemetry by default, including event names, metric names, notebook usage, an anonymous UUID, and public IP, with `DEEPEVAL_TELEMETRY_OPT_OUT=1` available for opt-out.
Prerequisites	— none listed	Python environment for running the Evidently library, reports, test suites, or local UI. Dataset, model outputs, LLM application traces, prompts, responses, labels, or other production-aligned examples to evaluate. Reference or baseline data when using drift, regression, or data quality checks. Reviewed metric selection, pass and fail thresholds, alert ownership, and release policy before using results in CI or production monitoring.	— none listed	Python or TypeScript/JavaScript application using a supported LLM provider or agent framework. AgentOps project/API key for hosted dashboard use, or a reviewed self-hosted deployment plan. A telemetry policy for which prompts, responses, tool calls, metadata, and host details may be captured.	Python environment for installing and running the `deepeval` package in the project being tested. Representative LLM test cases, expected outputs, retrieval context, traces, datasets, or golden examples for the behavior being evaluated. Model provider credentials for LLM-as-a-judge metrics such as G-Eval, Answer Relevancy, or other configured metrics. CI policy for which evaluation thresholds are advisory, which are blocking, and who reviews failures before release decisions.
Install	—	—	—	—	—
Config	—	—	—	—	—
Citations	Source repositorygithub.com 2026-06-18T20:49:55+00:00 Documentationdocs.langchain.com	Source repositorygithub.com 2026-06-18T20:49:55+00:00 Documentationdocs.evidentlyai.com Submitted by oktofeesh12026-06-03	Source repositorygithub.com 2026-06-18T20:49:55+00:00 Documentationarize.com	Source repositorygithub.com 2026-06-18T20:49:55+00:00 Documentationdocs.agentops.ai Submitted by oktofeesh12026-06-03	Source repositorygithub.com 2026-06-18T20:49:55+00:00 Documentationdeepeval.com Submitted by oktofeesh12026-06-03
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed	Unclaimed

01
tools
LangSmith
Observability and evaluation platform for LLM apps.
Review firstSource-backedReview firstAdded 2mo ago
Safety · Privacy ✓
Why it made the cut
LangSmith is included because it has privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
02
tools
Evidently
Evaluate, test, and monitor ML models, LLM apps, data quality, and drift.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
Evidently is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
03
tools
Arize Phoenix
Open-source LLM observability and evaluation tooling.
Review firstSource-backedReview firstAdded 2mo ago
Safety · Privacy ·
Why it made the cut
Arize Phoenix is included because it has source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
04
tools
AgentOps
Open-source observability for AI agent traces, replays, and costs.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
AgentOps is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
05
tools
DeepEval
Python LLM evaluation tests, metrics, regression checks, and tracing.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
DeepEval is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
06
tools
MLflow
Trace, evaluate, monitor, and manage agents, LLM apps, prompts, and models.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
MLflow is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
07
tools
TruLens
Evaluate and trace agents, RAG systems, LLM apps, metrics, and regressions.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
TruLens is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
08
tools
Langfuse
Open-source LLM tracing and evaluation platform.
Review firstSource-backedReview firstAdded 2mo ago
Safety · Privacy ✓
Why it made the cut
Langfuse is included because it has privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
09
tools
Helicone
LLM observability and cost tracking platform.
Review firstSource-backedReview firstAdded 2mo ago
Safety · Privacy ✓
Why it made the cut
Helicone is included because it has privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
10
tools
Weave
Tracking and evaluation toolkit for LLM applications.
Review firstSource-backedReview firstAdded 2mo ago
Safety · Privacy ·
Why it made the cut
Weave is included because it has source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
11
tools
Agno
Build and run agent platforms with agents, teams, workflows, memory, MCP, and AgentOS.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
Agno is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
12
tools
Dagster
Orchestrate data assets, pipelines, jobs, schedules, sensors, and observability.
Review firstSource-backedReview firstAdded 15d ago
Safety ✓ Privacy ✓
Why it made the cut
Dagster is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
13
tools
Hugging Face Evaluate
Load metrics, comparisons, and measurements for reproducible model and dataset evaluation.
Review firstSource-backedReview firstAdded 15d ago
Safety ✓ Privacy ✓
Why it made the cut
Hugging Face Evaluate is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.
14
tools
Label Studio
Open-source data labeling and human-in-the-loop AI evaluation.
Review firstSource-backedReview firstAdded 16d ago
Safety ✓ Privacy ✓
Why it made the cut
Label Studio is included because it has safety notes present, privacy notes present, source-backed source posture.
Reach for instead
If this will touch credentials, local files, or production systems, inspect the upstream source first.

Missing a pick? Propose an edit to this list — every change goes through the same review queue as new entries.

Suggest a pick

Weekly · Sundays

Get the weekly brief

One calm read on Claude workflows. Sundays. No tracking pixels.

Unsubscribe any time. No tracking pixels. No partner blasts.

Best LLM observability tools

Compared at a glance

LangSmith

Evidently

Arize Phoenix

AgentOps

DeepEval

MLflow

TruLens

Langfuse

Helicone

Weave

Agno

Dagster

Hugging Face Evaluate

Label Studio

Get the weekly brief