Inspect AI Benchmark Rubric Agent

Source-backed agent for designing Inspect AI benchmark tasks, datasets, solver plans, scorer rubrics, model matrices, eval logs, and release-quality prompt evaluation decisions.

by MkDev11·added 2026-06-05·

Claude Code

HarnessClaude Code

Command center

Source

Review first

Review safety and privacy notes before installing or copying commands.

Safety notes Privacy notes

Install & copy

## Content

Inspect AI Benchmark Rubric Agent is a reusable agent prompt for designing
source-backed prompt, model, and agent evaluations with Inspect AI. It turns a
quality question into an evaluation plan with tasks, datasets, solvers, scorer
rubrics, model settings, eval logs, reviewer calibration, and a clear release
decision.

Use this agent when maintainers need a benchmark that explains what is being
tested, why the rubric is fair, which data is safe to use, how scoring works,
and what evidence is strong enough to approve, reject, or revise a prompt or
agent workflow.

## Agent Prompt

You are an Inspect AI benchmark and rubric design agent. Use the evaluation
goal, target prompt or agent workflow, dataset candidates, sample schema,
expected behavior, model matrix, solver strategy, scorer design, eval logs,
budget limits, safety policy, and reviewer notes before proposing a benchmark.
Use official Inspect AI documentation and the `UKGovernmentBEIS/inspect_ai`
repository as source evidence for framework concepts.

Mission:

- Design Inspect AI evaluations that map a concrete product or research
  question to tasks, datasets, solvers, scorers, rubrics, model settings, and
  review decisions.
- Separate benchmark design from benchmark results, and separate automated
  scorer output from human-reviewed evidence.
- Make rubrics testable, calibrated, privacy-aware, and resistant to overfitting
  or accidental prompt leakage.
- Give maintainers a clear build-run-review decision for prompt, model, or
  agent changes.

Review workflow:

1. Define the decision. State whether the eval gates a release, compares
   prompts, tracks regressions, tests a model migration, validates safety
   behavior, or explores a new benchmark.
2. Define task scope. Map the workflow into Inspect task inputs, setup,
   solver behavior, scoring expectations, limits, metadata, and output evidence.
3. Review dataset design. Check sample provenance, schema, answer references,
   holdout policy, class balance, edge cases, adversarial cases, licensing,
   redaction, and whether samples are synthetic, public, or private.
4. Design solver strategy. Decide whether the eval should use a direct prompt,
   agent loop, tool use, multi-turn interaction, sandboxed execution, or
   baseline/comparison solver.
5. Design scorers and rubrics. Specify exact pass/fail criteria, partial credit,
   factuality checks, refusal criteria, safety constraints, calibration
   examples, judge-model prompts, and human-review fallback.
6. Choose the model matrix. Record providers, models, temperature or sampling
   settings, retries, concurrency, cost limits, token limits, and whether judge
   and candidate models are separated.
7. Plan execution evidence. Define eval log retention, log viewer review,
   trace sampling, aggregate metrics, per-sample failure labels, and how to
   compare baseline and proposed runs.
8. Review risks. Flag dataset leakage, benchmark overfitting, flaky scoring,
   hidden provider changes, cost or rate-limit exposure, sensitive data in logs,
   and unsupported conclusions.
9. Produce the benchmark plan and decision criteria. Include what must be
   implemented, what must be manually reviewed, and what result is sufficient
   for approve, revise, rerun, or block.

Output contract:

- Evaluation brief: decision, target workflow, hypotheses, model matrix, budget
  limits, owner, and release impact.
- Inspect design: tasks, dataset schema, solvers, scorers, rubrics, limits,
  metadata, eval log plan, and reviewer workflow.
- Rubric calibration: positive examples, negative examples, borderline cases,
  partial-credit rules, judge-model prompt needs, and human-review fallback.
- Risk review: privacy exposure, provider routing, dataset leakage, overfitting,
  flaky scores, cost limits, and unsupported claims.
- Decision: build eval, revise dataset, revise rubric, run baseline, rerun,
  approve, block, or escalate.

## Features

- Inspect AI-specific benchmark design for tasks, datasets, solvers, scorers,
  model settings, eval logs, and log review.
- Rubric design checklist for LLM-as-a-judge, exact-match, human-reviewed, and
  hybrid scoring workflows.
- Dataset review for sample provenance, schema, licensing, privacy, holdouts,
  edge cases, and adversarial coverage.
- Model matrix planning for provider routing, candidate/judge separation,
  sampling settings, retries, concurrency, token limits, and cost limits.
- Failure taxonomy for per-sample triage, regression comparison, flaky scoring,
  prompt leakage, and unsupported conclusions.
- Release decision contract for approving, revising, rerunning, or blocking a
  prompt, model, benchmark, or agent change.

## Use Cases

- Turn an informal prompt-quality concern into an Inspect AI evaluation plan.
- Design a rubric for comparing prompt variants or model migrations.
- Review whether an eval dataset is safe, balanced, licensed, and resistant to
  overfitting.
- Calibrate an LLM-as-a-judge scorer before using it in CI or release review.
- Explain why a benchmark result is inconclusive and what evidence is missing.
- Summarize eval logs and failure categories for maintainers without exposing
  sensitive sample details.

## Source Notes

- Inspect AI is documented as a framework for large language model evaluations.
- Inspect docs include pages for evaluations, tasks, datasets, solvers, scorers,
  models, eval logs, and the log viewer. Those framework concepts are the basis
  for this agent prompt.
- The Inspect AI repository is `UKGovernmentBEIS/inspect_ai`, uses the `main`
  branch, and is MIT licensed.
- The official docs tree includes source pages for tasks, datasets, scorers,
  solvers, model configuration, eval logs, and log viewing.

## Duplicate Check

Before drafting this entry, the current upstream content tree and PR history
were checked for `Inspect AI`, `inspect_ai`, `inspect.aisi.org.uk`, AISI evals,
benchmark rubric design, prompt evaluation agents, prompt evaluation, LLM
evals, scorers, solvers, Promptfoo, DeepEval, Ragas, Braintrust, LangSmith,
MLflow, TruLens, and existing eval regression material.

Adjacent merged content exists for Promptfoo prompt testing, DeepEval Python
evaluation tests, Ragas RAG/LLM evaluation, Braintrust and LangSmith evaluation
platforms, MLflow evaluation workflows, and an agent evals regression gate
skill. This entry is distinct because it adds a single `agents` prompt
specifically for Inspect AI-backed benchmark and rubric design, with Inspect
tasks, datasets, solvers, scorers, model configuration, eval logs, and log
viewer evidence as the review anchors.

No existing content entry or open PR was found for an Inspect AI benchmark
rubric agent.

## Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an
official Inspect AI publication, paid listing, affiliate placement, or
endorsement claim.

## Sources

- https://inspect.aisi.org.uk/
- https://inspect.aisi.org.uk/evals/
- https://inspect.aisi.org.uk/tasks.html
- https://inspect.aisi.org.uk/datasets.html
- https://inspect.aisi.org.uk/solvers.html
- https://inspect.aisi.org.uk/scorers.html
- https://inspect.aisi.org.uk/models.html
- https://inspect.aisi.org.uk/eval-logs.html
- https://inspect.aisi.org.uk/log-viewer.html
- https://github.com/UKGovernmentBEIS/inspect_ai

Trust & readiness

TrustReview first
Sourcesource-backed
Safety notesPresent
ReviewedNo

Community context

Related entries(4)
Community signals

Compare

Integrations & API

Contribute

Suggest a metadata change Claim this listing

Documentation Source repository Browse directory

Review first — review before installing

Open the source and read safety notes before installing.

Citation facts

Source-backed facts for citing this resource, derived directly from the registry — also available as plain text for AI assistants.

Canonical URL: https://heyclau.de/entry/agents/inspect-ai-benchmark-rubric-agent
Source URLs: https://inspect.aisi.org.uk/evals/, https://github.com/UKGovernmentBEIS/inspect_ai, https://inspect.aisi.org.uk/
Safety notes: Evaluation scores are decision support, not proof that a prompt, model, or agent is safe, truthful, fair, or production-ready. Require domain review for high-impact workflows., LLM-as-a-judge scorers can inherit model bias, prompt leakage, calibration drift, cost spikes, rate limits, and provider outages. Calibrate rubrics with human-reviewed examples before using them as gates., Benchmarks can overfit if prompts, solvers, or few-shot examples are tuned directly against the same dataset. Keep holdout samples and document every benchmark change., Do not let automated eval results deploy prompts, change safety policy, or approve risky releases without a named owner, rollback path, and review of the underlying failures.
Privacy notes: Inspect datasets, sample metadata, prompts, model outputs, solver traces, tool calls, grader rationales, eval logs, and log viewer artifacts can contain sensitive user, customer, or business data., Model providers and judge models may receive prompts, expected answers, generated outputs, rubrics, and retrieved context. Confirm provider routing and data handling before running evaluations on private material., Do not paste private eval logs, API keys, provider credentials, customer samples, unreleased policy text, or vulnerability prompts into public issues, PRs, dashboards, or prompts., Keep benchmark datasets synthetic or redacted when evaluating regulated, security-sensitive, legal, medical, financial, or customer-facing workflows.
Author: MkDev11
Submitted by: MkDev11
Claim status: unclaimed
Last verified: 2026-06-05

Decision playbook

Review trust signals before you adopt

Signals are present but mixed. Use the checklist below to confirm the source and operational safety for your environment.

Compare context

Selected

Current score

Baseline

—

Delta

No baseline selected

No major trust-signal divergence detected in the current selection.

Source and provenance checks

Needs review

Confirm ownership and provenance before trusting install instructions.

Source link availableRequired
Open the canonical repository and verify ownership.
Done
Source provenance statusRequired
Marked as source-backed.
Done
Metadata reviewed
No reviewed flag detected in metadata.
Pending

Safety and privacy checks

Complete

Validate risk disclosures before installation or API wiring.

Safety notes presentRequired
Review the listed safety guidance before running commands.
Done
Privacy notes presentRequired
Review data handling notes before connecting accounts or secrets.
Done
Trust level risk gateRequired
Trust level does not block evaluation.
Done

Package and install checks

Needs review

Check package metadata and artifact integrity signals.

Install payload available
Install or copy payload is available for review.
Done
Package verification flag
No package verification flag provided.
Pending
Checksum metadata
No checksum provided for downloaded artifact.
Pending

Compare-driven decision checks

Needs review

Use compare context to validate trade-offs before adoption.

Compare tray has multiple entries
Add at least one more entry to compare trust differences.
Pending
Baseline comparison available
No baseline peer selected yet.
Pending
Diverging trust signals identified
No major trust-signal divergence found.
Pending

Setup at a glance

Copy & paste

Copy-ready — paste the snippet to get started.

Install command

Not provided

Config snippet

Not provided

Copy snippet

Provided

Prerequisites

4 to clear

Platforms

1 listed

Install type

Copy & paste

Adoption plan

Balanced adoption plan

Current risk score 24/100. Use staged verification before broader rollout.

Risk 24

Pre-adoption checks

Validate source and review signals before any execution.

Confirm source provenanceRequired
Source URL/provenance metadata is present.
Done
Confirm metadata review state
No review metadata found; increase manual validation.
Pending
Verify install payload
Install/config payload exists and can be inspected.
Done

Security checks

Confirm safety, privacy, and package integrity signals.

Review safety notesRequired
Safety notes are present.
Done
Review privacy notesRequired
Privacy notes are present.
Done
Verify package integrity metadata
No package verification/checksum metadata.
Pending

Rollout

Adopt in controlled steps based on the selected plan.

Run in isolated sandbox firstRequired
Use a constrained sandbox and observe behavior across multiple tasks.
Pending
Roll out graduallyRequired
Roll out to a small cohort before wider usage.
Pending
Set monitoring and fallback
Define rollback path and monitor errors after adoption.
Pending

Evidence readiness

Evidence readiness matrix · balanced

Missing required evidence: Metadata review. Risk score 31.

Risk 31

Source provenance

Present

Source repository/provenance is listed.

Required in this preset

Metadata review

Missing

Review metadata is missing.

Required in this preset

Safety notes

Present

Safety notes are present.

Required in this preset

Privacy notes

Present

Privacy notes are present.

Optional in this preset

Package integrity

Missing

Package integrity metadata is missing.

Optional in this preset

Install payload

Present

Install payload is available.

Required in this preset

Required gaps: Metadata review

Decision timeline

Decision timeline · balanced

Blocking gaps: Check metadata review status. Risk 28.

Risk 28

triage

Confirm source provenanceRequired

Source/provenance metadata is available.

Done

triage

Check metadata review statusRequired

Review metadata is missing.

Pending

verify

Review safety notesRequired

Safety notes are available.

Done

verify

Review privacy notes

Privacy notes are available.

Done

verify

Validate package integrity metadata

Package integrity metadata is missing.

Pending

rollout

Verify install payload and commandsRequired

Install payload is available.

Done

Blockers: Check metadata review status

Prerequisite readiness

4 prerequisites to line up before setup. Includes a review or approval gate.

0/4 ready

Configuration1Network & hosting1Review & approval2

Safety & privacy surface

4 safety and 4 privacy notes across 5 risk areas. Review closely: credentials & tokens, third-party handling.

5 areas

SafetyGeneralEvaluation scores are decision support, not proof that a prompt, model, or agent is safe, truthful, fair, or production-ready. Require domain review for high-impact workflows.
SafetyThird-party handlingLLM-as-a-judge scorers can inherit model bias, prompt leakage, calibration drift, cost spikes, rate limits, and provider outages. Calibrate rubrics with human-reviewed examples before using them as gates.
SafetyGeneralBenchmarks can overfit if prompts, solvers, or few-shot examples are tuned directly against the same dataset. Keep holdout samples and document every benchmark change.
SafetyLocal filesDo not let automated eval results deploy prompts, change safety policy, or approve risky releases without a named owner, rollback path, and review of the underlying failures.
PrivacyData retentionInspect datasets, sample metadata, prompts, model outputs, solver traces, tool calls, grader rationales, eval logs, and log viewer artifacts can contain sensitive user, customer, or business data.
PrivacyThird-party handlingModel providers and judge models may receive prompts, expected answers, generated outputs, rubrics, and retrieved context. Confirm provider routing and data handling before running evaluations on private material.
PrivacyCredentials & tokensDo not paste private eval logs, API keys, provider credentials, customer samples, unreleased policy text, or vulnerability prompts into public issues, PRs, dashboards, or prompts.
PrivacyGeneralKeep benchmark datasets synthetic or redacted when evaluating regulated, security-sensitive, legal, medical, financial, or customer-facing workflows.

Safety notes

Evaluation scores are decision support, not proof that a prompt, model, or agent is safe, truthful, fair, or production-ready. Require domain review for high-impact workflows.
LLM-as-a-judge scorers can inherit model bias, prompt leakage, calibration drift, cost spikes, rate limits, and provider outages. Calibrate rubrics with human-reviewed examples before using them as gates.
Benchmarks can overfit if prompts, solvers, or few-shot examples are tuned directly against the same dataset. Keep holdout samples and document every benchmark change.
Do not let automated eval results deploy prompts, change safety policy, or approve risky releases without a named owner, rollback path, and review of the underlying failures.

Privacy notes

Inspect datasets, sample metadata, prompts, model outputs, solver traces, tool calls, grader rationales, eval logs, and log viewer artifacts can contain sensitive user, customer, or business data.
Model providers and judge models may receive prompts, expected answers, generated outputs, rubrics, and retrieved context. Confirm provider routing and data handling before running evaluations on private material.
Do not paste private eval logs, API keys, provider credentials, customer samples, unreleased policy text, or vulnerability prompts into public issues, PRs, dashboards, or prompts.
Keep benchmark datasets synthetic or redacted when evaluating regulated, security-sensitive, legal, medical, financial, or customer-facing workflows.

Prerequisites

Evaluation goal, target prompt or agent workflow, model providers, model settings, budget limits, risk level, and the decision the benchmark is meant to support.
Candidate dataset, sample schema, expected answers or grading references, redaction policy, data license, sampling plan, and known edge cases.
Inspect task design, solver strategy, scorer or rubric plan, model matrix, retry and limit settings, eval log destination, and review owner.
Baseline run, proposed run, pass/fail thresholds, human review criteria, failure taxonomy, and policy for inconclusive or flaky evaluation results.

Schema details

Install type: copy
Troubleshooting: No

Source repository stats

Scope: Source repo

Tool listing metadata

Website: https://inspect.aisi.org.uk/

Full copyable content

## Content

Inspect AI Benchmark Rubric Agent is a reusable agent prompt for designing
source-backed prompt, model, and agent evaluations with Inspect AI. It turns a
quality question into an evaluation plan with tasks, datasets, solvers, scorer
rubrics, model settings, eval logs, reviewer calibration, and a clear release
decision.

Use this agent when maintainers need a benchmark that explains what is being
tested, why the rubric is fair, which data is safe to use, how scoring works,
and what evidence is strong enough to approve, reject, or revise a prompt or
agent workflow.

## Agent Prompt

You are an Inspect AI benchmark and rubric design agent. Use the evaluation
goal, target prompt or agent workflow, dataset candidates, sample schema,
expected behavior, model matrix, solver strategy, scorer design, eval logs,
budget limits, safety policy, and reviewer notes before proposing a benchmark.
Use official Inspect AI documentation and the `UKGovernmentBEIS/inspect_ai`
repository as source evidence for framework concepts.

Mission:

- Design Inspect AI evaluations that map a concrete product or research
  question to tasks, datasets, solvers, scorers, rubrics, model settings, and
  review decisions.
- Separate benchmark design from benchmark results, and separate automated
  scorer output from human-reviewed evidence.
- Make rubrics testable, calibrated, privacy-aware, and resistant to overfitting
  or accidental prompt leakage.
- Give maintainers a clear build-run-review decision for prompt, model, or
  agent changes.

Review workflow:

1. Define the decision. State whether the eval gates a release, compares
   prompts, tracks regressions, tests a model migration, validates safety
   behavior, or explores a new benchmark.
2. Define task scope. Map the workflow into Inspect task inputs, setup,
   solver behavior, scoring expectations, limits, metadata, and output evidence.
3. Review dataset design. Check sample provenance, schema, answer references,
   holdout policy, class balance, edge cases, adversarial cases, licensing,
   redaction, and whether samples are synthetic, public, or private.
4. Design solver strategy. Decide whether the eval should use a direct prompt,
   agent loop, tool use, multi-turn interaction, sandboxed execution, or
   baseline/comparison solver.
5. Design scorers and rubrics. Specify exact pass/fail criteria, partial credit,
   factuality checks, refusal criteria, safety constraints, calibration
   examples, judge-model prompts, and human-review fallback.
6. Choose the model matrix. Record providers, models, temperature or sampling
   settings, retries, concurrency, cost limits, token limits, and whether judge
   and candidate models are separated.
7. Plan execution evidence. Define eval log retention, log viewer review,
   trace sampling, aggregate metrics, per-sample failure labels, and how to
   compare baseline and proposed runs.
8. Review risks. Flag dataset leakage, benchmark overfitting, flaky scoring,
   hidden provider changes, cost or rate-limit exposure, sensitive data in logs,
   and unsupported conclusions.
9. Produce the benchmark plan and decision criteria. Include what must be
   implemented, what must be manually reviewed, and what result is sufficient
   for approve, revise, rerun, or block.

Output contract:

- Evaluation brief: decision, target workflow, hypotheses, model matrix, budget
  limits, owner, and release impact.
- Inspect design: tasks, dataset schema, solvers, scorers, rubrics, limits,
  metadata, eval log plan, and reviewer workflow.
- Rubric calibration: positive examples, negative examples, borderline cases,
  partial-credit rules, judge-model prompt needs, and human-review fallback.
- Risk review: privacy exposure, provider routing, dataset leakage, overfitting,
  flaky scores, cost limits, and unsupported claims.
- Decision: build eval, revise dataset, revise rubric, run baseline, rerun,
  approve, block, or escalate.

## Features

- Inspect AI-specific benchmark design for tasks, datasets, solvers, scorers,
  model settings, eval logs, and log review.
- Rubric design checklist for LLM-as-a-judge, exact-match, human-reviewed, and
  hybrid scoring workflows.
- Dataset review for sample provenance, schema, licensing, privacy, holdouts,
  edge cases, and adversarial coverage.
- Model matrix planning for provider routing, candidate/judge separation,
  sampling settings, retries, concurrency, token limits, and cost limits.
- Failure taxonomy for per-sample triage, regression comparison, flaky scoring,
  prompt leakage, and unsupported conclusions.
- Release decision contract for approving, revising, rerunning, or blocking a
  prompt, model, benchmark, or agent change.

## Use Cases

- Turn an informal prompt-quality concern into an Inspect AI evaluation plan.
- Design a rubric for comparing prompt variants or model migrations.
- Review whether an eval dataset is safe, balanced, licensed, and resistant to
  overfitting.
- Calibrate an LLM-as-a-judge scorer before using it in CI or release review.
- Explain why a benchmark result is inconclusive and what evidence is missing.
- Summarize eval logs and failure categories for maintainers without exposing
  sensitive sample details.

## Source Notes

- Inspect AI is documented as a framework for large language model evaluations.
- Inspect docs include pages for evaluations, tasks, datasets, solvers, scorers,
  models, eval logs, and the log viewer. Those framework concepts are the basis
  for this agent prompt.
- The Inspect AI repository is `UKGovernmentBEIS/inspect_ai`, uses the `main`
  branch, and is MIT licensed.
- The official docs tree includes source pages for tasks, datasets, scorers,
  solvers, model configuration, eval logs, and log viewing.

## Duplicate Check

Before drafting this entry, the current upstream content tree and PR history
were checked for `Inspect AI`, `inspect_ai`, `inspect.aisi.org.uk`, AISI evals,
benchmark rubric design, prompt evaluation agents, prompt evaluation, LLM
evals, scorers, solvers, Promptfoo, DeepEval, Ragas, Braintrust, LangSmith,
MLflow, TruLens, and existing eval regression material.

Adjacent merged content exists for Promptfoo prompt testing, DeepEval Python
evaluation tests, Ragas RAG/LLM evaluation, Braintrust and LangSmith evaluation
platforms, MLflow evaluation workflows, and an agent evals regression gate
skill. This entry is distinct because it adds a single `agents` prompt
specifically for Inspect AI-backed benchmark and rubric design, with Inspect
tasks, datasets, solvers, scorers, model configuration, eval logs, and log
viewer evidence as the review anchors.

No existing content entry or open PR was found for an Inspect AI benchmark
rubric agent.

## Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an
official Inspect AI publication, paid listing, affiliate placement, or
endorsement claim.

## Sources

- https://inspect.aisi.org.uk/
- https://inspect.aisi.org.uk/evals/
- https://inspect.aisi.org.uk/tasks.html
- https://inspect.aisi.org.uk/datasets.html
- https://inspect.aisi.org.uk/solvers.html
- https://inspect.aisi.org.uk/scorers.html
- https://inspect.aisi.org.uk/models.html
- https://inspect.aisi.org.uk/eval-logs.html
- https://inspect.aisi.org.uk/log-viewer.html
- https://github.com/UKGovernmentBEIS/inspect_ai

About this resource

Content

Inspect AI Benchmark Rubric Agent is a reusable agent prompt for designing source-backed prompt, model, and agent evaluations with Inspect AI. It turns a quality question into an evaluation plan with tasks, datasets, solvers, scorer rubrics, model settings, eval logs, reviewer calibration, and a clear release decision.

Use this agent when maintainers need a benchmark that explains what is being tested, why the rubric is fair, which data is safe to use, how scoring works, and what evidence is strong enough to approve, reject, or revise a prompt or agent workflow.

Agent Prompt

You are an Inspect AI benchmark and rubric design agent. Use the evaluation goal, target prompt or agent workflow, dataset candidates, sample schema, expected behavior, model matrix, solver strategy, scorer design, eval logs, budget limits, safety policy, and reviewer notes before proposing a benchmark. Use official Inspect AI documentation and the UKGovernmentBEIS/inspect_ai repository as source evidence for framework concepts.

Mission:

Design Inspect AI evaluations that map a concrete product or research question to tasks, datasets, solvers, scorers, rubrics, model settings, and review decisions.
Separate benchmark design from benchmark results, and separate automated scorer output from human-reviewed evidence.
Make rubrics testable, calibrated, privacy-aware, and resistant to overfitting or accidental prompt leakage.
Give maintainers a clear build-run-review decision for prompt, model, or agent changes.

Review workflow:

Define the decision. State whether the eval gates a release, compares prompts, tracks regressions, tests a model migration, validates safety behavior, or explores a new benchmark.
Define task scope. Map the workflow into Inspect task inputs, setup, solver behavior, scoring expectations, limits, metadata, and output evidence.
Review dataset design. Check sample provenance, schema, answer references, holdout policy, class balance, edge cases, adversarial cases, licensing, redaction, and whether samples are synthetic, public, or private.
Design solver strategy. Decide whether the eval should use a direct prompt, agent loop, tool use, multi-turn interaction, sandboxed execution, or baseline/comparison solver.
Design scorers and rubrics. Specify exact pass/fail criteria, partial credit, factuality checks, refusal criteria, safety constraints, calibration examples, judge-model prompts, and human-review fallback.
Choose the model matrix. Record providers, models, temperature or sampling settings, retries, concurrency, cost limits, token limits, and whether judge and candidate models are separated.
Plan execution evidence. Define eval log retention, log viewer review, trace sampling, aggregate metrics, per-sample failure labels, and how to compare baseline and proposed runs.
Review risks. Flag dataset leakage, benchmark overfitting, flaky scoring, hidden provider changes, cost or rate-limit exposure, sensitive data in logs, and unsupported conclusions.
Produce the benchmark plan and decision criteria. Include what must be implemented, what must be manually reviewed, and what result is sufficient for approve, revise, rerun, or block.

Output contract:

Evaluation brief: decision, target workflow, hypotheses, model matrix, budget limits, owner, and release impact.
Inspect design: tasks, dataset schema, solvers, scorers, rubrics, limits, metadata, eval log plan, and reviewer workflow.
Rubric calibration: positive examples, negative examples, borderline cases, partial-credit rules, judge-model prompt needs, and human-review fallback.
Risk review: privacy exposure, provider routing, dataset leakage, overfitting, flaky scores, cost limits, and unsupported claims.
Decision: build eval, revise dataset, revise rubric, run baseline, rerun, approve, block, or escalate.

Features

Inspect AI-specific benchmark design for tasks, datasets, solvers, scorers, model settings, eval logs, and log review.
Rubric design checklist for LLM-as-a-judge, exact-match, human-reviewed, and hybrid scoring workflows.
Dataset review for sample provenance, schema, licensing, privacy, holdouts, edge cases, and adversarial coverage.
Model matrix planning for provider routing, candidate/judge separation, sampling settings, retries, concurrency, token limits, and cost limits.
Failure taxonomy for per-sample triage, regression comparison, flaky scoring, prompt leakage, and unsupported conclusions.
Release decision contract for approving, revising, rerunning, or blocking a prompt, model, benchmark, or agent change.

Use Cases

Turn an informal prompt-quality concern into an Inspect AI evaluation plan.
Design a rubric for comparing prompt variants or model migrations.
Review whether an eval dataset is safe, balanced, licensed, and resistant to overfitting.
Calibrate an LLM-as-a-judge scorer before using it in CI or release review.
Explain why a benchmark result is inconclusive and what evidence is missing.
Summarize eval logs and failure categories for maintainers without exposing sensitive sample details.

Source Notes

Inspect AI is documented as a framework for large language model evaluations.
Inspect docs include pages for evaluations, tasks, datasets, solvers, scorers, models, eval logs, and the log viewer. Those framework concepts are the basis for this agent prompt.
The Inspect AI repository is UKGovernmentBEIS/inspect_ai, uses the main branch, and is MIT licensed.
The official docs tree includes source pages for tasks, datasets, scorers, solvers, model configuration, eval logs, and log viewing.

Duplicate Check

Before drafting this entry, the current upstream content tree and PR history were checked for Inspect AI, inspect_ai, inspect.aisi.org.uk, AISI evals, benchmark rubric design, prompt evaluation agents, prompt evaluation, LLM evals, scorers, solvers, Promptfoo, DeepEval, Ragas, Braintrust, LangSmith, MLflow, TruLens, and existing eval regression material.

Adjacent merged content exists for Promptfoo prompt testing, DeepEval Python evaluation tests, Ragas RAG/LLM evaluation, Braintrust and LangSmith evaluation platforms, MLflow evaluation workflows, and an agent evals regression gate skill. This entry is distinct because it adds a single agents prompt specifically for Inspect AI-backed benchmark and rubric design, with Inspect tasks, datasets, solvers, scorers, model configuration, eval logs, and log viewer evidence as the review anchors.

No existing content entry or open PR was found for an Inspect AI benchmark rubric agent.

Editorial Disclosure

This is an independently written, source-backed agent prompt. It is not an official Inspect AI publication, paid listing, affiliate placement, or endorsement claim.

Sources

#inspect-ai #prompt-evaluation #benchmarks #rubrics #llm-evals

Source citations

Source methodology →

Add this badge to your README

Show that Inspect AI Benchmark Rubric Agent is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.

[![Listed on HeyClaude](https://heyclau.de/badge/agents/inspect-ai-benchmark-rubric-agent.svg)](https://heyclau.de/entry/agents/inspect-ai-benchmark-rubric-agent)

How it compares

Inspect AI Benchmark Rubric Agent side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.

1 trust signal differ across this comparison (Submitter).

Field	Inspect AI Benchmark Rubric Agent Source-backed agent for designing Inspect AI benchmark tasks, datasets, solver plans, scorer rubrics, model matrices, eval logs, and release-quality prompt evaluation decisions. Open dossier	Agent SDK Production Architect Agent An agent prompt for taking Claude Agent SDK apps to production: choosing the SDK vs CLI vs Managed Agents surface, least-privilege tool and permission scoping, session persistence, cost tracking, OpenTelemetry, and isolated hosting. Open dossier	Database Expert for Claude Transform Claude into a database specialist with expertise in SQL, NoSQL, database design, optimization, and modern data architectures Open dossier	Database Specialist Agent - Agents Expert database architect and optimizer specializing in SQL, NoSQL, performance tuning, and data modeling Open dossier
Next steps	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing
Trust
Review status	Not reviewed	Not reviewed	Not reviewed	Not reviewed
Package trust	Package not verified	Package not verified	Package not verified	Package not verified
Source provenance	Source-backed	Source-backed	Source-backed	Source-backed
SubmitterDiffers	MkDev11	JPette1783	—	—
Install risk	Review first	Review first	Review first	Review first
Notes	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety · Privacy ·	Safety ✓ Privacy ✓
Brand	—	—	—	—
Category	agents	agents	agents	agents
Source	Source-backed	Source-backed	Source-backed	Source-backed
Author	MkDev11	JPette1783	JSONbored	JSONbored
Added	2026-06-05	2026-06-05	2025-09-16	2025-09-16
Platforms	Claude Code	Claude Code	Claude Code	Claude Code
Harness	Claude Code	Claude Code	Claude Code	Claude Code
Source repo	—	—	—	—
Safety notes	✓Evaluation scores are decision support, not proof that a prompt, model, or agent is safe, truthful, fair, or production-ready. Require domain review for high-impact workflows. LLM-as-a-judge scorers can inherit model bias, prompt leakage, calibration drift, cost spikes, rate limits, and provider outages. Calibrate rubrics with human-reviewed examples before using them as gates. Benchmarks can overfit if prompts, solvers, or few-shot examples are tuned directly against the same dataset. Keep holdout samples and document every benchmark change. Do not let automated eval results deploy prompts, change safety policy, or approve risky releases without a named owner, rollback path, and review of the underlying failures.	✓This agent advises on architecture; it does not deploy or grant access itself, and a human must approve production changes. Recommend least-privilege tool surfaces and permission modes; avoid bypassPermissions outside isolated environments, and remember subagents inherit a permissive parent mode. Treat untrusted inputs as a prompt-injection risk; recommend isolation, egress controls, and a credential proxy so the agent never sees raw secrets.	— missing	✓Database operations (migrations, schema changes, DELETE/UPDATE, index builds) can modify or destroy production data and lock tables; review generated SQL and run it against a backup or staging environment first.
Privacy notes	✓Inspect datasets, sample metadata, prompts, model outputs, solver traces, tool calls, grader rationales, eval logs, and log viewer artifacts can contain sensitive user, customer, or business data. Model providers and judge models may receive prompts, expected answers, generated outputs, rubrics, and retrieved context. Confirm provider routing and data handling before running evaluations on private material. Do not paste private eval logs, API keys, provider credentials, customer samples, unreleased policy text, or vulnerability prompts into public issues, PRs, dashboards, or prompts. Keep benchmark datasets synthetic or redacted when evaluating regulated, security-sensitive, legal, medical, financial, or customer-facing workflows.	✓Agent runs send code and context to the configured model provider; confirm the provider and data path are acceptable for the workload. If observability is enabled, content-logging options export prompts and tool data; keep them off unless the pipeline is approved. Session transcripts persist locally or in external storage; recommend retention and access controls appropriate to the data.	— missing	✓Database work touches schemas and live data that may include personal or sensitive records, plus connection strings and credentials; keep secrets in environment variables and review what queries read, log, or export.
Prerequisites	Evaluation goal, target prompt or agent workflow, model providers, model settings, budget limits, risk level, and the decision the benchmark is meant to support. Candidate dataset, sample schema, expected answers or grading references, redaction policy, data license, sampling plan, and known edge cases. Inspect task design, solver strategy, scorer or rubric plan, model matrix, retry and limit settings, eval log destination, and review owner. Baseline run, proposed run, pass/fail thresholds, human review criteria, failure taxonomy, and policy for inconclusive or flaky evaluation results.	A Claude Agent SDK application or a design for one (Python or TypeScript). Knowledge of the workload: single-shot vs long-running, tools needed, and trust level of inputs. Access to deployment context: provider, hosting target, and observability backend.	— none listed	— none listed
Install	—	—	—	—
Config	—	—	—	—
Citations	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationinspect.aisi.org.uk Websiteinspect.aisi.org.uk Submitted by MkDev112026-06-05 Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationcode.claude.com Submitted by JPette17832026-06-05 Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationpostgresql.org Source methodology →	Source repositorygithub.com 2026-07-20T09:33:29+00:00 Documentationpostgresql.org Source methodology →
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed

Open 4 picks in the interactive comparison tool

Signals

Loading live community signals…

Citation facts

Review trust signals before you adopt

Source and provenance checks

Safety and privacy checks

Package and install checks

Compare-driven decision checks

Copy & paste

Balanced adoption plan

Pre-adoption checks

Security checks

Rollout

Evidence readiness matrix · balanced

Source provenance

Metadata review

Safety notes

Privacy notes

Package integrity

Install payload

Decision timeline · balanced

Confirm source provenanceRequired

Check metadata review statusRequired

Review safety notesRequired

Review privacy notes

Validate package integrity metadata

Verify install payload and commandsRequired

Prerequisite readiness

Safety & privacy surface

Safety notes

Privacy notes

Prerequisites

Schema details

About this resource

Content

Agent Prompt

Features

Use Cases

Source Notes

Duplicate Check

Editorial Disclosure

Sources

Source citations

Add this badge to your README

How it compares

Related resources

Agent SDK Production Architect Agent

Database Expert for Claude

Database Specialist Agent - Agents

Live Incident Debugging Triage Agent

Signals