Open Source Evals Prompt Testing

A source-backed collection for building repeatable LLM eval and prompt testing workflows with open-source tools: prompt regression tests, RAG and agent metrics, human review datasets, traces, prompt optimization, and release gates.

by MkDev11·added 2026-06-04·

Claude Code

HarnessClaude Code

Bundle:11 items

Command center

Source

Review first

Review safety and privacy notes before installing or copying commands.

Safety notes Privacy notes

Install & copy

## What this collection sets up

This collection turns prompt and agent behavior into a repeatable engineering
workflow. It covers fast prompt regression checks, Python-style evaluation
tests, RAG and agent metrics, trace-backed debugging, human review datasets, and
release gates that decide what happens when evals fail.

## Layers

### 1. Prompt and regression tests

- **promptfoo** gives teams prompt matrices, regression tests, and red-team
  checks.
- **deepeval** provides Python-first LLM unit tests and metrics.
- **agent-evals-regression-gate** helps define merge/release gates around eval
  suites.

### 2. RAG, agent, and optimization loops

- **ragas** focuses on RAG and LLM application evaluation.
- **dspy** helps teams program and optimize language-model pipelines with
  metrics and optimizers.
- **giskard** adds broader testing, scanning, and monitoring coverage.

### 3. Observability and review data

- **langfuse**, **trulens**, **mlflow**, and **agenta** capture traces, prompt
  versions, metrics, and experiment evidence.
- **label-studio** supports human review, annotation, benchmark curation, and
  preference data.

## Suggested order

Start with Promptfoo for fast prompt regression coverage. Add DeepEval or Ragas
for application-specific metrics, then wire traces into Langfuse, TruLens,
MLflow, or Agenta. Use Label Studio only after the team has written reviewer
instructions, sampling rules, and export boundaries.

## Source and references

- Promptfoo documentation: https://www.promptfoo.dev/docs
- DeepEval documentation: https://deepeval.com/docs/getting-started
- Ragas documentation: https://docs.ragas.io
- Langfuse documentation: https://langfuse.com/docs

## Duplicate check

Checked existing collections, upstream collection history, open collection PRs,
and repository content for `open-source-evals-prompt-testing`, open-source
evals, prompt testing collection, LLM regression testing, Promptfoo, DeepEval,
Ragas, and eval workflow. Existing collections cover code quality, production
readiness, API development, and data engineering, but none curates an
open-source LLM eval and prompt-testing lifecycle across prompt tests, metrics,
traces, human review, and release gates.

## Disclosure

Editorial collection. No paid placement or affiliate link is used.

Trust & readiness

TrustReview first
Sourcesource-backed
Safety notesPresent
ReviewedYes

Community context

Related entries(4)
Related guides(3)
Community signals

Compare

Integrations & API

Contribute

Suggest a metadata change Claim this listing

Documentation Source repository Browse directory

Review first — review before installing

Open the source and read safety notes before installing.

Citation facts

Source-backed facts for citing this resource, derived directly from the registry — also available as plain text for AI assistants.

Canonical URL: https://heyclau.de/entry/collections/open-source-evals-prompt-testing
Source URLs: https://www.promptfoo.dev/docs, https://github.com/JSONbored/awesome-claude/blob/main/content/collections/open-source-evals-prompt-testing.mdx
Safety notes: Eval scores are development and regression signals, not proof that an AI system is safe, fair, or production-ready., Red-team, prompt-injection, and adversarial prompt tests should run against isolated environments and reviewed credentials., Optimizer workflows can issue many model calls or overfit to narrow datasets; set budgets, holdout sets, and rollback rules.
Privacy notes: Eval datasets, traces, prompts, retrieved context, labels, and model outputs can contain user, customer, or proprietary data., LLM-graded metrics may send eval payloads to the configured model provider unless a reviewed local model path is used., Human review tools can retain annotations, reviewer decisions, and benchmark examples outside the original product system.
Author: MkDev11
Submitted by: MkDev11
Claim status: unclaimed
Last verified: 2026-06-04

Decision playbook

Review trust signals before you adopt

Signals are present but mixed. Use the checklist below to confirm the source and operational safety for your environment.

Compare context

Selected

Current score

Baseline

—

Delta

No baseline selected

No major trust-signal divergence detected in the current selection.

Source and provenance checks

Complete

Confirm ownership and provenance before trusting install instructions.

Source link availableRequired
Open the canonical repository and verify ownership.
Done
Source provenance statusRequired
Marked as source-backed.
Done
Metadata reviewed
Registry metadata indicates a reviewed listing.
Done

Safety and privacy checks

Complete

Validate risk disclosures before installation or API wiring.

Safety notes presentRequired
Review the listed safety guidance before running commands.
Done
Privacy notes presentRequired
Review data handling notes before connecting accounts or secrets.
Done
Trust level risk gateRequired
Trust level does not block evaluation.
Done

Package and install checks

Needs review

Check package metadata and artifact integrity signals.

Install payload available
Install or copy payload is available for review.
Done
Package verification flag
No package verification flag provided.
Pending
Checksum metadata
No checksum provided for downloaded artifact.
Pending

Compare-driven decision checks

Needs review

Use compare context to validate trade-offs before adoption.

Compare tray has multiple entries
Add at least one more entry to compare trust differences.
Pending
Baseline comparison available
No baseline peer selected yet.
Pending
Diverging trust signals identified
No major trust-signal divergence found.
Pending

Setup at a glance

Copy & paste

Copy-ready — paste the snippet to get started.

90 minutes

Install command

Not provided

Config snippet

Not provided

Copy snippet

Provided

Prerequisites

3 to clear

Platforms

1 listed

Install type

Copy & paste

Adoption plan

Balanced adoption plan

Current risk score 16/100. Use staged verification before broader rollout.

Risk 16

Pre-adoption checks

Validate source and review signals before any execution.

Confirm source provenanceRequired
Source URL/provenance metadata is present.
Done
Confirm metadata review state
Listing has review metadata.
Done
Verify install payload
Install/config payload exists and can be inspected.
Done

Security checks

Confirm safety, privacy, and package integrity signals.

Review safety notesRequired
Safety notes are present.
Done
Review privacy notesRequired
Privacy notes are present.
Done
Verify package integrity metadata
No package verification/checksum metadata.
Pending

Rollout

Adopt in controlled steps based on the selected plan.

Run in isolated sandbox firstRequired
Use a constrained sandbox and observe behavior across multiple tasks.
Pending
Roll out graduallyRequired
Roll out to a small cohort before wider usage.
Pending
Set monitoring and fallback
Define rollback path and monitor errors after adoption.
Pending

Evidence readiness

Evidence readiness matrix · balanced

Required evidence gates are covered (5/6 signals complete).

Risk 15

Source provenance

Present

Source repository/provenance is listed.

Required in this preset

Metadata review

Present

Review metadata is present.

Required in this preset

Safety notes

Present

Safety notes are present.

Required in this preset

Privacy notes

Present

Privacy notes are present.

Optional in this preset

Package integrity

Missing

Package integrity metadata is missing.

Optional in this preset

Install payload

Present

Install payload is available.

Required in this preset

Required evidence gates are covered for this preset.

Decision timeline

Decision timeline · balanced

5/6 steps complete with no blocking gaps for this preset.

Risk 14

triage

Confirm source provenanceRequired

Source/provenance metadata is available.

Done

triage

Check metadata review statusRequired

Review metadata is available.

Done

verify

Review safety notesRequired

Safety notes are available.

Done

verify

Review privacy notes

Privacy notes are available.

Done

verify

Validate package integrity metadata

Package integrity metadata is missing.

Pending

rollout

Verify install payload and commandsRequired

Install payload is available.

Done

No required blockers for this timeline preset.

Prerequisite readiness

3 prerequisites to line up before setup. Includes a review or approval gate.

0/3 ready

Review & approval1General290 minutes

Safety & privacy surface

3 safety and 3 privacy notes across 4 risk areas. Review closely: credentials & tokens, third-party handling.

4 areas

SafetyGeneralEval scores are development and regression signals, not proof that an AI system is safe, fair, or production-ready.
SafetyCredentials & tokensRed-team, prompt-injection, and adversarial prompt tests should run against isolated environments and reviewed credentials.
SafetyGeneralOptimizer workflows can issue many model calls or overfit to narrow datasets; set budgets, holdout sets, and rollback rules.
PrivacyGeneralEval datasets, traces, prompts, retrieved context, labels, and model outputs can contain user, customer, or proprietary data.
PrivacyThird-party handlingLLM-graded metrics may send eval payloads to the configured model provider unless a reviewed local model path is used.
PrivacyData retentionHuman review tools can retain annotations, reviewer decisions, and benchmark examples outside the original product system.

Safety notes

Eval scores are development and regression signals, not proof that an AI system is safe, fair, or production-ready.
Red-team, prompt-injection, and adversarial prompt tests should run against isolated environments and reviewed credentials.
Optimizer workflows can issue many model calls or overfit to narrow datasets; set budgets, holdout sets, and rollback rules.

Privacy notes

Eval datasets, traces, prompts, retrieved context, labels, and model outputs can contain user, customer, or proprietary data.
LLM-graded metrics may send eval payloads to the configured model provider unless a reviewed local model path is used.
Human review tools can retain annotations, reviewer decisions, and benchmark examples outside the original product system.

Prerequisites

Representative prompts, traces, retrieval cases, expected answers, and failure examples for the behavior being evaluated.
A policy for which eval scores block releases, which trigger human review, and which are advisory only.
Redaction rules for prompts, documents, tool calls, traces, and human labels before they enter eval datasets.

Schema details

Install type: copy
Troubleshooting: No

Collection metadata

Items: 11 entries
Estimated setup: 90 minutes
Difficulty: advanced

Included entries

tools/promptfoo tools/deepeval tools/ragas tools/dspy tools/giskard tools/langfuse tools/trulens tools/mlflow tools/agenta tools/label-studio skills/agent-evals-regression-gate

Installation order

promptfoodeepevalragasdspylangfusetrulensmlflowagentagiskardlabel-studioagent-evals-regression-gate

Full copyable content

## What this collection sets up

This collection turns prompt and agent behavior into a repeatable engineering
workflow. It covers fast prompt regression checks, Python-style evaluation
tests, RAG and agent metrics, trace-backed debugging, human review datasets, and
release gates that decide what happens when evals fail.

## Layers

### 1. Prompt and regression tests

- **promptfoo** gives teams prompt matrices, regression tests, and red-team
  checks.
- **deepeval** provides Python-first LLM unit tests and metrics.
- **agent-evals-regression-gate** helps define merge/release gates around eval
  suites.

### 2. RAG, agent, and optimization loops

- **ragas** focuses on RAG and LLM application evaluation.
- **dspy** helps teams program and optimize language-model pipelines with
  metrics and optimizers.
- **giskard** adds broader testing, scanning, and monitoring coverage.

### 3. Observability and review data

- **langfuse**, **trulens**, **mlflow**, and **agenta** capture traces, prompt
  versions, metrics, and experiment evidence.
- **label-studio** supports human review, annotation, benchmark curation, and
  preference data.

## Suggested order

Start with Promptfoo for fast prompt regression coverage. Add DeepEval or Ragas
for application-specific metrics, then wire traces into Langfuse, TruLens,
MLflow, or Agenta. Use Label Studio only after the team has written reviewer
instructions, sampling rules, and export boundaries.

## Source and references

- Promptfoo documentation: https://www.promptfoo.dev/docs
- DeepEval documentation: https://deepeval.com/docs/getting-started
- Ragas documentation: https://docs.ragas.io
- Langfuse documentation: https://langfuse.com/docs

## Duplicate check

Checked existing collections, upstream collection history, open collection PRs,
and repository content for `open-source-evals-prompt-testing`, open-source
evals, prompt testing collection, LLM regression testing, Promptfoo, DeepEval,
Ragas, and eval workflow. Existing collections cover code quality, production
readiness, API development, and data engineering, but none curates an
open-source LLM eval and prompt-testing lifecycle across prompt tests, metrics,
traces, human review, and release gates.

## Disclosure

Editorial collection. No paid placement or affiliate link is used.

About this resource

What this collection sets up

This collection turns prompt and agent behavior into a repeatable engineering workflow. It covers fast prompt regression checks, Python-style evaluation tests, RAG and agent metrics, trace-backed debugging, human review datasets, and release gates that decide what happens when evals fail.

Layers

1. Prompt and regression tests

promptfoo gives teams prompt matrices, regression tests, and red-team checks.
deepeval provides Python-first LLM unit tests and metrics.
agent-evals-regression-gate helps define merge/release gates around eval suites.

2. RAG, agent, and optimization loops

ragas focuses on RAG and LLM application evaluation.
dspy helps teams program and optimize language-model pipelines with metrics and optimizers.
giskard adds broader testing, scanning, and monitoring coverage.

3. Observability and review data

langfuse, trulens, mlflow, and agenta capture traces, prompt versions, metrics, and experiment evidence.
label-studio supports human review, annotation, benchmark curation, and preference data.

Suggested order

Start with Promptfoo for fast prompt regression coverage. Add DeepEval or Ragas for application-specific metrics, then wire traces into Langfuse, TruLens, MLflow, or Agenta. Use Label Studio only after the team has written reviewer instructions, sampling rules, and export boundaries.

Source and references

Promptfoo documentation: https://www.promptfoo.dev/docs
DeepEval documentation: https://deepeval.com/docs/getting-started
Ragas documentation: https://docs.ragas.io
Langfuse documentation: https://langfuse.com/docs

Duplicate check

Checked existing collections, upstream collection history, open collection PRs, and repository content for open-source-evals-prompt-testing, open-source evals, prompt testing collection, LLM regression testing, Promptfoo, DeepEval, Ragas, and eval workflow. Existing collections cover code quality, production readiness, API development, and data engineering, but none curates an open-source LLM eval and prompt-testing lifecycle across prompt tests, metrics, traces, human review, and release gates.

Disclosure

Editorial collection. No paid placement or affiliate link is used.

#evals #prompt-testing #llmops #open-source #regression

Source citations

Source methodology →

Add this badge to your README

Show that Open Source Evals Prompt Testing is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.

[![Listed on HeyClaude](https://heyclau.de/badge/collections/open-source-evals-prompt-testing.svg)](https://heyclau.de/entry/collections/open-source-evals-prompt-testing)

How it compares

Open Source Evals Prompt Testing side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.

3 trust signals differ across this comparison (Package trust, Source provenance, Submitter).

Next steps differ across entries — use the actions in the table below to copy install commands and source links per resource.

Field	Open Source Evals Prompt Testing A source-backed collection for building repeatable LLM eval and prompt testing workflows with open-source tools: prompt regression tests, RAG and agent metrics, human review datasets, traces, prompt optimization, and release gates. Open dossier	Promptfoo Open-source prompt testing and red-teaming framework for LLM outputs, regressions, evaluations, and security checks. Open dossier	Agent Evals Regression Gate Skill Build repeatable eval suites that catch quality regressions in AI agent behavior before merge or release. Open dossier	Ragas Open-source evaluation framework for testing RAG systems, prompts, agents, workflows, and other LLM application behavior. Open dossier
Next stepsDiffers	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing	Open dossier API JSON Open LLM Open source Newsletter Claim listing
Trust
Review status	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed	ReviewedMaintainer reviewed
Package trustDiffers	Package not verified	Package not verified	Package verified2026-04-10	Package not verified
Source provenanceDiffers	Source-backed	Source-backed	No submission link	Source-backed
SubmitterDiffers	MkDev11	—	—	oktofeesh1
Install risk	Review first	Review first	Low risk	Review first
Notes	Safety ✓ Privacy ✓	Safety · Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓
Brand	—	Promptfoo	—	Ragas
Category	collections	tools	skills	tools
Source	Source-backed	Source-backed	First-party	Source-backed
Author	MkDev11	Promptfoo	JSONbored	Vibrant Labs
Added	2026-06-04	2026-04-27	2026-04-10	2026-06-03
Platforms	Claude Code	CLI	Claude Code Codex Windsurf Gemini Cursor CLI	CLI
Harness	Claude Code	CLI	Claude Code Codex Windsurf Gemini Cursor CLI	CLI
Source repo	—	—	—	—
Safety notes	✓Eval scores are development and regression signals, not proof that an AI system is safe, fair, or production-ready. Red-team, prompt-injection, and adversarial prompt tests should run against isolated environments and reviewed credentials. Optimizer workflows can issue many model calls or overfit to narrow datasets; set budgets, holdout sets, and rollback rules.	— missing	✓This skill produces automated release recommendations (merge, patch, or rollback) from eval scores; treat them as decision support and require human review before gating production releases or running suggested commands.	✓Ragas scores should be treated as decision support, not a substitute for domain review of critical outputs. LLM-based metrics can call configured model providers, so evaluation runs should be scoped and budgeted before use on large datasets. Generated test data and evaluator prompts should be reviewed before they influence release, ranking, or regression decisions.
Privacy notes	✓Eval datasets, traces, prompts, retrieved context, labels, and model outputs can contain user, customer, or proprietary data. LLM-graded metrics may send eval payloads to the configured model provider unless a reviewed local model path is used. Human review tools can retain annotations, reviewer decisions, and benchmark examples outside the original product system.	✓Promptfoo sends your prompts and test inputs to the model providers you configure to run evals and red-team probes; review which providers are used and keep secrets out of test cases.	✓Inputs can include source files, prompts, logs, account metadata, repository details, and operational context that may be sent to the configured AI model. Redact secrets, customer data, private URLs, credentials, and proprietary implementation details before sharing prompts, reports, or generated artifacts.	✓Evaluation examples may include prompts, retrieved context, generated responses, references, and metadata from the application under test. LLM-based metrics can send evaluation payloads to the configured model provider unless a local model path is used. The upstream README says Ragas collects minimal, anonymized usage analytics; review or disable analytics where policy requires it.
Prerequisites	Representative prompts, traces, retrieval cases, expected answers, and failure examples for the behavior being evaluated. A policy for which eval scores block releases, which trigger human review, and which are advisory only. Redaction rules for prompts, documents, tool calls, traces, and human labels before they enter eval datasets.	— none listed	Existing prompts, tools, or agent workflows to evaluate A representative set of real user tasks or transcripts CI or local runner where eval suites can be executed repeatedly	Python environment for installing and running Ragas. Test data, application outputs, or production-aligned examples for the RAG, prompt, workflow, or agent behavior being evaluated. Model provider credentials when using LLM-based metrics or generated test data.
Install	—	—	`curl -L https://heyclau.de/downloads/skills/agent-evals-regression-gate.zip -o agent-evals-regression-gate.zip && unzip -o agent-evals-regression-gate.zip -d ./agent-evals-regression-gate`	—
Config	—	—	—	—
Citations	Source repositorygithub.com 2026-07-18T19:14:44+00:00 Documentationpromptfoo.dev Submitted by MkDev112026-06-04 Source methodology →	Source repositorygithub.com 2026-07-18T19:14:44+00:00 Documentationpromptfoo.dev Source methodology →	Source repositorygithub.com 2026-07-18T19:14:44+00:00 Documentationopenai.com Package (SHA-256 pinned)/downloads/skills/agent-evals-regression-gate.zip Source methodology →	Source repositorygithub.com 2026-07-18T19:14:44+00:00 Documentationdocs.ragas.io Submitted by oktofeesh12026-06-03 Source methodology →
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed

Open 4 picks in the interactive comparison tool

Signals

Loading live community signals…

Citation facts

Review trust signals before you adopt

Source and provenance checks

Safety and privacy checks

Package and install checks

Compare-driven decision checks

Copy & paste

Balanced adoption plan

Pre-adoption checks

Security checks

Rollout

Evidence readiness matrix · balanced

Source provenance

Metadata review

Safety notes

Privacy notes

Package integrity

Install payload

Decision timeline · balanced

Confirm source provenanceRequired

Check metadata review statusRequired

Review safety notesRequired

Review privacy notes

Validate package integrity metadata

Verify install payload and commandsRequired

Prerequisite readiness

Safety & privacy surface

Safety notes

Privacy notes

Prerequisites

Schema details

About this resource

What this collection sets up

Layers

1. Prompt and regression tests

2. RAG, agent, and optimization loops

3. Observability and review data

Suggested order

Source and references

Duplicate check

Disclosure

Source citations

Add this badge to your README

How it compares

Related resources

Promptfoo

Agent Evals Regression Gate Skill

Ragas

Agenta

Related guides

OpenAI Agents Trace to Eval Regression Guide

Evaluate AI Coding Tools with Repeatable Benchmarks

Securing Agentic Coding Workflows In Open Source Repos

Signals