Run-your-own workshop

Turn your definition of good into evals you can ship on.

A guided builder for teams shipping chat or voice agents. Take the behaviors you decomposed and build the eval suite and quality loop that prove each change made the product better.

It picks up where the personality harness workshop leaves off. Bring that packet and the builder seeds from your counted failures and trust moments. Start fresh and it walks you through naming the behaviors from scratch.

Start the builder Preview the output

Definition of good

An eval

Definition of good

"Be a warm, supportive coach."

An eval

On 120 real disclosure turns, reflect before advising in at least 90 percent, judged by a rubric that agrees with two humans, tracked on its own line.

Definition of good

"It should feel helpful."

An eval

Score whether the system followed the method on the five turns where a wrong answer loses the user, not whether the session felt nice.

Definition of good

"The new prompt seems better."

An eval

The candidate beats the baseline on the paired suite with the interval clear of zero, and no trust-moment metric dropped.

Definition of good

"Add an LLM judge."

An eval

Write the rubric, then prove the judge agrees with human labels before it can gate a release.

What you leave with

An eval plan your engineers can build from.

Every answer maps to a piece of an eval implementation plan. You export one Markdown file and hand it to Claude, ChatGPT, Cursor, or an internal assistant to write the plan for your stack.

Behaviors under test, each tied to a counted failure
A dataset of eval cases with the trust moments pulled out
The cheapest grader that measures each behavior
A validation plan for every model judge
Acceptance criteria with a sample size and an interval
A test-first improvement loop, the eval before the fix
An online or A/B plan to prove the win on real traffic
A CI gate that blocks a regression on every change
A cadence that turns every production failure into a case
The eval-stack profile and the docs your LLM should read
An engineering backlog seeded from the work
An open decisions log you can see, not bury

Export preview

eval-plan.md
├── LLM instructions
├── Eval-stack profile
├── Documentation targets
├── Behaviors under test
├── Counted failure patterns
├── One eval, end to end
├── Eval scope
├── Eval dataset
├── Grader per behavior
├── Grader validation
├── Acceptance criteria
├── Eval-driven loop
├── Online and A/B plan
├── CI gate and closing the loop
├── Engineering handoff
├── Open decisions
└── Backlog seed

Start here

Enter your email to unlock the workspace

Your answers stay in this browser unless you export them. The email unlocks the builder and the blank guide, and lets us send you occasional new tools.