AI Evals

Understand your AI's performance

Score real AI interactions against your product standards and track quality across every request, prompt, and model change.

HUMAN ALIGNMENT

Build evals aligned with your standards

Turn reviewer standards into repeatable quality checks for prompts, models, and production traffic.

Calibrate from real cases.

Score sampled outputs with your team, then test your eval against those same cases before it goes live.

Test cases

Refund request #4821

case_a91

Password reset #4819

case_b02

Billing dispute #4815

case_c44

Shipping delay #4812

case_d18

Support Response Quality

eval_3c8224 sampled casesRefund request #4821

Scoring

Model output

I can process your refund for order #4821. It will appear in 3 to 5 business days on your original payment method.

Rubric

Faithfulness

4.5/4.5

Groundedness

4.0/4.0

Helpfulness

4.5/4.5

Tone

3.5/3.5

Safety

5.0/5.0

Human alignment

96%

agreement

Automated eval scores match your team on 23 of 24 sampled cases.

3 reviewers scored

Build evals on a visual canvas.

Drag judges, checks, and rules into place. Tune prompts with your team, then run the same eval across models.

Eval canvas

Support Response Quality
Eval canvas with model output, LLM judges, and rubric aggregator nodes

Catch regressions before users do.

Sample live requests on your schedule. Track quality trends and failing cases as they appear.

Live eval results

eval_3c82 · 10% sample interval

Pass rate 82%
Search results
Last 24h
TimestampUserScoreStatusFailing
Feb 15, 10:33 PM
Randy Workman
4.6
Pass
Feb 15, 10:31 PM
Madelyn Carder
3.1
Fail
Groundedness
Feb 15, 10:29 PM
Maren Culhane
4.4
Pass
Feb 15, 10:27 PM
Abram Ekstrom
2.8
Fail
Faithfulness
Feb 15, 10:25 PM
Madelyn Bergson
4.5
Pass
Feb 15, 10:23 PM
Marcus Lubin
3.0
Fail
Groundedness

Ask why it's failing.

Send eval results to your coding agent through MCP. It finds the failing cases and points to the prompt or code change.

Cursor

MCP

Why is the support eval failing? How is it failing and how can we fix it?

I'll run the sansa MCP to pull the latest sampled eval results for this project.

Ran get_eval_results in sansa-mcp

6 of the last 30 sampled cases failed on Groundedness. The model is inventing refund timelines instead of citing order data. I'll update the system prompt to remind the model to use the policy lookup tool first.

support-agent.ts
1
2
3
4
const SYSTEM_PROMPT = `
Always cite the order number from the user message.
Do not invent refund timelines. Use the policy lookup tool first.
`;

Your all-in-one
AI backend.