back

Prompt Lab

A/B Testing for LLM Prompts
Prompt A(Naive)
Prompt B(Improved)
Compare how two different prompts extract invoice data from the same email.
Prompt A is intentionally naive. Prompt B has specific extraction rules. Edit either prompt and run both to see the accuracy difference.

Prompt Lab

Case Study
Industry

FinTech / Accounts Receivable Automation

Product

EVA (Email Verification Agent) processes incoming emails to automatically extract invoice data — numbers, amounts, dates, parties, and payment status. Accuracy is critical: wrong extractions mean wrong payments.

The Challenge

When EVA's extraction accuracy dropped on edge cases (partial payments, PO vs invoice confusion, ambiguous dates), there was no systematic way to identify, reproduce, and fix the issues across different prompts and models.

Pain Points
Prompt changes were tested ad-hoc — “try it and see”
No way to A/B test two prompt versions on the same email
Accuracy regressions went unnoticed until customers reported them
Model migrations (GPT-4 → Claude) required retesting everything manually