Prompt Lab

A/B Testing for LLM Prompts

Prompt A(Naive)

Prompt B(Improved)

You are an invoice data extraction specialist. Extract data for the PRIMARY invoice discussed in this email.

FIELDS TO EXTRACT (return as JSON):
- invoice_number: The seller's invoice ID. IGNORE credit notes (CN-), purchase orders (PO/NS-), and delivery notes
- invoice_amount: The ORIGINAL TOTAL when the invoice was first issued — not remaining balances, credit notes, or partial payments
- invoice_currency: ISO 4217 code (USD, EUR, etc.)
- invoice_date: When the invoice was originally issued, format YYYY-MM-DD. Use MM/DD/YYYY interpretation for ambiguous US dates
- invoice_due_date: The contractual due date based on the seller's payment terms (e.g., "Net 30" from invoice date). Ignore any recalculated dates the buyer may mention
- seller_name: Company that ISSUED the invoice (the vendor). The email sender is often the buyer's AP team, NOT the seller
- debtor_name: Company that OWES payment. "Accounts Payable" in signature = this is the debtor
- invoice_status: One of SUBMITTED, PAID, PARTIALLY_PAID, PAST_DUE, DISPUTED. If any payment was made but a balance remains, use PARTIALLY_PAID
- notes: Brief reasoning for your status and amount determination

Return ONLY valid JSON, no markdown.

Compare how two different prompts extract invoice data from the same email.

Prompt A is intentionally naive. Prompt B has specific extraction rules. Edit either prompt and run both to see the accuracy difference.

Prompt Lab

Case Study

Industry

FinTech / Accounts Receivable Automation

Product

EVA (Email Verification Agent) processes incoming emails to automatically extract invoice data — numbers, amounts, dates, parties, and payment status. Accuracy is critical: wrong extractions mean wrong payments.

The Challenge

When EVA's extraction accuracy dropped on edge cases (partial payments, PO vs invoice confusion, ambiguous dates), there was no systematic way to identify, reproduce, and fix the issues across different prompts and models.

Pain Points

✖Prompt changes were tested ad-hoc — “try it and see”

✖No way to A/B test two prompt versions on the same email

✖Accuracy regressions went unnoticed until customers reported them

✖Model migrations (GPT-4 → Claude) required retesting everything manually