Is your AI prompt production-ready?
A prompt that works in five hand-tests can still fail catastrophically on the sixth real input. This is the 12-point assessment that separates a demo prompt from one you can put in front of a customer — scored, so you get a number and a verdict, not just a feeling.
Why this matters in 2026: as models move to metered API billing, a prompt that fails 1-in-20 doesn't just annoy a user — it bills you for a retry and erodes trust at scale. Reliability is now a cost lever, not just a quality one.
How to score it
Run one real, important prompt through all 12 checks below. Score each one:
- 0 — not addressed at all
- 1 — partially handled, or only by luck
- 2 — deliberately engineered and verified
Add them up for a score out of 24. Your verdict is at the bottom.
The 12 points
Input containment
User-supplied text is wrapped in clear delimiters (XML tags, triple quotes) and never concatenated raw into your instructions.
This is the boundary between your logic and someone else's data. Without it, a user (or a scraped web page) can smuggle in "ignore previous instructions" and hijack the prompt. It's the SQL-injection of the AI era.
Instructions live in the system role
Rules, persona, output schema, and policy sit in the system prompt; only the variable task goes in the user turn.
Separation makes instructions far harder to override and is what makes prompt caching possible — which is where the 2026 cost savings come from (next point).
Cache-aware ordering
Your heaviest, unchanging content (reference docs, policies) comes first; the variable bits come last.
Prompt caching only fires on a byte-identical prefix — and it can cut input cost by roughly 90% and shave latency on repeated calls. One stray character at the front breaks the cache and you pay full price again.
Reasoning style matches the model
Standard models are asked to show their work step by step; reasoning / extended-thinking models are given the goal and success criteria, not prescribed steps.
Force step-by-step scaffolding onto a reasoning model and you fight its own process; skip it on a standard model and you lose accuracy on anything multi-step.
One job per prompt
Multi-step work (analyze → compare → rewrite) is a chain of focused prompts, not one mega-prompt doing everything.
A single prompt juggling five jobs is impossible to debug and overloads the model's attention. Atomic steps fail loudly and in one obvious place.
An explicit output contract
You specify the exact shape — keys and types, e.g. {"items":["string"],"count":int} — with a separate field for the model's thinking so it never pollutes the parsed result.
"Give me a list" produces a different shape every run and breaks the code that consumes it. A contract makes output something you can parse with confidence.
Concrete negative constraints
Prohibitions are specific — "Do not output markdown fences. Do not add a preamble." — not vague pleas like "don't be wordy."
Models handle explicit "do not X" far better than abstract style requests. Vague negatives get ignored; concrete ones stick.
Programmatic validation first
Before any quality judgment, automated checks confirm the output parses and has every required field; a structural failure scores zero and triggers a retry.
Catch broken output in code, not in front of a customer. Quality only matters once the shape is valid.
An adversarial test set
You keep a "golden set" of 20–100 cases — roughly 70% realistic, 30% nasty (empty input, gibberish, injection attempts, very long context).
Five happy-path tests will never surface the failure that shows up once in twenty runs. The adversarial 30% is where production breaks.
Layered success metrics
You have pass/fail thresholds on three layers: structure (does it parse), accuracy (are the facts right), and style (tone/brevity — judged by a model where it's subjective).
"Looks good" is not a metric. You can't keep the winning prompt variant if you can't measure which one actually won.
Versioned, with regression tests
Prompts are versioned like code, and every change re-runs the whole golden set before it's promoted.
Fixing one edge case silently breaks another — the regression trap. And when the underlying model updates, an untested prompt can degrade overnight without a single code change on your side.
Model fit & a human gate
Temperature matches the task (low for extraction/JSON, higher only for creative work), instructions are tuned to the model's dialect, and high-stakes outputs route to a human below a confidence threshold.
Full automation on a critical path is how a quiet 5% error rate becomes a public catastrophe. The best systems tee up the decision; a human signs off where it counts.
Your score
Don't score prompts by hand. Make them test themselves.
Scoring once is step one. The Autoresearch Playbook turns these 12 points into an automated loop: an AI agent writes prompt variants, scores each against these exact criteria, keeps the winner and reverts the rest — overnight, and for $0 on a local model for non-training work.