How to A/B test marketing copy with local AI models — free with Ollama.
You can A/B test headlines, cold emails, and ad copy on your own machine, for $0, using a local model and the autoresearch loop. No traffic, no split-test platform, no API bill. Here's the step-by-step workflow, the scoring rubrics that matter, and when to go local vs. API.
- Ollama runs open-weight models like Llama 3 on your laptop at zero cost. The autoresearch loop generates copy variants, scores each one against a rubric, keeps only what improves, and repeats.
- This compresses weeks of A/B testing ideation into a few hours of automated iteration — before you spend a single visitor on a live test.
- You need three things: Ollama installed (setup guide), a program.md instruction file, and an AI coding agent like Claude Code.
- After the loop narrows the field, feed the top 2–3 candidates into a real A/B test. Better inputs = better test outcomes.
- The Autoresearch Playbook includes pre-built scoring rubrics for 12 marketing use cases, so you don't build from scratch.
Why test copy with AI instead of waiting for traffic?
Traditional A/B testing requires three things most marketers don't have enough of: traffic for statistical significance, time to wait for results, and a platform subscription to run the splits. VWO Pro starts at $281/month. Optimizely is custom enterprise pricing. And even with those tools, you still have to brainstorm variants manually before the test begins.
AI copy testing compresses the ideation and filtering phase. Instead of brainstorming five headline options and hoping one wins, you let the autoresearch loop generate 20+ variants, score each one against measurable criteria, and eliminate the weak options computationally — before any traffic is spent.
The result is not a replacement for traffic-based validation. It's a dramatically better input to your live tests. When you start a real A/B test with the top 3 candidates from a 20-variant AI screening, instead of 3 guesses from a brainstorming session, your test converges faster and your winner is stronger.
And when you run the loop on a local model via Ollama, the entire process costs $0. No API bill, no rate limits, no data leaving your machine.
What you need: three components
AI copy testing requires three components. All three are free.
| Component | What it does | Cost |
|---|---|---|
| Ollama | Runs open-weight models (Llama 3, Mistral, Gemma) locally on your laptop. No API, no cloud, no data egress. | $0 |
| program.md | A plain-English instruction file that tells the AI what to optimize, how to score it, and when to stop. This is the scoring rubric. | $0 (write your own or use a Playbook template) |
| AI coding agent | Claude Code, Cursor, or Windsurf — reads your program.md and runs the propose-measure-keep/revert loop automatically. | $0 (Claude Code free tier) or subscription |
The instruction file — program.md — is the most important piece. It defines what "better" means for your specific asset. Without it, the AI generates random variation. With it, the AI generates directional improvement against criteria you control.
For Ollama installation, see How to run an LLM locally on Mac. For program.md structure, see how the autoresearch loop works.
The five-step workflow
Here's the step-by-step workflow for running AI copy tests on a local model. Total setup time for your first run: about 20 minutes. Subsequent runs: under 5 minutes to start, then hands-off while the loop iterates.
| Step | Action | Time |
|---|---|---|
| 1. Define baseline | Paste your current headline, email subject, or ad copy into a text file. This is what the loop will try to beat. | 2 min |
| 2. Load program.md | Write a scoring rubric or load a pre-built template. Define criteria: keyword presence, character count, readability, tone markers. | 5–10 min (first time), 1 min (template) |
| 3. Start the loop | Point your AI agent at the program.md. It reads your baseline, generates a variant, scores it, keeps or reverts, and repeats. | 1 min to start |
| 4. Let it run | The loop runs 10–20 iterations automatically. Each iteration builds on the best previous result. Hands-off. | 15–45 min (local model) |
| 5. Pick candidates | Review the final output: the best variant after N rounds. Put the top 2–3 into a real A/B test with live traffic. | 5 min |
Step 4 takes longer on local models than on API models — Ollama on an M-series Mac generates roughly 30–50 tokens per second, versus near-instant responses from Claude's API. For a 10-variant headline loop, expect 15–30 minutes locally versus 3–5 minutes on API. The tradeoff: local is free and private; API is fast and more capable.
Find out in 60 seconds. A 12-point check, one honest score — sent straight to your inbox. No card.
Take the free assessmentScoring rubrics by use case
The rubric is what separates autoresearch from random variation. Each criterion should be binary (pass/fail) or scored on a simple scale. Here are example rubrics for the three most common marketing copy types.
Cold email subject line
| Criterion | Pass condition | Points |
|---|---|---|
| Length | 30–50 characters (fits mobile preview) | 1 |
| Personalization | Contains at least one personalization token (company name, first name, role) | 1 |
| Power word | Includes a high-open-rate trigger word (verified, quick, mistake, update) | 1 |
| No spam triggers | Avoids words flagged by spam filters (free, guaranteed, act now, limited time) | 1 |
| Curiosity or benefit | Opens a curiosity gap or states a clear benefit — not both | 1 |
Landing page headline
| Criterion | Pass condition | Points |
|---|---|---|
| Word count | Under 12 words | 1 |
| Primary keyword | Contains the target keyword or a close semantic variant | 1 |
| Specific benefit | States a measurable or tangible outcome (not vague aspiration) | 1 |
| Readability | Grade 8 or below on Flesch-Kincaid | 1 |
| No jargon | Zero industry jargon, acronyms, or buzzwords without inline definition | 1 |
Paid ad copy
| Criterion | Pass condition | Points |
|---|---|---|
| Character limit | Within the ad format limit (Google: 30/30/90, Meta: 40/125) | 1 |
| Keyword in headline | Target keyword appears in headline 1 or 2 | 1 |
| Clear CTA | Description ends with a specific action verb (get, start, try, see) | 1 |
| Social proof or number | Includes a stat, count, or credibility signal (used by X, rated Y, Z results) | 1 |
| Differentiation | Makes a claim competitors cannot make or do not make in current SERP ads | 1 |
These rubrics are starting points. The Autoresearch Playbook templates include rubrics tuned for 12 specific use cases, with recommended iteration counts and cost-stop parameters for each. See systematic vs. ad-hoc prompting for why structured scoring outperforms freeform "make it better" instructions.
Cost comparison: local vs. hybrid vs. full API
The cost difference between approaches is large enough to change your workflow. Here's what each approach costs for a typical headline optimization session.
| Approach | Per 10 variants | Per 50 variants | Monthly (3x/week) |
|---|---|---|---|
| Ollama (Llama 3, local) | $0 | $0 | $0 |
| Claude hybrid (local eval + API polish) | ~$0.08 | ~$0.40 | ~$5 |
| Claude Sonnet 4.6 (full API) | ~$0.25 | ~$1.25 | ~$15 |
| VWO Pro (A/B platform) | $281/month flat — no variant generation included | ||
The key insight: AI copy testing and A/B testing platforms solve different problems. The AI generates and scores variants computationally. The platform validates the winner with real traffic. Use both — the AI narrows the field, the platform confirms the winner. For detailed API pricing, see Claude API cost breakdown.
When to use local models vs. API models
Local models and API models have different strengths. The decision depends on what you're testing and how complex the scoring rubric is.
| Factor | Use local (Ollama) | Use API (Claude) |
|---|---|---|
| Copy length | Short-form: headlines, subject lines, CTAs (<100 words) | Long-form: email bodies, landing page sections, blog intros (>100 words) |
| Rubric type | Rule-based: character count, keyword presence, readability score | Judgment-based: tone consistency, persuasion quality, brand voice |
| Budget | $0 — unlimited iterations | $0.08–$0.25 per 10 variants |
| Speed | 15–45 min for 10 variants (M-series Mac) | 3–5 min for 10 variants |
| Privacy | Nothing leaves your machine | Data sent to API endpoint |
The hybrid approach — local models for first-pass filtering, API for final polish — gives you the best of both. Run 50 variants locally at $0, pick the top 5, then run those 5 through Claude for nuanced evaluation. Total cost: ~$0.10 instead of $1.25.
Frequently asked questions
Can I A/B test marketing copy with a local AI model?
Yes. Ollama runs open-weight models like Llama 3 on your laptop at $0. The autoresearch loop generates copy variants, scores each one against a rubric you define (keyword presence, character count, readability, power words), keeps only what beats the baseline, and repeats. No traffic, no A/B platform, no API bill required.
How much does AI copy testing cost compared to an A/B testing platform?
$0 with Ollama running locally, or $0.08–$0.25 per 10-variant loop on Claude API. Compare to VWO Pro at $281/month or Optimizely at custom enterprise pricing. AI copy testing handles variant generation and scoring; A/B platforms handle traffic-based validation. They complement each other — use the AI to narrow the field, then validate with real visitors.
What scoring rubric should I use for AI copy testing?
The rubric depends on the asset type. Cold email subject lines: character count (30–50), power words, personalization tokens, no spam triggers. Landing page headlines: word count (under 12), primary keyword present, specific benefit, readability score. Ad copy: character limits for the format, keyword in headline, clear CTA, social proof. The Autoresearch Playbook includes pre-built rubrics for 12 marketing use cases so you don't build from scratch.
How many iterations does the autoresearch loop need?
Most marketing copy converges after 10–20 iterations. Headlines typically peak at 8–12 rounds. Cold email subject lines at 15–20. The loop auto-stops when variants stop beating the baseline or when you hit your cost or iteration limit. Each Autoresearch Playbook template includes a recommended iteration count for its use case.
Do I need website traffic to run AI A/B tests?
No. AI copy testing is computational, not traffic-based. The AI scores variants against defined criteria (keyword density, readability, character count, tone markers) rather than measuring user behavior. You should validate the top 2–3 candidates with real traffic afterward, but the loop itself runs without any visitors.