How to A/B Test Marketing Copy with Local AI (Ollama + Autoresearch Loop)

Why test copy with AI instead of waiting for traffic?

Traditional A/B testing requires three things most marketers don't have enough of: traffic for statistical significance, time to wait for results, and a platform subscription to run the splits. VWO Pro starts at $281/month. Optimizely is custom enterprise pricing. And even with those tools, you still have to brainstorm variants manually before the test begins.

AI copy testing compresses the ideation and filtering phase. Instead of brainstorming five headline options and hoping one wins, you let the autoresearch loop generate 20+ variants, score each one against measurable criteria, and eliminate the weak options computationally — before any traffic is spent.

The result is not a replacement for traffic-based validation. It's a dramatically better input to your live tests. When you start a real A/B test with the top 3 candidates from a 20-variant AI screening, instead of 3 guesses from a brainstorming session, your test converges faster and your winner is stronger.

And when you run the loop on a local model via Ollama, the entire process costs $0. No API bill, no rate limits, no data leaving your machine.

What you need: three components

AI copy testing requires three components. All three are free.

The three components for local AI copy testing
Component	What it does	Cost
Ollama	Runs open-weight models (Llama 3, Mistral, Gemma) locally on your laptop. No API, no cloud, no data egress.	$0
program.md	A plain-English instruction file that tells the AI what to optimize, how to score it, and when to stop. This is the scoring rubric.	$0 (write your own or use a Playbook template)
AI coding agent	Claude Code, Cursor, or Windsurf — reads your program.md and runs the propose-measure-keep/revert loop automatically.	$0 (Claude Code free tier) or subscription

The instruction file — program.md — is the most important piece. It defines what "better" means for your specific asset. Without it, the AI generates random variation. With it, the AI generates directional improvement against criteria you control.

For Ollama installation, see How to run an LLM locally on Mac. For program.md structure, see how the autoresearch loop works.

The five-step workflow

Here's the step-by-step workflow for running AI copy tests on a local model. Total setup time for your first run: about 20 minutes. Subsequent runs: under 5 minutes to start, then hands-off while the loop iterates.

Five steps from baseline copy to optimized candidates
Step	Action	Time
1. Define baseline	Paste your current headline, email subject, or ad copy into a text file. This is what the loop will try to beat.	2 min
2. Load program.md	Write a scoring rubric or load a pre-built template. Define criteria: keyword presence, character count, readability, tone markers.	5–10 min (first time), 1 min (template)
3. Start the loop	Point your AI agent at the program.md. It reads your baseline, generates a variant, scores it, keeps or reverts, and repeats.	1 min to start
4. Let it run	The loop runs 10–20 iterations automatically. Each iteration builds on the best previous result. Hands-off.	15–45 min (local model)
5. Pick candidates	Review the final output: the best variant after N rounds. Put the top 2–3 into a real A/B test with live traffic.	5 min

Step 4 takes longer on local models than on API models — Ollama on an M-series Mac generates roughly 30–50 tokens per second, versus near-instant responses from Claude's API. For a 10-variant headline loop, expect 15–30 minutes locally versus 3–5 minutes on API. The tradeoff: local is free and private; API is fast and more capable.

Scoring rubrics by use case

The rubric is what separates autoresearch from random variation. Each criterion should be binary (pass/fail) or scored on a simple scale. Here are example rubrics for the three most common marketing copy types.

Cold email subject line

Subject line scoring rubric (5 criteria, 1 point each)
Criterion	Pass condition	Points
Length	30–50 characters (fits mobile preview)	1
Personalization	Contains at least one personalization token (company name, first name, role)	1
Power word	Includes a high-open-rate trigger word (verified, quick, mistake, update)	1
No spam triggers	Avoids words flagged by spam filters (free, guaranteed, act now, limited time)	1
Curiosity or benefit	Opens a curiosity gap or states a clear benefit — not both	1

Landing page headline

Headline scoring rubric (5 criteria, 1 point each)
Criterion	Pass condition	Points
Word count	Under 12 words	1
Primary keyword	Contains the target keyword or a close semantic variant	1
Specific benefit	States a measurable or tangible outcome (not vague aspiration)	1
Readability	Grade 8 or below on Flesch-Kincaid	1
No jargon	Zero industry jargon, acronyms, or buzzwords without inline definition	1

Paid ad copy

Ad copy scoring rubric (5 criteria, 1 point each)
Criterion	Pass condition	Points
Character limit	Within the ad format limit (Google: 30/30/90, Meta: 40/125)	1
Keyword in headline	Target keyword appears in headline 1 or 2	1
Clear CTA	Description ends with a specific action verb (get, start, try, see)	1
Social proof or number	Includes a stat, count, or credibility signal (used by X, rated Y, Z results)	1
Differentiation	Makes a claim competitors cannot make or do not make in current SERP ads	1

These rubrics are starting points. The Autoresearch Playbook templates include rubrics tuned for 12 specific use cases, with recommended iteration counts and cost-stop parameters for each. See systematic vs. ad-hoc prompting for why structured scoring outperforms freeform "make it better" instructions.

Cost comparison: local vs. hybrid vs. full API

The cost difference between approaches is large enough to change your workflow. Here's what each approach costs for a typical headline optimization session.

Cost per optimization session by approach
Approach	Per 10 variants	Per 50 variants	Monthly (3x/week)
Ollama (Llama 3, local)	$0	$0	$0
Claude hybrid (local eval + API polish)	~$0.08	~$0.40	~$5
Claude Sonnet 4.6 (full API)	~$0.25	~$1.25	~$15
VWO Pro (A/B platform)	$281/month flat — no variant generation included

The key insight: AI copy testing and A/B testing platforms solve different problems. The AI generates and scores variants computationally. The platform validates the winner with real traffic. Use both — the AI narrows the field, the platform confirms the winner. For detailed API pricing, see Claude API cost breakdown.

When to use local models vs. API models

Local models and API models have different strengths. The decision depends on what you're testing and how complex the scoring rubric is.

Local vs. API decision framework
Factor	Use local (Ollama)	Use API (Claude)
Copy length	Short-form: headlines, subject lines, CTAs (<100 words)	Long-form: email bodies, landing page sections, blog intros (>100 words)
Rubric type	Rule-based: character count, keyword presence, readability score	Judgment-based: tone consistency, persuasion quality, brand voice
Budget	$0 — unlimited iterations	$0.08–$0.25 per 10 variants
Speed	15–45 min for 10 variants (M-series Mac)	3–5 min for 10 variants
Privacy	Nothing leaves your machine	Data sent to API endpoint

The hybrid approach — local models for first-pass filtering, API for final polish — gives you the best of both. Run 50 variants locally at $0, pick the top 5, then run those 5 through Claude for nuanced evaluation. Total cost: ~$0.10 instead of $1.25.

Frequently asked questions

Can I A/B test marketing copy with a local AI model?

Yes. Ollama runs open-weight models like Llama 3 on your laptop at $0. The autoresearch loop generates copy variants, scores each one against a rubric you define (keyword presence, character count, readability, power words), keeps only what beats the baseline, and repeats. No traffic, no A/B platform, no API bill required.

How much does AI copy testing cost compared to an A/B testing platform?

$0 with Ollama running locally, or $0.08–$0.25 per 10-variant loop on Claude API. Compare to VWO Pro at $281/month or Optimizely at custom enterprise pricing. AI copy testing handles variant generation and scoring; A/B platforms handle traffic-based validation. They complement each other — use the AI to narrow the field, then validate with real visitors.

What scoring rubric should I use for AI copy testing?

The rubric depends on the asset type. Cold email subject lines: character count (30–50), power words, personalization tokens, no spam triggers. Landing page headlines: word count (under 12), primary keyword present, specific benefit, readability score. Ad copy: character limits for the format, keyword in headline, clear CTA, social proof. The Autoresearch Playbook includes pre-built rubrics for 12 marketing use cases so you don't build from scratch.

How many iterations does the autoresearch loop need?

Most marketing copy converges after 10–20 iterations. Headlines typically peak at 8–12 rounds. Cold email subject lines at 15–20. The loop auto-stops when variants stop beating the baseline or when you hit your cost or iteration limit. Each Autoresearch Playbook template includes a recommended iteration count for its use case.

Do I need website traffic to run AI A/B tests?

No. AI copy testing is computational, not traffic-based. The AI scores variants against defined criteria (keyword density, readability, character count, tone markers) rather than measuring user behavior. You should validate the top 2–3 candidates with real traffic afterward, but the loop itself runs without any visitors.

How to A/B test marketing copy with local AI models — free with Ollama.