In 2023, AI researcher Andrej Karpathy published a small open-source project he called "autoresearch." The core idea was deceptively simple: instead of a human manually reviewing each experiment, you give an AI agent a clear metric, a set of variants to try, and a decision rule — keep the change if the metric improves, revert if it doesn't. Run that cycle repeatedly and the system converges on better outputs, automatically.

Karpathy designed it for ML model development. The Autoresearch Playbook takes the same loop structure and applies it to the things lean teams actually need to optimize: cold email reply rates, landing page conversion, pricing pages, and market research synthesis.

This guide explains the loop from first principles — what it is, why it works, and what you need to run your first cycle.

The Three-Step Loop

Every autoresearch loop runs the same three steps, regardless of what you're optimizing:

  1. Pick one metric. Not "improve performance" — a single, unambiguous number. Reply rate for cold email. Conversion rate for a landing page. Revenue per visitor for a pricing experiment. One metric per loop run.
  2. Generate and apply a variant. The AI agent reads your current version, generates an improvement (a new subject line, a rewritten headline, a restructured offer), and applies it to your asset.
  3. Measure and decide. After a defined measurement window, compare variant performance against your baseline. If the metric improved by at least your threshold, keep the change and promote it to the new baseline. If it didn't, revert to the previous version. Either way, the loop is ready to run again.

That's the whole loop. What makes it powerful isn't any single step — it's the compounding. Each kept change becomes the new baseline. Each reverted change costs nothing. Over dozens of cycles, small improvements add up.

What Makes It Different from A/B Testing

A/B testing platforms like VWO or Optimizely run both variants simultaneously and declare a winner based on statistical significance. That approach is correct but expensive: it requires enough traffic to reach significance, a developer to instrument the test, and a monthly subscription to keep the platform running.

The autoresearch loop is sequential, not parallel. You run one variant at a time, measure it against a rolling baseline, and keep or revert. This approach works with smaller sample sizes, doesn't require platform instrumentation, and has no ongoing cost — especially if you route the loop through a local AI model.

The trade-off is statistical power: sequential testing is less rigorous than a proper randomized controlled trial. For most lean teams optimizing cold email sequences or landing page headlines, that trade-off is correct. You don't need p<0.05 to make a judgment call on whether your new subject line is outperforming the old one.

The program.md Format

The Autoresearch Playbook delivers the loop as a set of plain-text template files — each one a fill-in-the-blank document called program.md. The format has three required sections:

  1. Context block. What are you optimizing? Who is the audience? What is the current version (or your starting point if this is the first run)? This is where you describe your business in plain language.
  2. Metric definition. What number are you measuring, how are you measuring it, and what counts as a meaningful improvement? For cold email, this might be: "Reply rate measured over 50 sends; keep the variant if reply rate is ≥ 20% higher than current baseline."
  3. Variant instructions. What kinds of changes should the agent try? For landing page headlines, this might be: "Generate a headline that leads with the outcome, not the method. Maximum 12 words. Do not change the subheadline or CTA."

You fill in the three sections with your specifics, drop the file into your AI agent's working directory (Claude Code, Cursor, or any MCP-capable agent), and run it. The agent reads the program, generates a variant, applies it, and outputs a decision log explaining what it changed and why.

The playbook includes 12 program.md templates covering cold email, landing pages, pricing, and market research synthesis. Each template ships with a completed example so you can see what a real output looks like before you start.

Running the Loop Locally

One of the most useful properties of the autoresearch approach is that it can run entirely on your own machine. If you have Ollama installed, you can point the program.md at a local model (Llama 3, Mistral, Phi-4) and run experiments at near-zero marginal cost. For most optimization tasks — subject line testing, headline variants, offer framing — local models produce competitive results.

For tasks that benefit from stronger reasoning (synthesizing customer research, modeling pricing elasticity), a paid API like Claude or GPT-4o gives better outputs. The Autoresearch Playbook includes a cost appendix that walks through expected token usage per task type and how to set hard caps so you never get a surprise bill.

What You Need to Get Started

To run your first autoresearch loop, you need four things:

  • One asset to optimize (a cold email sequence, a landing page, a pricing page)
  • One measurable metric (reply rate, conversion rate, revenue per visitor)
  • An AI agent (Claude Code, Cursor, or Ollama for local)
  • A program.md template filled in with your specifics

You don't need a developer, a CRO platform, a statistics background, or a large traffic volume. The loop is designed to work for a solo founder running 50 cold emails a week or a two-person team with 200 monthly landing page visitors.

The Compounding Effect

The real value of the autoresearch loop isn't any single experiment — it's the habit it creates. When testing is this cheap and this fast, you run more experiments. When you run more experiments, you find more improvements. When each improvement becomes the new baseline, the gains compound.

A landing page that converts at 3% today and improves by 15% relative every three months is converting at over 4.5% after a year. That's not a dramatic single win — it's the quiet compounding of a system that keeps running.

The free 12-point assessment will show you which of your current assets has the most headroom for improvement — cold email, landing pages, or pricing — and give you a starting point for your first loop.