Run an LLM locally on your Mac — $0 per experiment, no rate limits.
Ollama (and Apple's MLX) make it possible to run Llama 3 and other open models on your Mac in under 10 minutes. No API key, no credit card, no metered billing, no rate limits. Since Anthropic put programmatic Claude usage on a meter on June 15, 2026, the case for a local model on the loop got sharper. Here's the complete setup, what it costs (nothing for inference), an honest comparison of where local beats the API — and a blunt note on what local hardware can't do: train a model.
- Ollama is free, open-source, and installs in two commands on Mac. It runs Llama 3, Mistral, Phi, and dozens of other models on your hardware.
- Apple Silicon (M1, M2, M3, M4) runs local models fast — significantly faster than an Intel Mac — because the integrated GPU (via Metal) and unified memory handle the matrix math efficiently.
- Local inference = $0 per experiment, no internet required, no usage limits. The only cost is your machine's electricity and a one-time model download (a few gigabytes).
- This is for inference, not training. Running an already-trained model locally is free; training a large model from scratch still needs datacenter GPUs (H100-class) — a laptop can't replace that.
- Honest tradeoff: local models are weaker than Claude Sonnet or GPT-4 for nuanced creative copy. They excel at evaluation, scoring, classification, and structured output tasks.
What is Ollama and why does it make local LLMs practical?
Running an LLM on your own computer used to require setting up complex Python environments, manually downloading multi-gigabyte model files, and wrestling with CUDA drivers. Ollama eliminated most of that. It's a free, open-source tool that bundles everything — model management, a local inference server, and a simple CLI — into a clean Mac app. You install it, pull a model, and start talking to it in seconds. (On Apple Silicon, Apple's MLX / mlx-lm is a second route — a Python package that runs and fine-tunes open models natively on the Mac GPU; Ollama is the simpler starting point for most marketers.)
The reason this matters for optimization loops is the cost structure. Every call to the Claude API or GPT-4 API costs money — fractions of a cent per request, which compounds into real bills when you're running hundreds of experiments. Every call to a local Ollama model costs exactly nothing. No credit card, no billing dashboard, no monthly statement. For an autoresearch loop that needs to score and evaluate dozens of variants, the economics of local inference are hard to beat.
Ollama also exposes a local API that mimics the OpenAI interface — http://localhost:11434/v1/chat/completions — so any tool built to work with the OpenAI API can be pointed at Ollama with a single endpoint change and no other modifications. This makes it a drop-in local option for many workflows that currently use a cloud API for everything.
How to install Ollama and run Llama 3 on Mac
Three commands. Install Ollama, pull a model, run it — under 10 minutes on a modern Mac. The install is genuinely simpler than most developer tools. You need a Mac with at least 8 GB of RAM (16 GB recommended), macOS 14 Sonoma or later, and about 5 GB of free disk space for the Llama 3.1 8B model. (Steps verified against ollama.com, June 16, 2026.)
Step 1 — Install Ollama
Open Terminal and run:
curl -fsSL https://ollama.com/install.sh | sh
Alternatively, download the Mac app directly from ollama.com — it installs as a menu bar app with a built-in CLI. Either path ends with Ollama running in the background and a ollama command available in your terminal.
Step 2 — Pull Llama 3.1 8B
This downloads the model file — about 4.9 GB, one time only:
ollama pull llama3.1
For a larger, more capable model — only practical on a high-memory Mac:
ollama pull llama3.1:70b-instruct-q4_0
Note: the 70B model is roughly 43 GB on disk and needs about 40 GB of RAM to run — so it's realistic only on a 64 GB machine. Don't expect a 16 GB Mac to run it; start with the 8B version and step up only if you have the memory headroom.
Step 3 — Run it
Chat interactively:
ollama run llama3.1
Or call it programmatically (the server starts automatically when Ollama is running):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1","messages":[{"role":"user","content":"Score this headline for clarity out of 10: Buy Now"}]}'
That curl call — pointing at your own machine, costing $0 — is the foundation of a locally-run evaluation loop.
What to expect on first run: Apple Silicon vs Intel
On Apple Silicon, fast; on Intel, usable but slow. On Apple Silicon (M1, M2, M3, M4), Llama 3.1 8B typically generates text at roughly 40–80 tokens per second on a recent chip — fast enough to feel nearly instantaneous for typical marketing prompts. The integrated GPU (via Metal) and unified-memory architecture handle the matrix math efficiently, so inference doesn't require a separate discrete GPU. Throughput varies with chip generation and quantization, so treat these as ballpark figures.
On Intel Macs, the same model runs at 5–15 tokens per second — noticeably slower. Still usable for occasional requests, but uncomfortably slow for a loop that needs to score 20 variants. If you're on an Intel Mac and run into speed issues, the smaller Phi-3 Mini model (about 2 GB) runs faster and is still capable for scoring and classification tasks.
RAM matters as much as the chip. The 8B model uses approximately 6–8 GB of RAM during inference. If your Mac has 8 GB total, Ollama will compete with your other open applications. Close memory-heavy apps (Chrome tabs, Figma, Slack) when running intensive loops. With 16 GB or more, the model loads into memory and stays there — subsequent requests after the first are very fast.
A 12-point check tells you what to fix before you start iterating. Free, sent to your inbox.
Take the free assessmentLocal vs API: the real cost comparison for optimization loops
Local inference is $0 per call; API is a fraction of a cent that compounds. The math is simple but the implications are large at loop scale. This matters more since June 15, 2026, when Anthropic moved programmatic Claude usage onto a metered monthly credit — an unattended Opus optimization loop can run on the order of $15 in a single session, while the same scoring work on a local model costs nothing. (June 15 metering change confirmed via InfoWorld; see also our honest map of what's actually free after June 15.)
| Setup | Per experiment | 100 experiments | 1,000 experiments |
|---|---|---|---|
| Ollama / Local (any Mac) | $0.00 | $0.00 | $0.00 |
| Claude Haiku 4.5 via API | ~$0.007 | ~$0.70 | ~$7.00 |
| Claude Sonnet 4.6 via API | ~$0.020 | ~$2.00 | ~$20.00 |
| Claude Opus 4.8 via API | ~$0.033 | ~$3.30 | ~$33.00 |
Estimates assume 2,500 input tokens + 800 output tokens per experiment, June 2026 API rates. Actual cost depends on your token volume. For technical architecture and cost modeling details, see the Autoresearch Playbook technical guide.
For a team running continuous loops, the savings are significant. But the right answer isn't always "local." Local models have real limits — output quality, context window size, and the fact that your laptop is doing real work during inference (fan noise, warmth, battery draw on unplugged laptops). The API costs money but delivers better quality and runs without touching your machine's resources.
The practical pattern that works well: use Ollama locally for evaluation and scoring tasks — checking a headline's clarity, scoring a subject line for specificity, classifying a page element's function. Use the API for generation tasks where quality matters — writing the actual variants you'll test. Llama 3 is excellent at judgment calls; it's weaker at producing nuanced, on-brand copy from scratch.
Does running locally work for model training?
No — local hardware is for inference, not training. Everything in this guide is about running an already-trained model: feeding it a prompt and reading the output. That is what costs $0 on your Mac. Training a large language model from scratch — or doing a full fine-tune of one — is a fundamentally different, far heavier workload that requires datacenter GPU compute (H100-class machines, often many of them, for days or weeks). A MacBook or local box cannot replace that, and no amount of Ollama setup changes it.
This distinction matters because of where the idea comes from. Andrej Karpathy's autoresearch project is a keep-the-winner loop — propose a change, run it, keep it if the metric improves, discard it if it doesn't, repeat — and in his original it edits a training script and trains for five minutes per step on an NVIDIA GPU. That version genuinely needs the GPU. The Autoresearch Playbook borrows the same pattern but points it at marketing work (prompt optimization, SEO auditing, debugging, copy A/B), where each step is a model call, not a gradient update. No model weights are touched, so the marketing loop runs entirely on local inference at $0 per call — and that is the version a MacBook can run.
A narrow middle ground exists: lightweight, parameter-efficient fine-tuning (LoRA/QLoRA on a small model) is feasible on a 32–64 GB Apple Silicon Mac via tools like mlx-lm. That is real, but it is small-scale adaptation, not training a frontier model — and it is not required for the optimization loops this guide is about. If a workflow genuinely needs to train or heavily fine-tune a large model, rent cloud GPUs; don't expect a laptop to do it.
Local vs API: honest tradeoff table
| Dimension | Local (Ollama) | API (Claude / GPT) |
|---|---|---|
| Cost per call | $0 | $0.007–$0.033+ |
| Speed (Apple Silicon) | Fast (40–80 tok/s) | Fast (varies by load) |
| Output quality (generation) | Good for 8B; weaker for complex brand voice | Strong — Sonnet/Opus produce nuanced copy |
| Output quality (evaluation/scoring) | Strong — Llama 3 scores reliably | Strong — slight edge on subtle judgment |
| Privacy | Complete — data never leaves your machine | Sent to Anthropic/OpenAI servers |
| Internet required | No (after model download) | Yes |
| Context window | 128K (Llama 3.1) | 200K (Claude) / 128K (GPT-4) |
| Setup overhead | One-time 10-minute install + download | API key only |
| Machine resource use | High during inference (CPU/RAM) | None — cloud-side |
The cleanest use of local models in an autoresearch loop: run your evaluation and scoring layer locally — it's the most call-intensive part of the loop and the part where output quality differences matter least. Keep your generation layer on the API when you need brand-accurate, high-quality copy variants. This hybrid approach gives you near-zero marginal cost on the volume operations while preserving quality on the creative output.
The Autoresearch Playbook covers exactly this architecture — including how to configure your loop to route evaluation calls to a local Ollama endpoint and generation calls to the API, with a hard cost ceiling so the API layer never runs unchecked.
Frequently asked questions
Which local model is best for marketing optimization?
For evaluation and scoring tasks: Llama 3.1 8B or Mistral 7B — both are fast on Apple Silicon and reliably accurate for structured scoring tasks. For generation tasks where you need nuanced copy: Llama 3.1 70B (if your machine can handle it) or stick with a cloud API. Phi-3 Mini is a good lightweight option for Intel Macs or when speed is critical and output quality requirements are lower.
Does Ollama work on Windows or Linux?
Yes. Ollama has native Windows and Linux installers available at ollama.com. On Linux with an NVIDIA GPU, performance is comparable to Apple Silicon. On Windows without a discrete GPU, inference is CPU-only and significantly slower. The install commands differ slightly by platform, but the model library and API are identical.
How many models can I run at once with Ollama?
Ollama can serve multiple models from disk, but inference runs one at a time on most consumer hardware. You can switch between models freely — Ollama loads the requested model into memory on demand and unloads it after a period of inactivity. On machines with 32 GB or more unified memory (M3 Pro/Max, M4 Pro/Max), you can keep multiple smaller models loaded simultaneously.