Can you run an LLM locally for free?

Yes, for inference. Ollama is free and open-source, and runs open models like Llama 3.1 and Mistral on your own Mac at $0 per call — no API key, no rate limits, works offline. The only costs are a one-time model download (a few gigabytes) and your machine's electricity. This applies to inference loops (prompt optimization, scoring, classification, copy A/B), not to training a model from scratch, which still requires GPU compute you cannot replace with a laptop.

What hardware do you need to run an LLM locally?

For an 8B model like Llama 3.1, a Mac with at least 8 GB of RAM (16 GB recommended) and about 5 GB of free disk space. Apple Silicon (M1–M4) runs inference far faster than an Intel Mac because the GPU and unified memory handle the matrix math efficiently. A 70B model needs roughly 40 GB of RAM and 43 GB of disk, so it is only practical on a 64 GB machine.

Does running an LLM locally work for model training?

No. Local hardware is for inference — running an already-trained model. Training or full fine-tuning of a large model from scratch requires datacenter GPU compute (H100-class machines), and a MacBook or local box cannot replace that. The $0 local story in this guide is about inference loops: prompt optimization, SEO auditing, scoring, classification, and copy A/B — not training.

Is a local model good enough for marketing optimization loops?

For the evaluation half of the loop, yes. An 8B model like Llama 3.1 or Mistral 7B scores headlines, classifies page elements, and judges structured output reliably and at $0 per call. For generating nuanced, on-brand copy from scratch, a frontier API model (Claude Sonnet or GPT-4) is still stronger. The cost-effective pattern is local for high-volume scoring, API for the creative generation.

What is the cheapest way to run an AI optimization loop?

A local model via Ollama or Apple MLX. An unattended Opus loop on the API bills per token and can run roughly $15 per session. For non-training inference work, both the agent and the evaluator can run locally at $0 per token — no signup, no eligibility, no expiry.

How to Run an LLM Locally on Mac (Ollama + Llama 3, Free)

What is Ollama and why does it make local LLMs practical?

Running an LLM on your own computer used to require setting up complex Python environments, manually downloading multi-gigabyte model files, and wrestling with CUDA drivers. Ollama eliminated most of that. It's a free, open-source tool that bundles everything — model management, a local inference server, and a simple CLI — into a clean Mac app. You install it, pull a model, and start talking to it in seconds. (On Apple Silicon, Apple's MLX / mlx-lm is a second route — a Python package that runs and fine-tunes open models natively on the Mac GPU; Ollama is the simpler starting point for most marketers.)

The reason this matters for optimization loops is the cost structure. Every call to the Claude API or GPT-4 API costs money — fractions of a cent per request, which compounds into real bills when you're running hundreds of experiments. Every call to a local Ollama model costs exactly nothing. No credit card, no billing dashboard, no monthly statement. For an autoresearch loop that needs to score and evaluate dozens of variants, the economics of local inference are hard to beat.

Ollama also exposes a local API that mimics the OpenAI interface — http://localhost:11434/v1/chat/completions — so any tool built to work with the OpenAI API can be pointed at Ollama with a single endpoint change and no other modifications. This makes it a drop-in local option for many workflows that currently use a cloud API for everything.

How to install Ollama and run Llama 3 on Mac

Three commands. Install Ollama, pull a model, run it — under 10 minutes on a modern Mac. The install is genuinely simpler than most developer tools. You need a Mac with at least 8 GB of RAM (16 GB recommended), macOS 14 Sonoma or later, and about 5 GB of free disk space for the Llama 3.1 8B model. (Steps verified against ollama.com, June 16, 2026.)

Step 1 — Install Ollama

Open Terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

Alternatively, download the Mac app directly from ollama.com — it installs as a menu bar app with a built-in CLI. Either path ends with Ollama running in the background and a ollama command available in your terminal.

Step 2 — Pull Llama 3.1 8B

This downloads the model file — about 4.9 GB, one time only:

ollama pull llama3.1

For a larger, more capable model — only practical on a high-memory Mac:

ollama pull llama3.1:70b-instruct-q4_0

Note: the 70B model is roughly 43 GB on disk and needs about 40 GB of RAM to run — so it's realistic only on a 64 GB machine. Don't expect a 16 GB Mac to run it; start with the 8B version and step up only if you have the memory headroom.

Step 3 — Run it

Chat interactively:

ollama run llama3.1

Or call it programmatically (the server starts automatically when Ollama is running):

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1","messages":[{"role":"user","content":"Score this headline for clarity out of 10: Buy Now"}]}'

That curl call — pointing at your own machine, costing $0 — is the foundation of a locally-run evaluation loop.

What to expect on first run: Apple Silicon vs Intel

On Apple Silicon, fast; on Intel, usable but slow. On Apple Silicon (M1, M2, M3, M4), Llama 3.1 8B typically generates text at roughly 40–80 tokens per second on a recent chip — fast enough to feel nearly instantaneous for typical marketing prompts. The integrated GPU (via Metal) and unified-memory architecture handle the matrix math efficiently, so inference doesn't require a separate discrete GPU. Throughput varies with chip generation and quantization, so treat these as ballpark figures.

On Intel Macs, the same model runs at 5–15 tokens per second — noticeably slower. Still usable for occasional requests, but uncomfortably slow for a loop that needs to score 20 variants. If you're on an Intel Mac and run into speed issues, the smaller Phi-3 Mini model (about 2 GB) runs faster and is still capable for scoring and classification tasks.

RAM matters as much as the chip. The 8B model uses approximately 6–8 GB of RAM during inference. If your Mac has 8 GB total, Ollama will compete with your other open applications. Close memory-heavy apps (Chrome tabs, Figma, Slack) when running intensive loops. With 16 GB or more, the model loads into memory and stays there — subsequent requests after the first are very fast.

Local vs API: the real cost comparison for optimization loops

Local inference is $0 per call; API is a fraction of a cent that compounds. The math is simple but the implications are large at loop scale. An unattended Opus optimization loop on the API bills per token and can run on the order of $15 in a single session, while the same scoring work on a local model costs nothing. (See also our honest map of what's actually free for AI loops.)

Cost comparison: 100-experiment optimization loop
Setup	Per experiment	100 experiments	1,000 experiments
Ollama / Local (any Mac)	$0.00	$0.00	$0.00
Claude Haiku 4.5 via API	~$0.007	~$0.70	~$7.00
Claude Sonnet 4.6 via API	~$0.020	~$2.00	~$20.00
Claude Opus 4.8 via API	~$0.033	~$3.30	~$33.00

Estimates assume 2,500 input tokens + 800 output tokens per experiment, June 2026 API rates. Actual cost depends on your token volume. For technical architecture and cost modeling details, see the Autoresearch Playbook technical guide.

For a team running continuous loops, the savings are significant. But the right answer isn't always "local." Local models have real limits — output quality, context window size, and the fact that your laptop is doing real work during inference (fan noise, warmth, battery draw on unplugged laptops). The API costs money but delivers better quality and runs without touching your machine's resources.

The practical pattern that works well: use Ollama locally for evaluation and scoring tasks — checking a headline's clarity, scoring a subject line for specificity, classifying a page element's function. Use the API for generation tasks where quality matters — writing the actual variants you'll test. Llama 3 is excellent at judgment calls; it's weaker at producing nuanced, on-brand copy from scratch.

Does running locally work for model training?

No — local hardware is for inference, not training. Everything in this guide is about running an already-trained model: feeding it a prompt and reading the output. That is what costs $0 on your Mac. Training a large language model from scratch — or doing a full fine-tune of one — is a fundamentally different, far heavier workload that requires datacenter GPU compute (H100-class machines, often many of them, for days or weeks). A MacBook or local box cannot replace that, and no amount of Ollama setup changes it.

This distinction matters because of where the idea comes from. Andrej Karpathy's autoresearch project (88,000+ GitHub stars, March 2026) is a keep-the-winner loop — propose a change, run it, keep it if the metric improves, discard it if it doesn't, repeat — and in his original it edits a training script and trains for five minutes per step on an NVIDIA GPU. That version genuinely needs the GPU. The Autoresearch Playbook borrows the same pattern but points it at marketing work (prompt optimization, SEO auditing, debugging, copy A/B), where each step is a model call, not a gradient update. No model weights are touched, so the marketing loop runs entirely on local inference at $0 per call — and that is the version a MacBook can run.

A narrow middle ground exists: lightweight, parameter-efficient fine-tuning (LoRA/QLoRA on a small model) is feasible on a 32–64 GB Apple Silicon Mac via tools like mlx-lm. That is real, but it is small-scale adaptation, not training a frontier model — and it is not required for the optimization loops this guide is about. If a workflow genuinely needs to train or heavily fine-tune a large model, rent cloud GPUs; don't expect a laptop to do it.

Local vs API: honest tradeoff table

When to use local vs API in your loop
Dimension	Local (Ollama)	API (Claude / GPT)
Cost per call	$0	$0.007–$0.033+
Speed (Apple Silicon)	Fast (40–80 tok/s)	Fast (varies by load)
Output quality (generation)	Good for 8B; weaker for complex brand voice	Strong — Sonnet/Opus produce nuanced copy
Output quality (evaluation/scoring)	Strong — Llama 3 scores reliably	Strong — slight edge on subtle judgment
Privacy	Complete — data never leaves your machine	Sent to Anthropic/OpenAI servers
Internet required	No (after model download)	Yes
Context window	128K (Llama 3.1)	200K (Claude) / 128K (GPT-4)
Setup overhead	One-time 10-minute install + download	API key only
Machine resource use	High during inference (CPU/RAM)	None — cloud-side

The cleanest use of local models in an autoresearch loop: run your evaluation and scoring layer locally — it's the most call-intensive part of the loop and the part where output quality differences matter least. Keep your generation layer on the API when you need brand-accurate, high-quality copy variants. This hybrid approach gives you near-zero marginal cost on the volume operations while preserving quality on the creative output.

The Autoresearch Playbook covers exactly this architecture — including how to configure your loop to route evaluation calls to a local Ollama endpoint and generation calls to the API, with a hard cost ceiling so the API layer never runs unchecked.

Frequently asked questions

Which local model is best for marketing optimization?

For evaluation and scoring tasks: Llama 3.1 8B or Mistral 7B — both are fast on Apple Silicon and reliably accurate for structured scoring tasks. For generation tasks where you need nuanced copy: Llama 3.1 70B (if your machine can handle it) or stick with a cloud API. Phi-3 Mini is a good lightweight option for Intel Macs or when speed is critical and output quality requirements are lower.

Does Ollama work on Windows or Linux?

Yes. Ollama has native Windows and Linux installers available at ollama.com. On Linux with an NVIDIA GPU, performance is comparable to Apple Silicon. On Windows without a discrete GPU, inference is CPU-only and significantly slower. The install commands differ slightly by platform, but the model library and API are identical.

How many models can I run at once with Ollama?

Ollama can serve multiple models from disk, but inference runs one at a time on most consumer hardware. You can switch between models freely — Ollama loads the requested model into memory on demand and unloads it after a period of inactivity. On machines with 32 GB or more unified memory (M3 Pro/Max, M4 Pro/Max), you can keep multiple smaller models loaded simultaneously.

Run an LLM locally on your Mac — $0 per experiment, no rate limits.