2026-05-19
Local AI on Mac: A May 2026 Benchmark

Two weeks benchmarking AI runtimes and models across an Intel i9 and an M1 Max, to decide what Remember This v0.11 should ship by default. Below: the picks we landed on, the findings worth knowing if you run any local AI on a Mac, and the bench script so you can reproduce on your hardware.

What v0.11 ships

The point of the bench work is that you shouldn't have to read it to get a good install. v0.11 looks at your Mac on first launch — how much RAM, Apple Silicon or Intel — and picks the right models for each task. You can override anything in Settings, but the defaults aim to be the right answer out of the box.

Remember This runs three jobs against your archive, each picked separately because the trade-offs are different:

  • Photo Understanding — captions every photo and video so you can search by what's in them, not just when they were taken.
  • Audio Transcription — voice memos and video soundtracks become searchable text.
  • Dreamer Sessions — periodically, Claude Code reads through the day or week's captions, transcripts, and time logs and writes a synthesis: a rollup, an open question, a long-horizon nudge. Heaviest job; benefits most from a fast LLM.

Photo Understanding — defaults per Mac

RAM Apple Silicon Intel
≤ 8 GBmoondream:1.8b via ollamamoondream:1.8b via ollama
16 GBgemma-4-e2b via SwiftLMmoondream:1.8b via ollama
32 GBQwen3-VL-4B via SwiftLMqwen3-vl:2b via ollama
48 GB+Qwen3-VL-8B via SwiftLMqwen3-vl:2b via ollama

Apple Silicon with 16 GB+ runs SwiftLM, a new Apple-MLX-based runtime bundled in v0.11. It's what runs the MLX-native VLM families (Qwen3-VL, Gemma-4-Vision, FastVLM) that don't have an ollama equivalent at our quality bar. The 48 GB+ row unlocks Qwen3-VL-8B, the side-by-side quality winner in our caption eval; the 32 GB row uses Qwen3-VL-4B, the best speed-and-quality tradeoff at that tier. Both scored in detail in the VLM matrix further down. Intel Macs fall back to ollama (MLX is Apple-Silicon-only). For LLMs the runtime verdict goes the other way — see finding #3.

Audio Transcription — defaults per Mac

RAMWhisper model
≤ 8 GBmedium
16 GB and uplarge-v3-turbo

Plus a backend pick per machine: Metal (Apple's GPU framework) on Apple Silicon, where it's 8–22× faster than CPU; CPU on Intel, where Metal on the older Radeon Pro discrete GPU is both slower than CPU and crashes on long audio — see finding #1.

Dreamer Sessions — defaults per Mac

Hosted Claude (claude.ai) is the default on every Mac, including the 64 GB+ tier — best long-context coherence, no first-pull download, single privacy stance. Flip the toggle in Settings to use a local LLM instead and v0.11 picks one sized for your tier:

RAMApple Silicon (ollama)Intel (ollama)
≤ 8 GBnot recommendednot recommended
16 GBgemma4:e2bgemma4:e2b
32 GBgpt-oss:20bgemma4:e2b
48 GBgemma4:26bgemma4:e2b
64 GB+qwen3.6:35bgemma4:e2b

qwen3.6:35b on 64 GB Apple Silicon runs at ~24 tok/s with a 32 K-token conversation behind it — the highest-quality model that's still fast enough to chat with at that point. gemma4:e2b also holds up at long conversation lengths but is a much smaller model. Intel users get gemma4:e2b: the only model we tested that stays usable on an Intel CPU at 32 K tokens of history. Users who prefer SwiftLM-local (Apple Silicon only) can flip the runtime in Settings; finding #3 below shows why ollama is the default. The SwiftLM model picks for the same tiers are Qwen3-8B at 32 GB and Qwen3.6-35B-A3B at 48 GB+.

Battery awareness

Each background job has its own rule for when to run — voice transcription runs anytime, photo captioning pauses on battery, Dreamer is manual. The bench work turned up two ways the GPU silently slows down on Apple Silicon: when Low Power Mode is on (5–6× slower), and when the battery drops below ~5 % (another 2–3× even with High Power Mode set). An overnight catch-up on a draining battery uses ~3× the energy for the same captions and runs ~10× slower in the final hour. The default avoids both. Each rule is overridable per job in Settings → Background, which also shows the current power state so it's obvious why any job is paused.


Five findings worth pulling out

The picks above are mostly a packaging job — pick a model per RAM tier per task, ship it. The more interesting part of two weeks of benching is below: observations about local AI on Mac in May 2026 that are useful whether or not you ever touch Remember This.

1. Transcription runs ~32× faster on Apple Silicon than Intel

Same audio clip (60 s of LibriVox Sherlock Holmes, public domain), same large-v3-turbo model, same bench script. A 10-minute meeting transcribes in ~18 seconds on M1 Max with Metal, and in ~9 minutes on an Intel i9-8950HK with the CPU path. On Intel, the GPU path (Metal on the Radeon Pro 560X) is actually 2× slower than the CPU path and crashes reliably on any audio longer than ~3 minutes. So Intel picks CPU automatically; there's no Metal/CPU toggle in Settings on Intel, because the one right answer is already in place.

Hardware Best backend @ 60 s clip @ 10 min clip
Intel i9-8950HK / 32 GB / Radeon Pro 560X CPU, 4 threads 0.80 s / s audio 0.96 s / s audio
M1 Max / 64 GB Metal (any n_threads) 0.02 0.03

The "Intel Metal crashes" bug specifically: sona aborts with GGML_ASSERT(buf_src) failed in ggml-metal-device.m:1561, deterministic across retries. The same audio on Apple Silicon Metal runs cleanly. Reproducer + cross-platform comparison: gotchas/polaris-whisper-crash-long-audio.md.

2. On SwiftLM, LLMs slow down a lot when conversations get long. On ollama, they barely slow down.

We started out assuming all LLMs get slower the longer the conversation gets — the bigger the chat history, the more work per word. The first sweep on M1 Max was all SwiftLM, and looked like the textbook curve: Llama-3.2-3B dropped from 54 to 24 tok/s as the prompt grew from 4 K to 16 K tokens. Every other SwiftLM row in the matrix follows the same shape.

Then we ran the same models on ollama. They stay nearly the same speed across the whole range:

M1 Max, model4 K → 16 K decay4 K → 64 K decay
SwiftLM, Llama-3.2-3B2.25×
SwiftLM, gemma-4-e2b (LLM-mode)4.0×29×
SwiftLM, gemma-4-26b-a4b2.8×13×
ollama, gemma4:e2b1.06×1.51×
ollama, gemma4:26b1.08×1.53×
ollama, qwen3.6:35b1.06×1.29×

Same hardware, same Apple GPU underneath — but at 64 K tokens of context, SwiftLM has slowed down 30× while ollama has slowed down 1.5×. So the slowdown comes from how the runtime is built, not from the hardware. The mechanism, roughly: SwiftLM sends many tiny instructions to the GPU per word it produces, and the cost of that overhead grows with the conversation length. ollama (via llama.cpp) batches more aggressively, so the same overhead barely shows up.

What this means in practice: SwiftLM is fine for short chats (~8 K tokens or less, roughly a few thousand words of history); past 16 K its speed drops sharply. For a Dreamer session — which builds up context as it reads photos, voice memos, and time logs — ollama is the right pick on M1 Max.

3. On M1 Max, ollama beats SwiftLM on LLMs — and by more when conversations get long

We started out saying "ollama on Intel, SwiftLM on Apple Silicon, because SwiftLM is 10–15× faster than ollama on the big efficient (MoE) models we care about." That was based on a single short-conversation test of Qwen3-Coder-30B. When we later ran a full head-to-head across five model classes and a range of conversation lengths (4 K to 64 K tokens), the verdict flipped:

Model on M1 MaxSwiftLM @ 4Kollama @ 4KSwiftLM @ 64Kollama @ 64KWinner at 64K
gemma-4-e2b3274.61.149.5ollama 45×
gemma-4-26b-a4b15.249.21.232.1ollama 27×
gpt-oss-20b25.558.61.736.7ollama 22×
Qwen3.6-35B-A3B (thinking)4227.26.021.1ollama 3.5×
Llama-3.2-3B (small dense)5593ollama 1.7× at 4K

ollama wins on every model we tested. The two are roughly tied at short conversations and ollama pulls ahead as the conversation gets longer. Only Qwen3.6-35B-A3B at very short prompts (4 K, 8 K) still favours SwiftLM, by 1.3–1.5×. Past 16 K, ollama wins everywhere, and the gap widens: 2–3× at 4 K becomes 22–45× at 64 K. The biggest gap is gemma-4-e2b at 64 K, where ollama holds 49.5 tok/s while SwiftLM drops to 1.1. The reason is in finding #2 above — it's how the two runtimes are built, not the models or the hardware.

What this means for the v0.11 picks: the Dreamer-tier Apple Silicon defaults are now ollama, not SwiftLM (Hosted Claude remains the top-level default). SwiftLM stays as the VLM runtime — the MLX-only VLMs we ship have no ollama equivalent at our quality bar. SwiftLM-local Dreamer is still selectable in Settings for users who specifically want a single-stack setup.

One thing we didn't finish: the SwiftLM side of the Qwen3-Coder-30B-A3B test (the battery died mid-run). The ollama side ran cleanly (54.5 tok/s at 4 K, 24.1 at 64 K). That's the one model where SwiftLM might still be faster, since it's the one we used for the original 10–15× claim. Re-bench tracked in the REBENCH-QUEUE.

4. macOS quietly slows the GPU when battery drops below ~5 %

Same model, same M1 Max, same High Power Mode setting in both runs. Morning, full battery: Llama-3.2-3B at 4 K ran at 54 tok/s on SwiftLM. Evening, 2 % battery: 18.8 tok/s on the same hardware. 2.9× slower from battery alone. It gets worse at longer conversations — at 32 K, the speed dropped from 8 to 0.4 tok/s. This is on top of the better-known slowdown when macOS Low Power Mode kicks in (another 5–6×). All the "clean" Apple Silicon numbers in this post assume the laptop is plugged in or above ~5 % battery. If you bench on battery, watch for it.

5. Photo captioning runs ~15–24× faster on Apple Silicon than Intel

Same model (Qwen3-VL-2B weights): 53 s per photo on Intel via ollama, 3.6 s per photo on M1 Max via SwiftLM. For the typical photo (less text in the image to read), M1 Max is closer to 24× faster. Almost all of that comes from Apple Silicon's shared memory + MLX framework + the GPU running the model directly. The gap is big enough that "do you have Apple Silicon" is the most important question for anyone deciding whether local AI is worth running on their photo library.

The full matrices

All the numbers behind the picks at the top of the post. Detail-heavy — the findings above are the takeaways; these tables are for "OK but what about model X."

All VLMs benched (sorted by speed) — click to expand

Caption-quality scores: OCR / Setting / Multilingual / Anti-Hallucination, each 1–5 (5 = consistently strong, 1 = unusable). Scored side-by-side on a 25-photo eval set (5 OCR-heavy, 5 GPS-tagged, 8 group/family, 7 single/scene). Full rubric + per-photo examples at captioning-bench/QUALITY-SCORES.md.

HardwareModels/photoCaption quality
M1 Max / 64 GBFastVLM-0.5B-bf161.102/3/1/3 — short and mostly safe; no multilingual
M1 Max / 64 GBmlx-community/Qwen2.5-VL-3B-Instruct-4bit1.512/2/2/1 — hallucinates confidently
M1 Max / 64 GBmlx-community/gemma-4-e2b-it-4bit1.592/2/3/3 — hedges with "this appears to be"; partial Swedish OCR
M1 Max / 64 GBmlx-community/gemma-4-26b-a4b-it-4bit2.563/3/4/4 — cleanest prose, lowest hallucination; drops people counts
M1 Max / 64 GBmlx-community/Qwen2.5-VL-7B-Instruct-4bit2.723/2/2/2 — invents florid scenarios
M1 Max / 64 GBmlx-community/Qwen3-VL-4B-Instruct-4bit3.113/4/3/3 — strong narrative + scene-grasp; English-only output
M1 Max / 64 GBmlx-community/gemma-4-e4b-it-4bit3.261/1/2/1 — refuses to caption 17/25, lifestyle-template loops
M1 Max / 64 GBmlx-community/Qwen3-VL-2B-Instruct-4bit3.621/2/2/1 — unusable: token-repetition loops on 3/25
M1 Max / 64 GBlmstudio-community/Qwen3-VL-8B-MLX-4bit4.055/4/5/4 — quality winner. Multilingual OCR (Swedish, Finnish) + writes captions in source language
Intel i9 / 32 GBmoondream:1.8b25.9fuzzy gist (no OCR, miscounts people)
Intel i9 / 32 GBqwen2.5vl:3b52.3solid OCR + people + dates
Intel i9 / 32 GBqwen3-vl:2b53.2solid OCR + people + dates
Intel i9 / 32 GBgemma4:e2b63.8OCR ok; weak setting context
Intel i9 / 32 GBminicpm-v:8b71.9broken grammar, no detail
Intel i9 / 32 GBqwen3-vl:4b73.5best quality (OCR + translations)
Intel i9 / 32 GBllava:7b74.8generic, hallucinates details
Intel i9 / 32 GBgemma4:e4b108.1solid OCR, marginal quality gain over e2b
Intel i9 / 32 GBgemma3:4b266.6hallucinates entire scenes
LLM context sweep (tok/s by ctx size) — click to expand

Two rows still marked with a † were benched while macOS Low Power Mode was on — their numbers are 2–6× lower than the same hardware would produce on a clean run, but their ordering relative to each other still tells you which model is faster. The other previously-tainted rows have been re-run cleanly and the † removed. The 5 ollama M1 Max rows below are from the 2026-05-19 head-to-head sweep that powers finding #3. Re-bench items are tracked in the REBENCH-QUEUE.

HardwareModel @ 4 K@ 8 K @ 16 K@ 32 K @ 48 K@ 64 K
Intel i9 / 32 GBollama, gemma4:e2b11.810.48.56.0
Intel i9 / 32 GBollama, llama3.2:1b12.16.94.2
Intel i9 / 32 GBollama, gemma3:4b7.46.95.9
M1 Max / 64 GBSwiftLM, Llama-3.2-3B5444248
M1 Max / 64 GBollama, llama3.2:3bruntime A/B93827051
M1 Max / 64 GBSwiftLM, Phi-4-mini4438211.9
M1 Max / 64 GBSwiftLM --thinking, Qwen3-4B574729err
M1 Max / 64 GBSwiftLM --thinking, Qwen3-8B373022err
M1 Max / 64 GBSwiftLM --thinking, Qwen3.6-35B-A3B423323138.36.0
M1 Max / 64 GBollama, qwen3.6:35b27.226.225.623.722.321.1
M1 Max / 64 GBSwiftLM, gemma-4-e2b (LLM-mode)321782.81.61.1
M1 Max / 64 GBollama, gemma4:e2b74.673.270.161.856.949.5
M1 Max / 64 GB †SwiftLM, Qwen3-Coder-30B-A3B14.610.74.91.81.00.9
M1 Max / 64 GBollama, qwen3-coder:30b54.547.538.527.219.724.1
M1 Max / 64 GBSwiftLM, gemma-4-26b-a4b15.210.55.42.11.61.2
M1 Max / 64 GBollama, gemma4:26b49.248.545.541.035.832.1
M1 Max / 64 GBSwiftLM, gpt-oss-20b (MXFP4-Q4)25.519.110.74.82.61.7
M1 Max / 64 GBollama, gpt-oss:20b58.655.253.247.341.136.7
M1 Max / 64 GB †SwiftLM --thinking, Qwen3-Next-80B-A3B7.95.62.71.1

Coherence note: against a long (80 K) Dreamer-style prompt, Qwen3.6-35B-A3B was the only model that stayed on-task all the way through. Gemma-4-26B-A4B on the same input fell into repetition collapse (6,000 chars of "own own own ... way way way"). Decode rate doesn't capture this: gemma-4-26b shows usable tok/s in the table at every context size, but the output stops being useful past a certain prompt length on this specific input. That's why the 64 GB+ Dreamer default is qwen3.6:35b, not the faster gemma4:26b a row above it.

Reproducing this on your hardware

We benched exactly one Intel i9-8950HK / 32 GB and one M1 Max / 64 GB. If you have something else interesting — a higher-core Intel, 16 / 96 / 128 GB Apple Silicon, an M2/M3/M4 of any tier — please run the bench scripts and contribute results.

Full bench harness, raw numbers, gotcha writeups, and PII-clean aggregate matrices live at github.com/rememberthis-ai/local-ai-benchmarks. The transcription bench specifically (with the canonical LibriVox clip via git-LFS) is at transcription-bench/. Aim for the model list we used plus extras you care about, then open a PR with your results.

Your results may vary — especially with power mode. All Apple Silicon numbers above are with the Mac plugged in and Low Power Mode off. macOS Low Power Mode throttles the M1 Max GPU 5–6×, and it auto-enables on battery below ~20 %. Check with pmset -g | grep lowpowermode before benching; expect 2–6× slower numbers if it's on. The gotchas/ dir in the bench repo covers a half-dozen other silent ways local-AI numbers can be wrong — ollama 4 K prompt truncation, sona's --gpu-device -2 not actually disabling Metal, Gemma-4 repetition collapse at long context, etc.

v0.10.x continues to ship ollama + MiniCPM-V 4.5 as the captioner; the picks above land in v0.11.

Remember This icon

Remember This

Free. Local. Private. macOS 14+.

Not sure which? Apple menu → About This Mac. "Chip: Apple M..." = Apple Silicon. "Processor: Intel..." = Intel.

Stay updated

Get notified when we publish new posts. Sign up and we'll send updates straight to your inbox.