Two weeks benchmarking AI runtimes and models across an Intel i9 and an M1 Max, to decide what Remember This v0.11 should ship by default. Below: the picks we landed on, the findings worth knowing if you run any local AI on a Mac, and the bench script so you can reproduce on your hardware.
What v0.11 ships
The point of the bench work is that you shouldn't have to read it to get a good install. v0.11 looks at your Mac on first launch — how much RAM, Apple Silicon or Intel — and picks the right models for each task. You can override anything in Settings, but the defaults aim to be the right answer out of the box.
Remember This runs three jobs against your archive, each picked separately because the trade-offs are different:
- Photo Understanding — captions every photo and video so you can search by what's in them, not just when they were taken.
- Audio Transcription — voice memos and video soundtracks become searchable text.
- Dreamer Sessions — periodically, Claude Code reads through the day or week's captions, transcripts, and time logs and writes a synthesis: a rollup, an open question, a long-horizon nudge. Heaviest job; benefits most from a fast LLM.
Photo Understanding — defaults per Mac
| RAM | Apple Silicon | Intel |
|---|---|---|
| ≤ 8 GB | moondream:1.8b via ollama | moondream:1.8b via ollama |
| 16 GB | gemma-4-e2b via SwiftLM | moondream:1.8b via ollama |
| 32 GB | Qwen3-VL-4B via SwiftLM | qwen3-vl:2b via ollama |
| 48 GB+ | Qwen3-VL-8B via SwiftLM ★ | qwen3-vl:2b via ollama |
Apple Silicon with 16 GB+ runs SwiftLM, a new Apple-MLX-based runtime
bundled in v0.11. It's what runs the MLX-native VLM families (Qwen3-VL, Gemma-4-Vision,
FastVLM) that don't have an ollama equivalent at our quality bar. The 48 GB+ row unlocks
Qwen3-VL-8B, the side-by-side quality winner in our caption eval; the 32 GB
row uses Qwen3-VL-4B, the best speed-and-quality tradeoff at that tier. Both
scored in detail in the VLM matrix further down. Intel Macs fall back to ollama (MLX is
Apple-Silicon-only). For LLMs the runtime verdict goes the other way — see finding
#3.
Audio Transcription — defaults per Mac
| RAM | Whisper model |
|---|---|
| ≤ 8 GB | medium |
| 16 GB and up | large-v3-turbo ★ |
Plus a backend pick per machine: Metal (Apple's GPU framework) on Apple Silicon, where it's 8–22× faster than CPU; CPU on Intel, where Metal on the older Radeon Pro discrete GPU is both slower than CPU and crashes on long audio — see finding #1.
Dreamer Sessions — defaults per Mac
Hosted Claude (claude.ai) is the default on every Mac, including the 64 GB+ tier — best long-context coherence, no first-pull download, single privacy stance. Flip the toggle in Settings to use a local LLM instead and v0.11 picks one sized for your tier:
| RAM | Apple Silicon (ollama) | Intel (ollama) |
|---|---|---|
| ≤ 8 GB | not recommended | not recommended |
| 16 GB | gemma4:e2b | gemma4:e2b |
| 32 GB | gpt-oss:20b | gemma4:e2b |
| 48 GB | gemma4:26b | gemma4:e2b |
| 64 GB+ | qwen3.6:35b ★ | gemma4:e2b |
qwen3.6:35b on 64 GB Apple Silicon runs at ~24 tok/s with a 32 K-token
conversation behind it — the highest-quality model that's still fast enough to chat
with at that point. gemma4:e2b also holds up at long conversation lengths but
is a much smaller model. Intel users get gemma4:e2b: the only model we tested
that stays usable on an Intel CPU at 32 K tokens of history. Users who prefer
SwiftLM-local (Apple Silicon only) can flip the runtime in Settings; finding #3 below shows
why ollama is the default. The SwiftLM model picks for the same tiers are
Qwen3-8B at 32 GB and Qwen3.6-35B-A3B at 48 GB+.
Battery awareness
Each background job has its own rule for when to run — voice transcription runs anytime, photo captioning pauses on battery, Dreamer is manual. The bench work turned up two ways the GPU silently slows down on Apple Silicon: when Low Power Mode is on (5–6× slower), and when the battery drops below ~5 % (another 2–3× even with High Power Mode set). An overnight catch-up on a draining battery uses ~3× the energy for the same captions and runs ~10× slower in the final hour. The default avoids both. Each rule is overridable per job in Settings → Background, which also shows the current power state so it's obvious why any job is paused.
Five findings worth pulling out
The picks above are mostly a packaging job — pick a model per RAM tier per task, ship it. The more interesting part of two weeks of benching is below: observations about local AI on Mac in May 2026 that are useful whether or not you ever touch Remember This.
1. Transcription runs ~32× faster on Apple Silicon than Intel
Same audio clip (60 s of LibriVox Sherlock Holmes, public domain), same
large-v3-turbo model, same bench script. A 10-minute meeting transcribes in
~18 seconds on M1 Max with Metal, and in ~9 minutes on an
Intel i9-8950HK with the CPU path. On Intel, the GPU path (Metal on the Radeon Pro 560X)
is actually 2× slower than the CPU path and crashes reliably on any audio
longer than ~3 minutes. So Intel picks CPU automatically; there's no Metal/CPU toggle in
Settings on Intel, because the one right answer is already in place.
| Hardware | Best backend | @ 60 s clip | @ 10 min clip |
|---|---|---|---|
| Intel i9-8950HK / 32 GB / Radeon Pro 560X | CPU, 4 threads | 0.80 s / s audio | 0.96 s / s audio |
| M1 Max / 64 GB | Metal (any n_threads) | 0.02 | 0.03 |
The "Intel Metal crashes" bug specifically: sona aborts with
GGML_ASSERT(buf_src) failed in ggml-metal-device.m:1561,
deterministic across retries. The same audio on Apple Silicon Metal runs cleanly. Reproducer +
cross-platform comparison:
gotchas/polaris-whisper-crash-long-audio.md.
2. On SwiftLM, LLMs slow down a lot when conversations get long. On ollama, they barely slow down.
We started out assuming all LLMs get slower the longer the conversation gets — the bigger the chat history, the more work per word. The first sweep on M1 Max was all SwiftLM, and looked like the textbook curve: Llama-3.2-3B dropped from 54 to 24 tok/s as the prompt grew from 4 K to 16 K tokens. Every other SwiftLM row in the matrix follows the same shape.
Then we ran the same models on ollama. They stay nearly the same speed across the whole range:
| M1 Max, model | 4 K → 16 K decay | 4 K → 64 K decay |
|---|---|---|
SwiftLM, Llama-3.2-3B | 2.25× | — |
SwiftLM, gemma-4-e2b (LLM-mode) | 4.0× | 29× |
SwiftLM, gemma-4-26b-a4b | 2.8× | 13× |
ollama, gemma4:e2b | 1.06× | 1.51× |
ollama, gemma4:26b | 1.08× | 1.53× |
ollama, qwen3.6:35b | 1.06× | 1.29× |
Same hardware, same Apple GPU underneath — but at 64 K tokens of context, SwiftLM has
slowed down 30× while ollama has slowed down 1.5×. So the slowdown comes from
how the runtime is built, not from the hardware. The mechanism, roughly:
SwiftLM sends many tiny instructions to the GPU per word it produces, and the cost of that
overhead grows with the conversation length. ollama (via llama.cpp) batches
more aggressively, so the same overhead barely shows up.
What this means in practice: SwiftLM is fine for short chats (~8 K tokens or less, roughly a few thousand words of history); past 16 K its speed drops sharply. For a Dreamer session — which builds up context as it reads photos, voice memos, and time logs — ollama is the right pick on M1 Max.
3. On M1 Max, ollama beats SwiftLM on LLMs — and by more when conversations get long
We started out saying "ollama on Intel, SwiftLM on Apple Silicon, because SwiftLM is 10–15× faster than ollama on the big efficient (MoE) models we care about." That was based on a single short-conversation test of Qwen3-Coder-30B. When we later ran a full head-to-head across five model classes and a range of conversation lengths (4 K to 64 K tokens), the verdict flipped:
| Model on M1 Max | SwiftLM @ 4K | ollama @ 4K | SwiftLM @ 64K | ollama @ 64K | Winner at 64K |
|---|---|---|---|---|---|
gemma-4-e2b | 32 | 74.6 | 1.1 | 49.5 | ollama 45× |
gemma-4-26b-a4b | 15.2 | 49.2 | 1.2 | 32.1 | ollama 27× |
gpt-oss-20b | 25.5 | 58.6 | 1.7 | 36.7 | ollama 22× |
Qwen3.6-35B-A3B (thinking) | 42 | 27.2 | 6.0 | 21.1 | ollama 3.5× |
Llama-3.2-3B (small dense) | 55 | 93 | — | — | ollama 1.7× at 4K |
ollama wins on every model we tested. The two are roughly tied at short
conversations and ollama pulls ahead as the conversation gets longer. Only
Qwen3.6-35B-A3B at very short prompts (4 K, 8 K) still favours SwiftLM, by
1.3–1.5×. Past 16 K, ollama wins everywhere, and the gap widens: 2–3×
at 4 K becomes 22–45× at 64 K. The biggest gap is gemma-4-e2b at
64 K, where ollama holds 49.5 tok/s while SwiftLM drops to 1.1. The reason is in finding #2
above — it's how the two runtimes are built, not the models or the hardware.
What this means for the v0.11 picks: the Dreamer-tier Apple Silicon defaults are now ollama, not SwiftLM (Hosted Claude remains the top-level default). SwiftLM stays as the VLM runtime — the MLX-only VLMs we ship have no ollama equivalent at our quality bar. SwiftLM-local Dreamer is still selectable in Settings for users who specifically want a single-stack setup.
One thing we didn't finish: the SwiftLM side of the Qwen3-Coder-30B-A3B
test (the battery died mid-run). The ollama side ran cleanly (54.5 tok/s at 4 K, 24.1 at
64 K). That's the one model where SwiftLM might still be faster, since it's the one we
used for the original 10–15× claim. Re-bench tracked in the
REBENCH-QUEUE.
4. macOS quietly slows the GPU when battery drops below ~5 %
Same model, same M1 Max, same High Power Mode setting in both runs. Morning, full battery: Llama-3.2-3B at 4 K ran at 54 tok/s on SwiftLM. Evening, 2 % battery: 18.8 tok/s on the same hardware. 2.9× slower from battery alone. It gets worse at longer conversations — at 32 K, the speed dropped from 8 to 0.4 tok/s. This is on top of the better-known slowdown when macOS Low Power Mode kicks in (another 5–6×). All the "clean" Apple Silicon numbers in this post assume the laptop is plugged in or above ~5 % battery. If you bench on battery, watch for it.
5. Photo captioning runs ~15–24× faster on Apple Silicon than Intel
Same model (Qwen3-VL-2B weights): 53 s per photo on Intel via ollama, 3.6 s
per photo on M1 Max via SwiftLM. For the typical photo (less text in the image to read),
M1 Max is closer to 24× faster. Almost all of that comes from Apple Silicon's shared
memory + MLX framework + the GPU running the model directly. The gap is big enough that
"do you have Apple Silicon" is the most important question for anyone deciding whether
local AI is worth running on their photo library.
The full matrices
All the numbers behind the picks at the top of the post. Detail-heavy — the findings above are the takeaways; these tables are for "OK but what about model X."
All VLMs benched (sorted by speed) — click to expand
Caption-quality scores: OCR / Setting / Multilingual / Anti-Hallucination, each 1–5 (5 = consistently strong, 1 = unusable). Scored side-by-side on a 25-photo eval set (5 OCR-heavy, 5 GPS-tagged, 8 group/family, 7 single/scene). Full rubric + per-photo examples at captioning-bench/QUALITY-SCORES.md.
| Hardware | Model | s/photo | Caption quality |
|---|---|---|---|
| M1 Max / 64 GB | FastVLM-0.5B-bf16 | 1.10 | 2/3/1/3 — short and mostly safe; no multilingual |
| M1 Max / 64 GB | mlx-community/Qwen2.5-VL-3B-Instruct-4bit | 1.51 | 2/2/2/1 — hallucinates confidently |
| M1 Max / 64 GB | mlx-community/gemma-4-e2b-it-4bit | 1.59 | 2/2/3/3 — hedges with "this appears to be"; partial Swedish OCR |
| M1 Max / 64 GB | mlx-community/gemma-4-26b-a4b-it-4bit | 2.56 | 3/3/4/4 — cleanest prose, lowest hallucination; drops people counts |
| M1 Max / 64 GB | mlx-community/Qwen2.5-VL-7B-Instruct-4bit | 2.72 | 3/2/2/2 — invents florid scenarios |
| M1 Max / 64 GB | mlx-community/Qwen3-VL-4B-Instruct-4bit | 3.11 | 3/4/3/3 — strong narrative + scene-grasp; English-only output |
| M1 Max / 64 GB | mlx-community/gemma-4-e4b-it-4bit | 3.26 | 1/1/2/1 — refuses to caption 17/25, lifestyle-template loops |
| M1 Max / 64 GB | mlx-community/Qwen3-VL-2B-Instruct-4bit | 3.62 | 1/2/2/1 — unusable: token-repetition loops on 3/25 |
| M1 Max / 64 GB | lmstudio-community/Qwen3-VL-8B-MLX-4bit ★ | 4.05 | 5/4/5/4 — quality winner. Multilingual OCR (Swedish, Finnish) + writes captions in source language |
| Intel i9 / 32 GB | moondream:1.8b | 25.9 | fuzzy gist (no OCR, miscounts people) |
| Intel i9 / 32 GB | qwen2.5vl:3b | 52.3 | solid OCR + people + dates |
| Intel i9 / 32 GB | qwen3-vl:2b | 53.2 | solid OCR + people + dates |
| Intel i9 / 32 GB | gemma4:e2b | 63.8 | OCR ok; weak setting context |
| Intel i9 / 32 GB | minicpm-v:8b | 71.9 | broken grammar, no detail |
| Intel i9 / 32 GB | qwen3-vl:4b | 73.5 | best quality (OCR + translations) |
| Intel i9 / 32 GB | llava:7b | 74.8 | generic, hallucinates details |
| Intel i9 / 32 GB | gemma4:e4b | 108.1 | solid OCR, marginal quality gain over e2b |
| Intel i9 / 32 GB | gemma3:4b | 266.6 | hallucinates entire scenes |
LLM context sweep (tok/s by ctx size) — click to expand
Two rows still marked with a † were benched while macOS Low Power Mode was on — their numbers are 2–6× lower than the same hardware would produce on a clean run, but their ordering relative to each other still tells you which model is faster. The other previously-tainted rows have been re-run cleanly and the † removed. The 5 ollama M1 Max rows below are from the 2026-05-19 head-to-head sweep that powers finding #3. Re-bench items are tracked in the REBENCH-QUEUE.
| Hardware | Model | @ 4 K | @ 8 K | @ 16 K | @ 32 K | @ 48 K | @ 64 K |
|---|---|---|---|---|---|---|---|
| Intel i9 / 32 GB | ollama, gemma4:e2b | 11.8 | 10.4 | 8.5 | 6.0 | — | — |
| Intel i9 / 32 GB | ollama, llama3.2:1b | 12.1 | 6.9 | 4.2 | — | — | — |
| Intel i9 / 32 GB | ollama, gemma3:4b | 7.4 | 6.9 | 5.9 | — | — | — |
| M1 Max / 64 GB | SwiftLM, Llama-3.2-3B | 54 | 44 | 24 | 8 | — | — |
| M1 Max / 64 GB | ollama, llama3.2:3b — runtime A/B | 93 | 82 | 70 | 51 | — | — |
| M1 Max / 64 GB | SwiftLM, Phi-4-mini | 44 | 38 | 21 | 1.9 | — | — |
| M1 Max / 64 GB | SwiftLM --thinking, Qwen3-4B | 57 | 47 | 29 | err | — | — |
| M1 Max / 64 GB | SwiftLM --thinking, Qwen3-8B | 37 | 30 | 22 | err | — | — |
| M1 Max / 64 GB | SwiftLM --thinking, Qwen3.6-35B-A3B | 42 | 33 | 23 | 13 | 8.3 | 6.0 |
| M1 Max / 64 GB | ollama, qwen3.6:35b | 27.2 | 26.2 | 25.6 | 23.7 | 22.3 | 21.1 |
| M1 Max / 64 GB | SwiftLM, gemma-4-e2b (LLM-mode) | 32 | 17 | 8 | 2.8 | 1.6 | 1.1 |
| M1 Max / 64 GB | ollama, gemma4:e2b | 74.6 | 73.2 | 70.1 | 61.8 | 56.9 | 49.5 |
| M1 Max / 64 GB † | SwiftLM, Qwen3-Coder-30B-A3B | 14.6 | 10.7 | 4.9 | 1.8 | 1.0 | 0.9 |
| M1 Max / 64 GB | ollama, qwen3-coder:30b | 54.5 | 47.5 | 38.5 | 27.2 | 19.7 | 24.1 |
| M1 Max / 64 GB | SwiftLM, gemma-4-26b-a4b | 15.2 | 10.5 | 5.4 | 2.1 | 1.6 | 1.2 |
| M1 Max / 64 GB | ollama, gemma4:26b | 49.2 | 48.5 | 45.5 | 41.0 | 35.8 | 32.1 |
| M1 Max / 64 GB | SwiftLM, gpt-oss-20b (MXFP4-Q4) | 25.5 | 19.1 | 10.7 | 4.8 | 2.6 | 1.7 |
| M1 Max / 64 GB | ollama, gpt-oss:20b | 58.6 | 55.2 | 53.2 | 47.3 | 41.1 | 36.7 |
| M1 Max / 64 GB † | SwiftLM --thinking, Qwen3-Next-80B-A3B | 7.9 | 5.6 | 2.7 | 1.1 | — | — |
Coherence note: against a long (80 K) Dreamer-style prompt,
Qwen3.6-35B-A3B was the only model that stayed on-task all the way through.
Gemma-4-26B-A4B on the same input fell into repetition collapse (6,000 chars
of "own own own ... way way way"). Decode rate doesn't capture this:
gemma-4-26b shows usable tok/s in the table at every context size, but the
output stops being useful past a certain prompt length on this specific input. That's why
the 64 GB+ Dreamer default is qwen3.6:35b, not the faster
gemma4:26b a row above it.
Reproducing this on your hardware
We benched exactly one Intel i9-8950HK / 32 GB and one M1 Max / 64 GB. If you have something else interesting — a higher-core Intel, 16 / 96 / 128 GB Apple Silicon, an M2/M3/M4 of any tier — please run the bench scripts and contribute results.
Full bench harness, raw numbers, gotcha writeups, and PII-clean aggregate matrices live at github.com/rememberthis-ai/local-ai-benchmarks. The transcription bench specifically (with the canonical LibriVox clip via git-LFS) is at transcription-bench/. Aim for the model list we used plus extras you care about, then open a PR with your results.
Your results may vary — especially with power mode. All Apple Silicon
numbers above are with the Mac plugged in and Low Power Mode off. macOS Low Power Mode
throttles the M1 Max GPU 5–6×, and it auto-enables on battery below ~20 %. Check
with pmset -g | grep lowpowermode before benching; expect 2–6× slower
numbers if it's on. The
gotchas/ dir in the bench repo covers a half-dozen other
silent ways local-AI numbers can be wrong — ollama 4 K prompt truncation, sona's
--gpu-device -2 not actually disabling Metal, Gemma-4 repetition collapse at
long context, etc.
v0.10.x continues to ship ollama + MiniCPM-V 4.5 as the captioner; the picks above land in v0.11.
Remember This
Free. Local. Private. macOS 14+.
Not sure which? Apple menu → About This Mac. "Chip: Apple M..." = Apple Silicon. "Processor: Intel..." = Intel.
Stay updated
Get notified when we publish new posts. Sign up and we'll send updates straight to your inbox.