v0.11.0: Better Local AI on Your Mac

2026-05-22

The follow-up to the May bench: per-tier model picks land as defaults, Apple Silicon Macs caption photos through SwiftLM at about 10× the previous speed, and Power Mode keeps heavy backfill from eating your battery on the road. Plus voice-memo queue improvements, Abort / Re-transcribe buttons, and a VLM circuit breaker.

Where this release comes from

Two weeks ago we published Local AI on Mac: A May 2026 Benchmark — an exhaustive measurement of inference runtimes and models across an Intel i9 and an M1 Max, kicked off by one production captioning crash. v0.11 is the release that ships the picks from that bench: which Whisper model your machine should run, which vision model captions your photos, and on which runtime. None of these are theoretical — every choice traces to a quality-score sheet or a token-throughput curve.

The headline result was that Apple Silicon Macs with 16 GB+ of RAM unlock a vision model they couldn't run efficiently before: Qwen3-VL via SwiftLM (MLX), about ten times faster than the ollama path on the same machine, with quality that handles multi-language receipts and signage correctly. Intel Macs hit a CPU plateau the benchmarks couldn't get past, so the defaults there stay on ollama's moondream — not the best model in absolute terms, but the right floor.

SwiftLM captioning on Apple Silicon

For Apple Silicon Macs with 16 GB+ of RAM, photo captioning now runs through SwiftLM (mlx-swift-lm) rather than ollama. The bench delta is the kind of gap that justifies a backend swap: about ten times faster captioning on the same machine, with quality that matches or beats the previous default. The May benchmark has the side-by-side numbers and the quality-score grid.

More interestingly, the larger MLX picks at Tier 32 GB+ unlock things ollama's defaults couldn't read at all. Qwen3-VL-8B-MLX-4bit on Tier 48 / 64 was the only model in the eval that handles Swedish and Finnish receipts correctly and writes captions in the source language — the previous gemma-4-26b option produced strong English prose but consistently dropped people counts and misread event signs. The tradeoff is speed (2.56 s → 4.05 s per photo), which we took.

Newly indexed photos use the new model automatically. Existing captions stay as they are until you choose to recaption a specific photo. Intel Macs keep ollama with moondream — MLX has no x86_64 path yet.

Smarter defaults out of the box

First launch detects your Mac's RAM and CPU architecture, then writes a model-and-backend recipe matched to the bench matrix. Tier 8 GB gets the medium Whisper model; Tier 16 GB+ jumps straight to large-v3-turbo. Apple Silicon Tier 32+ moves to Qwen3-VL-4B for captioning; Tier 48 / 64 to Qwen3-VL-8B-MLX-4bit; LLM picks at Tier 32 / 48 / 64 bumped to gpt-oss:20b, gemma4:26b, and qwen3.6:35b respectively, since ollama held its decode throughput nearly flat up to 64K context on M1 Max.

The picks land in Settings → Models, alongside what your tier could use if you flipped to a different backend. Switching is one click. Hosted Anthropic Claude remains the default Claude Code backend on every tier — local-LLM Dreamer is opt-in.

Power Mode

Remember This now treats your machine's power state as a constraint. A new Power Mode panel in Settings gates each background workload independently: voice transcription, photo captioning, triage, and the local-LLM Dreamer (if you've switched Claude Code to local).

The defaults match what hurts most. Photo captioning — the long backfill that processes thousands of photos at a time — defaults to AC only and pauses the moment you unplug. Voice transcription stays Always on; recording a memo and waiting for a transcript shouldn't depend on whether you're plugged in. Triage stays Always. The local-LLM Dreamer is Manual by default — you run it when you want it.

Edit the policy any time. The change takes effect within about thirty seconds, no restart.

A live view of the GPU queue

The Processing tab used to be a list of counters. v0.11 turns it into a live feed. Under "Now processing" you see the GPU scheduler's actual queue, item by item, file paths and all — plus the in-flight Whisper transcription text as it streams out of the model. Long-running jobs are no longer opaque.

Pending items in the Timeline and Transcribe tabs now show a "Queued · #N" position pill so you can answer "is mine next?" at a glance. The number uses the same order the scheduler dispatches in — #1 is what's coming out next.

Fresh recordings jump to the front

Before v0.11, a fresh voice memo recorded right now could end up queued behind hours of older Apple Voice Memos. The scheduler tiebroke FIFO inside a priority bucket, and the scanner had a "skip if queue already has N items" guard that meant new arrivals could fall off the enqueue path entirely until the queue drained. v0.11 tiebreaks by capture time instead, and drops the cap-skip. New memos surface within seconds.

Apple Voice Memo custom titles also surface now — when you rename a memo to something like "Holmbacken 5b 55" in Voice Memos, that title shows as the lead line on the row instead of just the timestamp. Existing memos backfill via a one-shot migration on first launch.

Stop, restart, retry

Two voice-memo controls that should have always been there. A stop button next to the live progress pill aborts an in-flight transcription — the Whisper worker is preempted, the asset is marked Cancelled in frontmatter so it doesn't immediately re-queue, and the next item in the queue starts. Useful when you accidentally queued a 3-hour file or a long memo is going off the rails.

The voice memo row context menu's "Transcribe now" / "Re-transcribe" / "Retry transcription" entries now actually work — they used to be a silent no-op for pending memos. The button preempts the running transcription, resets stale state, and fires the scheduler's enqueue-immediate path so the asset genuinely jumps the queue. Label is contextual based on the asset's current state.

VLM captioning circuit breaker

The production crash that kicked off the May bench wasn't just slow — it was cascading. Ollama's llama-runner was crashing deterministically on an Intel/Polaris-Metal box, the worker had no backoff, and 35,029 photos got marked visual_status='failed' in 60 minutes. Every retry blasted a backend that was guaranteed to crash on the next request.

v0.11 adds a circuit breaker on the dispatch loop. After 5 consecutive failures matching a fatal-backend pattern (llama runner process has terminated, HTTP 500/503 from ollama, connection refused), the breaker trips and the VLM scanner skips enqueueing until a human clears it via POST /vlm/resume. Per-asset failures (locked photo, iCloud-offline image) don't count toward the trip — those are classified as Skipped upstream. Threshold of 5 catches deterministic cascades fast while tolerating real transient HTTP 500s.

MCP gets a bug-report flow

Two new MCP tools so Claude Code can troubleshoot and file in one session: preview_bug_report shows what the report will contain, and submit_bug_report sends it through the same HMAC-signed path that Settings → Bug report uses. Dreamer-context guard is code-enforced: when Claude Code is running as part of a dreamer pass, submit_bug_report refuses unless user_approved=true — the daemon log tail and sanitised config in the payload can't ship off the device without you explicitly saying yes.

Stability and smaller things

Long voice memos no longer hit a fixed-wall-clock timeout. The previous whisper guard was a flat 60 minutes; a one-hour memo on an Intel Mac (CPU-only, ~1× real-time) often tripped it despite making steady progress. The new guard resets on every progress callback and only fires after ten minutes of no progress.
No more "Sona busy" / 503 errors during overlapping transcriptions. Concurrent transcribe calls are now serialized so the single-context Sona server never sees overlapping requests.
No more false "Daemon offline" indicator. The status poll timeout was tripping on Intel under heavy GPU load; bumped from 2 to 10 seconds.
Fresh installs no longer brick at "0 items indexed". The daemon auto-scaffolds ~/Remember This/, creates your identity, attaches iCloud as a source, and starts indexing on first launch. The old "data folder picker" step is gone.
The Transcribe tab no longer shows "No transcriptions yet" on cold launch. The list now refreshes once the SQLite pool is ready instead of latching to the empty initial fetch.
Cross-machine voice memo re-imports no longer create duplicates. If you've ever synced your Registry to a second Mac, you used to end up with parallel duplicate transcripts bucketed under the import date. The importer now resolves capture time via a four-step fallback and checks the index for an existing hash before writing. A migration cleans up the parallel copies prior installs left behind.

What's next

The bench work that drove this release isn't done. The most expensive captioning model in the eval — Qwen3-VL-8B-MLX-4bit at Tier 48 / 64 — is great at multi-language receipts but about 1.5× slower than the previous default. There's probably a smaller MLX variant that keeps the quality without the cost; the next round of benchmarks will look. Bug reports from Settings → Bug report (or via Claude Code through MCP) still go straight into that loop.

Remember This

Free. Local. Private. macOS 14+.

Not sure which? Apple menu → About This Mac. "Chip: Apple M..." = Apple Silicon. "Processor: Intel..." = Intel.

Stay updated

Get notified when we publish new posts. Sign up and we'll send updates straight to your inbox.