Intro
A common thread today is engineering that bends constraints: making infrastructure that survives the real world, and making software that refuses to be limited by hardware. Short reads below, then a deeper look at a striking systems hack that runs a 397B‑parameter model on a MacBook.
In Brief
Manyana: Rethinking version control with CRDTs
Bram Cohen’s small demo, Manyana, explores using CRDTs (Conflict‑Free Replicated Data Types) as the core data model for a next‑generation version control system. The experiment is tiny—about 470 lines of Python—but deliberate: instead of the two‑blob merges we know from Git, the CRDT approach stores a single "weave" of history where every line ever added remains in the structure with metadata about additions and removals. As Cohen puts it:
"A CRDT merge always succeeds by definition."
The payoff is predictable merges and the possibility of non‑destructive rebases and repeatable conflict resolution. The trade-offs are practical: CRDTs change how you reason about history, and auto‑merging that never fails can still produce logically incorrect results that require human judgment. If you care about smoothing many‑branch workflows or preserving every edit as first‑class metadata, this post is a useful provocation and a concrete starting point for tooling experiments.
Project Nomad: Knowledge that never goes offline
Project Nomad is a grassroots toolkit for maintaining essential documentation locally—think Raspberry Pi + SD image with Wikipedia, manuals, maps and medical guides available when networks fail or are censored. It leans on ZIM/Kiwix archives and aims to be practical for travelers, aid workers, and people in restricted regions.
The project is a reminder that cloud convenience is brittle. Early adopters flagged rough edges—US‑centric content links, Docker assumptions for deployment, and a broken hardcoded Wikipedia dump—but the core idea is compelling: resilience through local copies of reference material. For organizations that need guaranteed access to instructions and safety information, Project Nomad is a low‑tech, high‑utility approach worth adapting.
GrapheneOS: Privacy without identity
GrapheneOS has reiterated a simple promise: its OS and services "will remain usable by anyone around the world without requiring personal information, identification or an account." The project is choosing openness over market reach—if regional rules force identity checks, GrapheneOS would rather abstain than add gating.
That stance matters for privacy‑first users who want a usable, account‑free mobile environment. The practical downsides are familiar: hardware partnerships, regional legal quirks, and app compatibility can impose constraints. Still, GrapheneOS’s stance keeps a clear option on the table for people who prioritize anonymity and minimal vendor lock‑in.
Deep Dive
Flash‑MoE: Running a 397B parameter model on a laptop
The headline is vivid: a 397‑billion parameter Mixture‑of‑Experts model running on a 48GB M3 Max MacBook Pro. The Flash‑MoE repo documents the engineering behind that claim: instead of trying to fit 209GB of weights into RAM, the system streams only the active experts from SSD on a per‑token basis, uses a hand‑written C/Metal inference engine, and depends on macOS’s page cache for expert caching rather than building a custom cache layer.
Two core techniques make this possible:
- Selective expert streaming: for MoE layers you only need the K active experts per token (here K≈4), so most expert weights stay on SSD and are touched only when needed. That reduces working set dramatically.
- Tight, platform‑specific kernels: small, optimized Metal/C kernels including an FMA‑friendly dequantization loop squeeze more throughput out of the CPU/accelerator pipeline.
Result: in a 4‑bit quantized mode the authors report around 4.4 tokens/sec for production‑quality tool calling. That’s slow compared to server GPUs but astonishing for a single laptop.
Why this matters
This is a systems story more than a model‑innovation story. It shows engineers can trade latency for locality and push very large models off the datacenter and onto consumer hardware with clever I/O and quantization tricks. For developers, that unlocks a new class of experiments—local tool calling, private inference, and reproducible edge demos—without renting a cluster.
Caveats and practical limits
- The work uses a premium MacBook Pro (~$3k+), so "on a laptop" headlines gloss over real cost.
- Aggressive quantization (2‑bit) and reducing experts per token hurt quality; the repo authors even note 2‑bit "breaks JSON/tool calling" for some tasks.
- Throughput (4.4 tokens/sec) means interactive latency is high; this is a weekend engineering triumph more than a drop‑in replacement for server inference.
- Relying on macOS page cache simplifies implementation but transfers cache policy control to the OS, which may not suit all workloads.
Still, this proof‑of‑concept pushes a useful boundary: model portability is as much a storage and systems problem as a model‑architecture problem. Expect community forks and alternative quantization/format work that chase better accuracy/latency tradeoffs on other hardware.
Notable quote from the project:
"Pure C/Metal inference engine that runs Qwen3.5‑397B‑A17B ... on a MacBook Pro"
Practical takeaway
If you build ML tools, Flash‑MoE is an invitation to rethink where inference happens. For researchers it’s a fast way to iterate privately. For product teams, it suggests hybrid deployment patterns—part local, part server—can be feasible with careful engineering. And for the ML tooling ecosystem, it pressures quantization and model formats to become more robust and portable.
Closing thought
This batch of posts shares a theme: smart constraints beat naively bigger resources. Whether you’re rethinking the model for version control, building a tiny offline library for emergencies, insisting an OS not ask for your passport, or streaming experts off an SSD, the most interesting work right now is about choice—choosing where complexity lives, and what we’ll accept to keep systems usable, private and resilient.