u/use-one_of-these

Reality check up front: this is slower than your GPU today. What's interesting is why it works at all, and where the ceiling is.

Explainer

The memory wall is the usual bottleneck for LLM inference: shuttling weights from VRAM to compute units. With BitNet's ternary weights ({-1, 0, +1}), there's a different option — instead of moving weights to a processor, do the math inside the DRAM chip.

The mechanism (this is what the explainer walks through visually):

If you send a DDR4 chip an ACT–PRE–ACT sequence with timing that violates JEDEC's tRAS/tRP rules in specific ways, the sense amps don't have time to fully resolve any single row. Multiple rows open at once and the analog charges mix on the bitlines. The sense amp then resolves to the majority value across the opened rows — every bitline in the subarray becomes one MAJ gate, computed in parallel across the row.

MAJ(a, b, 0) = AND. From AND you build ternary × int8 multiplies (the activations are int8 in BitNet, so the multiply decomposes into masked ANDs across 8 bitplanes + popcount). From those, you build a full transformer linear layer.

This isn't a hack — it's a line of published research (SiMRA, FracDRAM, POPCNT3) showing that the out-of-spec behavior of commodity DRAM is exploitable for compute. We built an end-to-end path from HuggingFace BitNet b1.58-2B-4T → PyTorch → DDR4-as-multiplier → next token.

What's in the explainer (10 scenes):

Scenes 1–3: DRAM basics — cells, rows, sense amps, the destructive read cycle
Scenes 4–6: how timing violations produce RowCopy, multi-row activation, and MAJ; why replication and "neutral rows" (FracDRAM's V_DD/2 trick) are needed to survive sense-amp threshold scatter
Scenes 7–8: ternary × int8 from MAJ-based ANDs, bitplane decomposition (the MSB factor is -128, which gives two's-complement sign for free), inference loop
Scene 9: the honest bottleneck — every MAJ has to ship a full DRAM row (~8 KiB) back to the host for popcount, which serialises through the DDR bus and dominates wall time
Scene 10: what'd have to change at the chip level (the POPCNT3 paper proposed doing popcount inside DRAM and reports 27×–348× vs A100 on bulk bitwise accumulation — with 256 parallel banks; we currently use ~4)

Code pane on each scene references specific lines in our project repo and the relevant paper section.

Honest caveats:

Needs per-chip calibration sweeps because the timing tricks are out-of-spec; what works on one DIMM doesn't work on another
Samsung chips don't support multi-row activation at all (SiMRA Limitation 1)
The POPCNT3 speedups are for the accumulation kernel, not end-to-end inference
Requires an FPGA-driven DDR4 testbench today, not consumer hardware

Happy to answer questions about the calibration nightmare, the row-decoder hypothesis for why max K = 32, or why ternary weights map onto this particular kind of compute so cleanly.

Any day traders who moved to China under the "six-year rule"?

How to run BitNet b1.58 inside DRAM by intentionally breaking DDR4 timing rules — interactive explainer