
Bit-exact SW-emulated FP64 on Metal, 5-11x faster than CPU HW-accelerated FP64
I was learning about RandomX and wanted to play with the algorithm on a Mac, and discovered Apple's Metal API has no hardware FP64 support. I further discovered this has been a frustration for folks in ML, Science, and Gaming for a while.
I went down a rabbit hole. The naive software emulation was ~10% the throughput of the CPU's hardware FPU on the same machine. I ended up obsessively squeezing every bit of juice out of the GPU to fix it. Some of the biggest wins:
- Reducing Warp Divergence: Rewrote shifts, CLZ, big/small sorting, and rounding-mode dispatch to be completely branch-free using select (cmov), preventing lanes from serializing. Dodged MSL shift-by-64 undefined behavior with safe fallbacks rather than branching.
- Compiler Wrangling: Split the hot/cold paths. Forced the IEEE-754 NaN/Inf/Subnormal handlers into __attribute__((noinline)) functions so the compiler aggressively inlines tight loops.
- Math/Algorithm Swaps: Replaced 14-iter Newton-on-reciprocal with Berkeley-style multiplication-only refinement for fdiv/fsqrt. Mathematically proved that subnormal output is impossible for fsqrt, allowing me to entirely bypass the underflow rounding checks.
- Pack/Unpack Churn: Added an _unp_ API. Instead of packing/unpacking the IEEE-754 bit pattern on every operation, tight loops (like Kahan/Welford reductions) can keep the 53-bit mantissa and exponent unpacked in registers, saving ~15-20 IR ops per call.
To make sure the benchmarks weren't just compiler constant-folding illusions, I chained 1024 ops per thread with a data-dependent mantissa twiddle that anchored the seed near 1.0 (keeping it in the hot path without drifting to Inf/NaN).
I ended up 5–11× faster than a 14-thread CPU hardware-fp64 baseline on arithmetic, and 10–35× faster on conversions and comparisons (measured on an M4 Pro, 20 GPU cores).
It ships as a completely standalone, drop-in MSL header (softfloat64.metal) for C++/Swift/Objective-C, alongside a no_std pure-Rust reference implementation for lockstep cross-platform determinism.
Repo: https://github.com/guyfischman/metal-softfloat
I hope you find this useful! Let me know if you have any questions about the implementation or the optimizations.