u/Illustrious_Pace9232

I was learning about RandomX and wanted to play with the algorithm on a Mac, and discovered Apple's Metal API has no hardware FP64 support. I further discovered this has been a frustration for folks in ML, Science, and Gaming for a while.

I went down a rabbit hole. The naive software emulation was ~10% the throughput of the CPU's hardware FPU on the same machine. I ended up obsessively squeezing every bit of juice out of the GPU to fix it. Some of the biggest wins:

Reducing Warp Divergence: Rewrote shifts, CLZ, big/small sorting, and rounding-mode dispatch to be completely branch-free using select (cmov), preventing lanes from serializing. Dodged MSL shift-by-64 undefined behavior with safe fallbacks rather than branching.
Compiler Wrangling: Split the hot/cold paths. Forced the IEEE-754 NaN/Inf/Subnormal handlers into __attribute__((noinline)) functions so the compiler aggressively inlines tight loops.
Math/Algorithm Swaps: Replaced 14-iter Newton-on-reciprocal with Berkeley-style multiplication-only refinement for fdiv/fsqrt. Mathematically proved that subnormal output is impossible for fsqrt, allowing me to entirely bypass the underflow rounding checks.
Pack/Unpack Churn: Added an _unp_ API. Instead of packing/unpacking the IEEE-754 bit pattern on every operation, tight loops (like Kahan/Welford reductions) can keep the 53-bit mantissa and exponent unpacked in registers, saving ~15-20 IR ops per call.

To make sure the benchmarks weren't just compiler constant-folding illusions, I chained 1024 ops per thread with a data-dependent mantissa twiddle that anchored the seed near 1.0 (keeping it in the hot path without drifting to Inf/NaN).

I ended up 5–11× faster than a 14-thread CPU hardware-fp64 baseline on arithmetic, and 10–35× faster on conversions and comparisons (measured on an M4 Pro, 20 GPU cores).

It ships as a completely standalone, drop-in MSL header (softfloat64.metal) for C++/Swift/Objective-C, alongside a no_std pure-Rust reference implementation for lockstep cross-platform determinism.

Repo: https://github.com/guyfischman/metal-softfloat

I hope you find this useful! Let me know if you have any questions about the implementation or the optimizations.

Bit-exact SW-emulated FP64 on Metal, 5-11x faster than CPU HW-accelerated FP64