
Matching engine performance challenge.
Along with our recent publication: "The World's Fastest Matching Engine Algorithm" on arXiv — and we're cordially inviting the HFT community to try to prove it wrong.
Paper: https://arxiv.org/abs/2606.01183
The claim, briefly: a single CPU core sustains ~32 million orders/second per symbol at sub-microsecond tail latency under sustained multi-million-message micro-bursts — 5–11× faster than the best open-source matching engines on the same hardware. On a single 96-core instance (~$1,630/month), it reaches ~640 million messages/second across 10,000 symbols.
In US equities, where marketable flow routes to whoever holds the NBBO, matching throughput isn't a vanity metric — it's the exchange's market share. Which is exactly why a claim like this deserves to be tested rather than taken on faith.
So we've opened the test, even though our engine itself stays proprietary. What's public is the harness: the deterministic workload generator, the methodology, the byte-level reference outputs, and the adapters for the open-source engines we benchmark against. With it, you can put your own engine through the same harness on your own hardware and see how it stacks up against the figures above.
The harness also includes adapters for several widely-cited open-source engines as well as the engines that claim high performance numbers (> 10 M/s, with some engines claiming > 100 M/s), so you can see how each measures under this workload — set against the figures their projects publish. The full comparison is in the repo.
If your engine matches or beats our figures, we'd love to hear it. If you think the methodology is unfair, we want to hear that too.
No hand-waving: an open workload, an open methodology, and baselines anyone can rerun — so you can judge the comparison for yourself and find out exactly where your own engine lands.
Harness: https://github.com/flash1-dev/matching-engine-benchmark
Run it, push on it, and tell us what you find — we'll be in the comments, glad to compare notes.