
Re-ran the wasm-in-JVM and JS-in-JVM benchmarks after maintainers asked to be included — wasmtime4j and chicory-redline numbers inside, same JMH harne
A couple of weeks back I posted two benchmark write-ups: wasm-in-JVM (six backends, JPEG decode) and JS-in-JVM (Sieve of Eratosthenes). The most useful thing that happened next: Andrea Peruffo (Chicory core contributor) reached out on LinkedIn, and u/Otherwise_Sherbert21 (wasmtime4j author) reached out on Reddit — both pointing out the obvious gap. So I added wasmtime4j and chicory-redline as backends to both harnesses, kept the workloads and JMH config identical, and re-ran the lot. Sharing the updated tables here because the two new rows actually move the discussion forward.
Same host both runs: Apple M2 Max, Oracle GraalVM 25 (25+37-LTS-jvmci-b01), JMH 1.37, 1 fork, single-threaded, AverageTime mode in µs/op. Workloads byte-identical to the original posts.
wasm — proxy.wasm JPEG decode (Rust jpeg-decoder, 320×240 → 230,400 bytes RGB8; SHA-256 of decoded output identical across all eight backends).
| # | Backend | Score (µs/op) | 99.9% CI | vs fastest |
|---|---|---|---|---|
| 1 | nativeFfm |
1,016.205 | ±33.699 | 1.00× |
| 2 | graalwasm |
1,324.282 | ±352.856 | 1.30× |
| 3 | wasmtime4j |
1,419.934 | ±387.601 | 1.40× |
| 4 | chicoryRedline |
1,782.507 | ±33.860 | 1.75× |
| 5 | chicoryAotPlugin |
9,594.229 | ±116.206 | 9.44× |
| 6 | chicoryAot |
9,600.974 | ±296.031 | 9.45× |
| 7 | graalwasmInterp |
72,996.191 | ±2,938.575 | 71.84× |
| 8 | chicory |
252,427.438 | ±8,303.613 | 248.45× |
What each new row actually is:
| backend | engine | codegen | bridge | tier observed |
|---|---|---|---|---|
wasmtime4j |
Wasmtime 44.0.1 | Cranelift JIT | wasmtime4j JNI | JNI (Panama impl unpublished in 0.x) |
chicoryRedline |
Chicory Machine SPI |
Cranelift AOT at build time | jffi → native code | redline.isNative() == true |
JS — sieve(1_000_000) = 78,498 (all 7 backends return the same answer).
| # | Backend | Score (µs/op) | 99.9% CI | vs fastest |
|---|---|---|---|---|
| 1 | graaljs |
2,799.644 | ±16.209 | 1.00× |
| 2 | graaljsInterp |
116,971.726 | ±286.025 | 41.78× |
| 3 | rquickjsFfm |
164,382.032 | ±1,129.132 | 58.72× |
| 4 | wasmtime4j |
607,403.997 | ±20,726.650 | 216.96× |
| 5 | chicoryRedline |
2,012,876.770 | ±89,055.397 | 718.96× |
| 6 | quickjs4j |
14,341,999.558 | ±54,122.651 | 5,123.51× |
| 7 | rquickjsChicory |
18,492,288.679 | ±64,727.834 | 6,605.30× |
Both new JS rows execute rquickjs.wasm (the same QuickJS-via-Rust binding used by rquickjsFfm, just delivered as wasm instead of as a cdylib).
Things the new rows surface that the original posts couldn't
Same engine, two bridges: FFM vs JNI. nativeFfm (1,016 µs) and wasmtime4j (1,420 µs) both run the same proxy.wasmthrough Wasmtime + Cranelift. The 40 % gap is per-call bridge overhead — JEP 454 FFM vs wasmtime4j's JNI path. wasmtime4j 44.0.1 publishes a wasmtime4j-panama artifact, but the PanamaWasmRuntime class isn't shipped in 0.x yet, so on JDK 25 you still go through JNI. Closing that gap is upstream work, not engine work — and when the Panama impl lands, the expectation is wasmtime4j and nativeFfm converge.
Cranelift → native vs Cranelift → JVM bytecode, 9× apart. chicoryRedline (1,783 µs) and chicoryAotPlugin (9,594 µs) both compile proxy.wasm at build time. The only material difference is the codegen target — native machine code through redline vs JVM bytecode through Chicory's compiler plugin. The native path wins 9.4× despite paying for the jffi bridge on every call. JVM-bytecode AOT is not a substitute for true native compilation on this workload.
Bridge cost depends on workload, not just on the bridge. Same two backends across the two harnesses:
| JPEG decode | Sieve | |
|---|---|---|
wasmtime4j (JNI) |
1,420 µs | 607,404 µs |
chicoryRedline (jffi + Chicory Instance scaffold) |
1,783 µs | 2,012,877 µs |
| ratio | 1.26× | 3.32× |
On JPEG decode each benchmark op is one heavy guest call — bridge cost is paid once and amortised across ~1 ms of native work. On Sieve, each op runs millions of QuickJS-interpreter instructions but the JVM↔guest scaffolding has more to do per op, and Chicory's Instance export call is enough heavier than Wasmtime's call ABI that the 1.26× gap on JPEG decode blows out to 3.32× here. Same engines, same bridge primitives — different workload-to-bridge-cost ratio.
wasm sandbox tax (JS-side). wasmtime4j (608 ms) runs the same upstream rquickjs binding as rquickjsFfm (164 ms) — just compiled to wasm and executed by Wasmtime + Cranelift instead of linked as a cdylib. The 3.7× gap is wasm linear-memory bounds checks, indirect calls, plus the JNI hop to enter the QuickJS interpreter. Useful number to keep in mind if you're considering "ship as wasm for portability" for a JS engine.
Floor on both workloads is unchanged. Chicory's tree-walk interpreter on JPEG decode (248×), Chicory bytecode-AOT on a JS interpreter wrapped in wasm (6,600×). These set the lower bound for "no codegen / two interpreters deep on the JVM" — useful context for the new rows but no rank-order change.
Caveats — same as before, repeated for completeness
- Single host, single fork, wide CIs on
graalwasm(±353),wasmtime4jwasm row (±388), andchicoryRedlineSieve row (±89,055 — that 4.4 % CI is the largest absolute error in either table). Rank order is stable; the absolute spread on those rows would tighten with more forks. - JDK = Oracle GraalVM 25. Stock OpenJDK 25 reproduces every row except
graalwasm/graaljs— those depend on Graal-as-JIT (JVMCI / libgraal). Running them on Temurin / Corretto silently falls back to the Truffle interpreter row, which is the calibration trap from the original posts. - Bridge mix is uneven across rows (FFM / JNI / jffi / direct JVM-bytecode call). A perfectly controlled "engine A vs engine B" comparison would hold the bridge constant — currently impossible because not every published artifact ships a Panama impl.
- One workload per harness. JPEG decode is compute-heavy with substantial memory traffic; Sieve is tight inner loops over arrays. Workloads with more allocation, more dynamic dispatch, more guest↔host round trips will reorder both tables — especially the bridge-overhead rows.
What's next on the Hexana side
The next release will ship tooling that lets you actually see inside one of these integrations from inside a JetBrains IDE — the wasm↔host boundaries, the codegen tier each module is running at, the bridge each call is going through. Whether seeing inside is enough to move numbers like the ones in these tables is the question the follow-up post will try to answer. Numbers, not promises.
Repros
- wasm: https://github.com/minamoto79/webasm-java-integration-benchmark
- JS: https://github.com/minamoto79/js-engine-benchmark
Both repos: mvn package builds the wasm artifacts, the Rust cdylib, and (for the wasm harness) the redline-compiled native code; then java --enable-native-access=ALL-UNNAMED -cp … runs the JMH suite. mvn exec:java will fail — JMH's forked runner can't see the project classpath that way; both READMEs spell out the workaround.
PRs welcome for backends I've missed — wasmer-java, wazero-on-JVM via JNI, additional WASI-heavy workloads. And if you're seeing materially different ratios on a different workload or JDK, I'd love to see the numbers — would help calibrate where these generalise.
Thanks again to u/Otherwise_Sherbert21 (wasmtime4j) and Andrea Peruffo (Chicory) for asking to be included. The two new rows changed how I read the original tables.