Hey r/SillyTavernAI — quantized two of TheDrummer's bigger RP finetunes to NVFP4 (4-bit) for those running RP locally on DGX Spark or other Blackwell hardware (5090, B100, GB10). Both fit on a single 128 GB UMA workstation via vLLM.

─────────────────────────────────────────────────────────

Model #1

• Model Name: Behemoth-X-123B-v2.2-NVFP4 • Model URL: https://huggingface.co/Kaleto/Behemoth-X-123B-v2.2-NVFP4 • Model Author: TheDrummer (base model: Behemoth-X-123B-v2.2, a Mistral-Large-2411 finetune; NVFP4 quant by me) • What's Different / Better:

First publicly available NVFP4 of a 123B Mistral-Large derivative (afaict)
66 GB on disk vs ~228 GB BF16; runs on a single Spark
NVFP4 quality ~Q5-Q6 GGUF range at Q4 size, with hardware- accelerated 4-bit GEMM on Blackwell (faster than GGUF on this hardware specifically)
Calibration came out clean (1683 quantizers, no NaN, no zeros)
3-node distributed quant pipeline (open-source — see end) was needed because half-Behemoth in BF16 is ~115 GB and 2-Spark UMA hit Linux-OOM during calibration • Backend: vLLM 0.20.2 with the Avarok-stack env vars: VLLM_NVFP4_GEMM_BACKEND=marlin VLLM_TEST_FORCE_FP8_MARLIN=1 VLLM_MARLIN_USE_ATOMIC_ADD=1 --attention-backend flashinfer --quantization compressed-tensors --kv-cache-dtype fp8 --max-model-len 32768 --gpu-memory-utilization 0.90 • Settings (from Drummer's "chaos edition" testing):
Chat template: Metharme with Mistral system tokens [SYSTEM_PROMPT]<|system|>{{system}}[/SYSTEM_PROMPT]<|user|>...
Temperature: 0.95 – 1.05
min-p: 0.025
smoothing_factor: 0.2
DRY: off (Drummer's notes don't call for it)
On a single Spark: ~3.2 tok/s decode (short context)

─────────────────────────────────────────────────────────

Model #2

• Model Name: Anubis-Pro-105B-NVFP4 • Model URL: https://huggingface.co/Kaleto/Anubis-Pro-105B-NVFP4 • Model Author: TheDrummer (base model: Anubis-Pro-105B-v1, a Llama-3.3-70B upscale to 105B; NVFP4 quant by me) • What's Different / Better:

First publicly available NVFP4 of a 100B+ RP/storytelling Llama-3.3 finetune (afaict)
58 GB on disk vs ~196 GB BF16
+22 % decode speedup over stock vLLM when serving with the Avarok-stack MARLIN+FlashInfer env vars (measured, not extrapolated — 5-run median, std-dev <1 %)
Calibration clean (840 quantizers, no NaN, no zeros)
Same pipeline + same fix-list as Behemoth above • Backend: vLLM 0.20.2 with the same Avarok-stack env vars as Behemoth above. Drop the env vars to fall back to stock vLLM (CUTLASS GEMM); model serves either way, MARLIN is just faster. • Settings (community "Setting A" from the model card):
Chat template: Llama 3
Temperature: 0.75
min-p: 0.01
smoothing_factor: 0.2, smoothing_curve: 2
DRY: multiplier 4, allowed_length 1, base 3, temp_last
On a single Spark: ~3.8 tok/s decode (short context), ~520 s cold load

─────────────────────────────────────────────────────────

Notes for the audience:

NVFP4 vs GGUF: NVFP4 typically lands in the Q5-Q6 quality range at Q4 size. It's specifically the vLLM-on-Blackwell path. If you're on llama.cpp or Apple Silicon, bartowski / mradermacher already have GGUFs of both — use those instead.
Honest disclaimer on calibration: I used modelopt's stock NVFP4_DEFAULT_CFG with 256 cnn_dailymail samples. NOT the agentic-mix-tuned -GB10 recipe from saricles. RP-quality comparison vs i1/imatrix Q6_K from anyone who runs the A/B test would be very welcome.
License: Anubis-Pro = Llama 3.3 Community License. Behemoth = Mistral Research License (research/non-commercial).
Pipeline source (open, Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4 Same toolchain that produced both. Resume-from-checkpoint, N-shard mode, smoke test that validates a 7B in ~1 min before committing to a 100B run.

Big thanks to TheDrummer for the finetunes, Avarok-Cybersecurity for the MARLIN-NVFP4 port that makes the speedup real on Spark, and saricles for setting the bar on Spark-tuned recipes. Feedback / quality reports welcome 🙏

u/KaletoAI

Two NVFP4 quants of TheDrummer's bigger RP finetunes (Behemoth-X-123B + Anubis-Pro-105B) for DGX Spark / Blackwell

Model #1

Model #2