
Two NVFP4 quants of TheDrummer's bigger RP finetunes (Behemoth-X-123B + Anubis-Pro-105B) for DGX Spark / Blackwell
Hey r/SillyTavernAI — quantized two of TheDrummer's bigger RP finetunes to NVFP4 (4-bit) for those running RP locally on DGX Spark or other Blackwell hardware (5090, B100, GB10). Both fit on a single 128 GB UMA workstation via vLLM.
─────────────────────────────────────────────────────────
Model #1
• Model Name: Behemoth-X-123B-v2.2-NVFP4 • Model URL: https://huggingface.co/Kaleto/Behemoth-X-123B-v2.2-NVFP4 • Model Author: TheDrummer (base model: Behemoth-X-123B-v2.2, a Mistral-Large-2411 finetune; NVFP4 quant by me) • What's Different / Better:
- First publicly available NVFP4 of a 123B Mistral-Large derivative (afaict)
- 66 GB on disk vs ~228 GB BF16; runs on a single Spark
- NVFP4 quality ~Q5-Q6 GGUF range at Q4 size, with hardware- accelerated 4-bit GEMM on Blackwell (faster than GGUF on this hardware specifically)
- Calibration came out clean (1683 quantizers, no NaN, no zeros)
- 3-node distributed quant pipeline (open-source — see end) was needed because half-Behemoth in BF16 is ~115 GB and 2-Spark UMA hit Linux-OOM during calibration • Backend: vLLM 0.20.2 with the Avarok-stack env vars: VLLM_NVFP4_GEMM_BACKEND=marlin VLLM_TEST_FORCE_FP8_MARLIN=1 VLLM_MARLIN_USE_ATOMIC_ADD=1 --attention-backend flashinfer --quantization compressed-tensors --kv-cache-dtype fp8 --max-model-len 32768 --gpu-memory-utilization 0.90 • Settings (from Drummer's "chaos edition" testing):
- Chat template: Metharme with Mistral system tokens [SYSTEM_PROMPT]<|system|>{{system}}[/SYSTEM_PROMPT]<|user|>...
- Temperature: 0.95 – 1.05
- min-p: 0.025
- smoothing_factor: 0.2
- DRY: off (Drummer's notes don't call for it)
- On a single Spark: ~3.2 tok/s decode (short context)
─────────────────────────────────────────────────────────
Model #2
• Model Name: Anubis-Pro-105B-NVFP4 • Model URL: https://huggingface.co/Kaleto/Anubis-Pro-105B-NVFP4 • Model Author: TheDrummer (base model: Anubis-Pro-105B-v1, a Llama-3.3-70B upscale to 105B; NVFP4 quant by me) • What's Different / Better:
- First publicly available NVFP4 of a 100B+ RP/storytelling Llama-3.3 finetune (afaict)
- 58 GB on disk vs ~196 GB BF16
- +22 % decode speedup over stock vLLM when serving with the Avarok-stack MARLIN+FlashInfer env vars (measured, not extrapolated — 5-run median, std-dev <1 %)
- Calibration clean (840 quantizers, no NaN, no zeros)
- Same pipeline + same fix-list as Behemoth above • Backend: vLLM 0.20.2 with the same Avarok-stack env vars as Behemoth above. Drop the env vars to fall back to stock vLLM (CUTLASS GEMM); model serves either way, MARLIN is just faster. • Settings (community "Setting A" from the model card):
- Chat template: Llama 3
- Temperature: 0.75
- min-p: 0.01
- smoothing_factor: 0.2, smoothing_curve: 2
- DRY: multiplier 4, allowed_length 1, base 3, temp_last
- On a single Spark: ~3.8 tok/s decode (short context), ~520 s cold load
─────────────────────────────────────────────────────────
Notes for the audience:
- NVFP4 vs GGUF: NVFP4 typically lands in the Q5-Q6 quality range at Q4 size. It's specifically the vLLM-on-Blackwell path. If you're on llama.cpp or Apple Silicon, bartowski / mradermacher already have GGUFs of both — use those instead.
- Honest disclaimer on calibration: I used modelopt's stock NVFP4_DEFAULT_CFG with 256 cnn_dailymail samples. NOT the agentic-mix-tuned -GB10 recipe from saricles. RP-quality comparison vs i1/imatrix Q6_K from anyone who runs the A/B test would be very welcome.
- License: Anubis-Pro = Llama 3.3 Community License. Behemoth = Mistral Research License (research/non-commercial).
- Pipeline source (open, Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4 Same toolchain that produced both. Resume-from-checkpoint, N-shard mode, smoke test that validates a 7B in ~1 min before committing to a 100B run.
Big thanks to TheDrummer for the finetunes, Avarok-Cybersecurity for the MARLIN-NVFP4 port that makes the speedup real on Spark, and saricles for setting the bar on Spark-tuned recipes. Feedback / quality reports welcome 🙏