DGX Spark + vLLM: 2x NVFP4 models for an internal AI platform. Does this architecture make sense?
I'm building an internal AI platform for my company (around 10-20 active users).
Current stack:
- DGX Spark (128GB Unified Memory)
- LiteLLM as the gateway/router
- vLLM (2 separate instances)
- Open WebUI
- VS Code (Copilot Chat + MCP)
Planned models:
Chat / Review
- nvidia/Gemma-4-26B-A4B-NVFP4
Coding / Agent
- nvidia/Qwen3.6-35B-A3B-NVFP4
Current idea:
- Both vLLM instances support long context (128K-262K).
- LiteLLM routes requests based on workload and prompt size.
- Most coding/chat requests are limited to 32K context.
- Long-context requests are allowed only when necessary.
Typical routing:
- Chat -> Gemma
- Code generation -> Qwen
- Code review -> Gemma
- Large repo/document analysis -> Long-context model
The goal isn't benchmarking. I care more about:
- Low latency
- Good concurrency
- Stable production behavior
- Efficient GPU memory usage
Questions:
- Would you run both models with 128K-262K max_model_len, or create separate "fast (32K)" and "long-context" vLLM instances?
- Any recommended vLLM tuning for DGX Spark (gpu-memory-utilization, max-num-seqs, batched tokens, chunked prefill, speculative decoding, etc.)?
- Has anyone benchmarked these NVFP4 models under concurrent real-world workloads (agent + MCP + coding), not just single-user token/sec?
I'd love to hear any production experience or lessons learned.