u/illNin0

I'm building an internal AI platform for my company (around 10-20 active users).

Current stack:

- DGX Spark (128GB Unified Memory)
- LiteLLM as the gateway/router
- vLLM (2 separate instances)
- Open WebUI
- VS Code (Copilot Chat + MCP)

Planned models:

Chat / Review
- nvidia/Gemma-4-26B-A4B-NVFP4

Coding / Agent
- nvidia/Qwen3.6-35B-A3B-NVFP4

Current idea:

- Both vLLM instances support long context (128K-262K).
- LiteLLM routes requests based on workload and prompt size.
- Most coding/chat requests are limited to 32K context.
- Long-context requests are allowed only when necessary.

Typical routing:

- Chat -> Gemma
- Code generation -> Qwen
- Code review -> Gemma
- Large repo/document analysis -> Long-context model

The goal isn't benchmarking. I care more about:
- Low latency
- Good concurrency
- Stable production behavior
- Efficient GPU memory usage

Questions:

Would you run both models with 128K-262K max_model_len, or create separate "fast (32K)" and "long-context" vLLM instances?
Any recommended vLLM tuning for DGX Spark (gpu-memory-utilization, max-num-seqs, batched tokens, chunked prefill, speculative decoding, etc.)?
Has anyone benchmarked these NVFP4 models under concurrent real-world workloads (agent + MCP + coding), not just single-user token/sec?

I'd love to hear any production experience or lessons learned.

DGX Spark + vLLM: 2x NVFP4 models for an internal AI platform. Does this architecture make sense?

Am I Crazy?