▲ 5 r/Vllm+1 crossposts

DGX Spark + vLLM: 2x NVFP4 models for an internal AI platform. Does this architecture make sense?

I'm building an internal AI platform for my company (around 10-20 active users).

Current stack:

- DGX Spark (128GB Unified Memory)
- LiteLLM as the gateway/router
- vLLM (2 separate instances)
- Open WebUI
- VS Code (Copilot Chat + MCP)

Planned models:

Chat / Review
- nvidia/Gemma-4-26B-A4B-NVFP4

Coding / Agent
- nvidia/Qwen3.6-35B-A3B-NVFP4

Current idea:

- Both vLLM instances support long context (128K-262K).
- LiteLLM routes requests based on workload and prompt size.
- Most coding/chat requests are limited to 32K context.
- Long-context requests are allowed only when necessary.

Typical routing:

- Chat -> Gemma
- Code generation -> Qwen
- Code review -> Gemma
- Large repo/document analysis -> Long-context model

The goal isn't benchmarking. I care more about:
- Low latency
- Good concurrency
- Stable production behavior
- Efficient GPU memory usage

Questions:

  1. Would you run both models with 128K-262K max_model_len, or create separate "fast (32K)" and "long-context" vLLM instances?
  2. Any recommended vLLM tuning for DGX Spark (gpu-memory-utilization, max-num-seqs, batched tokens, chunked prefill, speculative decoding, etc.)?
  3. Has anyone benchmarked these NVFP4 models under concurrent real-world workloads (agent + MCP + coding), not just single-user token/sec?

I'd love to hear any production experience or lessons learned.

reddit.com
u/illNin0 — 1 day ago
▲ 2 r/LocalLLM+1 crossposts

Am I Crazy?

I wonder if I'm wasting time and money (GPU) building a self-hosted AI

Local LLMs

AI gateway

Model Routing

Cost Monitoring

It's fun. I learned a lot. But it makes exactly $0.

  • Anyone else building AI infrastructure purely because they enjoy it?
  • Are companies actually hiring AI Infrastructure Engineers yet?
reddit.com
u/illNin0 — 15 days ago