u/AmanNonZero

Image 1 —
Image 2 —
▲ 706 r/ollama

Hey all. Just set up a workstation with two NVIDIA RTX PRO 6000 Blackwells (96GB VRAM each) for our design studio. Want to use Ollama as our main local inference layer.

What we want to do with it:

  1. Internal copilot for a ~60 person team. research, writing, brief analysis, code assist
  2. Backend for agentic tools we're building (API access is a big reason we picked Ollama)
  3. Run the biggest, best models our hardware can handle

Specific questions:

  • How well does Ollama handle dual GPU setups out of the box? Any config needed for tensor parallelism across both cards?
  • What models would you recommend at this VRAM level? Thinking Llama 3.1 70B unquantized, maybe even 405B at Q4?
  • Anyone serving Ollama to a team via Open WebUI or similar? How's the experience at 10-15 concurrent users?
  • Any gotchas with large model loading times or memory management I should know about?

First time running Ollama beyond hobby experiments, so any production-ish tips are appreciated. Will report back with what works.

------

UPDATE FOR OTHERS & THANKS FOR THE HELP . THIS SUB WASN'T AS SNARKY AND IN FACT A LOT MORE HELPFUL THAN THE OTHER ONE.

For context: we're a design agency rendering 3D animations, VR/AR walkthroughs, and architectural visualizations. Not generating AI images or running Stable Diffusion farms. The dual RTX Pro 6000s (96 GB VRAM each) are a dedicated render node that processes overnight animation batches and path-traced scenes while our design team stays productive on their own workstations. Cloud rendering costs add up absurdly fast at our project volume. Owning the hardware pays for itself in months. OctaneRender and Redshift scale linearly across both GPUs, which turns 12+ hour VR renders into something we can actually deliver on client deadlines.

Key Technical Advice & Actionables

Infrastructure Stack (Overwhelming Consensus)

Switch from Ollama to vLLM or llama.cpp

  • 169 upvotes on "Tip #1 don't use Ollama"
  • 109 upvotes on criticism of using Ollama with $25k hardware
  • vLLM is the top recommendation for multi-user concurrency (your 10-15 concurrent users scenario)
  • llama.cpp is acceptable for single-user or simpler setups, but vLLM wins for parallelization

Use Linux instead of Windows

  • 266 upvotes on "Tip #2 use Linux"
  • Ubuntu LTS 24.04 most recommended for NVIDIA driver support
  • Debian headless for maximum resource efficiency
  • Debate exists: some claim Windows CUDA drivers are 2-3% faster for pure VRAM inference, but Linux wins for stability and virtual memory handling

Model Recommendations

Stop using Llama 3.1 70B (described as "ancient" and "severely outdated")

  • Minimax M2.7 (230B MoE, 10B active) with NVFP4 quantization — perfect fit for your dual 96GB setup
  • Qwen 3.5/3.6 series (27B, 35B MoE, 122B) — excellent dense models, great for agentic tasks
  • Gemma 4 — recommended if you need "western" models (some companies ban Chinese models)
  • Mistral Medium 3.5 (119B MoE) or new Mistral 128B dense — good for massive context windows

Critical Configuration Settings

Use Tensor Parallelism (tp=2)

  • Splits model across both GPUs for unified inference
  • Doubles speed and allows models up to ~180-190GB total
  • Essential command: --tp 2 in vLLM or llama.cpp

Use NVFP4 Quantization

  • Hardware-accelerated 4-bit format specifically for Blackwell architecture
  • Minimax M2.7 NVFP4 fits in 130.6GB (down from 230GB)
  • Multiple users emphasized this is purpose-built for your cards

Optimize for Concurrency

  • Use litellm as a model router in front of vLLM for rate limiting and monitoring
  • Set --gpu-memory-utilization 0.9 or higher to maximize KV cache
  • SGLang recommended over vLLM if team works on same projects (prefix caching with RadixAttention)
  • For 60-person team: expect 5-8 simultaneous users per card on 70B Q4 before throughput drops

System Architecture

Cooling & Power Management

  • GPU spacing: minimum 2 slots apart for adequate airflow
  • Consider power limiting cards to reduce heat and increase stability
  • Script fixed clock times (10MHz below stock) to prevent PCIe bus spikes
  • Heat management is critical for sustained inference loads

RAM Requirements

  • Minimum 256GB system RAM
  • Recommendation: 2× VRAM = 384-512GB system RAM for optimal performance
  • Essential for virtual memory handling during large context operations

Frontend & User Access

  • Open WebUI is acceptable for team deployment (contrary to one dismissive comment)
  • Alternative: Set up litellm for monitoring, rate limiting, API key generation
  • Some debate about OpenWebUI in 2026, but no clear superior alternative mentioned for your use case

Specific Guides & Resources Mentioned

  1. vLLM Blackwell guide: https://github.com/lastloop-ai/vllm-blackwell-guide (120+ t/s on Qwen 27B, 200+ t/s on 35B MoE)
  2. Ollama agent configs: https://github.com/caliber-ai-org/ai-setup (888 stars, production patterns for team deployment)
  3. llama-swap tool for dynamic model switching without container restarts

Hiring & Operational Advice

Top upvoted wisdom (113+ votes on original thread you referenced): "Storage, model management, permissions, and user access become more important than the GPUs after week one. Hire someone experienced with this stack."

u/AmanNonZero — 23 days ago
▲ 273 r/LocalLLM

Hi folks, I run a 60-person design agency (brand, UI/UX, motion, CGI) and we just invested in a high-end dual-GPU workstation. Two NVIDIA RTX PRO 6000 Blackwells.

Now I want to squeeze every bit of value out of this thing. Here's what we're looking to do:

Use cases:

  1. Design workflows | AI-assisted ideation, image gen, upscaling, style transfer
  2. Local inference | running open-weight LLMs for internal research, copywriting, code assist, client brief analysis
  3. Fine-tuning | potentially training LoRAs or small domain-specific models on our design/brand data
  4. Video & motion | AI-assisted animation, interpolation, video gen experiments

What I'd love advice on:

  • What models should I be running locally with this VRAM? (96GB × 2)
  • Best serving stack? (vLLM, Ollama, text-generation-webui, something else?)
  • Anyone running Stable Diffusion / ComfyUI / Flux on similar hardware. What's your workflow?
  • Any tips on multi-GPU setup for inference vs. keeping one GPU free for rendering?

Open to any "I wish I'd known this on day one" advice. Thanks!

u/AmanNonZero — 23 days ago