u/Candy_Lucy

▲ 12 r/LLMDevs

Self-hosted LLM on GCP (1×H100 + 1×L4) for legal RAG in European languages — looking for advice

Hey,

I'm planning to migrate a production RAG system from Azure OpenAI (currently using 4o + 4.1 for different agents) to a self-hosted setup on GCP. Looking for advice from people who've done similar migrations.

Setup I'm considering:
- 1× H100 80GB for the main LLM
- 1× L4 for embeddings + reranker
- Possibly 2× H100 if a meaningfully better model justifies it

Workload:
- RAG with multiple agents (currently split between GPT-4o and GPT-4.1 depending on task complexity)
- ~2,500 documents/day, batched in ~500–600 packages of 5–6 docs each, 20–30 A4 pages per doc
- Processing window: 8h/day (8 AM–5 PM), so ~310 docs/h peak
- European languages, legal domain, **zero English content**
- Speed matters — needs to fit the 8h window comfortably

Quality bar:
I've gotten current setup to ~90% satisfaction/accuracy through prompt engineering. Looking for a self-hostable model that matches or slightly beats this. Anything significantly better that fits on a single H100 would be a huge win.

Cost context:
Current Azure spend is ~$62k USD). Self-host math works even at modest savings, but the bigger drivers are data residency and predictable per-doc cost as we scale questionnaires.

Models I'm currently looking at:
- Qwen3-32B (Apache 2.0, strong multilingual, fits 1×H100 at FP8 with KV headroom)
- Possibly Qwen3.5 / Qwen3.6 variants if anyone has experience with them on legal text
- Mistral-Small-3.2-24B as a backup option

  1. ⁠Anyone running Qwen3-32B (or newer Qwen variants) in production on legal/regulatory text in non-English European languages? How does it compare to GPT-4.1 on instruction following and structured JSON output?
  2. ⁠Is there anything in the 30B–70B range that would meaningfully beat Qwen3-32B on European legal text and still fit on 1×H100 FP8?
  3. ⁠Worth jumping to 2×H100 for something like Mistral Medium 3.5 or GLM-4.5-Air, or is that diminishing returns for extractive RAG?
  4. ⁠vLLM vs SGLang for this workload (lots of shared system prompts across agents — prefix caching is interesting)?
  5. ⁠Any gotchas with H100 capacity in EU GCP regions (Frankfurt/Belgium)?
reddit.com
u/Candy_Lucy — 21 days ago

Hey,

I'm planning to migrate a production RAG system from Azure OpenAI (currently using 4o + 4.1 for different agents) to a self-hosted setup on GCP. Looking for advice from people who've done similar migrations.

Setup I'm considering:
- 1× H100 80GB for the main LLM
- 1× L4 for embeddings + reranker
- Possibly 2× H100 if a meaningfully better model justifies it

Workload:
- RAG with multiple agents (currently split between GPT-4o and GPT-4.1 depending on task complexity)
- ~2,500 documents/day, batched in ~500–600 packages of 5–6 docs each, 20–30 A4 pages per doc
- Processing window: 8h/day (8 AM–5 PM), so ~310 docs/h peak
- European languages, legal domain, **zero English content**
- Speed matters — needs to fit the 8h window comfortably

Quality bar:
I've gotten current setup to ~90% satisfaction/accuracy through prompt engineering. Looking for a self-hostable model that matches or slightly beats this. Anything significantly better that fits on a single H100 would be a huge win.

Cost context:
Current Azure spend is ~$62k USD). Self-host math works even at modest savings, but the bigger drivers are data residency and predictable per-doc cost as we scale questionnaires.

Models I'm currently looking at:
- Qwen3-32B (Apache 2.0, strong multilingual, fits 1×H100 at FP8 with KV headroom)
- Possibly Qwen3.5 / Qwen3.6 variants if anyone has experience with them on legal text
- Mistral-Small-3.2-24B as a backup option

  1. ⁠Anyone running Qwen3-32B (or newer Qwen variants) in production on legal/regulatory text in non-English European languages? How does it compare to GPT-4.1 on instruction following and structured JSON output?
  2. ⁠Is there anything in the 30B–70B range that would meaningfully beat Qwen3-32B on European legal text and still fit on 1×H100 FP8?
  3. ⁠Worth jumping to 2×H100 for something like Mistral Medium 3.5 or GLM-4.5-Air, or is that diminishing returns for extractive RAG?
  4. ⁠vLLM vs SGLang for this workload (lots of shared system prompts across agents — prefix caching is interesting)?
  5. ⁠Any gotchas with H100 capacity in EU GCP regions (Frankfurt/Belgium)?
reddit.com
u/Candy_Lucy — 21 days ago