u/DamageSea2135

Follow-up to my earlier posts on omnivoice-triton (NAR, 3.4× speedup) and qwen3-tts-triton (AR, with kernel-fusion drift mitigation). The libraries themselves are unchanged; this update is about the deployment surface.

ComfyUI is increasingly used as a node-graph runtime for AV pipelines (image → video → lipsync). I kept getting asked how to slot Triton-fused TTS into those graphs without a separate gRPC service. So I shipped both as official Comfy Registry nodes.

What ships

ComfyUI-Qwen3-TTS-Triton v0.2.0

  • Qwen3TTSCustomVoice, Qwen3TTSVoiceClone
  • 7 inference modes covering Triton kernel fusion + TurboQuant KV cache combinations

ComfyUI-Omnivoice-Triton v0.1.0

  • OmnivoiceTTSAuto, OmnivoiceTTSVoiceClone, OmnivoiceTTSVoiceDesign
  • 6 inference modes (Base, Triton, Triton+Sage, Faster, Hybrid, Hybrid+Sage)
  • Streamlit A/B dashboard still bundled in the lib

Why it’s a meaningful packaging step

  • Inference modes are exposed as ComfyUI parameters → no code changes needed for ablation in production-shaped graphs
  • Per-task nodes (Auto / Voice Clone / Voice Design) keep the ComfyUI graph readable instead of a 30-input monolith
  • Workflow JSONs included; reproducible across machines

Numbers preserved from the lib release

  • Omnivoice: 572 ms → 168 ms (~3.4×), Speaker Similarity 0.99 (RTX 5090)
  • Qwen3-TTS: identical kernels to the standalone PyPI release

What I’d still love feedback on

  • Real-world latency numbers from A100/H100/Ada under graph-based serving (vs. direct Python loop)
  • Anyone integrating these into a streaming serving stack (Triton Inference Server, vLLM-style schedulers) — would value engineering input on chunked-output behavior

Links

(Disclosure: author of all four repos.)

reddit.com
u/DamageSea2135 — 26 days ago