u/techlatest_net

Supertone's Supertonic is just a 66M param, on-device text-to-speech engine that runs via ONNX for cross-platform inference.
▲ 5 r/MachineLearningAndAI+1 crossposts

Supertone's Supertonic is just a 66M param, on-device text-to-speech engine that runs via ONNX for cross-platform inference.

Lightning-Fast, On-Device, Multilingual TTS — running natively via ONNX.

 Highlights

  • ⚡ Blazingly Fast — Low-latency, real-time synthesis across desktop, browser, mobile, and edge — fast enough to turn an entire webpage into audio in under a second
  • 🌍 31-Language Multilingual — Synthesize directly from text across 31 languages, or pass lang="na" to let Supertonic process the text language-agnostically when you don't know the input language — no separate language adapters needed
  • 🪶 99M-Parameter Open-Weight Model — A compact, fully open-weight checkpoint — a fraction of the size of 0.7B–2B class open TTS systems — for smaller downloads, faster cold starts, and lower memory footprint
  • 📱 Edge-Device Ready — Runs locally on desktop, mobile, browsers, and resource-constrained hardware like Raspberry Pi or e-readers, with zero network dependency, complete privacy, and no GPU required
  • 🔊 44.1kHz High-Quality Audio — Outputs studio-grade 44.1kHz 16-bit WAV directly, ready for production playback without any external upsampler
  • 🎭 Expression Tags — 10 inline tags (e.g. <laugh><breath><sigh>) bring natural human nuance into generated speech without prompt engineering or reference audio
  • 🛠️ Multi-Runtime SDKs — Ready-to-use examples through ONNX Runtime across Python, Node.js, Browser (WebGPU), Java, C++, C#, Go, Swift, iOS, Rust, and Flutter

🌍 Supported Languages (31)

Arabic (ar), Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latvian (lv), Lithuanian (lt), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi)

The best part is it's 100% open source and comes under the MIT license.

Link: https://github.com/supertone-inc/supertonic

u/techlatest_net — 3 days ago

NVIDIA just launched Nemotron 3 Nano Omni, an open multimodal model that combines vision, audio, and language into one system for faster and more accurate AI agents. It delivers up to 9x higher throughput while reducing cost and latency compared to separate models. Built on a hybrid MoE architecture with a 256K context, it excels in tasks like document intelligence, UI navigation, and audio-video reasoning. The model is open, customizable, and deployable across local, cloud, and enterprise environments. Available now via platforms like Hugging Face and OpenRouter.

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

NVIDIA Blog: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence

BenchMark

Compared to other open omni models with the same interactivity, Nemotron 3 Nano Omni delivers 7.4x higher system efficiency for multi-document use cases and 9.2x higher system efficiency for video use cases

Efficiency highlights

Model architecture and key innovations

Model architecture and key innovations

reddit.com
u/techlatest_net — 23 days ago