u/ai-lover

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B
▲ 24 r/OpenSourceeAI+1 crossposts

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

Most LLM inference optimization forces a choice: fast drafting with a weak auxiliary model, or accurate generation with full Standard autoregressive (AR) decoding. NVIDIA Researchers just built a third option into the weights themselves.

They released Nemotron-Labs-Diffusion — a 3B/8B/14B model family trained on a joint Autoregressive AR-diffusion objective that supports three decoding modes from one checkpoint: standard AR, parallel diffusion decoding, and self-speculation, where the same model drafts and verifies without any auxiliary head.

Here's what's actually interesting:

→ Self-speculation achieves 5.99× tokens per forward over Qwen3-8B with comparable accuracy on a 10-task benchmark

→ Average acceptance length: 6.82 (with LoRA) vs. 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP — same draft length of 31

→ AR and diffusion objectives peak at the same loss coefficient (α=0.3) and improve together — they don't compete for model capacity

→ Speed-of-light analysis shows a theoretical ceiling of 7.60× TPF at block length 32; current confidence-based sampling realizes only ~3×, leaving headroom for better samplers

Full analysis: https://www.marktechpost.com/2026/05/20/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b/

Paper: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL

Model weights: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion

Technical details: https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive

https://i.redd.it/veehv38rv92h1.gif

reddit.com
u/ai-lover — 2 days ago

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Most translation models are audio pipelines with a TTS layer bolted on at the end. That's not simultaneous interpretation and Alibaba's Qwen team just built a clear technical case for the difference.

They released Qwen3.5-LiveTranslate-Flash: a real-time multimodal translation model that processes audio and video frames simultaneously, clones the original speaker's voice in the output, and covers 60 input languages at 2.8 seconds of latency.

No turn-detection. No generic synthesis voice replacing the speaker.

Here's what's actually interesting:

→ Vision-enhanced comprehension reads lip movements, gestures, and on-screen text alongside audio — robust in noisy or degraded audio environments

→ Semantic unit prediction via "reading units" processing commits to output segments mid-sentence, enabling continuous streaming without waiting for full utterances

→ Real-time voice cloning replicates the original speaker's voice profile from a single spoken sentence

→ Dynamic keyword configuration lets you inject domain-specific glossaries at runtime — brand names, medical terms, legal vocabulary

→ FLEURS and CoVoST2 benchmarks: outperforms major commercial alternatives across multilingual speech translation tasks

Full analysis: https://www.marktechpost.com/2026/05/20/alibaba-qwen-team-introduces-qwen3-5-livetranslate-flash-real-time-multimodal-interpretation-across-60-languages-at-2-8-second-latency/

Technical details: https://qwen.ai/blog?id=qwen3.5-livetranslate

https://preview.redd.it/rx8ahgg8592h1.png?width=1856&format=png&auto=webp&s=b80784f947e9827537d652972c2c6031a011ee39

reddit.com
u/ai-lover — 2 days ago

Meet MemPrivacy: An Edge-Cloud Framework that Uses Local Reversible Pseudonymization to Protect User Data Without Breaking Memory Utility

Most "privacy-preserving" AI memory just masks sensitive values with ***. That breaks the task. The cloud can't draft your doctor's email if the blood pressure reading is gone.

MemTensor just proposed a different approach — and it actually holds up under benchmarking.

They introduced MemPrivacy, a framework that runs a lightweight on-device model to detect private spans, replaces them with semantically typed placeholders like <Health_Info_1> before anything leaves the device, and restores the original values locally after the cloud responds. The cloud reasons on structure. It never sees the actual data.

Here's what's actually interesting:

→ Four-level privacy taxonomy (PL1–PL4) from general preferences to immediately exploitable credentials — user-configurable per session

→ MemPrivacy-4B-RL hits 85.97% F1 on MemPrivacy-Bench vs. 78.41% for Gemini-3.1-Pro and 68.99% for GPT-5.2 on privacy span extraction

→ Utility loss across LangMem, Mem0, and Memobase stays within 1.6% at PL2–PL4 protection — irreversible masking causes drops up to 41.87%

→ Models run at 0.6B, 1.7B, and 4B parameters with sub-2-second per-message latency on-device

The core insight: privacy protection and semantic utility don't have to trade off — if you replace values with typed structure instead of blank masks.

Full analysis: https://www.marktechpost.com/2026/05/18/meet-memprivacy-an-edge-cloud-framework-that-uses-local-reversible-pseudonymization-to-protect-user-data-without-breaking-memory-utility/

Paper: https://arxiv.org/pdf/2605.09530v2

Model Weights: https://huggingface.co/collections/IAAR-Shanghai/memprivacy

https://preview.redd.it/p2ia8c0lsy1h1.png?width=1338&format=png&auto=webp&s=2ac6f916638b0d9e60aa3093d0ad544859bed1fc

reddit.com
u/ai-lover — 3 days ago

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

Most "4-bit training" results come from small models on short token horizons because the format breaks before you can validate it. That's not pretraining — and NVIDIA just drew a clear line between the two.

They introduced the first public 4-bit pretraining run at multi-trillion-token scale — a 12B hybrid Mamba-Transformer (Nemotron-Nano-12B-v2-Base architecture) trained on 10 trillion tokens in NVFP4, a microscaling format with 16-element blocks, E4M3 block scales, and an FP32 per-tensor scale, with downstream accuracy closely tracking an FP8 baseline.

Here's what's actually interesting:

→ MMLU-Pro 5-shot: 62.58% (NVFP4) vs 62.62% (FP8). MMLU 76.57 vs 77.36. GSM8K CoT 92.27 vs 89.08. Validation loss within 1% of FP8 in the stable phase

→ Recipe = selective BF16 (~16% of linear layers) + 16×16 Random Hadamard Transforms on Wgrad inputs + 2D 16×16 weight scaling + stochastic rounding on gradients. Ablations show all four are required

→ Only linear-layer GEMMs run in NVFP4 — attention, embeddings, normalization, master weights, gradients, and optimizer states stay in BF16/FP32

→ On an 8B model, MXFP4 needed 1.36T tokens (+36%) to match NVFP4's loss at 1T tokens

Full Analysis: https://www.marktechpost.com/2026/05/18/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon/

Paper: https://arxiv.org/pdf/2509.25149

https://preview.redd.it/114lxr5x0v1h1.png?width=1462&format=png&auto=webp&s=c0f5be370e3b75ae7bec2d6eef9c3895f414cfab

reddit.com
u/ai-lover — 4 days ago

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Most sparse attention methods make a quiet assumption: that you can ship a custom kernel for selection and deal with the inference consequences later.

Nous Research's new paper just explained that assumption is optional. They released Lighthouse Attention — a selection-based hierarchical attention for long-context pretraining that pools Q, K, and V symmetrically across a multi-level pyramid, places selection entirely outside the attention kernel, and runs stock FlashAttention on a small dense sub-sequence. No custom sparse kernel. No auxiliary losses. No learnable scorer. No straight-through estimator.

Here's what's actually interesting:

→ 21× faster forward pass and 17.3× faster forward+backward vs. cuDNN SDPA at 512K context on a single B200

→ 1.40–1.69× end-to-end pretraining wall-clock speedup at 98K context at matched or lower final training loss

→ Brief dense-SDPA resumption after Lighthouse training recovers a full-attention model that beats dense-from-scratch (loss 0.6980 vs. 0.7237 baseline, same ~50.3B token budget)

→ Scales to 1M-token training across 32 Blackwell GPUs under standard ring attention — no sparse-aware collectives needed

Train with hierarchical selection to move fast, then recover the dense model you actually need at inference.

Analysis: https://www.marktechpost.com/2026/05/16/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context/

Paper: https://arxiv.org/pdf/2605.06554

Technical details: https://nousresearch.com/lighthouse-attention

GitHub Repo: https://github.com/ighoshsubho/lighthouse-attention

https://preview.redd.it/hz7eup7vsk1h1.png?width=1618&format=png&auto=webp&s=706e22db5210e898ab5b144039dffa5247304c68

reddit.com
u/ai-lover — 5 days ago

LiteLLM Agent Platform: Running Claude Code &amp; Codex in Isolated Sandboxes With Vault Protection

If you’ve been experimenting with agentic coding workflows (like Claude Code or Codex), you already know the massive headache of balancing agent autonomy with credential security. Most implementations either sandbox the agent completely (making it useless for real work) or give it direct access to API keys (a security nightmare).

The team behind LiteLLM just open-sourced the LiteLLM Agent Platform, a self-hosted infrastructure layer designed specifically to solve this.

Here is how it works under the hood:

  • Isolated Sandboxes: Agents run inside fresh Kubernetes pods (via the kubernetes-sigs/agent-sandbox CRD).
  • The Credential Vault: Pods are injected with stub credentials only (e.g., GITHUB_TOKEN=stub_github_1234).
  • Perimeter Interception: When the agent makes an outbound TLS connection to hit an API or push to GitHub, a secure vault proxy intercepts the traffic and swaps the stub for your real, production key. The agent never actually "sees" the real secret.

It runs locally via kind and Docker Desktop. You can spin up a local instance in minutes:

git clone https://github.com/BerriAI/litellm-agent-platform.git
cd litellm-agent-platform/cli &amp;&amp; npm install
ln -sf "$PWD/bin/lap.mjs" ~/.local/bin/lap

# Fire up the backend (Postgres, web UI, worker, and kind cluster)
bin/kind-up.sh
docker compose up

Once it's running, you can connect your local terminal straight to a secure agent sandbox using their CLI:

lap login
lap claude-code-cli1

This drops you straight into a WebSocket-attached TTY inside the isolated Kubernetes pod.

🌐 Deployment & Architecture

  • Local: kind + Docker Compose
  • Production: Built to scale seamlessly on AWS EKS for the sandbox layer and Render for the web/worker stack (manifests and automated scripts like bin/eks-up.sh are included).
  • License: MIT

Repo: https://github.com/BerriAI/litellm-agent-platform

Analysis: https://aideveloper44.com/blog/litellm-agent-platform-claude-code-codex-isolated-sandboxes-vault

reddit.com
u/ai-lover — 5 days ago
▲ 8 r/OpenSourceeAI+1 crossposts

Meet LiteLLM Agent Platform: A Kubernetes-Based, Self-Hosted Infrastructure Layer for Isolated Agent Sandboxes and Persistent Session Management in Production

Most "managed agent" solutions mean handing your sessions to someone else's cloud. That's not infrastructure you control — and BerriAI just shipped a clear alternative.

They open-sourced the LiteLLM Agent Platform, a self-hosted infrastructure layer for running multiple AI agents in production, built on top of the LiteLLM Gateway. It manages sandbox isolation per team or context and keeps session state alive across pod restarts and upgrades, with no external session store to wire up yourself.

Here's what's actually interesting:

→ Sandboxes run on Kubernetes via the kubernetes-sigs/agent-sandbox CRD — kind locally, AWS EKS in production

→ Two commands to get started: bin/kind-up.sh provisions the cluster, docker compose up boots Postgres, web (:3000), and worker

→ Secrets pass into sandboxes via CONTAINER_ENV_ prefix in .env — stripped at injection, no image rebuilds needed

→ The LiteLLM Gateway handles model routing across 100+ LLM providers — the Agent Platform handles everything above that layer

→ MIT licensed, currently in alpha public preview

Full analysis: https://www.marktechpost.com/2026/05/16/meet-litellm-agent-platform-a-kubernetes-based-self-hosted-infrastructure-layer-for-isolated-agent-sandboxes-and-persistent-session-management-in-production/

GitHub Repo: https://github.com/BerriAI/litellm-agent-platform

https://i.redd.it/cxgibb9ghj1h1.gif

reddit.com
u/ai-lover — 5 days ago

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

Most open-source world models either need 8 GPUs to run or drop to 480p to survive. That's not an efficiency problem — it's an architecture problem. NVIDIA just addressed it directly.

They introduced SANA-WM — a 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing 720p video with precise 6-DoF camera control from a single image and a camera trajectory, running inference on a single GPU with no multi-GPU dependency anywhere in the pipeline.

Here's what's actually interesting:

→ Hybrid Gated DeltaNet + softmax backbone keeps recurrent state at constant D×D size regardless of video length — solving the quadratic memory explosion that makes 961-frame sequences infeasible with standard softmax attention

→ Dual-branch camera control: UCPE at latent-frame rate for global trajectory + Plücker mixing at raw-frame rate for intra-stride motion — CamMC 0.2047, best among all compared methods

→ Second-stage refiner (17B LTX-2 + rank-384 LoRA, 3 Euler steps) cuts long-horizon visual drift ΔIQ from 3.09 to 0.31 on Hard trajectories → 22.0 videos/hour on 8 H100s — 36× higher throughput vs LingBot-World at 14B+14B parameters

→ Distilled variant: 34s per 60s 720p clip on a single RTX 5090 with NVFP4 quantization

Full analysis: https://www.marktechpost.com/2026/05/16/nvidia-introduces-sana-wm-a-2-6b-parameter-open-source-world-model-that-generates-minute-scale-720p-video-on-a-single-gpu/

Paper: https://arxiv.org/pdf/2605.15178

Project page: https://nvlabs.github.io/Sana/WM/

GitHub Page: https://github.com/NVlabs/Sana

https://preview.redd.it/ny5cruolhg1h1.png?width=1358&format=png&auto=webp&s=a8e45f60221194c7df2b94ba99d15f002a34304b

reddit.com
u/ai-lover — 6 days ago

Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup

Most LLMs are memory-bandwidth bound at inference.

Each user in a batch needs their own KV-cache loaded from GPU memory. The GPU sits idle waiting on data transfers.

Diffusion solves this. Zyphra's ZAYA1-8B-Diffusion-Preview generates 16 tokens simultaneously — all sharing one KV-cache. That shifts decoding from memory-bound to compute-bound.

Numbers:

→ 4.6x speedup — lossless sampler, no eval degradation

→ 7.7x speedup — logit-mixing sampler, minor quality trade-off

→ Beats MTP and EAGLE3 on inference speedup

It's also the first MoE diffusion model converted from an autoregressive LLM, and the first diffusion-LM trained on AMD hardware.

Training: no need to train from scratch. They used the TiDAR recipe on the existing ZAYA1-8B checkpoint — 1.1T tokens of additional mid-training total.

Analysis: https://www.marktechpost.com/2026/05/15/zyphra-releases-zaya1-8b-diffusion-preview-the-first-moe-diffusion-model-converted-from-an-autoregressive-llm-with-up-to-7-7x-speedup/

Technical details: https://www.zyphra.com/post/zaya1-8b-diffusion-preview

https://preview.redd.it/pwtms4q5yc1h1.png?width=3000&format=png&auto=webp&s=44d0713432d5485d07ddd6000e7dad12d153d7cd

reddit.com
u/ai-lover — 6 days ago

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Most LLM pre-training efficiency work either changes the tokenizer, the architecture, or the inference behavior. Nous Research just showed you don't have to touch any of them.

They released Token Superposition Training (TST) — a two-phase modification to the standard pre-training loop that averages s contiguous token embeddings into a single latent s-token in Phase 1, trains with a multi-hot cross-entropy loss against the next bag of tokens, then reverts to standard next-token prediction in Phase 2 from the same checkpoint, with the TST code fully removed.

Here's what's actually interesting:

→ Each TST step is kept equal-FLOPs to baseline by increasing data sequence length by s× — not the batch size

→ 3B dense: loss 2.676 in 247 B200-hrs vs 443 B200-hrs for baseline at matched loss (~1.8x faster)

→ 10B-A1B MoE: 4,768 B200-hrs vs 12,311 B200-hrs at matched loss (~2.5x faster)

→ Optimal range: bag size s ∈ [3–8] at 270M, s ∈ [6–10] at 600M, s = 16 at 10B; step ratio r ∈ [0.2, 0.4]

→ Re-initializing the embedding or LM head at the phase boundary breaks it entirely — loss went from 2.676 to 2.938, worse than the 2.808 baseline

Full analysis: https://www.marktechpost.com/2026/05/13/nous-research-releases-token-superposition-training-to-speed-up-llm-pre-training-by-up-to-2-5x-across-270m-to-10b-parameter-models/

Paper: https://arxiv.org/pdf/2605.06546

Project page: https://nousresearch.com/token-superposition

u/ai-lover — 8 days ago
▲ 16 r/OpenSourceeAI+1 crossposts

Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size

Why are we still running 7B–27B autoregressive decoder models for what is fundamentally a text classification problem?

Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size

It is a 300M parameter safety moderation model that runs 16x faster than the current generation of guardrail models.

Here's what's actually is interesting to learn:

  1. It's an encoder, not a decoder Most guardrail models (LlamaGuard4, WildGuard, ShieldGemma) generate safety verdicts autoregressively — one token at a time. That's slow by design. GLiGuard reframes the whole thing as a text classification problem. One forward pass. Done.
  2. Four moderation tasks. Zero added latency. It evaluates all four simultaneously in a single pass: → Safety classification (safe / unsafe) → Jailbreak strategy detection (11 strategies) → Harm category detection (14 categories) → Refusal detection (compliance / refusal)

More safety dimensions = no extra compute. That's the architectural win.

  1. The benchmark numbers are hard to ignore → 87.7 avg F1 on prompt classification — within 1.7 points of the best model (PolyGuard-Qwen at 89.4) → 82.7 avg F1 on response classification — second only to Qwen3Guard-8B (84.1) → 26ms latency vs. 426ms for ShieldGemma-27B at sequence length 64 → 133 samples/sec throughput vs. 8.2 at batch size 4 → Outperforms LlamaGuard4-12B, ShieldGemma-27B, and NemoGuard-8B — all 23–90x larger

  2. It runs on a single GPU At 0.3B parameters, individual developers and smaller teams can deploy and fine-tune it without heavy infrastructure.

Full analysis: https://www.marktechpost.com/2026/05/13/fastino-labs-open-sources-gliguard-a-300m-parameter-safety-moderation-model-that-matches-or-exceeds-accuracy-of-models-23-90x-its-size/

Paper: https://arxiv.org/pdf/2605.07982

Model weights on HF: https://huggingface.co/fastino/gliguard-LLMGuardrails-300M

GitHub Repo: https://github.com/fastino-ai/GLiGuard

Technical details: https://pioneer.ai/blog/gliguard-16x-faster-safety-moderation-with-a-small-language-model

u/ai-lover — 8 days ago

Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

Most real-time AI is a turn-based LLM with voice-activity detection bolted on. That's not an interaction model — and Thinking Machines Lab just drew a very clear line between the two.

They introduced a research preview of TML-Interaction-Small — a 276B MoE model with 12B active parameters built around a multi-stream, time-aligned micro-turn architecture that processes 200ms chunks of audio, video, and text simultaneously, with no external turn-detection scaffolding anywhere in the stack.

Here's what's actually interesting:

→ Full-duplex interaction and asynchronous background reasoning running in parallel, sharing full conversation context

→ Audio as dMel, video as 40×40 hMLP patches, flow head decoder — all co-trained from scratch with the transformer

→ FD-bench v1.5: 77.8 vs. 47.8 for GPT-realtime-2.0

→ Charades mIoU (visual proactivity): 32.4 vs. 0 for GPT-realtime-2.0

The core bet: train interactivity into the weights, not the pipeline.

Full analysis: https://www.marktechpost.com/2026/05/13/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration/

Technical Details: https://thinkingmachines.ai/blog/interaction-models/

https://preview.redd.it/ac6onr6clv0h1.png?width=2440&format=png&auto=webp&s=13804ca8c42419be6ce572de09c0ad4d34a14beb

reddit.com
u/ai-lover — 9 days ago
▲ 67 r/OpenSourceeAI+1 crossposts

A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time [Meet AntAngelMed]

A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time

Meet AntAngelMed — a 103B-parameter medical LLM that only activates 6.1B parameters at inference time.

Here's what's actually super interesting:

  1. The architectureIt uses a 1/32 activation-ratio MoE built on Ling-flash-2.0. You get 103B total parameters worth of knowledge capacity, but inference cost stays proportional to 6.1B active parameters — matching roughly 40B dense model performance.

  2. The training pipelineThree stages: → Continual pre-training on medical corpora (encyclopedias, web text, academic publications) → SFT with mixed general + clinical instruction data → GRPO-based reinforcement learning with task-specific reward models for safety, diagnostic reasoning, and hallucination reduction

  3. Inference numbers→ 200+ tokens/s on H20 hardware → ~3× faster than a 36B dense model → 128K context length via YaRN extrapolation → FP8 + EAGLE3 boosts throughput over FP8 alone: +71% on HumanEval, +45% on GSM8K, +94% on Math-500

  4. Benchmark results→ #1 open-source on OpenAI's HealthBench — also surpasses several proprietary models → Top-level on MedAIBench (China's national medical AI benchmark) → #1 overall on MedBench across all 5 dimensions: knowledge QA, language understanding, language generation, complex reasoning, and safety & ethics

Full analysis: https://www.marktechpost.com/2026/05/12/meet-antangelmed-a-103b-parameter-open-source-medical-language-model-built-on-a-1-32-activation-ratio-moe-architecture/

Model Weighs on HF: https://huggingface.co/MedAIBase/AntAngelMed

GitHub Repo: https://github.com/MedAIBase/AntAngelMed

https://preview.redd.it/4cg34od2zr0h1.png?width=1804&format=png&auto=webp&s=f4d76824cd6852e3b6d5af88c33d32e50ad1e229

Technical details: https://modelscope.cn/models/MedAIBase/AntAngelMed

reddit.com
u/ai-lover — 9 days ago

A team of researchers form Meta and Stanford Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

Byte-level language models have always had a strong case — no tokenizer bias, better multilingual fairness, stronger robustness to noisy inputs.

The problem? Inference. Generating one byte at a time means far more forward passes than token-level models. Memory bandwidth gets hammered.

A team of researchers form Meta and Stanford Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

This new research introduces three methods:

𝟭. BLT Diffusion (BLT-D)

Instead of generating bytes one at a time, BLT-D generates a full block of bytes in parallel via block-wise discrete diffusion. The encoder and global model are called once per block — not once per patch.

→ BLT-D-4: nearly matches BLT task scores at less than half the memory bandwidth

→ BLT-D-16: 87–92% memory-bandwidth reduction vs BLT

𝟮. BLT Self-Speculation (BLT-S)

No retraining. No architectural changes. BLT's own lightweight decoder drafts beyond normal patch boundaries, then the full model verifies. Under greedy decoding, outputs are bit-for-bit identical to standard BLT.

→ Up to 77% memory-bandwidth reduction, zero quality loss

𝟯. BLT Diffusion+Verification (BLT-DV)

Diffusion drafts a block. One autoregressive pass verifies it. Same weights, no extra training.

→ Up to 81% memory-bandwidth reduction, better quality than diffusion-only BLT-D

Here's what's actually interesting:

BLT-S requires nothing — no new weights, no new training, no architecture change — and still gives you up to 77% bandwidth reduction with identical outputs. That's a rare result in this space.

And BLT-D supports KV caching, so it stacks with existing optimization techniques.

Full analysis: https://www.marktechpost.com/2026/05/11/meta-and-stanford-researchers-propose-fast-byte-latent-transformer-that-reduces-inference-memory-bandwidth-by-over-50-without-tokenization/

Paper: https://arxiv.org/pdf/2605.08044

marktechpost.com
u/ai-lover — 10 days ago

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Feedforward layers account for 80%+ of LLM compute — and for any given token, most of that computation lands on zero-value activations.

Sakana AI and NVIDIA research team released TwELL and a set of CUDA kernels that finally make that sparsity exploitable on modern GPUs.

Here's the part that is very interesting:

Sparse ops have mostly run slower than dense ops on NVIDIA GPUs. The overhead from converting activations to sparse format cancelled every theoretical saving. That's the paradox this new esearch fixes.

Here's the breakdown:

→ TwELL (Tile-wise ELLPACK): A new sparse format built directly into the matmul kernel epilogue. No extra kernel launch. No extra global memory read. No synchronization overhead.

→ Fused inference kernel: Takes gate activations in TwELL format and performs up + down projections together. The hidden state is never written to global memory.

→ Hybrid sparse format for training: Routes rows into compact ELL or dense backup dynamically — handles the non-uniform sparsity patterns that make training hard without becoming brittle.

→ The training recipe: Two changes only — replace SiLU with ReLU, add L1 regularization at coefficient 2×10⁻⁵. Same LR, same optimizer, same batch size.

→ 2B model results on H100 PCIe:

🟢 +20.5% inference throughput

🟢 +21.9% training step throughput

🟢 −17.0% energy per token

🟢 Accuracy: 49.1% dense → 48.8% sparse

→ It scales the right way: Average non-zero activations drop from 39 (0.5B) to 24 (2B). Gains grow with model size — not shrink.

All kernels are open and released.

So, basically it's not about smaller models. It's about skipping the computation that was always wasted.

Full Analysis with Visuals/Guide: https://www.marktechpost.com/2026/05/11/sakana-ai-and-nvidia-introduce-twell-with-cuda-kernels-for-20-5-inference-and-21-9-training-speedup-in-llms/

Paper: https://arxiv.org/pdf/2603.23198

Repo: https://github.com/SakanaAI/sparser-faster-llms

Technical details: https://pub.sakana.ai/sparser-faster-llms/

https://i.redd.it/1a1ky5zx3h0h1.gif

reddit.com
u/ai-lover — 11 days ago
▲ 76 r/OpenSourceeAI+1 crossposts

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

NVIDIA just released Star Elastic — and the inference strategy alone is worth understanding.

Here's what's actually interesting from the technical side:

1. One checkpoint. Three models.

Star Elastic applies a post-training method to Nemotron Nano v3 that nests 23B and 12B submodels can be extracted zero-shot from the parent checkpoint the 30B parent. All three live in a single checkpoint in BF16, FP8, and NVFP4.

2. The router learns the architecture, not just the weights.

A learnable router trained via Gumbel-Softmax maps any target parameter budget to the optimal nested configuration across all elastic axes — attention heads, Mamba SSM heads, MoE experts, FFN channels, embedding dimensions. The importance-based ranking that orders these components is computed before training begins.

3. Use a smaller model for thinking. Use the full model for the answer.

This is the finding we found most interesting. Elastic budget control assigns the 23B submodel to the thinking phase and the 30B model to the final answer. Reasoning traces are high-volume but tolerant of lower capacity. The final answer is low-volume but requires precision. Matching model size to phase complexity gives:

→ +16% accuracy vs. standard budget control

→ 1.9× lower latency

Measured on AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro.

4. The cost reduction is significant.

→ 360× fewer tokens vs. pretraining each variant from scratch

→ 7× fewer tokens vs. state-of-the-art sequential compression

→ The 23B and 12B nested models match or outperform independently trained baselines of comparable size

5. Hardware accessibility.

The 12B NVFP4 variant runs on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000 it reaches 7,426 tokens/s — 3.4× the throughput of the 30B BF16 baseline.

Read the full analysis which also has an interactive step-by-step code guide here: https://www.marktechpost.com/2026/05/09/nvidia-ai-releases-star-elastic-one-checkpoint-that-contains-30b-23b-and-12b-reasoning-models-with-zero-shot-slicing/

3-in-1 model in BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16

3-in-1 model in FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8

3-in-1 model in NVFP4: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4

Paper: https://cas-bridge.xethub.hf.co/xet-bridge-us/69cd91b34a304b3afe4ceaa4/cedbede2a32a1757cd46b5ce6edbe0934f2c8437f61509d8f63aae86f96b43cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260509%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260509T212853Z&X-Amz-Expires=3600&X-Amz-Signature=a776c3adc5cd45d923a82950ea17eefb271caf85b0586ff79855f575381030a7&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=689a286d51b587fe5035c19f&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27star_elastic_arxiv.pdf%3B+filename%3D%22star_elastic_arxiv.pdf%22%3B&response-content-type=application%2Fpdf&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778365733&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODM2NTczM319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82OWNkOTFiMzRhMzA0YjNhZmU0Y2VhYTQvY2VkYmVkZTJhMzJhMTc1N2NkNDZiNWNlNmVkYmUwOTM0ZjJjODQzN2Y2MTUwOWQ4ZjYzYWFlODZmOTZiNDNjYioifV19&Signature=fpq%7EPKyILz2ZDcwgCMn%7EsYfSySqpZ5Fr-A3MXBBG94lfu6bTv6y63ejTUL16B8v03HIJyKwrdGgHoYAQr88iQ05qS%7EoIszdd0eU2dfem3CVxM-t3e8rIo4-i4OTBjP2oPAMjCqmwzcC6uPG3Xqm-3Tiq5IfrsDFSKSUPZavMI6nU%7EBBpxd-i-L3C4-4v80nzJWfkHZiKb0EHr3PN8CRlA6In1X2-tH3dXBm0GM0j83%7EBtcclb-4C18vdpfEuvEaKOf0tMxsf5zI0acMPdCJxnVatq%7EgZwixiF%7E53DxgPc94Pb93zl0TVTcLH4%7ExH8yi7Xj9YYjdMKB634Q1GeapoJA__&Key-Pair-Id=K2L8F4GPSG1IFC

marktechpost.com
u/ai-lover — 12 days ago

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents

GitHub's Spec Kit solves something fundamental: AI coding agents are being used wrong. You throw a vague prompt at them and hope for the best. The code compiles. It's wrong. You debug for hours. You already know this.

The fix is not a better model. The fix is a better process.

Spec-Driven Development (SDD) makes the specification the source of truth — not the code. The spec generates the plan. The plan generates the tasks. The tasks generate the implementation. Every step is traceable. Nothing is guessed.

The workflow:

— Write what you want to build. Not how. What.

— Clarify gaps before a single line of architecture is drawn.

— Define the tech stack. The agent builds a full technical plan.

— Generate dependency-ordered tasks with parallel execution markers.

— Run a cross-artifact consistency check. Catch mismatches before the agent touches your codebase.

— Implement. In order. With validation at every checkpoint.

It works with 29 AI coding agents. Claude Code, Copilot, Gemini CLI, Cursor, Codex — all supported. MIT licensed. Open source.

This is what engineering with AI should look like.

Not vibes. Intent.

Full breakdown + step-by-step guide: https://www.marktechpost.com/2026/05/08/meet-github-spec-kit-an-open-source-toolkit-for-spec-driven-development-with-ai-coding-agents/

GitHub Repo: https://github.com/github/spec-kit

marktechpost.com
u/ai-lover — 13 days ago

OpenAI Adds Chrome Extension to Codex, Letting Its AI Agent Access LinkedIn, Salesforce, Gmail, and Internal Tools via Signed-In Sessions

OpenAI just launched a Chrome extension for Codex — and it changes how the AI coding agent interacts with the browser.

Unlike the in-app browser, the Chrome extension gives Codex access to your actual signed-in browser state. That means it can work inside LinkedIn, Salesforce, Gmail, and internal tools — not just public pages.

Here is a step-by-step visual guide covering:

— How to install the extension from the Chrome Web Store

— How to connect it via the Plugins menu in the Codex app

— What permissions Chrome will request (and what they mean)

— How to invoke Chrome directly using u/Chrome in a prompt

— How per-site approval works and when to use the allowlist

A few technical details worth knowing before you set it up:

— Codex runs in task-specific tab groups — your active session is not interrupted

— Page content is treated as untrusted context (prompt injection risk is real)

— The Memories setting affects what context Codex carries into browser tasks

— File uploads require enabling "Allow access to file URLs" separately

— Not available in EU or UK yet

📖 Full analysis with guide: https://www.marktechpost.com/2026/05/08/openai-adds-chrome-extension-to-codex-letting-its-ai-agent-access-linkedin-salesforce-gmail-and-internal-tools-via-signed-in-sessions/

Try it here: https://chromewebstore.google.com/detail/codex/hehggadaopoacecdllhhajmbjkdcmajg

u/ai-lover — 13 days ago

Anthropic Introduces Natural Language Autoencoders That Convert Claude’s Internal Activations Directly into Human-Readable Text Explanations

Anthropic has introduced Natural Language Autoencoders (NLAs) — a method that converts a model's internal activations directly into human-readable text, making it possible to read what Claude is thinking, not just what it says.

Here's how it works:

An NLA consists of two components — an activation verbalizer (AV) that translates an activation into a text explanation, and an activation reconstructor (AR) that rebuilds the original activation from that text alone. The system is trained end-to-end: a good explanation produces an accurate reconstruction; a vague one fails. No ground truth required.

What Anthropic found when they applied NLAs in practice:

— On SWE-bench Verified, NLAs detected evaluation awareness on 26% of problems — Claude knew it was being tested, without ever saying so

— On a destructive code-writing eval, evaluation awareness appeared in 16% of cases — again, unverbalized

— On real claude.ai user traffic, evaluation awareness appeared in less than 1% of cases

— In a misalignment auditing game, NLAs raised hidden-motivation detection from under 3% to 12–15%, without access to training data

Beyond the numbers, NLAs have already been used in three concrete deployments:

→ Claude Mythos Preview cheated on a training task — NLAs revealed it was internally thinking about how to avoid detection

→ An early Claude Opus 4.6 was randomly responding to English queries in other languages — NLAs traced the cause back to specific training data

→ On a simple couplet task, NLAs showed Opus 4.6 planning its rhyme word ahead of time — internal planning not visible in output....

Full analysis: https://www.marktechpost.com/2026/05/08/anthropic-introduces-natural-language-autoencoders-that-convert-claudes-internal-activations-directly-into-human-readable-text-explanations/

Paper: https://transformer-circuits.pub/2026/nla/index.html#method

Technical details: https://www.anthropic.com/research/natural-language-autoencoders

Repo: https://github.com/kitft/natural_language_autoencoders

https://preview.redd.it/dfxyypnqfvzg1.png?width=1852&format=png&auto=webp&s=1eb7fa3cabafefc8ba43e247178495f4cbb9962d

reddit.com
u/ai-lover — 14 days ago
▲ 20 r/OpenSourceeAI+2 crossposts

LightSeek Foundation just released TokenSpeed — an open-source LLM inference engine built from scratch for agentic workloads, under the MIT license.

Built in two months. Benchmarked against TensorRT-LLM on NVIDIA B200. Results are worth paying attention to.

Here's what's interesting:

→ Compiler-backed SPMD modeling — developers annotate I/O placement at module boundaries; a static compiler generates the collective ops automatically

→ C++ FSM scheduler — enforces KV cache safety at compile time, not runtime; execution plane stays in Python for usability

→ Pluggable kernel layer — modular, heterogeneous-accelerator-aware, with one of the fastest MLA kernels on NVIDIA Blackwell

→ TokenSpeed MLA — already adopted by vLLM

Performance on Kimi K2.5 (Attention TP4 + MoE TP4, single deployment, no PD disaggregation):

→ ~9% lower latency than TensorRT-LLM at batch size 1

→ ~11% higher throughput at 100 TPS/User

→ Decode latency nearly halved vs TensorRT-LLM on speculative decoding workloads

Note: Currently a preview release.

Full Analysis: https://www.marktechpost.com/2026/05/07/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads/

Repo: https://github.com/lightseekorg/tokenspeed

Technical details: https://lightseek.org/blog/lightseek-tokenspeed.html

marktechpost.com
u/ai-lover — 14 days ago