Open-source 30B MoE VLM with DSA(DeepSeek Sparse Attention): Keye-VL-2.0-30B-A3B

Open-source 30B MoE VLM with DSA(DeepSeek Sparse Attention): Keye-VL-2.0-30B-A3B

Disclosure: I’m part of the Kwai Keye team that built this model.

We released the model weights under Apache-2.0 and I’d like feedback from people working on video understanding / temporal grounding. I’m not posting this as a product announcement; the useful part for this community is whether the evaluation setup and failure cases are convincing.

Model:
https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B

Code:
https://github.com/Kwai-Keye/Keye

What it is:
- 30B MoE model, about 3B active parameters
- Image/video-to-text VLM
- 256K context
- DSA / DeepSeek Sparse Attention for long-context sparse attention
- Designed for long-video input
- Apache-2.0

The main CV angle is temporal grounding. We are trying to make the model retain enough visual evidence across long videos to answer “when did X happen?” and “which segment contains Y?” questions without collapsing as more frames are added.

Selected eval results from the model card:
- Charades-TimeLens: 58.4 mIoU
- ActivityNet-TimeLens: 58.5 mIoU
- QVHighlights-TimeLens: 70.1 mIoU
- VideoMME V2 accuracy improves from 35.3% at 64 frames to 42.4% at 512 frames
- LongVideoBench: 74.1

Caveats:
- These are our own released eval numbers.
- Full technical report and more detailed methodology are still being prepared.
- No GGUF / AWQ / MLX quantized releases yet.

I’d be very interested in feedback from this community on:
- What long-video failure modes should we test beyond benchmark accuracy?
- For practical CV use, is frame sampling, temporal localization, OCR over time, or hallucination usually the first thing that breaks?
- What kind of qualitative examples would be most useful to include in the technical report?

https://preview.redd.it/fphfdtkpwt3h1.png?width=1244&format=png&auto=webp&s=8b272a251fda28e9d4fbda4f19b231fc2b4c8c36

https://preview.redd.it/vwoj2ocswt3h1.png?width=5140&format=png&auto=webp&s=90390cc879f8c236f08fbdd988e9e8b1dfee1797

reddit.com
u/Individual_Soil4641 — 14 days ago

Kwai Keye-VL-2.0-30B-A3B: Apache-2.0 30B MoE VLM, 3B active params, looking for local-running feedback

Disclosure: I’m part of the Kwai Keye team that built this model.

We just released Keye-VL-2.0-30B-A3B on Hugging Face and I’m mainly posting here because I’d like feedback from people actually running local LLM/VLM setups.

Model:
https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B

Quick facts:
- 30B MoE, about 3B active parameters
- Apache-2.0
- Multimodal / long-video focused
- 256K context
- Uses DSA / DeepSeek Sparse Attention
- Built-in Code / Tool / Search capabilities
- No GGUF, AWQ, or MLX quants yet

Some eval results from our model card:
- Charades-TimeLens: 58.4 mIoU
- ActivityNet-TimeLens: 58.5 mIoU
- QVHighlights-TimeLens: 70.1 mIoU
- VideoMME V2 improves from 35.3% at 64 frames to 42.4% at 512 frames
- LongVideoBench: 74.1

Caveat: these are our released/model-card eval numbers. The full technical report is still being prepared.

What I’d really like to learn from this sub:
- What hardware would you try a 30B MoE VLM on?
- What local inference stack would you want first: GGUF, AWQ, MLX, vLLM, something else?
- For long-video use cases, what usually breaks first for you: VRAM, prefill latency, frame sampling, tool support, or model behavior?

If anyone tries it locally, failure reports would be more useful than just benchmark reactions.

https://preview.redd.it/kiaqesqays3h1.png?width=5140&format=png&auto=webp&s=ec9de0474f1b57a3c946adfd79576469c907017e

https://preview.redd.it/xcj82tqays3h1.png?width=1244&format=png&auto=webp&s=a6319c381a39fb6f860cac9a296df8888d884998

reddit.com
u/Individual_Soil4641 — 14 days ago
▲ 3 r/OpenSourceeAI+1 crossposts

Kwai Keye-VL-2.0-30B-A3B released — 30B MoE / 3B active, Apache-2.0, first production VLM with DSA(DeepSeek Sparse Attention)

We just released Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family.

Highlights:

- Outstanding Video Understanding and Temporal Localization: across five video benchmarks, Keye-VL-2.0-30B-A3B leads open-source competitors and matches or surpasses Gemini-3-Flash on temporal grounding.
- DSA-Native Long-Context Architecture: sparse attention and targeted feature aggregation enable precise hour-long video understanding while keeping computation efficient.
- High-Efficiency Inference and Training Stack: DSA (DeepSeek Sparse Attention), ExtraIO, heterogeneous ViT-LM parallelism, activation optimization, and custom kernels reduce long-sequence prefill cost and boost training throughput.
- Data-Centric Multimodal Pre-Training: Keye-VL-1.5 vision encoder + synthetic CoT data strengthen perception, OCR/chart/table understanding, and reasoning continuity.
- Robust Post-Training for Reliable Reasoning: MOPD, bucket advantage scaling, Context-RL, and high-SNR data filtering improve cross-modal expert merging, reduce hallucinations, and stabilize long-context decisions.
- Agent-Ready Multimodal Capabilities: built-in Code, Tool, and Search agent abilities for repository tasks, API-style tool use, web-grounded search, and visual self-correction workflows.

As the first multi-modal model to land DSA in production, it delivers nearly lossless reasoning over 256K ultra-long context.

Selected bench numbers (chart attached):

Fine-grained Temporal Understanding (TimeLens, mIoU):
- Charades-TimeLens: 58.4, on par with the strongest closed-source video baselines we tested (Gemini 3 Flash 61.19).
- ActivityNet-TimeLens: 58.5, surpassing Gemini 3 Flash (56.95).
- QVHighlights-TimeLens: 70.1, neck-and-neck with the top closed-source models on the official leaderboard and far ahead of Gemini 3 Flash (49.45).

Long-Context Scaling (VideoMME V2): where most competitors degrade as the input frame count grows, our model's accuracy increases from 35.3% at 64 frames to 42.4% at 512 frames; the non-linear reasoning score climbs from 18.5 to 24.2.

Comprehensive Long-Video Understanding:
- LongVideoBench: 74.1, surpassing both Qwen3.5-35B-A3B and the much larger Qwen3-VL-235B-A22B.

At 30B scale, Keye-VL-2.0-30B-A3B not only outperforms open-source models with 200B+ parameters (e.g., Qwen3-VL-235B) on temporal understanding, but also goes head-to-head with — and in places exceeds — top closed-source giants.

Links:
- HF: https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B
- GitHub: https://github.com/Kwai-Keye/Keye-VL

Happy to answer questions about the architecture, the DSA integration, or the video training data pipeline.

https://preview.redd.it/rm9xqrhjrs3h1.png?width=5140&format=png&auto=webp&s=12d34a48e7ea7eec17597ad1d458d18d662edb1a

https://preview.redd.it/6gsc3shjrs3h1.png?width=1244&format=png&auto=webp&s=26c601b84c0e7363d8a107711fbda9f91cf06e06

reddit.com
u/Individual_Soil4641 — 14 days ago