
Open-source 30B MoE VLM with DSA(DeepSeek Sparse Attention): Keye-VL-2.0-30B-A3B
Disclosure: I’m part of the Kwai Keye team that built this model.
We released the model weights under Apache-2.0 and I’d like feedback from people working on video understanding / temporal grounding. I’m not posting this as a product announcement; the useful part for this community is whether the evaluation setup and failure cases are convincing.
Model:
https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B
Code:
https://github.com/Kwai-Keye/Keye
What it is:
- 30B MoE model, about 3B active parameters
- Image/video-to-text VLM
- 256K context
- DSA / DeepSeek Sparse Attention for long-context sparse attention
- Designed for long-video input
- Apache-2.0
The main CV angle is temporal grounding. We are trying to make the model retain enough visual evidence across long videos to answer “when did X happen?” and “which segment contains Y?” questions without collapsing as more frames are added.
Selected eval results from the model card:
- Charades-TimeLens: 58.4 mIoU
- ActivityNet-TimeLens: 58.5 mIoU
- QVHighlights-TimeLens: 70.1 mIoU
- VideoMME V2 accuracy improves from 35.3% at 64 frames to 42.4% at 512 frames
- LongVideoBench: 74.1
Caveats:
- These are our own released eval numbers.
- Full technical report and more detailed methodology are still being prepared.
- No GGUF / AWQ / MLX quantized releases yet.
I’d be very interested in feedback from this community on:
- What long-video failure modes should we test beyond benchmark accuracy?
- For practical CV use, is frame sampling, temporal localization, OCR over time, or hallucination usually the first thing that breaks?
- What kind of qualitative examples would be most useful to include in the technical report?