I built a local Qwen2.5-VL desktop tool that lets you ask questions about any part of your screen (using Ollama + live overlays)
I built a fully local desktop app that brings vision-language reasoning directly onto your screen. It runs Qwen2.5-VL:7B locally via Ollama and lets you query any region of your desktop in natural language.
Workflow
- Select any region of the screen (snipping-style)
- Ask a question in plain English
- The model returns structured coordinates via Ollama
- Results are rendered as a clickable overlay directly on top of the screen
What it can do
- Object localization: (“where is the cat?” → bounding box)
- Multi-object detection: (“show cat and dog”)
- Counting: (“how many people are in this region?” → numbered markers)
- Video reasoning: frame-by-frame analysis + aggregation over time
Core Idea (Coordinate Mapping)
The model outputs normalized coordinates (0–1000). A deterministic mapping layer converts them into exact screen pixels, making it stable across:
- Windows DPI scaling
- Multi-monitor setups
No heuristics - just deterministic coordinate mapping.
Video Mode
Since Qwen2.5-VL is image-based, video is handled by: frame sampling → per-frame reasoning → aggregation into final answer.
Tech Stack
- Model: Qwen2.5-VL:7B (Ollama, fully local)
- UI: PyQt6 overlay (click-through UI)
- Capture: OpenCV + mss
- Privacy: 100% offline, no telemetry, no cloud calls
MIT licensed.
Repo: https://github.com/tomaszwi66/qlens
Curious about edge cases, failure modes, or interesting things people would try to break this with.