u/Funny-Shake-2668

I built a fully local desktop app that brings vision-language reasoning directly onto your screen. It runs Qwen2.5-VL:7B locally via Ollama and lets you query any region of your desktop in natural language.

Workflow

Select any region of the screen (snipping-style)
Ask a question in plain English
The model returns structured coordinates via Ollama
Results are rendered as a clickable overlay directly on top of the screen

What it can do

Object localization: (“where is the cat?” → bounding box)
Multi-object detection: (“show cat and dog”)
Counting: (“how many people are in this region?” → numbered markers)
Video reasoning: frame-by-frame analysis + aggregation over time

Core Idea (Coordinate Mapping)

The model outputs normalized coordinates (0–1000). A deterministic mapping layer converts them into exact screen pixels, making it stable across:

Windows DPI scaling
Multi-monitor setups

No heuristics - just deterministic coordinate mapping.

Video Mode

Since Qwen2.5-VL is image-based, video is handled by: frame sampling → per-frame reasoning → aggregation into final answer.

Tech Stack

Model: Qwen2.5-VL:7B (Ollama, fully local)
UI: PyQt6 overlay (click-through UI)
Capture: OpenCV + mss
Privacy: 100% offline, no telemetry, no cloud calls

MIT licensed.

Repo: https://github.com/tomaszwi66/qlens

Curious about edge cases, failure modes, or interesting things people would try to break this with.

Hi everyone!

I just shipped Peeky, a local desktop sidekick for Windows:

Peeky

Visual analysis mode

It sits in the corner of your screen and lets you:

- talk by voice

- drag over any chart, image, or slide for analysis

- snap webcam photos and ask questions

- analyze clipboard/text

- use a guided “Video Coach” mode that watches progress through the webcam

Everything runs locally through Ollama. No cloud, no API keys, no telemetry. Works offline once installed. I mainly built it because I wanted something that feels like a desktop companion instead of a browser chatbot.

Setup is about 10 minutes:

Python + Ollama → install.bat → run.bat

MIT licensed.

Would genuinely love feedback from people running local models.

Repo:

https://github.com/tomaszwi66/peeky

I built a local Qwen2.5-VL desktop tool that lets you ask questions about any part of your screen (using Ollama + live overlays)

Workflow

What it can do

Core Idea (Coordinate Mapping)

Video Mode

Tech Stack

Built an open-source desktop sidekick for Windows that runs fully local with Ollama