u/Funny-Shake-2668

▲ 3 r/ollama

I built a local Qwen2.5-VL desktop tool that lets you ask questions about any part of your screen (using Ollama + live overlays)

I built a fully local desktop app that brings vision-language reasoning directly onto your screen. It runs Qwen2.5-VL:7B locally via Ollama and lets you query any region of your desktop in natural language.

Workflow

  • Select any region of the screen (snipping-style)
  • Ask a question in plain English
  • The model returns structured coordinates via Ollama
  • Results are rendered as a clickable overlay directly on top of the screen

What it can do

  • Object localization: (“where is the cat?” → bounding box)
  • Multi-object detection: (“show cat and dog”)
  • Counting: (“how many people are in this region?” → numbered markers)
  • Video reasoning: frame-by-frame analysis + aggregation over time

Core Idea (Coordinate Mapping)

The model outputs normalized coordinates (0–1000). A deterministic mapping layer converts them into exact screen pixels, making it stable across:

  • Windows DPI scaling
  • Multi-monitor setups

No heuristics - just deterministic coordinate mapping.

Video Mode

Since Qwen2.5-VL is image-based, video is handled by: frame sampling → per-frame reasoning → aggregation into final answer.

Tech Stack

  • Model: Qwen2.5-VL:7B (Ollama, fully local)
  • UI: PyQt6 overlay (click-through UI)
  • Capture: OpenCV + mss
  • Privacy: 100% offline, no telemetry, no cloud calls

MIT licensed.

Repo: https://github.com/tomaszwi66/qlens

Curious about edge cases, failure modes, or interesting things people would try to break this with.

u/Funny-Shake-2668 — 3 days ago
▲ 7 r/ollama

Built an open-source desktop sidekick for Windows that runs fully local with Ollama

Hi everyone!

I just shipped Peeky, a local desktop sidekick for Windows:

Peeky

Visual analysis mode

It sits in the corner of your screen and lets you:

- talk by voice

- drag over any chart, image, or slide for analysis

- snap webcam photos and ask questions

- analyze clipboard/text

- use a guided “Video Coach” mode that watches progress through the webcam

Everything runs locally through Ollama. No cloud, no API keys, no telemetry. Works offline once installed. I mainly built it because I wanted something that feels like a desktop companion instead of a browser chatbot.

Setup is about 10 minutes:

Python + Ollama → install.bat → run.bat

MIT licensed.

Would genuinely love feedback from people running local models.

Repo:

https://github.com/tomaszwi66/peeky

reddit.com
u/Funny-Shake-2668 — 10 days ago