u/Apart-Medium6539

Desktop agent architecture: UI tree, screenshots, or hybrid?

I’m working on a desktop agent architecture and debating the perception layer.

Options:

  1. UI Automation tree only

  2. Screenshots only

  3. Hybrid: UI tree first, screenshot fallback

UI trees are structured and more private. Screenshots are more complete, but noisier and expose more data.

The agent proposes an action, highlights the target, and waits for user approval.

From an architecture perspective, would you keep it UI-tree-only or design hybrid perception from the start?

reddit.com
u/Apart-Medium6539 — 9 days ago

Human-approved desktop agents: useful or too much friction?

I’m building Pupil, an open-source Windows tool for desktop AI automation.

The idea is simple: agents should not silently click/type on your machine.

Flow:

- agent inspects visible UI

- overlay highlights the target

- user approves or skips

- then the action runs

It uses Windows UI Automation + MCP, with no screenshots by default.

Would you use this for desktop workflows, or does human approval kill the automation value?

reddit.com
u/Apart-Medium6539 — 9 days ago

I built a human-approved automation layer for Windows agents

I’m building Pupil, an open-source Windows tool for desktop automation agents.

Instead of silent clicks, the agent must:

- inspect visible UI

- highlight the target

- wait for approval

- then act

It uses Windows UI Automation + MCP, with no screenshots by default.

Question: should approval always be required, or should users be able to allow repeated low-risk actions?

reddit.com
u/Apart-Medium6539 — 9 days ago

Desktop agent architecture: UI tree, screenshots, or hybrid?

I’m working on a Windows desktop agent system and trying to decide the perception architecture.

The agent needs to understand what is visible on screen before proposing an action. I see three possible approaches:

  1. Use the Windows UI Automation tree only

  2. Use screenshots/screen streaming only

  3. Use a hybrid model: UIA first, screenshots as fallback

UIA gives structured data like buttons, inputs, labels, focused elements, window titles, and bounding boxes. It is faster, cheaper, and more privacy-friendly than screenshots.

But it also fails on custom controls, canvas-heavy apps, games, and some Electron apps with weak accessibility metadata.

The system is human-in-the-loop: the agent proposes an action, the UI highlights the target, and the user approves or skips before anything runs.

From an architecture perspective, would you keep perception structured-only for simplicity, or design hybrid perception from the beginning?

reddit.com
u/Apart-Medium6539 — 9 days ago

Update on Pupil: UI Automation first, or screenshot fallback?

I posted Pupil here a few days ago — an open-source Windows layer for desktop AI agents.

Current flow:

- agent reads visible UI through Windows UI Automation

- overlay highlights what it wants to click/type

- user approves or skips

- MCP layer connects it to agents

Now I’m debating the next step.

UIA is fast, structured, and more private than screenshots. But it can fail on custom UIs, canvas apps, games, and some Electron apps.

Would you keep it UIA-only for now, or add screenshot fallback early?

reddit.com
u/Apart-Medium6539 — 9 days ago
▲ 2 r/cursor

Would desktop UI perception be useful for Cursor agents?

I’m building an MCP tool for Cursor that lets the agent inspect visible Windows UI, highlight what it wants to click/type, and wait for user approval.

Use case: helping with desktop apps outside the codebase — settings panels, dev tools, installers, local apps, etc.

Flow:

- Cursor calls MCP

- MCP reads UI through Windows UI Automation

- overlay highlights target

- user approves/skips

Would you use this in Cursor, or is desktop control outside the IDE too much?

reddit.com
u/Apart-Medium6539 — 9 days ago
▲ 1 r/mcp

Pupil: an MCP layer that gives AI agents eyes on Windows desktop UI

I’m building Pupil, an open-source MCP layer for Windows desktop agents.

The problem I’m trying to solve: agents can use tools and APIs, but they’re still mostly blind when working with normal desktop apps.

Pupil exposes tools like:

  • perceive — read visible UI elements through Windows UI Automation
  • indicate — highlight what the agent wants to click/type
  • approval flow — user accepts/skips before actions happen

So the loop becomes:

agent sees UI → highlights intent → user approves → action runs

Right now I’m debating the next architecture step:

  1. keep it UI Automation only
  2. add screenshots/screen stream fallback
  3. build a standalone app on top of the MCP server

Curious what MCP builders think. Should desktop perception stay structured/UIA-first, or should screenshot fallback be part of the protocol layer?

Repo: GitHub

Feedback very welcome.

reddit.com
u/Apart-Medium6539 — 9 days ago

Building a desktop agent layer: UI Automation vs screenshots?

I’m building a Windows layer for AI agents and trying to choose the right perception model.

Right now it uses Windows UI Automation to read visible UI: buttons, inputs, labels, windows, focus state, and bounding boxes. Before any click/type, an overlay highlights the target and the user approves or skips.

UIA feels fast, structured, and more private than screenshots. But it breaks on custom UIs, games, canvas apps, and some Electron apps.

So I’m debating:

  1. UIA only
  2. screenshots only
  3. hybrid: UIA first, screenshot fallback

For people building agents that act on real apps, what would you choose?

reddit.com
u/Apart-Medium6539 — 9 days ago

I built a Windows overlay so AI agents ask before touching your desktop

I’m building Pupil, an open-source Windows layer for desktop AI agents.

It lets an agent read visible UI, highlight what it wants to do, then wait for the user to approve or skip before any click/type happens.

Current stack:

  • Windows UI Automation
  • MCP
  • local approval overlay

I’m debating the next step:

  • stay MCP-first
  • make a standalone app
  • add screenshot/screen-stream fallback
  • keep it as “show me where to click” guidance only

Would this make desktop agents feel safer/trustworthy, or still too creepy?

reddit.com
u/Apart-Medium6539 — 9 days ago

I gave AI agents eyes on my PC

I built Pupil, an open-source tool.

The pain point: too many screenshots sent to AI tools just to ask where to click.

Now the agent can inspect the UI, point at the target, and wait for approval.

Feedback welcome.

reddit.com
u/Apart-Medium6539 — 13 days ago

Pupil: I gave ChatGPT eyes on my PC

I built Pupil, an open-source tool for AI agents.

Instead of uploading screenshots to ask where to click, the agent can inspect the app, highlight the target, and wait for approval.

Github

github.com
u/Apart-Medium6539 — 13 days ago
▲ 9 r/Agent_AI+5 crossposts

I gave ChatGPT eyes on my PC, now it can show me where to click

I kept sending screenshots to ChatGPT just to ask “where do I click?”

So I built Pupil, an open-source tool that lets an AI agent inspect desktop apps, point at the right place, and wait for approval before clicking. (tab to approve)

Demo: I ask it to find Discord’s data/privacy settings.

Looking for feedback on the idea and UX.

Github

u/Apart-Medium6539 — 13 days ago