u/Able_Programmer_2564

Hey everyone,

I’ve been spending way too much time lately trying to get agents to actually use a computer beyond the browser.

The biggest wall I kept hitting is that while multimodal LLMs are amazing at looking at a screenshot and telling you what's there, they are surprisingly bad at actually clicking the right pixel. In the browser, we have the DOM to help us out, but once you move to native OS apps, you're stuck with accessibility trees. If you’ve ever tried to automate a legacy Windows app or a custom Electron build, you know how inconsistent and "non-deterministic" those trees can be.

So, I decided to try a purely vision-based approach and built SoMatic.

It basically brings the "Set-of-Marks" (SOM) prompting style to the OS level. I used a fine-tuned YOLO model to detect buttons, icons, and text fields across Mac, Windows, and Linux. It throws a numerical overlay on the screen so the agent doesn't have to guess coordinates, it just says "click 4" and the framework handles the rest.

The part that actually shocked me: I ran some benchmarks against ScreenSpot-Pro and it’s currently beating the GPT-5.5 (high) baseline by about 20%, and OmniParser v2.0 by roughly 40%.

One weird thing I found: During ablation testing, the model actually performed better when it only had the textual coordinates of the boxes rather than seeing the visual labels on the screenshot. I'm thinking the YOLO detections might be adding too much visual noise at certain thresholds, but I’m still digging into that.

I’ve also included a stdio MCP server, so if you're using Claude Code or anything MCP-compatible, you can plug this in and it’ll start using your machine immediately.

In the video, I’m using it to have Claude Code open a random PDF, find a chess position, and then go replicate it 1-to-1 on Chess.com.

It’s all open source. If you want to play around with it or (more likely) help me find all the ways it breaks on different OS setups, I’d love the feedback!

To try it out:

npm install -g somatic-cli/cli

npx skills add Smyan1909/SoMatic

Let me know what you think about the vision-only vs. accessibility-tree approach. Is anyone else finding that metadata is becoming more of a hurdle than a help?

(GitHub link in the comments)