Computer use is 45x more expensive than a structured API call
Hi r/AI_Agents, I recently did a benchmark on computer use agents vs api calls as part of a feature launch for my company. I wanted to share the benchmark here since it seems relevant to this sub:
See, most teams default to computer use agents not because they're cheap or accurate, but because the alternative (writing an API for every single internal tool) takes too much engineering effort to be worth it for the 20+ internal tools a team could have. But skipping building APIs is a blunder IMO, especially as AI labs are subsidizing tokens less and less.
To quantify the cost difference, I ran two different agents on the same task, using a Reflex port of a React demo app. One agent was a computer-use agent driving the UI through screenshots and clicks. The other was a tool-calling agent calling the same handlers a button click would trigger, reading structured responses back instead of rendered pages (It was done this way since the feature being tested here creates APIs instantly from event handlers in an app). Same model on both sides, of course.
The computer-use agent took 53 steps and 551k input tokens. The tool-calling agent took 8 calls and 12k tokens. (45x) The vision agent was also only able to finish the task with a 14-step walkthrough naming every sidebar and tab. Sheesh.
Some of this is a model problem. The vision agent didn't scroll, so it missed content below the fold, and a more carefully prompted or differently trained model would close part of the gap. But the rest is structural. Each screenshot is thousands of input tokens, and getting to the data the API agent reads in one response requires rendering multiple intermediate states. Better models will narrow the cost per screenshot, not the number of screenshots, because that's set by the interface. The DOM is a rendering target, not a data layer, and that part of the cost doesn't close as models get better.
For apps where state is fully exposed as data, which is most internal tools anyone is building today, the choice isn't between two valid approaches. Vision agents are still the right tool for third-party SaaS and legacy systems you can't modify.
I ran this to prove to our customers paying for computer-use because building APIs per app wasn't worth the engineering effort, and that our Reflex 0.9 update made that effort zero by auto-generating the API from the app's handlers.
Full writeup with task, prompts, cost breakdown, code, pixel art, whatever, in the comments for those who are curious.