The "browser agents are expensive and still maturing" framing might be missing something architectural
There's a thread here every few weeks about browser agents — usually ending with some version of "real but expensive and still maturing." I've shared that view too. But I think the cost and reliability problems are partly an architectural mismatch rather than just the category being early.
The pattern I keep seeing: agent + headless Chrome + AI layer stacked on top. The browser controls pages; the AI layer tries to figure out what the pages mean. Those two things are disconnected. The agent burns tokens narrating its way back into context on every hop because the browser doesn't carry any understanding between steps.
I've been testing a different configuration. Opera Neon has a CLI now — opera-browser-cli — that exposes the browser's native AI agents (Do, Make, Research) as terminal commands. The AI is inside the browser, not bolted on top of it. When you call it from an external orchestrator, you're not calling a page controller that needs a separate model to interpret the output. You're calling something that already knows what it's looking at.
Practically: headless mode, runs locally, binds to a port, and the output that comes back to your orchestration layer is actually usable without a cleanup step. Token overhead is lower than the Playwright-plus-model-plus-prompt stack I was running before.
This doesn't solve everything. Anti-bot layers are still messy regardless of your architecture. And you're dependent on having an active Neon session, which limits purely serverless use cases. But the failure modes are different — and more recoverable — when the browser understands what it's doing rather than just reporting what it saw.
Anyone else approaching it this way? What's your browser layer when the task genuinely requires understanding the page rather than parsing it?