u/Environmental_Owl901

I’ve been looking at mobile agents recently, and one thing feels different from browser/computer-use agents.

On desktop, a lot of tasks still happen in relatively stable surfaces: browser tabs, files, terminals, web apps, APIs, etc.

On phones, the agent has to survive more messy state changes:

- keyboard opens and changes the layout

- permission popups interrupt the flow

- apps reload or switch context

- a wrong tap can put you in a totally different screen

- the same task may cross multiple apps

So my current guess is:

the hard part is not “can the agent tap a button?”

It’s whether the agent can keep track of where it is after the UI changes.

For people who have worked on mobile automation / GUI agents / Android agents:

does this match your experience, or is another failure mode more painful?

Claude Code and Codex made coding agents feel much more real to a lot of people.

But I’m curious about the next step: agents that don’t just write code or call APIs, but actually operate real apps.

For mobile GUI agents, the hard part seems to be reliability:

- reading the current screen

- understanding UI state

- deciding the next action

- tapping, typing, going back, switching apps

- verifying whether the action worked

- recovering from popups, loading states, and layout changes

Do you think this direction is better handled VLM-first, accessibility-tree-first, or as a hybrid system?