Is state tracking the hardest part of phone-use AI?
I’ve been looking at mobile agents recently, and one thing feels different from browser/computer-use agents.
On desktop, a lot of tasks still happen in relatively stable surfaces: browser tabs, files, terminals, web apps, APIs, etc.
On phones, the agent has to survive more messy state changes:
- keyboard opens and changes the layout
- permission popups interrupt the flow
- apps reload or switch context
- a wrong tap can put you in a totally different screen
- the same task may cross multiple apps
So my current guess is:
the hard part is not “can the agent tap a button?”
It’s whether the agent can keep track of where it is after the UI changes.
For people who have worked on mobile automation / GUI agents / Android agents:
does this match your experience, or is another failure mode more painful?