confused mobile use agents
For mobile agents, the GUI is not just a visual layer.
In many apps, it is the only available control surface.
No clean API. No DOM. Just screen understanding and actions.
For mobile agents, the GUI is not just a visual layer.
In many apps, it is the only available control surface.
No clean API. No DOM. Just screen understanding and actions.
Tech Stack Required for a Solo Startup in 2026:
- Codex / Claude Code for logistics
- coremate's OpenGUI for distribution
- Stripe for payments
- Posthog for analytics
- Kit / Beehiiv for email subscriptions
- Vercel for hosting and deployment - Supabase for database, backend, and authentication
Sidebar chats get a lot of criticism, but users are already used to them.
Right now, I see two common interaction patterns in agent products.
The first one is: conversation list on the left, code or documents on the right. Codex and Cursor’s Agent mode are good examples.
This is agent-first. The main user action is telling the AI agent what to do through chat. Manual editing is secondary, so the conversation becomes the center of the product. That’s also why Cursor built a separate Agent mode outside the traditional IDE flow, and why Codex Desktop does not even support direct file editing.
The second pattern is: the original software stays mostly the same, and the agent chat sits on the side. GitHub Copilot is a good example.
This is human-first. The user still mainly operates the software directly, and the agent is there to help with smaller edits, suggestions, or adjustments. So the sidebar makes sense because it adds AI without changing the core workflow too much.
Some products try to have both: they want an agent-first chat experience, but they also want to preserve the full traditional software UI. The result often feels messy.
Agent interaction design is still very early. There is a lot of room to explore. But I think one question has to be answered first:
Is your product centered around the agent, or is the agent just an assistant inside an existing product?
If you don’t answer that clearly, the rest of the interaction design becomes hard to get right.
Curious how others think about this.
Repo:
https://github.com/Core-Mate/open-gui
Curious what people think. Is this where agents are headed after browser automation, or is mobile UI too unreliable for long-running work?
I’m working on OpenGUI, an open-source Android GUI agent for controlling real Android devices.
The use case is not just “click this button.” I’m interested in longer mobile workflows where an agent has to keep observing, planning, acting, checking state, and recovering when the UI changes.
Examples:
- open X, search for AI news, inspect the top results, and return a structured summary
- open Reddit, search a topic, collect recent posts, and summarize them
- run repeated internal mobile workflows across multiple apps without writing one adapter per app
- trigger a phone task remotely through REST / Telegram / Feishu and get back structured results
The loop is roughly:
capture the Android screen
use a VLM to understand the current UI state
plan the next step
execute tap / swipe / type through Android AccessibilityService
re-check the screen
continue, retry, or recover if the UI changed
The hard part is long-horizon reliability. The model needs to understand mobile UI intent: search boxes, tabs, modals, feed cards, disabled buttons, ambiguous icons, loading states, and whether the previous action actually worked.
Repo for context:
https://github.com/Core-Mate/open-gui
For people running local multimodal models: what would you try first for this kind of mobile GUI task? Qwen-VL, InternVL, UI-TARS-style models, AgentCPM-GUI, or something else?
I’m especially interested in:
- mobile UI understanding
- multi-step task reliability
- grounding actions to screen coordinates/elements
- recovery after failed or ambiguous actions