Imagine you have a very smart, but slightly clumsy, digital assistant. You ask it to "Book a ticket to Beijing," and instead of just doing it, it gets confused, clicks the wrong button, or tries to type in a box that doesn't exist. This is the current state of many "GUI Agents" (computer programs that try to use apps and websites like humans do).
The UI-Venus-1.5 paper from Ant Group is like a report on how they took that clumsy assistant, gave it a massive education, trained it in a high-tech gym, and turned it into a world-class personal shopper and travel agent.
Here is the story of how they did it, explained with simple analogies:
1. The Problem: The "Smart but Clueless" Assistant
Before this update, AI models were great at chatting but terrible at actually doing things on a screen. They knew what a "button" looked like in a textbook, but when faced with a real, messy app on a phone, they got lost. They were like a student who memorized the map of a city but has never actually walked the streets.
2. The Solution: The Three-Step Training Camp
The team didn't just tweak the model; they built a completely new training pipeline with three distinct phases. Think of it as a three-year university degree for a robot.
Phase 1: The "Mid-Training" Boot Camp (The Library)
- The Analogy: Imagine your assistant is a new employee. Before they can do their job, you don't just throw them into the office. You send them to a library for a month to read every manual, every user guide, and every screenshot of every app imaginable.
- What they did: They fed the model 10 billion tokens of data from over 30 different datasets. This wasn't just random chat; it was specific "GUI" (Graphical User Interface) data.
- The Result: The model stopped guessing what a "search bar" or a "submit button" was. It learned the language of screens. It went from "I think that's a button" to "I know exactly what that button does and where it lives."
Phase 2: The "Offline" Simulation (The Flight Simulator)
- The Analogy: Now that the assistant knows the theory, they need to practice without crashing a real plane. They put the assistant in a flight simulator. They run thousands of scenarios where the assistant tries to book a flight, but if they make a mistake, the simulator says, "No, that's wrong, try again," and gives them a score.
- What they did: They used Offline Reinforcement Learning. They took existing records of humans using apps and taught the AI to mimic those successful paths. They also taught the AI a crucial new skill: Refusal. If you ask the AI to click a button that doesn't exist, instead of hallucinating and clicking the wrong thing, it learns to say, "I can't find that button."
- The Result: The model became much more accurate at finding specific items on a screen (Grounding) and stopped making up fake buttons.
Phase 3: The "Online" Real-World Gym (The Live Fire Drill)
- The Analogy: Simulators are great, but real life is messy. The internet changes, apps update, and things break. So, they put the assistant in a "Live Fire Drill." They gave it thousands of real phones and computers in the cloud and told it: "Go try to do these tasks. If you fail, learn from it immediately."
- What they did: This is Online Reinforcement Learning. The model actually interacted with real apps, saw what happened, and adjusted its brain instantly. They built a massive "Device-as-a-Service" system (like a giant warehouse of thousands of virtual phones) to let the AI practice millions of times a day.
- The Result: The model learned to handle long, complex tasks (like "Find a vegetarian lasagna recipe under 600 calories") without getting lost halfway through. It learned to recover from mistakes, just like a human would.
3. The Magic Trick: The "Model Merge" (The All-Star Team)
Usually, to make a robot good at everything, you have to train three different robots: one for mobile phones, one for websites, and one for finding specific buttons. Then you have to switch between them.
- The Analogy: Imagine you have a soccer coach, a basketball coach, and a tennis coach. Instead of hiring three different people, the UI-Venus team took the best moves from all three coaches and blended them into one single super-coach.
- What they did: They trained three specialized models, then used a "Model Merging" technique (specifically called TIES-Merge) to combine them into one single brain.
- The Result: You now have one AI that can handle your phone, your laptop, and complex websites seamlessly. It's not just a "jack of all trades, master of none"; it's a master of all three.
4. The Results: A New Champion
The paper shows that this new "Super Assistant" (UI-Venus-1.5) is crushing the competition:
- On Mobile Apps: It successfully completed tasks on Android phones at a 77.6% success rate, beating all previous models.
- On Websites: It navigated complex websites with 76.0% accuracy.
- On Precision: It can find tiny icons on a screen with incredible accuracy, even in professional software like CAD tools.
Why Should You Care?
Think about the apps you use every day: booking a train ticket, buying groceries, or managing your bank account. Doing these on a phone can be tedious.
UI-Venus-1.5 is the first step toward an AI that doesn't just talk to you about doing these things, but actually does them for you. It's like having a personal assistant who can look at your screen, understand what you want, and click the right buttons to get it done, even if the app is in Chinese or the layout is confusing.
In short: They took a smart but clumsy robot, gave it a massive education, trained it in a real-world gym, and fused its skills into one super-brain. The result is an AI that can finally navigate our digital world as well as a human can.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.