Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Imagine you are walking through a busy digital city (your smartphone). In this city, there are two types of residents: Humans (you and me) and Robots (AI agents designed to do tasks for us).

For a long time, the city guards (apps like WeChat, Taobao, or banking apps) didn't care much about the robots. They just wanted to make sure the robots could get things done efficiently. But recently, the city guards realized something: The robots are too perfect.

They move in straight lines, click buttons instantly, and never hesitate. To a human, this looks suspicious. It's like seeing a person walk through a park with a ruler in their hand, moving in a perfectly straight line without ever looking at a flower or tripping over a rock. The guards started locking the doors and kicking the robots out, thinking they were hackers or spam bots.

This paper is about teaching the robots how to act more human so they can stay in the city without getting kicked out.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Uncanny Valley" of Touch

The authors call this the "Turing Test on Screen."

The Old Turing Test: You chat with someone via text. If you can't tell if it's a human or a computer, the computer passes.
The New Screen Test: You watch someone touch a phone screen. If their finger movements look too robotic (too fast, too straight, too perfect), the system flags them as a bot.

The paper found that current AI agents are like dancers who have never practiced. They jump straight to the beat with perfect timing and move in straight lines. Humans, on the other hand, are messy. We hesitate, we curve our fingers, we tap a little longer, and we sometimes click the wrong spot before correcting it.

2. The Solution: "Humanization"

The researchers created a training program called the Agent Humanization Benchmark (AHB). Think of it as a "Bot Acting School." They taught the robots four main tricks to blend in:

Trick 1: The Curved Path (B-Splines)
- The Robot: Moves in a perfectly straight line from Point A to Point B.
- The Human: Moves in a slight curve, like drawing a smiley face.
- The Fix: The robot learns to wiggle its finger slightly, adding "noise" to make the path look like a natural human hand movement.
Trick 2: The "Fake" Pause (History Matching)
- The Robot: Thinks for 5 seconds, then clicks instantly.
- The Human: Thinks for 5 seconds, then maybe taps the screen lightly, looks at it, and then clicks.
- The Fix: The robot learns to copy real human movement patterns from a database. Instead of calculating a new path, it grabs a "human path" from its memory, rotates it to fit the task, and uses that. It's like a robot actor memorizing a real person's walk.
Trick 3: The "Thinking" Tap (Long Presses)
- The Robot: Taps a button for 0.001 seconds (instantly).
- The Human: Holds the finger down for a split second (0.1 seconds) because our skin is soft and we need a moment to register the touch.
- The Fix: The robot learns to hold its finger down for a realistic amount of time.
Trick 4: The "Distracted" Scroll (Fake Actions)
- The Robot: Goes straight to the goal.
- The Human: Sometimes scrolls up, realizes they went too far, scrolls back down, and then clicks.
- The Fix: The robot adds tiny, useless movements (like a tiny scroll or a hover) while it is "thinking" to make it look like it's exploring the screen, just like a human would.

3. The Catch: The "Efficiency vs. Safety" Trade-off

The paper discovered a tricky balance.

If the robot tries to act too human (adding too many fake pauses or random clicks), it might get confused and fail its actual job (like buying the wrong flight ticket).
If it acts too efficient, the guards catch it.

The researchers found that the best method was History Matching (copying real human data). It was the most convincing. The "Fake Actions" trick was good at hiding the robot's timing, but sometimes it made the robot do silly things that broke the task.

4. Why This Matters

This isn't just about robots sneaking into apps. It's about User Agency.
Imagine you want to use a smart assistant to book a flight for your grandma. If the app thinks the assistant is a "bad bot" and locks your account, you lose your ability to use the tool.

The paper argues that for AI to truly live alongside us in our digital lives, it can't just be a powerful tool; it has to be a polite, natural neighbor. It needs to understand that sometimes, being "perfect" is actually suspicious.

Summary Analogy

Think of the digital world as a VIP Club.

The Bouncers (Apps) are looking for people who look like they belong.
The Robots used to walk in with stiff, military-style marching. The bouncers immediately stopped them.
This Paper teaches the robots how to walk in with a casual, slightly messy, human-like swagger. They learn to sway a little, pause to check their phone, and maybe bump into a chair.
The Result: The bouncers say, "Oh, just another human," and let them in.

The ultimate goal is a future where AI agents can do our chores without us having to worry that the digital world will reject them for being "too smart" and "too perfect."

1. Problem Statement

The rapid advancement of Large Multimodal Models (LMMs) has enabled autonomous Graphical User Interface (GUI) agents to perform complex tasks on mobile devices. However, a fundamental conflict exists between agent efficiency (optimizing for speed and bypassing ads) and platform interests (maximizing user engagement and ad revenue).

The Conflict: Digital platforms view autonomous agents as threats to their business models and security. Consequently, they deploy aggressive defenses, including login blocks, ad traps, and sophisticated detection mechanisms.
The Gap: Existing research focuses on utility (can the agent do the task?) and robustness (can the agent withstand perturbations?). It largely ignores the detectability of agents. Current agents exhibit "unnatural kinematics" (perfectly linear swipes, zero-latency taps, rigid timing) that make them trivially distinguishable from humans.
The Core Question: How can agents evolve to survive in human-centric ecosystems by mimicking human behavioral nuances without sacrificing task performance?

2. Methodology

The authors propose a formal framework and a comprehensive benchmark to address this issue.

A. Theoretical Formulation: "Turing Test on Screen"

The interaction is modeled as a Min-Max Adversarial Game between a Detector ( $D_\Theta$ ) and a GUI Agent ( $G_\Phi$ ):

Detector's Goal: Maximize classification accuracy to distinguish between human event streams ( $H$ ) and agent event streams ( $G_\Phi$ ).
Agent's Goal: Minimize the probability of detection while maintaining task utility ( $R_{task}$ ).
Interaction Layers: The model decouples interaction into:
1. Logical Level: High-level commands (e.g., "tap," "swipe").
2. Event Level: Low-level hardware events generated by the command, including Motion Events (touch coordinates, pressure, velocity) and Sensor Events (gyroscope, accelerometer).

B. Data Collection

The authors constructed a high-fidelity dataset comprising:

Human Users: Data from four sub-populations (Young Men/Women, Middle-aged, Elderly) across 21 diverse apps (Social Media, Shopping, Video, Travel, Office).
Agents: Data from state-of-the-art models including UI-TARS, MobileAgent-E (GPT-4o/Claude-3.5), AgentCPM, and AutoGLM.
Features: 24 statistical features were extracted, covering Kinematics (velocity, acceleration), Geometry (path efficiency, curvature), and Temporal Dynamics (duration, latency).

C. Humanization Strategies

The paper proposes an External Wrapper approach (post-processing) to transform raw agent actions into humanized sequences. Four strategies were tested:

Heuristic Noise Injection (B-Spline): Replaces linear swipe paths with B-spline curves to simulate natural motor noise.
Data-Driven History Matching: Retrieves real human trajectories from the dataset that match the task vector (direction/distance) and applies affine transformations (rotation/scaling) to align them with the current task.
Fake Actions: Injects micro-interactions (e.g., slight scrolls or hovers) during inference latency periods to break the long-tail distribution of action intervals.
Longer Presses: Samples tap durations from a Gaussian distribution fitted to human data to avoid "instantaneous" agent taps.

3. Key Contributions

Conceptual Framework: First to extend the Turing Test to GUI agents, defining "Turing Test on Screen" as a behavioral biometric challenge rather than a linguistic one.
Agent Humanization Benchmark (AHB): A new benchmark evaluating the trade-off between Imitability (how hard is it to detect?) and Utility (does the task still succeed?).
High-Fidelity Dataset: A large-scale dataset of motion and sensor events from both humans and diverse LMM-based agents, enabling granular analysis of behavioral signatures.
Theoretical & Empirical Proof:
- Proved theoretically (via Jensen-Shannon divergence) that variance injection and history matching reduce detectability.
- Demonstrated empirically that History Matching is superior to heuristic smoothing.
- Identified that Fake Actions effectively neutralize interval-based detection but risk task utility if not context-aware.

4. Experimental Results

Baseline Vulnerability

Raw Agents: Vanilla LMM agents are highly detectable. Machine learning classifiers (SVM, XGBoost) achieved near-perfect accuracy (>98%) in distinguishing raw agents from humans.
Key Differentiators: Raw agents failed due to:
- Linearity: Swipe paths were perfectly straight (low deviation).
- Timing: Action intervals were clustered around inference latency (5–80s), unlike the long-tailed human distribution.
- Tap Duration: Near-zero spikes vs. human Gaussian distribution (0.05s–0.10s).

Humanization Performance

Imitability:
- History Matching significantly outperformed B-Spline noise. In the "Social Media" cluster, it reduced XGBoost detection accuracy from 0.9817 to 0.9450 (offline) and 0.8750 (online).
- Fake Actions successfully reduced interval-based detection accuracy to near-random guessing (~0.52), effectively bypassing time-based rules.
Utility Trade-off:
- The Pareto Frontier: There is a critical trade-off. While Fake Actions improved imitability, they caused a drastic drop in task success rates for complex tasks (e.g., Trip Planning accuracy dropped from 0.75 to 0.15) because blind noise injection disrupted task flow.
- Optimal Strategy: Combining History Matching (for path) with Long Presses (for tap duration) offered the best balance, maintaining high utility while significantly lowering detection scores.

Feature Analysis

Easiest to Humanize: Path shape metrics (e.g., maxDev, ratio_end_to_len). These can be easily synthesized to mimic human imperfection.
Hardest to Humanize: Endpoint Precision and Action Intervals.
- Precision: Agents must hit UI centers; adding noise risks missing the target.
- Intervals: Simulating human "thinking time" without breaking the task logic is difficult.

5. Significance and Future Directions

Paradigm Shift: The paper argues that the future of AI agents is not just about doing tasks, but how they do them. "Humanization" is a functional necessity for survival in adversarial digital ecosystems, not just an aesthetic feature.
Defense Evolution: As agents improve at mimicking kinematics, detection will likely shift from Execution Layer (analyzing movement) to Intent Layer (analyzing cognitive patterns like curiosity, distraction, or indecision).
Ethical Implications: The authors position this work as "Red Teaming." By exposing the vulnerabilities of current agents and providing detection baselines, they aim to help platforms develop more nuanced authentication that distinguishes between malicious bots and legitimate AI assistants, thereby protecting User Agency.
Future Work: Suggests moving from post-processing wrappers to End-to-End Humanization (training models to generate human-like trajectories natively) and Personalized Humanization (mimicking specific user profiles).

In conclusion, "Turing Test on Screen" establishes that for autonomous agents to coexist with digital platforms, they must evolve from efficient machines into behavioral mimics, balancing the dual objectives of Imitability and Utility.