Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

This paper introduces Multimodal Large Language Model-assisted Evolutionary Search (MLES), a novel framework that combines multimodal LLMs with evolutionary search and visual feedback to automatically generate transparent, verifiable, and human-aligned programmatic control policies that match the performance of deep reinforcement learning methods like PPO.

Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to drive a car or land a spaceship.

The Old Way (Deep Reinforcement Learning):
Traditionally, we teach robots by letting them crash thousands of times while we tweak invisible knobs in a giant, complex math brain (a neural network). The robot eventually learns to drive, but it's a "Black Box." You can't ask it why it turned left; it just "feels" like the right move based on billions of numbers. If it crashes, debugging it is like trying to fix a watch by guessing which gear is broken without being able to see inside. It works, but it's scary and hard to trust.

The New Way (MLES):
This paper introduces MLES (Multimodal LLM-assisted Evolutionary Search). Think of this not as training a black box, but as hiring a team of expert engineers and a super-smart AI coach to write the robot's instruction manual together.

Here is how it works, using a simple analogy:

1. The Team: The "Evolutionary" Engineers

Instead of one big brain, MLES uses a population of candidate policies. Imagine a room full of 16 different engineers, each trying to write a Python script (a set of clear, human-readable instructions) for the robot.

  • The Goal: Find the best script that makes the robot drive perfectly.
  • The Process: It's like survival of the fittest. The engineers write their scripts, the robot tries them out, and the ones that do the worst get fired. The best ones get to "reproduce" and write new, improved versions.

2. The Coach: The Multimodal AI (The "Eye" and "Brain")

This is where the magic happens. In the old days, the coach only looked at the score (e.g., "You crashed, score: 0").
In MLES, the coach is a Multimodal Large Language Model (MLLM). This is an AI that can see and think.

  • It watches the video: Instead of just seeing a low score, the AI coach watches a video of the robot crashing.
  • It diagnoses the problem: It says, "Ah, I see what happened! The robot turned too sharply at high speed and spun out. The script told it to turn too hard."
  • It gives specific advice: It tells the engineer, "Don't just change the numbers randomly. Look at the video. You need to tell the robot to slow down before the turn."

3. The "Thought" Process

Every time an engineer writes a new script, they also write a "Thought" (a short note explaining their logic).

  • Example: "I'm slowing down before the curve because the video showed the car spinning out."
    This makes the whole process transparent. You can read the code and the notes and understand exactly why the robot makes a decision.

The Creative Analogy: The Driving School

Imagine a driving school for robots:

  • Old Method: You put the robot in a car, let it drive until it crashes, and then you tweak the car's internal wiring (which you can't see) to make it slightly less likely to crash next time. You never know why it crashed, you just know it crashed less often.
  • MLES Method: You have a fleet of student drivers (the scripts). A super-smart instructor (the Multimodal AI) watches their driving videos.
    • The instructor sees a student swerving.
    • The instructor says, "You're turning the wheel too fast when you're going 60mph. Here is a new rule: 'If speed > 50, turn wheel slowly.'"
    • The student writes this new rule down in plain English.
    • The next day, the student tries again. They do better.
    • The instructor picks the best students, mixes their best rules, and creates a new, even better student.

Why is this a Big Deal?

  1. Transparency: The final result isn't a mysterious black box; it's a readable code with comments. A human can look at it and say, "Yes, that makes sense."
  2. Trust: Because we can see the logic, we can verify it's safe before letting it drive a real car or land a real spaceship.
  3. Efficiency: By using the AI coach to look at videos of failures (not just scores), the system learns much faster. It doesn't just guess; it diagnoses and fixes specific mistakes.

The Results

The paper tested this on two tasks:

  • Lunar Lander: Landing a spaceship.
  • Car Racing: Driving a car on a track.

The MLES method performed just as well as the best traditional AI methods (which use the "Black Box" neural networks), but it did it by creating clear, understandable instructions that humans can read and verify.

In short: MLES turns the scary, opaque process of teaching robots into a transparent, collaborative workshop where AI and humans work together to write clear, safe, and effective rules for the future.