RECODE: Reasoning Through Code Generation for Visual Question Answering

The paper introduces RECODE, an agentic framework that enhances visual question answering by reverse-engineering structured visuals into executable code through iterative generation and selection, thereby transforming ambiguous perceptual tasks into verifiable symbolic reasoning problems that significantly outperform existing methods.

Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi

Published Wed, 11 Ma
📖 3 min read☕ Coffee break read

Imagine you are trying to solve a tricky math problem, but instead of numbers on a page, you are looking at a complex, colorful chart or a geometric diagram.

The Problem: The "Guesswork" Artist
Current AI models (the "artists" of the digital world) are great at looking at a picture and saying, "That looks like a bar going up!" But when you ask them, "Exactly how much did it go up, and what does that mean for the total?" they often stumble. They are like someone trying to guess the weight of a watermelon just by looking at it. They might get close, but they can't be sure because they are just guessing based on how the pixels look. They lack a way to double-check their work.

The Solution: The "Reverse-Engineer" Detective
The paper introduces a new system called RECODE. Think of RECODE not as an artist, but as a detective who speaks the language of computers.

Instead of just staring at the chart and guessing, RECODE does something clever called "derendering." It's like taking a finished cake and trying to figure out the exact recipe used to bake it.

  1. The Drafting Phase: RECODE looks at the image and writes a piece of computer code (a recipe) that, if run, would draw that exact same image from scratch. It tries this a few times, creating different "draft recipes."
  2. The Taste Test: It then runs these recipes to see if they actually produce the image. If the code draws a bar that is too short, the system knows, "Oops, that recipe is wrong."
  3. The Fix: It acts like a strict editor, picking the best recipe and tweaking it until the computer-generated image matches the original perfectly.

Why This Changes Everything
Once RECODE has the perfect "recipe" (the code), the magic happens.

  • No More Guessing: Instead of guessing the height of a bar, the code knows the exact number because it wrote the number into the instructions.
  • Superpowers: Now that the AI has the data in a clean, logical format (code), it can do complex math, find hidden patterns, and solve geometry problems with the precision of a calculator, rather than the uncertainty of a human squinting at a screen.

The Bottom Line
Think of it this way: Old AI models were like a tourist trying to navigate a city by looking at a blurry photo. RECODE is like giving that tourist a GPS and a map. It translates the confusing visual world into a clear, step-by-step set of instructions that can be checked, verified, and trusted. This makes the AI much smarter at solving problems involving charts, graphs, and diagrams.