Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

The Big Problem: The "Text-Only" Brain

Imagine you are trying to teach a robot how to understand the world, but you only give it a library of books. The robot reads millions of stories about "buttering toast." It learns that people usually use a knife to spread butter.

But then, you ask the robot: "How do you butter toast?"
Because the robot has only read text, it might suggest a weird answer like, "Dip the toast into a tub of butter." Why? Because in some stories, people dip things. The robot doesn't know that butter is usually hard and solid at room temperature, and you can't dip toast into a tub of it without making a mess.

This is called Reporting Bias. Textbooks and news articles tend to describe the "most common" way things happen, often skipping the weird, physical, or sensory details that humans know instinctively. Humans know butter is solid because they can feel it and see it. The robot, having only read about it, is blind to the texture.

The Solution: "Machine Imagination"

The researchers at Korea University came up with a clever fix. They call their new system Imagine.

Think of Imagine as giving the robot a pair of glasses and a magic sketchbook.

The Glasses: When the robot reads a question, it doesn't just think about the words. It immediately "imagines" a picture of the scene.
The Sketchbook: Instead of just guessing, the robot actually generates an image of the scenario using an AI artist (like DALL-E 3).

So, when asked about buttering toast, the robot generates an image. It sees the solid block of butter and the knife. It realizes, "Oh! I can't dip this toast. I need a knife!"

How They Built the "Gym" for the Robot

You can't just give a robot a sketchbook and expect it to be smart immediately. It needs to practice.

The researchers built a massive training gym called Synthetic VQA+.

The Workout: They took thousands of common sense questions (like "Why do people stop caring about their problems?") and paired them with images.
The Twist: Some images were real photos, but many were AI-generated based on the question.
The Filter: They were very strict. If the AI generated a picture that looked silly or didn't match the question (like a floating toaster), they threw it away. They only kept the "plausible" images.

This training taught the robot that Text + Image = Better Answer. It learned to look at the picture to double-check its text-based logic.

The Results: A Small Brain Beats a Giant

Here is the most surprising part. The researchers tested this system against the biggest, most famous AI models in the world (like GPT-4, which is huge and expensive).

The Giant: GPT-4 is like a massive encyclopedia with billions of pages.
Imagine: This model is much smaller (less than 1 billion parameters), but it has the "magic sketchbook."

The Result: Imagine beat the giants.
Why? Because the giants were still relying too much on text patterns. Imagine was using visual clues. It was like a detective who reads the clues and looks at the crime scene photos, while the giant detective only read the police report.

Two Ways to "Imagine"

The paper also tested two ways to use this power:

The Artist (Generation): The robot draws a new picture for every single question. This is very accurate but slow (like hiring an artist for every question).
The Librarian (Retrieval): The robot has a huge library of pre-drawn pictures. When it gets a question, it quickly finds the picture that looks most similar. This is much faster and almost as smart.

The Takeaway

This paper proves that to make AI truly "smart" at common sense, we can't just feed it more text. We have to teach it to visualize.

By giving AI the ability to "imagine" what it's reading, we help it understand the physical world—the texture of butter, the weight of a rock, the space in a room—things that text alone often misses. It's a small step toward giving machines a human-like intuition.

1. Problem Statement

Pre-trained Language Models (PLMs) have achieved significant success in commonsense reasoning, particularly when fine-tuned on specific datasets. However, they face critical limitations in zero-shot scenarios (where test data distribution differs significantly from training data). The primary issues identified are:

Human Reporting Bias: Textual knowledge bases often emphasize the most common scenarios, overlooking less frequent but critical physical or contextual nuances (e.g., a model might suggest "dipping toast into butter" because it ignores the physical property that butter is solid at room temperature).
Lack of Physical Grounding: Purely text-based models struggle with spatial, physical, and sensory reasoning because they lack the ability to visualize concepts.
Inefficiency of Existing Multimodal Approaches: Current Vision-Language Models (VLMs) often focus on grounding language in real-world images, which limits their ability to reason about abstract or hypothetical scenarios where no real image exists.

2. Methodology: The "Imagine" Framework

The authors propose Imagine, a novel zero-shot commonsense reasoning framework that augments PLMs with Machine Imagination. The core strategy involves generating or retrieving visual signals to complement textual inputs, thereby mitigating reporting bias.

A. Core Architecture

Imagine integrates three main components:

Pre-trained Language Model (PLM): The backbone (e.g., RoBERTa-Large, DeBERTa-v3-Large, GPT-2-Large) responsible for textual understanding.
Machine Imagination Module:
- Generation Mode: A text-to-image generator (e.g., DALL-E 3) creates an image based on the input question.
- Retrieval Mode: An image retriever (using CLIP embeddings) fetches the most relevant pre-existing image from a database (Synthetic VQA+ or MSCOCO) to avoid generation latency and hallucination.
Visual Encoder: A model (e.g., CLIP-ViT) that extracts features from the generated or retrieved image.

B. Training Strategy: Synthetic VQA & Synthetic VQA+

To teach the PLM to utilize visual imagination effectively, the authors constructed large-scale synthetic datasets:

Synthetic VQA: Created by transforming knowledge triples from knowledge bases (like AbstractATOMIC) into Question-Answer (QA) pairs and generating corresponding images using text-to-image models.
Synthetic VQA+: An enhanced version that:
- Incorporates diverse sources (AbstractATOMIC, VCR, Sherlock).
- Filters Implausible Data: Uses the VERA model to score and remove QA pairs where the visual commonsense is incorrect or illogical.
- Enriches Context: Adds image captions to the question prefix.

C. Optimization Objectives

The model is trained using a multi-objective loss function to jointly learn from text and images:

Language Modeling (LM) Loss: Standard masked or auto-regressive loss for text prediction.
Image-Text Matching (ITM) Loss: Measures the similarity between contextualized visual features and textual features.
Adapter Mechanism: To prevent the visual objective from interfering with the language objective (catastrophic forgetting), the authors use Parallel Adapters. Only the adapter weights are updated during training, keeping the base PLM frozen.
Joint Inference: During inference, the final prediction is an ensemble of the LM score and the ITM score, controlled by a weighting coefficient ( $\lambda$ ).

3. Key Contributions

The Imagine Framework: A novel zero-shot reasoning approach that explicitly addresses human reporting bias by integrating machine-generated or retrieved visual signals into the reasoning pipeline.
Synthetic VQA+ Dataset: A high-quality, filtered multimodal dataset containing nearly 1 million QA pairs with visual contexts, designed to train models to harmonize text and "imagined" visuals.
State-of-the-Art Performance: Demonstrating that a relatively small model (under 1B parameters) enhanced with machine imagination can outperform significantly larger Large Language Models (LLMs) and existing zero-shot frameworks.
Efficient Retrieval Alternative: Proposing a retrieval-based inference method that achieves near-generation performance with significantly lower computational cost (sub-second inference vs. ~20 seconds for generation).

4. Experimental Results

The framework was evaluated on five commonsense reasoning benchmarks (αNLI, CSQA, PIQA, SIQA, Winogrande) and four science QA tasks (QASC, SciQ, ARC-Easy, ARC-Challenge).

Performance vs. Zero-shot Baselines: Imagine-DeBERTa-v3-L achieved 77.9% average accuracy on commonsense benchmarks, surpassing the previous state-of-the-art (CANDLE) by 2.8% and outperforming the base DeBERTa model by a large margin.
Performance vs. Large Language Models: Imagine outperformed massive models including GPT-4, ChatGPT, LLaMA2-13B, and Mistral-7B, despite being built on a model with fewer than 1B parameters.
Performance vs. Vision-Language Models: Unlike standard VLMs (e.g., LLaVA-1.5) which struggled on these text-heavy reasoning tasks, Imagine excelled, proving that visual signals should augment rather than dominate the reasoning process.
Ablation Studies:
- Synthetic VQA+ vs. VQA: The filtering and enrichment in VQA+ improved performance by 1.9% on average.
- Retrieval vs. Generation: Retrieval-based inference achieved competitive results (e.g., 80.7% on SciQ) with drastically reduced latency (<1s vs. ~21s).
- Helpful vs. Harmful Imagination: Analysis showed that "helpful" imagination (correcting errors) occurred significantly more often than "harmful" imagination (introducing errors), validating the approach's net positive impact.

5. Significance and Impact

Mitigating Bias: The study provides empirical evidence that visual imagination can correct the "reporting bias" inherent in textual corpora, allowing models to reason about physical realities (e.g., solidity, texture) that text often glosses over.
Efficiency: It challenges the notion that larger models are always superior for reasoning, showing that targeted multimodal augmentation on smaller models can yield better generalization.
Generalizability: The framework is not limited to commonsense; it also improved performance on general Natural Language Understanding (NLU) tasks (SST-2, MNLI), suggesting that visual context enriches general language representation.
Future Direction: The work bridges the gap between cognitive science (the role of mental imagery in human reasoning) and AI, suggesting that "machine imagination" is a viable path toward more robust, human-like artificial intelligence.