Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

This paper evaluates adjective-noun compositionality in large language models using both functional and representational approaches, revealing a significant divergence where models successfully develop compositional internal representations but fail to consistently translate them into functional task success.

Ruchira Dhar, Qiwei Peng, Anders Søgaard

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to figure out if a new, super-smart robot chef truly understands cooking, or if it's just a master of mimicry.

This paper is about testing Large Language Models (LLMs)—the AI brains behind tools like ChatGPT—to see if they really understand how words combine to create new meanings. Specifically, the authors looked at Adjective-Noun combinations (like "red car" or "alleged thief").

Here is the story of their investigation, explained with some kitchen analogies.

The Big Question: Does the Robot Know or Just Guess?

In human language, we are "compositional." This means we can take simple parts (like "red" and "car") and combine them to understand a new thing ("red car") without having memorized that specific phrase before. We know that a "red car" is still a car.

But does the AI do this? Or is it just a giant autocomplete machine that guesses the next word based on patterns it saw in its training data?

The researchers decided to test the AI using two different "flashlights" to see what was really going on inside the robot's brain.

Flashlight 1: The "Performance Test" (Functional View)

The Analogy: Imagine asking the robot chef to cook a dish. You give it a recipe: "Make a 'red car'." If it serves you a red car, you say, "Good job!" If it serves you a blue truck, you say, "Fail."

This is what the researchers call Functional Evaluation. They asked the AI to solve logic puzzles:

  • Substitutivity: If "The runner set a record" is true, is "The runner set a new record" also true? (Yes, because "new" just adds detail).
  • Systematicity: If a "red car" is a "car," and a "car" is a "vehicle," is a "red car" a "red vehicle"?
  • Overgeneralization: If a "trench coat" is a type of coat, is a "turncoat" (a traitor) also a type of coat? (No! The AI should know the difference).

The Result: The results were messy. Sometimes the AI got it right. But surprisingly, when the researchers made the AI "smarter" (by making it bigger or teaching it to follow instructions better), it actually got worse at some of these logic puzzles. It was like giving the chef a bigger kitchen and better tools, but they started burning the toast more often.

Flashlight 2: The "Brain Scan" (Representational View)

The Analogy: Now, imagine we don't just watch what the chef serves, but we put the chef under an MRI machine while they are thinking. We look at their brain activity to see if they are actually processing the ingredients correctly.

This is Representational Evaluation. The researchers looked inside the AI's "neurons" (its internal math states) to see if the concept of "red car" was being built correctly from "red" + "car," even if the AI didn't say the right answer out loud.

The Result: This is where it gets shocking. The "brain scan" showed that the AI was actually building the concepts correctly! The internal math showed a perfect understanding of how adjectives and nouns fit together. The "ingredients" were being mixed perfectly in the bowl.

The Great Divergence: The "Silent Genius"

Here is the twist: The AI knew the answer in its head, but failed to say it out loud.

  • Inside the brain: The AI had a perfect, logical map of how words combine.
  • Outside the mouth: The AI often gave the wrong answer when asked to perform the task.

It's like a brilliant student who understands the math perfectly during a test but gets so nervous they write down the wrong number. Or a chef who knows exactly how to make a soufflé but accidentally knocks the tray over before serving it.

Why Does This Matter?

The paper concludes that we can't just look at how well an AI performs a task (the "output") to judge if it's smart. We also have to look at how it thinks (the "internal state").

If we only look at the output, we might think the AI is failing at logic. But if we look inside, we see it actually has the logic; it just struggles to translate that logic into a final answer, especially when we make the models bigger or change how they are trained.

The Takeaway

The authors are telling us: Don't just judge a book by its cover (or a robot by its output).

To truly understand if AI is "compositional" (able to build complex meanings from simple parts), we need to use both flashlights:

  1. Functional: Can it do the job?
  2. Representational: Does it actually understand the job while doing it?

If we only use one, we get a confusing picture. But when we use both, we see that these AI models are often "silent geniuses"—they understand the rules of language deep down, even if they stumble when trying to show it off.