Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Imagine you are trying to figure out if a new, super-smart robot chef truly understands cooking, or if it's just a master of mimicry.

This paper is about testing Large Language Models (LLMs)—the AI brains behind tools like ChatGPT—to see if they really understand how words combine to create new meanings. Specifically, the authors looked at Adjective-Noun combinations (like "red car" or "alleged thief").

Here is the story of their investigation, explained with some kitchen analogies.

The Big Question: Does the Robot Know or Just Guess?

In human language, we are "compositional." This means we can take simple parts (like "red" and "car") and combine them to understand a new thing ("red car") without having memorized that specific phrase before. We know that a "red car" is still a car.

But does the AI do this? Or is it just a giant autocomplete machine that guesses the next word based on patterns it saw in its training data?

The researchers decided to test the AI using two different "flashlights" to see what was really going on inside the robot's brain.

Flashlight 1: The "Performance Test" (Functional View)

The Analogy: Imagine asking the robot chef to cook a dish. You give it a recipe: "Make a 'red car'." If it serves you a red car, you say, "Good job!" If it serves you a blue truck, you say, "Fail."

This is what the researchers call Functional Evaluation. They asked the AI to solve logic puzzles:

Substitutivity: If "The runner set a record" is true, is "The runner set a new record" also true? (Yes, because "new" just adds detail).
Systematicity: If a "red car" is a "car," and a "car" is a "vehicle," is a "red car" a "red vehicle"?
Overgeneralization: If a "trench coat" is a type of coat, is a "turncoat" (a traitor) also a type of coat? (No! The AI should know the difference).

The Result: The results were messy. Sometimes the AI got it right. But surprisingly, when the researchers made the AI "smarter" (by making it bigger or teaching it to follow instructions better), it actually got worse at some of these logic puzzles. It was like giving the chef a bigger kitchen and better tools, but they started burning the toast more often.

Flashlight 2: The "Brain Scan" (Representational View)

The Analogy: Now, imagine we don't just watch what the chef serves, but we put the chef under an MRI machine while they are thinking. We look at their brain activity to see if they are actually processing the ingredients correctly.

This is Representational Evaluation. The researchers looked inside the AI's "neurons" (its internal math states) to see if the concept of "red car" was being built correctly from "red" + "car," even if the AI didn't say the right answer out loud.

The Result: This is where it gets shocking. The "brain scan" showed that the AI was actually building the concepts correctly! The internal math showed a perfect understanding of how adjectives and nouns fit together. The "ingredients" were being mixed perfectly in the bowl.

The Great Divergence: The "Silent Genius"

Here is the twist: The AI knew the answer in its head, but failed to say it out loud.

Inside the brain: The AI had a perfect, logical map of how words combine.
Outside the mouth: The AI often gave the wrong answer when asked to perform the task.

It's like a brilliant student who understands the math perfectly during a test but gets so nervous they write down the wrong number. Or a chef who knows exactly how to make a soufflé but accidentally knocks the tray over before serving it.

Why Does This Matter?

The paper concludes that we can't just look at how well an AI performs a task (the "output") to judge if it's smart. We also have to look at how it thinks (the "internal state").

If we only look at the output, we might think the AI is failing at logic. But if we look inside, we see it actually has the logic; it just struggles to translate that logic into a final answer, especially when we make the models bigger or change how they are trained.

The Takeaway

The authors are telling us: Don't just judge a book by its cover (or a robot by its output).

To truly understand if AI is "compositional" (able to build complex meanings from simple parts), we need to use both flashlights:

Functional: Can it do the job?
Representational: Does it actually understand the job while doing it?

If we only use one, we get a confusing picture. But when we use both, we see that these AI models are often "silent geniuses"—they understand the rules of language deep down, even if they stumble when trying to show it off.

Here is a detailed technical summary of the paper "Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives."

1. Problem Statement

The paper addresses a critical gap in understanding compositionality in Large Language Models (LLMs). Compositionality—the ability to generate complex meanings from simpler parts—is a cornerstone of human language processing. While LLMs are highly performant, it remains unclear whether they possess genuine compositional abilities or merely rely on statistical correlations.

Existing research has evaluated compositionality through two distinct, rarely compared paradigms:

Functional Evaluation: Assessing task performance (e.g., accuracy on benchmarks).
Representational Evaluation: Probing internal model states to detect compositional structures.

The authors hypothesize that these two paradigms may yield divergent conclusions. Specifically, they investigate whether LLMs reliably encode compositional information internally even when they fail to demonstrate it behaviorally in functional tasks, and whether scaling or instruction tuning consistently improves compositional capabilities.

2. Methodology

The study employs a unified experimental framework focusing on adjective-noun (AN) compositionality, a linguistically transparent domain. The evaluation is split into two complementary setups:

A. Functional Evaluation (Task Performance)

The authors evaluate four model families (LLaMA-2, CodeLlama, Qwen2.5-Coder, Gemma2) across three variants: Base, Instruction-Tuned (IT), and Scaled (Larger parameter count). They test performance on three specific tasks derived from linguistic theory:

Substitutivity (AddOne Dataset): Tests if substituting a noun with an adjective-noun phrase preserves entailment (e.g., The runner set a new record $\models$ The runner set a record).
Systematicity (PLANE Dataset): Tests if models can recombine known semantic relations to infer novel ones (e.g., if red car $\models$ car and car $\models$ vehicle, does red car $\models$ red vehicle?).
Overgeneralization (COMPCOMB Dataset - New): Tests if models avoid inferring type membership based solely on surface string overlap. It distinguishes between compositional phrases (e.g., trenchcoat $\models$ coat) and exocentric compounds (e.g., turncoat $\not\models$ coat).

Metrics: Weighted F1 score for AddOne/PLANE; Accuracy for COMPCOMB.

B. Representational Evaluation (Internal States)

The authors perform a layer-wise probing analysis to determine if compositional signals exist within the model's hidden states, regardless of output behavior.

Method: Hidden states are extracted from every fifth transformer layer.
AddOne/PLANE: A linear classifier is trained on layer representations to predict entailment labels.
COMPCOMB: Cosine similarity is computed between token embeddings to assess if the model internally distinguishes between compositional and non-compositional pairs (i.e., is the vector for trenchcoat closer to coat than turncoat is?).

3. Key Results

Divergence Between Function and Representation

The most significant finding is a striking divergence between task performance and internal representations:

Functional Performance: Task performance is inconsistent and often degrades with scaling or instruction tuning.
- AddOne & COMPCOMB: Performance frequently decreases as models are scaled up or instruction-tuned.
- PLANE: Performance remains relatively stable but does not show consistent improvement with scaling.
Representational Signals: In contrast, internal representations consistently encode compositional information across all model variants.
- Linear probes and similarity metrics perform significantly above chance across all layers.
- Signal strength typically peaks in intermediate layers.
- Crucially, these representational signals remain stable across base, instruction-tuned, and scaled models, even when functional performance drops.

Specific Findings by Task

Substitutivity: Models reliably encode the entailment relationship internally, but fail to output the correct entailment judgment in functional tasks, especially after instruction tuning.
Systematicity: Internal representations show robust systematic structure, yet the models struggle to apply this systematically in the PLANE task.
Overgeneralization: Models internally distinguish between compositional and exocentric compounds (high similarity for valid pairs, low for invalid), yet often fail to select the correct option in the COMPCOMB functional task.

4. Key Contributions

Unified Contrastive Framework: The paper is the first to directly contrast functional and representational evaluations of compositionality on the same models and tasks, revealing a systematic disconnect.
New Dataset (COMPCOMB): Introduction of a novel dataset specifically designed to test robustness against overgeneralization in adjective-noun phrases, distinguishing compositional from exocentric compounds.
Evidence of "Hidden" Competence: The study provides empirical evidence that LLMs possess robust compositional knowledge internally (in their representations) even when they fail to express it behaviorally. This suggests that poor task performance may stem from decoding or alignment issues rather than a lack of compositional understanding.
Critique of Scaling/Tuning: The results challenge the assumption that scaling up parameters or instruction tuning uniformly improves compositional reasoning; in some cases, these processes may actually hinder functional compositional performance.

5. Significance and Implications

Evaluation Strategy: The paper argues that relying solely on task-level benchmarks (functional) or solely on probing (representational) provides an incomplete picture. A contrastive evaluation approach is necessary to fully understand model capabilities.
Model Interpretability: The findings suggest that LLMs are not "stochastic parrots" lacking internal structure; rather, they encode complex semantic rules that are sometimes inaccessible to the output layer.
Safety and Reliability: For safety-critical applications, understanding that a model "knows" the correct logic internally but fails to output it is crucial. It highlights the need for better alignment techniques that can bridge the gap between internal representational competence and external behavioral output.
Future Directions: The authors suggest that future work should explore causal interventions (e.g., activation patching) to determine if these internal representations can be "unlocked" to improve functional performance, and extend this contrastive framework to other linguistic domains.

In conclusion, the paper demonstrates that LLMs are compositionally competent internally but functionally inconsistent, urging the NLP community to adopt more holistic evaluation methods that account for the divergence between what models "know" and what they "say."