💬 NLP

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

This paper introduces the Lexical Consensus framework to demonstrate that artificial agents can acquire and stabilize grounded word meanings based on perceptual distance rather than semantic relatedness, revealing a robust learning gradient where native categories are easiest to learn while far-disjunctive concepts approach chance, and highlighting that bidirectional naming and retrieval rely on distinct mechanisms within frozen perceptual geometries.

Original authors: Patricio M. Vera

Published 2026-06-23

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Patricio M. Vera

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are teaching a robot to speak, but instead of giving it a dictionary full of definitions, you point at pictures and say, "This is a slithy," or "That is a vorpal." The robot has never heard these words before, and they mean nothing to it yet. The big question this paper asks is: Can the robot actually learn what these words mean just by looking at pictures, and will it remember them later?

The researchers, led by P. M. Vera, built a special experiment called Lexical Consensus to test this. Here is how it works, explained through simple analogies.

1. The Robot's "Eyes" Are Already Organized

Before the robot learns any words, it is given a set of "eyes" (a pre-trained computer vision model called DINOv2). Think of these eyes like a highly organized library.

The library already has books sorted by genre. All the "frog" books are on one shelf, all the "horse" books on another, and all the "ship" books on a third.
The robot doesn't learn to see; it just uses this pre-organized library. The researchers wanted to see if the robot could learn to put new labels on these existing shelves.

2. The "Carroll" Vocabulary

Instead of using normal words like "dog" or "car," the researchers used made-up words from Lewis Carroll's Alice in Wonderland (like slithy, mimsy, and vorpal).

Why? Because if you use the word "dog," the robot might already know what a dog is from its training data. By using nonsense words, the researchers ensure the robot is learning the meaning only from the pictures they show it, not from anything it already knew.

3. The Four Levels of Difficulty (The "Concept Carving")

The researchers tested the robot with four different types of lessons to see how hard it was to learn:

Level 1: Native Concepts (The Easy Shelves).
- The Lesson: "This word slithy means only frogs."
- The Result: The robot learned this instantly. It's like putting a new name tag on a shelf that was already perfectly organized.
Level 2: Coherent Overextensions (The Related Shelves).
- The Lesson: "This word mimsy means frogs AND toads." (Things that look similar).
- The Result: The robot still learned this very well. It's like putting a name tag on two shelves that are right next to each other.
Level 3: Mid-Range Disjunctive (The Distant Shelves).
- The Lesson: "This word vorpal means frogs AND ships." (Things that are somewhat different).
- The Result: The robot started to struggle. It got the meaning wrong more often.
Level 4: Far-Disjunctive (The Opposite Shelves).
- The Lesson: "This word gimble means frogs AND airplanes." (Things that are totally unrelated and far apart in the library).
- The Result: The robot failed. It performed no better than if it were just guessing randomly.

The Big Discovery: The robot didn't learn words based on how "logical" the group was. It learned based on how close the pictures looked to each other in its internal library. If the pictures were neighbors, the robot learned the word. If the pictures were strangers living in different parts of the library, the robot couldn't learn the word.

4. The "Name" vs. The "Memory" Test

The researchers tested the robot in two ways:

Naming (Image $\to$ Word): Show a picture, ask "What is this?"
Retrieving (Word $\to$ Image): Say "Show me a slithy," and ask the robot to pick the right picture from a pile.

They found these are different skills.

For Naming, a simple "average" memory worked fine.
For Retrieving, the robot was much better if it remembered specific examples (like a photo album) rather than just an "average" picture. It's easier to find a specific friend in a crowd if you remember their face, rather than just remembering "what an average person looks like."

5. The Robot Group Chat (Consensus)

The researchers then put many robots in a room and let them talk to each other to agree on the meanings of the words.

The Result: The robots quickly agreed on what the words meant.
The Catch: They agreed because they all had the same pre-organized library (the same "eyes"). They didn't change their internal libraries to match each other; they just coordinated their answers based on the library they already shared. The words didn't change how they saw the world; they just helped them agree on the labels.

6. The "Falsification" Checks (Did the Robot Cheat?)

To make sure the robot wasn't just guessing or memorizing patterns, the researchers tried to break the experiment:

Random Labels: They swapped the words randomly. The robot failed.
Random Pictures: They gave the robot random noise instead of real pictures. The robot failed.
Out-of-Box: They showed the robot pictures it had never seen before. The robot correctly said, "I don't know this word."

The Bottom Line

This paper proves that for an artificial agent to learn a new word, the concept must fit neatly into how it already sees the world.

It's not magic: You can't just teach a robot that "frogs = airplanes" and expect it to work.
It's about structure: Learning happens when the new word matches the natural groups the robot already sees.
It's a boundary: The robot can learn words for things that look similar, but it hits a wall when you try to teach it words for things that look nothing alike.

In short, language learning for AI is constrained by how the AI sees the world. If the world looks organized to the AI, the words stick. If the world looks like a messy jumble to the AI, the words fall apart.

Technical Summary: Lexical Consensus

Problem Statement
Current artificial intelligence evaluation is predominantly organized around task performance, benchmark accuracy, and behavioral imitation. While valuable, these metrics fail to address a deeper question: whether an artificial agent can acquire, stabilize, and utilize new lexical meanings derived from grounded experience. Specifically, it remains unclear if agents can learn novel word-concept mappings from limited visually grounded examples, generalize these mappings bidirectionally (image-to-label and label-to-image), and stabilize them across agents. This paper addresses the gap between imitation-based assessment and acquisition-based evaluation, asking whether agents can acquire vocabulary for their surroundings without relying solely on preloaded labels or task-specific definitions.

Methodology
The paper introduces Lexical Consensus, a reproducible experimental framework designed to evaluate grounded word learning over a structured perceptual substrate. The framework isolates lexical acquisition from perceptual learning by utilizing a frozen perceptual encoder (DINOv2-small) to generate visual embeddings. The experimental design includes the following components:

Artificial Lexicon: The system uses Carroll-style nonce words (e.g., slithy, mimsy, vorpal) drawn from Lewis Carroll's vocabulary. These labels are phonotactically plausible but experimentally ungrounded, entering the system as opaque identifiers to prevent semantic leakage.
Concept-Carving Evaluation: To test if acquisition is merely the relabeling of existing clusters or if it depends on perceptual coherence, the framework defines four concept tiers based on the relationship between the taught concept and the frozen perceptual geometry:
1. Native concepts: One label corresponds to one native visual category.
2. Near-disjunctive concepts: Labels group perceptually coherent categories (overextensions).
3. Mid-disjunctive concepts: Labels group categories with intermediate perceptual distance.
4. Far-disjunctive concepts: Labels group perceptually distant categories (arbitrary unions).
Learner Agents: The study employs interpretable lexical learners, including centroid-based learners (prototypical networks with frozen encoders), multi-centroid learners, exemplar k-NN, and linear baselines (logistic regression, linear SVM).
Bidirectional Grounding: Evaluation occurs in two directions:
- Condition 1 (C1): Image-to-label naming (assigning the correct label to a new image).
- Condition 2 (C2): Label-to-image retrieval (recovering a valid instance from a candidate pool given a label).
Multi-Agent Consensus: A population of agents trained on disjoint seed sets interacts to reach a consensus on label usage, measured by agreement thresholds and information-theoretic metrics (entropy, mutual information).
Falsification Controls: The framework includes rigorous controls such as random-label assignment, random embeddings, permuted image-embedding bindings, out-of-vocabulary (OOV) rejection tests, and homogeneous candidate-pool evaluations to rule out trivial explanations.

Key Contributions

Lexical Consensus Framework: A constrained empirical implementation of the first language-acquisition test proposed by Vera et al. (2023), providing a measurable protocol for evaluating how agents acquire, retrieve, and stabilize language-like mappings.
Perceptual-Coherence Gradient: The demonstration that lexical acquisition is not arbitrary set learning but follows a monotonic gradient governed by perceptual coherence.
Dissociation of Perception and Semantics: A pre-registered experiment over CIFAR-100 confirming that acquisition accuracy is driven by perceptual distance rather than semantic relatedness.
Bidirectional Distinction: Evidence that image-to-label naming and label-to-image retrieval expose distinct capacities (concept-geometry compatibility vs. memory fidelity).
Null Result on Representational Restructuring: Findings indicating that while agents can converge on shared lexical usage, this consensus does not substantially reorganize internal perceptual representations under the current architecture.

Results

Acquisition Gradient: Naming accuracy (C1) follows a robust, monotonic perceptual-coherence gradient. Native categories are acquired with near-perfect accuracy. Coherent overextensions remain highly learnable. Mid-disjunctive concepts show partial degradation, and far-disjunctive concepts degrade to near-chance levels. This pattern holds across centroid, exemplar, and linear learners.
Perceptual vs. Semantic Drivers: In the dissociation experiment, where perceptual and semantic distances disagreed, acquisition accuracy tracked the perceptual predictor (partial $R^2 = 0.245, p < 10^{-7}$ ). The semantic predictor added no significant explanatory power (partial $R^2 = 0.002, p = 0.660$ ). This confirms the gradient is a property of the perceptual substrate's geometry, not a measurement artifact.
Retrieval Dynamics: Label-to-image retrieval (C2) reveals a memory-fidelity dimension. Exemplar-based mechanisms consistently outperform compressed centroid prototypes, particularly for coherent but multimodal concepts. Linear discriminative baselines recover additional structure under hard candidate pools.
Consensus and Alignment: Multi-agent experiments show that agents can converge on a shared vocabulary, and feedback improves agreement. However, the no-feedback baseline already achieves high consensus accuracy, suggesting shared perceptual geometry is the dominant stabilizing force. Crucially, consensus feedback does not significantly reduce inter-agent centroid distances or reshape internal representations.
Falsification: The grounding effect collapses when embeddings are randomized or image-embedding bindings are permuted, confirming that correct grounding depends on the perceptual substrate and its binding to labels.

Significance and Claims
The paper positions Lexical Consensus not as a solution to full artificial language acquisition, but as a constrained empirical scaffold for studying the boundaries of grounded lexical learning.

The primary significance is the demonstration that early lexical acquisition is constrained by perceptual coherence. Agents learn labels more reliably when taught concepts correspond to coherent regions of the perceptual space. As taught concepts cut across distant regions of that space, performance degrades. This reframes the role of the perceptual substrate: its structure is not merely a confound to be hidden, but the condition under which acquisition becomes measurable.

Furthermore, the paper claims that shared lexical agreement should not be overinterpreted as representational transformation. While agents can coordinate decisions over a shared perceptual geometry, the current architecture shows that lexical feedback alone does not reorganize the underlying perceptual embeddings.

Ultimately, the work argues for a shift in AI evaluation from static performance metrics to acquisition-based tests that measure how agents acquire, retrieve, and stabilize meaning under perceptual constraints. It establishes that while agents can acquire and share lexical mappings over frozen perception, the scope of what can be learned is strictly bounded by the alignment between the taught concept and the available perceptual geometry.