A battery of image classification challenges reveals shared and distinct object categorization behavior across monkeys, humans, and deep networks
This study demonstrates that monkeys can rapidly learn and generalize over ten diverse object categorization rules using natural images, exhibiting error patterns similar to humans but relying on visual processing mechanisms that align more closely with language-free deep neural networks than with human performance.
Original authors:Zhang, H., Zheng, Z., Hu, J., Wang, Q., Xu, M., Zhou, Z., Li, Z., Okazawa, G.
This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer
Imagine you are a teacher trying to figure out how smart your students are at sorting things. You have three very different "students" to test:
A Human (who can read, speak, and knows what a "fire extinguisher" is because they've seen it in a movie).
A Monkey (who is very smart, has great eyes, but has never learned human language or cultural concepts).
A Computer AI (a digital brain that can be taught to see, but sometimes only "sees" pixels and sometimes "knows" words too).
This paper is about a massive classroom experiment where the researchers gave all three of these students a huge battery of sorting tests to see how they think.
The Classroom Setup: The "Drag-and-Drop" Game
Instead of asking the monkeys to talk or press buttons, the researchers gave them a touchscreen game.
The Game: A picture of an object (like a dog or a toaster) appears on the screen.
The Task: The monkey has to grab the picture with its finger and drag it into one of two boxes.
The Secret Rule: The monkey doesn't know the rule at first. It has to guess. If it drags a dog to the "Alive" box and gets a juice reward, it learns. If it drags a toaster to the "Alive" box and gets a timeout, it learns that's wrong.
The Challenge: The researchers changed the rule every few days. One day, the rule was "Alive vs. Dead." The next day, it was "Big vs. Small." Then "Natural vs. Man-made." Then "Fire-related vs. Water-related."
The Results: Who Passed the Test?
1. The Monkeys: The Visual Masters
The monkeys were surprisingly fast learners. They figured out rules like "Is this a living thing?" or "Is this a mammal?" in just a few days.
The Analogy: Imagine you show a monkey a picture of a snake and a picture of a car. The monkey quickly learns, "Snakes go in the 'Alive' box, cars go in the 'Dead' box." Even if you show it a new snake it has never seen before, it knows exactly where to put it.
The Catch: The monkeys were great at things you can see. But when the rules got too abstract—like "Is this object related to fire?" (e.g., a lighter vs. a hose) or "Is this Western or Eastern culture?" (e.g., a crown vs. a mooncake)—the monkeys got confused and failed. They couldn't "get" the concept because they couldn't see "culture" or "fire safety" just by looking at the pixels.
2. The Humans: The Word-Wizards
Humans learned the rules almost instantly.
The Analogy: If you tell a human, "Drag the fire-related things here," they immediately think, "Oh, a lighter! A fire truck! A candle!" They use their language and cultural knowledge to solve the puzzle. They didn't just look at the shape; they looked at the meaning.
The Result: Humans aced every single test, even the weird cultural ones, because they could read the "mental labels" attached to the objects.
3. The Computers (AI): The Two Types of Brains
The researchers tested two kinds of AI to see which one acted like the monkey and which acted like the human.
The "Pure Vision" AI: This AI was trained only on pictures, with no words attached. It learned to recognize shapes and textures.
Result: It acted just like the monkey! It was great at sorting "Alive vs. Dead" but terrible at "Fire vs. Water." It couldn't understand the abstract concept without a word to help it.
The "Language-Informed" AI: This AI was trained on pictures and the text descriptions of those pictures (like CLIP).
Result: It acted just like the human! It could sort the "Western vs. Eastern" objects perfectly because it knew the words associated with them.
The Big Picture: What Does This Tell Us?
The study reveals a fascinating truth about how our brains work compared to computers and animals:
Monkeys (and "Pure Vision" AI) are like photographers. They are incredible at noticing visual details: shapes, colors, textures, and whether something looks alive. They can sort the world based on what it looks like.
Humans (and "Language" AI) are like librarians. We don't just see the object; we see the idea behind it. We use language as a superpower to group things that might look totally different but share a hidden meaning (like a "crown" and a "mooncake" both being "Eastern" or "Western").
The Takeaway: You don't need to speak English to be smart at recognizing a dog or a chair. Your eyes and brain can do that on their own. But to understand abstract concepts like "culture," "safety," or "religion," you need the tool of language. The monkey's brain is a powerful visual engine, but it lacks the "software update" that language provides to sort the world by meaning rather than just by appearance.
1. Problem Statement
While deep neural networks (DNNs) have demonstrated that complex visual processing alone can support high-level object categorization (e.g., animate vs. inanimate) without language, it remains unclear to what extent non-human primates (NHPs) possess similar capabilities.
The Gap: Previous behavioral studies on monkeys focused primarily on basic-level recognition (e.g., similarity judgment, match-to-sample) or single-category tasks (e.g., "animal vs. non-animal"). These approaches lack the scalability to test a broad battery of abstract rules (superordinate categories) and cannot easily distinguish between learning specific image associations versus learning abstract conceptual rules.
The Question: Can monkeys learn and generalize diverse, high-level classification rules (spanning >10 categories) without language, and how does their performance compare to humans and various computational models (from low-level vision to language-informed DNNs)?
2. Methodology
Subjects and Apparatus:
Subjects: Three adult rhesus monkeys (Macaca mulatta) and 33 human participants.
Task Paradigm: A novel "Object Drag Task" implemented on a touchscreen system.
Procedure: Monkeys fixate on a red dot, then an object image appears with two gray target boxes. The monkey must touch the image, drag it to one of the two boxes, and release.
Feedback: Correct choices reveal the image under the correct box with a juice reward; incorrect choices result in a timeout.
Mechanism: The dragging motion (>0.5s) forces deliberate decision-making, reducing impulsive errors. The rule (e.g., "animate vs. inanimate") is hidden and must be inferred from trial-by-trial feedback.
Experimental Design:
Stimuli: A massive battery of binary classification tasks using natural object images (grayscale, background-removed).
Initial Battery: 10+ rules including animate vs. inanimate, natural vs. artificial, mammal vs. non-mammal, big vs. small, fire-related vs. water-related, and Western vs. Eastern culture.
Scale: ~315,000 trials total. Training sets contained 60–120 images; generalization sets contained new images (some from the THINGS database) never seen before.
Control Experiments:
Exemplar vs. Rule: Tested if monkeys relied on specific image memorization by using "old" (seen in training) vs. "new" (unseen) categories.
Feature Control: Used cartoons, silhouettes, outlines, and texture-distorted images to rule out reliance on specific low-level features (e.g., faces, legs, color).
Random Association: Trained monkeys on random stimulus-response mappings to establish a baseline for learning speed without conceptual rules.
Model Comparison:
Low/Mid-level Models: V1 filters, luminance/color statistics, Gist, and texture models.
Deep Neural Networks (DNNs): Pretrained CNNs (AlexNet, VGG16, ResNet-50), Vision Transformers (ViT), self-supervised models (DINO), and language-informed models (CLIP, SigLIP2).
Neural Data: Compared against ventral pathway neural recordings (V1, V4, IT) from the THINGS dataset.
Analysis:
Drift-Diffusion Model (DDM): Used to derive a single metric of "stimulus difficulty" (category sensitivity) by jointly fitting choice accuracy and reaction times (RTs) for both humans and monkeys.
Dissimilarity Matrix: Constructed a matrix of behavioral performance across all tasks to visualize the similarity between monkeys, humans, and models.
3. Key Results
A. Rapid Learning and Generalization:
Monkeys learned diverse classification rules (e.g., animate/inanimate, natural/artificial) rapidly, reaching ~90% accuracy within 3–6 days.
They successfully generalized these rules to novel images (including those from the THINGS database) immediately upon first exposure, indicating rule learning rather than rote memorization.
Performance was consistent across different abstraction levels (basic vs. superordinate categories).
B. Feature Independence:
Monkeys performed well on control images lacking specific features (e.g., silhouettes, cartoons, grayscale), proving they did not rely on a single visual cue (like "faces" or "green color").
They failed to generalize to "texform" images (which retain only mid-level texture statistics), suggesting they require higher-level structural information.
Learning speed for random stimulus-response associations was significantly slower than for conceptual rules, confirming that the monkeys were learning abstract concepts.
C. Comparison with Humans and Models:
Human vs. Monkey: Humans learned tasks almost instantly and achieved near-perfect accuracy. However, the difficulty of specific images (measured by DDM drift rate) was positively correlated between humans and monkeys (R=0.30−0.43). Both species found the same images "hard" or "easy."
Model Fitting:
Low-level and mid-level visual models performed poorly (~60% accuracy) and failed to predict monkey choices.
Visual DNNs (trained only on images, e.g., ResNet-50) achieved high accuracy (>90%) and best explained monkey behavior.
Language-informed DNNs (CLIP, SigLIP2) best explained human behavior.
Failure Cases: Monkeys (and visual DNNs) failed to learn abstract rules requiring specific cultural or functional knowledge (e.g., "fire-related vs. water-related" or "Western vs. Eastern culture"), whereas humans and language-informed models succeeded.
D. Triangular Comparison:
A dissimilarity matrix and t-SNE visualization revealed a spectrum:
Low-level models are at one extreme.
Monkey behavior sits in the middle, closely aligned with visual-only DNNs.
Human behavior clusters with language-informed DNNs.
4. Key Contributions
Scalable Behavioral Paradigm: Developed a high-throughput "drag-and-drop" task that allows monkeys to learn and switch between >10 abstract classification rules rapidly, overcoming the training limitations of previous studies.
Systematic Characterization: Provided the first systematic quantification of which high-level object concepts monkeys can and cannot learn, moving beyond basic-level recognition.
Decoupling Vision and Language: Demonstrated that while monkeys can learn many abstract visual categories without language, their performance is fundamentally limited compared to humans on tasks requiring semantic/cultural knowledge.
Benchmark for Biological Intelligence: Established that monkey object categorization is best modeled by visual-only deep networks, whereas human categorization requires the integration of language and visual features.
5. Significance
Evolution of Object Vision: The study suggests that the core mechanisms for extracting high-level object concepts from retinal images are shared between humans and monkeys and are largely visual in nature.
Role of Language: It highlights that the human advantage in abstract categorization (especially for culturally defined or functional concepts) stems largely from language and semantic knowledge, which allows humans to go beyond purely visual statistics.
AI and Neuroscience: The findings validate visual-only DNNs as strong models for primate visual processing but also delineate their limits. They suggest that to fully model human-level cognition, AI systems must integrate visual processing with language-based conceptual frameworks.
Concept Formation: The results challenge the notion that abstract concepts (like "animacy") are purely semantic; instead, they appear to be structurally embedded in visual images to a degree that biological brains (and visual DNNs) can extract them without explicit linguistic training.