PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Imagine you are an art critic visiting a museum. You look at a complex painting of a chaotic scene: a man pouring water, a bird with a specific beak shape, and a group of people interacting. You want to write a detailed description of this painting so that a blind person can "see" it in their mind.

Now, imagine you have a robot (an AI) that also writes descriptions. How do you know if the robot did a good job?

This is the problem the paper POSH tries to solve. Here is the breakdown in simple terms:

1. The Problem: Old Rulers Don't Fit New Clothes

For a long time, we've used simple rulers to grade AI descriptions. These old rulers (metrics like CIDEr or SPICE) were designed for short, simple sentences, like "A dog is running." They check if the AI used the right words.

But today, AI can write long, detailed stories about images. The old rulers are broken for this.

The Flaw: If the AI says, "The man is pouring water," but the man is actually holding the water, the old ruler might give it a high score because it got the words "man," "pouring," and "water" right. It misses the logic error.
The Real Issue: In detailed descriptions, getting the relationships right (who is doing what to whom) is everything. If the AI gets the relationships wrong, the description is useless, even if the vocabulary is perfect.

2. The Solution: POSH (The "Scene Graph" Detective)

The authors created a new tool called POSH. Think of POSH not as a ruler, but as a detective with a checklist.

Here is how POSH works, step-by-step:

Step 1: The Blueprint (Scene Graphs):
Imagine the AI writes a description. POSH takes that text and turns it into a blueprint (called a Scene Graph). It breaks the story down into its building blocks: Who is there? What are they wearing? How are they connected?
- Analogy: If the sentence is "The tall man on the horse is waving," the blueprint lists: [Man] + [Tall] + [On Horse] + [Waving].
Step 2: The Interrogation (LLM-as-a-Judge):
POSH compares the AI's blueprint against the "Gold Standard" blueprint (written by a human expert). Instead of just counting matching words, POSH uses a smart AI (a "Judge") to ask specific questions.
- The Question: "Does the description mention a man on a horse?"
- The Result: If the AI said "A man on a bicycle," the Judge catches that specific error. It pinpoints exactly where the mistake happened in the text.
Step 3: The Report Card:
POSH gives a score based on two things:
1. Mistakes (Precision): Did the AI invent things that weren't there? (e.g., saying there is a cat when there isn't).
2. Omissions (Recall): Did the AI forget important details? (e.g., forgetting to mention the bird's beak).

3. The New Test: DOCENT (The Art Museum)

To prove their new tool works, the authors built a new test called DOCENT.

What is it? A collection of 1,750 complex artworks (paintings, sketches, statues) from the National Gallery of Art.
Why is it special? Most AI tests use photos of cats or cars. These are simple. DOCENT uses art, which is messy, emotional, and full of hidden details.
The Human Element: They hired art history students to grade the AI's descriptions. These students didn't just say "Good" or "Bad"; they marked exactly which sentences were wrong or missing. This created a "Gold Standard" to test POSH against.

4. The Results: Why POSH Wins

When they tested POSH against other methods:

It's Smarter: POSH correlated much better with human experts than even the most advanced AI judges (like GPT-4o). It understood why a description was bad, not just that it was bad.
It's Cheaper: You don't need to pay for expensive API calls to use POSH; it uses open-source models.
It's a Coach: The authors used POSH to "train" an AI. Instead of just showing the AI examples, they let the AI practice and used POSH to give it feedback. The AI got significantly better at describing art after this training.

The Big Picture

The paper argues that to make AI truly helpful for things like accessibility (writing descriptions for the blind), we need to move beyond simple word-matching. We need tools that understand the story and the relationships in an image.

POSH is that tool. It's like upgrading from a spell-checker (which just checks if words are spelled right) to a literary editor (who checks if the plot makes sense). By using this new editor and a new, harder test (DOCENT), we can finally build AI that can truly "see" and describe the world in detail.

1. Problem Statement

While Vision-Language Models (VLMs) have advanced in generating detailed image descriptions, evaluation remains a significant bottleneck.

Limitations of Existing Metrics: Standard metrics (e.g., CIDEr, SPICE, BLEU) were designed for short captions and rely on n-gram overlap or simple semantic matching. They fail to capture the nuances of long-form text, such as attribute and relation attachments (e.g., confusing who is pouring water in a dynamic scene). They also lack the ability to localize errors to specific text spans.
Limitations of LLM-as-a-Judge: While Large Language Models (LLMs) used as judges offer better correlation with human ratings, they often provide only coarse, uninterpretable scores. Furthermore, closed-source models (like GPT-4o) are expensive, non-replicable, and lack transparency.
Lack of Benchmarks: There is a scarcity of datasets containing granular human judgments (identifying specific mistakes and omissions) for detailed, long-form descriptions, particularly in complex domains like visual art.

2. Methodology: POSH

The authors propose POSH (PrOofing Scene grapHs), a reference-based metric designed to evaluate detailed image descriptions by combining structured scene graphs with LLM-as-a-Judge capabilities.

Core Workflow (Three Steps):

Scene Graph Extraction:
- Given a generated description and a reference description, POSH extracts sentence-level scene graphs using dependency parsing (Spacy) and coreference resolution (Maverick).
- The graph $G(d) = \langle O(d), E(d), K(d) \rangle$ consists of Objects ( $O$ ), Attributes ( $E$ ), and Relations ( $K$ ).
- Crucially, this step preserves object attachments (e.g., ensuring "man on horse" is linked correctly), reducing surface text diversity to consequential visual components.
Granular Scoring via QA:
- POSH uses an open-weight LLM (specifically Qwen-3-14b) as a judge.
- It generates templated questions for every component in the scene graph (e.g., "Is the 'woman in white' mentioned?").
- The LLM answers these questions to quantify the presence of components in the other text (checking the generation against the reference for precision/mistakes, and the reference against the generation for recall/omissions).
- To handle entity collisions (e.g., multiple "men"), POSH dynamically generates unique identifiers (e.g., "man on the left") to disambiguate instances.
Coarse Scoring:
- Granular scores (1–5 scale) are aggregated to produce interpretable coarse scores for Mistakes (Precision), Omissions (Recall), and Overall Quality.
- Because the coarse score is an aggregate of localized granular scores, the metric is interpretable, allowing users to see exactly which text spans caused the score drop.

3. Key Contributions

A. The POSH Metric

Interpretable & Replicable: Unlike black-box API-based judges, POSH uses open-weight models and structured rubrics, making it fully replicable and inexpensive.
Granular Localization: It identifies specific text spans containing errors (e.g., misattributed attributes), addressing the "long text" evaluation gap.
Robustness: It handles diverse imagery and complex scene dynamics better than n-gram or simple embedding-based metrics.

B. The DOCENT Benchmark

To validate POSH, the authors introduced DOCENT, a new benchmark focused on visual art:

Data: 1,750 artworks (paintings, sketches, sculptures) from the U.S. National Gallery of Art.
References: Expert-written, exhaustive descriptions focusing on accessibility (covering clothing, orientation, gaze, etc.).
Generations: Descriptions generated by 4 VLMs (LLaVA-1.6, Molmo, GPT-4o, Claude 3.5).
Human Judgments: 300 granular judgments (span-level mistakes/omissions) and 600 coarse judgments (pairwise rankings) provided by art history students and an expert PhD.
Complexity: DOCENT images contain ~16% more objects and nearly twice as many people as existing benchmarks (like CapArena), requiring complex relational reasoning.

4. Results

Evaluation on DOCENT (Granular & Coarse)

Granular Performance: POSH achieved the highest Macro F1 for identifying mistakes (0.580) and omissions (0.680), outperforming embedding-based baselines (4GramEmbed, SGEmbed).
Coarse Performance:
- POSH achieved a +0.05 Spearman $\rho$ correlation with human judgments over the best open-weight alternative and even surpassed GPT-4o in predicting overall quality and omissions.
- It showed strong robustness across different error types (mistakes vs. omissions).

Evaluation on CapArena (Robustness)

Tested on a web-image dataset (CapArena) to ensure POSH isn't overfitted to art.
POSH achieved high caption-level accuracy and model-ranking correlations, outperforming nearly all open-weight baselines and GPT-4o.
In subsets with high complexity (3+ people), POSH significantly outperformed the larger LLaVA-Critic model.

Reward Function Experiment

The authors used POSH as a reward function for Reinforcement Learning (DAPO) to fine-tune a VLM (Qwen2.5-VL-7B).
Result: POSH-tuned models produced descriptions with significantly fewer omissions (+0.432 improvement) and higher overall quality (+0.135) compared to standard Supervised Fine-Tuning (SFT), despite a slight increase in minor mistakes. This demonstrates POSH's utility as a verifier for RL.

Model Analysis

Using POSH, the authors evaluated open vs. closed models on DOCENT.
Finding: Even top-tier foundation models struggle to achieve full, error-free coverage of rich scene dynamics. Closed models (GPT-4o) cover ~50% of visual information in references, highlighting a significant gap in current VLM capabilities for detailed assistive text generation.

5. Significance and Impact

Advancing Assistive Technology: By improving the evaluation of detailed image descriptions, POSH directly supports the development of better alt-text for blind and low-vision users, a critical area for web accessibility.
New Evaluation Paradigm: POSH moves the field away from simple overlap metrics toward structured, grounded evaluation that understands scene semantics and relationships.
Democratization: By being open-weight and replicable, POSH lowers the barrier for researchers to develop and test detailed description models without relying on expensive closed APIs.
Benchmarking: DOCENT establishes a new, high-difficulty standard for VLMs, forcing the community to address challenges in relational reasoning and long-form generation.

In summary, POSH provides a rigorous, interpretable, and cost-effective framework for evaluating detailed image descriptions, while DOCENT offers the necessary data to drive progress in this challenging domain, ultimately aiming to improve accessibility and VLM understanding of complex visual scenes.