DepthCharge: A Domain-Agnostic Framework for Measuring… — Plain-Language Explanation

Imagine you are hiring a new employee for a high-stakes job, like a surgeon or a lawyer. You ask them, "What is the flu?" They answer perfectly. You ask, "What are the symptoms?" They nail it again. You feel confident.

But then, you ask a follow-up: "Okay, you mentioned antiviral drugs. Which specific mutation makes the virus resistant to Tamiflu?" Suddenly, they stumble. They start guessing. They might even make something up to sound smart.

This is the problem the paper DepthCharge is trying to solve.

The Problem: The "Surface Competence" Illusion

Current tests for AI (Large Language Models) are like a multiple-choice pop quiz. They ask the AI 100 different questions about 100 different topics. The AI gets 85% right. We say, "Great job! This AI is smart."

But this is an illusion. It's like judging a swimmer by how well they can float on the surface. They look great, but if you push them down 10 feet, do they panic? Do they drown?

Standard tests don't push the AI down. They just ask shallow questions. The paper argues that in real life (medicine, law, engineering), we don't just need surface answers; we need deep, reliable knowledge that holds up under pressure.

The Solution: DepthCharge

The authors created a new testing framework called DepthCharge. Think of it like a submarine diving into the ocean.

Instead of asking random questions, DepthCharge dives deeper and deeper into a specific topic, but with a twist: it follows the AI's own words.

Here is how it works, using a simple metaphor:

1. The Adaptive Drill (The "Follow the Rabbit" Strategy)

Imagine the AI is a rabbit, and you are digging a hole.

Round 1: You ask, "What is a rabbit?" The AI says, "It's a mammal with long ears."
Round 2: Instead of asking a pre-written question like "What do rabbits eat?", DepthCharge listens to the AI. It hears "long ears" and asks, "How do those ears help the rabbit survive?"
Round 3: If the AI mentions "hearing predators," the next question is, "What specific frequencies can they hear?"

The test adapts to what the AI actually says. If the AI talks about a specific medical drug, the test drills down into that drug's chemical structure. If the AI talks about a historical battle, the test asks about the specific generals involved.

2. The Fact-Check (The "Truth Detective")

At every step, the system has a Truth Detective (a separate AI or search tool) that knows the real answer.

The AI gives an answer.
The Detective checks authoritative sources (like Wikipedia, medical journals, or legal codes) to see if the AI is telling the truth.
If the AI is wrong, that "path" of the conversation dies.

3. The Survival Rate (The "Lifeboat" Analogy)

This is the most important part. The paper uses a concept called Survival Statistics.

Imagine you have 30 people (30 different conversation paths) on a boat.

Depth 1: You ask a simple question. 25 people answer correctly. 5 people fall off the boat.
Depth 2: You ask a harder question only to the 25 survivors. Maybe 15 get it right, and 10 fall off.
Depth 3: You ask an expert-level question to the 15 survivors.

The paper calculates a score called Expected Valid Depth (EVD). It's not just "how many did they get right?" It's "how deep could they go before they completely lost their way?"

If an AI gets 90% right on easy questions but fails 100% of the hard follow-ups, its "depth" score is very low. It's like a person who can swim to the surface but can't dive.

What Did They Find?

The researchers tested this on five different AI models across four totally different worlds: Medicine, Law, Ancient History, and Quantum Physics.

Here are the surprising results:

No "Super AI" Exists: The AI that was the best at Medicine was not the best at Law. The one that was great at History was terrible at Physics. You can't just pick the "smartest" AI; you have to pick the one smartest for your specific job.
Expensive Doesn't Mean Deep: The most expensive AI models didn't always go the deepest. Sometimes, a cheaper model knew more about a specific topic than a fancy, expensive one.
The "Surface Illusion" is Real: All the models looked great on standard tests (scoring 80-90% on shallow questions). But when DepthCharge pushed them down, their scores dropped dramatically. Some models could only dive 3 levels deep; others could go 7 levels deep. That's a huge difference in reliability.

Why Does This Matter?

If you are a hospital using AI to help doctors, or a law firm using it to research cases, you don't want an AI that sounds confident but fails when you ask a detailed follow-up question. That could be dangerous.

DepthCharge is like a stress test for AI. It doesn't just ask, "Do you know this?" It asks, "How deep does your knowledge go, and can you handle the pressure when I keep asking 'Why?' and 'How?'"

The Bottom Line

The paper teaches us that depth is different from breadth.

Breadth is knowing a little bit about everything (like a trivia champion).
Depth is knowing a lot about one thing and being able to explain it under scrutiny (like a specialist).

DepthCharge gives us a way to measure that depth, ensuring that when we trust an AI with important decisions, we know exactly how deep its knowledge really goes.

1. Problem Statement

Large Language Models (LLMs) often exhibit a "surface competence illusion": they perform well on broad, shallow benchmarks (e.g., MMLU, TruthfulQA) but fail when subjected to adaptive, deep follow-up questioning in specialized domains.

Limitations of Current Benchmarks: Existing static benchmarks suffer from test set contamination (models trained on the data) and lack depth. They test breadth across many topics rather than drilling deep into specific domains.
Evaluation Gap: There is no "out-of-the-box" methodology to measure how deeply an LLM can sustain accurate responses in arbitrary professional domains (e.g., medicine, law) without requiring months of manual test-set construction by subject matter experts.
Risk: In high-stakes applications (clinical decision support, legal research), a model's initial confidence can mask a rapid degradation of reliability as questions become more specific, leading to dangerous hallucinations or outdated advice.

2. Methodology: The DepthCharge Framework

DepthCharge is a domain-agnostic framework that measures knowledge depth through adaptive drilling, on-demand fact verification, and survival statistics.

A. Adaptive Drilling Process

Unlike static tests with pre-scripted questions, DepthCharge generates questions dynamically based on the model's previous answers.

Concept Extraction: The system extracts specific concepts mentioned in the model's response.
Targeted Probing: The next question focuses exclusively on those extracted concepts. If a model mentions "neuraminidase inhibitors," the next question drills into that specific mechanism.
Path Survival: If the model answers correctly, the path continues to the next depth. If incorrect, that specific knowledge path terminates.

B. On-Demand Fact Verification

To ensure ground truth without pre-built datasets, the framework searches for facts in real-time:

COMMON Tier (Depths 1–3): Uses Wikipedia summaries.
TEXTBOOK Tier (Depths 4–6): Uses detailed Wikipedia sections.
PROFESSIONAL/SPECIALIST/CUTTING_EDGE Tiers (Depths 7+): Uses retrieval-augmented generation (RAG) to search authoritative sources (clinical guidelines, peer-reviewed literature, recent publications) via web search APIs.
Scoring: An LLM evaluator checks if the target model's answer entails the pre-verified factual claim. This is a constrained entailment task, not open-ended judgment.

C. Statistical Design: Constant Sample Size & Survival Metrics

A key innovation is maintaining statistical power at every depth level, preventing results from being skewed by dwindling sample sizes.

Constant Sample Size ( $N=30$ ): At every depth level, exactly 30 questions are asked. If fewer knowledge paths survive, the system generates multiple questions per surviving path to reach $N=30$ .
Cumulative Survival ( $S(d)$ ): Accuracy is not averaged; it is cumulative.
$S(d) = \prod_{i=1}^{d} A(i)$
Where $A(i)$ is the accuracy at depth $i$ . An error at depth 1 permanently reduces the survival probability for all subsequent depths.
Expected Valid Depth (EVD): The primary metric, calculated as the area under the cumulative survival curve:
$EVD = \sum_{d=1}^{D} S(d)$
Drilling stops when $S(d)$ drops below a threshold (default 20%).

D. Difficulty Tiers

The framework maps depth to specificity tiers:

COMMON: General knowledge.
TEXTBOOK: University level.
PROFESSIONAL: Practitioner level (guidelines/standards).
SPECIALIST: Expert level (peer-reviewed literature).
CUTTING_EDGE: Recent research (last 2 years).

3. Key Contributions

Domain-Agnostic System: A unified framework that can evaluate any domain with verifiable facts (science, law, history) without custom test sets.
Survival Analysis with Statistical Rigor: Introduces a method to maintain constant sample sizes ( $N=30$ ) across adaptive paths, ensuring statistically meaningful confidence intervals (approx. $\pm12\%$ at 50% accuracy) even at deep levels.
Granular Difficulty Mapping: Uses configurable passes per tier ( $Q=3$ ) to create a monotonic difficulty progression, allowing fine-grained measurement of knowledge ceilings.
Empirical Validation: Demonstrates that aggregate benchmarks fail to predict depth performance, revealing significant variation in model rankings across different domains.

4. Experimental Results

The framework was tested on 5 frontier LLMs (Models A–E) across 4 diverse domains: Medicine, Constitutional Law, Ancient Rome, and Quantum Computing.

Depth-Dependent Performance:
- Aggregate Accuracy: Narrow range (73%–87%), hiding significant differences.
- EVD (Depth): Wide range (3.45 to 7.55), revealing distinct capability differences.
- Tier Degradation: Accuracy drops monotonically from COMMON (92–100%) to TEXTBOOK (52–84%) to PROFESSIONAL (17–77%).
Domain-Dependent Rankings:
- No single model dominated all domains.
- Example: Model A ranked #1 in Medicine (EVD 7.55) but only #3 in Quantum Computing. Model D ranked #1 in Constitutional Law but #4 in Medicine.
- This proves that "general" benchmarks do not predict performance in specialized verticals.
Cost-Performance Tradeoff:
- Expensive models do not always achieve deeper knowledge.
- Example: Model D achieved the second-highest mean EVD at 1/10th the cost of the most expensive model (Model B), suggesting that domain-specific evaluation is crucial for cost-effective model selection.
Error Analysis:
- 52% of errors occurred at the PROFESSIONAL tier.
- Common error types: Factual omission, outdated information, and concept conflation.
- Models that failed early (e.g., Model E) accumulated errors in COMMON/TEXTBOOK tiers, while top performers failed deeper in SPECIALIST tiers.

5. Significance and Implications

Beyond Breadth: DepthCharge shifts the evaluation paradigm from "how many topics does a model know?" to "how deeply can it sustain accuracy in a specific domain?"
Professional Utility: For high-stakes fields (medicine, law), the framework helps organizations select models based on domain-specific depth rather than aggregate scores, potentially preventing costly errors in specialized tasks.
Epistemic Humility: The adaptive nature of the test incentivizes models to stay within their knowledge boundaries. If a model avoids introducing concepts it cannot substantiate, it reduces hallucination risk.
Comparative Tool: While results are relative to the evaluator model used, the framework provides a stable, reproducible method for comparing models within a specific domain, offering a practical alternative to the resource-intensive creation of custom benchmarks.

In conclusion, DepthCharge exposes the "surface competence illusion" of current LLMs and provides a scalable, statistically robust method to measure the true depth of knowledge required for professional deployment.

DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models