DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

The paper introduces DepthCharge, a domain-agnostic framework that measures the depth-dependent knowledge of Large Language Models through adaptive probing and on-demand fact verification, revealing significant performance variations across domains that standard benchmarks fail to capture.

Alexander Sheppert

Published 2026-03-26
📖 5 min read🧠 Deep dive

Imagine you are hiring a new employee for a high-stakes job, like a surgeon or a lawyer. You ask them, "What is the flu?" They answer perfectly. You ask, "What are the symptoms?" They nail it again. You feel confident.

But then, you ask a follow-up: "Okay, you mentioned antiviral drugs. Which specific mutation makes the virus resistant to Tamiflu?" Suddenly, they stumble. They start guessing. They might even make something up to sound smart.

This is the problem the paper DepthCharge is trying to solve.

The Problem: The "Surface Competence" Illusion

Current tests for AI (Large Language Models) are like a multiple-choice pop quiz. They ask the AI 100 different questions about 100 different topics. The AI gets 85% right. We say, "Great job! This AI is smart."

But this is an illusion. It's like judging a swimmer by how well they can float on the surface. They look great, but if you push them down 10 feet, do they panic? Do they drown?

Standard tests don't push the AI down. They just ask shallow questions. The paper argues that in real life (medicine, law, engineering), we don't just need surface answers; we need deep, reliable knowledge that holds up under pressure.

The Solution: DepthCharge

The authors created a new testing framework called DepthCharge. Think of it like a submarine diving into the ocean.

Instead of asking random questions, DepthCharge dives deeper and deeper into a specific topic, but with a twist: it follows the AI's own words.

Here is how it works, using a simple metaphor:

1. The Adaptive Drill (The "Follow the Rabbit" Strategy)

Imagine the AI is a rabbit, and you are digging a hole.

  • Round 1: You ask, "What is a rabbit?" The AI says, "It's a mammal with long ears."
  • Round 2: Instead of asking a pre-written question like "What do rabbits eat?", DepthCharge listens to the AI. It hears "long ears" and asks, "How do those ears help the rabbit survive?"
  • Round 3: If the AI mentions "hearing predators," the next question is, "What specific frequencies can they hear?"

The test adapts to what the AI actually says. If the AI talks about a specific medical drug, the test drills down into that drug's chemical structure. If the AI talks about a historical battle, the test asks about the specific generals involved.

2. The Fact-Check (The "Truth Detective")

At every step, the system has a Truth Detective (a separate AI or search tool) that knows the real answer.

  • The AI gives an answer.
  • The Detective checks authoritative sources (like Wikipedia, medical journals, or legal codes) to see if the AI is telling the truth.
  • If the AI is wrong, that "path" of the conversation dies.

3. The Survival Rate (The "Lifeboat" Analogy)

This is the most important part. The paper uses a concept called Survival Statistics.

Imagine you have 30 people (30 different conversation paths) on a boat.

  • Depth 1: You ask a simple question. 25 people answer correctly. 5 people fall off the boat.
  • Depth 2: You ask a harder question only to the 25 survivors. Maybe 15 get it right, and 10 fall off.
  • Depth 3: You ask an expert-level question to the 15 survivors.

The paper calculates a score called Expected Valid Depth (EVD). It's not just "how many did they get right?" It's "how deep could they go before they completely lost their way?"

If an AI gets 90% right on easy questions but fails 100% of the hard follow-ups, its "depth" score is very low. It's like a person who can swim to the surface but can't dive.

What Did They Find?

The researchers tested this on five different AI models across four totally different worlds: Medicine, Law, Ancient History, and Quantum Physics.

Here are the surprising results:

  1. No "Super AI" Exists: The AI that was the best at Medicine was not the best at Law. The one that was great at History was terrible at Physics. You can't just pick the "smartest" AI; you have to pick the one smartest for your specific job.
  2. Expensive Doesn't Mean Deep: The most expensive AI models didn't always go the deepest. Sometimes, a cheaper model knew more about a specific topic than a fancy, expensive one.
  3. The "Surface Illusion" is Real: All the models looked great on standard tests (scoring 80-90% on shallow questions). But when DepthCharge pushed them down, their scores dropped dramatically. Some models could only dive 3 levels deep; others could go 7 levels deep. That's a huge difference in reliability.

Why Does This Matter?

If you are a hospital using AI to help doctors, or a law firm using it to research cases, you don't want an AI that sounds confident but fails when you ask a detailed follow-up question. That could be dangerous.

DepthCharge is like a stress test for AI. It doesn't just ask, "Do you know this?" It asks, "How deep does your knowledge go, and can you handle the pressure when I keep asking 'Why?' and 'How?'"

The Bottom Line

The paper teaches us that depth is different from breadth.

  • Breadth is knowing a little bit about everything (like a trivia champion).
  • Depth is knowing a lot about one thing and being able to explain it under scrutiny (like a specialist).

DepthCharge gives us a way to measure that depth, ensuring that when we trust an AI with important decisions, we know exactly how deep its knowledge really goes.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →