Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

This paper addresses the lack of specialized dental datasets by proposing a framework that uses Vision-Language Models with guided prompts to generate high-quality, holistic captions for single-tooth RGB images, thereby enabling more comprehensive dental image analysis.

Anastasiia Sukhanova, Aiden Taylor, Julian Myers, Zichun Wang, Kartha Veerya Jammuladinne, Satya Sri Rajiteswari Nimmagadda, Aniruddha Maiti, Ananya Jana

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you have a giant library of dental photos, but the books have no titles, no summaries, and no descriptions. They are just pictures of teeth. If you wanted to teach a computer to understand dentistry, you'd need to tell it what it's looking at: "That's a molar," "That one has a cavity," "This is the chewing surface."

Right now, most dental AI is like a specialist who only knows how to do one thing perfectly—like a mechanic who can only change tires but doesn't know how to fix the engine. They can find a cavity, but they can't write a full report on the tooth's health.

This paper is about building a smart assistant that can look at a single tooth photo and write a detailed, professional description of it, just like a human dentist would.

Here is the story of how they did it, explained simply:

1. The Problem: The "Blurry Group Photo"

The researchers found that existing dental photo collections had two big problems:

  • The Group Shot Issue: Most photos showed the whole mouth, but the descriptions only talked about the front teeth or just said "gingivitis." It was like taking a photo of a whole football team but only writing a caption about the goalie's shoes. The back teeth (molars) were often hidden or ignored.
  • The "One-Note" Issue: Existing descriptions were too simple. They didn't say which tooth it was, what part of the tooth was visible, or exactly what was wrong with it.

They needed a way to turn these messy, unlabelled photos into a library where every single tooth has its own detailed biography.

2. The Solution: The "AI Intern" with a Checklist

Instead of hiring a human to write thousands of descriptions (which would take forever and cost a fortune), they used a powerful AI called GPT-4o. Think of this AI as a very smart, fast, but slightly inexperienced "AI Intern."

If you just hand the Intern a photo and say, "Describe this," it might guess wrong or miss important details. So, the researchers invented a Two-Step Prompting Strategy (a fancy way of saying "giving the AI a better checklist").

  • Step 1: The Rough Draft. They asked the AI to look at the photo and write a basic description.
  • Step 2: The Editor. They looked at the mistakes the AI made (like confusing a canine tooth for a front tooth) and gave it a better set of instructions. They said, "Hey, don't just guess! Look specifically for the tooth number, the surface (front, back, or top), and any diseases like cavities or stains."

This is like teaching a child to draw. First, they draw a stick figure. Then you say, "Okay, now add eyes, a nose, and make sure the arms are on the sides." The second instruction makes the drawing much better.

3. The Process: Cleaning the "Raw Ingredients"

Before the AI could write, the researchers had to prepare the "ingredients" (the photos):

  • The Filter: They took photos from public websites. Some were blurry, some were too dark, and some showed the whole mouth. They threw out the bad ones.
  • The Cutter: They used a computer program to "crop" the photos. If a photo showed a whole mouth, the program cut it out so only one single tooth was left in the frame.
  • The Masking: They made sure the file names didn't give away the answers (like naming a file "cavity.jpg"). They wanted the AI to learn by looking, not by reading the filename.

4. The Results: A New "Dental Dictionary"

After running their two-step process, they ended up with 1,520 high-quality photos of single teeth, each with a new, detailed caption.

What worked well?
The AI became surprisingly good at spotting the obvious stuff. It could tell the difference between a molar (the big back tooth) and an incisor (the front tooth). It could spot cavities, stains, and broken edges. It was like a detective who is great at finding big clues.

Where did it struggle?

  • The "Invisible" Clues: The AI sometimes missed subtle gum inflammation (gingivitis). It's like trying to spot a faint bruise on a person wearing a thick sweater; the AI couldn't see the gum clearly enough.
  • The "Baby Tooth" Confusion: It got confused by children's teeth, which look different from adult teeth.
  • The "Angle" Problem: Sometimes, if a tooth was turned sideways, the AI thought it was a different type of tooth.

5. Why This Matters

Why go through all this trouble?
Imagine you want to build a super-dentist AI that can diagnose any problem in a mouth. To train that super-AI, you need thousands of examples where the computer knows exactly what it's looking at.

This paper provides the training manual. By using this framework, they created a massive dataset of "single-tooth stories" without needing humans to write every single one.

The Big Takeaway:
They proved that you don't need a specialized, expensive dental AI to start. You can use a general smart AI (like GPT-4o), give it the right instructions (prompts), and it can learn to describe teeth almost as well as a human. This paves the way for future AI tools that can help dentists diagnose problems faster and more accurately, one tooth at a time.