VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

The paper proposes VLCE, a knowledge-enhanced framework that integrates external semantic knowledge from ConceptNet and WordNet into a two-stage vision-language pipeline to generate more accurate, domain-specific, and actionable image descriptions for disaster assessment, outperforming general-purpose models on satellite and UAV benchmarks.

Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George

Published 2026-03-11
📖 4 min read☕ Coffee break read

Imagine you are a disaster relief worker rushing to a town hit by a hurricane. You pull up a satellite photo or a drone video on your tablet, hoping for a quick summary of what happened.

You ask a standard AI (like the ones that describe your vacation photos) to tell you what it sees. It says: "I see a picture of houses and trees. Some look broken."

That's true, but it's not helpful enough. You need to know: "The roof of the community center is gone, the main road is blocked by a wall of fallen trees, and there's standing water in the lower district."

This paper introduces VLCE (Vision-Language Caption Enhancer), a new system designed to turn that generic AI description into a professional, life-saving report.

Here is how it works, explained with simple analogies:

1. The Problem: The "Tourist" vs. The "Expert"

Think of standard AI models (like LLaVA or QwenVL) as enthusiastic tourists. They have seen millions of photos of cats, beaches, and cars. When they see a disaster, they describe it using the words they know: "house," "tree," "broken."

But disaster response needs experts. An expert knows the difference between "a broken tree" and "a structural collapse with debris fields." The tourist AI lacks the specific vocabulary and the "common sense" about how disasters actually behave.

2. The Solution: The "Two-Step Interview"

VLCE acts like a smart editor who interviews the tourist AI and then checks their notes against a massive encyclopedia. It works in two stages:

Step 1: The First Draft (The Tourist)
First, the system asks a standard AI to look at the image and write a quick description. It also uses an object detector (like a security guard named YOLOv8) to point out exactly where things are (e.g., "There is a car here," "There is a building there").

Step 2: The Expert Review (The Librarian)
This is where VLCE shines. It takes that first draft and runs it through a Knowledge Graph (think of this as a giant, interconnected library of facts).

  • If the draft says "broken house," the system checks the library.
  • The library tells it: "In a hurricane context, 'broken house' usually means 'structural damage,' 'debris field,' or 'roof failure.'"
  • The system swaps the simple words for the expert terms.

3. The Magic Ingredients

To make this work, the researchers gave the AI two special tools:

  • ConceptNet & WordNet: These are like dictionaries that don't just list synonyms, but explain how things relate. They teach the AI that "hurricane" is related to "wind," "flooding," and "evacuation," even if those words aren't in the picture.
  • Two Different Brains: They tested two ways to process this info:
    • The CNN-LSTM: Like a careful accountant who adds up visual clues and text clues one by one.
    • The Transformer: Like a detective who looks at the whole picture at once, connecting the dots between a fallen tree and a blocked road instantly.

4. The Results: From "Meh" to "Mission Critical"

The researchers tested this on two types of disaster photos:

  • Satellite Photos (xBD): Taken from high up. These are a bit blurry, so the AI doesn't need to be too specific. The system helped, but the standard AI was already okay.
  • Drone Photos (RescueNet): Taken from low angles, showing tiny details like cracked walls and scattered trash. This is where the magic happened.

The "Without Knowledge" Disaster:
Without the library (Knowledge Graph), the AI started hallucinating. It would say things like "There are five dead animals" (when there were none) or repeat the same sentence three times. It was like a tourist who got nervous and started making things up.

The "With Knowledge" Success:
With the library, the AI became a pro.

  • On Drone photos: The new system was preferred 95% of the time over the standard AI.
  • The Difference: Instead of "I see a mess," it said, "The image shows the aftermath of Hurricane Michael with debris fields blocking roadways and structural damage to residential roofs."

The Bottom Line

VLCE is like giving a general-purpose AI a specialized field guide for disasters. It stops the AI from guessing and forces it to use the right words for the right situation.

In a real disaster, when every second counts, you don't want an AI that sounds like a confused tourist. You want one that sounds like a seasoned rescue worker. VLCE bridges that gap, turning simple image descriptions into actionable intelligence that can save lives.