Knowledge driven Description Synthesis for Floor Plan Interpretation

Imagine you have a blueprint of a house. To an architect, it's a clear map of rooms, doors, and furniture. But to a computer, it's just a bunch of black lines on a white background. It doesn't "see" a kitchen or a bedroom; it just sees shapes.

The problem this paper tackles is: How do we teach a computer to look at a blueprint and write a beautiful, descriptive paragraph about it, like a real estate agent would?

The authors, Shreya, Chiranjoy, and Gaurav, realized that just saying "This is a house" isn't enough. You need to say, "This is a cozy two-bedroom apartment with a sunlit kitchen and a spacious living room." They built two different "AI writers" to do this job.

Here is the breakdown of their work using simple analogies:

The Challenge: The "Blank Canvas" Problem

Most AI that describes photos (like a picture of a cat) works great because every pixel has color and texture. But floor plans are like stick-figure drawings. They lack the "clues" (colors, shadows, textures) that normal photo-AI relies on. If you just ask an AI to look at a blueprint and guess what it is, it often gets confused or gives a very robotic, boring answer.

The Two Solutions: Two Different Ways to Write

The team built two models to solve this. Think of them as two different types of writers.

1. DSIC: The "Visual Detective"

How it works: This model looks only at the picture. It scans the blueprint, finds the shapes (like a rectangle for a bed or a circle for a table), and tries to guess what they are. It then uses a "hierarchical" brain (like a manager and a worker) to stitch those guesses into sentences.
The Analogy: Imagine a detective who is blindfolded and only allowed to feel the shape of an object with their hands. They have to guess what it is based purely on its outline.
The Flaw: If the blueprint is drawn in a weird style or has a symbol the detective hasn't seen before, they get stuck. They might describe a "bedroom" as a "living room" because the shapes look similar. It's rigid and lacks flexibility.

2. TBDG: The "Knowledgeable Architect" (The Winner)

How it works: This model is smarter. It doesn't just look at the lines; it also uses text clues. Before writing the final paragraph, it first generates short, simple captions for specific parts of the room (e.g., "Here is a kitchen," "Here are stairs"). It then feeds these text clues into a powerful "Transformer" engine (the same tech behind modern chatbots) to weave them into a full story.
The Analogy: Imagine an architect who has a cheat sheet. Before describing the house, they first jot down notes: "Kitchen here, bathroom there." Then, they use those notes to write a rich, flowing description. They aren't just guessing from the lines; they are using "word hints" to guide their writing.
Why it's better: Because it uses these text hints, it understands the context. If the lines are ambiguous, the text clues help it figure out, "Ah, this must be a bathroom because the caption said 'toilet'." It's much more robust and flexible.

The "Training Camp" (The Dataset)

To teach these AI writers, the authors used a massive library called BRIDGE.

Think of this as a giant library containing 13,000 blueprints.
Crucially, every blueprint came with a "teacher's note" (a human-written paragraph describing it).
The AI practiced by looking at the blueprint, trying to write a description, and then checking its work against the teacher's note to see how close it got.

The Results: Who Won?

The authors tested their models against other methods (like old-school templates or simple language bots).

The Old Way (Templates): Like a "Mad Libs" game. "The [Room Type] has a [Furniture]." It's boring and repetitive.
The Simple AI (LSTM/GRU): These are like students who memorized a dictionary but haven't seen a house. They can write grammatically correct sentences, but they might say, "The kitchen has a dragon," because they are just guessing words that go together, not looking at the picture.
The New Winners (DSIC & TBDG):
- DSIC was good at describing what it saw, but sometimes got the details wrong if the drawing was tricky.
- TBDG was the champion. It produced descriptions that sounded the most human. It could talk about specific details like "a walk-in closet" or "a porch," which the other models missed. It felt like a real person was describing the home.

The Bottom Line

This paper is about teaching computers to stop just "seeing" lines and start "understanding" spaces.

DSIC is like a student who studies hard but relies only on what they can see.
TBDG is like a student who studies hard and uses a textbook to double-check their work.

The result? We can now take a boring, black-and-white floor plan and instantly generate a warm, inviting description that could be used for real estate listings, helping robots navigate homes, or assisting architects in designing better spaces. The "Knowledge-Driven" approach (TBDG) proved that giving the AI a little bit of text help makes it a much better writer.

1. Problem Statement

The paper addresses the challenge of Image-to-Text generation specifically for floor plan images. Unlike natural photographs, floor plans are graphical documents (2D line drawings with binary pixel values) that lack the rich, pixel-level semantic information found in natural images.

Limitations of Existing Methods: Traditional image captioning methods (e.g., CNN-RNN) and existing floor plan analysis techniques often produce rigid, semi-structured captions or fail to capture fine-grained details. Multi-staged pipelines (which perform semantic segmentation, symbol spotting, and room classification sequentially) are prone to error propagation; a misclassification in an early stage leads to incorrect descriptions in the final output.
Goal: To generate flexible, multi-sentence, human-like paragraph descriptions from floor plan images that capture global context and specific details (e.g., room contents, connections between spaces) without relying on rigid templates.

2. Methodology

The authors propose two deep learning models designed to bridge the gap between visual features and textual generation, utilizing the BRIDGE dataset (a large-scale dataset with 13,000 floor plans and annotations).

A. Model 1: Description Synthesis from Image Cue (DSIC)

Architecture: An end-to-end Hierarchical RNN framework.
Encoder: Uses a Region Proposal Network (RPN) combined with a CNN to extract top-5 region proposals (visual features) from the floor plan. These regions are pooled into a single vector.
Decoder: A two-level RNN structure:
1. Sentence-Level RNN (S-RNN): Generates a topic vector for each sentence and determines the number of sentences.
2. Word-Level RNN (W-RNN): Takes the topic vector and generates the specific words for that sentence.
Mechanism: The model learns to map visual region features directly to paragraph text without intermediate textual cues.

B. Model 2: Transformer Based Description Generation (TBDG)

Architecture: A Transformer-based approach (specifically an Encoder-Decoder with Attention) that incorporates Knowledge-Driven inputs.
Key Innovation: Unlike DSIC, TBDG utilizes textual cues extracted from the image as an intermediate step.
1. Region-wise Captioning: First, an LSTM model generates short captions for specific regions (e.g., "bedroom with closet") based on visual features and existing annotations.
2. Fusion: The top 5 region captions are fused and converted into word embeddings (using Word2Vec).
3. Encoder-Decoder:
  - Encoder: A Bi-LSTM processes the fused region captions (visual + textual cues).
  - Decoder: A LSTM with an Attention Mechanism generates the final paragraph. The attention mechanism allows the decoder to focus on relevant parts of the input sequence (the region cues) when generating each word.
Advantage: By feeding the decoder with pre-extracted region captions, the model gains "extra knowledge" about the floor plan's specific elements, making it more robust to variations in graphical representation.

3. Key Contributions

End-to-End Learning Framework: The paper moves away from error-prone multi-staged pipelines (segmentation $\to$ classification $\to$ generation) to end-to-end deep learning models that learn visual and textual features jointly.
Two Novel Models:
- DSIC: Demonstrates that hierarchical RNNs can generate paragraphs directly from visual cues.
- TBDG: Introduces a knowledge-driven approach where region-wise captions act as intermediate semantic cues, significantly improving robustness for general floor plans.
Dataset Utilization: Extensive use of the BRIDGE dataset, leveraging its unique annotations (region-wise captions and paragraph descriptions) to train models that understand the relationship between graphical symbols and natural language.
Comprehensive Evaluation: The study includes a comparative analysis against state-of-the-art baselines (DenseCap, Semi-Structured templates, and standard language models like LSTM/GRU) using both quantitative metrics and qualitative analysis.

4. Experimental Results

The models were evaluated on the BRIDGE dataset using standard NLP metrics: BLEU, ROUGE, and METEOR, as well as Average Precision (AP) for symbol detection.

Symbol Detection: A fine-tuned YOLO model achieved 82.06% mAP for decor symbol spotting, outperforming Faster-RCNN (75.25%).
Room Classification: A VGG19 backbone achieved 82.98% accuracy in classifying rooms (Bedroom, Bathroom, Kitchen, Hall, Living Room), outperforming Capsule Networks and classical ML methods.
Description Generation Performance:
- TBDG achieved the highest scores across most metrics:
  - BLEU-1: 0.7277 (vs. 0.1519 for Semi-Structured baseline).
  - METEOR: 0.4927 (vs. 0.0677 for Semi-Structured).
  - ROUGE-L: 1.5283.
- DSIC also performed strongly (BLEU-1: 0.7013), significantly better than standard language models (LSTM/GRU) which generated generic text not specific to the image.
Qualitative Findings:
- Baseline methods (Semi-Structured) produced rigid, repetitive sentences.
- Standard Language Models (LSTM/GRU) generated grammatically correct but contextually irrelevant text (e.g., generic story-like sentences).
- TBDG produced the most human-like, descriptive paragraphs that correctly identified specific details (e.g., "closets," "staircases," "porches") and maintained global context.

5. Significance and Conclusion

Robustness: The TBDG model proves that incorporating intermediate textual knowledge (region captions) alongside visual features creates a more robust system for interpreting complex graphical documents like floor plans. It handles variability in symbol representation better than pure visual models (DSIC).
Application: The generated descriptions have practical applications in real estate (automated property listings), robotics (indoor path planning and navigation), and architectural automation.
Future Work: The authors suggest further generalizing the architecture to handle even wider variations in floor plan styles and refining the method of extracting word cues.

In summary, this paper demonstrates that for specialized graphical documents like floor plans, a knowledge-driven, end-to-end approach (TBDG) that leverages intermediate semantic cues outperforms both traditional multi-stage pipelines and direct visual-to-text models.