Knowledge driven Description Synthesis for Floor Plan Interpretation

This paper proposes two deep learning-based models, DSIC and TBDG, to generate flexible and robust textual descriptions from floor plan images, addressing the limitations of rigid structures in existing methods through experiments on a large-scale dataset.

Shreya Goyal, Chiranjoy Chattopadhyay, Gaurav Bhatnagar

Published 2026-02-20
📖 5 min read🧠 Deep dive

Imagine you have a blueprint of a house. To an architect, it's a clear map of rooms, doors, and furniture. But to a computer, it's just a bunch of black lines on a white background. It doesn't "see" a kitchen or a bedroom; it just sees shapes.

The problem this paper tackles is: How do we teach a computer to look at a blueprint and write a beautiful, descriptive paragraph about it, like a real estate agent would?

The authors, Shreya, Chiranjoy, and Gaurav, realized that just saying "This is a house" isn't enough. You need to say, "This is a cozy two-bedroom apartment with a sunlit kitchen and a spacious living room." They built two different "AI writers" to do this job.

Here is the breakdown of their work using simple analogies:

The Challenge: The "Blank Canvas" Problem

Most AI that describes photos (like a picture of a cat) works great because every pixel has color and texture. But floor plans are like stick-figure drawings. They lack the "clues" (colors, shadows, textures) that normal photo-AI relies on. If you just ask an AI to look at a blueprint and guess what it is, it often gets confused or gives a very robotic, boring answer.

The Two Solutions: Two Different Ways to Write

The team built two models to solve this. Think of them as two different types of writers.

1. DSIC: The "Visual Detective"

  • How it works: This model looks only at the picture. It scans the blueprint, finds the shapes (like a rectangle for a bed or a circle for a table), and tries to guess what they are. It then uses a "hierarchical" brain (like a manager and a worker) to stitch those guesses into sentences.
  • The Analogy: Imagine a detective who is blindfolded and only allowed to feel the shape of an object with their hands. They have to guess what it is based purely on its outline.
  • The Flaw: If the blueprint is drawn in a weird style or has a symbol the detective hasn't seen before, they get stuck. They might describe a "bedroom" as a "living room" because the shapes look similar. It's rigid and lacks flexibility.

2. TBDG: The "Knowledgeable Architect" (The Winner)

  • How it works: This model is smarter. It doesn't just look at the lines; it also uses text clues. Before writing the final paragraph, it first generates short, simple captions for specific parts of the room (e.g., "Here is a kitchen," "Here are stairs"). It then feeds these text clues into a powerful "Transformer" engine (the same tech behind modern chatbots) to weave them into a full story.
  • The Analogy: Imagine an architect who has a cheat sheet. Before describing the house, they first jot down notes: "Kitchen here, bathroom there." Then, they use those notes to write a rich, flowing description. They aren't just guessing from the lines; they are using "word hints" to guide their writing.
  • Why it's better: Because it uses these text hints, it understands the context. If the lines are ambiguous, the text clues help it figure out, "Ah, this must be a bathroom because the caption said 'toilet'." It's much more robust and flexible.

The "Training Camp" (The Dataset)

To teach these AI writers, the authors used a massive library called BRIDGE.

  • Think of this as a giant library containing 13,000 blueprints.
  • Crucially, every blueprint came with a "teacher's note" (a human-written paragraph describing it).
  • The AI practiced by looking at the blueprint, trying to write a description, and then checking its work against the teacher's note to see how close it got.

The Results: Who Won?

The authors tested their models against other methods (like old-school templates or simple language bots).

  • The Old Way (Templates): Like a "Mad Libs" game. "The [Room Type] has a [Furniture]." It's boring and repetitive.
  • The Simple AI (LSTM/GRU): These are like students who memorized a dictionary but haven't seen a house. They can write grammatically correct sentences, but they might say, "The kitchen has a dragon," because they are just guessing words that go together, not looking at the picture.
  • The New Winners (DSIC & TBDG):
    • DSIC was good at describing what it saw, but sometimes got the details wrong if the drawing was tricky.
    • TBDG was the champion. It produced descriptions that sounded the most human. It could talk about specific details like "a walk-in closet" or "a porch," which the other models missed. It felt like a real person was describing the home.

The Bottom Line

This paper is about teaching computers to stop just "seeing" lines and start "understanding" spaces.

  • DSIC is like a student who studies hard but relies only on what they can see.
  • TBDG is like a student who studies hard and uses a textbook to double-check their work.

The result? We can now take a boring, black-and-white floor plan and instantly generate a warm, inviting description that could be used for real estate listings, helping robots navigate homes, or assisting architects in designing better spaces. The "Knowledge-Driven" approach (TBDG) proved that giving the AI a little bit of text help makes it a much better writer.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →