Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

The paper introduces HouseMind, a multimodal large language model that unifies the understanding, generation, and editing of architectural floor plans by employing discrete room-instance tokens to bridge visual layouts with symbolic reasoning, thereby achieving superior geometric validity and controllability.

Sizhong Qin, Ramon Elias Weber, Xinzheng Lu

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you want to build a house, but instead of hiring an architect, you ask a very smart robot to draw the floor plan for you. You say, "I want a big living room in the middle, a kitchen to the north, and a bedroom to the south."

For a long time, AI was like a painter who could make a picture look pretty but didn't understand how a house actually works. It might draw a kitchen that floats in the air, a bedroom with no door, or a bathroom that is inside the living room. It was good at making pixels look nice, but bad at understanding the logic of space.

Enter HouseMind, a new AI system that changes the game. Here is how it works, explained simply:

1. The Problem: The "Pixel Painter" vs. The "Architect"

Think of old AI models as Pixel Painters. They look at a floor plan and try to guess what color goes where, pixel by pixel. They are like someone trying to recreate a map by copying every single dot on a piece of paper. If they miss one dot, the road might disappear. They struggle to understand that a "kitchen" needs to be next to a "dining room" or that a "bedroom" needs a door.

HouseMind is different. It acts like a Master Architect who speaks a special language of space.

2. The Secret Sauce: "Legos" instead of "Pixels"

The paper's big idea is Tokenization.

Imagine you have a giant box of LEGO bricks.

  • Old AI: Tries to build the house by painting every single square inch of the wall.
  • HouseMind: Uses pre-made LEGO bricks.
    • One brick is a "Kitchen."
    • One brick is a "Living Room."
    • One brick is a "Wall."
    • One brick is a "Door."

Instead of looking at a blurry image, HouseMind breaks the floor plan down into these discrete "Room Tokens" (like LEGO bricks). It turns the complex drawing into a simple list of words and codes, like a sentence:
<LivingRoom> <Kitchen> <Wall> <Bedroom>

3. How HouseMind "Thinks"

HouseMind is a Multimodal Large Language Model (MLLM). You can think of it as a super-smart translator that speaks two languages fluently:

  1. Human Language: "Put the kitchen next to the living room."
  2. Space Language: The list of LEGO bricks (tokens) that make up the floor plan.

Because it treats the floor plan like a sentence, it can use the same logic it uses to write a story to design a house.

  • If you say, "The kitchen is to the left of the living room," the model understands the relationship between the bricks, not just the colors.
  • It knows that a "Bathroom" usually needs a "Wall" around it and a "Door" to enter.

4. The Three Superpowers

The paper shows HouseMind doing three things that other AIs struggle with:

  • Understanding (The Detective):
    You show it a floor plan, and it can tell you exactly what is happening. "Ah, I see a large living room in the center, with a kitchen to the northeast and a bedroom to the southwest." It doesn't just guess; it reads the "sentence" of the floor plan.

  • Generating (The Creator):
    You give it a text prompt: "Design a house with 3 bedrooms and a big balcony." It doesn't just paint a picture; it assembles the LEGO bricks in the correct order to build a logical, usable house. It ensures the rooms connect properly and fit inside the outline.

  • Editing (The Renovator):
    This is the coolest part. You can say, "Remove the balcony and add a small study."

    • Old AI: Might try to "paint over" the balcony, often messing up the walls or making the study float.
    • HouseMind: It simply takes the "Balcony" brick out of the list and swaps it for a "Study" brick. It knows exactly how to rearrange the remaining bricks so the house still makes sense.

5. Why This Matters

  • It's Logical: It doesn't just make things look pretty; it makes things work. The rooms connect, the doors open, and the flow is logical.
  • It's Fast & Small: Because it uses these "LEGO bricks" (tokens) instead of processing millions of pixels, it runs very fast and can even run on a single computer (locally), not just on massive supercomputers.
  • It's Controllable: You can tell it exactly what to change, and it will do exactly that, without breaking the rest of the house.

The Bottom Line

HouseMind is like giving an AI a set of magic LEGO instructions instead of a paintbrush. It teaches the computer to understand that a house isn't just a picture; it's a puzzle of connected spaces. By turning floor plans into a language the AI already understands, it can finally design homes that are not just visually correct, but logically sound and ready to build.