Imagine a City: CityGenAgent for Procedural 3D City Generation

Imagine you want to build a massive, realistic 3D city for a video game, a self-driving car simulator, or a virtual reality world. In the past, doing this was like trying to build a city out of LEGOs by hand, one brick at a time. It took forever, required a team of experts, and if you wanted to change the color of a building or move a park, you had to tear it down and start over.

Other modern AI methods are like hiring a magic painter. You tell them, "Paint a city," and they spray a beautiful picture. But here's the catch: it's just a flat painting. If you try to walk into it, you hit a wall. You can't move the buildings, and the roads don't actually connect.

"CityGenAgent" is a new invention that solves this by acting like a super-smart City Planner and Architect who speaks your language. Instead of just painting a picture or building by hand, it writes a recipe (a computer program) that tells a robot exactly how to build the city, piece by piece.

Here is how it works, broken down into simple steps:

1. The Two-Step Recipe (The "Programs")

Instead of trying to describe the whole city in one giant paragraph, CityGenAgent breaks the job into two distinct recipes:

The "Block Program" (The City Planner):
Imagine you are drawing a map on a piece of paper. You say, "I want a big park here, a tall office building there, and a school over there."
- The AI takes your words and draws a precise map. It makes sure the buildings don't overlap (like two cars trying to park in the same spot) and that the roads make sense.
- Analogy: This is like the City Zoning Department. They decide where things go and how big the lots are.
The "Building Program" (The Architect):
Once the map is set, the AI looks at each building spot and asks, "What does this building look like?" You might say, "Make it a modern glass skyscraper with blue windows."
- The AI writes a detailed list of instructions for that specific building: "Use glass for the walls, blue for the windows, and a flat roof."
- Analogy: This is like the Interior Designer and Architect. They decide the style, color, and materials of the individual houses.

2. Learning to Be Perfect (The Training)

The AI didn't start out perfect. It learned in two stages, kind of like a student:

Stage 1: The Classroom (Supervised Fine-Tuning):
The AI was shown thousands of examples of "City Description" paired with the "Correct Map/Blueprint." It learned the rules: "If I say 'park', I must draw green space," and "Buildings cannot float in the air." This taught it the basics of grammar and geometry.
Stage 2: The Coach (Reinforcement Learning):
Just knowing the rules isn't enough; the city needs to look right and feel right. The AI started generating cities, and a "Coach" (a smart computer judge) gave it scores.
- The "No-Collision" Reward: If the AI tried to put two buildings in the same spot, the Coach gave it a bad score.
- The "Looks Like the Prompt" Reward: If you asked for a "red brick house" and it made a "glass tower," the Coach gave it a bad score.
- The AI kept trying until it got perfect scores, learning to "think" about space and style better than any human could.

3. The Magic of "Talking to Change"

This is the coolest part. Because the AI built the city using a recipe (the program) rather than just a static image, you can talk to it to make changes instantly.

Old Way: You want to change a building from "Modern" to "Chinese Style." You have to delete the old 3D model and build a new one from scratch.
CityGenAgent Way: You just say, "Hey, change that building to Chinese style." The AI looks at its recipe, finds the line that says "Modern Glass," and swaps it for "Chinese Tile Roof." It instantly updates the 3D model without breaking anything else.

Why is this a big deal?

It's Editable: You aren't stuck with what you see. You can tweak the city like a text document.
It's Real: It creates actual 3D shapes (meshes) that cars can drive on and robots can walk through, not just pretty pictures.
It's Fast: It can generate a whole city block in less than a minute, whereas a human expert might take an hour or more.

In summary: CityGenAgent is like having a magical construction crew that listens to your ideas, draws a perfect blueprint, builds the city, and then stands by ready to rearrange the furniture or repaint the walls the moment you ask. It turns the complex, messy job of building a city into a simple conversation.

1. Problem Statement

The automated generation of interactive, high-fidelity 3D cities is a critical challenge for applications in autonomous driving, virtual reality, and embodied intelligence. Existing methods face three primary limitations:

Lack of Controllability: Rendering-based (NeRF) and diffusion-based methods often produce photorealistic imagery but struggle with precise 3D geometric consistency, making them unsuitable for downstream simulation tasks requiring editable geometry.
Poor Structural Reasoning: While some methods use Large Language Models (LLMs) to guide generation, they often rely on retrieving fixed assets or lack the ability to perform deep spatial reasoning, leading to layouts that violate physical constraints (e.g., overlapping buildings).
Data Scarcity: There is a lack of large-scale, structured datasets for city-scale scenes, making it difficult to train models that can generalize to complex urban environments.

2. Methodology: CityGenAgent

The authors propose CityGenAgent, a natural language-driven framework that decomposes city generation into two hierarchical, interpretable components using Domain-Specific Languages (DSL).

A. Hierarchical Decomposition via Programs

Instead of generating raw 3D meshes directly, the system generates executable code (programs) that define the city structure:

Block Program ( $P_{block}$ ): Defines the macro layout of a city block. It encodes:
- Elements: Buildings and green spaces.
- Attributes: Unique IDs, usage types (e.g., residential), and polygons (non-self-intersecting 2D footprints).
- Optional: Floor counts and facade descriptions for buildings.
Building Program ( $P_{building}$ ): Defines the micro-structure of individual buildings. It decomposes a building into components (e.g., windows, doors, roofs) and provides natural language descriptions for their style, material, and color.

B. Two-Stage Training Strategy

The framework employs two specialized agents, BlockGen and BuildingGen, trained on LLMs (specifically Qwen3-8B) using a two-stage approach:

Supervised Fine-Tuning (SFT):
- Goal: Teach the models to follow instructions and output valid program formats (JSON) with complete fields and geometrically closed shapes.
- Data: Synthetic datasets of prompt-program pairs.
Reinforcement Learning (RL) via PPO:
- Goal: Enhance spatial reasoning and visual consistency beyond simple format compliance.
- BlockGen Rewards (Spatial Alignment Reward):
  - Semantic Consistency: Uses GPT-4o to score how well the layout matches the text description.
  - Global Plausibility: Assesses physical feasibility.
  - Geometric Overlap: Penalizes overlapping building footprints.
  - Footprint Density: Encourages building coverage within a realistic density band.
- BuildingGen Rewards (Visual Consistency Reward):
  - Uses a Vision-Language Model (VLM) to score the rendered output against the text description based on Text Alignment, Color Coherence, Style Consistency, and Material Coherence.

C. Execution and Manipulation

Execution: The generated programs are executed by an executor that retrieves assets from a database (or generates them via Text-to-3D) and assembles them into a 3D scene using geometric parameters (rotation, translation, scaling).
Interactive Editing: Because the city is represented by editable programs, users can modify the city via natural language (e.g., "Change to Chinese style" or "Increase density"). The agents update the specific program fields, and the scene is re-rendered, preserving geometric plausibility.

3. Key Contributions

Novel Representation: Introduction of Block Program and Building Program, which provide a compact, editable, and hierarchical representation for 3D cities, bridging the gap between natural language and procedural geometry.
RL-Enhanced Agents: The design of CityGenAgent (BlockGen + BuildingGen) utilizing Spatial Alignment and Visual Consistency rewards. This significantly improves the model's ability to reason about spatial constraints and align visual outputs with textual descriptions.
Interactive Control: The system enables fine-grained, natural language manipulation of city layouts and architectural details without external plugins, a capability lacking in current diffusion or rendering-based approaches.

4. Experimental Results

The authors evaluated CityGenAgent against state-of-the-art methods (e.g., CityDreamer, CityCraft, Hunyuan3D, SGAM) using quantitative metrics and user studies.

Semantic & Visual Quality: CityGenAgent achieved the highest scores in Text Alignment (0.286 CLIP score) and Visual Consistency (6.7/10 GPT score, 5.8/10 User score), outperforming all baselines.
Geometric Quality: The method produced meshes with superior Rectilinearity/Orthogonality Score (ROS) and significantly lower Over-tessellation Ratio (OTR) compared to diffusion-based methods, indicating cleaner, more efficient geometry.
Structural Validity: The RL stage drastically reduced the Collision Rate (from 23.97% in the base model to 4.89% in the final agent) while maintaining high format accuracy (98%).
Efficiency: CityGenAgent generates a city block in 0.75 minutes, significantly faster than manual modeling (60 min) and competitive with or faster than other automated methods (Hunyuan3D: 3 min).
Ablation Studies: Confirmed that PPO-based RL outperforms Direct Preference Optimization (DPO) in handling multi-dimensional spatial rewards and capturing human preferences.

5. Significance

This work establishes a robust foundation for scalable and controllable 3D city generation. By shifting from direct pixel/voxel generation to programmatic generation, CityGenAgent solves the critical issues of geometric inconsistency and lack of editability in current generative AI. It demonstrates that combining LLMs with procedural constraints and RL-based reward shaping can produce high-fidelity, structurally sound, and interactive urban environments, paving the way for advanced applications in simulation, gaming, and urban planning.

Imagine a City: CityGenAgent for Procedural 3D City Generation

1. The Two-Step Recipe (The "Programs")

2. Learning to Be Perfect (The Training)

3. The Magic of "Talking to Change"

Why is this a big deal?

1. Problem Statement

2. Methodology: CityGenAgent

A. Hierarchical Decomposition via Programs

B. Two-Stage Training Strategy

C. Execution and Manipulation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation