TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

Imagine you are looking at a single photograph of a person doing something with an object—maybe a skateboarder mid-air or someone holding a coffee cup. Your brain instantly understands the story: how they are holding it, why they are looking at it, and the whole vibe of the scene.

Now, imagine trying to build a 3D movie version of that photo using only a computer. This is what researchers call "3D reconstruction." For a long time, computers have been terrible at this because they are too literal. They only look for physical touch.

The Problem: The Computer's "Touch-Only" Blindness

Think of old 3D reconstruction methods like a robot that only understands the world through handshakes.

If a person is holding a cup, the robot sees the hand touching the cup and says, "Okay, I'll stick the cup to the hand."
But what if the person is reaching for a cup they haven't touched yet? Or looking at a bird in a tree? Or jumping over a skateboard?

To the old robot, these are impossible. Since there is no physical contact (no handshake), the robot gets confused. It might drop the skateboard, make the person stare at the wrong direction, or even make the object float in the wrong place. It misses the "story" of the image because it ignores the context.

The Solution: TeHOR (The "Storyteller" Computer)

The paper introduces TeHOR, a new system that acts less like a robot and more like a creative director who reads a script.

Instead of just looking for where hands touch objects, TeHOR asks an AI "storyteller" (a Vision-Language Model like GPT-4) to describe the image in words.

Old way: "Hand is near cup."
TeHOR way: "A man is jumping with a skateboard while performing a trick."

Once TeHOR has this sentence, it uses a powerful "imagination engine" (a Diffusion Model, similar to the tech behind AI art generators) to build the 3D world. It doesn't just guess where things go; it asks, "If I were to draw a picture of a man jumping with a skateboard, what would it look like?" and then shapes the 3D model to match that mental image.

How It Works: The Three-Step Recipe

The Rough Draft (Initial Build):
TeHOR first builds a basic 3D skeleton of the person and the object, kind of like a clay sculpture. It uses standard tools to get the shapes right, but at this stage, the person might be floating weirdly or holding the object in the wrong way.
The Script (Text Guidance):
The system reads the text description (e.g., "A woman is holding a donkey's halter"). It knows that "holding a halter" implies a specific posture and hand position, even if the hands aren't perfectly touching in the photo yet. This text acts as a magnetic guide, pulling the 3D model into the correct pose.
The Polish (Texture & Context):
Here is the magic trick. The system doesn't just care about the shape; it cares about the look. It uses the text to ensure the colors, shadows, and overall "vibe" make sense.
- Analogy: Imagine you are painting a picture. The old method only made sure the brush touched the canvas. TeHOR makes sure the brushstrokes match the feeling of the story. If the text says "sitting on a colorful mosaic bench," TeHOR ensures the bench looks colorful and the person is sitting comfortably, not just hovering above it.

Why This Matters: The "Non-Contact" Superpower

The biggest breakthrough is handling non-contact interactions.

The Old Way: If a person is pointing at a sign but not touching it, the computer fails. It doesn't know where the sign should be relative to the finger.
TeHOR: Because it understands the sentence "A man is pointing at a sign," it knows exactly where the sign belongs in 3D space, even without a physical connection. It understands intent.

The Result: A Realistic 3D World

By combining the shape (geometry) with the story (text) and the look (texture), TeHOR creates 3D models that are not only accurate but also make sense to human eyes. It can create immersive digital assets for video games, VR, and robots, allowing them to understand that a person isn't just a collection of shapes, but a character with a story, a gaze, and a relationship with the world around them.

In short: TeHOR stops the computer from being a literal-minded robot and turns it into a creative storyteller that can build 3D worlds based on the meaning of a picture, not just the pixels.

1. Problem Statement

The paper addresses the challenge of jointly reconstructing 3D humans and interacting objects from a single input image. While existing methods have made progress, they suffer from two fundamental limitations:

Over-reliance on Physical Contact: Current approaches primarily use physical contact regions (e.g., hands grasping an object) as the main cue for interaction reasoning. This fails to capture non-contact interactions (e.g., gazing at an object, pointing, or preparing to catch a frisbee) where no physical touch occurs.
Neglect of Global Context: Reconstruction is often driven by local geometric proximity, ignoring global appearance cues (color, shading, pose) that provide the holistic context necessary for semantic alignment. This leads to physically plausible but semantically incorrect reconstructions (e.g., an object oriented incorrectly relative to the human's intent).

2. Methodology

TeHOR introduces a text-guided framework that leverages natural language descriptions to enforce semantic alignment between the 3D reconstruction and the input image. The pipeline consists of three main stages:

A. 3D Representation

3D Gaussians: Both the human and the object are represented as sets of 3D Gaussians ( $\Phi_h$ $Φ_{h}$ and $\Phi_o$ $Φ_{o}$ ).
- Human: Parameterized by SMPL-X pose ( $\theta$ ) and shape ( $\beta$ ) parameters, with Gaussian attributes ( $\phi_h$ ) anchored to the canonical mesh and animated via Linear Blend Skinning (LBS).
- Object: Parameterized by rotation ( $R$ ), translation ( $t$ ), scale ( $s$ ), and Gaussian attributes ( $\phi_o$ ) in a canonical space.
Rendering: Differentiable rendering (Mip-Splatting) is used to project 3D Gaussians onto 2D image space.

B. Reconstruction Stage (Initialization)

Before optimization, the framework generates initial 3D assets:

Text Captioning: A Vision-Language Model (GPT-4) generates two prompts:
- $P_{holistic}$ : Describes the global interaction context (e.g., "A man is jumping with a skateboard").
- $P_{contact}$ : Identifies specific body parts in contact (e.g., "Contact part: right foot").
Asset Generation:
- Human: The object is removed from the image (SmartEraser), and the human is segmented. LHM (Large-scale Human Mesh) generates the initial 3D Gaussian attributes and pose.
- Object: The human is removed, and InstantMesh reconstructs a textured 3D mesh, which is converted to 3D Gaussians.
- Background: A 2D background image is generated by removing both human and object.

C. HOI Optimization Stage

The initial reconstructions are jointly refined over 200 steps using a composite loss function:
$L = L_{recon} + L_{appr} + L_{contact} + L_{collision}$

Reconstruction Loss ( $L_{recon}$ ): Minimizes the difference between the input image and the front-view rendering (RGB and silhouette).
Appearance Loss ( $L_{appr}$ ): The core innovation. It utilizes a pre-trained diffusion network (StableDiffusion) conditioned on $P_{holistic}$ . Using Score Distillation Sampling (SDS), it computes gradients to align the rendered 2D appearance of the 3D scene with the semantic meaning of the text. This enforces holistic semantic consistency (e.g., correct gaze direction, object orientation) beyond just geometry.
Contact Loss ( $L_{contact}$ ): Enforces local geometric proximity between specific human body parts (identified in $P_{contact}$ ) and the object surface.
Collision Loss ( $L_{collision}$ ): Prevents interpenetration between the human and object.

D. Gaussians-to-Mesh Conversion

For compatibility with existing benchmarks, the optimized 3D Gaussians are converted back to meshes. A local shift is applied to contact regions to ensure geometric consistency between the Gaussian-defined contacts and the mesh surfaces.

3. Key Contributions

Text-Guided Semantic Reasoning: TeHOR is the first framework to use text descriptions as a primary guidance signal for 3D human-object reconstruction, enabling reasoning over non-contact interactions and global context.
Holistic Appearance Alignment: By integrating a diffusion prior into the optimization loop, the method captures global visual plausibility (shading, color, pose) that local geometric fitting misses.
Full 3D Texture Reconstruction: The framework jointly reconstructs textured 3D humans and objects, producing immersive digital assets suitable for AR/VR, unlike many prior methods that focus only on geometry or non-textured surfaces.
State-of-the-Art Performance: The method achieves superior results in both general and non-contact scenarios compared to existing contact-based approaches.

4. Experimental Results

The framework was evaluated on Open3DHOI (open-vocabulary, in-the-wild) and BEHAVE (indoor, controlled) datasets.

Quantitative Metrics: TeHOR outperforms state-of-the-art methods (PHOSA, InteractVLM, HOI-Gaussian) across all metrics:
- Chamfer Distance: Lower error for both human ( $CD_{human}$ ) and object ( $CD_{object}$ ).
- Contact Score: Higher F1-score for contact regions.
- Collision: Lower interpenetration rates.
Non-Contact Scenarios: In a subset of data excluding physical contact, TeHOR significantly outperforms competitors. While other methods fail due to the lack of contact cues, TeHOR successfully reconstructs interactions based on gaze, posture, and object orientation guided by text.
Ablation Studies:
- Removing text prompts leads to incorrect object orientation and gaze.
- Replacing the diffusion-based appearance loss with CLIP loss results in lower accuracy, proving the superiority of dense, pixel-level diffusion gradients over global embedding vectors.
- Using 3D Gaussians and a 2D background significantly improves performance over mesh-only representations.

5. Significance

TeHOR represents a paradigm shift in 3D reconstruction from relying solely on geometric priors (contact) to leveraging semantic priors (text).

Robustness: It solves the "contact ambiguity" problem where physical touch is absent or ambiguous.
Applications: The ability to generate semantically coherent, textured 3D assets from a single image is crucial for Robotics (understanding human intent), AR/VR (creating realistic digital twins), and Digital Content Creation.
Future Direction: The paper highlights the potential of using text-to-video models for temporal consistency in future video-based reconstruction tasks.