HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Imagine you are trying to teach a robot how to understand the world and talk to people inside a smart home.

The Problem: The "Blind" Robot

Currently, most smart home robots rely heavily on cameras (vision) to see what's happening. They are like humans who only have their eyes open.

The Issue: If it's pitch black, if someone is hiding behind a couch, or if you want to respect privacy (no cameras in the bedroom), the robot goes blind. It can't see the person who fell, and it can't tell you what's going on.
The Human Way: Humans are smarter. We don't just use our eyes. We use our ears (hearing footsteps), our sense of touch, and even our intuition. We use many senses to understand the world.

The Solution: HoloLLM (The "Super-Sense" Robot)

The authors of this paper created HoloLLM. Think of HoloLLM as a robot with a "super-sense" brain. Instead of just looking at pictures, it can "feel" the world using invisible signals like:

WiFi signals (which bounce off people).
Radar (like a bat's sonar).
Infrared (seeing heat in the dark).
LiDAR (3D laser scanning).

HoloLLM takes all these strange, invisible signals and translates them into natural language. It can look at a WiFi signal, realize "Oh, someone is walking behind that wall," and say, "I hear someone walking in the kitchen, even though I can't see them."

The Two Big Hurdles (And How They Cleared Them)

Building this robot was hard because of two main problems:

1. The "Language Gap" (Not enough textbooks)

The Analogy: Imagine trying to teach a student a new language, but you only have 50 sentences to work with, while other languages have millions of books.
The Reality: We have millions of pictures with captions (like "a cat sitting on a mat") to train AI. But for WiFi signals or Radar, there are almost no examples of "WiFi signal + text description."
The Fix: They used a clever trick. They took a brain that already knew how to understand pictures and text (a pre-trained model) and used it as a "starter kit." Then, they built a special bridge to teach the robot how to translate the new signals (WiFi/Radar) into that same language, using very little data.

2. The "Translation Glitch" (Different shapes, different meanings)

The Analogy: Imagine trying to translate a poem written in water, a song written in smoke, and a drawing written in sand. They are all different "shapes" of information. If you try to force them all into the same box, the meaning gets lost.
The Reality: A WiFi signal looks nothing like a photo. Standard AI tools get confused trying to read them.
The Fix: They invented a special translator called UMIP (Universal Modality-Injection Projector).
- Think of UMIP as a smart filter. It takes the "rough" signal (like a messy WiFi wave), runs it through a custom specialist (a "tailored encoder" that knows exactly how that specific signal works), and then gently injects the important details into the robot's main brain.
- It does this in layers, like peeling an onion, ensuring the robot understands the fine details of the signal without getting overwhelmed.

The Result: A New Kind of Intelligence

They tested HoloLLM in two new "exam rooms" (datasets) filled with people doing different actions.

The Score: HoloLLM didn't just pass; it crushed the competition. It improved accuracy by up to 30% compared to other robots.
The Superpower: It can answer questions like "Is anyone in the room?" even if the lights are off and the person is behind a wall. It can also describe what the person is doing ("Someone is waving their arms") just by looking at the WiFi signals.

Why This Matters

This isn't just about a cooler robot. This is about making AI that works in the real world, not just in perfect labs.

Privacy: You can have a smart home that knows you're there without a camera watching you.
Safety: It can detect a fall in the dark when a camera would fail.
Reliability: It works when the internet is spotty or the lighting is bad.

In short: HoloLLM is the first robot that doesn't just "see" the world; it feels it through invisible waves and can chat with you about it, making our future smart homes safer, more private, and truly intelligent.

1. Problem Statement

Embodied agents in smart homes require robust human perception and natural language communication to interact effectively. While Vision-Language Models (VLMs) have advanced language-grounded perception, they rely heavily on visual data, which fails in real-world scenarios involving occlusions, poor lighting, and privacy constraints.

To address this, the authors propose integrating "uncommon" but powerful sensing modalities (LiDAR, Infrared, mmWave radar, WiFi, RFID) into Large Language Models (LLMs). However, this integration faces two critical challenges:

Data Scarcity: Unlike RGB images, there are no large-scale web-sourced "modality-text" pairs for rare sensors (e.g., mmWave, WiFi). Existing datasets contain only thousands of samples, making large-scale pre-training for alignment infeasible.
Modality Heterogeneity: Sensing data (e.g., radio frequency signals, point clouds) has distinct physical characteristics (wavelengths, frequencies) that differ significantly from visual data. Standard Transformer-based encoders struggle to learn robust representations for these heterogeneous signals without massive data.

2. Methodology: HoloLLM

The authors propose HoloLLM, a Multimodal Large Language Model (MLLM) designed to align rare sensing modalities with text for seamless human perception and reasoning. The architecture consists of three main components:

A. Universal Modality-Injection Projector (UMIP)

This is the core innovation designed to solve the data scarcity and alignment challenges.

Pre-aligned Initial Embeddings: Instead of training from scratch, HoloLLM uses a pre-trained CLIP Vision Encoder to generate initial embeddings ( $Y^{CLIP}_m$ ) for any input modality. Because CLIP is trained on massive image-text pairs, these embeddings are inherently "pre-aligned" with text, requiring minimal fine-tuning.
Tailored Encoders: To capture fine-grained, modality-specific features (which CLIP lacks for non-visual data), the model employs modality-specific encoders (e.g., ResNet for images, PointNet for LiDAR, MetaFi for WiFi). These are pre-trained on specific tasks (like Human Action Recognition) to extract discriminative features ( $Y^T_m$ ).
Coarse-to-Fine Injection: The UMIP does not simply concatenate features. It uses an iterative process:
1. The coarse initial embeddings ( $Y^{CLIP}_m$ ) are downsampled to form queries.
2. The fine-grained features from tailored encoders ( $Y^T_m$ ) are converted into keys and values.
3. Through cross-attention, the queries adaptively identify and inject the fine-grained, text-aligned features into the multimodal tokens.
4. This process repeats over $L$ blocks, progressively enhancing the tokens before projecting them into the LLM's semantic space.

B. Human-VLM Collaborative Data Curation

Since textual annotations for sensing data do not exist, the authors created a pipeline to generate them:

Action QA: Human experts annotated seed questions, which were expanded into a diverse list of 15 questions using GPT-4o.
Action Captioning: A "Human-VLM" pipeline was used. Human experts annotated a small set of seed samples. These were used as "in-context" examples to prompt LLaVA-Video to automatically generate detailed captions for the remaining dataset samples.
Benchmarks: This pipeline was applied to two datasets: MM-Fi (5 modalities: Video, Depth, LiDAR, mmWave, WiFi) and XRF55 (5 modalities: Video, Depth, Infrared, RFID, WiFi).

C. Training Strategy

A two-stage training strategy is employed:

Stage 1 (Pre-training): Tailored encoders are pre-trained using task-specific objectives (e.g., Cross-Entropy loss for Human Action Recognition) to learn modality-specific features.
Stage 2 (Fine-tuning): The tailored encoders are frozen. The UMIP and the LLM are fine-tuned using a combination of task-specific objectives (Action QA, Action Caption) and next-token prediction.

3. Key Contributions

First MLLM for Rare Sensing Modalities: HoloLLM is the first work to successfully align MLLMs with data-scarce, heterogeneous sensing modalities (LiDAR, mmWave, WiFi, etc.) for language-grounded human perception.
Universal Modality-Injection Projector (UMIP): A novel architecture that bypasses the need for massive "modality-text" pre-training by leveraging pre-aligned CLIP embeddings and iteratively injecting fine-grained features from tailored encoders via cross-attention.
New Benchmarks and Data Pipeline: The authors established the first multisensory benchmarks for human sensing (Action Recognition, QA, Captioning) and introduced a scalable human-VLM collaborative pipeline to generate textual annotations for sensing datasets.

4. Experimental Results

The model was evaluated on MM-Fi and XRF55 datasets across three settings: Random Split, Cross-Subject, and Cross-Environment.

Performance Gains: HoloLLM significantly outperforms state-of-the-art MLLMs (including OneLLM, ImageBind, Honeybee, and Tokenpacker).
- Action QA: Improved accuracy by up to 30% compared to existing methods. For example, on the MM-Fi dataset (Random Split), HoloLLM achieved 86.5% average accuracy vs. 46.2% for ImageBind and 3.9% for OneLLM.
- Action Caption: Achieved significantly higher METEOR scores (e.g., 28.4% vs. 19.9% for ImageBind on MM-Fi).
- Action Recognition: Demonstrated superior robustness, particularly in cross-subject and cross-environment scenarios where visual-only models fail.
Ablation Studies:
- Removing Tailored Encoders caused a drastic drop in performance, proving the necessity of modality-specific feature extraction.
- Removing UMIP (using a standard Q-Former) resulted in lower alignment quality, confirming that the coarse-to-fine injection mechanism is critical for integrating heterogeneous features.
Generalization: HoloLLM showed superior data efficiency when generalized to new modalities (Audio and UWB) with minimal fine-tuning compared to baseline models.

5. Significance

This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

Robustness: It enables agents to operate in environments where vision fails (darkness, occlusion, privacy-sensitive areas).
Scalability: The UMIP architecture provides a blueprint for integrating any new sensor modality into an LLM without requiring massive amounts of paired text data, addressing a major bottleneck in the field of embodied AI.
Future Impact: By bridging the gap between raw sensor data and natural language reasoning, HoloLLM paves the way for more capable household robots and intelligent appliances that can understand and interact with humans in complex, real-world scenarios.