Original authors: Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

Published 2026-06-03

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine trying to understand how a mouse moves, a bird flies, or a human dances just by watching a single video camera. It's like trying to guess the shape of a sculpture while only looking at its shadow on the wall. You miss the depth, the hidden parts, and the true 3D structure.

Scientists have started using multiple cameras (like a surround-sound system, but for video) to capture animals in 3D. But analyzing this data is hard. Existing tools either need humans to painstakingly draw dots on every video frame (like a tedious game of "connect the dots" for every single second of footage) or they are general-purpose AI models that get confused by the specific, close-up, lab-style footage.

Enter BEAST3D. Think of BEAST3D as a "3D magic mirror" for animal behavior. It's a new computer program that teaches itself how to see in 3D without needing humans to draw any dots first.

Here is how it works, using some simple analogies:

1. The "Ghost Cloud" (Gaussian Splatting)

Instead of building a rigid 3D model (like a plastic toy), BEAST3D creates a "cloud of glowing dust" to represent the animal. In the paper, they call these Gaussian splats.

The Analogy: Imagine the animal is made of thousands of tiny, fuzzy, glowing balloons floating in space. Each balloon knows exactly where it is, what shape it is, and what color it is.
The Magic: The computer learns to arrange these balloons so that if you look at them from the angle of Camera A, Camera B, or Camera C, they look exactly like the real video.

2. The "Blindfolded Artist" (Self-Supervised Learning)

How does the computer learn to arrange these balloons? It plays a game of "guess the missing piece."

The Analogy: Imagine an artist who has 5 cameras filming a rat. The computer is given the footage from 4 cameras but is blindfolded to the 5th.
The Task: The computer has to look at the 4 cameras, build its "cloud of balloons" in its mind, and then try to paint what the 5th camera should be seeing.
The Learning: If the painting doesn't match the real 5th camera video, the computer adjusts the balloons. It does this millions of times. Eventually, it gets so good at predicting the missing view that it has truly learned the 3D shape of the animal, not just the 2D picture.

3. Why It's Different from Other Tools

The "Generalist" Problem: Other 3D AI models are like tourists who have seen thousands of landscapes. They are great at guessing the shape of a mountain range from a few photos, but they get lost when shown a close-up of a mouse in a lab because the "camera angles" are too sparse and the lighting is too controlled.
BEAST3D's Edge: BEAST3D knows the exact location of the cameras (because scientists calibrated them). It doesn't waste energy guessing where the cameras are; it focuses all its brainpower on figuring out the animal's shape. It can build a good 3D model with as few as four cameras, whereas other models usually need a dozen or more overlapping views to work.

What Can It Do? (The Three Superpowers)

The paper shows that once BEAST3D learns this 3D "cloud," it can help scientists in three specific ways:

The Time-Travel Camera (Novel View Synthesis):
You can ask the computer to show you the animal from a camera angle that doesn't even exist. It takes the 3D cloud and renders a new, realistic video from a "ghost camera" hovering anywhere in the room. This proves the computer actually understands the 3D shape.
The Skeleton Tracker (Pose Estimation):
Scientists need to track specific joints (like a knee or an elbow) to study movement. Usually, this requires labeling thousands of frames. BEAST3D, having already learned the 3D shape, can find these joints much more accurately and with far less human help. It's like the computer already knows where the skeleton is hidden inside the "cloud of balloons," so it just has to point it out.
The Brain Decoder (Neural Encoding):
This is the most unique part. Scientists record electrical signals from the animal's brain while it moves. They want to know: Which part of the movement makes this brain cell fire?
- Old methods used simple dots (joints) to explain the brain.
- BEAST3D uses the whole "cloud." Because the cloud is anchored to specific parts of the body, scientists can look at a brain signal and say, "Ah, this neuron fires specifically when the left ear moves," rather than just "the head moves." It connects the brain to the body with much higher precision.

The Bottom Line

BEAST3D is a tool that turns flat, multi-camera videos into a rich, 3D understanding of animal movement. It does this by teaching itself to fill in the blanks of missing camera angles, creating a "cloud" of the animal that is accurate enough to track joints and decode brain activity. It bridges the gap between fancy 3D computer vision and the specific, tricky needs of neuroscience labs.

Note: The authors mention that the current version requires powerful computers (8 high-end GPUs) to train, which might be a hurdle for smaller labs, but they see this as a solvable engineering challenge for the future.

Technical Summary: BEAST3D

Problem Statement

Advances in behavioral neuroscience increasingly rely on precise 3D quantification of animal movement. However, standard single-view video recordings are fundamentally limited by self-occlusions and the inability to recover full 3D kinematics from 2D observations. While multi-view synchronized recordings offer a solution, extracting rich 3D representations from them remains challenging due to three primary limitations in current approaches:

Supervised Pose Estimation: Requires extensive, labor-intensive manual annotation of keypoints, which must be repeated for every new species or experimental setup.
Species-Specific Mesh Models: Approaches like SMAL provide rich surface representations but require species-specific template meshes and costly per-frame optimization, limiting scalability.
General-Purpose 3D Vision Models: Recent models (e.g., VGGT, E-RayZer) trained on generic internet datasets fail on specialized laboratory imagery (close-up views, controlled lighting, fixed rigs). Furthermore, these models are designed for dense, overlapping viewpoints and devote significant capacity to estimating unknown camera parameters, a process that often fails in laboratory settings where cameras are sparse (3–6 views) but accurately calibrated.

Methodology: BEAST3D

BEAST3D is a self-supervised pretraining framework designed to learn 3D visual representations from unlabeled, calibrated multi-view animal behavior videos. It addresses the gaps above by leveraging known camera parameters and using 3D Gaussian Splatting (3DGS) as an intermediate representation.

Core Architecture

The framework operates as a masked autoencoder where the goal is to reconstruct held-out views from a subset of reference views.

Input: A set of synchronized, calibrated camera views $\{I_v\}$ and corresponding camera parameters.
Image Tokenization: A frozen DINOv3 ViT-B/16 encoder extracts patch-level features from reference images.
Camera Tokenization: Camera geometry is encoded using Plücker coordinates (6D descriptors of rays) derived from ground-truth calibration data, tokenized via linear projection.
Fusion & Transformer: Image and camera tokens are fused and processed by a geometry transformer (based on VGGT). This transformer alternates between frame-level attention (processing 2D appearance) and global attention (aggregating 3D multi-view information).
3D Gaussian Prediction: The transformer outputs are decoded into per-patch 3D Gaussian parameters (position, shape, and view-dependent color).
Differentiable Rendering: The predicted 3D Gaussians are rendered using GSplat to generate images for both reference and target (held-out) views.
Segmentation: The model simultaneously predicts per-pixel alpha values to segment the animal from the background. These masks are distilled from a foundation model (SAM3) during training, eliminating the need for segmentation at inference time.

Training Objective

The model is trained using a self-supervised loss computed only on the held-out target views. The total loss combines:

Photometric Loss ( $L_{\ell2}$ ): Pixel-level differences between ground truth and rendered images.
Perceptual Loss ( $L_{perc}$ ): Feature-level differences between ground truth and rendered images.
Mask Loss ( $L_{mask}$ ): Differences between the rendered alpha channel and the ground-truth segmentation masks.

Crucially, BEAST3D conditions directly on known camera parameters, removing the need for a camera pose estimation branch. This allows the model to focus its capacity on learning appearance and geometry features with as few as four views.

Key Contributions

Self-Supervised 3D Pretraining: BEAST3D introduces a framework that learns 3D visual features from unlabeled multi-view videos without requiring manual keypoint annotations.
Sparse-View Adaptation: Unlike general 3D models that require dense overlapping views to estimate camera geometry, BEAST3D exploits the known calibration of laboratory rigs to function effectively with sparse views (3–6 cameras).
Integrated Segmentation: The model learns to segment the animal from the background during training, producing clean foreground representations without requiring external segmentation tools at inference.
Versatile Downstream Transfer: The framework establishes a single backbone that transfers effectively to three distinct downstream tasks: novel view synthesis, multi-view pose estimation, and neural encoding.

Results

The authors evaluated BEAST3D across four datasets spanning different species (mouse, rat, chickadee, human) and environments (head-fixed, freely moving, naturalistic).

Novel View Synthesis (NVS): BEAST3D demonstrated superior performance in reconstructing held-out views compared to baselines (E-RayZer, Pose Splatter, VGGT). It achieved higher PSNR, SSIM, and lower LPIPS scores, producing sharp reconstructions that recover subject silhouettes and fine details. Notably, E-RayZer often failed to populate the subject's location in sparse-view regimes, while BEAST3D maintained structural consistency.
Pose Estimation: When fine-tuned for pose estimation with limited labeled data (100 instances), BEAST3D outperformed single-view baselines (DINOv3, BEAST) and other multi-view models (VGGT, E-RayZer) across most datasets. The results suggest that the specific pretraining objective (novel view synthesis via 3D splatting) is more critical for transfer performance than the multi-view architecture alone.
Neural Encoding: BEAST3D features were used to predict neural activity in mouse and chickadee datasets. The dense 3D Gaussian splat representations outperformed sparse 3D keypoints and matched the predictive power of opaque CLS tokens (from BEAST) while retaining spatial interpretability. This allows researchers to identify which specific body parts or appearance details drive neural activity.

Significance and Claims

The paper claims that BEAST3D establishes a versatile framework for behavioral analysis that bridges the gap between general-purpose 3D computer vision and the specialized demands of behavioral neuroscience. By leveraging 3D structure in modern multi-view laboratory recordings, it offers:

Viewpoint-Invariant Features: Representations that are independent of camera viewpoint, providing a natural basis for relating behavior to neural activity.
Scalability: A reduction in the need for species-specific templates or extensive manual annotation.
Interpretability: A spatially grounded representation (unlike high-dimensional CLS tokens) that allows neuroscientists to interpret the behavioral drivers of neural activity.

The authors note that while BEAST3D currently requires significant computational resources for pretraining (approx. 32 hours on 8 A100 GPUs), the framework is designed to accommodate joint training across multiple datasets, potentially yielding a single, out-of-the-box backbone for the community. The work positions self-supervised 3D representation learning as a critical tool for the next generation of behavioral neuroscience.

BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting