Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

Imagine you want to make a movie starring your favorite toy, your pet cat, or even yourself. You have a few photos of them, and you want to tell a story: "The cat is riding a skateboard through a neon city."

In the past, AI video generators were like clumsy directors. If you showed them a photo of a cat, they might make a video where the cat stays exactly the same, frozen in the photo, or they might accidentally copy the messy bedroom from the background of your photo into the neon city scene. If you tried to put two characters in the scene (like a cat and a dog), the AI would often get confused, mixing them up or making them look like a weird monster.

Kaleido is a new, open-source AI model that acts like a super-talented, hyper-organized film director who solves these problems. Here is how it works, broken down into simple concepts:

1. The Problem: The "Bad Copy-Paste" Director

Current AI models often suffer from two main issues:

The Background Stalker: If you show a photo of a person in a messy room, the AI thinks the messy room is part of the person. So, when you ask for them to be on a beach, the AI still tries to put the messy room furniture on the sand.
The Identity Crisis: If you show photos of two different people, the AI gets confused. It might blend their faces together or forget who is who halfway through the video.

2. The Solution: Kaleido's Two Secret Weapons

Weapon A: The "Mix-and-Match" Training Camp (Data Construction)

To teach the AI how to be a good director, the researchers didn't just feed it random videos. They built a special training pipeline.

The Analogy: Imagine you are teaching a student to draw a horse. If you only show them photos of horses standing in a stable, they will think "horse" always means "horse + stable."
What Kaleido does: The researchers took thousands of videos, cut out the subjects (the "stars"), and then swapped their backgrounds. They took a photo of a dog, erased the park behind it, and pasted the dog onto a beach, a spaceship, and a kitchen.
The Result: The AI learns that the dog is the important part, and the background is just a costume that can be changed. It also learned to mix and match different subjects (cross-pairing) so it knows how to handle a scene with a cat and a dog without them turning into a cat-dog hybrid.

Weapon B: The "Name Tag" System (R-RoPE)

When you give the AI multiple photos (e.g., one of a man, one of a woman, one of a car), the AI needs to know which pixel belongs to which character.

The Analogy: Imagine a crowded party where everyone is wearing the same gray suit. If you shout "Dance!" everyone dances, but you can't tell who is who.
What Kaleido does: It gives every reference photo a special digital name tag (called Reference Rotary Positional Encoding or R-RoPE).
How it works: Instead of just shoving the photos into the AI's brain, Kaleido tells the AI: "This photo is the Man, and he lives in 'Zone A'. This photo is the Woman, and she lives in 'Zone B'."
The Result: The AI never gets confused. It knows exactly which character to keep consistent and which background to ignore, even when there are many characters in the scene.

3. The Results: What Can Kaleido Do?

Because of these two upgrades, Kaleido is currently the best open-source video generator for this specific task.

Consistency: If you show a photo of a specific toy, the toy in the video looks exactly like the toy in the photo, not a generic toy.
Disentanglement: If you ask for the toy to be in a forest, the AI creates a forest. It doesn't accidentally paste the toy's original bedroom into the forest.
Multi-Subject: You can have a man, a woman, and a dog all interacting in the same video, and they all stay true to their original photos.

The Bottom Line

Think of previous AI video tools as a photocopier that just copies the whole picture (subject + background) and tries to animate it.

Kaleido is like a master puppeteer. It takes your photos, carefully separates the "puppets" (the subjects) from the "stage" (the background), and then lets you direct the play. You can change the stage, add new actors, and tell a story, and the puppets will look exactly like the ones you brought in.

The best part? The creators have shared the "puppeteer's manual" (the code and data) with the world, so anyone can use it to create their own movies.

1. Problem Statement

The paper addresses the limitations of current Subject-to-Video (S2V) generation models, specifically the difficulty in maintaining multi-subject consistency and achieving background disentanglement when conditioned on multiple reference images.

Existing open-source approaches suffer from two primary issues:

Data Limitations: Training datasets often lack diversity and high-quality samples. Crucially, they frequently contain "entangled" data where reference images are naively selected from video frames. This causes models to overfit to specific backgrounds or incidental objects in the reference images rather than learning the intrinsic characteristics of the subject.
Inadequate Conditioning Strategies: Current methods for injecting reference image information (e.g., simple latent concatenation or adapter-based architectures) often fail to distinguish between video tokens and reference image tokens. This leads to token disorder, spatial overlap of different subjects, and "semantic drift," where the model confuses multiple subjects or fails to separate the subject from the background.

2. Methodology

Kaleido proposes a comprehensive framework comprising a novel data construction pipeline and an improved model architecture.

A. Data Construction Pipeline

To overcome data scarcity and entanglement, the authors designed a scalable, multi-stage pipeline:

Preprocessing & Captioning: Raw videos are sliced into coherent clips and captioned using VLMs.
Subject Discovery: A broad taxonomy (100+ categories, 800+ synonyms) is used to identify candidate subjects automatically.
Grounding & Segmentation: Combines Grounding DINO for localization and SAM (Segment Anything Model) for precise segmentation to isolate subjects.
Strict Filtering: Removes low-quality samples based on size, overlap (IoU), brightness, blur, and face validity (using InsightFace for humans).
Background Disentanglement (Key Innovation): The pipeline uses inpainting to erase background information from segmented reference images. This forces the model to learn the subject's appearance independently of the original context.
Cross-Paired Data Construction: The pipeline generates "cross-paired" samples by mixing subjects from different images with new backgrounds and poses (using Flux Redux for pose enrichment). This prevents the model from memorizing specific subject-background pairings.

B. Model Architecture & R-RoPE

The model is built upon a Diffusion Transformer (DiT) backbone (fine-tuned from Wan2.1-T2V-14B).

Condition Injection: Instead of complex adapters, Kaleido uses a simple concatenation scheme, merging reference image tokens and video noise tokens along the sequence dimension.
Reference Rotary Positional Encoding (R-RoPE): To solve the confusion caused by simple concatenation, the authors introduce R-RoPE.
- Standard video tokens use 3D RoPE with coordinates $(t, h, w)$ .
- Reference image tokens are assigned shifted spatial coordinates. Their spatial dimensions start from $(H_{max}, W_{max})$ of the video sequence, and their temporal positions are individually assigned starting from $t=0$ .
- Effect: This creates a distinct "embedding space" for reference images, preventing the model from misinterpreting them as consecutive video frames or causing spatial overlap between multiple subjects.

3. Key Contributions

Comprehensive Data Pipeline: A novel pipeline that generates high-quality, background-disentangled, and cross-paired training data, addressing the lack of diverse S2V datasets.
R-RoPE Mechanism: A specialized positional encoding strategy that enables stable integration of multiple reference images without architectural bloat, significantly improving multi-subject consistency.
State-of-the-Art Open-Source Model: Kaleido achieves performance comparable to closed-source commercial models (like Kling and Vidu) while remaining open-source, setting a new benchmark for S2V generation.

4. Experimental Results

The authors evaluated Kaleido against top closed-source (Vidu Q1, Kling) and open-source (VACE, Phantom, SkyReels) models.

Quantitative Metrics:
- S2V Consistency: Kaleido achieved the highest score (0.723), outperforming all competitors, indicating superior preservation of subject identity.
- S2V Decoupling: Kaleido scored 0.319 (higher is better), demonstrating the best ability to separate subjects from irrelevant background information.
- Face Similarity: On human subjects, Kaleido achieved an average FaceSim of 0.504, surpassing other open-source models and slightly outperforming the closed-source Kling (0.495).
- General Quality: It matched or exceeded closed-source models in motion smoothness, aesthetic quality, and text alignment.
Qualitative Results:
- Visual comparisons show that Kaleido successfully generates videos with multiple subjects without overlap or confusion.
- Unlike VACE (which often carries over background artifacts) or Vidu (which sometimes repeats subjects), Kaleido maintains clean subject-background separation and consistent identity across frames.
User Study: Human raters consistently preferred Kaleido over both open-source and closed-source baselines in terms of video quality, prompt alignment, and subject consistency.

5. Significance

Kaleido represents a significant leap forward in open-source video generation. By solving the critical bottlenecks of multi-subject consistency and background disentanglement, it bridges the performance gap between open-source and proprietary commercial models. The release of both the training data pipeline and the model checkpoints provides a robust foundation for the research community, enabling new applications in e-commerce, advertising, and digital content creation where precise subject control is essential. The introduction of R-RoPE offers a lightweight yet effective solution for integrating multiple visual conditions into diffusion transformers.

Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

1. The Problem: The "Bad Copy-Paste" Director

2. The Solution: Kaleido's Two Secret Weapons

Weapon A: The "Mix-and-Match" Training Camp (Data Construction)

Weapon B: The "Name Tag" System (R-RoPE)

3. The Results: What Can Kaleido Do?

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Construction Pipeline

B. Model Architecture & R-RoPE

3. Key Contributions

4. Experimental Results

5. Significance

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection