Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Tele-Omni is a unified multimodal framework that leverages pretrained large language models to parse diverse text, image, and video instructions for a single diffusion-based model, enabling flexible and high-quality video generation and editing across multiple tasks within a cohesive system.

Jialun Liu, Tian Li, Xiao Cao, Yukuo Ma, Gonghu Shang, Haibin Huang, Chi Zhang, Xiangzhen Chang, Zhiyong Huang, Jiakui Hu, Zuoxin Li, Yuanzhi Liang, Cong Liu, Junqi Liu, Robby T. Tan, Haitong Tang, Qizhen Weng, Yifan Xu, Liying Yang, Xiaoyan Yang, Peng Yu, Shiwen Zhang, Xuelong Li

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you have a magical movie studio in your living room. Right now, most AI video tools are like specialized workers: one worker is great at making movies from a script (text), another is great at animating a sketch (image), and a third is great at editing out a background. But if you want to do something complex—like "Take this photo of a cat, put it in a cyberpunk city, make it rain, and then edit the video to make the cat wear sunglasses"—you usually have to hire three different workers and hope they talk to each other. They often don't.

Tele-Omni is the solution to this problem. Think of it as hiring a Super-Producer who can do everything in one go.

Here is a simple breakdown of how Tele-Omni works, using everyday analogies:

1. The Two-Brain System

Tele-Omni isn't just one big brain; it's a team of two specialists working in perfect sync:

  • The "Director" (The MLLM): This is the brain that understands language, images, and videos. It's like a seasoned film director who reads your messy notes ("Make the car red, but keep the rain falling") and looks at your reference photos. The Director doesn't paint the frames; instead, they write a very precise shot list and storyboard. They figure out what you want and how to do it.
  • The "Cinematographer" (The Diffusion Model): This is the artist who actually paints the movie. They take the Director's shot list and the reference photos and start creating the video frame by frame. They are incredibly skilled at making things look real and keeping the motion smooth.

The Magic: In older systems, the Director and Cinematographer spoke different languages. In Tele-Omni, they speak the same language. The Director translates your complex, mixed instructions (text + photos + video clips) into a clear plan that the Cinematographer can execute perfectly.

2. The "Swiss Army Knife" of Video Tasks

Most video AI tools are like a single-purpose screwdriver. Tele-Omni is a Swiss Army Knife. Because the Director understands the intent behind your request, Tele-Omni can handle a huge variety of jobs without needing to be retrained for each one:

  • Text-to-Video: You say, "A dragon flying over a castle," and it makes it.
  • Image-to-Video: You upload a photo of a sleeping cat, and it makes the cat wake up and stretch.
  • First-and-Last-Frame: You give it a photo of a person sitting down and a photo of them standing up. The Director figures out the "middle" of the story and fills in the movement smoothly.
  • Video Editing: You can say, "Remove that ugly trash can from the background," or "Change the summer clothes to winter coats," and it edits the video while keeping the rest of the scene perfect.
  • In-Context Generation: You can show a reference video of a dance and say, "Do this dance, but with a robot," and it copies the motion but changes the character.

3. The Secret Sauce: "Task-Aware" Training

You might wonder, "How does the AI know the difference between 'making a new video' and 'editing an old one'?"

Usually, if you mix different types of training data (like mixing cake batter with car engine oil), the AI gets confused. Tele-Omni uses a special Data Pipeline (a recipe for training) that organizes everything neatly.

Think of it like a universal remote control. Instead of having a separate remote for the TV, the stereo, and the AC, Tele-Omni teaches the AI to recognize the buttons you press.

  • If you press "Make New," it knows to create from scratch.
  • If you press "Edit," it knows to keep the background and change the subject.
  • If you press "Reference," it knows to copy the style or motion from a provided clip.

The system is trained on a massive, organized library of examples so it learns to distinguish these tasks instantly, just by looking at your instructions.

4. Why This Matters

Before Tele-Omni, if you wanted to make a complex video, you had to jump between different tools, losing quality and consistency along the way. It was like trying to build a house by switching between a hammer, a saw, and a trowel every five minutes.

Tele-Omni lets you stay in one room. You give it a mix of instructions—text, photos, and video clips—and it acts as a unified production team. It understands that "make the sky blue" applies to the whole video, while "make the car red" only applies to the car, all while keeping the movement smooth and the lighting realistic.

In short: Tele-Omni is the first AI video tool that acts like a true creative partner, capable of understanding your messy, multi-part ideas and turning them into a single, high-quality movie without needing a different tool for every step.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →