Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

This paper presents a systematic account of the engineering challenges, design decisions, and key lessons learned in developing the Summer-22B video foundation model, emphasizing that dataset engineering and metadata-driven curation were more critical to success than architectural variations.

Simo Ryu, Chunghwan Han

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you want to teach a robot to become a master filmmaker. You don't just hand it a camera and say, "Go make a movie." You have to build the entire school, the library, the curriculum, and the grading system from scratch.

This paper, "Summer-22B," is the story of how a team at fal.ai built a video-making AI from the ground up. They didn't just tweak an existing model; they built a new one called Summer-22B, trained on about 50 million video clips.

Here is the story of their journey, explained with simple analogies.

1. The Biggest Challenge: The "Garbage In, Garbage Out" Problem

The team discovered something surprising: The architecture (the robot's brain) mattered less than the data (the robot's education).

  • The Analogy: Imagine trying to teach a student to write a novel. You could give them the most expensive pen and the most comfortable chair (the architecture), but if you feed them a diet of spam emails and broken sentences (bad data), they will never write a good book.
  • The Reality: The team spent 80% of their time cleaning and organizing the data, not designing the brain. They built a massive factory called the Lavender Data System to sort through raw video footage.
    • Shot Detection: They cut long movies into short, coherent scenes (like cutting a 2-hour movie into 30-second clips).
    • Quality Control: They threw away blurry videos, static slideshows, or videos with no movement.
    • Deduplication: They removed thousands of nearly identical videos so the robot didn't get bored learning the same thing twice.

2. The "Magic Recipe" for Training (µP and Hyperspheres)

Training a giant AI is like tuning a massive orchestra. If you change the volume of one instrument, the whole song might get out of tune. Usually, you have to re-tune the whole orchestra every time you add more musicians.

  • The Analogy (µP): The team used a secret sauce called µP (Maximal Update Parameterization). Think of this as a "universal tuning fork." It allowed them to find the perfect volume settings for a small practice group (30 million parameters) and then apply those exact same settings to the full orchestra (1 billion parameters) without needing to re-tune everything.
  • The Analogy (Hypersphere Optimization): Usually, when you train an AI, the numbers inside it can grow too big or too small, causing the math to break. The team forced all the numbers to stay on a perfect "sphere" (like keeping a ball rolling on a track).
    • Why it helps: It's like putting guardrails on a highway. The AI can't drive off the road, so it doesn't need a "speed limit sign" (weight decay) to tell it to slow down. It just naturally stays on track.
    • The Breakthrough: They were the first to prove that you can use the "universal tuning fork" (µP) while driving on these "guardrails" (hypersphere constraints). It worked perfectly.

3. The "Parallel Processing" Trick

When the AI generates a video, it has to do two things at once: think about the story (Attention) and draw the picture (MLP). Usually, it does one, then the other, like a chef chopping vegetables before cooking them.

  • The Analogy: The team realized they could have the chef chop and cook at the same time.
  • The Result: They built a "parallel" kitchen. This made the AI 20% faster at generating videos without making the training any harder.

4. The Results: A Cost-Effective Success

The final model, Summer-22B, was trained for a total cost of about $300,000 (with half of that just for computer power).

  • The Comparison: They tested their model against other famous video AIs (like Wan 2.2 and Veo3).
    • The Good News: Summer-22B is very good at making smooth, realistic movements and following basic physics. It's competitive with models that cost much more to train.
    • The Bad News: It's not quite as "creative" or good at following complex instructions as the biggest, most expensive models. It's a bit like a very talented student who can draw a perfect apple but struggles to invent a new type of fruit.

Key Takeaways (The "Moral of the Story")

  1. Data is King: Spending time cleaning your data is more important than spending time tweaking the model's design.
  2. Small Tests Work: You don't need to train a giant model to find the right settings. You can test on a tiny model and scale up the settings using the "universal tuning fork" (µP).
  3. Guardrails Help: Forcing the math to stay on a "sphere" makes training more stable and removes the need for complex manual adjustments.
  4. It's Accessible: You don't need billions of dollars to build a video foundation model. With smart engineering, you can do it for a few hundred thousand dollars.

In short: The team didn't just build a video AI; they built a systematic, efficient factory for making them, proving that with the right data and smart math, you can create powerful AI without breaking the bank.