Harvest Video Foundation Models via Efficient Post-Pretraining

This paper proposes an efficient post-pretraining framework that transforms image foundation models into state-of-the-art video-language models by randomly dropping video patches and masking text, achieving high performance with minimal computational resources and data.

Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu Qiao, Ping Luo

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you have a brilliant art student who has spent years studying millions of paintings. They are an expert at recognizing a cat, a sunset, or a cup of coffee just by looking at a single image. This student is your Image Foundation Model (like CLIP).

Now, you want to teach this student to understand movies. Movies are just a rapid sequence of images playing one after another. But there's a problem: watching a whole movie is expensive and time-consuming. If you try to teach the student by showing them every single frame of every movie, they will get overwhelmed, and it will cost a fortune in electricity and time. Plus, movies often have a lot of "dead air"—frames that look almost exactly like the ones before them.

This paper proposes a clever, lazy (in a good way!) way to turn that image expert into a video expert without breaking the bank. They call it "Harvesting Video Foundation Models via Efficient Post-Pretraining."

Here is how they do it, using some simple analogies:

1. The "Skip-Frame" Trick (Video Patch Dropping)

Imagine you are trying to learn a dance routine by watching a video. Instead of watching every single second, you decide to skip 90% of the frames. You only look at the key moments where the dancer jumps or spins.

  • Why do this? Because 90% of a video is just the dancer standing still or moving slightly. It's redundant.
  • The Benefit: By skipping most of the video, the computer doesn't have to do 90% of the math. This makes training 10 times faster and much cheaper. The model learns the "gist" of the action without getting bogged down in the boring details.

2. The "Blindfold" Game (Text Masking)

Now, imagine you are describing a movie to a friend, but you play a game where you cover up some of the words in your description with a "BLANK" sticker. Your friend has to guess what the missing words are based on the video they are watching.

  • The Goal: This forces the computer to really understand the connection between the video and the words. It can't just guess the words; it has to look at the video to figure out if the missing word was "eating" or "sleeping."
  • The Result: This creates a deep bond between the visual and the text, making the model smarter at answering questions like, "Is the panda eating bamboo?"

3. The "Freeze-Frame" Strategy (Keeping the Text Expert)

The authors realized that the "text expert" part of the original image model was already incredibly smart because it was trained on a massive library of books and articles. The video data they had (WebVid-10M) was actually too small and simple to teach the text expert anything new.

So, they decided to freeze the text expert. They kept the text brain exactly as it was and only taught the video brain the new tricks. This prevented the model from "forgetting" its language skills while trying to learn video.

The Results: A Super-Efficient Video Brain

The best part? They did all of this in less than one day using just 8 standard graphics cards.

  • Other methods are like trying to build a skyscraper from scratch: they need massive teams, years of work, and huge budgets to train video models from zero.
  • This method is like taking a finished, beautiful house (the image model) and just adding a second floor (the video capabilities). It's cheap, fast, and the house is just as strong.

Why Does This Matter?

  1. Accessibility: You don't need a billion-dollar budget to build a powerful video AI. Small labs and universities can now do it.
  2. Sustainability: Because it's so efficient, it uses way less electricity, which is better for the planet.
  3. Performance: Surprisingly, this "lazy" method works just as well as the expensive, heavy-duty models on tasks like:
    • Zero-shot: Identifying new things it's never seen before (e.g., "Find a video of a cat juggling").
    • Retrieval: Finding the right video for a text search.
    • Question Answering: Answering questions about what happened in a video.

The Big Takeaway

The paper suggests that maybe we don't need to reinvent the wheel for video AI. We can just take the amazing image models we already have, give them a "skip-frame" workout, play a "fill-in-the-blank" game with them, and suddenly, they become world-class video experts.

It's a reminder that sometimes, the smartest solution isn't to work harder, but to work smarter by realizing that video is just a lot of images with a little bit of redundancy.