TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Imagine you have a incredibly talented but slightly clumsy video director. This director (the AI model) is amazing at creating beautiful scenes, but if you ask them to do something specific like "Make a robot and a wizard sneak up on each other while four pandas eat bamboo," they often get confused. They might make the robot walk backward, forget the pandas, or make the wizard float in the wrong direction. They struggle with composition—putting multiple specific things together in a specific way.

The paper introduces a new method called TTOM (Test-Time Optimization and Memorization) to fix this. Think of TTOM as giving this director a smart assistant and a personal library of past successes.

Here is how it works, broken down into simple concepts:

1. The Problem: The "One-Off" Struggle

Usually, when you ask an AI to make a video, it treats every request as a brand-new, isolated event. It doesn't remember that it just made a video about a "robot walking left" five minutes ago. It has to figure out the physics and logic of "robot walking left" from scratch every single time. This leads to mistakes in complex scenes.

2. The Solution: TTOM's Two Superpowers

Superpower A: The "Rehearsal" (Test-Time Optimization)

Instead of just blindly generating the video, TTOM pauses before the final cut.

The Analogy: Imagine the director gets the script ("Robot and wizard sneak up..."). Before filming, they run a quick rehearsal.
How it works: The AI uses a smart helper (a Large Language Model) to draw a rough map of where everything should be (a "layout"). Then, it tweaks the video generation process slightly to make sure the robot actually moves left and the wizard moves right, matching that map.
The Result: The video is generated with much higher precision because the AI "rehearsed" the specific movements just for this request.

Superpower B: The "Personal Library" (Parametric Memorization)

This is the game-changer. After the rehearsal is done and the video is perfect, TTOM doesn't throw the "rehearsal notes" away.

The Analogy: Imagine the director keeps a library of "cheat sheets." If the next prompt is "A cat walks left," the director doesn't start from zero. They look in the library, find the "Cat walking left" cheat sheet, and use it as a starting point.
How it works:
- Insert: If the AI solves a new problem (e.g., "A blue bird flies up"), it saves the solution (the specific settings it tweaked) into its memory library.
- Read: If a new prompt comes in that is similar (e.g., "A red bird flies up"), the AI grabs the "blue bird" settings, loads them, and uses them as a head start.
- Update: If the new bird flies slightly differently, the AI tweaks the cheat sheet and saves the improved version back to the library.
- Delete: If the library gets too full, it throws out the oldest or least-used cheat sheets to make room for new ones.

3. Why This is a Big Deal

It Learns as it Goes: Unlike other methods that forget everything after the video is made, TTOM gets smarter with every video it creates. It builds a "world knowledge" of how objects move and interact.
It's Fast: Because it can just "read" a cheat sheet from its library instead of doing a full rehearsal every time, it becomes very efficient for similar requests.
It Fixes the "Clumsy" Parts: The paper shows that this method drastically improves the AI's ability to handle numbers (counting pandas), spatial relationships (who is to the left of whom), and complex motions.

Summary

Think of TTOM as turning a video generator from a genius who forgets everything into a seasoned professional with a massive experience log.

Plan: It draws a map of what should happen.
Rehearse: It tweaks the video to match the map perfectly.
Remember: It saves the "perfect tweak" in a library.
Reuse: Next time, it pulls that tweak from the library to make the new video even better and faster.

The result? Videos that actually follow your instructions, even when those instructions are complex and involve many moving parts.

1. Problem Statement

While Video Foundation Models (VFMs) have achieved remarkable success in generating realistic videos, they struggle significantly in compositional scenarios. These scenarios involve complex instructions requiring the simultaneous coordination of multiple objects, attributes, spatial relationships, motion dynamics, and numeracy (e.g., "Four pandas munch bamboo," or "A robot and a sorcerer approaching each other").

Existing approaches to improve text-video alignment often rely on:

Direct Intervention: Modifying intermediate latents or attention maps per sample. This can disrupt feature distributions, leading to visual artifacts (flickering, inconsistency) or model collapse.
Per-Sample Independence: Treating each generation task in isolation, ignoring the valuable historical context of previously generated videos.
Lack of Generalization: Interventions optimized for one specific prompt do not transfer to similar future prompts, failing to enhance the model's intrinsic capabilities.

The authors argue that real-world video generation is a streaming process where user prompts arrive sequentially. Current methods fail to leverage the "memory" of past successful generations to guide future inference.

2. Methodology: TTOM Framework

The authors propose TTOM (Test-Time Optimization and Memorization), a model-agnostic, training-free framework designed to align VFM outputs with spatiotemporal layouts during inference. The framework consists of three core stages:

A. LLM-Driven Spatiotemporal Layout (STL) Planning

Input: A text prompt describing a compositional scene.
Process: A Large Language Model (LLM) is prompted to generate a structured Spatiotemporal Layout.
Output: A sequence of bounding boxes (bbox) for each object across frames, including object phrases, start/end frame indices, and coordinate vectors.
Verification: A verification step ensures consistency between the object phrases and the generated bounding boxes, correcting discrepancies before optimization begins.

B. Test-Time Optimization (TTO)

Instead of optimizing latents (which risks distribution collapse), TTOM optimizes lightweight, sample-specific parameters (using LoRA) inserted into the Video Foundation Model (VFM).

Mechanism: The framework extracts cross-attention maps from the DiT (Diffusion Transformer) layers. It establishes a strong correlation between specific layers' attention maps and the final video layout.
Objective: A Jensen-Shannon Divergence (JSD) loss function is used to minimize the divergence between the model's attention maps and the soft masks derived from the LLM-generated layout.
Optimization: The model performs gradient descent on the inserted LoRA parameters ( $\phi$ ) for a few initial denoising steps. This steers the generation toward the prescribed layout without altering the base model weights.
Result: The optimized parameters ( $\phi^*$ ) capture the specific "compositional patterns" (motion, numeracy, interaction) of the current prompt.

C. Parametric Memory Mechanism

To address the isolation of per-sample methods, TTOM introduces a memory structure to store and reuse optimization results.

Structure: A key-value store where the Key is an abstracted semantic representation of the prompt (e.g., <object A> drifts right to left above <object B>), and the Value is the optimized parameter set ( $\phi^*$ ).
Operations:
- Insert: If a new prompt has no match, TTO is performed, and the resulting $\phi^*$ is inserted into memory.
- Read/Load: If a similar prompt is encountered, the corresponding $\phi^*$ is retrieved and loaded into the VFM.
- Update: The loaded parameters can be further optimized (Continual TTO) to adapt to subtle differences in the new prompt.
- Delete: When memory capacity is full, least-frequently-used items are removed.
Benefit: This enables Lifelong Learning, allowing the model to "remember" how to generate specific compositional patterns (e.g., "two objects running toward each other") and apply them immediately to new, similar requests, significantly improving efficiency and consistency.

3. Key Contributions

Test-Time Optimization Framework: A novel approach that optimizes lightweight parameters guided by a layout-attention objective, avoiding the visual degradation associated with direct latent manipulation.
Parametric Memory for Streaming: The first integration of a parametric memory mechanism into compositional video generation, supporting flexible operations (insert, read, update, delete) to maintain historical context and enable lifelong learning.
Disentanglement of World Knowledge: The method effectively disentangles compositional world knowledge (motion, numeracy, spatial relations), demonstrating strong transferability and generalization to unseen prompts.
Training-Free & Scalable: The framework requires no retraining of the base VFM and is compatible with various DiT-based models (e.g., CogVideoX, Wan2.1).

4. Experimental Results

The authors evaluated TTOM on two major benchmarks: T2V-CompBench (focused on compositionality) and VBench (focused on semantic consistency and quality).

T2V-CompBench Performance:
- CogVideoX-5B: TTOM achieved a 34.45% relative improvement in overall average score. Notable gains were seen in Motion (+63.69%) and Numeracy (+37.10%).
- Wan2.1-14B: TTOM achieved a 15.83% relative improvement overall, with Motion improving by 82.57%.
- TTOM outperformed commercial models (Pika, Gen-3, Kling) and other SOTA methods (LVD, DyST-XL).
VBench Performance:
- Significant improvements in Object Classification, Multi-Object handling, and Spatial Relations.
- Maintained high visual fidelity, with no degradation in aesthetic quality or temporal consistency.
Ablation Studies:
- TTO vs. Memory: TTO alone improved motion scores by ~60%; adding memory provided an additional ~14% boost.
- Continual Optimization: Further optimizing loaded parameters yielded better results than static loading, balancing context reuse with prompt-specific adaptation.
- Loss Function: The proposed JSD loss outperformed BCE and Center-of-Mass losses.

5. Significance

TTOM represents a paradigm shift in controllable video generation. By treating video generation as a streaming process rather than a series of isolated tasks, it leverages historical optimization contexts to solve the "compositional gap" in foundation models.

Practicality: It offers a scalable, efficient solution for real-world applications where users generate videos sequentially.
Generalization: It demonstrates that compositional world knowledge can be learned and stored as parameter patterns, allowing models to "remember" complex interactions without retraining.
Future Direction: The work opens new avenues for lifelong learning in generative AI, moving beyond static models to adaptive systems that evolve with user interaction.