MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

Imagine you are trying to teach a team of two very different experts to work together on a complex project: a Vision Expert (who sees pictures) and a Language Expert (who reads and writes).

In the world of Artificial Intelligence, these two experts are part of a "Multimodal Large Language Model" (MLLM). To make them work together on a new task (like answering science questions about images), you need to "fine-tune" them.

The Problem: The "Tug-of-War"

The paper identifies a major problem with how we usually do this. Currently, we often treat both experts the same way, giving them the same "training intensity."

But here's the catch: They learn at different speeds.

The Vision Expert might be slow to learn because it's used to just looking at static images.
The Language Expert might be a fast learner, used to processing text quickly.

If you force them to train at the same speed, chaos ensues:

The Slow Learner Bottleneck: If the Vision Expert is too slow, the Language Expert gets bored and starts guessing, ruining the team's performance.
The Fast Learner's Panic: If the Language Expert is too fast, it starts memorizing the training data (overfitting) before the Vision Expert has even figured out what it's looking at. This causes the whole team to oscillate and fail.

The Old Solution:
Previously, researchers tried to fix this by manually adjusting the "learning rate" (how fast they learn) for each expert. This is like a coach constantly running back and forth, shouting, "You, slow down!" and "You, speed up!" It's exhausting, time-consuming, and relies on a lot of guesswork.

The Solution: MARS (The Smart Coach)

The authors introduce MARS (Multimodal Adaptive Rank Search). Instead of just telling the experts to speed up or slow down, MARS changes how much capacity each expert has to learn.

Think of the "Rank" in LoRA (the method used to fine-tune) as the size of the notebook each expert gets to write their notes in.

A small notebook (low rank) means the expert can only learn a few key concepts. They learn slowly but don't get confused.
A huge notebook (high rank) means the expert can learn everything instantly, but they might get overwhelmed or memorize the wrong things.

MARS's Superpower:
MARS acts like a genius coach who knows exactly how big a notebook each expert needs so they finish learning at the exact same time.

How MARS Works: The "Crystal Ball" Strategy

Finding the perfect notebook size for each expert usually requires trying thousands of combinations, which takes years of computer time. MARS avoids this by using two "Crystal Balls" (Scaling Laws):

Crystal Ball #1: The "Time" Predictor (Scaling Law-C)
- This predicts how long it will take an expert to finish their training based on the size of their notebook and the amount of homework (data).
- The Magic: MARS uses this to instantly calculate: "If the Language Expert gets a notebook of size 32, the Vision Expert must get a notebook of size 16 to finish at the same time."
- This eliminates 99% of the bad combinations immediately.
Crystal Ball #2: The "Grade" Predictor (Scaling Law-P)
- Once MARS has a list of "balanced" teams (where both finish at the same time), it uses this second crystal ball to predict which team will get the best final grade.
- It picks the notebook sizes that will result in the highest accuracy.

The Result: A Perfectly Balanced Team

By using these predictions, MARS finds the perfect "notebook sizes" (ranks) for the Vision and Language experts without needing to test every single possibility.

Why is this amazing?

It's Fast: It saves about 11.5 times the time and computing power compared to the old "guess and check" methods.
It's Smarter: It consistently beats other methods, improving accuracy on science questions by up to 12% and making the models more stable.
It's Automatic: You don't need a human coach running around shouting instructions; the system figures it out automatically.

In a Nutshell

Imagine a relay race where one runner is a sprinter and the other is a marathoner.

Old Way: You tell the sprinter to jog and the marathoner to sprint, hoping they arrive at the baton exchange at the same time. It's a mess.
MARS Way: MARS calculates exactly how much water and food (capacity) each runner needs so they naturally arrive at the exchange point at the exact same moment, maximizing the team's speed without anyone getting exhausted or bored.

MARS is the ultimate tool for harmonizing AI teams, ensuring that the "eyes" and the "brain" learn together, perfectly in sync.

1. Problem Statement

Multimodal Large Language Models (MLLMs) require comprehensive fine-tuning of their constituent modules (Vision Encoder, Projector, and LLM backbone) to achieve state-of-the-art performance. However, current parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), often apply a uniform rank across all modules or rely on heuristic differential learning rates.

This approach fails to address the core issue of imbalanced training dynamics:

The Disparity: Different modules (e.g., Vision Encoder vs. LLM) have distinct parameter scales and domain gaps, leading to different learning capacities and convergence speeds.
The Consequence: When modules converge at different rates, it causes negative interference.
- If the Vision Encoder (VE) is under-adapted (slow), it creates a performance bottleneck.
- If the LLM is under-adapted (slow) while the VE is fast, it causes training oscillations and instability.
The Limitation of Current Solutions: Manually tuning differential learning rates is laborious, relies on trial-and-error, and does not fundamentally address the adaptation capacity of the modules. Furthermore, an exhaustive search for optimal LoRA rank pairs is computationally prohibitive due to the vast combinatorial search space.

2. Methodology: MARS

The authors propose MARS (Multimodal Adaptive Rank Search), an automated framework that discovers optimal, modality-specific LoRA rank pairs to balance training dynamics. The core innovation is the use of Dual Scaling Laws to guide the search, replacing exhaustive grid search with a predictive, data-driven approach.

A. Dual Scaling Laws

MARS formulates two predictive models based on empirical observations:

Scaling Law-P (Performance):
- Goal: Predicts the final task performance (e.g., perplexity or accuracy).
- Formulation: $\hat{L}(r_{ve}, r_{llm}, D_f) = A \cdot \frac{1}{(r_{ve})^{\alpha_m} (r_{llm})^{\alpha_l} D_f^\beta} + E$
- Insight: Performance is not just a function of total parameters but depends on the interaction between VE and LLM ranks. An imbalance between ranks leads to suboptimal performance, especially on large datasets.
Scaling Law-C (Convergence):
- Goal: Estimates the number of training iterations ( $t_i$ ) required for a specific module $i$ to converge.
- Formulation: $t_i(r_i, D_f) = k_i \cdot (r_i)^{\gamma_i} \cdot D_f^{\delta_i} + E_i$
- Insight: Convergence time scales inversely with rank (higher rank = faster convergence) and positively with dataset size.

B. The Search Algorithm

MARS operates in two phases to find the optimal rank pair $(r^*_{ve}, r^*_{llm})$ :

Pruning via Convergence Balancing (Scaling Law-C):
- Instead of searching all combinations, MARS enforces a balance condition: $t_{ve} \approx t_{llm}$ .
- It solves for the ideal $r_{ve}$ given a candidate $r_{llm}$ using the derived convergence equation. This drastically prunes the search space to only "convergence-aligned" candidates, eliminating unstable configurations.
Selection via Performance Prediction (Scaling Law-P):
- From the pruned set of balanced candidates, MARS uses Scaling Law-P to predict the final performance.
- It selects the pair that minimizes the predicted loss (or maximizes accuracy).

C. Calibration Phase

Before the full search, MARS performs a lightweight calibration:

It runs a few short fine-tuning sessions with representative rank pairs.
It records metrics at intermediate checkpoints (simulating smaller dataset sizes) to fit the coefficients ( $A, E, \alpha, \beta, \gamma, \delta$ ) for both scaling laws.
This ensures the predictive models are tailored to the specific model architecture and dataset.

3. Key Contributions

Identification of Imbalanced Dynamics: The paper provides empirical evidence that imbalanced training dynamics between modalities are a primary source of suboptimal MLLM performance, caused by disparities in learning capacity and required learning budgets.
Dual Scaling Laws: The authors are the first to propose and validate scaling laws specifically for MLLM fine-tuning that model both final performance and module-specific convergence time. This makes the search for optimal ranks feasible.
Automated Optimization (MARS): A robust algorithm that replaces heuristic tuning and exhaustive search with a guided, two-step process, achieving optimal rank pairs automatically.
Scalability: The method scales linearly ( $O(N)$ ) with the number of modalities, unlike the exponential growth ( $O(C^N)$ ) of naive grid search, by anchoring the search on the LLM rank and solving for others.

4. Experimental Results

The authors evaluated MARS on various MLLM architectures (LLaVA-OneVision, Qwen2.5-VL) and benchmarks (LLaVA Bench, ScienceQA, MME, MMStar, etc.).

Performance Gains:
- ScienceQA: Up to 12.0% higher accuracy compared to baseline methods.
- LLaVA Bench: Up to 13.2% lower perplexity.
- MARS consistently outperformed fixed-rank tuning, differential learning rate tuning, and adaptive methods designed for unimodal models (AdaLoRA, GeoLoRA).
Efficiency:
- MARS reduced the total search and fine-tuning time by an average of 11.5x compared to naive exhaustive search.
- It achieves this by avoiding full fine-tuning runs for every candidate pair, relying instead on the predictive scaling laws.
Generality: The method demonstrated robust performance across different model scales (0.5B to 7B+ parameters) and diverse tasks (generalist vs. specialist fine-tuning).
From-Scratch Validation: Experiments on models assembled from scratch (without prior multimodal exposure) confirmed that MARS effectively enables downstream knowledge acquisition where standard methods fail.

5. Significance

Paradigm Shift: MARS moves the field from heuristic, manual tuning of learning rates to a systematic, automated approach based on convergence alignment.
Fundamental Insight: It establishes that LoRA rank is a more fundamental control knob for modality-specific adaptation speed than learning rate, acting as both a capacity controller and a regularizer.
Practical Impact: By significantly reducing the computational cost of hyperparameter search and improving final model accuracy, MARS lowers the barrier for fine-tuning large multimodal models, accelerating development cycles and reducing the carbon footprint of AI training.
Future Direction: The work opens avenues for systematically characterizing the relationship between pre-training domain gaps, modality-specific learning capacities, and training dynamics in complex multimodal systems.

MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

The Problem: The "Tug-of-War"

The Solution: MARS (The Smart Coach)

How MARS Works: The "Crystal Ball" Strategy

The Result: A Perfectly Balanced Team

In a Nutshell

1. Problem Statement

2. Methodology: MARS

A. Dual Scaling Laws

B. The Search Algorithm

C. Calibration Phase

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank