MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

MARS introduces an adaptive rank search framework that leverages dual scaling laws to automatically discover optimal LoRA rank pairs, thereby harmonizing imbalanced multimodal training dynamics and maximizing performance in Multimodal Large Language Models without manual heuristic tuning.

Minkyoung Cho, Insu Jang, Shuowei Jin, Zesen Zhao, Adityan Jothi, Ethem F. Can, Min-Hung Chen, Z. Morley Mao

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a team of two very different experts to work together on a complex project: a Vision Expert (who sees pictures) and a Language Expert (who reads and writes).

In the world of Artificial Intelligence, these two experts are part of a "Multimodal Large Language Model" (MLLM). To make them work together on a new task (like answering science questions about images), you need to "fine-tune" them.

The Problem: The "Tug-of-War"

The paper identifies a major problem with how we usually do this. Currently, we often treat both experts the same way, giving them the same "training intensity."

But here's the catch: They learn at different speeds.

  • The Vision Expert might be slow to learn because it's used to just looking at static images.
  • The Language Expert might be a fast learner, used to processing text quickly.

If you force them to train at the same speed, chaos ensues:

  1. The Slow Learner Bottleneck: If the Vision Expert is too slow, the Language Expert gets bored and starts guessing, ruining the team's performance.
  2. The Fast Learner's Panic: If the Language Expert is too fast, it starts memorizing the training data (overfitting) before the Vision Expert has even figured out what it's looking at. This causes the whole team to oscillate and fail.

The Old Solution:
Previously, researchers tried to fix this by manually adjusting the "learning rate" (how fast they learn) for each expert. This is like a coach constantly running back and forth, shouting, "You, slow down!" and "You, speed up!" It's exhausting, time-consuming, and relies on a lot of guesswork.

The Solution: MARS (The Smart Coach)

The authors introduce MARS (Multimodal Adaptive Rank Search). Instead of just telling the experts to speed up or slow down, MARS changes how much capacity each expert has to learn.

Think of the "Rank" in LoRA (the method used to fine-tune) as the size of the notebook each expert gets to write their notes in.

  • A small notebook (low rank) means the expert can only learn a few key concepts. They learn slowly but don't get confused.
  • A huge notebook (high rank) means the expert can learn everything instantly, but they might get overwhelmed or memorize the wrong things.

MARS's Superpower:
MARS acts like a genius coach who knows exactly how big a notebook each expert needs so they finish learning at the exact same time.

How MARS Works: The "Crystal Ball" Strategy

Finding the perfect notebook size for each expert usually requires trying thousands of combinations, which takes years of computer time. MARS avoids this by using two "Crystal Balls" (Scaling Laws):

  1. Crystal Ball #1: The "Time" Predictor (Scaling Law-C)

    • This predicts how long it will take an expert to finish their training based on the size of their notebook and the amount of homework (data).
    • The Magic: MARS uses this to instantly calculate: "If the Language Expert gets a notebook of size 32, the Vision Expert must get a notebook of size 16 to finish at the same time."
    • This eliminates 99% of the bad combinations immediately.
  2. Crystal Ball #2: The "Grade" Predictor (Scaling Law-P)

    • Once MARS has a list of "balanced" teams (where both finish at the same time), it uses this second crystal ball to predict which team will get the best final grade.
    • It picks the notebook sizes that will result in the highest accuracy.

The Result: A Perfectly Balanced Team

By using these predictions, MARS finds the perfect "notebook sizes" (ranks) for the Vision and Language experts without needing to test every single possibility.

Why is this amazing?

  • It's Fast: It saves about 11.5 times the time and computing power compared to the old "guess and check" methods.
  • It's Smarter: It consistently beats other methods, improving accuracy on science questions by up to 12% and making the models more stable.
  • It's Automatic: You don't need a human coach running around shouting instructions; the system figures it out automatically.

In a Nutshell

Imagine a relay race where one runner is a sprinter and the other is a marathoner.

  • Old Way: You tell the sprinter to jog and the marathoner to sprint, hoping they arrive at the baton exchange at the same time. It's a mess.
  • MARS Way: MARS calculates exactly how much water and food (capacity) each runner needs so they naturally arrive at the exchange point at the exact same moment, maximizing the team's speed without anyone getting exhausted or bored.

MARS is the ultimate tool for harmonizing AI teams, ensuring that the "eyes" and the "brain" learn together, perfectly in sync.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →