Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Imagine you have a master chef, Chef Base, who knows how to cook a perfect, basic soup.

Now, imagine you want to teach this chef new recipes: one for spicy curry, one for a sweet dessert, and one for a vegan salad. In the old way of doing things (traditional machine learning), you'd send the chef to culinary school for each new dish. But there's a catch: every time they learn the curry, they start to forget how to make the soup. By the time they learn the salad, they might have forgotten the dessert entirely. This is called "Catastrophic Forgetting."

Alternatively, you could try to just mix the recipes together by averaging the ingredients. But if you mix a spicy curry recipe with a sweet dessert recipe, you end up with a gross, confusing mess that tastes like neither.

MDM-OC is a new, brilliant kitchen system that solves both problems. Here is how it works, broken down into simple concepts:

1. The "Delta" (The Change, Not the Whole)

Instead of rewriting the chef's entire brain for every new recipe, MDM-OC only writes down the changes (the "deltas").

Old Way: "Here is the full recipe for Curry."
MDM-OC Way: "Here is a small note: Add 2 extra chili peppers and swap the broth for coconut milk."
This keeps things light and efficient.

2. The "Orthogonal" Trick (The Invisible Walls)

This is the magic part. In a normal kitchen, if you try to add "Spicy" and "Sweet" to the same pot, they clash.
MDM-OC builds invisible walls (orthogonal subspaces) between the recipes.

Imagine the "Spicy" recipe lives in a room on the Left.
The "Sweet" recipe lives in a room on the Right.
The "Vegan" recipe lives in a room Upstairs.

Because these rooms are perfectly perpendicular (orthogonal) to each other, adding chili peppers to the Left room cannot accidentally turn the dessert sweet or ruin the salad. They don't interfere with each other. The system mathematically forces these new skills to live in their own unique "directions" so they never bump into each other.

3. The "Merging" (The Unified Masterpiece)

When you want the chef to serve a full menu, you don't retrain them. You simply take the Base Chef and gently layer the "Change Notes" from the Left, Right, and Upstairs rooms on top of them.
Because the rooms are separate, the chef can now make the Curry, the Dessert, and the Salad perfectly at the same time, without forgetting the original soup.

4. The "Un-Merging" (The GDPR Superpower)

This is the most unique feature. Imagine a customer says, "I don't want the dessert recipe in my kitchen anymore because of privacy laws."

Old Systems: You have to scrub the chef's brain, which is messy and might accidentally delete the soup recipe too.
MDM-OC: Because the dessert recipe was stored in its own separate "Upstairs" room, you just take that specific note away. The chef instantly forgets the dessert but remembers everything else perfectly. It's like erasing a single line of code without breaking the whole program. This is crucial for laws like GDPR that require you to "delete" data or skills on demand.

5. The "Stability" (The Safety Net)

To make sure the chef doesn't get confused when switching between these rooms, the system uses two safety nets:

Elastic Weight Consolidation: This is like a "memory anchor." It tells the chef, "Don't change the core ingredients of the soup too much."
Synthetic Replay: This is like a "practice dummy." The chef occasionally practices the old recipes using fake ingredients to keep the muscle memory sharp.

Why Does This Matter?

No More Amnesia: The AI doesn't forget old tasks when learning new ones.
No More Messy Mixes: New skills don't ruin old ones.
Privacy & Compliance: You can surgically remove specific skills (like learning from a specific user's data) without rebuilding the whole AI.
Scalability: You can keep adding hundreds of new skills without the system getting slow or bloated.

In short: MDM-OC is like giving a super-intelligent assistant a set of modular, non-interfering toolkits. You can snap a new tool on to learn a new skill, or snap it off to forget it, all while keeping the original brain perfectly intact and the whole system running smoothly.

1. Problem Statement

Modern machine learning systems, particularly foundation models (e.g., Transformers, ViTs), require continual adaptation to new tasks and dynamic environments. However, current approaches to model merging and continual learning face three critical limitations:

Task Interference & Catastrophic Forgetting: Merging multiple fine-tuned models often leads to performance degradation on individual tasks because parameter updates for one task interfere with others.
Lack of Reversibility: Existing methods (e.g., weight averaging, Task Arithmetic) generally lack the ability to "unmerge" or selectively remove a specific task's influence without retraining the entire model. This is a significant barrier for regulatory compliance (e.g., GDPR "right to be forgotten").
Scalability: Many continual learning methods require storing past data or incur high computational overhead, making them unsuitable for large-scale, dynamic model composition.

2. Methodology: MDM-OC Framework

The authors propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a framework that treats model composition as an orthogonal projection problem in parameter space. The methodology consists of five key stages:

A. Delta Representation

Instead of merging full model weights, MDM-OC operates on deltas ( $\Delta\theta$ ). Each task-specific model $\theta_i$ is represented as a deviation from a shared base model $\theta_{base}$ :
$\Delta\theta_i = \theta_i - \theta_{base}$
This ensures compactness and interpretability.

B. Orthogonal Projection

To eliminate interference, task deltas are projected into orthogonal subspaces using a Gram–Schmidt process. For a new task delta $\Delta\theta_i$ , it is projected onto the null space of all previous orthogonal deltas $\{\Delta\theta^\perp_j\}_{j<i}$ :
$\Delta\theta^\perp_i = \Delta\theta_i - \sum_{j=1}^{i-1} \text{proj}_{\Delta\theta^\perp_j}(\Delta\theta_i)$
where $\text{proj}_u(v) = \frac{\langle v, u \rangle}{\|u\|^2} u$ .
This ensures that $\langle \Delta\theta^\perp_i, \Delta\theta^\perp_j \rangle = 0$ for all $i \neq j$ , mathematically guaranteeing that updates for one task do not affect the performance of others.

C. Coefficient Optimization

The merged model is formed by a weighted sum of orthogonal deltas:
$\theta_{merged} = \theta_{base} + \sum_{i=1}^N \alpha_i \Delta\theta^\perp_i$
The coefficients $\alpha_i$ are optimized via gradient-based optimization (Adam) or CMA-ES to minimize a joint validation loss across all tasks, balancing contributions without retraining the base model.

D. Continual Integration and Unmerging

Integration: New models are integrated incrementally by projecting their deltas into the orthogonal complement of the existing subspace.
Unmerging (Reversibility): Because the deltas are orthogonal, a specific task can be removed algebraically without affecting others:
$\theta_{merged}^{-k} = \theta_{merged} - \alpha_k \Delta\theta^\perp_k$
This enables "clean" removal of knowledge for compliance purposes.

E. Stability Preservation

To prevent drift during repeated merges/unmerges, the framework incorporates:

Elastic Weight Consolidation (EWC): Penalizes changes to parameters critical for previous tasks.
Synthetic Replay: Uses pseudo-samples to retain general knowledge.
Dimensionality Reduction: Uses PCA/SVD to approximate deltas, reducing computational complexity from $O(N^2)$ to $O(kN)$.

3. Key Contributions

Orthogonal Delta Merging: A novel formulation that projects task deltas into orthogonal subspaces, providing a theoretical guarantee of interference-free composition.
Algebraic Reversibility: The framework enables the exact removal of specific task components (unmerging) without retraining, addressing critical data privacy and compliance needs (GDPR).
Scalable Continual Learning: By using delta representations and PCA approximation, the method scales efficiently to large model pools (up to 50 tasks) with low memory overhead.
Theoretical Guarantees: The paper provides proofs for orthogonality preservation, span preservation (no loss of expressiveness), and bounded numerical interference in finite-precision arithmetic.

4. Experimental Results

The authors evaluated MDM-OC on Vision (CIFAR-100, ImageNet-100) and NLP (AG News, DBpedia, Yahoo Answers) benchmarks using ResNet-50 and BERT-large.

Accuracy & Transfer: MDM-OC achieved 78.4% accuracy on CIFAR-100, outperforming the best baseline (TIES-Merging) by 6.3 percentage points. It demonstrated positive Backward Transfer (BWT), indicating effective retention of prior knowledge.
Unmerging Fidelity: In selective removal tests, MDM-OC showed minimal accuracy drop (1.8% for Vision, 2.3% for Language), significantly outperforming baselines like Task Arithmetic (12.3% drop) and TIES-Merging (8.9% drop).
Efficiency:
- Memory: Peak memory usage was 8.7GB for 50 tasks, far lower than GEM (47GB).
- Speed: Unmerging operations were completed in 12.4 seconds, compared to ~45 seconds for other methods.
Ablation Studies: Confirmed that Orthogonal Projection contributed the most to performance gains (+6.2%), followed by coefficient optimization (+2.1%).

5. Significance and Impact

Regulatory Compliance: MDM-OC offers a principled solution for "machine unlearning," allowing organizations to comply with data protection laws by mathematically removing specific data influences from a model without retraining.
Modular AI Architecture: It bridges the gap between continual learning and composable AI, enabling the creation of dynamic, multi-task systems where components can be added or removed on demand.
Interference-Free Design: By moving from gradient-level orthogonality (during training) to parameter-delta orthogonality (post-hoc), the framework allows for the composition of independently trained models without coordination.
Future Directions: The work lays the foundation for secure, auditable, and interpretable AI systems, though future work is needed to address heterogeneous architectures and further improve scalability for massive model pools.

In summary, MDM-OC represents a significant advancement in model composition, offering a mathematically rigorous, efficient, and reversible approach to managing the lifecycle of complex machine learning models in real-world, privacy-sensitive deployments.