DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Imagine you have a massive, incredibly smart library (a Large Language Model, or LLM) that knows almost everything. But now, you want to teach this library a very specific new skill, like how to solve math problems or write code for a specific company.

The Problem:
The traditional way to teach the library is Full Fine-Tuning. This is like hiring a team of 100 librarians to rewrite every single book in the library to include the new information. It works great, but it's incredibly expensive, slow, and requires a massive warehouse (memory) to store all the changes.

The Current "Hack" (LoRA):
To save money, researchers invented LoRA (Low-Rank Adaptation). Instead of rewriting the books, LoRA is like adding a small, sticky-note index card to the front of the library. You only write on the index card, not the books.

The Catch: These index cards are made of two layers of paper glued together (a matrix product). To make them stick properly, you have to be very careful about how you glue them (special initialization) and how you press them down (special optimization). If you mess up the gluing, the notes fall off, or the library gets confused.

The New Solution (DiaBlo):
The authors of this paper propose DiaBlo (Diagonal Blocks). They say, "Why glue two pieces of paper together? Let's just write directly on the books, but only on specific, neat squares."

Here is the breakdown using simple analogies:

1. The "Chessboard" vs. The "Glue"

Imagine the library's knowledge is a giant chessboard.

LoRA tries to learn by moving two separate sets of pieces (A and B) that interact with each other. It's like trying to solve a puzzle where the pieces are magnetic but repel each other if you don't hold them perfectly. It's tricky and unstable.
DiaBlo says, "Let's just pick out the white squares (the diagonal blocks) of the chessboard and write our new rules directly on them." We ignore the black squares. We don't need glue; we just write on the wood.

2. Why is this better?

No "Glue" Required: Because DiaBlo writes directly on the existing structure, it doesn't need complex "gluing" tricks (special initialization) to start working. You can just start writing immediately. It's like painting a wall: you don't need to mix a special chemical to make the paint stick if you're just painting the wall directly; you just paint.
Stability: Since there's no complex interaction between two layers of "glue," the learning process is much smoother. It doesn't wobble or crash as often as LoRA.
Efficiency: Even though we are writing on the books, we are only writing on a tiny fraction of the squares (the diagonal blocks). This means we still save 99% of the memory and time, just like LoRA does.

3. The "Secret Sauce" (Why it works)

You might ask, "If we only change a few squares, won't the library forget the rest?"
The paper proves mathematically that for these giant libraries, the most important changes actually happen in those specific "diagonal" patterns.

Analogy: Imagine a massive orchestra. To change the sound of the song, you don't need every single musician to change their instrument. You just need the first violins, the second violins, and the cellos (the diagonal blocks) to play a slightly different note. The rest of the orchestra can stay exactly the same, and the song still sounds perfect.

4. Real-World Results

The authors tested this on:

Common Sense: Making the AI understand jokes and logic.
Math: Solving complex equations.
Coding: Writing computer programs.
Safety: Teaching the AI not to say harmful things.

The Result: DiaBlo didn't just match the performance of the expensive "rewrite the whole library" method; it often beat the current "sticky note" method (LoRA) and its fancy variations, while being just as fast and cheap to run.

The Bottom Line

DiaBlo is a simpler, more robust way to teach giant AI models new tricks. Instead of building a complex, fragile "sticky note" system (LoRA), it just picks the most important parts of the AI's brain and updates them directly. It's cheaper, faster, more stable, and surprisingly, it works better.

In one sentence: DiaBlo proves that to teach a giant AI a new skill, you don't need to rebuild the whole thing or use complex glue; you just need to tweak the right few squares on the chessboard.

Here is a detailed technical summary of the paper "DiaBlo: Diagonal Blocks Are Sufficient for Finetuning" (ICLR 2026).

1. Problem Statement

Fine-tuning Large Language Models (LLMs) for downstream tasks is essential but computationally expensive. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA (Low-Rank Adaptation) have become standard by updating only a small subset of parameters, they face specific limitations:

Optimization Instability: LoRA relies on the product of two low-rank matrices ( $AB$ ). This non-convex structure often leads to unstable gradient flows, vanishing gradients, and sensitivity to initialization.
Complexity: To mitigate instability, recent LoRA variants (e.g., DoRA, PiSSA, MiLoRA) require complex, customized initialization schemes and optimization strategies.
Expressiveness Limits: Under strict parameter budgets, the low-rank constraint may limit the model's ability to capture necessary task-specific updates compared to full fine-tuning.
Sparsity Issues: Existing sparsity-based methods often use unstructured sparsity (random masking), which is hardware-inefficient and computationally expensive due to irregular memory access patterns.

2. Methodology: DiaBlo

The authors propose DiaBlo, a PEFT framework that updates only the diagonal blocks of the model's weight matrices, avoiding matrix products entirely.

Core Algorithm

Given a linear layer $Y = XW$ where $W \in \mathbb{R}^{m_1 \times m_2}$ :

Block Partitioning: The weight matrix $W$ is partitioned into $N \times N$ blocks.
Diagonal Selection: Only the diagonal blocks ( $W_{11}, W_{22}, \dots, W_{NN}$ ) are made trainable. All off-diagonal blocks are frozen.
Adaptation Matrix: An adaptation matrix $D$ is introduced such that $Y = X(W_0 + D)$ , where $D$ is a block-diagonal matrix containing the trainable blocks $D_i$ .
Implementation Efficiency:
- $D$ is stored as a tensor of shape $N \times d_1 \times d_2$ .
- The forward pass $XD$ is computed via batched matrix multiplications (using torch.einsum), avoiding the need to reconstruct the full sparse matrix.
- Backpropagation is similarly efficient, computing gradients for each block independently.

Initialization

Unlike LoRA, which requires careful initialization (e.g., $A$ random, $B$ zero) to prevent vanishing gradients, DiaBlo is initialized as an all-zero tensor. Since it does not involve a product of parameters, it avoids the entanglement issues of low-rank factorization, ensuring stable training from the start without specialized tricks.

3. Theoretical Contributions

The paper provides rigorous theoretical guarantees supporting the efficacy of diagonal block updates:

Linear Least Squares (LSQ) Expressiveness:
- Theorem 1: Under mild low-rank assumptions on the input matrix $X$ (rank $r$ ), if the number of blocks $N \leq m_1/r$ , any minimizer of the DiaBlo objective is also a minimizer of the full fine-tuning objective.
- Comparison: DiaBlo is strictly more expressive than LoRA under the same parameter budget. LoRA requires a rank of at least $r$ (needing $\approx (m_1+m_2)r$ parameters), whereas DiaBlo achieves the same solution with $N \cdot d_1 \cdot d_2$ parameters, which can be significantly smaller.
Convergence to Full Fine-Tuning Stationary Points:
- Theorem 2: In general non-linear settings, if the activation matrix $X$ and output gradient $g_Y$ exhibit low-rank properties (empirically observed in LLMs), DiaBlo converges to a stationary point of the full fine-tuning objective.
- This implies that updating only diagonal blocks is sufficient to drive the gradients of the entire weight matrix to zero, provided the block size is appropriate.

4. Experimental Results

DiaBlo was evaluated across diverse tasks and model architectures (LLaMA2-7B/13B, LLaMA3-8B, Mistral-7B) and compared against Full Fine-Tuning, LoRA, DoRA, PiSSA, MiLoRA, and SMT.

Key Findings:

Commonsense Reasoning: DiaBlo consistently outperformed LoRA and its variants. On LLaMA2-7B, it achieved 83.5% average accuracy (matching Full Fine-Tuning) using only 0.52% of trainable parameters, surpassing SMT (81.8%) which used more parameters.
Arithmetic Reasoning: On GSM8K and MATH datasets, DiaBlo achieved 43.4% accuracy, slightly beating Full Fine-Tuning (43.2%) and significantly outperforming LoRA (38.7%).
Code Generation & Safety: On HumanEval and HEx-PHI (safety), DiaBlo achieved state-of-the-art results among PEFT methods, with Pass@1 scores of 43.2% (LLaMA3-8B) and refusal rates of 97.6%.
Quantized Models (4-bit & 2-bit): DiaBlo demonstrated exceptional robustness in low-precision settings.
- In 2-bit quantization, where most methods fail or degrade significantly, DiaBlo achieved 48.7% average accuracy on arithmetic tasks, outperforming QLoRA and GPTQ-LoRA baselines.
- It achieved this without requiring specialized quantization-aware initialization.
Efficiency:
- Training Speed: DiaBlo matches the training speed of LoRA (approx. 170 mins/epoch) and is significantly faster than DoRA (480 mins/epoch).
- Memory: It maintains a memory footprint comparable to LoRA.
Stability: Gradient norm variance analysis showed DiaBlo converges smoothly with low variance, whereas LoRA exhibited higher variance and sensitivity to initialization.

5. Significance and Impact

Simplicity: DiaBlo eliminates the need for complex initialization schemes (like PiSSA or LoRA-GA) and custom optimizers, offering a "plug-and-play" solution that is easier to implement and debug.
Hardware Friendliness: By using structured diagonal blocks rather than unstructured sparsity, DiaBlo leverages efficient batched matrix operations, making it highly compatible with modern GPU architectures.
Theoretical Insight: The work challenges the necessity of low-rank factorization for PEFT, demonstrating that structured sparsity (diagonal blocks) offers superior expressiveness and stability under low-rank conditions common in LLM activations.
Robustness: Its ability to perform well in ultra-low precision (2-bit) settings makes it a critical tool for deploying LLMs on resource-constrained edge devices.

In conclusion, DiaBlo presents a paradigm shift in PEFT, proving that diagonal blocks are sufficient for high-performance fine-tuning, offering a superior balance of accuracy, stability, and efficiency compared to the current state-of-the-art low-rank methods.