FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

Imagine you have a highly trained chef (the AI model) who is a master at cooking perfect Italian pasta. They've spent years learning this in a specific kitchen with specific ingredients.

Now, imagine you send this chef out to a new restaurant where the ingredients are slightly different, the stove is broken, and the customers are ordering weird fusion dishes they've never seen before. This is what happens to AI models in the real world: the data they see changes (this is called distribution shift).

The paper introduces a new method called FOZO to help this chef adapt instantly without needing a full retraining course. Here is how it works, broken down into simple concepts:

1. The Problem: The "Backward" Bottleneck

Most current methods try to fix the chef by sending them back to culinary school (retraining). In AI terms, this is called Backpropagation.

The Issue: It's like sending the chef back to school every time they encounter a new ingredient. It takes too much time, requires a huge library of books (memory), and is impossible to do on a small, portable stove (like a phone or a low-power sensor).
The Alternative: Some methods try to just tweak the chef's apron or hat (adjusting normalization layers) without changing their cooking style. But these tweaks are often too weak to handle big changes.

2. The Solution: "Forward-Only" Cooking

The authors propose FOZO (Forward-Only Zeroth-Order Optimization).

The Metaphor: Instead of sending the chef back to school, FOZO gives them a magic tasting spoon.
How it works: The chef tries a dish (a "forward pass"). If it tastes bad, they don't need to know exactly which chemical reaction went wrong (which requires complex math/backpropagation). They just need to know: "If I add a pinch more salt, does it get better? If I add less, does it get worse?"
The "Zeroth-Order" Magic: This is the "tasting spoon." It estimates the direction to improve just by trying two slightly different versions of the dish and comparing the results. It doesn't need the complex "recipe book" (gradients) that requires heavy memory. It just needs to taste the food.

3. The Secret Sauce: Dynamic Perturbation

One problem with just "tasting" is that if the kitchen is chaotic (noisy data), you might taste the wrong thing and make a bad decision.

The Analogy: Imagine the chef is trying to find the perfect amount of salt in a foggy kitchen.
- Early on: The fog is thick. The chef needs to take big, bold steps (large "perturbation") to feel around and find the right direction. "Maybe I need a lot of salt? Or maybe none at all?"
- Later on: As the fog clears and the chef gets closer to the right flavor, they need to take tiny, precise steps to fine-tune the taste.
FOZO's Innovation: The method automatically adjusts the size of these "steps." It starts with big, bold guesses to escape bad spots quickly, then slowly shrinks the steps to perfect the result. This is called Dynamic Perturbation.

4. The "Prompt" Trick

Instead of changing the chef's entire brain (the model weights), FOZO only changes a tiny note attached to the order ticket (called a Prompt).

Why this matters: Changing the whole brain is heavy and risky. Changing a tiny note is light, fast, and safe. It's like giving the chef a sticky note that says, "Remember, today's tomatoes are sour," rather than rewriting their entire memory of what a tomato is.

5. The Results: Why It's a Game Changer

The paper tested this on a famous benchmark (ImageNet-C), which is like throwing the chef into a kitchen with 15 different types of disasters (blurry photos, noise, weird lighting).

Speed: FOZO adapts faster than the competition. It reaches a high level of accuracy in less time.
Efficiency: It uses very little memory. You could run this on a small device (like a drone or a smart camera) where other methods would crash because they are too heavy.
Robustness: It works even when the model is "quantized" (compressed to save space), which is crucial for real-world devices.

Summary

FOZO is like a smart, lightweight assistant for AI models. When the world changes and the AI gets confused, this assistant doesn't force the AI to go back to school. Instead, it whispers tiny hints ("Try adding a bit of noise here, try less there") and guides the AI to the right answer using only forward steps. It's fast, it's light, and it works perfectly even when the AI is running on a tiny, low-power device.

1. Problem Statement

Test-Time Adaptation (TTA) is crucial for enabling deep learning models to handle real-world data distribution shifts (e.g., corruptions, style changes) without access to ground-truth labels. However, existing approaches face significant limitations in resource-constrained deployment scenarios:

Backpropagation-based methods (e.g., TENT, EATA): While effective, they require high computational power and memory to compute gradients and update model weights. This makes them unsuitable for edge devices, low-power FPGAs, or black-box models where weights are immutable or quantized.
Traditional Forward-Only methods: Approaches that avoid backpropagation often suffer from limited adaptation capabilities.
- Gradient-free methods (e.g., adjusting Batch Norm statistics) lack explicit optimization objectives based on model feedback.
- Evolutionary methods (e.g., FOA using CMA-ES) update learnable prompts but struggle with high-dimensional parameter spaces, leading to slow convergence and suboptimal performance.
- Zeroth-order methods (e.g., ZOA) often modify internal model components (like normalization layers), limiting applicability when parameters are frozen.

Core Challenge: How to achieve efficient, stable, and high-performance TTA using only forward passes (no backpropagation) while handling the noisy, out-of-distribution (OOD) nature of test data streams.

2. Methodology: FOZO

The authors propose FOZO (Forward-Only Zeroth-Order Optimization), a novel paradigm that optimizes learnable input prompts using zeroth-order gradient estimation without modifying the pre-trained model weights.

A. Forward-Only Zeroth-Order Gradient Estimation

Instead of using evolutionary strategies like CMA-ES, FOZO employs Simultaneous Perturbation Stochastic Approximation (SPSA) to estimate gradients.

Mechanism: For a learnable prompt $P$ , the method perturbs it in positive and negative directions ( $P + \epsilon_t Z$ and $P - \epsilon_t Z$ ) where $Z$ is a random vector.
Gradient Estimation: The gradient is estimated by comparing the loss values of these two forward passes:
$\hat{\nabla}L(P) \approx \frac{L(P + \epsilon_t Z) - L(P - \epsilon_t Z)}{2\epsilon_t} Z$
Efficiency: This requires only forward passes, eliminating the memory overhead of storing activations for backpropagation.

B. Dynamic Perturbation Strategy

A key innovation is the dynamically decaying perturbation scale ( $\epsilon_t$ ).

Problem: In TTA, data streams are non-stationary. A fixed perturbation size leads to either slow convergence (too small) or instability/noise (too large).
Solution: The perturbation scale $\epsilon_t$ $ϵ_{t}$ is adjusted based on the loss history:
- Exploration: If the loss increases significantly (indicating a domain shift or local minimum), $\epsilon_t$ is reset to a larger value ( $\epsilon_0$ ) to encourage exploration.
- Exploitation: As optimization stabilizes, $\epsilon_t$ decays by a factor $\alpha$ to ensure precise convergence.
Theoretical Basis: The authors prove that this dynamic adjustment balances the bias term (introduced by perturbation) and the variance term, ensuring convergence under the assumption of local $r$ -effective rank.

C. Unsupervised Loss Function

Since test data is unlabeled, FOZO minimizes a composite loss function:

Prediction Entropy Minimization ( $L_{ent}$ ): Encourages confident predictions on the target domain.
Deep-Shallow Feature Alignment ( $L_{stats}$ ): Aligns the statistics (mean and variance) of the [CLS] token activations from both shallow and deep layers of the model with pre-computed source domain statistics. This ensures the feature distribution remains consistent with the pre-training domain.

3. Key Contributions

Novel Paradigm: Introduction of FOZO, the first forward-only TTA method that combines zeroth-order prompt optimization with a dynamic perturbation scheme, avoiding backpropagation entirely.
Theoretical Guarantees: Theoretical proof of convergence for the proposed method under TTA data stream assumptions, leveraging the local $r$ -effective rank assumption. This demonstrates that convergence depends on the intrinsic dimension $r$ rather than the full parameter dimension $d$ .
Dynamic Perturbation: A novel strategy to handle OOD data streams, theoretically proven to balance exploration and exploitation, accelerating convergence compared to static perturbation methods.
Resource Efficiency: The method is designed for low-memory, low-compute environments, making it suitable for quantized models (INT8) and edge devices.

4. Experimental Results

The authors evaluated FOZO on ImageNet-C (corruptions), ImageNet-R (artistic styles), and ImageNet-Sketch (hand-drawn sketches).

Performance vs. Forward-Only SOTA:
- On ImageNet-C (Level 5), FOZO achieved 59.52% Top-1 accuracy, outperforming the previous SOTA forward-only method FOA (58.13%) and ZOA (58.56%).
- Convergence Speed: FOZO reached 65% accuracy in significantly less time than FOA and ZOA (only 66% of their runtime).
Performance vs. Backpropagation Methods:
- FOZO achieved 62.60% accuracy (with 26 forward passes), surpassing many backpropagation-based methods (e.g., TENT, EATA) while using a fraction of the memory.
Quantized Models:
- FOZO demonstrated strong generalization on INT8 quantized ViT models, achieving 58.0% accuracy, outperforming other forward-only methods (FOA: 57.07%, ZOA: 56.91%). This is critical for edge deployment where full-precision backpropagation is impossible.
Ablation Studies:
- Removing the Deep-Shallow Alignment caused a 2.8% drop in accuracy.
- Removing Dynamic Perturbation caused a 2.6% drop, confirming its importance for handling distribution shifts.

5. Significance

Deployment Viability: FOZO bridges the gap between high-performance TTA and practical deployment. It enables models to adapt to real-world shifts on devices with strict memory and compute constraints (e.g., mobile phones, IoT sensors) where backpropagation is infeasible.
Black-Box Compatibility: By only requiring forward passes and updating external prompts, FOZO can be applied to proprietary or frozen models where internal weights cannot be accessed or modified.
Robustness: The method proves effective even in highly dynamic, mixed-shift scenarios and with quantized models, addressing critical challenges in modern AI deployment.

In summary, FOZO offers a theoretically grounded, memory-efficient, and high-performing solution for Test-Time Adaptation, making it a highly competitive choice for resource-limited and real-world applications.