WTHaar-Net: a Hybrid Quantum-Classical Approach

Imagine you are trying to teach a robot to recognize pictures of cats, dogs, and cars. To do this, the robot needs to look at the picture and break it down into understandable pieces.

The Old Way: The "Global Mixer"

For a long time, researchers tried to use a method called the Hadamard Transform. Think of this like taking a bowl of soup, dumping it into a giant blender, and spinning it so fast that every single ingredient mixes with every other ingredient instantly.

The Problem: While this mixes everything together quickly, it loses the "where." You know you have carrots and potatoes, but you don't know where they are in the bowl. In image recognition, knowing where a feature is (like an eye on the left side of a face) is crucial. Also, doing this on a quantum computer (a super-fast, futuristic calculator) was easy, but the results weren't always great for seeing details.

The New Idea: The "Smart Organizer" (WTHaar-Net)

The authors of this paper, Vittorio, Tsai, and Ahmet, came up with a better way called WTHaar-Net. Instead of the "global blender," they use the Haar Wavelet Transform.

Imagine you are organizing a messy room. Instead of throwing everything into one big pile, you use a smart system:

Zoom Out: First, you look at the whole room and say, "Is this room generally bright or dark?" (This is the low-resolution view).
Zoom In: Then, you look at specific corners. "Is there a toy here? Is there a sock there?" (This is the high-resolution detail).
Keep it Local: You keep the "toy" information separate from the "sock" information. You know exactly where the mess is.

This is what the Haar Wavelet does. It breaks an image down into:

The Big Picture: The general shape and color.
The Details: The edges, textures, and small spots.
The Location: It keeps track of where those details are.

Why Mix Quantum and Classical?

The paper is about a Hybrid approach. Think of it like a construction crew:

The Classical Part (The Humans): These are the standard computers we use today. They are great at doing the heavy lifting, like learning from thousands of pictures and making the final decision ("That's a cat!").
The Quantum Part (The Super-Tool): This is the new, experimental technology. It's incredibly fast at doing specific math tricks (like the "Zoom Out/Zoom In" sorting we mentioned above).

The authors built a system where the Quantum part acts as a super-efficient filter. It quickly sorts the image data into "Big Picture" and "Details" using a special set of rules (gates) that quantum computers love. Then, it hands the organized data back to the Classical part to finish the job.

The Results: Faster, Smaller, and Stronger

Because this new method is so organized, the robot doesn't need to carry as much "brain power" (parameters) to learn.

Smaller Footprint: They reduced the size of the model by about 26%. It's like shrinking a heavy backpack into a sleek messenger bag without losing any of the tools inside.
Better Vision: On a harder test (Tiny-ImageNet), this new "Smart Organizer" beat the old "Global Blender" and even some standard models. It was better at seeing the forest and the trees.
Real Hardware Test: They didn't just simulate this on a normal computer; they actually ran a small version of it on a real quantum computer in the cloud (IBM's). It worked! It proved that this idea is possible with the quantum computers we have today.

The Catch (The "Sign" Problem)

There is one small hurdle. When the quantum computer does its magic, it sometimes forgets whether a number should be positive or negative (like forgetting if a temperature is +5 or -5). The authors had to use some clever math tricks to fix this later. It's a bit like the robot knowing how much noise there is, but needing a human to tell it if it's a happy noise or a sad noise.

The Bottom Line

WTHaar-Net is a new way to teach AI to see. It swaps out a messy, global mixing method for a tidy, local sorting method that plays nicely with quantum computers. The result? A smaller, faster, and more accurate AI that can run on the quantum hardware of the near future.

1. Problem Statement

Deep learning models, particularly Convolutional Neural Networks (CNNs), rely heavily on linear filtering operations. While these can be optimized in transform domains, current quantum hardware (NISQ era) lacks the qubit count and coherence time to implement full convolutional layers for high-dimensional inputs.

The Bottleneck: Fully quantum convolutional layers require thousands of qubits, scaling with input dimensionality, which exceeds current device capabilities.
The Gap in Existing Solutions: Previous hybrid approaches utilized the Hadamard Transform (HT) to create quantum-friendly layers. However, the HT performs global mixing, meaning it mixes all input components uniformly. This lacks spatial locality and does not align well with the inductive biases of computer vision tasks (which rely on local features and multi-resolution analysis).
Goal: Develop a hybrid quantum-classical architecture that maintains the computational efficiency of transform-domain processing while introducing spatially localized, multi-resolution representations better suited for vision, and which can be implemented on near-term quantum hardware.

2. Methodology: WTHaar-Net

The authors propose WTHaar-Net, a hybrid architecture that replaces the global Hadamard mixing with the Haar Wavelet Transform (HWT).

A. The Haar Wavelet Transform (HWT)

Unlike the Hadamard Transform, the HWT provides a multi-resolution, spatially localized representation.

Definition: It is defined recursively using pairwise sums (averages) and differences.
Quantum Compatibility: The 2×2 Haar matrix is identical to the Hadamard matrix. The recursive structure of the HWT allows it to be decomposed into a sequence of simple, hardware-friendly unitary operations (Hadamard gates, Controlled-Hadamard gates, Pauli-X, and SWAP gates).
Complexity: The classical Fast Haar Wavelet Transform runs in $O(N)$ time, and the quantum realization maintains a constant-depth circuit ( $O(1)$ ) relative to the number of qubits for a fixed patch size.

B. The HWT-Perceptron Layer

The core innovation is a new convolutional layer that operates in the Haar wavelet domain:

Forward Transform: The input tensor is transformed into the 2D Haar wavelet domain ( $\mathcal{H}_{2D}$ ).
Multi-Path Filtering: Instead of spatial convolution, the layer applies $P$ $P$ parallel paths in the transform domain:
- Learnable Scaling: Element-wise multiplication with a learnable matrix $A_i$ .
- Channel Mixing: A $1 \times 1$ convolution ( $V_i$ ) to mix channels.
- Non-linearity: A Soft-Thresholding function is used instead of ReLU. This preserves both positive and negative coefficients, which is critical as discriminative information in wavelet domains is often encoded in the sign of the coefficients.
Aggregation & Inverse: The paths are summed, and the result is mapped back to the spatial domain using the inverse Haar transform ( $\mathcal{H}_{2D}^{-1}$ ).
Residual Connection: Optional residual connections are added to facilitate training.

C. Quantum Implementation

Circuit Design: The authors designed a quantum circuit that replicates the classical 2D HWT using a specific sequence of gates (Hadamard, Controlled-Hadamard, Pauli-X, SWAP).
Input Encoding: Pixel values of a $4 \times 4$ image patch are encoded as amplitudes in a 4-qubit quantum state.
Output: The circuit outputs the Haar wavelet coefficients. Due to measurement limitations, only the magnitude of the probability amplitudes is directly observable. The authors address sign ambiguity (loss of sign information during measurement) via classical post-processing constraints or by training subsequent classical layers to be robust to magnitude-only inputs.

3. Key Contributions

HWT-Based Hybrid Pipeline: Integration of the Haar Wavelet Transform as a front-end transform in a hybrid CNN, replacing the global mixing of Hadamard-based approaches.
Quantum-Friendly Decomposition: A novel decomposition of the HWT into structured Hadamard gates, creating circuits compatible with current NISQ hardware constraints (shallow depth, low qubit count).
Efficiency and Accuracy: Demonstrated significant reduction in Multiply-Accumulate operations (MACs) and parameters while maintaining or improving accuracy compared to standard CNNs and Hadamard-based baselines.
Hardware Validation: Successful implementation and evaluation on real IBM Quantum cloud hardware (ibm_brisbane), validating the feasibility of the approach on near-term devices.

4. Experimental Results

Datasets and Baselines

Datasets: CIFAR-10, Tiny-ImageNet, and MNIST (for quantum patch validation).
Baselines: Standard ResNet, Hadamard-based Hybrid (WHT), and Learnable Discrete Wavelet (LDW) models.

Performance on Tiny-ImageNet

Accuracy: WTHaar-Net (3-path) achieved 70.84% Top-1 accuracy, significantly outperforming the Hadamard baseline (66.65%) and the standard ResNet (63.28%).
Efficiency: Achieved a 12.4% reduction in parameters compared to ResNet while delivering superior accuracy.

Performance on CIFAR-10

Accuracy: WTHaar-ResNet-20 (3-path) achieved 91.28% accuracy, nearly matching the standard ResNet-20 (91.66%) and the Hadamard baseline (91.29%).
Efficiency: Achieved a 26.64% reduction in parameters compared to the standard ResNet-20.
Observation: While accuracy was comparable to baselines on CIFAR-10, the spatial locality of HWT provided better robustness on higher-resolution data (Tiny-ImageNet).

Robustness Analysis (Noise and Blur)

Gaussian Blur: WTHaar-Net showed superior robustness across all blur levels compared to WHT. This confirms that the Haar wavelet's ability to capture low-frequency structural information is advantageous for blurred images.
Salt-and-Pepper Noise: WHT showed better resilience to high-probability impulse noise, while WTHaar performed better at low corruption levels. This highlights a trade-off between frequency-domain representations.

Quantum Hardware Validation

Implementation: Tested on IBM's 127-qubit Heron processor using $4 \times 4$ patches.
Accuracy: The quantum circuit output closely matched the classical Haar transform with a Mean Squared Error (MSE) of 0.023.
Error Source: The primary error source was identified as measurement-induced sign ambiguity rather than stochastic gate errors, suggesting that future improvements in phase estimation could further reduce errors.

5. Significance and Conclusion

WTHaar-Net represents a significant step forward in Hybrid Quantum-Classical Machine Learning for computer vision.

Theoretical Impact: It bridges the gap between wavelet-based CNNs (which are known for spatial locality) and quantum computing (which requires structured, orthogonal transforms). It proves that wavelet transforms can be efficiently mapped to quantum circuits.
Practical Impact: The approach offers a viable path to reducing the computational cost (MACs and parameters) of deep learning models while leveraging quantum acceleration for specific linear transforms.
Future Directions: The authors identify sign ambiguity in quantum measurements as a key limitation. Future work will focus on phase estimation techniques to recover sign information and scaling the approach to larger image patches via error mitigation.

In summary, WTHaar-Net successfully demonstrates that replacing global mixing with spatially localized wavelet transforms in hybrid architectures leads to more efficient, accurate, and robust vision models that are compatible with current quantum hardware.