AirCNN via Reconfigurable Intelligent Surfaces: Architecture Design and Implementation

Imagine you have a super-smart AI brain (a Convolutional Neural Network, or CNN) that needs to look at a picture and tell you if it's a cat, a dog, or a car. Usually, this brain lives inside a computer chip, crunching numbers one by one in a digital factory.

This paper introduces a wild new idea called AirCNN. Instead of doing the math inside a computer chip, the authors want to do the math in the air using radio waves and special mirrors.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "Digital Factory" vs. The "Air Highway"

The Old Way (Digital): Imagine a factory where workers (transistors) take a raw material (an image), measure it, cut it, and glue it together step-by-step. It's precise, but it takes time and energy.
The New Way (AirCNN): Imagine you want to shape a block of clay. Instead of carving it with tools, you throw it into a wind tunnel. If you arrange the wind tunnels and fans just right, the wind itself shapes the clay into the perfect form as it flies through.
- In AirCNN, the "clay" is the data (the image).
- The "wind" is the radio signal.
- The "fans" are the Reconfigurable Intelligent Surfaces (RIS).

2. The Magic Tool: Reconfigurable Intelligent Surfaces (RIS)

Think of an RIS as a smart, programmable mirror wall.

A normal mirror just reflects light.
An RIS is like a wall made of thousands of tiny, independent mirrors (like pixels on a screen) that can instantly change their angle.
By tilting these tiny mirrors, the RIS can bend, focus, or scatter radio waves exactly how we want. It turns the empty space between a transmitter and a receiver into a "programmable room."

3. How the Math Happens in the Air

In a normal computer, a "Convolution" (the core math of image recognition) is like sliding a stencil over an image and multiplying numbers.

The Trick: The authors realized that radio waves naturally add up. If you send two radio signals at the same time, they mix together in the air.
The Setup:
1. The Transmitter: Sends out the image data as radio waves.
2. The RIS (The Mirror Wall): The wall tilts its tiny mirrors to distort the waves. This distortion is actually the "math" being done. It's like the wind shaping the clay.
3. The Receiver: Catches the waves. Because the waves were shaped by the mirrors, the receiver doesn't need to do the heavy math; the answer is already "baked" into the signal.

4. Two Ways to Build the "Air Factory"

The paper tests two different ways to set up this system, like choosing between a single-lane road and a multi-lane highway:

MISO (Multiple Input, Single Output):
- Analogy: A single-lane road where cars (data) take turns driving through.
- How it works: The transmitter sends data in chunks. The RIS changes its mirror angles for every single chunk to do the math.
- Pros: It's very flexible and can do complex math very accurately.
- Cons: It takes longer because everything happens one step at a time.
MIMO (Multiple Input, Multiple Output):
- Analogy: A multi-lane highway where many cars drive side-by-side at once.
- How it works: The transmitter has many antennas, and the receiver has many antennas. They send all the data at once. The RIS sets its mirrors once for the whole batch.
- Pros: It's super fast (low latency).
- Cons: It's harder to get the math perfect, especially if the signal is weak or the room is "echoey."

5. The Results: What Did They Find?

The authors ran simulations (computer tests) to see how well this "Air Brain" could recognize images (like identifying clothes in the Fashion MNIST dataset).

It Works! The system can recognize images almost as well as a normal computer, but it does the heavy lifting using physics instead of chips.
The "Mirror Wall" Matters: Using multiple RIS walls (multiple mirrors) is much better than using just one. It's like having a whole team of sculptors instead of just one; they can shape the signal much more precisely.
The "Clear Air" Problem: If the air is too clear (a direct line of sight with no obstacles), the system actually struggles a bit. It needs a little bit of "chaos" (reflections) to have enough freedom to shape the waves.
Power vs. Speed: When the signal is weak, the "Single Lane" (MISO) approach is better. When the signal is strong, the "Multi-Lane" (MIMO) approach is faster and efficient.

The Big Picture

This paper proposes a future where the environment itself is a computer. Instead of sending data to a server to be processed, we process the data while it travels through the air using smart mirrors.

This could lead to:

Faster AI: No waiting for data to travel to a server and back.
Lower Energy: Less power needed for digital chips because the physics of the air does the work.
6G Networks: This is a key technology for the next generation of wireless internet, where the network doesn't just carry data, it computes with it.

In short: AirCNN turns the airwaves into a giant, invisible calculator.

Here is a detailed technical summary of the paper "CNNs in the Air via Reconfigurable Intelligent Surfaces" (AirCNN).

1. Problem Statement

The paper addresses the challenge of implementing Convolutional Neural Networks (CNNs) directly within the wireless physical layer to reduce latency and energy consumption. Traditional CNNs rely on digital Multiply-Accumulate (MAC) operations, which are computationally intensive.

Core Challenge: Convolution operations cannot be directly performed over-the-air (OTA) because wireless channels naturally perform matrix multiplication, not convolution.
Goal: To engineer the wireless propagation environment using Reconfigurable Intelligent Surfaces (RIS) to emulate the mathematical operations of a CNN layer (specifically 2D convolution) via analog signal propagation, creating a "Wireless Physical Neural Network" (WPNN).
Specific Hurdles: Mapping multi-channel convolutional kernels onto physical transformations requires joint optimization of transmitter precoders, receiver combiners, and RIS phase shifts while adhering to hardware constraints (e.g., unit-modulus phase shifts, power budgets).

2. Methodology

The authors propose AirCNN, a framework that transforms convolutional operations into equivalent matrix multiplications that can be realized via OTA transmission.

A. Mathematical Transformation

Convolution to Multiplication: The paper demonstrates that a 2D convolution operation ( $Y = X * W$ ) can be mathematically unfolded into a matrix multiplication ( $Y = \bar{W} \bar{X}$ ) by rearranging the input image and kernel matrices (Toeplitz-like structure).
OTA Realization: Since OTA transmission inherently performs matrix multiplication ( $Y = HX$ ), the system is designed such that the effective end-to-end channel matrix ( $H$ ) emulates the unfolded convolution kernel ( $\bar{W}$ ).

B. System Architecture

The framework utilizes a transmitter with $N_t$ antennas, a receiver with $N_r$ antennas, and $L$ RISs, each with $M/L$ reflecting elements. The effective channel is modeled as $H = \sum_{l=1}^L \hat{H}_l \Theta_l \bar{H}_l$ , where $\Theta_l$ is the phase-shift matrix of the $l$ -th RIS.

The authors propose two realization paradigms for two types of CNN layers:

Classic 2D Convolution (Conv2d)
Depthwise Separable Convolution (ConvSD) (Lightweight)

For both types, they design MISO (Multiple-Input Single-Output) and MIMO (Multiple-Input Multiple-Output) transmission strategies:

Conv2d MISO:
- Uses Time-Division Multiple Access (TDMA) to handle output channels sequentially.
- Employs $C_{in}$ OFDM carriers simultaneously to transmit input channels.
- Requires $C_{in} \times C_{out}$ time slots (or precoder adjustments) to emulate all kernels.
- Trade-off: High degrees of freedom (DoFs) for emulation but higher signaling overhead (more time slots).
Conv2d MIMO:
- Uses $C_{out}$ receive antennas to capture all output channels in a single time slot.
- Requires only one RIS adjustment and one set of precoders/combiners.
- Trade-off: Low overhead but fewer DoFs for kernel emulation compared to MISO.
ConvSD Strategies:
- Depthwise Convolution: Handled via the RIS/precoder/combiner design.
- Pointwise Convolution: Handled by digital processing at the receiver (linear combination of channels).
- MISO vs. MIMO: Similar trade-offs apply, but the overhead structure differs due to the separation of depthwise and pointwise stages.

C. Optimization

The system parameters (precoder $F_1$ , combiner $F_2$ , and RIS phases $\Theta$ ) are jointly optimized to minimize the Frobenius norm error between the effective channel and the target digital kernel, subject to power and unit-modulus constraints. The entire system is trained end-to-end using a standard cross-entropy loss function for image classification.

3. Key Contributions

AirCNN Framework: A novel paradigm for realizing 2D CNNs entirely via analog OTA computation using RIS, eliminating the need for digital MAC operations in the middle layer.
Dual-Architecture Design: Proposes and analyzes both MISO and MIMO architectures for realizing Conv2d and ConvSD, providing a comprehensive comparison of their performance versus communication overhead.
Joint Optimization: Develops a joint optimization algorithm for transceiver design and RIS phase shifts under practical constraints (power, unit modulus).
Extension to Lightweight CNNs: Extends the framework to Depthwise Separable Convolutions (ConvSD), addressing the specific transmission strategies required for depthwise and pointwise stages.
Multi-RIS Analysis: Demonstrates the benefits of deploying multiple RISs to enhance channel rank and degrees of freedom, particularly in Line-of-Sight (LoS) environments.

4. Numerical Results

Simulations were conducted using the Fashion MNIST dataset with a Rician fading channel model.

Performance vs. Power: Classification accuracy increases with transmit power ( $P_{max}$ ). Conv2d-based schemes generally outperform ConvSD due to stronger feature extraction capabilities.
MISO vs. MIMO (Conv2d): Conv2d MISO consistently outperforms Conv2d MIMO across various settings. This is attributed to the MISO scheme's ability to utilize more degrees of freedom (via multiple time slots/precoder adjustments) to accurately emulate complex kernels.
MISO vs. MIMO (ConvSD): The performance depends on channel conditions. MISO is superior only under poor channel conditions (low power or low Rician factor $K$ ). Under high power or strong LoS conditions, MIMO performs better or comparably due to lower overhead and efficient resource usage.
Impact of RIS Elements ( $M$ ): Increasing the number of reflecting elements significantly improves accuracy by providing more DoFs and beamforming gain.
Impact of Rician Factor ( $K$ ):
- In Single-RIS setups, accuracy initially rises with $K$ (better channel gain) but eventually decreases when $K$ is very high (strong LoS). This is because a strong LoS component reduces the channel rank, limiting the DoFs available for neural network emulation.
- Multi-RIS setups mitigate this issue, maintaining robust performance even in strong LoS environments by increasing the effective channel rank.

5. Significance

Latency Reduction: By shifting computation from the digital domain to the physical layer, AirCNN eliminates sequential MAC operations, potentially enabling ultra-low latency inference for 6G applications.
Energy Efficiency: Analog computation via RIS consumes significantly less energy than digital processing, aligning with green communication goals.
Co-Design Paradigm: The paper establishes a critical link between communication theory (channel modeling, beamforming) and deep learning, showing that the wireless channel itself can be a programmable neural network layer.
Practical Viability: The study highlights that while MISO offers higher accuracy, MIMO is more efficient in terms of overhead, offering system designers a flexible trade-off based on specific application requirements (e.g., latency vs. accuracy vs. bandwidth). The findings regarding multi-RIS deployment are particularly crucial for future 6G networks operating in diverse propagation environments.