DroFiT: A Lightweight Band-fused Frequency Attention Toward Real-time UAV Speech Enhancement

Imagine you are flying a drone to deliver a package or film a movie. Suddenly, you try to talk to the pilot or record a voice command, but all you hear is a deafening, buzzing "whirrrrr" from the drone's propellers. It's like trying to have a quiet conversation at a rock concert. This paper introduces a new digital "noise-canceling ear" called DroFiT designed specifically to solve this problem.

Here is the story of how DroFiT works, explained without the heavy math jargon.

The Problem: The "Heavy" Solution

Before DroFiT, the best tools to clean up drone noise were like giant, powerful cranes. They could definitely lift the heavy noise off your voice, but they were too big, too heavy, and ate up too much battery power. If you tried to put one of these "cranes" on a small, battery-powered drone, the drone would crash because it couldn't carry the weight or the power drain.

The Solution: DroFiT (The "Smart, Lightweight Drone")

The researchers built DroFiT (Drone Frequency lightweight Transformer). Think of DroFiT not as a giant crane, but as a swarm of tiny, hyper-efficient bees. It does the same job as the giant crane but uses a fraction of the energy and fits in a tiny backpack.

Here is how DroFiT cleans up the noise using three clever tricks:

1. The "Band-Fused" Strategy (The Orchestra Analogy)

Imagine the sound coming into the microphone is a messy orchestra playing all at once.

Old methods tried to listen to the entire orchestra at once to figure out who was playing what. This is slow and confusing.
DroFiT splits the orchestra into two groups:
- The Full Band: It listens to the whole room to get the "big picture" of the noise.
- The Sub-Bands: It zooms in on specific sections (like just the violins or just the drums) to catch the tiny, specific details of the drone's buzz.
The Magic: DroFiT combines these two views instantly. It's like having a conductor who hears the whole symphony and a specialist who knows exactly which violin string is out of tune, all at the same time. This helps it separate the human voice from the drone buzz much faster.

2. The "Frequency-Only" Brain (The Traffic Light Analogy)

Most AI models try to pay attention to time (what happened a second ago) and frequency (the pitch of the sound) simultaneously. This is like a traffic light trying to control every car on the highway and every pedestrian on the sidewalk at the exact same moment. It gets overwhelmed and slows down.

DroFiT changes the rules:

It ignores the "time" traffic for a moment and focuses only on the frequency (the pitch).
It treats the drone noise like a specific, annoying hum that stays on one "lane" of the road.
By only looking at the "frequency lanes," it can process the sound much faster, like a traffic system that only manages the main highway lanes, letting the cars (the voice) flow through without stopping.

3. The "Streaming" Stream (The Conveyor Belt)

Old models worked like a laundry basket: they waited until they had a whole pile of audio (a chunk of time) before they started washing it. This caused a delay (latency) and required a huge basket (memory) to hold everything.

DroFiT works like a conveyor belt:

As soon as a tiny piece of sound comes in, it gets processed immediately and passed along.
It doesn't need to hoard a massive pile of data. It just needs a small, steady stream. This makes it perfect for real-time use on a drone where you can't wait for the audio to "buffer."

The Results: Fast, Light, and Clear

The researchers tested DroFiT against the "giant cranes" (other AI models) using recordings of people talking over loud drone noise.

Performance: DroFiT cleaned up the voice just as well as the heavy models. The voice sounded natural and clear.
Efficiency: Here is the big win. DroFiT was 17 times faster and used 26 times less memory than the biggest competitor.
Battery Life: Because it is so efficient, a drone could run this software on its own computer without draining the battery in minutes.

The Bottom Line

DroFiT is a smart, lightweight software tool that lets drones "hear" human voices clearly even when their own motors are screaming. It does this by splitting the sound into manageable chunks, focusing only on the most important parts, and processing it in real-time. This means future drones won't just be able to see us; they'll be able to hear and understand us, even while flying at full speed.

1. Problem Statement

Unmanned Aerial Vehicles (UAVs) are increasingly used for applications requiring acoustic interaction (e.g., disaster monitoring, parcel delivery). However, capturing clear speech on UAVs is severely hindered by self-noise generated by propellers and motors. This noise is characterized as wideband, periodic, and narrowband harmonic, often resulting in extremely low Signal-to-Noise Ratios (SNR), ranging from -5 dB to -25 dB.

Existing solutions face two main limitations:

Hardware Constraints: Multi-microphone beamforming requires additional hardware, which adds weight and complexity to UAVs.
Computational Constraints: State-of-the-art single-channel deep learning models (e.g., DCU-Net) offer high-quality enhancement but are too computationally heavy and memory-intensive for resource-constrained, battery-powered UAV platforms. Conversely, lightweight models (e.g., SMoLnet-T) often rely on chunk-based processing that introduces latency and high peak memory usage, making real-time streaming difficult.

2. Methodology: The DroFiT Architecture

The authors propose DroFiT (Drone Frequency lightweight Transformer), a single-microphone speech enhancement network designed for real-time streaming and low resource consumption. The architecture integrates three core components:

A. Hybrid Encoder-Decoder (Full-Band & Sub-Band)

Instead of processing the entire spectrogram as a single block, DroFiT employs a parallel dual-path strategy:

Full-Band Path: Uses Conv1D-based CNA blocks and a Global Convolution (GConv) to capture long-range spectral dependencies across the entire frequency spectrum.
Sub-Band Path: Divides the input spectrogram (513 bins) into five Mel-like groups (32-32-64-128-257 bins). Each group is processed by lightweight convolutional layers to focus on fine-grained, speech-dominant low-frequency regions.
Fusion: Learnable skip-and-gate connections adaptively fuse local sub-band details with global full-band context, optimizing the reconstruction of both spectral magnitude and phase.

B. Frequency-Wise Transformer

A key innovation is the Frequency-wise Transformer, which replaces standard 2D (time-frequency) attention with frequency-only attention.

Mechanism: It applies multi-head self-attention exclusively along the frequency axis while discarding temporal attention.
Complexity Reduction: By eliminating temporal attention (which causes quadratic complexity $O(T^2)$ ) and restricting frequency attention to local windows, the computational complexity is reduced from $O(F^2T^2d)$ to linear time complexity $O(T)$ .
Benefit: This design allows for efficient streaming without the need to store past key/value states, significantly reducing memory usage.

C. Temporal Convolutional Network (TCN) Back-end

To model temporal dependencies without the latency of chunk-based Transformers, DroFiT uses a TCN after the frequency modeling stage.

The TCN captures temporal continuity using Conv1D layers with flexible receptive fields.
It processes audio in small, sequential time chunks, enabling true real-time streaming and high memory reuse efficiency.

D. Output Combination & Loss Function

Combine Block: The outputs of the full-band and sub-band paths are concatenated and processed through a Conv2D layer and a Learning Gate to refine the final complex-valued (real and imaginary) output.
Loss Function: A combined loss is used to balance spectral consistency and waveform fidelity:
- STFT-domain: Weighted sum of Magnitude Loss and Complex Loss.
- Time-domain: Scale-Invariant Signal-to-Distortion Ratio (SI-SDR).
- Auxiliary objectives (cMSE, LSD) are also employed.

3. Key Contributions

Lightweight Frequency Attention: The introduction of a frequency-only Transformer with local windowing, which drastically reduces computational complexity ( $O(T)$ ) compared to standard time-frequency Transformers, making it suitable for embedded AI.
Band-Fused Architecture: A novel parallel processing of full-band and sub-band representations with learnable fusion, effectively capturing both global spectral context and fine-grained harmonic structures specific to UAV noise.
Real-Time Streaming Capability: By replacing chunk-based processing with a TCN back-end and sequential small-chunk processing, DroFiT eliminates the high latency and peak memory spikes associated with previous lightweight models like SMoLnet-T.
Complex-Domain Modeling: The model jointly estimates real and imaginary STFT components, leading to better phase reconstruction and more natural sound quality compared to magnitude-only masking.

4. Experimental Results

The model was trained on VoiceBank-DEMAND mixed with recorded DJI Flip drone noise at SNRs of -5 to -25 dB.

Performance: DroFiT achieved competitive results against heavy baselines (DCU-Net) and lightweight baselines (SMoLnet-T).
- PESQ: 2.440 (Outperformed DCU-Net's 2.433 and SMoLnet-T's 2.433).
- STOI/ESTOI: Consistently higher than DCU-Net across all SNR levels.
- SI-SDR: 9.764 dB.
Efficiency:
- Parameters: Reduced by 26.7× compared to DCU-Net (0.105M vs. 2.808M).
- MACs (Computational Cost): Reduced by 17.3× compared to DCU-Net and nearly 10× compared to SMoLnet-T (1.86 G vs. 18.64 G).
Conclusion: DroFiT maintains high speech intelligibility and quality while significantly lowering the computational and memory footprint.

5. Significance

This work addresses a critical bottleneck in UAV technology: on-board, real-time speech processing under extreme noise conditions.

Deployment Viability: By achieving linear time complexity and low memory usage, DroFiT is the first model of its class capable of running on resource-constrained embedded platforms (FPGA, ASIC) and battery-powered UAVs without requiring external processing units.
Future Impact: The architecture paves the way for advanced UAV applications such as autonomous voice interaction, disaster rescue communication, and integration with downstream tasks like Automatic Speech Recognition (ASR) and Keyword Spotting (KWS) in noisy environments.