Toward Complex-Valued Neural Networks for Waveform Generation

🎵 The Big Idea: Teaching AI to "Feel" Music, Not Just Count It

Imagine you are trying to teach a robot to paint a sunset.

The Old Way (Real-Valued Networks): You tell the robot, "Paint the red part here, and the orange part there." The robot treats red and orange as two completely separate buckets of paint. It doesn't understand that they blend together to create a gradient. It has to guess how they interact.
The New Way (ComVo): You give the robot a single brush that holds a "sunset color" which naturally contains both red and orange mixed together. The robot understands that these colors are two sides of the same coin.

This paper introduces ComVo, a new AI voice synthesizer that uses this "mixed color" approach to create human-like speech and music.

🎧 The Problem: The "Split Personality" Voice

Current AI voice generators (called Vocoders) are great, but they have a weird habit. When they look at sound, they break it down into a Spectrogram (a map of sound frequencies).

Sound waves have two main parts:

Magnitude: How loud the sound is.
Phase: When the sound wave starts and how it wiggles over time.

Think of a wave in the ocean.

Magnitude is the height of the wave.
Phase is the timing of the crest.

The Flaw: Most AI models treat these two parts like they are strangers. They have one brain for "Height" and a totally separate brain for "Timing." They try to guess how the two relate, but they often miss the subtle dance between them. This leads to audio that sounds a bit robotic or "muddy."

🚀 The Solution: ComVo (The "Complex" Brain)

The authors built ComVo (Complex-valued neural Vocoder). Instead of splitting the sound into two separate brains, ComVo uses a Complex-Valued Neural Network.

The Analogy:
Imagine a dance couple.

Old AI: The leader and the follower are in different rooms. The leader shouts instructions, and the follower tries to guess the moves. They often step on each other's toes.
ComVo: The leader and follower are holding hands. They move as a single unit. If the leader turns, the follower turns instantly and perfectly because they are physically connected.

By treating the "Loudness" and "Timing" as a single, connected entity (a complex number), ComVo captures the natural structure of sound much better.

⚙️ Three Secret Ingredients

To make this work, the team added three special tricks:

1. The "Phase Quantization" (The Ruler)

The Problem: In the world of sound, "timing" (phase) can be messy. It's like trying to draw a perfect circle freehand; you might wobble.
The Fix: ComVo uses a "Phase Quantization" layer. Imagine a ruler with only 128 marks. Instead of letting the AI guess a timing down to the nanosecond, it snaps the timing to the nearest mark on the ruler.
Why it helps: This stops the AI from getting confused by tiny, useless wiggles. It forces the AI to learn the big picture rhythm, making the voice sound more stable and natural.

2. The "Block-Matrix" (The Assembly Line)

The Problem: Doing math with these "connected" numbers is usually slow. It's like a factory where workers have to stop and switch tools every time they pick up a new part.
The Fix: The team invented a "Block-Matrix" computation scheme. Imagine a super-efficient assembly line where four different tools are fused into one giant machine.
The Result: The AI learns 25% faster. It does the same amount of work but in less time, saving money and energy.

3. The "Complex Discriminator" (The Tough Critic)

The Problem: In AI training, a "Generator" makes the sound, and a "Discriminator" (the critic) tries to spot if it's fake. Usually, the critic looks at the sound in two separate ways (loudness and timing).
The Fix: ComVo's critic looks at the sound with "complex eyes." It sees the connection between loudness and timing immediately. It can spot a fake voice much faster because it sees the "dance" between the two parts, not just the individual steps.

🏆 The Results: Does It Sound Better?

The team tested ComVo against the best voice AIs currently available (like HiFi-GAN and Vocos).

Quality: ComVo produced voices that humans rated as more natural and expressive. It sounded less like a robot and more like a human.
Speed: Because of the "Block-Matrix" trick, it trained significantly faster.
Versatility: It worked great not just for talking, but also for singing and music (tested on a music dataset called MUSDB18).

🌟 The Takeaway

ComVo is a breakthrough because it stops treating sound as two separate puzzles (loudness and timing) and starts treating it as one unified picture. By using math that respects the natural connection between these parts, and by building a faster engine to run it, the authors have created a voice synthesizer that is both higher quality and faster to train.

It's like upgrading from a bicycle with a wobbly chain to a high-performance sports car: the destination is the same, but the ride is smoother, faster, and much more enjoyable.

1. Problem Statement

Neural vocoders have significantly improved speech synthesis quality, with iSTFT-based vocoders gaining popularity for their ability to synthesize waveforms directly from complex spectrograms, avoiding computationally expensive learned upsampling stages. However, current state-of-the-art iSTFT-based vocoders rely on Real-Valued Neural Networks (RVNNs). These networks treat the real and imaginary parts of complex spectrograms as independent channels.

This separation creates a fundamental limitation:

Loss of Structural Coupling: Complex spectrograms inherently possess algebraic structures where the magnitude and phase (or real and imaginary components) are coupled. RVNNs fail to capture these intrinsic dependencies because they process the two components independently.
Suboptimal Modeling: By ignoring the complex domain's natural geometry, RVNNs may struggle to model phase transformations and spectral coherence effectively, limiting synthesis quality.

2. Methodology: ComVo

The authors propose ComVo, the first iSTFT-based vocoder that employs Complex-Valued Neural Networks (CVNNs) for both the generator and the discriminator, operating entirely within the complex domain.

A. Architecture

Generator: Adapted from the Vocos architecture (ConvNeXt-based). All convolutional and normalization layers are implemented as complex-valued operations.
- Split GELU: Uses a split activation function to maintain the ConvNeXt block layout in the complex setting.
- Phase Quantization Layer: A novel component inserted after the initial complex convolution. It discretizes phase angles into a fixed set of levels ( $N_q$ ). This acts as an inductive bias to regularize training, mitigate phase drift, and stabilize the learning of coherent phase patterns. It uses a Straight-Through Estimator (STE) to maintain differentiability.
Discriminator:
- Complex Multi-Resolution Discriminator (cMRD): Unlike traditional discriminators that concatenate real/imaginary channels or use only magnitude, cMRD operates directly on complex spectrogram inputs using complex-valued layers. It applies adversarial losses to both real and imaginary parts.
- Multi-Period Discriminator (MPD): Remains a real-valued network operating on waveform segments to capture periodic structures, providing complementary supervision.

B. Efficient Computation: Block-Matrix Scheme

Complex operations typically require tracking real and imaginary components separately, leading to redundant computations and inefficient memory access in standard autodifferentiation frameworks.

Solution: The authors reformulate complex-valued operations as real-valued block-matrix multiplications.
Mechanism: A complex multiplication $Wz$ (where $W = W_r + iW_i$ and $z = x + iy$ ) is computed as a single block-matrix operation:
$\begin{bmatrix} \text{Re}(z') \\ \text{Im}(z') \end{bmatrix} = \begin{bmatrix} W_r & -W_i \\ W_i & W_r \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}$
Benefit: This fuses four independent real-valued multiplications into one block-matrix multiplication, significantly reducing the computational graph size and improving GPU parallelism.

3. Key Contributions

First Complex-Domain Adversarial Framework: ComVo is the first iSTFT-based vocoder to utilize CVNNs in both the generator and discriminator, establishing a training framework that respects the algebraic structure of complex spectrograms.
Phase Quantization: Introduction of a tailored non-linear transformation that discretizes phase angles. This serves as a regularizer to guide the network toward learning structured phase patterns, improving stability and perceptual quality.
Block-Matrix Computation: An efficient implementation strategy that reduces training time by 25% by eliminating redundant operations in the backward pass, making complex-valued training feasible without sacrificing model fidelity.
Empirical Validation: Comprehensive experiments demonstrating that modeling real-imaginary correlations jointly yields superior performance compared to simply scaling up real-valued models.

4. Experimental Results

The model was evaluated on the LibriTTS (speech) and MUSDB18-HQ (music) datasets, comparing against strong baselines like HiFi-GAN, iSTFTNet, BigVGAN, and Vocos.

Objective Metrics: ComVo achieved the highest scores across all metrics, including UTMOS (3.69 vs. 3.60 for Vocos), PESQ (3.82 vs. 3.63), and MR-STFT error (0.84 vs. 0.88).
Subjective Metrics: In Mean Opinion Score (MOS) tests, ComVo matched or exceeded the performance of the best baselines (4.07 vs. 4.05 for Vocos).
Ablation Studies:
- Complex vs. Real: Replacing the generator or discriminator with complex-valued versions consistently improved metrics. The full complex setup (GCDC) yielded the best results.
- Phase Quantization: Setting $N_q=128$ provided the best trade-off, smoothing phase fluctuations and boosting perceptual quality (UTMOS/PESQ) with minimal impact on reconstruction error.
- Efficiency: The block-matrix scheme reduced the number of backward graph nodes by >55% in the generator and ~67% in the cMRD, resulting in a 25% reduction in total training time.
Scaling: Even when compared to a real-valued model with double the parameters (to match the memory footprint of the complex model), ComVo still outperformed it, proving that the gains come from the complex domain modeling, not just increased capacity.

5. Significance

This paper marks a significant shift in neural vocoder design by demonstrating that complex-valued modeling is not just theoretically sound but practically superior for waveform generation.

Theoretical Insight: It validates that the coupling between real and imaginary components in spectrograms is critical for high-fidelity synthesis and cannot be adequately captured by independent channel processing.
Practical Impact: The proposed block-matrix computation scheme solves the efficiency bottleneck often associated with CVNNs, making them viable for large-scale training.
Future Direction: The work opens avenues for applying complex-domain adversarial training to other generative paradigms (e.g., diffusion models) and exploring richer complex-domain activations and loss functions.

In summary, ComVo establishes a new state-of-the-art for iSTFT-based vocoders by leveraging the full mathematical structure of complex numbers, enhanced by novel regularization (phase quantization) and computational optimizations (block-matrix).