OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

Imagine you have a brilliant, multilingual librarian named OSUM-Pangu. This librarian is incredibly smart: they can listen to a recording, understand what the speaker is saying, guess their age, detect their emotions, and even answer complex questions about the audio.

But here's the catch: most super-smart librarians in the world only work in a specific, expensive building called the NVIDIA CUDA House. If you try to bring them into a different building (like the Ascend NPU house, which is popular in China and doesn't use the same "electricity" or wiring), they get confused and stop working.

This paper introduces OSUM-Pangu, a new librarian who was born and raised entirely inside the Ascend NPU building. They don't need the expensive CUDA House to function.

Here is how they built this new librarian, explained simply:

1. The Problem: The "Language Barrier" Between Hardware

Think of the CUDA ecosystem (used by NVIDIA GPUs) as a universal language that almost all AI models speak. The Ascend NPU (a chip made by Huawei) speaks a different dialect.

The Old Way: Researchers built great speech models, but they were like tourists who only spoke the "CUDA dialect." If you tried to run them on an Ascend chip, they couldn't understand the instructions.
The Goal: The authors wanted to build a speech model that speaks the "Ascend dialect" natively, so it runs fast and smooth without needing to translate everything first.

2. The Solution: A Custom-Built Team

The authors didn't build a librarian from scratch; they assembled a dream team using parts that already fit the Ascend building:

The Ears (Audio Encoder): They used a pre-trained "ear" (based on Whisper) that is great at listening to sound. This part is frozen, meaning it's a reliable, unchanging tool that just does its job perfectly.
The Translator (Adapter): Imagine sound waves are like a long, messy stream of water. The computer can't drink that much at once. The "Adapter" is a funnel that squishes the long stream of sound into a neat, manageable cup of "speech tokens" that the brain can understand.
The Brain (LLM Backbone): This is the star of the show. Instead of using a brain trained in the CUDA House (like Qwen2), they used openPangu-7B. This is a giant brain that was already trained inside the Ascend NPU building. It knows how to think, reason, and chat in the native language of the hardware.

3. The Training: A Three-Step School

You can't just plug a brain into ears and expect it to work perfectly. The authors put OSUM-Pangu through a three-stage school:

Stage 1: Learning the Ropes (Tag-Based Alignment):
First, they taught the model to listen to sound and answer specific, rigid questions like "What does this say?" or "How old is the speaker?" They used fixed labels (like <asr> or <age>) as training wheels. This taught the ears and the brain how to talk to each other.
Stage 2: Learning to Read Minds (Text Intent):
Next, they removed the audio. They gave the model only text instructions like, "I want to know the speaker's age." They taught the model to recognize that this sentence means "Do the age task." This is like teaching the librarian to understand the intent behind a question, not just the keywords.
Stage 3: The Grand Finale (Joint Integration):
Finally, they combined everything. The model now hears audio and reads natural language instructions like, "Hey, what's happening in this clip, and is the speaker a child?" The model has to figure out: "Oh, they want me to do transcription AND age detection!" It does both automatically.

4. The Results: A Rival to the Giants

The team tested OSUM-Pangu against the big, famous models that run on expensive NVIDIA GPUs.

Performance: OSUM-Pangu performed almost as well as the giants. In some tasks, like guessing the speaker's age or style, it was actually better.
Flexibility: The best part? It understands natural language. You don't need to type a specific code to ask it to do something. You can just say, "Tell me what this audio is about," and it figures it out. It achieved a 90.2% success rate in following these natural instructions.

Why Does This Matter? (The Big Picture)

Think of the AI world as a city. For a long time, everyone built their houses using only NVIDIA bricks (CUDA). If you didn't have those bricks, you couldn't build a house.

This paper says: "You don't need NVIDIA bricks to build a great house."

By proving that you can build a top-tier, multi-talented speech AI entirely on Ascend bricks, they have:

Opened the door for countries or companies that can't access NVIDIA chips to build their own powerful AI.
Created a blueprint (open-source code) so anyone can copy this method.
Proven that you don't need to sacrifice intelligence just because you are using different hardware.

In short, OSUM-Pangu is a proof-of-concept that says: "We can build world-class speech intelligence without relying on the traditional, expensive hardware everyone else uses."

Based on the paper "OSUM-PANGU: AN OPEN-SOURCE MULTIDIMENSION SPEECH UNDERSTANDING FOUNDATION MODEL BUILT UPON OPENPANGU ON ASCEND NPUS," here is a detailed technical summary:

1. Problem Statement

Despite the rapid advancement of Speech Large Language Models (Speech-LLMs) in multi-dimensional understanding (e.g., ASR, emotion recognition, speaker attributes), the field faces two critical bottlenecks:

Hardware Dependency: Most high-performance frameworks are optimized exclusively for the NVIDIA CUDA ecosystem, creating a significant barrier for deployment on non-CUDA infrastructures like Huawei's Ascend NPUs.
Rigid Interaction: Existing open-source models (e.g., OSUM) often rely on rigid, predefined task tags or fixed prompts. They struggle to interpret free-form natural language instructions, limiting their ability to handle diverse user intents without explicit formatting.
Lack of Native NPU Optimization: While the OSUM framework has been adapted to NPUs, its underlying language model backbone (Qwen2) still relies on CUDA and lacks native optimization for the NPU architecture.

2. Methodology

The authors propose OSUM-Pangu, a fully open-source speech understanding foundation model built entirely on a non-CUDA software and hardware stack.

Model Architecture

Backbone: Replaces the CUDA-dependent Qwen2 with openPangu-7B (specifically openPangu-Embedded-7B-V1.1), a large language model pre-trained natively on the Ascend NPU architecture. This ensures native compatibility and optimized performance.
Acoustic Encoder: Utilizes a pre-trained Whisper-medium encoder to extract semantic content and paralinguistic features from audio, transforming input into acoustic embeddings.
Modality Adapter: A trainable module consisting of 2D convolutional layers (for 4× temporal downsampling) and Transformer layers. It aligns acoustic features with the LLM's embedding space and compresses the sequence length to fit within the LLM's context window.
Inference Workflow: The model processes a hybrid sequence of natural language instructions and speech tokens. It generates structured outputs with task-specific tags (e.g., <asr>, <age>) to enable implicit task routing.

Training Strategy

To achieve efficient alignment without massive multimodal pre-training, the authors employ a three-stage training pipeline:

Stage I (Task-Specific Alignment): Trains the modality adapter and applies LoRA to the LLM backbone using fixed task tags (e.g., <asr>) to align acoustic tokens with the LLM representation space.
Stage II (Text Intent Perception): Trains the LLM to recognize user intent from natural language instructions without audio input. This converts diverse colloquial queries into structured task identifiers, improving instruction parsing robustness.
Stage III (Joint Multimodal Integration): Integrates natural language instructions and raw audio signals. The model learns to autonomously infer user intent and execute the corresponding speech task, minimizing a joint loss function ( $L_{total} = L_{intent} + L_{speech}$ ).

Evaluation Metric

The paper introduces the Instruction Following Rate (IFR) to measure the model's ability to correctly identify and execute tasks based on diverse natural language prompts, evaluated using an LLM-as-a-Judge (DeepSeek-V3).

3. Key Contributions

First Non-CUDA Speech-LLM: OSUM-Pangu is the first open-source, multidimensional speech foundation model that achieves end-to-end training and inference entirely on the Ascend NPU platform using the openPangu backbone.
Intent-Aware Training: The proposed three-stage training strategy enables the model to interpret free-form natural language instructions, achieving a high instruction-following capability without relying on rigid prompts.
Reproducible Baseline: The authors release the model code and weights, providing a reproducible baseline for the open-source community to develop multimodal intelligence outside the CUDA ecosystem.

4. Experimental Results

Experiments were conducted on a cluster of Ascend 910B NPUs using the CANN software stack.

Task Performance: OSUM-Pangu achieves competitive accuracy across seven speech dimensions (ASR, Voice Emotion Detection, Speaker Gender Classification, etc.).
- It matches or exceeds GPU-based baselines (Qwen2-Audio, OSUM) in several tasks.
- Notably, it demonstrates superior performance in Age Prediction and Style Recognition compared to Qwen2-Audio.
Instruction Following: The model achieves an IFR of 90.2%, significantly outperforming the instruction-tuned baseline Qwen2Audio-Instruct (71.3%).
Robustness: Performance differences between fixed instructions and natural language prompts are minimal for core tasks like ASR and Speaker Emotion Recognition, proving the model maintains accuracy while gaining flexibility.
Chat Capabilities: In Speech-to-Text Chat (STTC) benchmarks, OSUM-Pangu surpasses several specialized open-source models (e.g., DeepTalk, OSUM-EChat) in TriviaQA and WebQ, though it trails behind commercial giants like ChatGPT-4o.

5. Significance

This work represents a pivotal step toward hardware diversity in AI. By successfully bridging the gap between acoustic perception and natural language interaction on Ascend NPUs, OSUM-Pangu:

Democratizes Access: Reduces dependency on NVIDIA GPUs, enabling researchers and developers in regions or organizations with restricted access to CUDA hardware to build state-of-the-art speech systems.
Validates NPU Ecosystem: Demonstrates that the Ascend NPU platform, combined with the openPangu ecosystem, is capable of supporting complex, end-to-end multimodal training workflows comparable to GPU-centric systems.
Advances Open Source: Provides a robust, non-CUDA baseline that encourages the independent evolution of multimodal intelligence and fosters a more heterogeneous AI computing landscape.