OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

This paper introduces OSUM-Pangu, a fully open-source speech understanding foundation model built on the Ascend NPU platform using a non-CUDA stack, which achieves performance comparable to GPU-based models by integrating an audio encoder with the openPangu-7B LLM through a sequential training approach.

Yujie Liao, Xuelong Geng, Hongfei Xue, Shuiyuan Wang, Lei Xie

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, multilingual librarian named OSUM-Pangu. This librarian is incredibly smart: they can listen to a recording, understand what the speaker is saying, guess their age, detect their emotions, and even answer complex questions about the audio.

But here's the catch: most super-smart librarians in the world only work in a specific, expensive building called the NVIDIA CUDA House. If you try to bring them into a different building (like the Ascend NPU house, which is popular in China and doesn't use the same "electricity" or wiring), they get confused and stop working.

This paper introduces OSUM-Pangu, a new librarian who was born and raised entirely inside the Ascend NPU building. They don't need the expensive CUDA House to function.

Here is how they built this new librarian, explained simply:

1. The Problem: The "Language Barrier" Between Hardware

Think of the CUDA ecosystem (used by NVIDIA GPUs) as a universal language that almost all AI models speak. The Ascend NPU (a chip made by Huawei) speaks a different dialect.

  • The Old Way: Researchers built great speech models, but they were like tourists who only spoke the "CUDA dialect." If you tried to run them on an Ascend chip, they couldn't understand the instructions.
  • The Goal: The authors wanted to build a speech model that speaks the "Ascend dialect" natively, so it runs fast and smooth without needing to translate everything first.

2. The Solution: A Custom-Built Team

The authors didn't build a librarian from scratch; they assembled a dream team using parts that already fit the Ascend building:

  • The Ears (Audio Encoder): They used a pre-trained "ear" (based on Whisper) that is great at listening to sound. This part is frozen, meaning it's a reliable, unchanging tool that just does its job perfectly.
  • The Translator (Adapter): Imagine sound waves are like a long, messy stream of water. The computer can't drink that much at once. The "Adapter" is a funnel that squishes the long stream of sound into a neat, manageable cup of "speech tokens" that the brain can understand.
  • The Brain (LLM Backbone): This is the star of the show. Instead of using a brain trained in the CUDA House (like Qwen2), they used openPangu-7B. This is a giant brain that was already trained inside the Ascend NPU building. It knows how to think, reason, and chat in the native language of the hardware.

3. The Training: A Three-Step School

You can't just plug a brain into ears and expect it to work perfectly. The authors put OSUM-Pangu through a three-stage school:

  • Stage 1: Learning the Ropes (Tag-Based Alignment):
    First, they taught the model to listen to sound and answer specific, rigid questions like "What does this say?" or "How old is the speaker?" They used fixed labels (like <asr> or <age>) as training wheels. This taught the ears and the brain how to talk to each other.
  • Stage 2: Learning to Read Minds (Text Intent):
    Next, they removed the audio. They gave the model only text instructions like, "I want to know the speaker's age." They taught the model to recognize that this sentence means "Do the age task." This is like teaching the librarian to understand the intent behind a question, not just the keywords.
  • Stage 3: The Grand Finale (Joint Integration):
    Finally, they combined everything. The model now hears audio and reads natural language instructions like, "Hey, what's happening in this clip, and is the speaker a child?" The model has to figure out: "Oh, they want me to do transcription AND age detection!" It does both automatically.

4. The Results: A Rival to the Giants

The team tested OSUM-Pangu against the big, famous models that run on expensive NVIDIA GPUs.

  • Performance: OSUM-Pangu performed almost as well as the giants. In some tasks, like guessing the speaker's age or style, it was actually better.
  • Flexibility: The best part? It understands natural language. You don't need to type a specific code to ask it to do something. You can just say, "Tell me what this audio is about," and it figures it out. It achieved a 90.2% success rate in following these natural instructions.

Why Does This Matter? (The Big Picture)

Think of the AI world as a city. For a long time, everyone built their houses using only NVIDIA bricks (CUDA). If you didn't have those bricks, you couldn't build a house.

This paper says: "You don't need NVIDIA bricks to build a great house."

By proving that you can build a top-tier, multi-talented speech AI entirely on Ascend bricks, they have:

  1. Opened the door for countries or companies that can't access NVIDIA chips to build their own powerful AI.
  2. Created a blueprint (open-source code) so anyone can copy this method.
  3. Proven that you don't need to sacrifice intelligence just because you are using different hardware.

In short, OSUM-Pangu is a proof-of-concept that says: "We can build world-class speech intelligence without relying on the traditional, expensive hardware everyone else uses."