Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization

This paper introduces SBARThez, a novel framework that leverages multimodal and multilingual sentence embeddings alongside a Named Entity Injection mechanism to enhance the factual consistency and cross-lingual capabilities of abstractive summarization for both text and speech inputs.

Chaimae Chellaf, Salima Mdhaffar, Yannick Estève, Stéphane Huet

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you have a massive library of books, podcasts, and news reports in dozens of different languages. You want to know the main points of a 50-page article or a 20-minute conversation, but you don't have time to read or listen to the whole thing. You need a summary.

For a long time, computers tried to do this by acting like photocopiers. They would scan the original text, pick out the "best" sentences, and glue them together. This is called extractive summarization. It's safe, but it's not very creative.

Then, computers got smarter and started acting like writers. They would read the whole thing and write a brand-new summary from scratch, using their own words. This is abstractive summarization. It's much better, but it has a dangerous flaw: hallucinations.

Think of a hallucination like a student who didn't study for a test but tries to bluff their way through. They might say, "The President visited the moon last Tuesday," even though that never happened. In the world of AI, this means the computer invents facts, names, or places that weren't in the original story.

The New Solution: SBARThez

The authors of this paper built a new system called SBARThez (a playful name combining "Sentence," "BART," and "Thez"). Here is how it works, broken down into simple concepts:

1. The "Gist" vs. The "Letters"

Most AI models read text like a human reads a book: letter by letter, word by word. They focus on the tiny details (tokens).

  • The Old Way: Imagine trying to understand a movie by looking at every single pixel on the screen. It's detailed, but you might miss the big picture.
  • The SBARThez Way: This model looks at the sentences as whole blocks of meaning. It uses special "sentence embeddings." Think of these as summary cards. Instead of reading the whole sentence, the model looks at a single card that says, "This sentence is about a politician visiting a school."
  • Why it's cool: Because it thinks in "gists" rather than "letters," it can understand the meaning of a sentence in French, Spanish, or even a spoken audio clip, and turn it into a French summary without needing to translate word-for-word first. It's like having a universal translator that understands the idea, not just the dictionary definition.

2. The "Fact-Check" Safety Net (Named Entity Injection)

The biggest problem with these "gist" models is that they are so good at understanding the vibe that they sometimes forget the specific names. They might summarize a story about "Elon Musk" and just say "the famous tech CEO," or worse, invent a new CEO entirely.

To fix this, the authors added a Named Entity Injection mechanism.

  • The Analogy: Imagine you are writing a story about a football game. You know the general play-by-play (the gist), but you want to make sure you get the player names right. So, before you start writing, you pull out a list of the players on the field and tape it to your desk.
  • How it works: The system scans the original text, pulls out all the important names (People, Organizations, Locations), and feeds them directly to the "writer" part of the AI. This acts as a cheat sheet, forcing the AI to use the real names from the source text, drastically reducing the chance of making things up.

3. Speaking and Listening

This system is also multimodal, meaning it can handle both text and speech.

  • The Analogy: Most summarizers are like librarians who only read books. SBARThez is like a librarian who can also listen to a podcast, a phone call, or a lecture, understand the main points, and write a summary.
  • It doesn't matter if the input is a typed article or a recording of a conversation; the system converts the audio into those same "summary cards" (embeddings) and processes them just like text.

Why Does This Matter?

  1. It's Great for Small Languages: Most AI is trained on English. If you try to summarize a story in a rare language (like Igbo or Kirundi), standard AI often fails or needs to translate it first (which introduces errors). SBARThez works directly on the "meaning" of these languages, making it much better for low-resource languages.
  2. It's More Concise: Because it thinks in sentences rather than words, it tends to write shorter, punchier summaries that get straight to the point, rather than just copying chunks of the original text.
  3. It's Honest: By using the "Name List" trick, it stops the AI from making up fake facts, which is crucial for news and medical summaries.

The Bottom Line

The authors created a summarizer that doesn't just copy-paste or blindly guess. It reads the "soul" of the content (using sentence embeddings), checks its facts against a list of real names (Named Entity Injection), and can handle both written text and spoken audio. It's like upgrading from a photocopier to a smart, multilingual editor who knows exactly who was in the room and what actually happened.