PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

This paper introduces PRISM, a persona-reasoned multimodal framework and the accompanying U-MStance dataset, to address the limitations of pseudo-multimodality and user homogeneity in conversational stance detection by leveraging longitudinal user personas and chain-of-thought reasoning for more realistic, user-centric attitude interpretation.

Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng Xu

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are walking through a bustling, chaotic town square (the internet). People are shouting opinions about everything from politics to the latest car models. Some are holding signs (text), while others are holding up photos or memes (images) to make their points.

The goal of this research paper is to build a super-smart "Town Square Observer" who can listen to these arguments, look at the pictures, and figure out exactly what each person really thinks about a specific topic.

Here is the story of how the authors built this observer, called PRISM, and why the old way of doing things wasn't good enough.

The Problem: The "Blindfolded" Observer

The researchers found that previous attempts to build these observers had two major flaws:

  1. The "One-Sided" View (Pseudo-multimodality):
    Imagine the observer could see the pictures only on the main stage (the original post), but when people in the crowd shouted back (comments), the observer had to pretend those people were just speaking text. In real life, people often reply with memes or photos too! The old observers were blind to this, missing half the story.
  2. The "Cookie-Cutter" Crowd (User Homogeneity):
    The old observers treated everyone in the crowd as if they were the same person. They didn't realize that "Grumpy Grandpa" is naturally more critical than "Optimistic Teen." If you don't know who is speaking, you might misinterpret a sarcastic joke as a serious attack, or a genuine compliment as a fake one.

The Solution: A New Map and a New Detective

To fix this, the team did two big things:

1. The New Map: U-MStance Dataset

They created a massive new library of real conversations called U-MStance.

  • What's special? It contains over 40,000 comments where both the original posts and the replies can have pictures.
  • The Twist: They also included the "history books" of the people speaking. They know what these users have posted for years, so they can build a profile of who they are.

2. The New Detective: PRISM

They built a new AI framework named PRISM (Persona-Reasoned multImodal Stance Model). Think of PRISM as a detective with three special tools:

  • Tool 1: The "Personality X-Ray" (Longitudinal User Persona)
    Before the detective listens to the current argument, they look at the speaker's history. They use a psychological framework (the "Big Five" personality traits) to build a profile.

    • Analogy: If the speaker is usually "Neurotic" (emotional and anxious), the detective knows to look for emotional triggers. If they are "Open" (adventurous), they might be more willing to try new ideas. This helps the AI understand why someone is saying what they are saying.
  • Tool 2: The "Translator" (Rationalized Cross-Modal Grounding)
    When someone posts a picture, it's not just a picture; it's a joke, a threat, or a compliment wrapped in an image.

    • Analogy: Imagine someone posts a picture of a burning house with the caption "Great weather!" A normal AI might just see "House" and "Weather." PRISM uses a "Chain of Thought" (a step-by-step reasoning process) to say: "Wait, the house is burning, but they are smiling. They are being sarcastic. This picture is actually an insult to the situation." It translates the visual joke into a clear meaning.
  • Tool 3: The "Practice Run" (Mutual Task Reinforcement)
    To get really good at guessing what people think, PRISM practices a second game: writing the next comment.

    • Analogy: It's like a student trying to understand a math problem. If they can also teach the problem to someone else (generate a response), they understand the logic much better. By training the AI to both guess the stance and write a reply, it learns the nuances of human conversation much faster.

The Results: Why It Matters

When they tested PRISM against other smart AI models:

  • It won. It was much better at figuring out if someone was for or against a topic, especially when sarcasm or complex images were involved.
  • It was adaptable. Even when the AI was tested on topics it had never seen before (like switching from talking about "Cars" to "Cryptocurrency"), it still performed well because it understood the people behind the words, not just the words themselves.

The Bottom Line

This paper teaches us that to truly understand human opinion online, we can't just look at the text or the pictures. We have to look at who is speaking, what they usually stand for, and how they use images to hide or reveal their true feelings. PRISM is the first tool that brings all these pieces together to give us a clear, human-like understanding of the digital crowd.