Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires

This paper proposes and evaluates a hybrid semi-supervised semantic labeling pipeline that leverages clustering and Large Language Models to efficiently organize and retrieve third-party cybersecurity risk assessment questions, demonstrating that such semantic labels improve retrieval alignment while significantly reducing the cost and effort associated with manual or direct LLM-based labeling.

Ali Nour Eldin, Mohamed Sellami, Walid Gaaloul, Julien Steunou

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are a security guard for a massive, chaotic library. This library doesn't hold books; it holds thousands of security questions used to check if new suppliers (like cloud companies or software vendors) are safe to do business with.

Every time you need to hire a new supplier, you have to pick the right questions from this giant pile.

  • The Problem: The pile is messy. There are no shelves, no categories, and no index. It's just a giant heap of paper.
  • The Old Way: To find the right questions, you (or a computer) have to read every single question and guess if it matches what you need. It's like trying to find a specific needle in a haystack by looking at every single piece of hay one by one. It's slow, expensive, and you often pick the wrong questions because they sound similar but mean something different.

This paper proposes a new, smarter way to organize this library. Here is the breakdown using simple analogies:

1. The Old Way: "The Keyword Search"

Previously, computers tried to find questions by looking for matching words.

  • The Flaw: If you ask for questions about "locking the front door," the computer might give you questions about "locking a digital file." They both have the word "lock," but they are totally different tasks. The computer lacks the "big picture" understanding of what the question is actually testing.

2. The New Solution: The "Smart Librarian" System (SSSL)

The authors created a system called SSSL (Semi-Supervised Semantic Labeling). Think of it as hiring a super-smart librarian who doesn't read every single book but uses a clever trick to organize the whole library quickly.

Here is how their three-step process works:

Step A: Grouping by Vibe (Clustering)

Instead of reading every question one by one, the system first groups questions that "feel" similar.

  • The Analogy: Imagine throwing all the questions into a room and asking them to form groups based on who they want to hang out with. Questions about "passwords" naturally group together; questions about "firewalls" form another group.
  • The Magic: They use a special math trick (called Possibilistic Clustering) that allows a question to belong to multiple groups at once. A question about "locking a server" might belong to both the "Access Control" group and the "Incident Response" group. This is more flexible than forcing a question into just one box.

Step B: The "Smart Librarian" (The LLM)

Now, the system calls in a Large Language Model (LLM)—think of this as a very smart, but expensive, consultant.

  • The Trick: Instead of asking the consultant to label every single question (which would cost a fortune in time and money), they only ask the consultant to look at the groups created in Step A.
  • The Result: The consultant looks at the "Password Group" and says, "Okay, all these questions are about Access Control." They give the whole group a label. Because the consultant only has to do this for a few groups instead of thousands of questions, it's much cheaper and faster.

Step C: The "Copy-Paste" (Propagation)

Once the groups are labeled, the system uses a simple rule to label the rest of the library.

  • The Analogy: If you have a new question that looks 90% like the "Password Group," the system just says, "Hey, you're in the Access Control group too!" It copies the label from the group to the new question without asking the expensive consultant again.
  • The Safety Net: If the new question is weird and doesn't match any group well, the system admits, "I don't know," and sends it back to the human consultant for a quick check.

3. Why This Matters: The "Label-Based Search"

Once the library is organized with these labels, searching becomes a game-changer.

  • Before: You search for "Incident Response," and the computer gives you questions that contain those words, even if they are about the wrong type of incident.
  • After: You search for "Incident Response," and the system looks at the labels. It knows exactly which questions are tagged "Incident Response" and ignores the ones that just happen to have the word "response" in them. It's like searching a library by its Dewey Decimal number instead of just guessing by the title.

The Results: Speed and Savings

The paper tested this system and found:

  1. Massive Cost Savings: By only asking the "Smart Librarian" (LLM) to label groups instead of individual questions, they cut the cost and time by about 40%.
  2. Speed: The "Copy-Paste" step (using math to guess the labels) is incredibly fast—thousands of times faster than asking the consultant.
  3. Better Accuracy: The questions selected were much more relevant to what the company actually needed to check.

The Bottom Line

This paper is about organizing the chaos. Instead of trying to read every single security question to find the right ones, they group them, get a smart AI to name the groups, and then let the math do the rest. It's a way to make cybersecurity checks faster, cheaper, and smarter, so companies can focus on fixing risks rather than just filling out paperwork.