Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires

Imagine you are a security guard for a massive, chaotic library. This library doesn't hold books; it holds thousands of security questions used to check if new suppliers (like cloud companies or software vendors) are safe to do business with.

Every time you need to hire a new supplier, you have to pick the right questions from this giant pile.

The Problem: The pile is messy. There are no shelves, no categories, and no index. It's just a giant heap of paper.
The Old Way: To find the right questions, you (or a computer) have to read every single question and guess if it matches what you need. It's like trying to find a specific needle in a haystack by looking at every single piece of hay one by one. It's slow, expensive, and you often pick the wrong questions because they sound similar but mean something different.

This paper proposes a new, smarter way to organize this library. Here is the breakdown using simple analogies:

1. The Old Way: "The Keyword Search"

Previously, computers tried to find questions by looking for matching words.

The Flaw: If you ask for questions about "locking the front door," the computer might give you questions about "locking a digital file." They both have the word "lock," but they are totally different tasks. The computer lacks the "big picture" understanding of what the question is actually testing.

2. The New Solution: The "Smart Librarian" System (SSSL)

The authors created a system called SSSL (Semi-Supervised Semantic Labeling). Think of it as hiring a super-smart librarian who doesn't read every single book but uses a clever trick to organize the whole library quickly.

Here is how their three-step process works:

Step A: Grouping by Vibe (Clustering)

Instead of reading every question one by one, the system first groups questions that "feel" similar.

The Analogy: Imagine throwing all the questions into a room and asking them to form groups based on who they want to hang out with. Questions about "passwords" naturally group together; questions about "firewalls" form another group.
The Magic: They use a special math trick (called Possibilistic Clustering) that allows a question to belong to multiple groups at once. A question about "locking a server" might belong to both the "Access Control" group and the "Incident Response" group. This is more flexible than forcing a question into just one box.

Step B: The "Smart Librarian" (The LLM)

Now, the system calls in a Large Language Model (LLM)—think of this as a very smart, but expensive, consultant.

The Trick: Instead of asking the consultant to label every single question (which would cost a fortune in time and money), they only ask the consultant to look at the groups created in Step A.
The Result: The consultant looks at the "Password Group" and says, "Okay, all these questions are about Access Control." They give the whole group a label. Because the consultant only has to do this for a few groups instead of thousands of questions, it's much cheaper and faster.

Step C: The "Copy-Paste" (Propagation)

Once the groups are labeled, the system uses a simple rule to label the rest of the library.

The Analogy: If you have a new question that looks 90% like the "Password Group," the system just says, "Hey, you're in the Access Control group too!" It copies the label from the group to the new question without asking the expensive consultant again.
The Safety Net: If the new question is weird and doesn't match any group well, the system admits, "I don't know," and sends it back to the human consultant for a quick check.

3. Why This Matters: The "Label-Based Search"

Once the library is organized with these labels, searching becomes a game-changer.

Before: You search for "Incident Response," and the computer gives you questions that contain those words, even if they are about the wrong type of incident.
After: You search for "Incident Response," and the system looks at the labels. It knows exactly which questions are tagged "Incident Response" and ignores the ones that just happen to have the word "response" in them. It's like searching a library by its Dewey Decimal number instead of just guessing by the title.

The Results: Speed and Savings

The paper tested this system and found:

Massive Cost Savings: By only asking the "Smart Librarian" (LLM) to label groups instead of individual questions, they cut the cost and time by about 40%.
Speed: The "Copy-Paste" step (using math to guess the labels) is incredibly fast—thousands of times faster than asking the consultant.
Better Accuracy: The questions selected were much more relevant to what the company actually needed to check.

The Bottom Line

This paper is about organizing the chaos. Instead of trying to read every single security question to find the right ones, they group them, get a smart AI to name the groups, and then let the math do the rest. It's a way to make cybersecurity checks faster, cheaper, and smarter, so companies can focus on fixing risks rather than just filling out paperwork.

Here is a detailed technical summary of the paper "Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires."

1. Problem Statement

Third-Party Risk Assessment (TPRA) is a critical cybersecurity practice where organizations evaluate suppliers against standards like ISO/IEC 27001 and NIST. A major operational bottleneck exists in the selection of relevant questions from large, unstructured repositories of compliance questions.

Current Limitations: Existing automated approaches rely on text similarity (e.g., cosine similarity over dense embeddings). While effective for finding topically related text, these methods fail to capture the implicit assessment scope (e.g., "existence verification" vs. "enforcement") and specific control domains (e.g., "access control" vs. "incident response").
Consequence: Retrieval often yields generic questions that do not align with the specific intent of the assessment.
Labeling Challenge: Manually labeling repositories is time-consuming. Using Large Language Models (LLMs) for direct, per-question labeling is accurate but prohibitively expensive and sensitive to prompt variability when applied to large, evolving datasets.

2. Methodology: Hybrid Semi-Supervised Semantic Labeling (SSSL)

The authors propose a Hybrid Semi-Supervised Semantic Labeling (SSSL) framework designed to organize and retrieve TPRA questions using semantic labels. The pipeline consists of four distinct phases:

A. Repository Construction & Clustering (Annotation Phase)

Embedding: Unlabeled questions are converted into dense semantic vectors using a pre-trained sentence embedding model (text-embedding-3-large).
Possibilistic Clustering: Instead of hard clustering, the authors use Possibilistic C-Means (PCM). This allows questions to belong to multiple clusters with varying degrees of membership, reflecting the overlapping nature of compliance concepts.
Automatic Thresholding: An "elbow detection" algorithm automatically determines the membership threshold for each cluster to convert soft memberships into discrete groups, avoiding manual tuning.

B. LLM-Assisted Label Discovery

Cluster-Level Labeling: Instead of calling the LLM for every single question, the system invokes the LLM once per cluster.
Contextual Prompting: The LLM is given all questions within a cluster and asked to generate a small set of reusable, human-readable semantic labels (e.g., "Access Control," "Incident Response") that describe the shared intent of the group.
Label Aggregation: Each question inherits the union of labels from all clusters it belongs to. This creates a labeled repository with minimal LLM usage.

C. Label Propagation (Prediction Phase)

k-Nearest Neighbors (kNN): For new questions, the system computes their embedding and retrieves the $k$ most similar labeled questions from the repository.
Voting Scheme: Labels are predicted based on a voting mechanism among the neighbors.
Out-of-Distribution (OOD) Handling: If no label receives a sufficient number of votes (e.g., at least 2), the system flags the question as OOD and defers to the LLM for labeling, ensuring robustness.

D. Label-Based Retrieval

Shift in Search Space: Retrieval is performed in the label space rather than the raw question-text space.
Mechanism: User queries are embedded and matched against the embeddings of the semantic labels associated with each question. This ensures that retrieved questions align with the specific control domain and assessment scope requested, rather than just lexical similarity.

3. Key Contributions

SSSL Framework: A novel pipeline that combines unsupervised clustering with selective LLM usage to generate interpretable semantic labels for compliance repositories.
Cost-Efficiency Strategy: By moving LLM inference from the question level to the cluster level and using kNN for propagation, the framework drastically reduces token consumption and runtime.
Improved Retrieval Alignment: Demonstrates that retrieval based on semantic labels outperforms direct text-similarity retrieval in aligning questions with specific assessment intents (e.g., distinguishing between policy existence and operational enforcement).
Open Source Release: The authors released the implementation, datasets, and evaluation scripts to ensure reproducibility.

4. Experimental Results

The framework was evaluated against an LLM-only baseline and traditional similarity methods using real-world (CAIQ) and synthetic datasets.

Computational Efficiency:
- Token Reduction: SSSL reduced token usage by 39.6% (from 57,146 to 34,527 tokens) during the labeling phase compared to LLM-only.
- Speedup: The kNN prediction phase achieved a ~1,460x speedup (0.22s vs. 322s) and a ~1,500x reduction in energy consumption compared to LLM-only inference.
- Cost: The kNN phase incurs zero LLM token costs.
Label Quality:
- Consistency: SSSL achieved high consistency (4.8/5) because labels are derived from cluster context rather than isolated questions.
- Trade-off: While the initial LLM phase had high correctness (3.5/5), the kNN propagation phase saw a drop in correctness (1.8/5). The authors attribute this to semantic drift when transferring labels across different standards (e.g., ISO vs. NIST) where terminology does not perfectly overlap.
Retrieval Performance:
- Label-Based Retrieval: Outperformed both BM25 and pure semantic similarity.
- Scores: In a composite query test (Q3), Label-Based Semantic Similarity scored 72/100, compared to 62 for pure semantic similarity and 58 for BM25. This confirms that explicit labels better capture multi-domain assessment intents.

5. Significance and Conclusion

This paper addresses a critical scalability issue in Third-Party Risk Management. By decoupling label discovery (expensive, high-quality LLM inference) from label assignment (cheap, fast kNN propagation), the SSSL framework makes it feasible to manage large, evolving compliance repositories without prohibitive costs.

Practical Impact: Organizations can build adaptive TPRA questionnaires that are better aligned with specific risk scopes, reducing redundancy and improving the quality of supplier assessments.
Future Work: The authors suggest exploring label-level grouping (clustering labels rather than questions) to improve cross-standard transferability and extending the framework to automated question answering using organizational knowledge.

In summary, the paper demonstrates that semantic labeling, when optimized via a hybrid semi-supervised approach, is a superior strategy for TPRA question retrieval compared to raw text similarity, offering a balance of high accuracy, interpretability, and operational efficiency.