A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

Imagine you are a lawyer, but instead of reviewing one contract a day, you are drowning in a sea of Non-Disclosure Agreements (NDAs). These are the legal "handshakes" companies sign to promise they won't spill each other's secrets.

The problem? Every company writes these agreements differently. Some are short and sweet; others are 50 pages of dense, confusing legal jargon with no clear structure. Reading them manually is slow, boring, and easy to mess up. You might miss a tiny clause that says, "Oh, by the way, we own your idea now," buried in paragraph 42.

This paper proposes a two-stage robot team to do the heavy lifting for you. Think of it as a high-tech assembly line for legal documents.

🏗️ The Two-Stage Assembly Line

The system is built like a relay race with two runners: The Segmenter and The Classifier.

Stage 1: The Segmenter (The "Book Binder")

The Problem: An NDA is like a giant, uncut block of marble. It's all one big chunk of text. You can't analyze it until you chop it into manageable pieces (clauses). But legal text is messy; sometimes a clause is one sentence, sometimes it's three paragraphs, and sometimes the formatting is a nightmare.

The Solution: The team uses a super-smart AI called LLaMA (think of it as a very well-read, fast-reading librarian).

How it works: You feed the whole NDA to the librarian. The librarian reads the whole thing and says, "Okay, here is Clause 1, here is Clause 2, here is Clause 3..."
The Magic Trick: The researchers taught this librarian to ignore the messy formatting (like weird columns or tables) and focus purely on the meaning to find where one idea ends and the next begins.
The Result: It's incredibly accurate. If the original document had 100 clauses, the robot found 95 of them perfectly and didn't miss the important details. It's like a master chef slicing a loaf of bread so perfectly that every slice is exactly the right size, even if the loaf was shaped weirdly.

Stage 2: The Classifier (The "Labeling Expert")

The Problem: Now that you have the individual clauses, you need to know what kind of clause they are. Is this the "Confidentiality" clause? The "Governing Law" clause? The "Who gets fired if we break the rules" clause?

The Solution: The team uses a specialized AI called Legal-Roberta (think of this as a legal scholar who has read thousands of contracts and memorized the patterns).

How it works: The Segmenter hands the chopped-up clauses to the Scholar. The Scholar reads a clause and shouts, "This is a Confidentiality clause!" or "This is a Termination clause!"
The Challenge: Some clauses are tricky. One paragraph might be about both "Confidentiality" and "Liability." It's a multi-label problem (like a song that is both "Rock" and "Jazz"). Also, some types of clauses are rare (like "Competition Rights"), making it hard for the robot to learn them.
The Result: The Scholar is very good at spotting the common clauses (getting about 85% accuracy overall). It's like a seasoned detective who can instantly spot the most common types of clues, even if they sometimes miss the rare, obscure ones.

🧪 How Did They Test It?

The researchers didn't just guess; they put the system through a rigorous exam.

The Dataset: They used 322 real NDAs from a public database. These were messy, real-world documents from different companies, not just clean, perfect examples.
The "Needle in a Haystack" Check: To make sure the Segmenter didn't just guess the right number of clauses but actually got the right text, they used a special math algorithm (Needleman-Wunsch). Imagine trying to match two slightly different versions of a story; this algorithm aligns them word-for-word to see how much they match.
The Score:
- Segmentation: The robot got a 95% score. It basically cut the documents perfectly.
- Classification: The robot got an 85% score on the most common clauses. It's not perfect yet (because legal language is tricky and some clauses are rare), but it's good enough to save lawyers hours of work.

🚀 Why Does This Matter?

Currently, lawyers spend hours reading these documents, risking burnout and human error.

Before: A lawyer reads a 30-page NDA, squinting at the screen, hoping they don't miss a hidden trap.
After: The robot reads the 30 pages in seconds, chops it up, and hands the lawyer a neat list: "Here are the 14 clauses. Clause 3 is about money, Clause 7 is about secrecy, and Clause 12 is about who owns the IP. You just need to double-check these three."

🔮 What's Next?

The authors admit the system isn't perfect yet. It struggles a bit with the rare, weird clauses because there weren't enough examples to teach it.

Future Plan: They want to teach the robot to not just find and label the clauses, but to rewrite them. Imagine a robot that says, "Hey, this clause is too vague. Here is a better, safer version of it."

In a nutshell: This paper builds a smart assistant that turns a chaotic pile of legal paperwork into a clean, organized, and labeled file, letting human lawyers focus on the big decisions instead of the boring reading.

Here is a detailed technical summary of the paper "A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification."

1. Problem Statement

Non-Disclosure Agreements (NDAs) are critical in business-to-business relations but present significant challenges for legal teams due to:

Lack of Standardization: NDAs vary widely in format, structure, writing style, and clause definitions across different companies.
Manual Review Bottlenecks: The non-standardized nature makes rule-based systems ineffective, while manual review is slow, prone to human error, and difficult to scale given the volume and urgency of contract reviews.
Complexity of Legal Language: Legal texts contain domain-specific terminology, ambiguity, and complex structural variations that challenge traditional Natural Language Processing (NLP) systems.

The authors aim to automate the extraction and classification of clauses within NDAs to reduce workload, ensure consistency, and enhance legal reliability.

2. Methodology

The authors propose a two-stage architecture implemented using LangGraph, consisting of a Segmenter and a Classifier.

A. Dataset

Source: The Kleister-NDA public dataset, containing 726 NDAs from foreign companies in English (PDF format).
Characteristics: High heterogeneity in templates (single/double column), writing styles, and clause structures.
Annotation: 322 documents were manually annotated by three legal specialists, resulting in 3,714 clauses.
Classes: 14 distinct legal categories (e.g., Party Identification, Confidentiality Obligations, Governing Law).
Task Nature: Multi-label classification (a single clause can belong to multiple classes) with significant class imbalance (Class 14 accounts for ~49% of data).

B. Stage 1: Segmenter (Clause Extraction)

Goal: Decompose a full NDA document into individual, coherent clauses.
Model: LLaMA-3.1-8B-Instruct.
Infrastructure: Deployed using vLLM (Virtual LLM) with PagedAttention for efficient memory management and GPU utilization on an NVIDIA L40S.
Process:
- The model receives the full NDA text via a 499-token prompt.
- It outputs text segmented with specific tags ([INIT_CLAUSE], [END_CLAUSE]).
- Evaluation Challenge: Since the number of generated clauses ( $N$ ) rarely matches the number of reference clauses ( $M$ ) exactly, direct comparison is difficult.
- Solution: The authors implemented the Needleman-Wunsch algorithm (originally for biological sequence alignment) to align predicted clauses with reference clauses. A similarity threshold of 0.7 was used to filter pairs, reducing computational complexity by 92.5%.
Metrics: ROUGE-1, Factual Correctness (verified via GPT-4.1-Nano), and Semantic Similarity (via OpenAI text-embedding-3-large).

C. Stage 2: Classifier (Clause Labeling)

Goal: Assign semantic categories to the segmented clauses.
Model: Legal-Roberta-Base (a legal-domain variant of RoBERTa).
Rationale: BERT-based models are preferred here over LLMs due to their efficiency with shorter inputs and suitability for fine-tuning on specific classification tasks.
Training Strategy:
- Fine-tuning: Adapted for multi-label classification.
- Handling Imbalance: Used Focal Loss ( $\alpha=0.25, \gamma=2$ ) to penalize easy examples and focus on hard, minority classes.
- Hyperparameters: 3 epochs, learning rate $1e^{-5}$, weight decay 0.01, no dropout.
Metrics: Macro F1, Weighted F1, Hamming Loss, and Matthews Correlation Coefficient (MCC).

3. Key Contributions

Novel Architecture: A hybrid approach combining a generative LLM for long-context segmentation and a discriminative Transformer for precise multi-label classification.
Evaluation Innovation: The application of the Needleman-Wunsch algorithm to solve the "mismatched sequence" problem in clause segmentation evaluation, enabling accurate alignment between predicted and reference clauses without exhaustive pairwise comparison.
Robustness to Heterogeneity: The system demonstrates effectiveness on highly variable, real-world NDA documents with diverse templates and structures.
Handling Imbalance: Successful implementation of Focal Loss to address the severe class imbalance inherent in legal datasets.

4. Results

Segmenter Performance

The segmentation model achieved high fidelity in preserving document content and structure:

ROUGE F1-Score: 0.95 ± 0.0036 (Segment level).
Factual Correctness: 0.95 ± 0.0044, indicating minimal information loss or distortion.
Semantic Similarity: 0.98 ± 0.0027, confirming the generated segments retain the original meaning.
Alignment: After Needleman-Wunsch filtering, the average alignment score was 0.98, indicating near-perfect correspondence for correctly mapped clauses.

Classifier Performance

The classifier showed strong performance on majority classes but struggled with minority classes due to data scarcity:

Weighted F1: 0.85 (Test set). This high score reflects strong performance on frequent classes (e.g., "Additional Information").
Macro F1: 0.69 (Test set). The lower score highlights the difficulty in generalizing to underrepresented classes (4 classes had <100 samples).
Hamming Loss: 0.03, indicating a low error rate per label.
MCC: 0.84, demonstrating a strong correlation between predictions and true labels across the dataset.

5. Significance and Future Work

Significance: The study proves that a two-stage pipeline is feasible and precise for automating NDA analysis. It bridges the gap between the flexibility required for parsing unstructured legal documents (LLMs) and the precision required for legal categorization (Fine-tuned BERT).
Limitations: The primary bottleneck is data scarcity. Due to the confidential nature of NDAs, obtaining large, annotated datasets is difficult, which limits the model's ability to learn minority classes effectively.
Future Work:
- Data Augmentation: Using generative models to create paraphrases and synthetic data for minority classes.
- System Integration: Building a complete system with "specialist agents" to not only classify but also identify inconsistencies, suggest corrections, and revise clauses.
- Architecture Expansion: Integrating clause correction and revision capabilities directly into the workflow.

In conclusion, this architecture offers a scalable, high-precision solution for legal document review, significantly reducing manual effort while maintaining high legal reliability.