Towards Improved Sentence Representations using Token Graphs

The Big Problem: The "Crowded Room" Confusion

Imagine you walk into a massive, noisy party (this is a Large Language Model, or LLM). Inside, there are thousands of people (these are tokens, or words) talking at once.

To understand the "vibe" of the party, you need to summarize what's happening into a single sentence.

Old Method (Mean/Max Pooling): The old way of doing this is like asking a security guard to stand in the middle of the room, close their eyes, and just shout out the average noise level. Or, they might just pick the loudest person and ignore everyone else.
- The Flaw: This ignores who is talking to whom. If someone says, "The movie was not good," the security guard might just hear "good" and think the party is great, missing the crucial "not." The relationships between words get lost in the noise.

The Solution: GLOT (The "Social Network" Approach)

The authors introduce GLOT (Graph-based Token Pooling). Instead of treating the words as a random crowd, GLOT treats them like a social network.

Here is how GLOT works, step-by-step:

1. Drawing the Map (Graph Construction)

Imagine you are a detective at that party. Instead of just listening to everyone, you start drawing lines between people who are having a conversation.

If two words are similar or related (like "dog" and "bark"), you draw a strong line between them.
If they are unrelated (like "dog" and "toaster"), you don't draw a line.
The Magic: You create a map (a graph) of the sentence that shows exactly who is connected to whom.

2. The Group Chat (Token-GNN)

Now, imagine the words on your map can pass notes to their neighbors.

In the old method, the word "not" sits alone.
In GLOT, the word "not" passes a note to "good," whispering, "Hey, flip this meaning!"
This happens through a Graph Neural Network (GNN). It's like a group chat where every word updates its understanding based on who it's talking to. This fixes the "not good" problem because the words actually communicate.

3. The Final Summary (Readout)

Finally, GLOT asks the group: "Who is the most important person in this conversation?"

It doesn't just pick the loudest person. It looks at the group chat history and realizes, "Oh, 'genome' and 'individuals' are the key players here, not the word 'What'."
It creates a final summary based on these refined, connected insights.

Why is this a Big Deal?

1. It's a "Superpower" for Frozen Models

Usually, to make an AI smarter at a specific task, you have to "fine-tune" it. This is like hiring a new teacher to retrain the whole school. It costs a fortune and takes forever.

GLOT's Trick: It works with a "frozen" model (a model that isn't being retrained). It's like taking a brilliant but rigid professor and giving them a new, smart assistant (GLOT) who organizes the notes. The professor doesn't change, but the output becomes much better.
The Result: It's 20 times cheaper and 100 times faster than retraining the whole model.

2. The "Needle in a Haystack" Test

The authors tested GLOT with a crazy stress test. They took a sentence with a tiny, important clue (like "The file has keys but not the lock") and buried it inside a sea of 90% random garbage words (like "banana, cloud, purple, 42...").

Old Methods: The old methods got completely confused by the garbage. Their accuracy crashed. They couldn't find the needle.
GLOT: Because GLOT draws lines between the important words, it ignores the garbage. Even with 90% noise, it still found the clue with 97% accuracy. It's like having a metal detector that only beeps for gold, ignoring the sand.

3. It Works on Any Model

Whether you are using a small, efficient model or a giant, powerful one (like Mistral-7B or LLaMA), GLOT makes them better at understanding sentences without needing to change the model itself.

The Bottom Line

Think of GLOT as a smart translator that sits between a raw, powerful AI and the real world.

Before: The AI spoke in a jumble of disconnected words.
After: GLOT connects the dots, understands the context, filters out the noise, and gives you a clear, accurate summary.

It proves that you don't need to rebuild the engine to make the car go faster; sometimes, you just need a better navigation system to understand the map.

1. Problem Statement

Large Language Models (LLMs) generate rich, token-level hidden states, but many downstream applications (e.g., classification, retrieval, semantic similarity) require a single vector representation for an entire sentence or document.

Current Limitations: Standard pooling methods (Mean, Max, or using a dedicated [CLS]/[EOS] token) treat tokens as an independent set. This approach discards the rich relational structure captured by the model's self-attention layers.
Signal Dilution: When a sentence contains many distractor tokens or noise, standard pooling methods suffer from "signal dilution," where the critical semantic signal is averaged out or overwhelmed by irrelevant tokens.
Fine-Tuning Costs: While fine-tuning the entire LLM backbone can improve sentence representations, it is computationally prohibitive for billion-parameter models and risks catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA are better but still require significant resources compared to frozen backbones.
Decoder-Only Challenge: Decoder-only models (e.g., LLaMA, Mistral) are optimized for next-token prediction, not holistic sentence representation, making their raw hidden states particularly poor for sentence-level tasks without adaptation.

2. Methodology: GLOT

The authors propose GLOT (Graph-based Token Pooling), a lightweight, structure-aware module that operates on the frozen outputs of an LLM. Instead of treating tokens as an independent set, GLOT reframes pooling as a relational learning problem followed by aggregation.

The process consists of three stages:

Token Graph Construction:
- Given token hidden states $X \in \mathbb{R}^{L \times d}$ , GLOT constructs a latent token-similarity graph $G=(V, E)$ .
- Nodes correspond to tokens.
- Edges are defined by the cosine similarity between token vectors. To induce sparsity and focus on semantic relevance, an edge is created only if the similarity exceeds a threshold $\tau$ .
Refinement with TOKEN-GNN:
- A lightweight Graph Neural Network (GNN), termed TOKEN-GNN, is applied to the graph.
- This module performs message passing, allowing tokens to exchange information and refine their representations based on their neighbors (relational dependencies). This step captures complex multi-token dependencies (e.g., negation, long-range context) that standard pooling misses.
- The GNN uses learnable weights to aggregate neighbor information and update node features.
Learnable Readout:
- The refined token representations are aggregated into a single sentence vector $z$ using a learnable attention mechanism.
- A scoring function computes an importance weight for each token, which are normalized via softmax to produce a weighted sum of the refined vectors.

Key Design Choice: The LLM backbone remains completely frozen. Only the GLOT module (GNN + Readout) and a task-specific classifier head are trained.

3. Key Contributions

Conceptual Shift: The paper redefines sentence representation from "information compression" (collapsing independent vectors) to "relational learning" (learning over a token graph). It generalizes existing methods (Mean, Max, CLS) as special cases where the graph has no edges or fixed weights.
Efficiency: GLOT achieves state-of-the-art performance with 20x fewer trainable parameters and 100x faster training times compared to parameter-efficient fine-tuning methods like LoRA. It requires only ~0.42 GB of GPU memory for a 7B model, compared to >32 GB for full fine-tuning.
Robustness to Noise: The authors introduce a novel diagnostic stress test ("Needle in a Haystack" style) where 90% of tokens are random distractors. GLOT maintains >97% accuracy in this scenario, whereas baseline methods (including AdaPool) collapse to near-chance levels.
Generalization: The method works effectively on both encoder-only (BERT, RoBERTa) and decoder-only (LLaMA, Mistral) architectures, enabling decoder-only models to function as powerful text embedding models without fine-tuning.

4. Experimental Results

The authors evaluated GLOT across diverse benchmarks:

GLUE Benchmark: GLOT consistently outperformed static pooling (Mean, Max, CLS) and learnable baselines (AdaPool) across all tested backbones (from BERT to Mistral-7B). Notable gains were seen in tasks requiring nuanced relational understanding, such as CoLA (Linguistic Acceptability) and RTE (Textual Entailment).
Long-Text Classification (IMDB): GLOT improved accuracy by ~4.5% over the best baseline on RoBERTa and ~10% over standard [EOS] pooling on decoder models, demonstrating its ability to preserve signals in long contexts.
MTEB Benchmark: On seven diverse tasks (classification, retrieval, clustering, etc.), GLOT achieved top or near-top performance across both encoder and decoder backbones, often rivaling fully fine-tuned models.
Diagnostic Stress Test: In the synthetic noise test, GLOT's accuracy remained stable (>97%) even with 90% distractors, while AdaPool dropped from ~92% to ~78% and Mean pooling dropped to ~64%.
Ablation Studies:
- Graph Sparsity: A similarity threshold ( $\tau$ ) of 0.4–0.6 yielded optimal results, confirming that pruning weak edges prevents noise dilution.
- Architecture: The performance gains were attributed to the graph structure (relational learning) rather than just increased parameter count, as GLOT outperformed a parameter-matched MLP baseline.
- Scalability: Graph construction overhead was negligible (<1.5% of total runtime) even at context lengths of 32k tokens.

5. Significance

This work challenges the prevailing view that pooling is merely a routine final step in NLP pipelines. By demonstrating that relational learning before compression unlocks significant performance from frozen LLMs, GLOT offers a practical, accessible solution for adapting billion-parameter models on consumer-grade hardware.

Practical Impact: It enables the use of massive, frozen decoder-only models (like Mistral-7B) as high-quality sentence encoders without the prohibitive cost of fine-tuning.
Theoretical Insight: It highlights the importance of modeling inter-token dependencies explicitly, suggesting that the "set" assumption of standard pooling is a fundamental bottleneck for semantic understanding in the presence of noise or complex syntax.
Future Directions: The paper opens avenues for applying graph-based relational learning to other modalities (e.g., Vision Transformers) and exploring dynamic graph rewiring techniques.

In summary, GLOT provides a highly efficient, robust, and theoretically grounded framework for generating sentence embeddings, bridging the gap between the capabilities of frozen LLMs and the demands of downstream sentence-level tasks.