The Big Question: Is the Bias Built-In or Learned?

Imagine you are hiring a librarian to find specific facts inside a massive library of books. You notice a strange problem: this librarian is terrible at finding information if it's located in the middle or at the very end of a book. They almost always find the answer if it's on the first page, but if the answer is on page 500, they often miss it entirely.

This is called Position Bias. For a long time, researchers thought this bias was "hardwired" into the librarian's brain (the computer model's architecture), like a physical limitation of their eyes or ears. They thought, "Oh, the librarian just can't see past the first page."

This paper asks a different question: What if the librarian isn't born with this bad habit? What if they just learned it from the books they were trained on?

The Experiment: Training the Librarian

To test this, the researchers created a special training camp for eight different types of librarians (computer models). These librarians had different "brain structures" (some were encoders, some were decoders, some used different math tricks), so they should have had different natural tendencies.

The researchers set up four distinct training scenarios using synthetic data:

The "Start-Only" Camp: They only showed the librarian questions where the answer was at the very beginning of the text.
The "Middle-Only" Camp: They only showed questions where the answer was in the middle.
The "End-Only" Camp: They only showed questions where the answer was at the very end.
The "Balanced" Camp: They showed a mix of all three, so the librarian learned that answers could be anywhere.

The Results: The Librarian Copies the Teacher

The results were surprising and very clear. The librarians didn't stick to their "natural" brain structures; they completely adopted the habits of their training camp.

The "Start-Only" Librarians became obsessed with the beginning of the text. If the answer was there, they were great. If it was at the end, they failed miserably.
The "End-Only" Librarians flipped the script. They ignored the beginning and became experts at finding answers at the very end of the document.
The "Middle-Only" Librarians learned to look specifically in the middle.

The Analogy: Imagine you teach a dog to sit only when you stand on the left side of the room. If you then move to the right side and say "Sit," the dog won't do it. The dog isn't "bad" at sitting; it just learned that "Sit" only happens on the left. Similarly, these AI models learned that "Relevant Information" only exists where the training data told them to look.

Even the librarians who started with a slight natural preference (like a slight tendency to look at the start) completely changed their behavior to match the training data.

The Solution: The "Balanced" Diet

The paper also tested what happens if you feed the librarian a balanced diet (the "Balanced Camp").

The Result: When trained on a mix of beginning, middle, and end examples, the librarians became much more reliable. They stopped ignoring parts of the book.
The Trade-off: Did this make them slower or worse overall? No. They remained just as good at finding answers as the biased ones, but they didn't have the "blind spots." They could find the answer whether it was on page 1 or page 500.

Why This Matters

The paper concludes that Position Bias is not a permanent flaw in the machine's design. It is a learned habit from the data it was fed.

The Problem: Many real-world datasets (like news articles or search logs) naturally put the most important info at the start. If you train an AI on this, it learns to ignore the rest of the document.
The Fix: You don't need to rebuild the AI's brain or change its complex math. You just need to curate your training data better. By ensuring the AI sees examples where the answer is in the middle and at the end, you can "unlearn" the bias and create a more robust, fair retriever.

In short: The bias isn't built-in; it's learned. And just like a student can unlearn bad study habits if you give them the right practice problems, these AI models can unlearn position bias if you give them balanced training data.

Technical Summary: Position Bias in Dense Retrievers

Problem Statement

Dense retrievers, which are central to open-domain question answering and retrieval-augmented generation (RAG), exhibit a systematic positional bias. They disproportionately favor documents where query-relevant information appears near the beginning, leading to significant performance degradation when relevant evidence is located in the middle or end of a document.

While prior research has empirically observed this bias across various training stages and positional encodings, the underlying cause remains unclear. Previous explanations have focused on architectural factors, such as causal attention in autoregressive models or specific pooling-token attention patterns. However, encoder-based dense retrievers lack causal masking yet still exhibit strong "primacy bias," suggesting that architecture alone cannot fully explain the phenomenon. A critical gap exists in understanding the extent to which the positional distribution of fine-tuning data shapes this bias, as prior work has largely relied on observation rather than direct manipulation of training data distributions.

Methodology

To isolate the effect of training data on retrieval-level position bias, the authors constructed a controlled experimental framework involving synthetic, position-targeted datasets and diverse model architectures.

1. Position-Controlled Data Construction

The authors developed a three-stage pipeline to generate training data where the location of query-relevant evidence is strictly controlled:

Corpus Preparation: Using English Wikipedia, documents were stratified by length into five bins (256–8192 characters) and divided into three equal segments: beginning, middle, and end.
Position-Targeted Query Generation: Using GPT-4o-mini with persona-conditioned prompting, queries were generated to be answerable only by a specific target segment (begin, middle, or end).
Multi-Reranker Verification: To ensure the generated queries were truly exclusive to the target segment, a panel of three cross-encoder rerankers (BGE, GTE, Jina) verified candidates. A candidate was retained only if all rerankers scored the target segment at least $\delta=0.3$ higher than the strongest non-target segment.
Balanced Sampling: The resulting retained pool was naturally skewed toward the beginning. To create controlled training sets, the authors downsampled within length-position cells to ensure equal representation of length bins and target positions for specific experimental configurations.

2. Experimental Design

The study fine-tuned eight architecturally diverse pretrained models (including BERT, Longformer, ModernBERT, GPT-2, BLOOM, TinyLlama, and Qwen3) under four distinct training configurations:

Concentrated Configurations: Training data where 100% of queries targeted the beginning (MB), middle (MM), or end (ME) of documents.
Uniform Configuration (MU): Training data where queries were evenly distributed across all three positions.

The models were evaluated on:

Position-Aware Benchmarks: SQuAD-PosQ, FineWeb-PosQ, and PosIR, which allow for performance measurement based on the specific location of evidence.
Standard Retrieval Benchmarks: Four BEIR subsets (SciFact, HotpotQA, FEVER, Climate-FEVER) to assess performance under conventional settings where evidence location is not controlled.
Representation Analysis: Cosine similarity analyses between query-document pairs and document segment embeddings to determine if bias exists at the embedding level.

Key Results

1. Training Distribution Dictates Bias Direction

The primary finding is that retrieval-level position bias follows the training data distribution, regardless of the model's architecture.

Models trained on begin-skewed data (MB) consistently favored early evidence.
Models trained on middle-skewed data (MM) favored middle evidence.
Models trained on end-skewed data (ME) favored later evidence.
This directional shift occurred across all eight models, including those with different positional encodings (APE, RoPE, ALiBi, NoPE) and pooling strategies (CLS, Mean, Last-token).

2. Mitigation via Balanced Training

Position-balanced training (MU) significantly reduced positional sensitivity without sacrificing retrieval performance.

On position-aware benchmarks, balanced training reduced the Position Sensitivity Index (PSI) by 57–87% compared to the worst skewed configuration for all models.
For example, on SQuAD-PosQ, the PSI for GPT-2-medium dropped from 0.592 (begin-trained) to 0.080 (uniformly trained).
Crucially, the uniformly trained models maintained competitive mean retrieval performance (nDCG@10), often achieving the highest or near-highest scores across benchmarks. This indicates that reducing bias does not require a trade-off in overall retrieval quality.

3. Representation-Level Shifts

Analysis of document embeddings revealed that fine-tuning reshapes learned positional preferences:

Pretrained base models showed only mild, model-specific initial tendencies (e.g., slight primacy in encoders, recency in some decoders).
After fine-tuning, the similarity profiles of document segments shifted to align with the training distribution. For instance, begin-trained models showed higher similarity to the first segment, while end-trained models showed higher similarity to the final segments.
Uniform training compressed these profiles, resulting in flatter similarity curves across positions.

4. Benchmark Specificity

The study observed that standard benchmark scores (e.g., BEIR) can be misleading regarding robustness. Benchmarks with evidence heavily concentrated at the beginning (like FEVER) favored begin-trained models, masking their lack of robustness to evidence appearing elsewhere. Conversely, models trained on balanced data performed more consistently across different evidence locations.

Significance and Claims

The paper claims to identify training-position distribution as a major controllable factor in retrieval-level position bias, challenging the notion that this bias is an inherent, unchangeable property of dense retriever architectures.

Causal Evidence: By directly manipulating the positional distribution of training data, the authors provide direct evidence that data curation drives the direction of bias, rather than just architecture or pretraining.
Practical Mitigation: The study proposes balanced data curation as a practical and effective strategy to mitigate position bias. It demonstrates that simply ensuring query-relevant evidence is distributed evenly across document positions during fine-tuning can produce models that are robust to evidence location while maintaining high retrieval performance.
Architectural Independence: The findings suggest that architectural factors (such as positional encodings or pooling strategies) are not the sole determinants of bias; even models with fundamentally different positional processing mechanisms can be steered toward specific bias patterns through training data.

The authors conclude that while pre-existing architectural or pretraining tendencies persist in some models, the retrieval-level bias direction is largely malleable and can be redirected through controlled training data distributions.

Is Position Bias in Dense Retrievers Built In-or Learned from Data?