Story Point Estimation Using Large Language Models

Imagine you are the captain of a ship (a software team) preparing for a voyage (a sprint). Before you set sail, you need to know how much fuel (effort) each task will take. In the world of software, developers don't measure this in "hours" because that's too rigid. Instead, they use Story Points, which are like a fuzzy, relative measure of "how hard this feels compared to that."

Usually, the whole team sits in a circle, plays a game called "Planning Poker," and argues until they agree on a number for every task. It's fun, but it takes a long time and depends heavily on who is in the room.

This paper asks a big question: Can a super-smart AI (a Large Language Model or LLM) look at a task description and guess the "Story Points" for us, saving us all that time?

Here is the story of what they found, explained with some everyday analogies.

1. The "Zero-Shot" Test: The Expert Who Never Met You

The Question: Can an AI guess the effort for your specific project without ever seeing your project's history?

The Analogy: Imagine hiring a world-famous chef who has never cooked in your kitchen. You hand them a recipe for "Spicy Tacos" and ask, "How hard is this to make?"

The Old Way (Machine Learning): You'd have to show the chef 1,000 photos of your tacos and how long you took to make them before they could guess correctly.
The New Way (LLM Zero-Shot): You just ask the chef. Surprisingly, the AI chef guessed the difficulty better than a chef who had studied 80% of your past recipes!

The Result: The AI models (like Kimi and DeepSeek) were surprisingly good at guessing the difficulty just by reading the description, even without any training data. They understood the "vibe" of the task.

2. The "Few-Shot" Test: Giving the AI a Cheat Sheet

The Question: What if we give the AI just five examples of tasks you've already finished, along with the points you assigned them?

The Analogy: You tell the chef, "Hey, remember that 'Spicy Taco' we made? It took 5 points. And that 'Giant Burrito'? That was 8 points. Now, look at this new 'Enchilada'—how many points?"

The Result:

Magic Happens: Giving the AI just five examples made it much smarter. It learned your team's specific "scale."
The Strategy Matters: The researchers tried two ways to pick those five examples:
1. The "Most Common" Strategy: Picking five easy tasks because your team usually does easy stuff. (Bad idea: The AI gets confused by hard tasks).
2. The "Full Range" Strategy: Picking one easy, one medium, one hard, one very hard, and one super-hard task. (Good idea: This gave the AI a ruler to measure against).
Winner: The "Full Range" strategy worked best. It's like giving the AI a ruler with marks for 1, 5, and 10, rather than just showing it a pile of 1s.

3. The "Comparison" Test: Which is Harder?

The Question: Humans find it easier to say "Task A is harder than Task B" than to say "Task A is 5 points." Can the AI do the same?

The Analogy: Imagine asking the chef, "Is the Taco harder than the Burrito?" vs. "How many points is the Taco?"

Human Intuition: Humans usually say, "Comparing is easier! I don't need to count, I just know which is bigger."
The AI Reality: The AI did not find comparing easier. In fact, it was worse at saying "A is harder than B" than it was at just guessing the number directly.
Why? The AI seems to have a hidden "number brain." Even when you ask it to compare, it's secretly calculating a number in its head and then converting it to a "Yes/No." It's like asking a calculator to tell you which number is bigger without it actually doing the math first—it just does the math anyway!

4. The "Comparison Cheat Sheet" Test

The Question: If the AI is bad at comparing, can we still use those comparisons as a "cheat sheet" to help it guess the numbers?

The Analogy: You tell the chef, "I know you're bad at comparing, but here are five pairs of dishes where I told you which was harder. Now, guess the points for this new dish."

The Result:

Surprise! Even though the AI wasn't great at predicting the comparisons, using those comparisons as examples still helped it guess the numbers better.
The Special Case: For the smaller, lighter AI models (like Gemini), using "comparisons" as examples actually worked better than giving them direct numbers. It was like a training wheel that helped the smaller bike stay upright.

The Big Takeaways (The "So What?")

AI is Ready to Help: You don't need years of data to get a good estimate. A smart AI can guess the effort of a new project just by reading the description.
A Little Help Goes a Long Way: If you have just five past examples, show the AI a mix of easy and hard tasks. This calibrates the AI to your team's specific style.
AI Thinks Differently Than Us: Humans love comparing things ("This is harder than that"). AI prefers to just guess the number directly. Don't try to force the AI to be human; let it be an AI.
Not All AI is the Same: Big, powerful AI models love seeing direct numbers. Smaller, cheaper AI models might actually learn better if you show them comparisons instead.

In short: This paper shows that we can use AI to speed up software planning. We don't need to train it for months; we just need to give it a tiny "cheat sheet" of five examples, and it can save the team hours of meeting time!

Here is a detailed technical summary of the paper "Story Point Estimation Using Large Language Models."

1. Problem Statement

In Agile software development, Story Point (SP) estimation is a critical activity for sprint planning and resource allocation. Traditionally, this is done via collaborative, subjective techniques like Planning Poker. While effective, manual estimation is time-consuming, inconsistent across teams, and difficult to scale.

Previous attempts to automate this using Machine Learning (ML) and Deep Learning (DL) models (e.g., regression, neural networks) have faced significant limitations:

Data Scarcity: Supervised models require large amounts of labeled historical data from the same project to achieve decent performance.
Generalizability: Models trained on one project often fail to generalize to others due to domain differences.
Cold-Start Problem: New projects lack the historical data necessary to train supervised models.

Additionally, while Comparative Judgments (pairwise comparisons of effort) have been proposed to reduce human cognitive load, it remains unclear if Large Language Models (LLMs) benefit from this format or if they can perform estimation without any training data.

2. Methodology

The authors conducted a systematic empirical study using data from 16 real-world software projects (backlog item titles and descriptions with ground-truth story points).

Models Evaluated

Four state-of-the-art, off-the-shelf LLMs were tested:

DeepSeek-V3.2 (DeepSeek)
Gemini Flash Lite (Google)
OpenAI GPT-5 Nano (OpenAI)
Kimi (Moonshot K2) (Moonshot AI)

Experimental Design

The study addressed four Research Questions (RQs) using specific prompting strategies:

RQ1 (Zero-Shot): Can LLMs predict SPs without any training data?
- Method: Direct prompting with item text.
- Baseline: Compared against supervised deep learning models trained on 80% of project data.
RQ2 (Few-Shot with SPs): Does providing a few labeled examples improve performance?
- Method: Provided 5 examples per project.
- Strategies:
  - Count-based: Selected based on the most frequent SP values.
  - Scale-aware: Selected to cover the full range of SP values (min to max).
RQ3 (Comparative Difficulty): Is it easier for LLMs to predict pairwise comparisons (which item requires more effort) than absolute SP values?
- Method: Compared direct SP prediction vs. explicit pairwise decision prompts.
RQ4 (Few-Shot with Comparisons): Can comparative judgments serve as effective few-shot examples to improve SP prediction?
- Method: Provided 5 pairwise comparison examples (Item A > Item B) as context for SP prediction.

Evaluation Metrics

Pearson Correlation ( $\rho$ ): Measures linear alignment between predicted and actual values.
Spearman Rank Correlation ( $r_s$ ): Measures the alignment of the relative ordering/rank of predictions.
Accuracy: Used for pairwise comparison tasks.

3. Key Contributions

Zero-Shot Viability: Demonstrated that LLMs can predict story points without any project-specific training data, outperforming supervised deep learning models trained on 80% of the data in several metrics.
Few-Shot Calibration: Showed that providing a minimal number of examples (5) significantly boosts performance, with Scale-aware selection (covering the full effort range) proving superior to frequency-based selection.
LLM vs. Human Cognitive Differences: Challenged the assumption that comparative judgments are inherently easier for AI. The study found that LLMs actually perform better at deriving relative orderings from direct SP estimates than when explicitly asked to make pairwise comparisons.
Comparative Judgments as Supervision: Proved that comparative judgments can serve as effective few-shot examples, particularly for models that struggle with absolute scale calibration (e.g., Gemini).

4. Key Results

RQ1: Zero-Shot Performance

DeepSeek and Kimi achieved the best results.
Surprising Finding: In a zero-shot setting, these LLMs outperformed supervised deep learning baselines (trained on 80% of data) in both Pearson and Spearman correlations.
Gemini and OpenAI struggled with absolute values (low $\rho$ ) but maintained good rank ordering (higher $r_s$ ).

RQ2: Few-Shot with SP Examples

Few-shot prompting consistently improved performance across all models.
Scale-aware strategy (Prompt 2-Scale) outperformed Count-based strategy on average. Providing examples that span the full difficulty spectrum helps the model calibrate its internal scale better than providing only common examples.
DeepSeek's average $\rho$ improved from 0.404 (zero-shot) to 0.457 (few-shot).

RQ3: Comparative vs. Absolute

Counter-Intuitive Result: It is not easier for LLMs to predict comparative judgments than absolute story points.
The accuracy of derived pairwise rankings (from direct SP predictions) was consistently higher than the accuracy of explicit pairwise prompts.
Implication: LLMs likely rely on an internal latent numerical representation even when asked for binary decisions, differing from human cognitive processes where comparison is often easier.

RQ4: Comparative Judgments as Few-Shot

Using comparative judgments as few-shot examples improved performance over zero-shot baselines for all models.
Model Dependency:
- For DeepSeek and Kimi, direct labeled SP examples (Prompt 2) remained superior.
- For Gemini, comparative judgments (Prompt 4) actually outperformed direct SP examples, suggesting relative signals are more effective for calibrating models with weaker absolute scale understanding.

5. Significance and Implications

Solving the Cold-Start Problem: LLMs offer a viable, low-cost solution for effort estimation in new projects or data-scarce environments where supervised ML fails.
Reduced Annotation Burden: Teams can achieve high accuracy with only 5 examples. Furthermore, for certain models, these examples can be comparative judgments (easier to collect than absolute numbers), reducing the cognitive burden on developers.
Strategic Model Selection: The study suggests a "hybrid" approach:
- Use Direct SP examples for high-capacity models (DeepSeek, Kimi).
- Use Comparative examples for resource-constrained or lighter models (Gemini).
Relative vs. Absolute: LLMs are naturally better at preserving the relative order of effort (ranking) than matching exact numerical magnitudes, aligning well with the relative nature of story points.

Conclusion

The paper concludes that LLMs are a promising tool for agile effort estimation. They can operate effectively without training data and improve significantly with minimal supervision. Crucially, the study highlights that the mechanism of LLM estimation differs fundamentally from human estimation (where comparison is easier), yet comparative data remains a powerful tool for calibrating these models in low-resource scenarios.