Agentic Jackal: Live Execution and Semantic Value… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to ask a very strict, old-fashioned librarian (let's call him Jira) for a specific book. You don't know the library's secret code system (called JQL), so you just speak to him in plain English.

The problem? Jira is a bit of a stickler. If you say, "I want the book about Version 6.5," he might say, "We don't have a book called '6.5'. We have '6.5.0', '6.5.1', and '6.5 Beta'. Which one?" If you guess wrong, he hands you an empty box.

This paper introduces a new way to talk to Jira using a smart assistant named Jackal.

The Problem: The "One-Shot" Mistake

Previously, if you asked a smart AI (a Large Language Model) to translate your English request into Jira's secret code, it would try to do it in one single guess.

Think of it like taking a multiple-choice test without being allowed to check your answers.

The AI guesses: "Maybe the version is '6.5'?"
The Result: The AI writes the code, sends it to Jira, and Jira returns nothing because that exact version doesn't exist.
The Failure: The AI never knew it was wrong until it was too late. It couldn't "check the shelves" to see what was actually there.

The Solution: Agentic Jackal (The Smart Librarian's Assistant)

The authors built Agentic Jackal, which acts like a super-smart assistant who doesn't just guess; he checks the library shelves in real-time.

Here is how Jackal works, using a simple analogy:

1. The "Live Check" (Jira Search)

Instead of writing the code and hoping for the best, Jackal writes a draft, runs it through the library, and sees what happens.

Scenario: You ask for "Tasks about Build Tools."
Jackal's Move: He tries to run the search. Jira says, "I found nothing. We don't have a category called 'Build Tools'."
The Fix: Jackal realizes his mistake immediately. He goes back, thinks, "Oh, maybe they call it 'Build Tools: Other'?" He tries again. This time, it works.
The Old Way: The AI would have just guessed "Build Tools," failed, and given you a blank result, never knowing it could have been fixed.

2. The "Value Detective" (JiraAnchor)

Sometimes, the user says something vague, like "I need the 6.5 release." But the library has "6.5.0", "6.5.1", and "6.5.0 Beta". How does Jackal know which one you mean?

Enter JiraAnchor. Think of JiraAnchor as a magnifying glass that scans the library's entire catalog instantly.

You say: "6.5".
JiraAnchor whispers to Jackal: "Hey, I see '6.5.0 Beta1' and '6.5.0' in the system. '6.5' alone doesn't exist."
Jackal then picks the most likely match and writes the correct code.

The Results: Did it Work?

The researchers tested this on 9 different "smart brains" (AI models) with 1,000 different requests.

The "Guessers" (Old Way): They got about 43% of the tricky, vague requests right. They were like people guessing the password to a safe without any clues.
The "Checkers" (Agentic Jackal): With the new assistant, the success rate jumped significantly.
- For the hardest, most vague requests, the success rate went up by 9%.
- For requests about specific "components" (like specific parts of a project), the success rate skyrocketed from 17% to 66%. That's a massive improvement!

The Catch: It Takes a Little Longer

There is a trade-off.

The Old Way: Fast and cheap. Like ordering a pizza and hoping it's the right topping. (Takes 2 seconds).
The New Way: Slower but accurate. Like calling the pizza shop, asking "Do you have pepperoni?", waiting for the answer, and then ordering. (Takes 30 seconds).

The paper admits that this "checking" process uses more computer power and time. However, for important business tasks where getting the wrong answer is costly, the extra time is worth it.

The Big Discovery: It's Not Just About the Data

The researchers found something interesting. Even with the super-smart assistant checking the shelves, the AI still made mistakes. But these mistakes weren't because it couldn't find the right "Version 6.5."

The mistakes happened because the English language is tricky.

If you say "Find the bugs," does that mean "Find the Bug issue type" or "Find issues that contain the word 'bug' in the description"?
The AI still struggles with these human ambiguities. The tool can check the shelves, but it can't read your mind if your request is too vague.

Summary

Agentic Jackal is like giving a smart AI a live phone line to the database instead of letting it guess from memory.

Before: AI guesses, fails, and you get nothing.
After: AI guesses, checks the database, fixes its mistake, and gives you the right answer.

It's not perfect (it's slower and still gets confused by tricky English), but it turns a game of "blind guessing" into a game of "smart verification," making it much more reliable for real-world business use.

1. Problem Statement

Translating natural language (NL) into Jira Query Language (JQL) presents unique challenges that standard Large Language Models (LLMs) struggle to solve in a single pass:

Instance-Specific Ambiguity: Jira instances contain thousands of unique categorical values (e.g., component names, fix versions, labels) that vary by deployment. LLMs cannot predict these specific values from training data alone.
Verification Gap: Without access to a live database, models cannot verify if a generated query returns results or if the syntax is valid for the specific instance.
Semantic Ambiguity: User requests often lack precise terminology (e.g., "bugs" could mean issuetype=Bug or a text search in the summary field), leading to misinterpretation.
Lack of Benchmarks: Prior to this work, there was no open, execution-based benchmark for text-to-JQL. Existing datasets were small or lacked live verification.

Consequently, single-pass LLMs achieve high accuracy on literal translations (>90%) but collapse on paraphrased or under-specified queries (<30%).

2. Methodology: Agentic Jackal

The authors propose Agentic Jackal, a tool-augmented, multi-step agent designed to close the verification and value-resolution gaps. The system operates on a live Jira instance and consists of three core components:

A. The Agent Loop

Architecture: A two-node directed graph where an LLM generates a candidate JQL query, and a tool execution node runs it against the live Jira instance via the Jira MCP (Model Context Protocol) server.
Iterative Refinement: The agent receives three distinct feedback signals from the execution:
1. Non-empty result set: Query is plausible; agent may accept or refine.
2. Zero-result response: Query is too restrictive or misaligned; agent relaxes constraints.
3. Error message: Syntax or schema violation; agent corrects the specific clause.
Termination: The loop continues until the agent produces a query that executes successfully or hits a recursion limit (25 steps).

B. Jira Search

This is the standard execution tool provided by the Jira MCP server. It validates the generated JQL against the live schema and returns issue keys, result counts, or specific error messages.

C. JiraAnchor (Novel Contribution)

Purpose: A custom semantic retrieval tool designed to resolve natural language mentions of categorical values to their exact, instance-specific stored forms.
Mechanism:
1. Fetch: Retrieves all unique values for a specific field (e.g., fixVersion) from the live Jira instance via REST API.
2. Rank: Uses embedding-based similarity search (cosine similarity) and regex/approximate string matching to rank candidate values against the user's NL mention.
3. Output: Returns the top- $K$ matches (e.g., "6.5" $\to$ "6.5.0 Beta1") for the agent to use in the final query.
Design Philosophy: Unlike multi-turn value-linking agents, JiraAnchor consolidates value resolution into a single retrieval step to minimize latency while ensuring accuracy.

3. Key Contributions

Agentic Jackal Benchmark: The first large-scale, execution-based benchmark for text-to-JQL, comprising 100,000 validated NL–JQL pairs on a live instance with 200,000+ issues. The evaluation set (Jackal-1K) includes 1,000 stratified queries with four NL variants (Semantically Exact, Long NL, Short NL, Semantically Similar).
Agentic Framework: An open-source, model-agnostic agent that establishes the first baseline for enterprise text-to-JQL using live execution and iterative refinement.
JiraAnchor Tool: A novel semantic retrieval tool that significantly improves the resolution of instance-specific categorical values, addressing a critical failure mode in text-to-structured-query systems.
Error Taxonomy: A detailed analysis identifying that semantic interpretation ambiguities (issue type disambiguation, text field selection, version confusion) are the dominant failure modes, rather than value resolution errors.

4. Experimental Results

The authors evaluated 9 frontier LLMs (including GPT-5, Claude 4, Gemini 3, and Pixtral) across two experiments.

Experiment 1: Naive vs. Agentic (Jackal-1K)

Setup: Compared single-pass generation (Naive) against the full Agentic Jackal loop.
Results:
- Overall Accuracy: Improved from 62.5% (Naive) to 64.4% (Agentic).
- Variant Impact: The agentic approach showed the most significant gains on Short NL queries (3.9% absolute improvement), where context is minimal.
- Model Performance: 7 out of 9 models improved. Gemini 3 Flash showed the largest individual gain (+8.2%), driven by a 17.3% improvement on Short NL.

Experiment 2: JiraAnchor Ablation (Field-Value Set)

Setup: Isolated the impact of JiraAnchor on queries involving categorical fields (Fix Version, Affected Version, Components).
Results:
- Overall Accuracy: Improved from 48.7% (Baseline) to 71.7% (with JiraAnchor).
- Component Field: Accuracy skyrocketed from 16.9% to 66.2%, demonstrating that without live value grounding, models fail catastrophically on complex component names.
- Version Fields: Showed consistent moderate gains (e.g., fixVersion from 63.1% to 73.0%).

Error Analysis

Dominant Failure Modes: On the hardest query variants (Short NL and Semantically Similar), semantic interpretation errors accounted for 58% to 68% of failures.
- Issue Type Interpretation: Confusing text search with specific issue types.
- Text Field Selection: Ambiguity between summary and description fields.
- Version Confusion: Mixing up fixVersion and affectedVersion.
Key Insight: Tools like JiraAnchor effectively solve value resolution, but they cannot resolve inherent linguistic ambiguities that require user clarification or richer context.

Operational Costs

Latency: Agentic execution increases latency significantly (from ~5s to ~32s on average) due to iterative LLM inference and API calls.
Tokens: Token usage increases roughly 15x (from ~1.8K to ~27K) due to chain-of-thought and tool interaction logs.
Trade-off: The cost is justified for ambiguous queries where accuracy is critical, but a hybrid approach (naive for clear queries, agentic for ambiguous ones) is recommended for cost-sensitive deployments.

5. Significance and Future Directions

Paradigm Shift: The paper demonstrates that for enterprise text-to-SQL/JQL tasks, execution-based feedback loops and live value grounding are essential, moving beyond static, single-pass prompting.
Benchmarking: It provides the first rigorous, open benchmark for text-to-JQL, enabling reproducible research in this domain.
Future Work: The authors suggest that future improvements should focus on:
- Intent Disambiguation: Mechanisms to ask users clarifying questions when semantic ambiguity is detected (e.g., "Did you mean the summary or description?").
- Selective Invocation: Using confidence scores to decide when to call JiraAnchor, reducing unnecessary overhead.
- Hybrid Strategies: Combining naive generation for well-specified queries with agentic refinement for ambiguous ones.

In summary, Agentic Jackal proves that while LLMs struggle with the specific constraints of live enterprise data, equipping them with tools for live execution and semantic value retrieval significantly bridges the gap between natural language intent and accurate, executable queries.

Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL