The Limits of Long-Context Reasoning in Automated Bug Fixing
This paper demonstrates that while agentic workflows improve bug-fixing performance by decomposing tasks into short-context steps, current large language models fail to effectively reason over genuinely long contexts (e.g., 64k tokens), revealing a significant gap between nominal context length and usable reasoning capacity.