SWE-Together: Evaluating Coding Agents in Interactive User Sessions
Abstract
SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency.
Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with recoverable repository states, clear user goals, and observable outcomes. To replay these interactions across agents, we build a reactive LLM-based user simulator that preserves the original users' intents and provides feedback when the coding agent's progress requires it. To evaluate agents as collaborators, we measure both final repository correctness and the number of corrective feedback turns required during the interaction. Experiments with frontier coding agents show that stronger agents generally achieve higher final success rates while requiring fewer interventions, suggesting an improved user experience.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions (2026)
- EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions (2026)
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows (2026)
- ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End? (2026)
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation (2026)
- SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision (2026)
- Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.29957 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
