There's a version of AI-powered software development that actually makes sense: small models running locally, quietly triaging your GitHub backlog overnight, without a cloud bill and without your proprietary code leaving the building. It's a reasonable thing to want, and when you describe it to most engineers they immediately see the appeal. The problem is that when we actually tried to build it, using real agentic frameworks and real small models on a real benchmark of bugs, almost nothing worked.
What we're even talking about
Agentic coding tools are AI systems that don't just autocomplete. They act autonomously: reading relevant files from a GitHub issue, writing a patch, running tests, and submitting a fix without a human in the loop. SWE-Agent and OpenHands are two of the more established examples, and they've shown genuinely impressive results on standard benchmarks.
But they generally assume a powerful model underneath, GPT-4 class or similar. That works if you can accept the cost and privacy tradeoffs. Many teams can't. Every token sent externally may include internal code, and at scale those tokens become a real bill.
Quick vocabulary
LLM: A large language model, typically cloud-hosted and expensive to run.
SLM: A small language model (often 1-5B params) that can run locally.
Agentic framework: The orchestration layer that lets AI use tools (read files, run commands, write code) across repeated loops.
How we ran the experiment
We took four real agentic frameworks and swapped their underlying models for small, locally-runnable ones. Then we ran each setup 150 times, measuring energy, runtime, token counts, and memory use. That level of repetition is what separates an anecdote from a credible empirical result.
The four frameworks
| Framework | Character | Approach |
|---|---|---|
| SWE-Agent | Methodical | Reads broadly before touching code |
| OpenHands | Generalist | Browser + terminal + editor, broader tool usage |
| Mini SWE Agent | Stripped down | Tighter loops and fewer prompts |
| AutoCodeRover | Targeted | Searches for relevant files before editing |
The two small models
| Model | Size | Made by |
|---|---|---|
| Gemma-3 4B | 4 billion parameters | |
| Qwen 1.7B | 1.7 billion parameters | Alibaba |
The benchmark was SWE-bench Verified Mini, a curated set of real GitHub issues from real Python projects. That means the tasks represent bugs that developers actually struggled with.
What happened: roughly nothing
Across most configurations, the resolution rate was effectively zero. Systems ran, consumed energy, generated tokens, iterated through framework loops, and returned almost no correct patches. AutoCodeRover often spent roughly 24-27 minutes per run with little useful output.
~0%
Task resolution in most setups
24-27
Minutes per AutoCodeRover run
4 x 2 x 150
Frameworks x SLMs x runs
Energy != value
Compute spent without useful patches
Framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning.
The framework stayed busy while the model lost the thread. The system kept requesting more steps and burning tokens, but there was no robust mechanism to recognize failure and cut losses early.
Two separate problems, often conflated
| Problem | What is happening | Where to fix it |
|---|---|---|
| Reasoning failureModel | SLM cannot solve the bug reliably | Model capability and training |
| Energy wasteFramework | Orchestrator fails to detect flailing and stop early | Framework design and control logic |
Improving the small model alone does not automatically solve orchestration waste. These are related, but separate, engineering problems.
Why this matters
The energy argument cuts both ways
SLMs are often cheaper per token, but cost per successful task is what matters. A cheap model that finishes 0% of tasks is still expensive in practice.
The privacy case for local AI is real
For teams with strict compliance boundaries, local deployment is not just preference, it can be a requirement. That keeps this line of work highly relevant even when current results are weak.
False positives are a serious risk
Some runs looked successful but produced broken patches. That means every autonomous fix should be treated as untrusted until independently verified by external CI and stronger checks.
The short version, by role
| If you are... | Takeaway |
|---|---|
| Building local AI coding tools | Do not just downsize an LLM stack. Build SLM-first orchestration. |
| Evaluating tools for cost | Optimize for successful task completion, not token price. |
| Working under strict data policy | Local agentic AI is valid, but framework maturity is not there yet. |
| Researching this area | SWEnergy provides a strong, reproducible baseline. |
TL;DR
- Existing agentic coding frameworks assume strong cloud models.
- When swapped with small local models, task resolution is near zero in this study.
- Model reasoning failure and orchestration waste are distinct problems.
- False positives make independent verification mandatory.
- The likely path forward is SLM-native framework design, not just better small models.
What our paper is actually doing
Benchmark records often get more attention, but practical negative results like this are more useful for engineering decisions. Knowing why local agentic issue resolution currently fails gives teams a concrete roadmap for what needs to change.
The private, cheap, unattended version of local agentic AI that many teams want is not ready yet. This work helps define the missing pieces.
Paper: SWEnergy. Accepted at AGENTS workshop at ICSE 2026!