← Back to Blogs

Small AI models can't always fix your code. Here's why that's interesting.

By Arihant TripathyMarch 13, 202612 min read

There's a version of AI-powered software development that actually makes sense: small models running locally, quietly triaging your GitHub backlog overnight, without a cloud bill and without your proprietary code leaving the building. It's a reasonable thing to want, and when you describe it to most engineers they immediately see the appeal. The problem is that when we actually tried to build it, using real agentic frameworks and real small models on a real benchmark of bugs, almost nothing worked.

What we're even talking about

Agentic coding tools are AI systems that don't just autocomplete. They act autonomously: reading relevant files from a GitHub issue, writing a patch, running tests, and submitting a fix without a human in the loop. SWE-Agent and OpenHands are two of the more established examples, and they've shown genuinely impressive results on standard benchmarks.

But they generally assume a powerful model underneath, GPT-4 class or similar. That works if you can accept the cost and privacy tradeoffs. Many teams can't. Every token sent externally may include internal code, and at scale those tokens become a real bill.

Quick vocabulary

LLM: A large language model, typically cloud-hosted and expensive to run.

SLM: A small language model (often 1-5B params) that can run locally.

Agentic framework: The orchestration layer that lets AI use tools (read files, run commands, write code) across repeated loops.

How we ran the experiment

We took four real agentic frameworks and swapped their underlying models for small, locally-runnable ones. Then we ran each setup 150 times, measuring energy, runtime, token counts, and memory use. That level of repetition is what separates an anecdote from a credible empirical result.

The four frameworks

FrameworkCharacterApproach
SWE-AgentMethodicalReads broadly before touching code
OpenHandsGeneralistBrowser + terminal + editor, broader tool usage
Mini SWE AgentStripped downTighter loops and fewer prompts
AutoCodeRoverTargetedSearches for relevant files before editing

The two small models

ModelSizeMade by
Gemma-3 4B4 billion parametersGoogle
Qwen 1.7B1.7 billion parametersAlibaba

The benchmark was SWE-bench Verified Mini, a curated set of real GitHub issues from real Python projects. That means the tasks represent bugs that developers actually struggled with.

What happened: roughly nothing

Across most configurations, the resolution rate was effectively zero. Systems ran, consumed energy, generated tokens, iterated through framework loops, and returned almost no correct patches. AutoCodeRover often spent roughly 24-27 minutes per run with little useful output.

~0%

Task resolution in most setups

24-27

Minutes per AutoCodeRover run

4 x 2 x 150

Frameworks x SLMs x runs

Energy != value

Compute spent without useful patches

Framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning.

The framework stayed busy while the model lost the thread. The system kept requesting more steps and burning tokens, but there was no robust mechanism to recognize failure and cut losses early.

Two separate problems, often conflated

ProblemWhat is happeningWhere to fix it
Reasoning failureModelSLM cannot solve the bug reliablyModel capability and training
Energy wasteFrameworkOrchestrator fails to detect flailing and stop earlyFramework design and control logic

Improving the small model alone does not automatically solve orchestration waste. These are related, but separate, engineering problems.

Why this matters

The energy argument cuts both ways

SLMs are often cheaper per token, but cost per successful task is what matters. A cheap model that finishes 0% of tasks is still expensive in practice.

The privacy case for local AI is real

For teams with strict compliance boundaries, local deployment is not just preference, it can be a requirement. That keeps this line of work highly relevant even when current results are weak.

False positives are a serious risk

Some runs looked successful but produced broken patches. That means every autonomous fix should be treated as untrusted until independently verified by external CI and stronger checks.

The short version, by role

If you are...Takeaway
Building local AI coding toolsDo not just downsize an LLM stack. Build SLM-first orchestration.
Evaluating tools for costOptimize for successful task completion, not token price.
Working under strict data policyLocal agentic AI is valid, but framework maturity is not there yet.
Researching this areaSWEnergy provides a strong, reproducible baseline.

TL;DR

  • Existing agentic coding frameworks assume strong cloud models.
  • When swapped with small local models, task resolution is near zero in this study.
  • Model reasoning failure and orchestration waste are distinct problems.
  • False positives make independent verification mandatory.
  • The likely path forward is SLM-native framework design, not just better small models.

What our paper is actually doing

Benchmark records often get more attention, but practical negative results like this are more useful for engineering decisions. Knowing why local agentic issue resolution currently fails gives teams a concrete roadmap for what needs to change.

The private, cheap, unattended version of local agentic AI that many teams want is not ready yet. This work helps define the missing pieces.


Paper: SWEnergy. Accepted at AGENTS workshop at ICSE 2026!