Rankings

Top 7 LLMs for Coding in 2026 (SWE-bench, LiveCodeBench, Terminal-Bench)

Q: Where do DeepSeek and Kimi fit in this ranking?

Kimi K2.5 is included at rank five. Newer releases such as DeepSeek V4-Pro and Kimi K2.6 lack public, independently reported coding benchmarks at the time of writing, so we do not rank them. We will revise the table once contamination-resistant scores are published rather than estimate placements.

We ranked the seven strongest coding models of early 2026 primarily by SWE-bench Verified, the most contamination-resistant agentic coding benchmark, with LiveCodeBench and Terminal-Bench 2.0 as secondary axes. Every score below is labeled as vendor-reported or independently measured, and the numbers reflect February to March 2026 snapshots that will shift over time.

80.8

SWE-bench Verified leader (Claude Opus 4.6)

Anthropic + Onyx, Feb-Mar 2026

83.6

LiveCodeBench leader (Qwen 3.5-plus)

Onyx independent tracker

77.3

Terminal-Bench 2.0 leader (GPT-5.3 Codex)

Independent, agentic shell tasks

Models ranked on shared axes

Verified Feb-Mar 2026

The Full Rankings

This table covers all seven models at a glance. Click any model name to jump to its detailed breakdown below, or sort by any column to re-rank on a different axis. The Source column flags whether the figures are independently measured or include vendor-reported numbers. Scores are SWE-bench Verified percent, LiveCodeBench percent, and Terminal-Bench 2.0 percent.

# ↕	Model ↕	Org ↕	SWE-bench Verified ↕	LiveCodeBench ↕	Terminal-Bench 2.0 ↕	Source ↕
1	Claude Opus 4.6	Anthropic	80.8SWE lead	76.0	65.4	Vendor + Independent
2	Gemini 3.1 Pro	Google	80.6	81.3	68.5	Independent
3	MiniMax M2.5	MiniMax	80.2	65.0	42.2	Vendor + Independent
4	GLM-5	Zhipu AI	77.8	52.0	56.2	Independent
5	Kimi K2.5	Moonshot AI	76.8	85.0	Not reported	Independent
6	Qwen 3.5-plus	Alibaba	76.4	83.6LCB lead	52.5	Independent
7	GPT-5.3 Codex	OpenAI	Not independently reported (SWE-bench Pro 56.8)	71.0	77.3Term lead	Independent

Sortable: click "SWE-bench Verified", "LiveCodeBench", or "Terminal-Bench 2.0" to re-rank. The two rows without an independent SWE-bench Verified figure sort to the bottom of that column.

How We Ranked These Models

Methodology

Ranked primarily by SWE-bench Verified (real GitHub issues, contamination-resistant), with LiveCodeBench and Terminal-Bench 2.0 as secondary axes, as of February to March 2026. SWE-bench scores depend on the agentic scaffold. Vendor-reported versus independent is labeled. Saturated benchmarks such as HumanEval and GSM8K are excluded.

Where a model lacks an independently reported SWE-bench Verified score, we say so rather than substitute a different benchmark and treat it as equivalent. GPT-5.3 Codex is placed at the bottom of the primary axis for this reason and is included on the strength of its independent Terminal-Bench 2.0 result.

The three benchmarks measure different things, and a model can lead one while trailing another:

SWE-bench Verified runs a model inside an agent harness against human-validated GitHub issues. It is the closest public proxy for "can this fix a real bug in a real repository," which is why it is our primary axis.
LiveCodeBench uses a rolling set of recent competitive-programming problems collected after model training cutoffs, making it among the most contamination-resistant trackers for algorithmic coding.
Terminal-Bench 2.0 measures agentic shell and tool-use competence: navigating a filesystem, running commands, and completing multi-step terminal tasks.

Sources include independent trackers (Onyx AI, LLM-Stats, Artificial Analysis) and the public SWE-bench, LiveCodeBench, and Terminal-Bench leaderboards, cross-checked against vendor technical reports where independent figures were unavailable. Saturated benchmarks like HumanEval (above 93 percent across the field) and GSM8K (around 99 percent) are excluded because they no longer separate frontier models.

1. Claude Opus 4.6 (Anthropic)

Claude Opus 4.6 takes the top spot on our primary axis with the highest SWE-bench Verified score in the group at 80.8 percent. The gap over second place is small, well within scaffold-driven noise, but Opus has been the most consistent performer across independent harnesses on real repository fixes. Its Terminal-Bench 2.0 result of 65.4 is mid-pack, and its LiveCodeBench figure of 76.0 trails the algorithmic specialists, so it is a strong generalist rather than a category winner everywhere.

80.8

SWE-bench Verified

76.0

LiveCodeBench

65.4

Terminal-Bench 2.0

Best for: Teams that want the most reliable agentic bug-fixing and refactoring across real codebases, especially when paired with a strong scaffold such as Claude Code.

How to read this: The SWE-bench Verified figure blends an Anthropic-reported result with an independent Onyx measurement. The lead over Gemini 3.1 Pro and MiniMax M2.5 is a fraction of a point, so treat the top three as effectively tied on this axis.

Read more: What Is Claude AI?

2. Gemini 3.1 Pro (Google)

Gemini 3.1 Pro is arguably the strongest all-rounder on this list. Its independent SWE-bench Verified result of 80.6 is statistically level with the leader, and it pairs that with a much higher LiveCodeBench score of 81.3 and the second-best Terminal-Bench 2.0 figure at 68.5. If you want one model that is competitive on real bug-fixing, algorithmic problems, and terminal agents at once, Gemini 3.1 Pro has the most balanced profile here.

80.6

SWE-bench Verified

81.3

LiveCodeBench

68.5

Terminal-Bench 2.0

Best for: Developers who want the most balanced single model across all three axes, particularly those already working in the Google ecosystem.

How to read this: All three figures here are independently measured (Onyx and officechai), which makes this one of the more defensible profiles in the group. The SWE-bench gap to rank one is two-tenths of a point.

Read more: What Is Google Gemini?

3. MiniMax M2.5 (MiniMax)

MiniMax M2.5 rounds out the top tier on SWE-bench Verified at 80.2 percent, a genuinely strong result for real-repository bug fixing. The catch is that its strength is narrow: its LiveCodeBench score of 65.0 and especially its Terminal-Bench 2.0 result of 42.2 are the weakest agentic-terminal figure in the group. It is a focused repository-fixing model more than a do-everything coding assistant.

80.2

SWE-bench Verified

65.0

LiveCodeBench

42.2

Terminal-Bench 2.0

Best for: Pipelines focused tightly on resolving GitHub-style issues where terminal-agent breadth is less important than raw patch accuracy.

How to read this: The headline SWE-bench number blends a MiniMax-reported figure with an independent Onyx measurement. The low Terminal-Bench 2.0 result is a real gap, not noise, so do not assume top-three SWE-bench standing carries over to terminal agents.

4. GLM-5 (Zhipu AI)

GLM-5 posts a solid independent SWE-bench Verified score of 77.8, landing just below the top tier. Its Terminal-Bench 2.0 result of 56.2 is respectable, sitting between the leaders and the weaker terminal performers. The soft spot is LiveCodeBench at 52.0, the lowest algorithmic score here, which suggests it is more comfortable with applied repository work than with competitive-style problems.

77.8

SWE-bench Verified

52.0

LiveCodeBench

56.2

Terminal-Bench 2.0

Best for: Teams wanting a capable, independently benchmarked option for applied coding and terminal tasks without paying frontier-flagship prices.

How to read this: All three figures are independently sourced. The weak LiveCodeBench result means GLM-5 is a poorer fit for algorithm-heavy or competitive-programming workloads.

5. Kimi K2.5 (Moonshot AI)

Kimi K2.5 is a study in why the primary axis matters. Its independent LiveCodeBench score of 85.0 is the highest algorithmic figure on this entire list, yet its SWE-bench Verified result of 76.8 places it fifth on our real-repository axis. No Terminal-Bench 2.0 figure has been independently reported for it. If competitive-style problem solving is your priority, Kimi punches well above its overall rank.

76.8

SWE-bench Verified

85.0

LiveCodeBench

Not reported

Terminal-Bench 2.0

Best for: Algorithmic and competitive-programming work, where its category-leading LiveCodeBench score is the most relevant signal.

How to read this: Ranked fifth because our primary axis is SWE-bench Verified, not LiveCodeBench. The missing Terminal-Bench 2.0 figure means its agentic-terminal competence is unverified here.

6. Qwen 3.5-plus (Alibaba)

Qwen 3.5-plus holds the LiveCodeBench crown among models with a full score set at 83.6 percent, edging out Gemini on that axis. Its SWE-bench Verified result of 76.4 is solid but places it sixth on our primary measure, and its Terminal-Bench 2.0 figure of 52.5 sits mid-table. Like Kimi, it is a reminder that a high algorithmic score does not automatically translate into top real-repository performance.

76.4

SWE-bench Verified

83.6

LiveCodeBench

52.5

Terminal-Bench 2.0

Best for: Algorithmic workloads where it leads LiveCodeBench among full-profile models, with an active open ecosystem behind it.

How to read this: All figures are independently sourced. Its sixth-place SWE-bench standing means it is a weaker pick for autonomous repository bug-fixing than the top three.

7. GPT-5.3 Codex (OpenAI)

GPT-5.3 Codex is the hardest model to place, which is exactly why it ranks seventh on a SWE-bench-first list. It has no independently reported SWE-bench Verified score; the closest public figure is SWE-bench Pro at 56.8, a harder and different benchmark that cannot be compared directly to the Verified numbers above. Where it clearly excels is agentic terminal work: it leads the group on Terminal-Bench 2.0 at 77.3, with a respectable LiveCodeBench result of 71.0.

Pro 56.8

SWE-bench Verified n/a

71.0

LiveCodeBench

77.3

Terminal-Bench 2.0

Best for: Agentic terminal and tool-use workflows, where its category-leading Terminal-Bench 2.0 score is the most relevant benchmark.

How to read this: Ranked last on the primary axis only because no comparable SWE-bench Verified figure exists, not because it is a weak coder. The SWE-bench Pro 56.8 number is a different, tougher benchmark and is shown for context, not as a Verified-equivalent score.

Read more: What Is ChatGPT?

Reading the Scores: Three Things That Change the Ranking

Benchmark tables look authoritative, but coding scores carry caveats that can flip a ranking. Keep these three in mind before you treat any number as settled.

Scores are scaffold-dependent

SWE-bench Verified and Terminal-Bench 2.0 run a model inside an agent harness. The same model can score noticeably higher or lower depending on the scaffold, for example Claude Code versus a generic Codex-style CLI. Read every figure as a model-plus-scaffold result, not pure model capability.

Saturated benchmarks are excluded

HumanEval (above 93 percent across the field) and GSM8K (around 99 percent) are saturated and contamination-prone, so they no longer separate frontier models. We exclude them from ranking. If a comparison leans on those numbers, treat its conclusions with caution.

Some standings are still pending

Newer releases such as DeepSeek V4-Pro and Kimi K2.6 lack public, independently reported coding benchmarks at the time of writing. We do not rank what we cannot verify. The table will be revised when contamination-resistant scores are published, rather than filled in with estimates.

The practical takeaway: the top three on SWE-bench Verified are effectively tied, so pick based on your actual workload. Choose for real-repository bug fixing on the SWE-bench axis, for algorithmic problems on LiveCodeBench, and for terminal-agent tasks on Terminal-Bench 2.0. Then verify current standings yourself, because these are February to March 2026 snapshots and the leaderboards move.

Frequently Asked Questions

What is the best LLM for coding in 2026?

By SWE-bench Verified, the most contamination-resistant agentic coding benchmark, Claude Opus 4.6 leads at 80.8 percent, with Gemini 3.1 Pro at 80.6 and MiniMax M2.5 at 80.2 close behind. That top three is effectively tied. The right model depends on your task: Qwen 3.5-plus leads LiveCodeBench (83.6) among full-profile models for competitive-style problems, and GPT-5.3 Codex leads Terminal-Bench 2.0 (77.3) for shell and tool-use work.

Why is SWE-bench Verified used as the primary ranking signal?

SWE-bench Verified draws from real, human-validated GitHub issues, so it measures whether a model can resolve genuine software bugs rather than recite memorized answers. It is more contamination-resistant than saturated benchmarks like HumanEval (above 93 percent) and GSM8K (around 99 percent), which we exclude. The main caveat is that SWE-bench scores depend on the agentic scaffold used to run the model.

What does scaffold-dependent mean for these scores?

Agentic coding benchmarks run a model inside a harness that plans, edits files, runs tests, and retries. The same model can score noticeably higher or lower depending on the scaffold, for example Claude Code versus a generic Codex-style CLI. Compare scores as model-plus-scaffold results, not pure model capability, and treat small gaps between top models as noise.

Why is GPT-5.3 Codex ranked seventh if it leads Terminal-Bench?

Our primary ranking axis is SWE-bench Verified, and GPT-5.3 Codex has no independently reported Verified score at the time of writing. Its available figure is SWE-bench Pro at 56.8, a harder and different benchmark that cannot be placed on the same axis as the others. It still leads Terminal-Bench 2.0 at 77.3, which is why it earns a spot on the list.

Where do DeepSeek and Kimi fit in this ranking?

Kimi K2.5 is included at rank five, and it actually leads LiveCodeBench at 85.0. Newer releases such as DeepSeek V4-Pro and Kimi K2.6 lack public, independently reported coding benchmarks at the time of writing, so we do not rank them. We will revise the table once contamination-resistant scores are published rather than estimate placements.

Video Resources

SWE-bench Verified Explained

YouTube Search

How the real-GitHub-issue benchmark works and why scaffolds matter

Best LLMs for Coding in 2026 Compared

YouTube Search

Hands-on comparisons of frontier coding models across tasks

LiveCodeBench and Terminal-Bench Walkthrough

YouTube Search

What the secondary axes measure and how to read agentic scores

Go Deeper

Resources from across Tech Jacks Solutions

AI Tools Hub

Vendor breakdowns, comparisons, and rankings across the AI tooling landscape

Prompt Engineering Library

Prompting techniques that get better code out of any model

AI Career Paths

Roles that work with these models and tools daily

AI Glossary

Definitions for benchmark and model terms used in this article

Ranked from SWE-bench Verified, LiveCodeBench, and independent trackers, data as of Feb-Mar 2026. Scores shift; verify current standings.

Claude and Claude Opus are trademarks of Anthropic. Gemini is a trademark of Google LLC. GPT and Codex are trademarks of OpenAI. MiniMax is a trademark of MiniMax. GLM is a trademark of Zhipu AI. Kimi is a trademark of Moonshot AI. Qwen is a trademark of Alibaba Group. SWE-bench, LiveCodeBench, and Terminal-Bench are the property of their respective maintainers. All other trademarks belong to their respective owners. Tech Jacks Solutions is not affiliated with or endorsed by any of the organizations mentioned.

Gallery

Contacts

Top 7 LLMs for Coding in 2026 (SWE-bench, LiveCodeBench, Terminal-Bench)

The Full Rankings

How We Ranked These Models

1. Claude Opus 4.6 (Anthropic)

2. Gemini 3.1 Pro (Google)

3. MiniMax M2.5 (MiniMax)

4. GLM-5 (Zhipu AI)

5. Kimi K2.5 (Moonshot AI)

6. Qwen 3.5-plus (Alibaba)

7. GPT-5.3 Codex (OpenAI)

Reading the Scores: Three Things That Change the Ranking

Frequently Asked Questions

What is the best LLM for coding in 2026?

Why is SWE-bench Verified used as the primary ranking signal?

What does scaffold-dependent mean for these scores?

Why is GPT-5.3 Codex ranked seventh if it leads Terminal-Bench?

Where do DeepSeek and Kimi fit in this ranking?

Video Resources

Go Deeper

Services

Learn

Company