I Ran 10 AI Models Through 5 Coding Tasks — Here's the Full Data
Last weekend I cleared my kitchen table, cracked open a fresh notebook (the paper kind, I'm old-school), and started running what turned into a three-day benchmark marathon. My goal was simple: figure out which AI model actually deserves the "best coding assistant" crown in 2026, and whether paying more actually correlates with better code.
I'm a data scientist by trade, so I couldn't just vibe-check these models. I had to score them, plot them, and run the numbers like I'd run a regression. What follows is everything I learned, with all the receipts.
Why I Bothered Running This Benchmark
I've been burned before. I picked a "top-rated" coding model last quarter based on a Twitter thread, integrated it into our team's pipeline, and watched it produce three subtly broken PRs in a row. That's when I realized: anecdotal rankings are worthless. I needed my own data.
My sample size ended up being 10 models × 5 tasks = 50 scored interactions. Is that statistically robust enough to declare a winner forever? No, but it's enough to spot clear patterns and avoid the worst traps. Correlation, not causation, is what we're after here.
Let me walk you through the lineup first.
The Lineup: 10 Models, Sorted by Price
| Model | Provider | Output $/M | Category |
|---|---|---|---|
| Ga-Standard | GA Routing | $0.20 | Smart routing layer |
| DeepSeek V4 Flash | DeepSeek | $0.25 | General, code-strong |
| DeepSeek Coder | DeepSeek | $0.25 | Code-specialized |
| Qwen3-32B | Qwen | $0.28 | General purpose |
| Qwen3-Coder-30B | Qwen | $0.35 | Code-specialized |
| Hunyuan-Turbo | Tencent | $0.57 | General purpose |
| DeepSeek V4 Pro | DeepSeek | $0.78 | Premium general |
| GLM-5 | Zhipu | $1.92 | Premium general |
| DeepSeek-R1 | DeepSeek | $2.50 | Reasoning (with code thinking) |
| Kimi K2.5 | Moonshot | $3.00 | Premium general |
A few things jumped out at me before I even ran a single prompt. The price spread is enormous — 15x between Ga-Standard at $0.20 and Kimi K2.5 at $3.00. If a $3 model scores just 1 point higher on a 10-point scale than a $0.25 model, that's a terrible deal. Spoiler: that's exactly what happened.
How I Set Up the Test
Five tasks, one per language/concern area. I picked these because they cover the bread and butter of daily engineering work:
- Function implementation — flatten a nested list in Python, recursively
- Bug fix — diagnose an async/await race condition in JavaScript
- Algorithm — Dijkstra's shortest path, implemented in TypeScript with proper typing
- Code review — security and performance review of a Go service
- Full feature build — paginated, filtered REST endpoint in Express.js
Scoring was 1–10 across four weighted dimensions: correctness (40%), code quality (25%), documentation (15%), and edge-case handling (20%). Every model got the exact same prompt, in the exact same order, with no retries. I did not cherry-pick. I did not give second chances. The first response was the one I scored.
Quick caveat: I tested through a single unified endpoint at global-apis.com/v1 so the network latency variable was neutralized. If you're going to replicate this, I'd recommend doing the same.
Setting Up Your Own Test in 10 Lines of Python
Since folks always ask how I actually call these models, here's a minimal example using the OpenAI-compatible client pointed at Global API:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("GLOBAL_API_KEY"),
base_url="https://global-apis.com/v1"
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "user", "content": "Write a Python function to flatten a nested list recursively."}
]
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
That single config swap is the difference between vendor lock-in and being able to run apples-to-apples benchmarks like the one in this post. It's also how I was able to flip between DeepSeek, Qwen, and Kimi without rewriting my tooling.
The Headline Numbers
Alright, drum roll. Here's the master table I ended up staring at for an hour. Score is the average across all five tasks, and "Value" is my favorite column — score per dollar of output cost.
| Rank | Model | Avg Score | Price ($/M) | Value Score |
|---|---|---|---|---|
| 🥇 | Qwen3-Coder-30B | 8.8 | $0.35 | 25.1 |
| 🥈 | DeepSeek V4 Flash | 8.7 | $0.25 | 34.8 |
| 🥉 | DeepSeek Coder | 8.6 | $0.25 | 34.4 |
| 4 | DeepSeek V4 Pro | 9.1 | $0.78 | 11.7 |
| 5 | DeepSeek-R1 | 9.4 | $2.50 | 3.8 |
| 6 | Kimi K2.5 | 9.0 | $3.00 | 3.0 |
| 7 | Qwen3-32B | 8.3 | $0.28 | 29.6 |
| 8 | GLM-5 | 8.0 | $1.92 | 4.2 |
| 9 | Hunyuan-Turbo | 7.5 | $0.57 | 13.2 |
| 10 | Ga-Standard | 8.5* | $0.20 | 42.5* |
The asterisk on Ga-Standard deserves a paragraph, because it's the weirdest result in the whole study. Ga-Standard isn't a single model — it's a routing layer that picks whichever underlying model is best-suited for the prompt. Its score fluctuates by task (which is why I noted the asterisk), but on average it punches way above its weight. If your workload is heterogeneous, this is genuinely interesting. If you need consistent, reproducible behavior, you want a single model.
The Correlation Between Price and Quality Is... Weak
Let me put on my statistician hat for a moment. If I plot average score against price per million tokens across the 10 models, the Pearson correlation coefficient comes out to roughly r ≈ 0.18. That's a very weak positive correlation, and certainly not statistically significant with n=10. In plain English: paying more does not reliably get you better code.
Let that sink in for a second. The $3.00 Kimi K2.5 scored 9.0. The $0.25 DeepSeek V4 Flash scored 8.7. The 0.3-point quality gap costs you 12x as much. That's not value — that's a luxury tax.
The only model where the high price might be justified is DeepSeek-R1, but only for a specific use case I'll get to in a second.
Task 1: The Python Flatten Test
The prompt was: "Write a Python function to flatten a nested list recursively."
Sounds trivial, right? You'd be surprised how many models overthink it. Here's what I scored:
| Model | Score | What Stood Out |
|---|---|---|
| DeepSeek-R1 | 9.5 | Included Big-O analysis, three solution variants |
| DeepSeek V4 Flash | 9.0 | Clean recursion with type hints |
| Qwen3-Coder-30B | 9.0 | Iterative alternative plus edge cases |
| Kimi K2.5 | 9.0 | Most readable version with a solid docstring |
| DeepSeek Coder | 8.5 | Correct but overly verbose |
| (others) | 6.5–8.0 | Various minor issues |
DeepSeek-R1 won this one, and frankly I wasn't surprised. The reasoning-style models are great when you ask for "explain your work" because they're literally built to think step by step. The catch: R1 costs $2.50 per million output tokens. For a 30-line function, you're spending fractions of a cent. The premium only hurts on bulk workloads.
Task 2: The JavaScript Race Condition
This was my favorite task. I gave every model this broken code:
let data = null;
fetch('/api/data').then(r => r.json()).then(d => data = d);
console.log(data); // Always logs null
The bug is obvious to anyone who's been burned by fetch before — the console.log runs synchronously before the promise resolves. The interesting part wasn't whether models found the bug (they all did), but how they explained it and what they fixed it with.
| Model | Score | What I Liked |
|---|---|---|
| DeepSeek V4 Flash | 9.0 | Clear explanation, three different fix options |
| Qwen3-Coder-30B | 9.0 | Correct fix with bonus error handling |
| Qwen3-32B | 8.5 | Good fix, slightly verbose |
| DeepSeek Coder | 8.5 | Correct fix, minimal explanation |
The tie at the top is real. DeepSeek V4 Flash and Qwen3-Coder-30B both nailed it, and both are cheap. If I'm picking a model specifically for debugging help, I'm picking one of these two and pocketing the difference.
Task 3: Dijkstra in TypeScript
The hardest pure-algorithm task. I asked for "Dijkstra's shortest path in TypeScript" — and what I really wanted to see was whether models would use proper types (a Map<Vertex, number> priority structure, ideally) or fall back on any-laden JavaScript with the type annotations bolted on.
| Model | Score | What I Liked |
|---|---|---|
| DeepSeek-R1 | 9.5 | Perfect type safety, custom priority queue |
| Qwen3-Coder-30B | 9.0 | Idiomatic TypeScript, used built-in Map
|
| DeepSeek V4 Pro | 9.0 | Clean implementation, good generics usage |
| DeepSeek V4 Flash | 8.5 | Correct but used any in two spots |
| Kimi K2.5 | 8.5 | Worked, slightly over-engineered |
| Others | 6.0–8.0 | Mixed results |
This is the task where DeepSeek-R1 earns its $2.50. Reasoning models absolutely shine on graph algorithms. The type-safety treatment from R1 was honestly the best I saw — it built a proper min-heap with full generics. If you're doing serious algorithms work, the reasoning tier is worth it for this category specifically.
Task 4: Go Code Review
I dropped a 200-line Go service with three security issues (SQL injection via string concatenation, an unbounded query, and a missing mutex on a shared map) and asked for a review. The bar was: did the model find all three issues, and did it suggest code-level fixes (not just "consider adding security")?
| Model | Score | Issues Found | Fix Quality |
|---|---|---|---|
| DeepSeek V4 Pro | 9.0 | 3/3 | Production-ready suggestions |
| Kimi K2.5 | 8.5 | 3/3 | Good fixes, slightly verbose |
| Qwen3-Coder-30B | 8.5 | 2/3 | Missed the mutex issue |
| DeepSeek V4 Flash | 8.0 | 3/3 | Found all but lighter on context |
| DeepSeek-R1 | 9.5 | 3/3 | Best explanations, suggested tests too |
I gave the win to DeepSeek-R1 here too, but DeepSeek V4 Pro deserves an honorable mention for being 3x cheaper and still catching everything.
Task 5: Express.js REST Endpoint
The "full feature" stress test. I asked for a paginated, filtered user endpoint with proper status codes, validation, and at least basic tests. This is the task that mimics real production work, so it












