I Ran 10 AI Models Through 5 Coding Tasks — Here's the Full Data

Last weekend I cleared my kitchen table, cracked open a fresh notebook (the paper kind, I'm old-school), and started running what turned into a three-day benchmark marathon. My goal was simple: figure out which AI model actually deserves the "best coding assistant" crown in 2026, and whether paying more actually correlates with better code.

I'm a data scientist by trade, so I couldn't just vibe-check these models. I had to score them, plot them, and run the numbers like I'd run a regression. What follows is everything I learned, with all the receipts.

Why I Bothered Running This Benchmark

I've been burned before. I picked a "top-rated" coding model last quarter based on a Twitter thread, integrated it into our team's pipeline, and watched it produce three subtly broken PRs in a row. That's when I realized: anecdotal rankings are worthless. I needed my own data.

My sample size ended up being 10 models × 5 tasks = 50 scored interactions. Is that statistically robust enough to declare a winner forever? No, but it's enough to spot clear patterns and avoid the worst traps. Correlation, not causation, is what we're after here.

Let me walk you through the lineup first.

The Lineup: 10 Models, Sorted by Price

Model	Provider	Output $/M	Category
Ga-Standard	GA Routing	$0.20	Smart routing layer
DeepSeek V4 Flash	DeepSeek	$0.25	General, code-strong
DeepSeek Coder	DeepSeek	$0.25	Code-specialized
Qwen3-32B	Qwen	$0.28	General purpose
Qwen3-Coder-30B	Qwen	$0.35	Code-specialized
Hunyuan-Turbo	Tencent	$0.57	General purpose
DeepSeek V4 Pro	DeepSeek	$0.78	Premium general
GLM-5	Zhipu	$1.92	Premium general
DeepSeek-R1	DeepSeek	$2.50	Reasoning (with code thinking)
Kimi K2.5	Moonshot	$3.00	Premium general

A few things jumped out at me before I even ran a single prompt. The price spread is enormous — 15x between Ga-Standard at $0.20 and Kimi K2.5 at $3.00. If a $3 model scores just 1 point higher on a 10-point scale than a $0.25 model, that's a terrible deal. Spoiler: that's exactly what happened.

How I Set Up the Test

Five tasks, one per language/concern area. I picked these because they cover the bread and butter of daily engineering work:

Function implementation — flatten a nested list in Python, recursively
Bug fix — diagnose an async/await race condition in JavaScript
Algorithm — Dijkstra's shortest path, implemented in TypeScript with proper typing
Code review — security and performance review of a Go service
Full feature build — paginated, filtered REST endpoint in Express.js

Scoring was 1–10 across four weighted dimensions: correctness (40%), code quality (25%), documentation (15%), and edge-case handling (20%). Every model got the exact same prompt, in the exact same order, with no retries. I did not cherry-pick. I did not give second chances. The first response was the one I scored.

Quick caveat: I tested through a single unified endpoint at global-apis.com/v1 so the network latency variable was neutralized. If you're going to replicate this, I'd recommend doing the same.

Setting Up Your Own Test in 10 Lines of Python

Since folks always ask how I actually call these models, here's a minimal example using the OpenAI-compatible client pointed at Global API:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GLOBAL_API_KEY"),
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Write a Python function to flatten a nested list recursively."}
    ]
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

That single config swap is the difference between vendor lock-in and being able to run apples-to-apples benchmarks like the one in this post. It's also how I was able to flip between DeepSeek, Qwen, and Kimi without rewriting my tooling.

The Headline Numbers

Alright, drum roll. Here's the master table I ended up staring at for an hour. Score is the average across all five tasks, and "Value" is my favorite column — score per dollar of output cost.

Rank	Model	Avg Score	Price ($/M)	Value Score
🥇	Qwen3-Coder-30B	8.8	$0.35	25.1
🥈	DeepSeek V4 Flash	8.7	$0.25	34.8
🥉	DeepSeek Coder	8.6	$0.25	34.4
4	DeepSeek V4 Pro	9.1	$0.78	11.7
5	DeepSeek-R1	9.4	$2.50	3.8
6	Kimi K2.5	9.0	$3.00	3.0
7	Qwen3-32B	8.3	$0.28	29.6
8	GLM-5	8.0	$1.92	4.2
9	Hunyuan-Turbo	7.5	$0.57	13.2
10	Ga-Standard	8.5*	$0.20	42.5*

The asterisk on Ga-Standard deserves a paragraph, because it's the weirdest result in the whole study. Ga-Standard isn't a single model — it's a routing layer that picks whichever underlying model is best-suited for the prompt. Its score fluctuates by task (which is why I noted the asterisk), but on average it punches way above its weight. If your workload is heterogeneous, this is genuinely interesting. If you need consistent, reproducible behavior, you want a single model.

The Correlation Between Price and Quality Is... Weak

Let me put on my statistician hat for a moment. If I plot average score against price per million tokens across the 10 models, the Pearson correlation coefficient comes out to roughly r ≈ 0.18. That's a very weak positive correlation, and certainly not statistically significant with n=10. In plain English: paying more does not reliably get you better code.

Let that sink in for a second. The $3.00 Kimi K2.5 scored 9.0. The $0.25 DeepSeek V4 Flash scored 8.7. The 0.3-point quality gap costs you 12x as much. That's not value — that's a luxury tax.

The only model where the high price might be justified is DeepSeek-R1, but only for a specific use case I'll get to in a second.

Task 1: The Python Flatten Test

The prompt was: "Write a Python function to flatten a nested list recursively."

Sounds trivial, right? You'd be surprised how many models overthink it. Here's what I scored:

Model	Score	What Stood Out
DeepSeek-R1	9.5	Included Big-O analysis, three solution variants
DeepSeek V4 Flash	9.0	Clean recursion with type hints
Qwen3-Coder-30B	9.0	Iterative alternative plus edge cases
Kimi K2.5	9.0	Most readable version with a solid docstring
DeepSeek Coder	8.5	Correct but overly verbose
(others)	6.5–8.0	Various minor issues

DeepSeek-R1 won this one, and frankly I wasn't surprised. The reasoning-style models are great when you ask for "explain your work" because they're literally built to think step by step. The catch: R1 costs $2.50 per million output tokens. For a 30-line function, you're spending fractions of a cent. The premium only hurts on bulk workloads.

Task 2: The JavaScript Race Condition

This was my favorite task. I gave every model this broken code:

let data = null;
fetch('/api/data').then(r => r.json()).then(d => data = d);
console.log(data); // Always logs null

The bug is obvious to anyone who's been burned by fetch before — the console.log runs synchronously before the promise resolves. The interesting part wasn't whether models found the bug (they all did), but how they explained it and what they fixed it with.

Model	Score	What I Liked
DeepSeek V4 Flash	9.0	Clear explanation, three different fix options
Qwen3-Coder-30B	9.0	Correct fix with bonus error handling
Qwen3-32B	8.5	Good fix, slightly verbose
DeepSeek Coder	8.5	Correct fix, minimal explanation

The tie at the top is real. DeepSeek V4 Flash and Qwen3-Coder-30B both nailed it, and both are cheap. If I'm picking a model specifically for debugging help, I'm picking one of these two and pocketing the difference.

Task 3: Dijkstra in TypeScript

The hardest pure-algorithm task. I asked for "Dijkstra's shortest path in TypeScript" — and what I really wanted to see was whether models would use proper types (a Map<Vertex, number> priority structure, ideally) or fall back on any-laden JavaScript with the type annotations bolted on.

Model	Score	What I Liked
DeepSeek-R1	9.5	Perfect type safety, custom priority queue
Qwen3-Coder-30B	9.0	Idiomatic TypeScript, used built-in `Map`
DeepSeek V4 Pro	9.0	Clean implementation, good generics usage
DeepSeek V4 Flash	8.5	Correct but used `any` in two spots
Kimi K2.5	8.5	Worked, slightly over-engineered
Others	6.0–8.0	Mixed results

This is the task where DeepSeek-R1 earns its $2.50. Reasoning models absolutely shine on graph algorithms. The type-safety treatment from R1 was honestly the best I saw — it built a proper min-heap with full generics. If you're doing serious algorithms work, the reasoning tier is worth it for this category specifically.

Task 4: Go Code Review

I dropped a 200-line Go service with three security issues (SQL injection via string concatenation, an unbounded query, and a missing mutex on a shared map) and asked for a review. The bar was: did the model find all three issues, and did it suggest code-level fixes (not just "consider adding security")?

Model	Score	Issues Found	Fix Quality
DeepSeek V4 Pro	9.0	3/3	Production-ready suggestions
Kimi K2.5	8.5	3/3	Good fixes, slightly verbose
Qwen3-Coder-30B	8.5	2/3	Missed the mutex issue
DeepSeek V4 Flash	8.0	3/3	Found all but lighter on context
DeepSeek-R1	9.5	3/3	Best explanations, suggested tests too

I gave the win to DeepSeek-R1 here too, but DeepSeek V4 Pro deserves an honorable mention for being 3x cheaper and still catching everything.

Task 5: Express.js REST Endpoint

The "full feature" stress test. I asked for a paginated, filtered user endpoint with proper status codes, validation, and at least basic tests. This is the task that mimics real production work, so it