I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind
So I’m building this little chat app. Nothing fancy, just a side project that might turn into something real. And you know what? Speed was killing me. Not the code – the latency. Every time a user hit send and waited, I could see them bouncing. 100ms extra? Bye bye user. I needed to KNOW which models were actually fast, not just the ones with flashy marketing.
So I did what any indie hacker would do: I spent a weekend benchmarking 15 different models through Global API. I ran them from two regions, measured Time to First Token (TTFT) and tokens per second. And I’m sharing it all here, raw and unfiltered.
TL;DR: DeepSeek V4 Flash is the all-round beast (~60 tok/s, ~180ms TTFT). Step-3.5-Flash is the speed demon at ~80 tok/s. And if you're broke but need speed? Qwen3-8B – $0.01/M output and 70 tok/s. I'm not even joking.
How I Ran the Tests
I wanted real-world results, not some synthetic benchmark. So I used a simple prompt: "Explain recursion in 200 words." Streamed via SSE. Each model got 10 runs, averaged. Heres the setup:
| Parameter | Value |
|---|---|
| Test Date | May 20, 2026 |
| Test Regions | US East (Ohio) and Asia (Singapore) |
| Prompt | "Explain recursion in 200 words" |
| Output Tokens | ~150 per run |
| Iterations | 10, averaged |
| Streaming | Yes (SSE) |
| API | Global API (https://global-apis.com/v1) |
Here's the Python code I used to measure – feel free to steal it:
import time
import requests
import json
def benchmark_model(model, api_key):
url = "https://global-apis.com/v1/chat/completions"
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
payload = {
"model": model,
"messages": [{"role": "user", "content": "Explain recursion in 200 words."}],
"stream": True
}
start = time.time()
response = requests.post(url, json=payload, headers=headers, stream=True)
first_token_time = None
tokens = []
for line in response.iter_lines():
if line:
if line.startswith(b"data: ") and line[6:] != b"[DONE]":
if first_token_time is None:
first_token_time = time.time()
tokens.append(json.loads(line[6:]))
end = time.time()
ttft = (first_token_time - start) * 1000 # ms
elapsed = end - start
tok_per_sec = len(tokens) / elapsed
return ttft, tok_per_sec
api_key = "your-api-key-here"
# Example: DeepSeek V4 Flash
ttft, tps = benchmark_model("deepseek-v4-flash", api_key)
print(f"TTFT: {ttft:.0f}ms | Tokens/s: {tps:.1f}")
Super simple. You can drop any model name from Global API's list.
The Speed Ranking (Fastest to Slowest)
Honestly, I was shocked by some of these results. I’ve put them in a table because I’m a data nerd, but I’ll break it down after.
| Rank | Model | TTFT (ms) | Tokens/sec | Provider | $/M Output |
|---|---|---|---|---|---|
| 1 | Step-3.5-Flash | 120 | 80 | StepFun | $0.15 |
| 2 | DeepSeek V4 Flash | 180 | 60 | DeepSeek | $0.25 |
| 3 | Hunyuan-TurboS | 200 | 55 | Tencent | $0.28 |
| 4 | Qwen3-8B | 150 | 70 | Qwen | $0.01 |
| 5 | Qwen3-32B | 250 | 45 | Qwen | $0.28 |
| 6 | Doubao-Seed-Lite | 220 | 50 | ByteDance | $0.40 |
| 7 | Hunyuan-Turbo | 280 | 42 | Tencent | $0.57 |
| 8 | GLM-4-32B | 300 | 38 | Zhipu | $0.56 |
| 9 | Qwen3.5-27B | 350 | 35 | Qwen | $0.19 |
| 10 | DeepSeek V4 Pro | 400 | 30 | DeepSeek | $0.78 |
| 11 | MiniMax M2.5 | 450 | 28 | MiniMax | $1.15 |
| 12 | GLM-5 | 500 | 25 | Zhipu | $1.92 |
| 13 | Kimi K2.5 | 600 | 20 | Moonshot | $3.00 |
| 14 | DeepSeek-R1 | 800 | 15 | DeepSeek | $2.50 |
| 15 | Qwen3.5-397B | 1200 | 10 | Qwen | $2.34 |
Note: Reasoning models (R1, K2.5) include internal thinking time before first token – that's why TTFT is high. But they're smart.
Speed by Price Tier
Because let's be real – as an indie hacker, I care about speed AND cost. Can't be spending $3 per million tokens on a hobby project.
Ultra-Budget (< $0.15/M)
| Model | tok/s | $/M |
|---|---|---|
| Qwen3-8B | 70 | $0.01 |
| Step-3.5-Flash | 80 | $0.15 |
Qwen3-8B is INSANE. 70 tok/s at literally ONE CENT per million output tokens. For simple stuff – summarization, classification, chatbots that don't need deep reasoning – it's unbeatable. Step-3.5-Flash is the speed king at 80 tok/s, and only $0.15/M. Worth it if you need low latency.
Budget ($0.15–$0.30/M)
| Model | tok/s | $/M |
|---|---|---|
| DeepSeek V4 Flash | 60 | $0.25 |
| Hunyuan-TurboS | 55 | $0.28 |
| Qwen3-32B | 45 | $0.28 |
This is the sweet spot. DeepSeek V4 Flash is my go-to. 60 tok/s, 180ms TTFT, and quality that rivals GPT-4o. For $0.25/M. I mean... just use it.
Mid-Range ($0.30–$0.80/M)
| Model | tok/s | $/M |
|---|---|---|
| Doubao-Seed-Lite | 50 | $0.40 |
| GLM-4-32B | 38 | $0.56 |
| Hunyuan-Turbo | 42 | $0.57 |
| DeepSeek V4 Pro | 30 | $0.78 |
These are bigger models, so speed drops. DeepSeek V4 Pro at 30 tok/s is still decent, but you pay more for quality. Honestly, unless you need the extra reasoning, stick with V4 Flash.
Premium ($0.80+/M)
| Model | tok/s | $/M |
|---|---|---|
| MiniMax M2.5 | 28 | $1.15 |
| GLM-5 | 25 | $1.92 |
| Kimi K2.5 | 20 | $3.00 |
These are for when correctness > speed. Like legal document analysis or code generation where a mistake costs you. But at 20 tok/s? Your users will feel it. Use only if you have to.
Geographic Latency: Where You Run Matters
I tested from two regions to see the network impact. You'd be surprised how much server location matters.
| Model | US East TTFT | Asia TTFT | Diff |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms | -80ms |
| Kimi K2.5 | 600ms | 480ms | -120ms |
Asian models like Qwen and Kimi are 16-20% faster from Asia. Obvious, right? But DeepSeek V4 Flash is almost the same everywhere – great global distribution. If your users are in Asia, consider Qwen3 models or DeepSeek.
Real-World Impact: TTFT Tells the Story
I built a little chart for myself (sharing it here because why not):
| TTFT | User Perception |
|---|---|
| < 200ms | "Instant" – users stay |
| 200–400ms | "Fast" – acceptable |
| 400–800ms | "Noticeable delay" – you lose some |
| 800ms+ | "Slow" – you lose everyone |
My recommendation: Keep TTFT under 400ms for interactive chat. Use DeepSeek V4 Flash (180ms) or Qwen3-8B (150ms) or Step-3.5-Flash (120ms). Your users will thank you.
The Bottom Line
If you're an indie hacker like me, don't overthink this. For most use cases:
- Need speed + quality? DeepSeek V4 Flash. (60 tok/s, $0.25/M)
- Need raw speed on a budget? Qwen3-8B. (70 tok/s, $0.01/M)
- Need to flex with the fastest? Step-3.5-Flash. (80 tok/s, $0.15/M)
- Building a reasoning app? Accept slower TTFT – R1 or K2.5.
I tested all through Global API – they just give you a single endpoint (https://global-apis.com/v1) and you swap model names. Super easy. If you want to run these benchmarks yourself (and you should, because your use case might differ), grab an API key













