I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind

So I’m building this little chat app. Nothing fancy, just a side project that might turn into something real. And you know what? Speed was killing me. Not the code – the latency. Every time a user hit send and waited, I could see them bouncing. 100ms extra? Bye bye user. I needed to KNOW which models were actually fast, not just the ones with flashy marketing.

So I did what any indie hacker would do: I spent a weekend benchmarking 15 different models through Global API. I ran them from two regions, measured Time to First Token (TTFT) and tokens per second. And I’m sharing it all here, raw and unfiltered.

TL;DR: DeepSeek V4 Flash is the all-round beast (~60 tok/s, ~180ms TTFT). Step-3.5-Flash is the speed demon at ~80 tok/s. And if you're broke but need speed? Qwen3-8B – $0.01/M output and 70 tok/s. I'm not even joking.

How I Ran the Tests

I wanted real-world results, not some synthetic benchmark. So I used a simple prompt: "Explain recursion in 200 words." Streamed via SSE. Each model got 10 runs, averaged. Heres the setup:

Parameter	Value
Test Date	May 20, 2026
Test Regions	US East (Ohio) and Asia (Singapore)
Prompt	"Explain recursion in 200 words"
Output Tokens	~150 per run
Iterations	10, averaged
Streaming	Yes (SSE)
API	Global API (`https://global-apis.com/v1`)

Here's the Python code I used to measure – feel free to steal it:

import time
import requests
import json

def benchmark_model(model, api_key):
    url = "https://global-apis.com/v1/chat/completions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": "Explain recursion in 200 words."}],
        "stream": True
    }

    start = time.time()
    response = requests.post(url, json=payload, headers=headers, stream=True)
    first_token_time = None
    tokens = []

    for line in response.iter_lines():
        if line:
            if line.startswith(b"data: ") and line[6:] != b"[DONE]":
                if first_token_time is None:
                    first_token_time = time.time()
                tokens.append(json.loads(line[6:]))

    end = time.time()
    ttft = (first_token_time - start) * 1000  # ms
    elapsed = end - start
    tok_per_sec = len(tokens) / elapsed
    return ttft, tok_per_sec

api_key = "your-api-key-here"
# Example: DeepSeek V4 Flash
ttft, tps = benchmark_model("deepseek-v4-flash", api_key)
print(f"TTFT: {ttft:.0f}ms | Tokens/s: {tps:.1f}")

Super simple. You can drop any model name from Global API's list.

The Speed Ranking (Fastest to Slowest)

Honestly, I was shocked by some of these results. I’ve put them in a table because I’m a data nerd, but I’ll break it down after.

Rank	Model	TTFT (ms)	Tokens/sec	Provider	$/M Output
1	Step-3.5-Flash	120	80	StepFun	$0.15
2	DeepSeek V4 Flash	180	60	DeepSeek	$0.25
3	Hunyuan-TurboS	200	55	Tencent	$0.28
4	Qwen3-8B	150	70	Qwen	$0.01
5	Qwen3-32B	250	45	Qwen	$0.28
6	Doubao-Seed-Lite	220	50	ByteDance	$0.40
7	Hunyuan-Turbo	280	42	Tencent	$0.57
8	GLM-4-32B	300	38	Zhipu	$0.56
9	Qwen3.5-27B	350	35	Qwen	$0.19
10	DeepSeek V4 Pro	400	30	DeepSeek	$0.78
11	MiniMax M2.5	450	28	MiniMax	$1.15
12	GLM-5	500	25	Zhipu	$1.92
13	Kimi K2.5	600	20	Moonshot	$3.00
14	DeepSeek-R1	800	15	DeepSeek	$2.50
15	Qwen3.5-397B	1200	10	Qwen	$2.34

Note: Reasoning models (R1, K2.5) include internal thinking time before first token – that's why TTFT is high. But they're smart.

Speed by Price Tier

Because let's be real – as an indie hacker, I care about speed AND cost. Can't be spending $3 per million tokens on a hobby project.

Ultra-Budget (< $0.15/M)

Model	tok/s	$/M
Qwen3-8B	70	$0.01
Step-3.5-Flash	80	$0.15

Qwen3-8B is INSANE. 70 tok/s at literally ONE CENT per million output tokens. For simple stuff – summarization, classification, chatbots that don't need deep reasoning – it's unbeatable. Step-3.5-Flash is the speed king at 80 tok/s, and only $0.15/M. Worth it if you need low latency.

Budget ($0.15–$0.30/M)

Model	tok/s	$/M
DeepSeek V4 Flash	60	$0.25
Hunyuan-TurboS	55	$0.28
Qwen3-32B	45	$0.28

This is the sweet spot. DeepSeek V4 Flash is my go-to. 60 tok/s, 180ms TTFT, and quality that rivals GPT-4o. For $0.25/M. I mean... just use it.

Mid-Range ($0.30–$0.80/M)

Model	tok/s	$/M
Doubao-Seed-Lite	50	$0.40
GLM-4-32B	38	$0.56
Hunyuan-Turbo	42	$0.57
DeepSeek V4 Pro	30	$0.78

These are bigger models, so speed drops. DeepSeek V4 Pro at 30 tok/s is still decent, but you pay more for quality. Honestly, unless you need the extra reasoning, stick with V4 Flash.

Premium ($0.80+/M)

Model	tok/s	$/M
MiniMax M2.5	28	$1.15
GLM-5	25	$1.92
Kimi K2.5	20	$3.00

These are for when correctness > speed. Like legal document analysis or code generation where a mistake costs you. But at 20 tok/s? Your users will feel it. Use only if you have to.

Geographic Latency: Where You Run Matters

I tested from two regions to see the network impact. You'd be surprised how much server location matters.

Model	US East TTFT	Asia TTFT	Diff
DeepSeek V4 Flash	180ms	150ms	-30ms
Qwen3-32B	250ms	210ms	-40ms
GLM-5	500ms	420ms	-80ms
Kimi K2.5	600ms	480ms	-120ms

Asian models like Qwen and Kimi are 16-20% faster from Asia. Obvious, right? But DeepSeek V4 Flash is almost the same everywhere – great global distribution. If your users are in Asia, consider Qwen3 models or DeepSeek.

Real-World Impact: TTFT Tells the Story

I built a little chart for myself (sharing it here because why not):

TTFT	User Perception
< 200ms	"Instant" – users stay
200–400ms	"Fast" – acceptable
400–800ms	"Noticeable delay" – you lose some
800ms+	"Slow" – you lose everyone

My recommendation: Keep TTFT under 400ms for interactive chat. Use DeepSeek V4 Flash (180ms) or Qwen3-8B (150ms) or Step-3.5-Flash (120ms). Your users will thank you.

The Bottom Line

If you're an indie hacker like me, don't overthink this. For most use cases:

Need speed + quality? DeepSeek V4 Flash. (60 tok/s, $0.25/M)
Need raw speed on a budget? Qwen3-8B. (70 tok/s, $0.01/M)
Need to flex with the fastest? Step-3.5-Flash. (80 tok/s, $0.15/M)
Building a reasoning app? Accept slower TTFT – R1 or K2.5.

I tested all through Global API – they just give you a single endpoint (https://global-apis.com/v1) and you swap model names. Super easy. If you want to run these benchmarks yourself (and you should, because your use case might differ), grab an API key