Z.ai shipped GLM-5.2 on 2026-06-16 with a headline pitch developers will recognize: a usable million-token window and a single dial for how hard the model thinks. The benchmarks look strong — but they're vendor-reported, so treat the numbers as claims until you've run your own.
GLM-5.2 at a glance: the 1M span and depth switch
GLM-5.2 is Z.ai's coding-focused flagship, a 753B-parameter Mixture-of-Experts model released under an MIT license on 2026-06-16 . The two changes that matter for daily work are a 1,000,000-token context window — roughly 5× the ~200K of GLM-5.1 — and a new reasoning_effort parameter that controls how much the model thinks before it answers .
Output is capped at 128K tokens (131,072), with a default max_tokens of 65,536 for the glm-5.2 id . The reasoning_effort dial accepts seven values — max, xhigh, high, medium, low, minimal, none — but they collapse to two effective thinking tiers (High and Max); none/minimal skip thinking, low/medium map to High, and xhigh/max map to Max . The default is max; Z.ai notes higher effort raises latency and token usage materially, so the dial is a real cost lever, not a cosmetic one . Open weights are on HuggingFace as BF16/F32 (zai-org/GLM-5.2) and an FP8 variant (zai-org/GLM-5.2-FP8) .
On the scoreboard, Z.ai's own table reports SWE-bench Pro 62.1, FrontierSWE 74.4, and Terminal Bench 2.1 (Terminus-2) 81.0 . Worth flagging: these are self-reported. the-decoder's coverage puts FrontierSWE 74.4 just behind Claude Opus 4.8 at 75.4, but it draws from the same vendor table — not an independent reproduction . The honest read for now: GLM-5.2 is positioned to close the gap with closed-source leaders on coding while staying open-weight, and the 74.4 figure holds up only as far as Z.ai's own harness does.
Sign up and pick: pay-as-you-go or Coding Plan
Z.ai offers two billing paths, and which you pick depends on whether you are scripting against the raw API or wiring GLM-5.2 into an IDE agent. Pay-as-you-go (PAYG) on the general API runs $1.40 per 1M input tokens and $4.40 per 1M output tokens — the same rates as GLM-5.1 — with cached input temporarily free under a limited-time storage benefit . The general endpoint is https://api.z.ai/api/paas/v4/.
The GLM Coding Plan is the other route. It starts at $18/month and covers GLM-5.2, GLM-5-Turbo, and GLM-4.7 . Crucially, it exposes both an OpenAI-compatible endpoint (https://api.z.ai/api/coding/paas/v4/) and an Anthropic-compatible one (https://api.z.ai/api/anthropic), so GLM-5.2 slots into existing Claude Code workflows .
For SDK setup, install the official Python client with pip install zai-sdk==0.2.3, or reuse the OpenAI Python SDK by setting base_url='https://api.z.ai/api/paas/v4/'. Java developers pull ai.z.openapi:zai-sdk:0.3.5 .
Rule of thumb: pick the Coding Plan when pointing an IDE agent at GLM-5.2, since plan benefits only apply through officially supported integrations and may degrade through unsupported SDKs or third-party scenarios . Pick PAYG for raw scripting or one-off evals where you control the request directly.
GLM-5.2 drop-in: the exact swap in your settings
Swapping GLM-5.2 into an existing agent is a config edit, not a rewrite. Z.ai exposes an Anthropic-compatible endpoint at https://api.z.ai/api/anthropic and an OpenAI-compatible one at https://api.z.ai/api/coding/paas/v4, so Claude Code, Cline, and similar clients point at GLM-5.2 by changing model IDs and a base URL .
For Claude Code, edit ~/.claude/settings.json: set both ANTHROPIC_DEFAULT_SONNET_MODEL and ANTHROPIC_DEFAULT_OPUS_MODEL to glm-5.2[1m], then add CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000 so the agent stops compacting before it hits the 1M ceiling .
{
"env": {
"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]",
"CLAUDE_CODE_AUTO_COMPACT_WINDOW": "1000000"
}
}
Run /status to confirm the model is live. Effort routing is mapped to GLM's two thinking tiers: /effort low|medium|high resolves to GLM High, while /effort xhigh|max|ultracode resolves to GLM Max . Z.ai is explicit about the trade-off:
"Max effort is recommended for complex, multi-step coding work, but higher effort raises latency and token usage," — Z.ai, GLM-5.2 release notes (source: DataCamp).
For Cline and other OpenAI-compatible clients, add a provider with base URL https://api.z.ai/api/coding/paas/v4/, model glm-5.2, image support left unchecked, and the context window field set to 1000000 .
If you are calling the API directly, POST to /chat/completions with thinking enabled and stream the response:
{
"model": "glm-5.2",
"thinking": {"type": "enabled"},
"reasoning_effort": "max",
"temperature": 1.0,
"stream": true,
"messages": [{"role": "user", "content": "Refactor this module."}]
}
Parse delta.reasoning_content and delta.content as two separate streams — reasoning tokens arrive before the answer. For streaming function calls, also set tool_stream=true and concatenate delta.tool_calls[*].function.arguments until the call is complete .
Where GLM-5.2 bites: quota math and unsupported paths
The catch with GLM-5.2 on the Coding Plan is quota burn, not raw price. Z.ai applies a peak-hour multiplier: requests cost 3× quota during 14:00–18:00 UTC+8 and 2× off-peak, with a limited-time 1× off-peak promotion running through September 2026 . Schedule long-horizon agent runs outside the afternoon window and you roughly triple your effective throughput for free.
Do the math before you commit a workflow. The Pro tier allows about 400 prompts per 5-hour window, but one prompt may invoke the model 15–20 times under agentic loops, and at the 3× peak multiplier that nets out to roughly 135 usable prompts per window . That ceiling drops fast if every call reaches for the full span.
So treat glm-5.2[1m] as a deliberate choice, not a default. Z.ai notes that selecting it carries extra cost and latency, and recommends it only when a task genuinely needs the 1M context; standard glm-5.2 is cheaper and faster for everyday single-file edits .
"Coding Plan benefits are restricted to officially supported tools and may be limited through unsupported SDKs or third-party scenarios," — Z.ai documentation (source: MarkTechPost).
The practical risk: route GLM-5.2 through an unsupported client and your calls may silently fall back to pay-as-you-go rates — $1.40 per 1M input and $4.40 per 1M output tokens — instead of your plan quota. Stick to officially supported integrations to keep billing predictable.
What to attempt with the extra span
The 1M-token window changes what fits in a single prompt: with GLM-5.2's roughly 5x jump from GLM-5.1's ~200,000-token ceiling , full-project reads become feasible where you previously had to chunk repeatedly. Four concrete tasks worth running:
- Cross-repo analysis: feed several large codebases in one prompt and ask GLM-5.2 to trace a call path or shared contract across them — no manual splitting.
-
Marathon refactors: pass an entire monorepo and request a structured migration. Raise
reasoning_effortto Max for multi-file dependency tracking across the full pass . - MCP orchestration: Z.ai reports an MCP-Atlas public-set score of 76.8 . Run your own MCP task suite against it before wiring production flows.
One caveat governs all of it: the coding benchmarks are vendor-reported. SWE-bench Pro 62.1 and FrontierSWE 74.4 had no independent third-party verification at launch. The takeaway: treat the extra span as capability to test, not a result to trust — run a representative subset of your own tasks as the real measure of fit before you ship.
Frequently asked questions
Is GLM-5.2's FrontierSWE 74.4 independently verified?
No. As of its 2026-06-16 release, the FrontierSWE 74.4 figure is vendor-reported only. The-decoder's coverage, which places GLM-5.2 just behind Claude Opus 4.8 at 75.4 , cites the same source table rather than a separate reproduction. Independent leaderboard entries are expected after launch. Until then, run a task-representative harness on your own codebase before committing a production flow.
What is the difference between the standard API and the Coding Plan endpoint?
They are billed and routed separately. The standard pay-as-you-go API uses https://api.z.ai/api/paas/v4/ at $1.40 per 1M input and $4.40 per 1M output tokens. The GLM Coding Plan, from $18/month, adds an OpenAI-compatible /api/coding/paas/v4 endpoint and an Anthropic-compatible /api/anthropic path for Claude Code workflows . Plan benefits — quota and pricing — apply only through officially supported integrations, not arbitrary third-party SDK calls.
When should I use glm-5.2[1m] instead of glm-5.2?
Use glm-5.2[1m] only when you genuinely need context past roughly 200K tokens — cross-repo reads, full-monorepo passes, or large document analysis. The [1m] suffix activates the 1,000,000-token variant at extra cost and latency. For most day-to-day edits, plain glm-5.2 is the cheaper and faster choice.
Can I run GLM-5.2 locally?
Yes, under MIT-licensed weights on HuggingFace: zai-org/GLM-5.2 (BF16/F32, 753B parameters) and an FP8 variant zai-org/GLM-5.2-FP8. Supported serving frameworks include Transformers, vLLM (v0.23.0+), SGLang (v0.5.13.post1+), Docker Model Runner, xLLM, and ktransformers. A 753B-parameter model requires substantial GPU infrastructure to serve.
How does reasoning_effort affect cost and speed?
The seven declared values collapse into two effective thinking tiers: High (low/medium) and Max (xhigh/max), while none/minimal skip thinking entirely . The default is max. Z.ai recommends Max effort for complex, multi-step coding tasks and lower settings for quick single-file edits, since higher effort meaningfully raises latency and output token count .




![GLM 5.2: China's Open Frontier Model Dropped the Day Anthropic Got Banned [2026]](https://media2.dev.to/dynamic/image/width=1200,height=627,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpu76crcyh5816opkfh3b.png)




