China’s Kimi K2 Thinking Sets New Pace for Open Source AI, Topping GPT 5 on Key Tests

Asia Daily
11 Min Read

A powerful open model challenges the frontier

Moonshot AI has released Kimi K2 Thinking, an open weights model that aims directly at the highest tier of reasoning systems. The company says the model outperforms leading proprietary systems on a range of demanding evaluations that measure step by step reasoning, web browsing agents, and software engineering. K2 Thinking is available through Moonshot’s platform and Kimi.com, and the weights are posted on public hubs for self hosting under a Modified MIT License that allows commercial use with a light attribution requirement for very large deployments.

The model’s design blends scale with efficiency. K2 Thinking uses a Mixture of Experts architecture with a reported one trillion total parameters, while only about 32 billion parameters activate for each token generated. It supports a 256,000 token context window and is trained for native INT4 inference using quantization aware training. That setup reduces latency and memory needs, which helps explain its competitive pricing. Beyond raw scores, the release pushes open models toward agent behavior. K2 Thinking shows long horizon planning and structured tool use across 200 to 300 sequential calls, and it can output transparent reasoning traces so users can inspect intermediate steps. Those capabilities, paired with lower costs, raise the stakes for closed rivals.

What is K2 Thinking and how it works

K2 Thinking is built as a Mixture of Experts transformer. A large pool of specialized expert modules sits behind a router that selects a small subset for each token, which reduces compute while preserving capacity. Moonshot describes a stack with 61 layers, 384 experts, eight experts selected per token, 64 attention heads, a hidden dimension of 7168, and a vocabulary size of 160,000. Only about 32 billion parameters are active at any moment, which keeps inference manageable while the total capacity remains very high. The attention mechanism is described as multi head latent attention with SwiGLU activation. The approach focuses training and inference on the most relevant experts for the content in context, improving both throughput and accuracy.

Sparse experts and efficiency

Mixture of Experts lets a model scale to a trillion parameters without forcing every parameter to run for each token. In practice, this means the model can exhibit deeper reasoning and better breadth of knowledge while the memory footprint and step time stay within the range of current accelerators. The 256k context window allows the model to ingest large codebases or lengthy research packets, then plan across many steps as it calls tools. That design is a strong fit for agent tasks, where a system might search, read, write, and verify results across dozens or hundreds of actions.

Native INT4 and quantization aware training

Moonshot trained K2 Thinking with quantization aware methods so it can run natively in INT4 for weight storage and inference. INT4 uses two bits per nibble for each parameter value instead of higher precision formats, which cuts memory needs and speeds up generation. The company reports about a two times generation speed improvement with minimal quality loss compared to higher precision, and all benchmark scores were reported under INT4. That matters for cost and reach. Community testers report running the model on high end desktop systems, and Moonshot recommends engines like vLLM, SGLang, and KTransformers. The model also supports OpenAI and Anthropic compatible APIs, which makes integration simpler for teams that already use those patterns.

Advertisement

Benchmarks where K2 Thinking leads

Across public tests, the model posts results that place it among the best reasoning systems available, and on several it takes the top spot. Reported numbers include 44.9 percent on Humanity’s Last Exam with tools enabled, 23.9 percent on a text only setting, and 51.0 percent on a heavy setting; 60.2 percent on BrowseComp for web browsing agents; 71.3 percent on SWE Bench Verified for software bug fixing with tools; 83.1 percent on LiveCodeBench v6; and 56.3 percent on Seal 0. On math and science, the model reaches 94.5 percent on AIME25 without tools, 99.1 percent on AIME25 when allowed to run Python, 89.4 percent on HMMT25 without tools, and about 85 to 86 percent on the GPQA Diamond scientific knowledge test. Scores are also reported for Frames at 87.0, FinSearchComp T3 at 47.4, and SWE Multilingual at 61.1. In many of these categories, K2 Thinking meets or surpasses GPT 5, and the BrowseComp score clears a wide gap over Claude Sonnet 4.5.

Agentic search and browse tasks

BrowseComp probes whether a model can plan, search the web, follow links, synthesize findings, and verify answers. K2 Thinking’s 60.2 percent BrowseComp score exceeds the 54.9 percent result reported for GPT 5 and far surpasses Claude Sonnet 4.5 at 24.1 percent. The model also posts strong numbers on Seal 0 and Frames, which measure information gathering, planning, and consistency under tool use. These results align with the model’s design goals. Interleaved thinking allows the model to pause, reason, call a tool, then update its plan. Support for 200 to 300 sequential calls helps maintain state across many steps, and the large context window keeps the working set in memory.

Coding and software engineering

SWE Bench Verified evaluates a model’s ability to read repository issues, produce a patch, and pass tests. K2 Thinking reports a 71.3 percent score with tools, a strong result for an open weights system. On LiveCodeBench v6 it reaches 83.1 percent, and on SWE Multilingual it scores 61.1 percent. Community demos show it can build functional projects from a single prompt, including games and full web apps, and then iterate through fixes using tool loops. Long context helps for code navigation, while INT4 inference keeps latency down during extended development sessions.

Advertisement

Pricing, access, and license

K2 Thinking is released under a Modified MIT License. Commercial use is allowed, and very large deployments must include light attribution. The model is live on Kimi.com for chat, and it is accessible by API on the Kimi Open Platform. For self hosting, teams can obtain the weights from public model hubs like Hugging Face at huggingface.co/moonshotai/Kimi-K2-Thinking, and platform documentation is available at platform.moonshot.ai.

Pricing is positioned to undercut closed alternatives. Moonshot lists about 0.15 dollars per one million tokens for cache hits, about 0.60 dollars per one million tokens for cache misses, and about 2.50 dollars per one million tokens for output. That compares to GPT 5 pricing around 1.25 dollars per one million input tokens and 10 dollars per one million output tokens. K2 Thinking’s native INT4 support lowers runtime costs, and the Mixture of Experts setup ensures only a subset of the network runs at once. On Kimi.com, the standard chat mode uses a restricted tool set for a lighter experience, so on site results may not match agent benchmarks until the Agent mode rollout completes.

Advertisement

Why this matters for the AI race

Open models are closing the gap with the best closed systems. K2 Thinking’s benchmark wins signal that open research can now compete at the high end of reasoning, coding, and agent tasks. This puts pressure on proprietary vendors to differentiate on reliability, safety tooling, enterprise support, and integrated products rather than only headline scores. For enterprises, open weights bring control over data, compliance, and deployment footprint. For developers, they offer a path to experiment with advanced agent behavior without paying premium rates for every token.

China’s growing AI presence

China’s labs are moving faster and taking more of the conversation. Names like DeepSeek, Qwen, MiniMax, and Kimi have stepped into view, with releases that arrive at a steady clip. MiniMax’s earlier model established an open benchmark pace and a lower cost baseline. K2 Thinking now pushes further on long horizon reasoning and agent tooling. Analysts point out that Chinese teams are tightening feedback loops, leaning on test time scaling, and using release cadence as an advantage even if a few tasks still favor the best closed systems. The competition is also shifting from chat to agent intelligence, where a system can plan, use tools, and complete complex goals. K2 Thinking is aimed squarely at that shift.

Advertisement

Capabilities and real world examples

Moonshot’s demonstrations and user tests highlight how the model solves extended tasks. In one walkthrough, the system tackles a doctoral level mathematics question by looping through reading papers, extracting formulas, executing Python code, and verifying results across more than twenty reasoning and tool steps before producing a final solution. Another demo shows the model cycling through searches to identify actor Jimmy Gary Jr. and confirm a specific film role, a task that required careful source checking over many web pages. In coding sessions, testers report that K2 Thinking can produce a working Space Invaders style game in a single pass, then refine it through iterative tool calls. Others show it laying out full websites, building front end components, and even replicating a desktop style interface to handle files and menus.

What developers are seeing

Independent testers on developer forums describe a balance of speed and control. One experienced user ran the model with consumer accessible hardware and reported steady behavior across long tool chains and large prompts. The reviewer added that the native INT4 path improved speed while keeping quality stable.

In my tests, it handled up to 300 sequential tool calls without losing coherence, a big improvement over prior models.

That same tester cited a 256k context window that handled large code projects, multilingual coding strength, and practical gains in front end tasks. Reported throughput on high end desktop machines reached tens to hundreds of tokens per second depending on settings, which makes agent loops feel responsive enough for development work.

Expert perspective

Nathan Lambert, an AI researcher who tracks open models and agent systems, argues that K2 Thinking marks a turning point for open weights at the high end of performance. He also highlights the pace of releases from China based labs.

The release is seen as the closest open models have come to the closed frontier in performance.

Lambert also points to a speed advantage in shipping new systems and iterations on the open side, despite some areas where closed models still lead.

Open models release faster. Chinese labs release their models much more quickly than Western labs.

What K2 Thinking still needs to prove

Benchmark leadership is not the same as across the board superiority in production. Some users report occasional recall dips in very long contexts, which is a common challenge for large context systems. Benchmarks also differ in allowed tool budgets and thinking token limits, so comparisons can hinge on how much test time scaling each team uses. Running hundreds of tool calls in a single session also raises hosting challenges, since providers need to handle more state, more network calls, and larger logs.

Enterprises will watch for stability across versions, reproducibility of results under fixed settings, and safety tooling that fits into existing workflows. The reasoning trace output is a helpful step, since it allows teams to inspect intermediate steps, but it also means operators must decide how much thinking to log and when to mask sensitive content. Finally, the model’s open design invites rapid community iteration, which is a strength, and it also means many variants will appear with different tradeoffs in speed, memory use, and quality. The early results are promising, and continued testing across languages, domains, and hardware will tell developers where K2 Thinking is strongest.

The Bottom Line

  • Moonshot AI released Kimi K2 Thinking as open weights with a Modified MIT License that allows commercial use.
  • The model uses a Mixture of Experts design with one trillion total parameters and about 32 billion active per token.
  • Native INT4 inference from quantization aware training cuts cost and latency while holding quality.
  • K2 Thinking reports 44.9 percent on Humanity’s Last Exam with tools and 60.2 percent on BrowseComp, beating GPT 5 on that agent test.
  • Coding scores include 71.3 percent on SWE Bench Verified and 83.1 percent on LiveCodeBench v6.
  • The system supports a 256k context window and can run 200 to 300 sequential tool calls.
  • Pricing is about 0.15 dollars per million tokens for cache hits, 0.60 dollars for cache misses, and 2.50 dollars for output, well below GPT 5 rates.
  • Access is available on Kimi.com, via API at the Kimi Open Platform, and through public hubs like Hugging Face.
  • Open models from China, including Kimi, DeepSeek, Qwen, and MiniMax, are pushing rapid releases and higher benchmark scores.
  • Areas to watch include long context reliability, hosting support for extended tool loops, and real world performance under production constraints.
Share This Article