Why this launch matters now
China’s on demand services giant Meituan has joined the open source large language model wave with the release of LongCat Flash Chat, a system that combines a very large parameter count with an efficiency first design. Announced in early September 2025 and posted for developers on GitHub and Hugging Face, the model arrives with a technical report that positions it alongside current top tier systems. Meituan says LongCat Flash Chat matches the performance of DeepSeek V3.1, Alibaba Cloud’s Qwen3 family, and Moonshot AI’s Kimi K2, and it compares well with Anthropic’s Claude Sonnet and Google’s Gemini 2.5 Flash, according to the company’s benchmarks. The model is licensed for broad use and modification, which signals a push to build an active developer ecosystem around Meituan’s AI stack.
LongCat Flash Chat is framed as a fast, production ready foundation model rather than a reasoning oriented system. That focus reflects a clear tradeoff: less emphasis on slow, complex chains of logic and more on responsive agent behavior, tool use, and instruction following. The release follows Meituan’s 2023 acquisition of AI startup Light Year for 281 million dollars and fits an internal strategy that spans AI for internal workflows, AI features in end user products, and the building of core models.
What Meituan released
The company’s documentation describes LongCat Flash Chat as a Mixture of Experts model with 560 billion total parameters. Only a fraction of those parameters are active on any given token, typically 18.6 billion to 31.3 billion with an average around 27 billion. That selective activation is central to the model’s speed and cost profile. The model supports a long context window of up to 128k tokens, which allows it to process lengthy conversations, documents, or multi step tasks without constant restarts.
In Meituan’s words, this is a nonthinking foundation model. Caixin Global reported that the model does not include advanced reasoning capabilities, while still delivering strong results across many instruction and agent tasks. The repository highlights competitive scores in general benchmarks, instruction following, math, coding, tool use, and safety evaluations. Developers can access the model weights and documentation on Hugging Face at this page and browse product details on the official site at longcat.ai. The code and weights are released under the MIT License.
Inside the LongCat architecture
Mixture of Experts models split a giant network into many specialized components called experts. A router decides which experts to consult for each token. By activating only a small subset of experts per token, the model preserves large total capacity while keeping compute per step in check. This approach has become a common way to scale models without ballooning inference cost.
Mixture of Experts in plain terms
Think of the system as a large team where only a few specialists are called in for each message. The router selects those specialists based on the content, which reduces unnecessary work. Meituan’s design adds several twists. The report describes Zero Computation Experts, which act as placeholders that can be chosen when the router needs to balance load or reduce compute for simple tokens. There is also a PID controller behind the scenes that adjusts the bias of experts to keep the average number of active parameters steady. PID stands for proportional, integral, derivative. It is a common control method that nudges a system toward a target level without oscillation.
To push throughput, LongCat Flash Chat uses a shortcut connected MoE layout that increases the overlap between communication and computation. In multi GPU setups, communication time can bottleneck performance. Shortcut connections create paths that let parts of the network move faster while data moves across devices. The technical notes cite an inference rate of more than 100 tokens per second for a single user on H800 class hardware.
Speed and cost claims
Open source summaries and Chinese media reports say Meituan completed training in about 30 days with a highly parallel setup, and that the model can generate at 100 tokens per second on H800 accelerators. The company cites an output cost near 5 yuan per million tokens under specific conditions. Actual costs vary by hardware, batch size, and deployment framework, but the headline claim is clear. Meituan is pitching LongCat Flash Chat as a model built for speed with competitive quality at a lower compute footprint per request.
Training recipe and stability
The repository outlines a set of stability aids: hyperparameter transfer to reuse good settings across scales, model growth initialization so larger variants start from smaller ones, and a router gradient balancing method with z loss to keep the MoE router healthy. The team emphasizes deterministic computation for reproducible runs. On the data side, Meituan describes a two stage pretraining fusion strategy and a multi agent synthesis framework to create complex task trajectories for agent behavior. The upshot is a training pipeline tuned for reliability and agent performance rather than deep multi step reasoning.
How it stacks up against rivals
Meituan’s benchmarks place LongCat Flash Chat on par with China’s leading general models. The comparisons highlight DeepSeek V3.1, Alibaba Cloud’s Qwen3 family, and Moonshot AI’s Kimi K2. The report also references strong standing against prominent United States models like Claude Sonnet and Gemini 2.5 Flash. Some coverage even notes comparisons with GPT 4.1 for specific nonthinking tasks. Exact rankings depend on the benchmark suite and prompt settings. Vendor run tests are a starting point, and third party evaluations will give a fuller picture.
The nonthinking label is an important qualifier. Reasoning oriented systems aim to plan and reflect across multiple steps, often with slower generation and a higher compute bill. LongCat Flash Chat focuses on rapid instruction following, tool use, and agent orchestration. That approach fits use cases where response time, throughput, and predictable cost matter more than elaborate scratchpad reasoning.
What agentic tasks mean in practice
Agentic tasks involve a model that can call tools, retrieve information, follow workflows, and decide when to ask for more input. The engine might browse documentation, hit an internal database, trigger a payment API, or draft a customer message. LongCat Flash Chat is tuned for this style of work. It aims to make quick decisions, handle long contexts, and keep latency low so it feels responsive in real applications.
Meituan has already experimented with agents across its business. The company has showcased NoCode for AI assisted coding, Kangaroo Advisor for business decision support, and Meituan Jibai for hotel operations management. A fast agent engine can help power tasks such as merchant onboarding, customer support triage, inventory checks, route planning support, and marketing copy generation. In e commerce and local services, many tasks benefit from reliable speed, tool connectivity, and long context retention more than slow step by step reasoning.
Strategy, acquisitions, and the rise of open source in China
Meituan’s push into core models began in earnest with the purchase of Light Year in 2023 for 281 million dollars. Company leaders have described a three tier plan: AI for internal work, AI in consumer products, and the building of foundational models. LongCat Flash Chat is the first public proof point at the foundational layer. It also aligns with a wider trend among Chinese tech firms. Open source releases from companies like Alibaba, DeepSeek affiliated labs, and Moonshot AI are narrowing gaps with Western peers through rapid iteration and community testing.
Open source licensing matters here. By publishing the weights and code under an MIT License, Meituan invites startups, research groups, and enterprise teams to experiment, adapt, and deploy without heavy constraints. That widens the funnel for feedback and use cases. It also helps Meituan build technical credibility outside its core food delivery identity. Several sources noted that the company frames its AI approach as aggressive and proactive, seeking to win developer mindshare through speed and openness.
Licensing, deployment, and safety
LongCat Flash Chat is released under the MIT License, a permissive license that allows broad commercial use. The repository includes guidance for vLLM and SGLang deployment. Those frameworks are popular for high throughput inference, with features like continuous batching and paged attention that can reduce serving cost. A model with partial expert activation pairs naturally with environments that reduce idle time on GPUs.
Like all large language models, LongCat Flash Chat has limits. The documentation urges developers to evaluate safety, fairness, and legal compliance in their own settings. This is especially relevant in domains that handle personal data, finance, or health. Meituan’s materials include safety evaluations and red teaming, and they remind users to follow laws and industry rules when applying the model to sensitive contexts.
Investor reaction and market signals
Financial coverage after the release noted that Meituan’s shares traded lower even as the company joined the list of Chinese firms shipping competitive models. TipRanks cited the South China Morning Post benchmarks and pointed to a wider policy backdrop, with China boosting support for its technology sector while facing trade frictions. Market reactions often lag product reality. What will matter to investors is whether the model drives adoption, whether it reduces costs inside Meituan’s operations, and whether outside developers pick it as an engine for real products.
The bigger signal is that companies far outside core AI circles are now publishing models that challenge established players on speed, cost, and capability in agent use cases. That momentum could reshape how enterprises think about where their AI stack comes from. It can also accelerate pressure on model makers to prove value in pragmatic, latency sensitive tasks rather than only in benchmark leaderboards.
Key Points
- Meituan released LongCat Flash Chat, an open source Mixture of Experts model with 560 billion total parameters, on GitHub and Hugging Face in early September 2025.
- The model activates 18.6 billion to 31.3 billion parameters per token on average, which lowers compute cost while keeping a large total capacity.
- Benchmarks in the technical report claim performance on par with DeepSeek V3.1, Alibaba Cloud’s Qwen3, Moonshot AI’s Kimi K2, and strong standing versus Claude Sonnet and Gemini 2.5 Flash.
- Meituan positions the system as a nonthinking foundation model focused on speed, instruction following, tool use, and agent workflows rather than deep reasoning.
- Engineering features include Zero Computation Experts, a PID controller for expert bias, shortcut connected MoE design, and a 128k token context window.
- Technical notes cite more than 100 tokens per second on H800 hardware and an output cost near 5 yuan per million tokens under specific conditions.
- The code and weights are under the MIT License, with deployment guidance for vLLM and SGLang.
- The launch follows Meituan’s 2023 purchase of Light Year and supports a broader three tier AI strategy across work tools, products, and core models.
- TipRanks reported a share price decline after the announcement, showing that investor sentiment may depend on real world adoption and cost savings.
- The release adds to a wave of open source model activity in China that is closing gaps with Western peers and focusing on fast agent use cases.