Alibaba open sources Qwen3-Next, a faster AI model that costs less and runs on consumer hardware

Asia Daily
10 Min Read

A faster and cheaper model aimed at developers and enterprises

Alibaba Cloud has released an open source family of large language models under a new architecture called Qwen3-Next. The lead model, Qwen3-Next-80B-A3B, carries 80 billion parameters but introduces design choices that allow the system to run far faster and at much lower cost than earlier Qwen models. Alibaba says training the new base model required less than one tenth of the compute used for Qwen3-32B, a dense predecessor released in April, while the new series matches the quality of the company’s larger Qwen3-235B model on many tasks.

Alibaba has made the models available for developers as open source, with download and API access through major model hubs and its own cloud. The company positions Qwen as a fast growing developer ecosystem. That strategy aligns with a wider move by Chinese AI groups to publish high quality models that others can inspect and build on, a path that can widen adoption without relying on massive centralized infrastructure.

Alongside the training efficiency claims, Qwen3-Next targets speed and long context performance at inference time. In testing shared by Alibaba, the architecture delivers more than 10 times higher throughput for prompts longer than 32,000 tokens compared with earlier Qwen releases. The context window reaches 256,000 tokens and, in some settings, can be extended to 1 million tokens. This makes the models more comfortable with book length documents, large code repositories, and other heavy input workloads.

What Qwen3-Next changes under the hood

This generation introduces several ingredients that are becoming standard tools for advanced AI. The core is a sparse Mixture of Experts (MoE), a design where many small expert networks are trained and only a few are used for each token. Qwen3-Next pairs that with a hybrid attention mechanism for handling long sequences, multi token prediction to speed up generation, and training fixes that tame instability in reinforcement learning (RL). In the Qwen3-Next-80B-A3B-Base model, only about 3 billion parameters are active on each step, even though the model carries 80 billion parameters in total.

Alibaba says the base model reaches quality similar to, and sometimes better than, the dense Qwen3-32B while using less than 10 percent of its GPU hours to train. During inference, the architecture keeps compute low yet preserves capacity, which is why the team reports strong speedups at long context and a better cost profile for both cloud and consumer grade hardware.

Mixture of Experts in plain language

In a Mixture of Experts, the network learns many specialized sub models called experts. A light gating module decides which experts to call for each token. Qwen3-Next uses a high sparsity setting, which means only a small fraction of experts fire at once. Reports cite 512 experts with 10 active per token in some layers. That makes the compute and memory bill for each forward pass comparable to a much smaller dense network, while the full set of experts gives the model a large total capacity.

The trade off is added complexity during training and the need for scheduling to ensure experts are balanced. Alibaba says it addressed stability issues that often appear when combining sparse MoE with RL style fine tuning. For developers, the important part is that the runtime cost scales with the number of active parameters, not the full parameter count.

Hybrid attention and long context

Attention layers are the parts of a transformer that let a model look across input tokens and decide what matters. Standard attention becomes expensive as inputs get longer because the cost scales with the square of the sequence length. Qwen3-Next mixes standard attention with methods that approximate context in a leaner way, described by the team as Gated DeltaNet combined with classic attention. The result is more throughput on long inputs and fewer slowdowns as prompts grow. Alibaba reports more than 10 times higher throughput for context lengths beyond 32,000 tokens, a 256,000 token window by default, and an extension path to 1 million tokens for special cases. The team also reports near linear scaling gains in some regimes, including up to around 7 times faster prefill and around 4 times faster decode at shorter contexts.

Multi token prediction and decoding

Multi token prediction attempts to guess several next tokens at once, then corrects if the guess is wrong. Combined with speculative decoding, this cuts the number of expensive forward passes needed to stream a response. Qwen3-Next supports this approach and integrates with popular inference stacks such as SGLang and vLLM, which helps keep latency down in production services. The team also mentions Zero Centered RMSNorm and other optimization choices that steady training. The net effect, when paired with sparse MoE, is faster generation without a heavy quality penalty.

Two model variants for instruction and reasoning

On release, Alibaba provided two post trained models on top of the base network. Qwen3-Next-80B-A3B-Instruct targets everyday instruction following, chat, and tool use. In company testing it performs close to the 235 billion parameter flagship in many general tasks and shows strength when the prompt contains very long context, up to 256,000 tokens.

Qwen3-Next-80B-A3B-Thinking is tuned for complex reasoning. Alibaba highlights scores that exceed its own mid tier Qwen3 models and that surpass the closed Gemini 2.5 Flash Thinking on several benchmarks picked for evaluation. The company cites a 60.8 percent result on SuperGPQA, a hard graduate level test set. Benchmarks are useful signals, yet buyers should run their own trials on domain specific workloads, including coding, retrieval augmented generation, math, and multi step planning.

Speed, cost, and hardware footprint

The headline gains are about both money and time. Alibaba says training the base model finished with less than 10 percent of the compute bill of Qwen3-32B. At inference, only about 3 billion parameters are active per token, so the model can answer with lower GPU memory pressure and energy draw. That leads to a smaller hardware footprint for a given level of service, or higher request throughput on the same setup.

For long prompts, the architecture shines. Alibaba reports order of magnitude throughput gains for inputs longer than 32,000 tokens and solid improvements at shorter sequences too, including around 7 times faster prefill and around 4 times faster decode in some tests. With a 256,000 token window, analysts can feed entire contracts, transcripts, or large code files in one request, then ask the model to summarize, compare, or generate changes. The company emphasizes that the new series is tuned to run well on consumer grade hardware, so advanced prototypes no longer require large clusters.

Availability and access

The Qwen3-Next models are published for broad access. They are listed on common model hubs such as Hugging Face and ModelScope and are also available through Alibaba Cloud Model Studio for managed deployment. Users can call hosted endpoints or download weights for local inference. The series is also accessible through the NVIDIA API Catalog, and some variants are mirrored on Kaggle for easier experimentation.

The release sits alongside utilities such as Qwen3-ASR-Flash, a speech transcription component that supports 11 major languages according to the team. Alibaba encourages developers to review license terms for each model, especially for high traffic or commercial use, then test with their own datasets and workloads. Support for inference frameworks like SGLang and vLLM, along with hardware friendly activation patterns, should shorten the path from a demo to a reliable service.

How this resets competition in AI

The move underscores a broader contest among global labs to make powerful models widely usable at lower cost. Chinese companies are pushing hard on open source AI as a way to spread adoption and catch up with Western rivals on capability. Alibaba says the Qwen ecosystem has become one of the largest open source communities for developers. The group also previewed Qwen3-Max, a model with over 1 trillion parameters that recently ranked sixth on the LMArena chatbot leaderboard hosted by researchers. Alongside Alibaba, firms such as DeepSeek and Meituan are releasing competitive models and services.

Cheaper and faster base models matter because compute is scarce and expensive. Export limits on advanced chips add pressure in China, and even in the United States and Europe many teams face GPU shortages. Architectures that keep most parameters idle during inference open the door for smaller labs, startups, and universities to run high quality models on modest servers. Investors are paying attention, with Alibaba stock performance improving this year as the company refocuses on cloud and AI. The competitive effects will spread as more organizations adopt open source AI in daily workflows.

Benchmarks, transparency, and careful evaluation

Model leaderboards are useful, yet they can conceal trade offs. Speed numbers vary by batch size, framework, quantization, and hardware. Accuracy varies by domain, prompt style, and the presence of tools such as code interpreters or retrieval. Vendor tests also pick their own suites. Independent testing by community groups and customers tends to give a fuller picture of strengths and gaps.

Teams evaluating Qwen3-Next should check more than a score on a single benchmark. Measure long context accuracy, tool use, math and code reliability, and how the model behaves under safety constraints. Track memory use and throughput at the context lengths you care about. If your application relies on grounded answers, combine the model with retrieval and check evidence traces. If you need consistent chain of thought, compare the Thinking variant against your current stack.

Use cases and sectors likely to benefit

Qwen3-Next looks built for applications that need long context, fast responses, and tight cost control. Legal and financial teams can analyze large document sets and create tailored summaries that preserve citations. Customer support leaders can unify knowledge bases and long troubleshooting guides into a single assistant that reasons across thousands of pages. Software teams can feed in repositories and logs to help with code review, refactoring, and test generation.

Researchers can upload lengthy papers and datasets to compare methods and results. Media and education providers can transcribe and translate audio with Qwen3-ASR-Flash, then use the language model to create study guides or content outlines. Governments and regulated industries that require on premises deployment gain from the ability to run a capable model on consumer grade servers with careful monitoring, audit logging, and content filters.

The Bottom Line

  • Alibaba released Qwen3-Next and open sourced the Qwen3-Next-80B-A3B family.
  • Training used less than 10 percent of the compute of Qwen3-32B while matching or exceeding it in tests.
  • Only about 3 billion of 80 billion parameters are active per token thanks to sparse Mixture of Experts.
  • Throughput gains exceed 10 times for prompts above 32,000 tokens, with a 256,000 token context window.
  • Extension to 1 million tokens is available in some settings, with reported prefill and decode speedups at shorter contexts.
  • Two post trained variants cover instruction following and complex reasoning, with company tests that outperform some closed peers.
  • Models are available on Hugging Face, ModelScope, Alibaba Cloud Model Studio, and the NVIDIA API Catalog.
  • Open source access and lower hardware needs broaden adoption across startups, enterprises, and research groups.
Share This Article