A native multimodal challenger goes open source
Alibaba has unveiled Qwen3-Omni, a native multimodal artificial intelligence model that accepts text, images, audio, and video as input and responds in text or synthesized speech. The model is available as open source under the Apache 2.0 license, which permits commercial use, modification, and redistribution with a patent grant. That combination of broad capabilities and permissive licensing places Qwen3-Omni in direct competition with leading closed systems from the United States while lowering the barrier to adoption for startups, researchers, and large enterprises.
- A native multimodal challenger goes open source
- What makes Qwen3-Omni different
- Models, modes, and pricing
- Benchmarks and how to read them
- How it compares to GPT 4o, Gemini, and Gemma
- Enterprise and developer use cases
- Adoption, markets, and strategy
- Voices from Alibaba
- Risks, limits, and what to watch
- Highlights
Qwen3-Omni follows the industry shift toward omni models, a class of systems designed to understand and act across modalities in real time. Unlike stacks that bolt speech or vision on top of a text model, Alibaba says Qwen3-Omni was built as an end to end multimodal system. It processes audio and video directly during thinking and generates speech through a dedicated speech module, so it can maintain conversational rhythm, tone, and context without waiting for entire files to finish.
The first release centers on a 30 billion parameter family that uses a Mixture of Experts design to keep inference fast and scalable. Alibaba reports first packet latencies as low as 234 milliseconds for audio and 547 milliseconds for video, with an RTF (real time factor) under one even under concurrency. The system supports 119 languages for text, 19 languages for speech input, and 10 languages for speech output, which opens the door to multilingual assistants, call centers, and live translation.
The launch arrives as Alibaba positions its cloud business as a full stack AI provider and as Chinese open models surge on developer platforms. Within days, variants of Qwen3-Omni rose to the top of trending rankings on Hugging Face, reflecting strong interest from the global community. The company is also distributing a faster API version and hosting the weights for direct download and local deployment.
What makes Qwen3-Omni different
Qwen3-Omni is designed to think and speak in parallel. The Thinker component handles multimodal reasoning. The Talker component turns the Thinker output into natural speech while conditioning directly on audio and visual features. That split lets developers insert retrieval, safety checks, or business logic into the Thinker stage before any words are spoken, which is useful for regulated workflows.
Thinker and Talker
The Thinker is a large language model trained across text, speech, images, and video with about 2 trillion tokens. It uses Mixture of Experts routing so only a small subset of parameters activates per token. That reduces compute cost and increases throughput. The Talker is a lightweight speech generator that uses a multi codebook autoregressive scheme to preserve timbre and prosody and a compact Code2Wav network to render audio quickly.
Real time streaming and caching
Latency matters in conversation. Alibaba reports streaming with the first audible token in roughly a quarter of a second for audio and about half a second for video contexts. The audio stack includes a new Audio Transformer encoder trained on 20 million hours of supervised audio, mostly Chinese and English ASR data with global coverage for other languages. The encoder supports both online caching for live calls and offline encoding for batch jobs.
Languages and long context
Developers can push long prompts and transcripts. Thinking Mode supports up to 65,536 tokens of context, with reasoning chains up to 32,768 tokens. Non Thinking Mode supports up to 49,152 tokens. The model accepts up to 16,384 tokens as input and can generate up to 16,384 tokens as output in text. The speech system covers 10 output languages today, a practical range for global pilots in customer support and field service.
Models, modes, and pricing
Alibaba is releasing three variants of the 30B family so teams can tune for different needs. The Instruct model combines the full Thinker and Talker stack for audio, video, and text inputs, then produces text and speech. The Thinking model focuses on complex reasoning and long chain of thought. It uses the same multimodal inputs but outputs text only, which suits research, analysis, and compliant enterprise records. The Captioner model is a fine tuned specialist for audio captioning. It aims to produce accurate descriptions of music, speech, and mixed audio with low hallucination rates.
Qwen3-Omni is available on model hubs for download, on GitHub for code and weights, and through Alibaba Cloud as an API, including a faster Flash option for interactive services. New users receive a free quota of one million tokens, valid for 90 days after activation. For paid use, pricing is per thousand tokens. Text input is 0.00025 dollars, image or video input is 0.00046 dollars, and audio input is 0.00221 dollars. For output, text is 0.00096 dollars when the input is text only, rising to 0.00178 dollars when input includes images or audio. When developers request both text and audio output, the audio portion costs 0.00876 dollars per thousand tokens and the text portion is free.
Audio output is available only in Non Thinking Mode via the API. That choice simplifies scheduling for streaming speech and avoids conflicts with long running structured reasoning. Teams that need rich chain of thought can still use the Thinking model to plan or retrieve, then hand the final response to an audio renderer or the Instruct model for speech.
Benchmarks and how to read them
Alibaba reports strong results across 36 public tests, with state of the art scores on 22 and leadership among open models on 32. The numbers suggest a balanced system that keeps text and visual quality while pushing ahead in speech. Highlights include reasoning tasks where Qwen3-Omni reaches 65.0 on AIME25 and 76.0 on ZebraLogic. WritingBench lands at 82.6. On audio transcription, the model records 4.69 and 5.89 word error rates on Wenetspeech, alongside 2.48 on Librispeech other, which is very low for that dataset. Music tagging and classification scores, such as 93.0 on GTZAN and 52.0 on RUL-MuchoMusic, indicate a good ear for genre and audio features.
Vision and math benchmarks also show strength. HallusionBench comes in at 59.7, MMMU_pro at 57.0, and MathVision_full at 56.3. On video understanding, Qwen3-Omni records 75.2 on MLVU. Those results place it above many closed models in those specific tests, including GPT 4o and Gemini 2.0 Flash, based on the figures Alibaba published.
Benchmarks do not capture every scenario that users face. Real conversations can be noisier, accents vary, lighting changes, and business tasks often chain multiple tools. Early independent reviews point out that smaller models sometimes struggle outside curated tests, even when they shine on leaderboards. Developers should run proof of concept trials with their own data, especially for edge cases like domain jargon, specialized diagrams, or overlapping speakers.
How it compares to GPT 4o, Gemini, and Gemma
Qwen3-Omni lands in a competitive field that includes GPT 4o from OpenAI and Gemini 2.5 from Google. Those systems are strong and already integrated into large platforms. They are closed source. Users pay per call and cannot legally modify the base models or redistribute them. Qwen3-Omni is released under Apache 2.0, a permissive license that allows commercial use, changes to the code and weights, and redistribution. It also includes a patent license. That mix lowers legal risk for companies that need to embed AI directly into products or on private infrastructure.
Functionally, Qwen3-Omni accepts text, images, audio, and video and can answer with text or speech. GPT 4o accepts text, images, and audio input and responds with text and speech, but it does not take video clips as input in its core configuration. Gemini 2.5 Pro can analyze video input, and it offers text and speech responses through Google services. Alibaba is positioning Qwen3-Omni as a native multimodal system where speech and vision are first class citizens rather than adapters on top of a text core.
Google also offers Gemma 3n as open source under Apache 2.0. Gemma 3n accepts video, audio, text, and images, but it outputs text only. For teams that want a fully open stack with live speech output, Qwen3-Omni may be the only current option with a permissive license and speech generation out of the box.
Why Apache 2.0 matters for adopters
Licensing decides where a model can run and how it can be productized. Apache 2.0 allows fine tuning and closed redistribution without forcing the release of derivative models. It includes a patent grant and a contribution license. For regulated industries that scrutinize IP rights, those terms reduce integration risk. Companies can deploy Qwen3-Omni on their own servers, on a chosen cloud, or at the edge, and keep their custom models private.
Enterprise and developer use cases
Qwen3-Omni targets a wide range of jobs. The multilingual text and speech stack can power call center transcription and live translation across 19 input languages and 10 output languages. The audio captioner can describe music, ambient sound, and mixed audio clips for editors, archivists, and assistive tools. The vision pipeline reads screenshots and photos for OCR and layout, which helps in automating back office workflows and customer onboarding.
Video understanding is a defining capability. A support agent can analyze a short clip from a phone to identify an error light or a miswired cable. A maintenance app can watch a live stream to confirm that a valve has opened. Because the Thinker and Talker run in parallel, the system can start speaking while it continues to process incoming frames, which keeps a human like pace in hands free scenarios.
The explicit separation of thinking and speaking gives developers control. Retrieval systems can bring in company documents. Policy filters can screen content. Safety checks can examine reasoning traces before the Talker produces audio. This pattern suits consumer assistants that must stay on brand, enterprise summarizers that must avoid data leakage, and field tools that require step by step guidance.
Tool use and agents
Qwen3-Omni can be wired to external tools and services for complex jobs. That includes calling search, querying databases, or triggering workflow engines. Alibaba is also building agent frameworks in its cloud platform so that low code and high code teams can compose tools, schedule tasks, and deploy to production. With token context windows above 49,000 in Non Thinking Mode, the model can hold long multi step sessions and keep track of previous actions.
Adoption, markets, and strategy
Developer interest has been swift. Variants from the Qwen3 family occupy several top positions on the main model hub trending boards, with Qwen3-Omni-30B-A3B often ranking first. That visibility reflects a broader wave of Chinese open models that are winning attention across research and industry.
Market signals point in the same direction. Alibaba shares rallied on growing confidence in the companys AI and cloud push, with fresh analyst upgrades and renewed interest from global funds. The company is expanding its data center footprint to support customers outside China, with new facilities in Brazil, France, and the Netherlands and additional capacity planned in other regions. Alibaba has also been investing heavily in AI infrastructure, committing to spend tens of billions of dollars over the next three years and indicating that total AI capital outlay will rise.
Hardware partnerships are part of the plan. Alibaba announced a software collaboration with Nvidia to integrate AI development tools for areas such as training, simulation, and validation. The company also previewed a larger Qwen3 Max model with more than one trillion parameters for advanced code generation and agent behavior, which shows how Qwen3-Omni fits into a broader family that spans from interactive assistants to heavy research models.
Voices from Alibaba
Alibaba Group chief executive Eddie Wu set the tone at the companys annual conference in Hangzhou, where he described AI and cloud as the twin engines for the next phase of growth. He also outlined plans to expand international data centers and deepen partnerships around training and deployment.
Wu used his remarks to explain why the company is increasing its AI investment.
“AI and cloud are Alibaba’s engines of growth alongside e commerce. We will expand spending and capacity to meet global demand.”
Lin Junyang, a researcher on the Qwen team, described how the project drew on earlier work in speech and vision to assemble the new system.
“We combined foundational audio and image projects to build Qwen3 Omni.”
Risks, limits, and what to watch
Even a strong benchmark story leaves open questions that only production use can answer. Real world audio can include heavy background noise, crosstalk, or rare accents. Videos often capture glare, motion blur, or occlusion. If your service needs very high recall on specialized diagrams, or compliance grade transcription in a narrow field, plan time for data curation and domain fine tuning.
The current release supports 10 speech output languages, which already covers many markets, but teams with long tail language needs will require text output or third party speech engines. The Thinker Talker split also brings choices. Some organizations may prefer to keep long reasoning strictly in text, then synthesize speech only for final answers. Others will accept shorter outputs in exchange for faster spoken responses. Those tradeoffs should be tested against user experience goals.
Open source licensing is a strength for product builders, but it also means anyone can adapt the model. Companies should use content filters, abuse detection, and usage monitoring to deter misuse. On the flip side, Apache 2.0 includes a patent grant that reduces intellectual property risk for adopters, a key factor for enterprise compliance teams.
The prize is a multimodal assistant that thinks across text, vision, and sound with near human timing. If Qwen3-Omni keeps proving itself outside curated evaluations, it could become a default choice for teams that want local control, predictable costs, and a broad developer community.
Highlights
- Qwen3-Omni accepts text, image, audio, and video inputs, and outputs text and speech.
- The model is open source under Apache 2.0 with a patent grant for commercial use.
- Thinker handles reasoning, Talker generates speech, both use Mixture of Experts for speed.
- Reported first packet latency is 234 milliseconds for audio and 547 milliseconds for video.
- Supports 119 text languages, 19 speech input languages, and 10 speech output languages.
- Three variants are available: Instruct, Thinking, and Captioner.
- API pricing starts at 0.00025 dollars per thousand text tokens with a 1 million token free trial.
- Benchmark results show leadership across many speech, video, and reasoning tests.
- Hugging Face rankings and investor interest signal fast adoption.
- Alibaba is expanding data centers and deepening Nvidia collaboration while growing the Qwen3 family.