Baidu unveils open source multimodal AI it says tops GPT-5 and Gemini on vision benchmarks

A new open source challenger in multimodal AI

Baidu has open sourced a new multimodal model, ERNIE 4.5 VL 28B A3B Thinking, a 28 billion parameter system that processes images, video, documents and text. The company says it beats leading closed models on vision heavy benchmarks while using fewer resources. It is licensed under Apache 2.0 for unrestricted commercial use. The claim, if it holds up in independent tests, positions Baidu as a key supplier of enterprise grade visual reasoning models that run on accessible hardware.

Contents

A new open source challenger in multimodal AI
What the model claims to do
How a 28B model runs on one 80 GB GPU
Thinking with images and precise grounding
Benchmarks vs real workloads
Training recipe and developer stack
What this means for enterprises
The wider race: open models vs closed services
Limits and unanswered questions
Key Points

ERNIE 4.5 VL 28B A3B Thinking activates about 3 billion parameters per token, a small fraction of its total, by using a Mixture of Experts design that routes each input to the most relevant expert networks. This selective activation reduces compute and memory footprint while preserving the capacity of a much larger model. Baidu says the model can run inference on a single 80 GB GPU and still deliver strong results on document understanding, chart interpretation and complex visual reasoning.

The headline feature is what Baidu calls Thinking with Images. The model can zoom in and out of an image while reasoning, inspect tiny regions for clues, and combine local findings with the wider scene. That behavior matches how people solve visual problems and improves tasks like reading dense technical diagrams, spotting defects on a production line, or grounding references such as the blue connector under the second bolt.

Baidu also highlights stronger visual grounding, tighter instruction following for image tasks, tool use that includes image search, and robust video understanding with temporal awareness. The model ships with integration kits that work with Transformers, vLLM and FastDeploy, and a training and fine tuning toolkit called ERNIEKit. In a crowded field where most top vision language systems remain closed, this release gives developers and companies a new open option with a business friendly license.

What the model claims to do

Baidu groups the model abilities into six areas: visual reasoning, STEM problem solving, visual grounding, tool integration, video understanding, and dynamic image analysis. In practice this means the system can read and summarize invoices or financial statements, analyze charts and tables, reason through circuit diagrams and geometry problems, point to precise objects or regions on an image when asked, call external tools to retrieve reference pictures, and track events across video segments.

On leaderboards shown in Baidu documentation, the model is presented as matching or surpassing versions of OpenAI GPT-5 High and Google Gemini 2.5 Pro on a set of vision focused tests. Those include chart question answering, document understanding, and multi step visual reasoning benchmarks. The company also highlights speed advantages, claiming two to three times faster inference than comparable models with similar accuracy.

Independent labs have not yet verified these charts. Benchmark suites often reward pattern recall, prompt formatting, or narrow tricks rather than reliability in messy real deployments. Companies evaluating ERNIE 4.5 VL 28B A3B Thinking will want to run targeted pilots with their own documents, lighting conditions, camera angles and security policies before making production commitments.

Even with that caution, the scope of tasks covered is wide. The mix of reasoning, grounding, and tool use makes the model a candidate for agent style workflows such as reading incoming claims in insurance queues, flagging anomalies in warehouse operations, or helping technicians troubleshoot equipment with annotated photos and short videos.

How a 28B model runs on one 80 GB GPU

The key is Mixture of Experts, a design where many specialist subnetworks sit behind a router. For each token or image patch, the router selects a small subset of experts to activate. Only those experts compute, so the system behaves like a small model on each step, while the full pool of experts provides larger capacity when you look across many steps and tasks. This structure improves throughput because most parameters lie idle during any given operation.

Selective activation cuts memory needs and compute cost. A 28 billion parameter dense model would require far more hardware to run at useful speeds. By activating around 3 billion parameters per token, Baidu can claim single GPU inference for the vision language model, which opens testing to many labs. The tradeoff is added complexity. Engineers need a fast and balanced router, careful load balancing, and stable training techniques to prevent a few experts from doing all the work.

Thinking with images and precise grounding

Traditional vision language models take a fixed resolution image and produce a text answer. Thinking with Images adds a planning loop. The model breaks a question into steps, proposes where to look next, crops a region at high resolution, reads text or measures geometry inside that region, then updates its plan. This loop repeats until enough evidence is gathered to answer. The capability also works in reverse when the task asks for pointing, by mapping words back to pixel locations and drawing a box or a mask.

Better grounding matters when images are cluttered or technical. In manufacturing, two parts can look nearly identical until you zoom. In engineering drawings, a few pixels of notation change the meaning of a symbol. In medical imaging, tiny artifacts can swing a diagnosis. A system that can shift between scene level context and fine details, while explaining the steps it took, gives operators a clearer audit trail and a path to correct mistakes.

Benchmarks vs real workloads

Vendor charts often pool scores from many evaluations. Some benchmarks are clean images with perfect crops. Others use questions that the training set already covers. Real deployments introduce glare, motion blur, odd fonts, occlusion and multilingual labels. Performance can swing when inputs are outside the training distribution. This is why the headline claim of beating GPT-5 or Gemini on selected tests should be treated as a starting point, not a conclusion.

A more reliable approach is task specific evaluation. For documents, teams can assemble a few hundred representative files and measure exact field extraction accuracy with a human audited answer key. For vision tasks, they can simulate production conditions, then track precision and recall on boxes or masks. For charts, they can seed synthetic plots with edge cases like log scales or stacked series. This yields a clearer view of model behavior and failure modes.

Training recipe and developer stack

ERNIE 4.5 models use a training path that mixes classic supervised learning with reinforcement learning on multimodal tasks. The documentation mentions strategies called GSPO and IcePop to stabilize the router and improve reward shaping when training an MoE, paired with dynamic difficulty sampling so the model sees harder examples as it improves. The result is a system that can plan multi step reasoning over pictures, charts and video, rather than guessing in a single pass.

The model family spans text only and vision language variants, from very small dense models to massive MoE systems with hundreds of billions of total parameters. Baidu says the architecture allows parameter sharing across modalities, so text and vision components can benefit from each other without pulling down language quality. Training and inference run on the company platform PaddlePaddle, which the team reports has high hardware utilization at large scale.

For developers, Baidu ships a set of tools. ERNIEKit supports fine tuning, preference optimization and quantization for resource efficient deployment. The vision language model can be served through common stacks such as Transformers and vLLM, and production inference can be handled with FastDeploy. The Apache 2.0 license removes many constraints on commercial use. The official model page is available on Hugging Face and the code and recipes live in Baidu GitHub repository.

What this means for enterprises

Open licensing and single GPU inference make the model attractive to teams that want private, on premises visual AI without sending data to a cloud provider. Document heavy sectors can try intelligent intake for contracts, invoices and claims. Manufacturers can build quality inspection stations that combine vision and language guidance for line workers. Retail and logistics teams can use grounding to locate items and verify labels in warehouses. Customer service groups can enrich chat answers by reading embedded charts or screenshots.

Costs matter. Many organizations cannot reserve clusters of H100 class GPUs for a single application. A system that runs on one 80 GB card reduces entry barriers for pilots and early deployments. The mix of grounding, tool use and video understanding also lends itself to agent style workflows, where the model reads, searches, clicks, and reasons across several steps to complete a task. That aligns with the shift toward computer using agents that operate standard interfaces rather than custom APIs.

The wider race: open models vs closed services

The release lands in a busy year for open models. DeepSeek introduced a very large MoE language model with hundreds of billions of total parameters and only a few dozen billion active per token, reporting strong scores on general benchmarks at a fraction of the training cost seen in closed systems. Alibaba Marco o1 pushed on reasoning with techniques like chain of thought fine tuning and search guided planning to handle open ended tasks that lack clear numeric rewards.

Chinese labs, including Baidu, Alibaba and others, are increasingly shipping capable models under permissive licenses. Open access gives regional developers a way to build with fewer dependencies on proprietary services. It also pressures global providers to justify premium pricing by delivering clear advantages in reliability, tooling or integrations. The gap between open and closed performance has narrowed in many areas, especially for specialized tasks like chart analysis and document extraction.

Limits and unanswered questions

There are tradeoffs. The vision language model still needs an 80 GB GPU for best results, which rules out many consumer devices. The context window is listed at 128K tokens, helpful for long PDFs but still finite when you combine many images, OCR text and instructions. MoE routing can complicate deployment at scale, since engineers must monitor expert balance and guard against latency spikes caused by routing hotspots.

Safety remains a major topic. Image systems can be tricked by adversarial patterns or spurious correlations. A model that actively crops and zooms must avoid confirmation bias, where the plan keeps looking at the wrong region. Enterprises will expect clear audit logs of each step in the chain of thought when the model is in a thinking mode, or at least structured traces of tool calls and bounding boxes. Baidu says the model underwent multimodal reinforcement learning and grounding alignment, yet broad third party evaluations will be needed to build trust.

Key Points

Baidu released ERNIE 4.5 VL 28B A3B Thinking, an open source multimodal model under Apache 2.0 for commercial use.
The model claims wins over GPT-5 High and Gemini 2.5 Pro on vision heavy benchmarks, pending independent validation.
It uses a Mixture of Experts design, activating about 3 billion of 28 billion parameters per token to cut compute and memory.
Inference can run on a single 80 GB GPU, lowering the barrier for pilots and on premises tests.
Headline features include Thinking with Images for zoom and inspect workflows, stronger visual grounding, and video understanding.
Six core capabilities target visual reasoning, STEM tasks, grounding, tool use, video analysis, and dynamic image planning.
Training includes multimodal reinforcement learning with GSPO and IcePop and dynamic difficulty sampling to stabilize MoE.
Developer tools include ERNIEKit, compatibility with Transformers, vLLM and FastDeploy, plus a public model page and code repository.
Use cases span document processing, manufacturing quality control, warehouse operations, and customer support.
Limits include an 80 GB GPU requirement, a 128K context window, MoE routing complexity, and open questions on robustness and safety.

Baidu unveils open source multimodal AI it says tops GPT-5 and Gemini on vision benchmarks

A new open source challenger in multimodal AI

What the model claims to do

How a 28B model runs on one 80 GB GPU

Thinking with images and precise grounding

Benchmarks vs real workloads

Training recipe and developer stack

What this means for enterprises

The wider race: open models vs closed services

Limits and unanswered questions

Key Points

You May also Like

Taiwan sets 2027 deadline to end food waste incineration and phase out pig swill

Shenzhen-Zhongshan Link Artificial Island Opens as Premier Tourist Destination

Malaysian Ringgit Surge Fuels Travel Boom, Reshaping Regional Tourism

China Breaks $1 Trillion Trade Surplus as Exports Pivot Beyond United States

From Ottomans to Erdogan: Turkey’s Resurgent Interest in South Asia

Fragile Peace Threatened as Thailand Accuses Cambodia of Major Ceasefire Breach

China’s Largest Comic Convention Bars Japanese Anime Amid Diplomatic Spat

Humanoid Robots Take Center Stage at Tokyo Exhibition

Quick Links