A new way to squeeze long documents into AI context
DeepSeek has released DeepSeek-OCR, a vision language system that compresses long passages of text by converting them into images, then recovers the content when the model needs it. The approach reduces the number of tokens a large language model must handle by a factor of seven to twenty, depending on settings and document type, while keeping high precision at moderate compression. DeepSeek says the system reaches about 97 percent decoding accuracy around the ten times compression point, and still returns useful results at higher ratios. The project ships with code and weights, is available for developers, and targets one of the most expensive pain points in modern AI, handling long context at a reasonable cost.
- A new way to squeeze long documents into AI context
- How converting text to images reduces tokens
- Inside the model DeepEncoder and the decoder
- What benchmarks and precision numbers show
- What this could change for AI cost, memory and product design
- Limitations, risks and open questions
- How to try DeepSeek OCR today
- Key Points
The system works in two stages. A vision encoder called DeepEncoder turns text pages, tables, charts, and other layouts into a compact set of vision tokens. A lightweight language model, DeepSeek3B MoE A570M, then decodes those tokens back into structured text. This setup preserves layouts and formatting when needed, supports many languages, and can output plain text or structured results such as Markdown tables.
Why this matters comes down to tokens, the basic units that language models read and write. More tokens mean higher memory use and latency, which push up costs. Models also have hard limits on how much context they can see at once. By storing context as images and revisiting it through vision tokens, DeepSeek-OCR offers a cheaper memory tier for long sessions and large documents, with the potential to expand the effective context window far beyond what text alone allows.
Scale is a core claim. Training used about 30 million PDF pages in close to 100 languages, with extra synthetic data for charts, formulas, and figures. On the operations side, the team reports that a single Nvidia A100 GPU can process more than 200,000 pages per day. A cluster of 20 similar servers can push this to roughly 33 million pages daily. These figures hint at fast pipelines for document digitization and for generating training data at scale.
How converting text to images reduces tokens
Language models read input as tokens, not as words or characters. A typical article or report can turn into thousands of tokens. Attention mechanisms inside most transformers compare many token pairs, and that work grows quickly as sequences get longer. Long context drives up memory and time, and at large scale that becomes a budget problem.
What are tokens and context windows
Tokens are slices of language that models understand. A short word might be one token, a longer word may split into several, and punctuation counts too. The maximum number of tokens the model can consider at once is the context window. Many real world files exceed that budget, so users either truncate documents or pay high costs for special long context models.
Why pixels can be cheaper than text for storage
Vision encoders turn an image into a smaller grid of patches, then compress that grid into a set of vision tokens. DeepSeek’s encoder uses a compression block to cut the token count before passing features to the decoder. An example in the technical notes describes a 1,024 by 1,024 pixel image that would begin with 4,096 patch tokens, then drop to 256 compressed tokens before flowing into the vision language stack. If those 256 tokens capture the essence of a dense page that would otherwise be 1,500 or more text tokens, the savings are immediate.
The trade is simple. Instead of storing long context as text tokens, the system stores snapshots as compact vision tokens, then reconstructs meaning on demand. That keeps memory budgets in check, helps preserve layout and diagrams, and avoids some quirks of text tokenizers. The cost comes in the added encode and decode steps and in a gradual loss of precision at very high compression ratios.
Inside the model DeepEncoder and the decoder
DeepEncoder blends two well known ideas, local perception and global understanding. It applies a Segment Anything Model backbone for fine grain regions and edges, and a CLIP style module for broader semantic layout. Between them sits a convolutional compressor that slashes the number of vision tokens before handing them to the language stack. The encoder supports multiple resolution modes, so users can dial in a token budget that matches the complexity of each page. Simple slides can use few tokens. Crowded news pages or dense scientific figures may need more.
The decoder is a Mixture of Experts language model built around a three billion parameter design with about 570 million active parameters per token. Expert routing helps the model focus on different content types without carrying the full three billion parameters for every token. This saves compute while keeping quality high. The system can output plain text, preserve structure as Markdown, and even extract data from tables and charts into machine friendly formats. It supports multilingual OCR and can interpret diagrams, chemical structures, and geometric figures.
Configuration presets include Tiny, Small, Base, and Large, plus a dynamic mode for very complex pages. The presets balance image size and token count. DeepSeek recommends starting with Small for typical pages, then stepping up only when the content is dense or the layout is challenging.
What benchmarks and precision numbers show
Compression is only useful if the decoder can recover the text with high fidelity. At compression ratios under ten times, DeepSeek reports around 97 percent accuracy on decoding tests. That drops to about 60 percent around twenty times compression. The team also shows examples where 100 vision tokens represent documents that would have consumed 700 to 800 text tokens, with precision still around 97 percent on a focused benchmark. These figures line up with the claim that medium compression yields the best mix of savings and accuracy.
On OmniDocBench, a common document understanding test, the model outperforms well known OCR systems while using fewer tokens. Token needs vary by document category. Slides and simple brochures may need around 64 vision tokens per page. Magazines or newspapers with many columns and images can require several hundred. The system can preserve formatting or strip it, depending on user settings.
Operational metrics are aggressive. The team cites more than 200,000 pages processed per day on one Nvidia A100 card, and about 33 million pages per day on a 20 server cluster. Training and inference code, along with model weights, are available with a permissive license. That combination invites rapid experimentation in research and production environments.
What this could change for AI cost, memory and product design
Many AI products hit a wall when conversations or document chains get long. Storing earlier context as vision tokens gives models a cheaper memory layer. A system can summarize or compress old turns into images, then decode the details only when needed. That keeps the active text context lean while preserving the ability to recall facts, citations, and layouts from earlier stages. It also helps models handle pages with complex tables or math that traditional tokenizers treat poorly.
For users, this unlocks practical workflows. An assistant can ingest full length PDFs, contracts, and spreadsheets, keep a compact memory, then restore exact sections with their structure when asked. Copying rows from a table or extracting a figure caption becomes straightforward because the decoder reconstructs structure, not just raw characters. This also hints at efficient enterprise search that indexes compressed page snapshots, then decodes snippets on demand.
For builders and researchers, throughput matters. The system can generate large training corpora for language and vision models, producing hundreds of thousands of pages per GPU per day. That speed helps create supervised data at lower cost. Industries that rely on charts and forms, like finance, healthcare, and science, may see faster digitization pipelines with better preservation of structure. Care is still needed when compressed outputs feed future training, since small decoding errors can turn into noisy labels at scale.
Limitations, risks and open questions
Higher compression yields smaller token budgets but lowers fidelity. Around twenty times compression, the reported accuracy falls to about 60 percent. Many tasks tolerate minor mistakes. Others, like names, numbers, and legal clauses, demand near perfect recovery. Users will need policies that push critical pages into lower compression modes and reserve aggressive compression for context that does not require exact reproduction.
There is also a reasoning question. Models reason well over text tokens because they were trained that way. Vision tokens carry different inductive biases. The decoder must reconstruct text first, then reason. That extra step might affect tasks that rely on long chains of thought. Teams will want to test multi step reasoning when important context sits behind the image layer, and consider fallbacks for precision critical tasks.
Latency and infrastructure planning remain relevant. The extra encode and decode steps add work. The savings appear once documents grow large or conversations get long. Deployments should benchmark wall clock time, memory, and cost across several document types to set thresholds for when to compress and when to pass raw text. Storage formats also change. Images or image like tensors replace raw text files, which may affect retrieval systems and compliance pipelines.
Privacy and security deserve attention. Compressing sensitive files into images or vision tokens does not remove risk. Access controls, encryption, and audit trails must apply to the compressed store. Some documents, like low resolution scans or messy handwriting, will still challenge the model. Dynamic mode selection and human review remain prudent for critical workflows.
How to try DeepSeek OCR today
The project page on Hugging Face provides model cards, quick start notes, and downloads. The code and weights are also on GitHub with a permissive license that covers both academic and commercial use. Developers can start here: Hugging Face model page and here: GitHub repository.
The reference setup lists Python 3.12.9 and CUDA 11.8, with dependencies that include PyTorch, Transformers, Tokenizers, Einops, Addict, Easydict, and Flash Attention. An Nvidia GPU is recommended for speed. Preset modes let users trade detail for cost. For standard business documents, the Small mode is a reasonable starting point. Dense scientific or legal pages may benefit from higher resolution or the dynamic mode.
A typical workflow converts a PDF into page images, feeds each page into DeepEncoder with a chosen resolution, then decodes the resulting vision tokens with the language model. Outputs can be plain text for search or Markdown that preserves tables and section headers. The GitHub repository includes scripts for batch processing and throughput tips. Teams should monitor accuracy when raising the compression ratio and keep critical pages at conservative settings.
Key Points
- DeepSeek-OCR compresses long text by storing it as images, then recovers the content on demand with a decoder.
- Token use drops by roughly seven to twenty times depending on settings and document complexity.
- Reported decoding precision is about 97 percent near ten times compression, and about 60 percent around twenty times.
- The architecture pairs DeepEncoder with a Mixture of Experts decoder that activates about 570 million parameters per token.
- Modes range from Tiny to Large, plus a dynamic option for very complex pages, letting users set explicit token budgets.
- Benchmarks show strong results on OmniDocBench while using fewer tokens than popular OCR baselines.
- Throughput claims include more than 200,000 pages per day on one Nvidia A100 GPU and about 33 million pages per day on a 20 server cluster.
- Structured outputs include Markdown tables and chart data, with support for diagrams, formulas, and many languages.
- Open source code and weights are available on Hugging Face and GitHub under a permissive license.
- Best use cases include cheaper long context memory, large scale document digitization, and training data generation, with careful settings for precision critical work.