dots.ocr: SOTA Multilingual Document Parsing with 1.7B Parameters - Complete Analysis
⏱️ Estimated reading time: 8 min
Introduction
A significant shift is taking place in the field of document parsing. Traditionally, document layout detection and text recognition required multiple independent models chained together in a pipeline. However, dots.ocr, released by the RedNote research team, integrates all of these tasks into a single vision-language model (VLM) while achieving state-of-the-art (SOTA) performance.
A particularly notable aspect is that, despite having a relatively small size of 1.7B parameters, the model delivers performance comparable to much larger models such as Doubao-1.5 and Gemini 2.5 Pro. This makes it an excellent example of practical AI system design that pursues both efficiency and performance simultaneously.
Core Features of dots.ocr
1. The Innovation of a Unified Architecture
The most significant innovation in dots.ocr is that a single vision-language model performs all of the following tasks concurrently:
- Layout detection: Identifying regions containing text, tables, images, formulas, and other elements
- Text recognition: Accurate text extraction via OCR
- Reading order: Ordering elements in the sequence a human would read
- Format conversion: Producing output in appropriate formats such as Markdown, HTML, and LaTeX
What once required a complex multi-model pipeline can now be switched between different task modes by simply changing a prompt.
2. Strong Multilingual Support
dots.ocr demonstrates a decisive advantage in multilingual document parsing, including low-resource languages:
Supported languages (examples):
- English
- Chinese
- Tibetan
- Dutch
- Kannada
- Russian
This capability is highly valuable for organizations that need to process documents written in a variety of languages across a global business environment.
Benchmark Performance Analysis
OmniDocBench Results
dots.ocr achieved the following SOTA results on OmniDocBench:
| Task Area | dots.ocr Performance | Comparison |
|---|---|---|
| Text recognition | SOTA | Existing OCR models |
| Table recognition | SOTA | Specialized table recognition models |
| Reading order | SOTA | Layout analysis models |
| Formula recognition | On par with Doubao-1.5 / Gemini 2.5 Pro | Large-scale VLMs |
Multilingual Performance Advantage
On the model’s own multilingual benchmark, dots.ocr-bench, it demonstrated a decisive lead in both layout detection and content recognition. Unlike existing models that were primarily optimized for English and Chinese, this result reflects strong generalization capability across a wide range of languages.
Implementation and Usage
1. Environment Setup
The following steps configure the environment required to use dots.ocr:
# Download and register the model
python3 tools/download_model.py
export hf_model_path=./weights/DotsOCR
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
# vLLM server setup (note: directory names must not contain dots)
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`
2. Starting the vLLM Server
# Launch a GPU memory-optimized vLLM server
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--chat-template-content-format string \
--served-model-name model \
--trust-remote-code
3. Using Different Parsing Modes
The strength of dots.ocr lies in its ability to handle diverse tasks with a single model:
Full Layout Analysis and Recognition
# Parse an image file
python3 dots_ocr/parser.py demo/demo_image1.jpg
# Parse a PDF file (increase thread count for large PDFs)
python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64
Layout Detection Only
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
Text Extraction Only (excluding headers and footers)
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
Analysis of a Specific Region
# Analyze only a specified region using a bounding box
python3 dots_ocr/parser.py demo/demo_image1.jpg \
--prompt prompt_grounding_ocr \
--bbox 163 241 1536 705
4. Usage with HuggingFace Transformers
If you prefer HuggingFace Transformers over vLLM:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model
model_path = "./weights/DotsOCR"
model = AutoModelForCausalLM.from_pretrained(
model_path,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Define the prompt
prompt = """Please output the layout information from the PDF image,
including each layout element's bbox, its category, and the corresponding
text content within the bbox.
1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption', 'Footnote', 'Formula', 'List-item',
'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title']
3. Text Extraction & Formatting Rules:
- Picture: Text field omitted
- Formula: LaTeX format
- Table: HTML format
- Others: Markdown format
4. Output: Single JSON object sorted by reading order
"""
# Construct messages and run inference
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "demo/demo_image1.jpg"},
{"type": "text", "text": prompt}
]
}]
# Run inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=24000)
output_text = processor.batch_decode(
[out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
skip_special_tokens=True, clean_up_tokenization_spaces=False
)
Output Analysis
dots.ocr produces structured results in the following forms:
1. JSON Structured Data
- Bounding boxes: Precise coordinate positions for each element
- Categories: Automatic classification into 11 layout categories
- Text content: Extracted text per element
2. Markdown Conversion
- A Markdown file concatenating the text of all detected cells
- A version excluding headers and footers, provided for benchmark compatibility
3. Visualization Output
- The original image with detected layout bounding boxes overlaid
Performance Optimization and Considerations
Recommendations for Optimal Performance
Image Resolution Optimization
# DPI setting for PDF parsing (recommended: 200 DPI)
# Optimal resolution: 11,289,600 pixels or fewer
GPU Memory Optimization
# Adjust GPU memory utilization when starting the vLLM server
--gpu-memory-utilization 0.95 # Adjust as needed
Known Limitations
1. Complex Document Elements
- Highly complex tables: Not yet handled perfectly
- Formulas: Accuracy is limited for intricate mathematical expressions
- Images: Images embedded within documents are not currently parsed
2. Conditions That Cause Parsing Failures
- When the character-to-pixel ratio is excessively high
- Infinite repetition in output triggered by consecutive special characters (e.g.,
...,___)
3. Using Alternative Prompts
If you encounter issues, try the following prompts:
prompt_layout_only_en: Layout detection onlyprompt_ocr: Text extraction onlyprompt_grounding_ocr: Analysis of a specific region
Practical Use Cases
1. Multilingual Corporate Document Management
# Batch processing of multilingual contracts and reports
for document in multilingual_documents:
result = parse_document(document, language="auto")
structured_data = extract_structured_info(result)
store_to_database(structured_data)
2. Building an Academic Paper Database
# Automated parsing of papers containing formulas and tables
papers = load_academic_papers()
for paper in papers:
layout_info = dots_ocr.parse(paper, mode="academic")
formulas = extract_latex_formulas(layout_info)
tables = extract_html_tables(layout_info)
create_searchable_index(formulas, tables)
3. Legal Document Digitization
# Structuring complex legal documents
legal_docs = load_legal_documents()
for doc in legal_docs:
parsed = dots_ocr.parse(doc, preserve_reading_order=True)
sections = identify_legal_sections(parsed)
create_legal_knowledge_base(sections)
Future Development Directions
The RedNote research team has outlined the following planned improvements:
Short-term Goals
- Improved accuracy for table and formula parsing
- Performance optimization for large-scale PDF processing
- Adding image parsing capability within documents
Long-term Vision
- Universal recognition model: Integrating general detection, image captioning, and OCR
- More capable and efficient models: Improving both performance and efficiency simultaneously
- Community collaboration: Advancement through open-source contributions
Conclusion
dots.ocr represents a paradigm shift in the field of document parsing. With a relatively compact size of 1.7B parameters, it achieves SOTA performance while demonstrating the viability of practical deployment.
Three core strengths stand out in particular: a single model that handles diverse tasks, strong multilingual support, and an efficient architecture. Together, these point to broad applicability in real-world production environments.
dots.ocr holds significant promise for improving operational efficiency across many domains, including multilingual document processing, academic material digitization, and legal document management. With a clear roadmap for future improvement, the model is expected to grow into an even more capable tool through continued development.
References