dots.ocr: SOTA Multilingual Document Parsing with 1.7B Parameters - Complete Analysis

⏱️ Estimated reading time: 8 min

Introduction

A significant shift is taking place in the field of document parsing. Traditionally, document layout detection and text recognition required multiple independent models chained together in a pipeline. However, dots.ocr, released by the RedNote research team, integrates all of these tasks into a single vision-language model (VLM) while achieving state-of-the-art (SOTA) performance.

A particularly notable aspect is that, despite having a relatively small size of 1.7B parameters, the model delivers performance comparable to much larger models such as Doubao-1.5 and Gemini 2.5 Pro. This makes it an excellent example of practical AI system design that pursues both efficiency and performance simultaneously.

Core Features of dots.ocr

1. The Innovation of a Unified Architecture

The most significant innovation in dots.ocr is that a single vision-language model performs all of the following tasks concurrently:

Layout detection: Identifying regions containing text, tables, images, formulas, and other elements
Text recognition: Accurate text extraction via OCR
Reading order: Ordering elements in the sequence a human would read
Format conversion: Producing output in appropriate formats such as Markdown, HTML, and LaTeX

What once required a complex multi-model pipeline can now be switched between different task modes by simply changing a prompt.

2. Strong Multilingual Support

dots.ocr demonstrates a decisive advantage in multilingual document parsing, including low-resource languages:

Supported languages (examples):
- English
- Chinese
- Tibetan
- Dutch
- Kannada
- Russian

This capability is highly valuable for organizations that need to process documents written in a variety of languages across a global business environment.

Benchmark Performance Analysis

OmniDocBench Results

dots.ocr achieved the following SOTA results on OmniDocBench:

Task Area	dots.ocr Performance	Comparison
Text recognition	SOTA	Existing OCR models
Table recognition	SOTA	Specialized table recognition models
Reading order	SOTA	Layout analysis models
Formula recognition	On par with Doubao-1.5 / Gemini 2.5 Pro	Large-scale VLMs

Multilingual Performance Advantage

On the model’s own multilingual benchmark, dots.ocr-bench, it demonstrated a decisive lead in both layout detection and content recognition. Unlike existing models that were primarily optimized for English and Chinese, this result reflects strong generalization capability across a wide range of languages.

Implementation and Usage

1. Environment Setup

The following steps configure the environment required to use dots.ocr:

# Download and register the model
python3 tools/download_model.py
export hf_model_path=./weights/DotsOCR
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH

# vLLM server setup (note: directory names must not contain dots)
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`

2. Starting the vLLM Server

# Launch a GPU memory-optimized vLLM server
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --chat-template-content-format string \
  --served-model-name model \
  --trust-remote-code

3. Using Different Parsing Modes

The strength of dots.ocr lies in its ability to handle diverse tasks with a single model:

Full Layout Analysis and Recognition

# Parse an image file
python3 dots_ocr/parser.py demo/demo_image1.jpg

# Parse a PDF file (increase thread count for large PDFs)
python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64

Layout Detection Only

python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en

Text Extraction Only (excluding headers and footers)

python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr

Analysis of a Specific Region

# Analyze only a specified region using a bounding box
python3 dots_ocr/parser.py demo/demo_image1.jpg \
  --prompt prompt_grounding_ocr \
  --bbox 163 241 1536 705

4. Usage with HuggingFace Transformers

If you prefer HuggingFace Transformers over vLLM:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model
model_path = "./weights/DotsOCR"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Define the prompt
prompt = """Please output the layout information from the PDF image, 
including each layout element's bbox, its category, and the corresponding 
text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption', 'Footnote', 'Formula', 'List-item', 
   'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title']
3. Text Extraction & Formatting Rules:
   - Picture: Text field omitted
   - Formula: LaTeX format
   - Table: HTML format
   - Others: Markdown format
4. Output: Single JSON object sorted by reading order
"""

# Construct messages and run inference
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "demo/demo_image1.jpg"},
        {"type": "text", "text": prompt}
    ]
}]

# Run inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, 
                  padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=24000)
output_text = processor.batch_decode(
    [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True, clean_up_tokenization_spaces=False
)

Output Analysis

dots.ocr produces structured results in the following forms:

1. JSON Structured Data

Bounding boxes: Precise coordinate positions for each element
Categories: Automatic classification into 11 layout categories
Text content: Extracted text per element

2. Markdown Conversion

A Markdown file concatenating the text of all detected cells
A version excluding headers and footers, provided for benchmark compatibility

3. Visualization Output

The original image with detected layout bounding boxes overlaid

Performance Optimization and Considerations

Recommendations for Optimal Performance

Image Resolution Optimization

# DPI setting for PDF parsing (recommended: 200 DPI)
# Optimal resolution: 11,289,600 pixels or fewer

GPU Memory Optimization

# Adjust GPU memory utilization when starting the vLLM server
--gpu-memory-utilization 0.95  # Adjust as needed

Known Limitations

1. Complex Document Elements

Highly complex tables: Not yet handled perfectly
Formulas: Accuracy is limited for intricate mathematical expressions
Images: Images embedded within documents are not currently parsed

2. Conditions That Cause Parsing Failures

When the character-to-pixel ratio is excessively high
Infinite repetition in output triggered by consecutive special characters (e.g., ..., ___)

3. Using Alternative Prompts

If you encounter issues, try the following prompts:

prompt_layout_only_en: Layout detection only
prompt_ocr: Text extraction only
prompt_grounding_ocr: Analysis of a specific region

Practical Use Cases

1. Multilingual Corporate Document Management

# Batch processing of multilingual contracts and reports
for document in multilingual_documents:
    result = parse_document(document, language="auto")
    structured_data = extract_structured_info(result)
    store_to_database(structured_data)

2. Building an Academic Paper Database

# Automated parsing of papers containing formulas and tables
papers = load_academic_papers()
for paper in papers:
    layout_info = dots_ocr.parse(paper, mode="academic")
    formulas = extract_latex_formulas(layout_info)
    tables = extract_html_tables(layout_info)
    create_searchable_index(formulas, tables)

3. Legal Document Digitization

# Structuring complex legal documents
legal_docs = load_legal_documents()
for doc in legal_docs:
    parsed = dots_ocr.parse(doc, preserve_reading_order=True)
    sections = identify_legal_sections(parsed)
    create_legal_knowledge_base(sections)

Future Development Directions

The RedNote research team has outlined the following planned improvements:

Short-term Goals

Improved accuracy for table and formula parsing
Performance optimization for large-scale PDF processing
Adding image parsing capability within documents

Long-term Vision

Universal recognition model: Integrating general detection, image captioning, and OCR
More capable and efficient models: Improving both performance and efficiency simultaneously
Community collaboration: Advancement through open-source contributions

Conclusion

dots.ocr represents a paradigm shift in the field of document parsing. With a relatively compact size of 1.7B parameters, it achieves SOTA performance while demonstrating the viability of practical deployment.

Three core strengths stand out in particular: a single model that handles diverse tasks, strong multilingual support, and an efficient architecture. Together, these point to broad applicability in real-world production environments.

dots.ocr holds significant promise for improving operational efficiency across many domains, including multilingual document processing, academic material digitization, and legal document management. With a clear roadmap for future improvement, the model is expected to grow into an even more capable tool through continued development.

References