ByteDance Dolphin Document Image Parsing: Fox Dataset and Benchmark Complete Analysis

⏱️ Estimated reading time: 18 min

Introduction

Document image parsing is a core AI technology for extracting structured information from scanned documents, PDFs, or photographed pages. ByteDance’s Dolphin project proposes an innovative approach in this space, and building on research published at ACL 2025, the team has released the Fox dataset and benchmark.

This article provides a thorough analysis of Dolphin’s core techniques together with the large-scale dataset the researchers constructed, with particular focus on the structure and practical use of the Fox-Page benchmark.

Dolphin Project Overview

🎯 Core Idea: The Analyze-then-Parse Paradigm

Dolphin adopts an “Analyze-then-Parse” approach that sets it apart from conventional document parsing methods.

Limitations of Existing Methods

# Conventional approach: specialized model pipeline
traditional_pipeline = {
    "layout_detection": "YOLO/Faster R-CNN",
    "ocr_engine": "Tesseract/PaddleOCR", 
    "table_extraction": "TableNet/CascadeTabNet",
    "formula_recognition": "Im2Latex/InftyReader"
}
# Problems: integration overhead, lack of consistency, high complexity

# Conventional approach: autoregressive generation
autoregressive_approach = {
    "input": "document_image",
    "output": "sequential_text_generation",
    "problem": "layout_structure_degradation"
}
# Problems: loss of layout structure, reduced efficiency

Dolphin’s Innovative Approach

# Dolphin: two-stage Analyze-then-Parse paradigm
dolphin_paradigm = {
    "stage_1": {
        "task": "layout_analysis",
        "output": "element_sequence_in_reading_order",
        "elements": ["text", "table", "figure", "formula"]
    },
    "stage_2": {
        "task": "parallel_element_parsing", 
        "method": "heterogeneous_anchor_prompting",
        "efficiency": "parallel_processing"
    }
}

🏗️ Model Architecture

Dolphin is built on a Vision-Encoder-Decoder structure:

Vision Encoder

Backbone: Swin Transformer
Function: Extracting visual features from document images
Resolution: Multi-scale processing supported

Text Decoder

Base: MBart architecture
Languages: Chinese and English supported simultaneously
Vocabulary size: 32K tokens

Prompt-based Interface

# Heterogeneous anchor prompting example
prompts = {
    "layout_analysis": "Analyze the layout and generate element sequence:",
    "table_parsing": "Parse the table content in the red box:",
    "formula_recognition": "Recognize the mathematical formula in the blue box:",
    "text_extraction": "Extract text content from the green box:"
}

Fox Dataset: Detailed Analysis

📊 Dataset Composition

The ByteDance research team built a large-scale multi-granularity dataset for training Dolphin.

Overall Dataset Scale

dataset_statistics:
  total_samples: 30_000_000+
  granularity_levels:
    - page_level: "full-page parsing"
    - element_level: "individual element parsing"
  
  task_distribution:
    layout_analysis: 8_500_000
    table_extraction: 7_200_000  
    formula_recognition: 5_800_000
    text_recognition: 8_500_000

Fox-Page Benchmark Characteristics

Fox-Page is a high-quality subset manually refined from the original Fox dataset.

fox_page_benchmark:
  release_date: "2025-07-10"
  quality: "manually_refined"
  purpose: "evaluation_benchmark"
  
  download_links:
    baidu_yun: "https://pan.baidu.com/..."
    google_drive: "https://drive.google.com/..."
  
  characteristics:
    diversity: "diverse document types"
    complexity: "complex layout structures"
    quality: "expert-verified"

🔍 Data Category Analysis

1. Academic Papers

academic_papers = {
    "sources": ["arXiv", "ACL", "NeurIPS", "ICLR"],
    "elements": {
        "multi_column_text": "two- and three-column text",
        "complex_tables": "statistical tables, results comparison tables",
        "mathematical_formulas": "inline and display formulas",
        "figures_with_captions": "graphs, diagrams"
    },
    "challenges": [
        "dense_layout",
        "interleaved_elements", 
        "scientific_notation"
    ]
}

2. Business Documents

business_documents = {
    "types": ["invoices", "reports", "presentations"],
    "layouts": {
        "structured_forms": "form-based documents",
        "mixed_content": "text and chart combinations",
        "branded_headers": "logos and header information"
    },
    "extraction_targets": [
        "key_value_pairs",
        "financial_data",
        "contact_information"
    ]
}

3. Educational Materials

educational_materials = {
    "categories": ["textbooks", "worksheets", "exams"],
    "special_elements": {
        "question_answer_pairs": "Q&A format",
        "step_by_step_solutions": "step-by-step solutions",
        "mixed_languages": "mixed multilingual content"
    },
    "complexity_factors": [
        "handwritten_annotations",
        "geometric_diagrams",
        "chemical_formulas"
    ]
}

📈 Benchmark Performance Metrics

Page-level Evaluation Metrics

page_level_metrics = {
    "structure_accuracy": {
        "description": "layout structure accuracy",
        "calculation": "correct_elements / total_elements",
        "weight": 0.3
    },
    "content_fidelity": {
        "description": "content fidelity", 
        "measures": ["BLEU", "ROUGE", "Edit Distance"],
        "weight": 0.4
    },
    "reading_order": {
        "description": "reading order accuracy",
        "metric": "sequence_alignment_score", 
        "weight": 0.3
    }
}

Element-level Evaluation Metrics

element_level_metrics = {
    "table_extraction": {
        "cell_accuracy": "per-cell accuracy",
        "structure_score": "table structure score", 
        "format_preservation": "degree of format preservation"
    },
    "formula_recognition": {
        "latex_accuracy": "LaTeX syntax accuracy",
        "semantic_correctness": "semantic correctness",
        "rendering_quality": "rendering quality"
    },
    "text_extraction": {
        "character_accuracy": "character-level accuracy",
        "word_accuracy": "word-level accuracy", 
        "layout_preservation": "layout preservation"
    }
}

Practical Usage Guide

🚀 Using the Dolphin Model

Installation and Setup

# Clone the repository
git clone https://github.com/bytedance/Dolphin.git
cd Dolphin

# Install dependencies
pip install -r requirements.txt

# Download the model (HuggingFace approach)
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin ./hf_model

Page-level Parsing Example

# Usage example for demo_page_hf.py
import argparse
from pathlib import Path

# Process a single document image
python demo_page_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/page_imgs/academic_paper.jpeg \
    --save_dir ./results

# Process a PDF document
python demo_page_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/page_imgs/business_report.pdf \
    --save_dir ./results

# Batch processing (entire directory)
python demo_page_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/page_imgs \
    --save_dir ./results \
    --max_batch_size 16

Element-level Parsing Example

# Table extraction
python demo_element_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/element_imgs/complex_table.jpeg \
    --element_type table

# Formula recognition
python demo_element_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/element_imgs/math_formula.jpeg \
    --element_type formula

# Text paragraph extraction
python demo_element_hf.py \
    --model_path ./hf_model \
    --input_path ./demo/element_imgs/text_paragraph.jpg \
    --element_type text

📊 Performance Optimization Tips

TensorRT-LLM Acceleration (added 2025.06.30)

# Install TensorRT-LLM
pip install tensorrt-llm

# Convert the model
python convert_to_tensorrt.py \
    --model_path ./hf_model \
    --output_dir ./tensorrt_model \
    --precision fp16

# Run accelerated inference
python demo_page_tensorrt.py \
    --model_path ./tensorrt_model \
    --input_path ./test_images

vLLM Acceleration (added 2025.06.27)

# Install vLLM
pip install vllm

# Start the vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model ./hf_model \
    --tensor-parallel-size 2 \
    --dtype half

# Call from a client
curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ByteDance/Dolphin",
        "messages": [{"role": "user", "content": "Parse this document"}]
    }'

🔧 Building a Custom Dataset

Data Preparation Guidelines

# Custom dataset structure
custom_dataset = {
    "images": {
        "format": ["JPEG", "PNG", "PDF"],
        "resolution": "minimum 300 DPI recommended",
        "quality": "high-resolution, sharp images"
    },
    "annotations": {
        "layout": {
            "bounding_boxes": "bounding box for each element",
            "element_types": ["text", "table", "figure", "formula"],
            "reading_order": "natural reading order"
        },
        "content": {
            "ground_truth": "accurate text content", 
            "markup": "structured markup (HTML/Markdown)",
            "latex": "LaTeX representation of formulas"
        }
    }
}

Annotation Guidelines

annotation_guidelines:
  layout_analysis:
    text_blocks:
      - "Distinguish paragraphs, headings, and captions"
      - "Assign sequence numbers that reflect reading order"
    
    tables:
      - "Distinguish header rows from data rows"
      - "Include information on merged cells"
      - "Link table captions to tables"
    
    figures:
      - "Images, charts, and diagrams"
      - "Relationship between figure and its caption"
      - "Reference number information"
    
    formulas:
      - "Distinguish inline from display formulas"
      - "Accurate LaTeX representation"
      - "Consistent use of variables and symbols"

  quality_control:
    consistency_checks:
      - "Style consistency within the same document"
      - "Unified terminology and notation"
    
    accuracy_validation:
      - "Compare OCR output with source"
      - "Verify formula rendering"
      - "Confirm table structure accuracy"

Comparison with Other Datasets

📋 Comparison of Major Document Understanding Benchmarks

Dataset	Scale	Characteristics	Limitations
Fox-Page	Refined, high quality	Multilingual, complex layouts	Relatively smaller size
DocVQA	50K+	VQA format	Limited to question-answer pairs
ChartQA	20K+	Chart-focused	Lacks non-chart elements
PubLayNet	360K+	Layout-centric	Limited content extraction
TableBank	417K+	Table-focused	Tables only

🎯 What Sets the Dolphin Fox Dataset Apart

1. Multi-Granularity Support

multi_granularity = {
    "page_level": {
        "task": "understanding full document structure",
        "output": "JSON + Markdown",
        "applications": ["document digitization", "automatic summarization"]
    },
    "element_level": {
        "task": "precise extraction of individual elements", 
        "output": "structured data",
        "applications": ["data mining", "information retrieval"]
    }
}

2. Grounded in Real-World Scenarios

real_world_scenarios = {
    "academic_research": {
        "documents": "arXiv papers, conference proceedings",
        "challenges": "complex formulas, multi-column layouts"
    },
    "business_intelligence": {
        "documents": "financial statements, business reports",
        "challenges": "table structures, chart interpretation"
    },
    "education_technology": {
        "documents": "textbooks, exam questions",
        "challenges": "multilingual content, handwriting"
    }
}

3. Comprehensive Evaluation Metrics

comprehensive_evaluation = {
    "structure_preservation": "preservation of layout structure",
    "content_accuracy": "content accuracy",
    "efficiency_metrics": "processing speed and memory usage",
    "robustness_testing": "stability across diverse conditions"
}

Research and Development Use Cases

🔬 Academic Research Applications

1. Document Understanding Model Development

research_applications = {
    "baseline_comparison": {
        "purpose": "benchmarking new model performance",
        "metrics": "Fox-Page benchmark scores",
        "advantage": "standardized evaluation environment"
    },
    "ablation_studies": {
        "components": ["vision_encoder", "text_decoder", "prompting"],
        "methodology": "per-component contribution analysis"
    },
    "cross_lingual_analysis": {
        "languages": ["Chinese", "English", "Mixed"],
        "research_questions": "analysis of performance differences by language"
    }
}

2. Validating New Techniques

technique_validation = {
    "anchor_prompting": {
        "hypothesis": "heterogeneous anchors improve performance",
        "experiment": "comparison experiments with and without prompts"
    },
    "parallel_processing": {
        "hypothesis": "parallel processing improves efficiency",
        "measurement": "processing time and memory usage"
    }
}

🏭 Industrial Applications

1. Digital Transformation Projects

digital_transformation = {
    "document_digitization": {
        "scope": "digitizing large-scale document archives",
        "pipeline": [
            "scan / photograph",
            "Dolphin parsing",
            "structured data storage",
            "search indexing"
        ]
    },
    "automated_processing": {
        "workflows": [
            "automated invoice processing",
            "contract information extraction", 
            "automated report summarization"
        ]
    }
}

2. Knowledge Management Systems

knowledge_management = {
    "academic_libraries": {
        "task": "automatic extraction of paper metadata",
        "benefits": "improved classification and search accuracy"
    },
    "corporate_archives": {
        "task": "building corporate document knowledge bases",
        "benefits": "providing information to support decision-making"
    }
}

Advanced Usage and Extension

🛠️ Model Fine-tuning Guide

1. Domain-specific Fine-tuning

# Medical document fine-tuning example
medical_domain_config = {
    "data_preparation": {
        "medical_reports": "diagnostic reports, prescriptions",
        "terminology": "adding medical terminology dictionaries",
        "privacy": "masking personally identifiable information"
    },
    "training_config": {
        "learning_rate": 1e-5,
        "batch_size": 8,
        "epochs": 10,
        "special_tokens": ["<MEDICAL>", "<PRESCRIPTION>", "<DIAGNOSIS>"]
    }
}

2. Multilingual Extension

# Korean language extension example
korean_extension = {
    "tokenizer_expansion": {
        "korean_vocab": "adding Korean vocabulary",
        "hanja_support": "supporting Chinese character notation",
        "mixed_script": "processing Korean-English mixed documents"
    },
    "dataset_creation": {
        "korean_documents": [
            "official documents", "academic papers", "news articles", "textbooks"
        ],
        "annotation_guidelines": "reflecting Korean language characteristics"
    }
}

📊 Performance Monitoring and Optimization

1. Real-time Performance Tracking

# Performance monitoring script
import time
import psutil
import torch

class PerformanceMonitor:
    def __init__(self):
        self.start_time = None
        self.memory_usage = []
        
    def start_monitoring(self):
        self.start_time = time.time()
        self.memory_usage = []
        
    def log_metrics(self, step, accuracy):
        current_memory = psutil.virtual_memory().used / 1024**3  # GB
        elapsed_time = time.time() - self.start_time
        
        metrics = {
            "step": step,
            "accuracy": accuracy,
            "memory_gb": current_memory,
            "elapsed_time": elapsed_time,
            "gpu_memory": torch.cuda.memory_allocated() / 1024**3 if torch.cuda.is_available() else 0
        }
        
        return metrics

2. Deployment Optimization

# Production deployment configuration
production_config = {
    "model_optimization": {
        "quantization": "INT8 quantization",
        "pruning": "weight pruning", 
        "distillation": "knowledge distillation"
    },
    "inference_optimization": {
        "batching": "dynamic batching",
        "caching": "result caching",
        "parallel_workers": 4
    },
    "monitoring": {
        "latency_tracking": "response time tracking",
        "error_logging": "error logging",
        "usage_analytics": "usage pattern analysis"
    }
}

Conclusion and Future Outlook

🎯 Significance of Dolphin and the Fox Dataset

The Dolphin project and the Fox dataset mark an important milestone in document image parsing:

1. Technical Innovation

Analyze-then-Parse paradigm: An intuitive approach that mirrors how humans read documents
Heterogeneous anchor prompting: Effective handling of diverse document elements
Parallel processing mechanism: High efficiency and scalability

2. Dataset Value

Large-scale multi-granularity: Over 30 million diverse samples
Real-world scenario coverage: Academic, business, and educational domains included
Standardized evaluation environment: A fair comparison baseline for the research community

🚀 Future Research Directions

1. Technical Development Directions

future_directions = {
    "multimodal_fusion": {
        "vision_language": "more refined vision-language fusion",
        "3d_documents": "understanding three-dimensional document structure"
    },
    "interactive_parsing": {
        "user_feedback": "improvement based on user feedback",
        "adaptive_learning": "adaptive learning mechanisms"
    },
    "efficiency_improvements": {
        "edge_deployment": "deployment on edge devices",
        "real_time_processing": "real-time processing optimization"
    }
}

2. Application Domain Expansion

application_expansion = {
    "specialized_domains": [
        "legal_documents",
        "medical_records",
        "financial_reports",
        "historical_archives"
    ],
    "emerging_technologies": [
        "ar_vr_integration",
        "voice_interaction",
        "blockchain_verification"
    ]
}

💡 Recommendations for Practical Adoption

1. Adoption Strategy

Pilot project: Start small and expand gradually
Domain specialization: Customize for specific document types
Performance benchmarking: Establish a baseline using the Fox dataset
Continuous improvement: Update the model based on user feedback

2. Quality Assurance

quality_assurance = {
    "validation_pipeline": {
        "automated_testing": "automated accuracy testing",
        "human_review": "expert review process",
        "error_analysis": "error pattern analysis and improvement"
    },
    "continuous_monitoring": {
        "performance_tracking": "real-time performance monitoring",
        "drift_detection": "detecting model performance degradation",
        "retraining_triggers": "automatic determination of retraining timing"
    }
}

ByteDance Dolphin and the Fox dataset set a new benchmark for document understanding AI, enabling practical solutions across industries and research domains. Continued technical advancement and community contributions are expected to yield more refined and capable document parsing systems.

Further Reading: