DeepSeek-R1 Complete Reproduction Guide: 2-Stage RL + 2-Stage SFT + Distillation Pipeline

⏱️ Estimated Reading Time: 15 minutes

TL;DR This comprehensive guide provides step-by-step reproduction of DeepSeek-R1’s 2-stage RL + 2-stage SFT + Distillation pipeline. Based on scripts and configurations from the official repository, we detail the complete implementation methodology for each stage. The framework is available under MIT License for commercial use.

DeepSeek-R1 Training Pipeline Overview

DeepSeek-R1 represents a model that achieved exceptional reasoning capabilities through large-scale reinforcement learning. The official repository provides a training pipeline consisting of five distinct stages:

Complete Pipeline Structure

The training methodology follows a sophisticated multi-stage approach designed to progressively enhance model capabilities. The process begins with a base model that undergoes initial reinforcement learning to develop fundamental reasoning abilities, creating DeepSeek-R1-Zero. This foundation model then receives additional refinement through a second stage of reinforcement learning, incorporating cold-start data to achieve the full DeepSeek-R1 capabilities.

The pipeline continues with supervised fine-tuning stages that focus on quality improvement and stabilization. The first SFT stage enhances response quality and consistency, while the second SFT stage emphasizes safety alignment and behavioral stability. Finally, the distillation process transfers the accumulated knowledge and capabilities to smaller, more efficient models suitable for production deployment.

Core Performance Indicators

The training progression demonstrates significant improvements across key benchmarks:

Base Model (DeepSeek-V3-Base): Serves as the foundation language model Stage 1 RL (DeepSeek-R1-Zero): Achieves initial reasoning capability acquisition with MMLU score of 85.2 Stage 2 RL (DeepSeek-R1): Demonstrates quality improvement with MMLU score reaching 89.7 SFT Stages (DeepSeek-R1-SFT): Shows stability enhancement with GSM8K score of 97.3 Distillation (Qwen-32B-Distill): Maintains 95% performance efficiency in compact form

Stage 1: Environment Setup and Preparation

Essential Library Installation

The implementation requires a comprehensive set of dependencies and tools. The process begins with cloning the official DeepSeek-R1 repository and establishing a proper Python environment. The setup includes installing PyTorch with CUDA support, transformers library for model handling, datasets library for data processing, accelerate for distributed training, deepspeed for memory optimization, wandb for experiment tracking, and flash-attention for computational efficiency.

Hardware Requirements

The training pipeline demands substantial computational resources across different stages. Stage 1 RL requires a minimum of 8 A100 80GB GPUs, with 16 A100 80GB GPUs recommended, utilizing over 640GB memory and requiring 7-14 days of training time. Stage 2 RL maintains similar requirements with 3-7 days of training duration. SFT stages are more modest, requiring 4-8 A100 80GB GPUs with 320GB+ memory and 1-3 days of training time. Distillation is the most efficient, needing only 2-4 A100 80GB GPUs with 160GB+ memory and 12-24 hours of training time.

Stage 2: First-Stage RL - DeepSeek-R1-Zero Training

Data Preparation

The first stage of reinforcement learning focuses on developing fundamental reasoning capabilities through carefully curated mathematical and reasoning problems. The data preparation process involves loading diverse datasets including competition mathematics problems, elementary arithmetic challenges, and reasoning tasks. These datasets are formatted consistently to ensure uniform processing throughout the training pipeline.

The data formatting process standardizes various mathematical problem sources into a common structure that includes problem statements, solution methodologies, difficulty levels, and problem types. This standardization enables the model to learn reasoning patterns across different mathematical domains while maintaining consistency in the learning process.

RL Environment Configuration

The reinforcement learning environment requires careful configuration to balance exploration and exploitation during training. The setup includes defining the base model architecture, establishing maximum sequence lengths for processing, configuring RL hyperparameters such as learning rates and batch sizes, setting up reward model specifications, defining training parameters including episode counts and evaluation intervals, and establishing exploration settings with appropriate temperature and sampling parameters.

The configuration also incorporates regularization techniques to prevent overfitting and ensure stable learning. KL penalty mechanisms maintain reasonable deviation from the base model, while entropy bonuses encourage exploration of diverse reasoning strategies.

First-Stage RL Training Implementation

The training implementation centers around a comprehensive trainer class that manages the entire reinforcement learning process. This system handles model initialization including policy models for training, reference models for KL regularization, and reward models for feedback generation. The data loading system prepares training datasets with appropriate formatting and batching.

The reward computation system represents a critical component that evaluates both mathematical correctness and reasoning quality. Mathematical correctness verification involves extracting answers from model responses and comparing them with ground truth solutions, while reasoning quality assessment examines the presence of structured thinking patterns, step-by-step reasoning processes, and logical consistency throughout the response.

The training episode management system coordinates the generation of responses to mathematical problems, computation of rewards based on correctness and reasoning quality, and PPO updates to improve policy performance. This iterative process gradually enhances the model’s ability to approach mathematical problems systematically and generate accurate solutions.

Stage 3: Second-Stage RL - DeepSeek-R1 Training

Cold-Start Data Preparation

The second stage of reinforcement learning introduces cold-start data to provide high-quality reasoning examples that guide the model toward more sophisticated reasoning patterns. This data consists of carefully crafted examples that demonstrate ideal reasoning processes, including explicit thinking steps, mathematical notation usage, and clear conclusion statements.

The cold-start data preparation involves creating diverse examples across multiple domains, ensuring comprehensive coverage of reasoning patterns, and maintaining high quality standards throughout the dataset. These examples serve as anchors that help stabilize the training process and guide the model toward desired reasoning behaviors.

Enhanced RL Configuration

The second stage employs refined hyperparameters and improved reward functions that emphasize quality over pure correctness. The configuration builds upon the first stage model while implementing more sophisticated evaluation criteria. The quality-focused reward system places greater emphasis on reasoning process evaluation, response readability assessment, and consistency maintenance across different problem types.

The training process incorporates cold-start data through initial fine-tuning phases that help the model internalize high-quality reasoning patterns before proceeding with reinforcement learning. This approach ensures that the model develops stable reasoning habits that persist throughout the training process.

Advanced Reward Function Design

The second stage implements a more nuanced reward function that evaluates multiple aspects of model responses. Beyond basic correctness, the system assesses reasoning quality through structured thinking pattern recognition, mathematical notation usage evaluation, and logical flow consistency checking. Readability evaluation considers sentence length appropriateness, paragraph organization quality, and repetition minimization.

The reward function also incorporates consistency scoring that ensures the model maintains coherent reasoning approaches across similar problems. This comprehensive evaluation framework guides the model toward generating responses that are not only correct but also well-structured, readable, and pedagogically valuable.

Stage 4: First-Stage SFT - Quality Enhancement

SFT Data Preparation

The supervised fine-tuning stage focuses on enhancing response quality through carefully selected high-quality samples. The data preparation process involves generating responses using the RL-trained model and filtering these responses based on quality metrics. The filtering process evaluates samples across multiple domains including mathematics, physics, chemistry, logical reasoning, and code generation.

Quality filtering mechanisms assess samples based on automatic quality evaluation scores, ensuring that only responses meeting high standards are included in the training dataset. This selective approach helps the model learn from exemplary responses while avoiding reinforcement of suboptimal patterns.

SFT Training Implementation

The supervised fine-tuning implementation utilizes standard language modeling objectives while incorporating quality-focused data selection. The training process employs careful hyperparameter tuning with conservative learning rates to preserve the reasoning capabilities developed during RL stages while enhancing response quality and consistency.

The training methodology includes comprehensive evaluation protocols that monitor model performance across various metrics throughout the fine-tuning process. This monitoring ensures that quality improvements do not come at the expense of reasoning capabilities or factual accuracy.

Stage 5: Second-Stage SFT - Stabilization

Safety and Consistency Data

The second SFT stage emphasizes safety alignment and behavioral consistency through specialized datasets that address potential harmful outputs and ensure consistent behavior across different contexts. The safety data includes examples of appropriate responses to potentially harmful requests, while consistency data ensures uniform behavior across similar queries.

This stage implements custom loss functions that incorporate safety considerations alongside standard language modeling objectives. The safety evaluation component assesses generated text for potential harmful content, while consistency evaluation ensures stable behavior patterns across different interaction contexts.

Advanced Training Techniques

The second SFT stage employs sophisticated training techniques that balance multiple objectives including response quality, safety alignment, and behavioral consistency. The training process incorporates regularization mechanisms that prevent degradation of previously learned capabilities while enhancing safety and stability characteristics.

Stage 6: Distillation - Compact Model Generation

Teacher-Student Model Setup

The distillation process transfers knowledge from the large DeepSeek-R1 model to smaller, more efficient models suitable for production deployment. The setup involves configuring teacher models with full capabilities and student models with reduced parameter counts across various sizes including 1.5B, 7B, 14B, 32B, and 70B parameter variants.

Distillation Data Generation

The distillation process requires generating large-scale training data using the teacher model. This involves creating diverse prompts that cover the full range of model capabilities and generating high-quality responses that demonstrate the teacher model’s reasoning abilities. The data generation process includes quality verification steps to ensure that only exemplary teacher responses are used for student training.

Knowledge Transfer Implementation

The distillation training employs specialized loss functions that combine soft target learning with hard target supervision. The soft target component helps the student model learn the teacher’s probability distributions, while the hard target component ensures factual accuracy. The training process carefully balances these objectives to maximize knowledge transfer efficiency.

Stage 7: Complete Pipeline Execution

Master Execution Script

The complete pipeline execution involves coordinating all training stages through a comprehensive master script that manages dependencies, resource allocation, and stage transitions. The script handles environment setup, data preparation across all stages, sequential execution of RL and SFT stages, and final distillation processes.

Performance Evaluation

The pipeline includes comprehensive evaluation protocols that assess model performance at each stage using standardized benchmarks. The evaluation framework covers mathematical reasoning, general knowledge, code generation, and safety assessments to ensure that improvements in one area do not compromise performance in others.

Production Deployment Considerations

Model Serving Infrastructure

The trained models require robust serving infrastructure that can handle the computational demands of reasoning-intensive tasks. The deployment considerations include memory management for large context lengths, efficient batching strategies for multiple concurrent requests, and optimization techniques for reduced latency while maintaining reasoning quality.

Monitoring and Maintenance

Production deployment requires ongoing monitoring of model performance, safety characteristics, and resource utilization. The monitoring framework includes automated quality assessment, safety violation detection, and performance degradation alerts to ensure consistent service quality.

Conclusion: Building Next-Generation AI Models with DeepSeek-R1

The DeepSeek-R1 training pipeline represents a sophisticated approach to developing advanced reasoning capabilities in large language models. The multi-stage methodology demonstrates how careful orchestration of reinforcement learning, supervised fine-tuning, and knowledge distillation can produce models with exceptional reasoning abilities while maintaining practical deployment characteristics.

Key Implementation Insights

Progressive Enhancement: Each stage builds upon previous achievements while introducing new capabilities, ensuring steady improvement without capability regression.

Quality-Centric Approach: The pipeline prioritizes reasoning process quality alongside correctness, producing models that generate pedagogically valuable responses.

Safety and Stability: Dedicated stages ensure that advanced capabilities are aligned with safety requirements and behavioral consistency.

Practical Efficiency: The distillation process enables deployment of reasoning capabilities in resource-constrained environments while maintaining performance standards.

The MIT License availability enables commercial utilization, and the provided scripts and configurations offer a foundation for building specialized reasoning models tailored to specific domains and requirements. This comprehensive approach establishes new standards for developing AI systems that combine advanced reasoning capabilities with practical deployment considerations.