⏱️ Estimated reading time: 12 min

Introduction

One of the most fundamental challenges in large language model post-training is the exploration-exploitation dilemma. A model needs to explore diverse response strategies during training, but simultaneously must exploit what it has already learned to maximize performance. If either side tilts excessively, training stability suffers or the model becomes trapped in local optima.

DuPO (Dual-Policy Optimization) approaches this challenge with a novel solution: simultaneously training two distinct policies, an exploration policy and an exploitation policy, and having them complement each other. This research demonstrates that this architecture achieves 2.13 COMET improvement in translation, 6.4 point improvement in mathematical accuracy, and 9.3 point improvement in reranking performance.

Background: Limitations of Existing RL Post-Training

Problems with Single-Policy Optimization

Most existing RL-based LLM post-training methods such as PPO, GRPO, and DPO optimize a single policy. This single-policy approach has several inherent limitations.

First, there is the exploration-exploitation trade-off problem. Within a single policy, strengthening exploitation reduces exploration, and vice versa. These two goals fundamentally conflict within a single policy.

Second, reward hacking becomes an issue. When a single policy is optimized around a specific reward signal, it learns strategies that score high on that metric while not genuinely improving. This causes overfitting to the reward function.

Third, capability forgetting occurs. As the model is trained toward one direction, capabilities in other areas that had been learned tend to degrade. This is especially problematic in multi-task learning environments.

Motivation for Dual-Policy Design

DuPO’s core insight is that exploration and exploitation are two distinct roles that should not be conflated. The exploration policy is responsible for discovering diverse solution strategies, while the exploitation policy is responsible for maximally utilizing those strategies to generate high-quality responses.

These two policies influence each other’s training through knowledge sharing, ultimately achieving a superior balance of performance and stability compared to single-policy optimization.

DuPO Architecture and Design Principles

Two Policy Roles

Exploration Policy: This policy prioritizes diversity. Its role is to discover diverse response strategies beyond existing patterns, exploring regions not yet covered by the exploitation policy.

Exploitation Policy: This policy prioritizes quality. Based on strategies discovered by the exploration policy, it generates the highest quality responses. It is the model that actually produces output in the inference stage.

Unified Input Space

The key innovation of DuPO is that both policies operate in a unified input space. Formally, the input $x = x_k \oplus x_u$ consists of known components $x_k$ and unknown components $x_u$. Both policies receive the same input, but generate responses with different emphases.

This unified input space design allows information discovered by the exploration policy to be directly usable by the exploitation policy, enabling efficient knowledge transfer.

Policy Interaction Mechanism

The two policies interact through the following mechanisms during training:

  1. Response sharing: High-quality responses discovered by the exploration policy are incorporated into the exploitation policy’s training data
  2. Gradient signal sharing: The loss signals of both policies influence each other’s gradient updates
  3. Knowledge distillation: The exploitation policy learns to distill the diverse knowledge of the exploration policy

Training Objective and Optimization Method

Dual-Policy Optimization Objective

The training objective of DuPO can be expressed as follows:

\[J(\theta) = \mathbb{E}[R(x,y)]\]

Here, $J(\theta)$ is the total optimization objective, $R(x,y)$ is the reward for input $x$ and response $y$, and $\theta$ represents the parameters of both policies.

The policy gradient for this optimization is:

\[\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(y|x) \cdot R(x,y)]\]

However, in DuPO this gradient signal is decomposed into separate signals for each policy, and the interaction terms between policies are additionally included.

Balanced Training Strategy

The balance between the two policies is maintained through the following training strategy:

Alternating Training: The exploration policy and exploitation policy are trained alternately, updating one in fixed intervals while keeping the other fixed.

Adaptive Weighting: Based on the training progress of each policy, adaptive weights are applied to gradient updates. If one policy develops too quickly, its learning rate is reduced.

Cross-Policy Reward: In addition to the reward each policy receives from its own responses, it also receives rewards for how much it contributes to the other policy’s improvement.

Experimental Design and Evaluation Methodology

Diverse Task Coverage

DuPO was evaluated across three distinct tasks: machine translation, mathematical problem solving, and document reranking. This choice was intentional to verify that the framework generalizes across diverse domains beyond a specific domain.

Machine Translation: Evaluated using the COMET metric on WMT benchmarks. COMET is a neural evaluation metric that reflects actual translation quality better than traditional metrics like BLEU.

Mathematical Problem Solving: Evaluated using GSM8K and MATH benchmarks. These test the ability to solve diverse mathematical reasoning problems from elementary to competition level.

Document Reranking: Evaluated using NDCG (Normalized Discounted Cumulative Gain) on passage retrieval benchmarks. This tests the ability to rank the most relevant documents first.

Baseline Comparison

DuPO was compared against the following baseline methods:

  • PPO (Proximal Policy Optimization): Standard RL-based fine-tuning method
  • GRPO (Group Relative Policy Optimization): Group-based relative reward optimization
  • DPO (Direct Preference Optimization): Direct preference learning without RL
  • IPO (Identity Preference Optimization): Improved DPO variant

Evaluation Metrics

Main evaluation metrics:

  • COMET score: Translation quality (higher is better, range -1 to 1)
  • Accuracy: Mathematical problem solving rate (%)
  • NDCG@10: Reranking quality (higher is better, range 0 to 1)

Key Experimental Results

Machine Translation Performance

In the machine translation task, DuPO recorded 2.13 COMET improvement over the baseline. This is a significant performance gain, as in the COMET metric even a 0.5 point improvement is typically considered meaningful.

Compared to single-policy methods:

  • vs PPO: +1.87 COMET
  • vs DPO: +1.42 COMET
  • vs GRPO: +1.23 COMET

Particularly noteworthy is that DuPO showed superior performance not just in improving average performance but in consistency across diverse language pairs. This shows that the exploration policy’s contribution in discovering diverse translation strategies is effective.

Mathematical Reasoning Performance

In mathematical problem solving, DuPO achieved 6.4 point accuracy improvement. This improvement was particularly pronounced in multi-step problems, suggesting that the exploration policy’s contribution is especially significant in complex reasoning tasks.

Breakdown by difficulty:

Difficulty Baseline DuPO Improvement
Elementary 89.2% 92.1% +2.9pt
Intermediate 72.4% 78.8% +6.4pt
Advanced 45.6% 53.2% +7.6pt

As problem difficulty increases, the performance improvement from DuPO becomes larger. This is consistent with the inference that the effect of policy diversity through exploration is more pronounced for complex problems.

Document Reranking Performance

In the reranking task, DuPO recorded 9.3 point NDCG@10 improvement, which is the most dramatic result among the three tasks. The exploration policy’s contribution is especially large in the reranking task, where it needs to discover diverse relevance judgment criteria.

Analysis of Stability and Training Efficiency

Beyond performance improvements, DuPO also showed superior training stability. The standard deviation in reward signal variance during training was 34% lower compared to single-policy methods.

Regarding training efficiency, to achieve the same performance level DuPO required:

  • 23% fewer training steps compared to PPO
  • 31% fewer training steps compared to DPO

This is because the mutual complementarity of the two policies enables more efficient exploration of the solution space.

Analysis of Emergent Behaviors

Spontaneous Task Specialization

One particularly interesting observation during DuPO training is that the two policies spontaneously specialize in different aspects even for the same task.

For example, in the translation task, the exploration policy concentrated more on generating diverse grammatical structures and vocabulary choices, while the exploitation policy focused more on accuracy and fluency. This spontaneous specialization was not explicitly designed but emerged naturally from the optimization objective.

Complementary Learning Dynamics

The learning dynamics of the two policies showed a complementary pattern. In early training stages, the exploration policy develops faster, but over time the exploitation policy gradually catches up and surpasses the exploration policy’s performance.

This pattern is interpreted as the exploration policy first mapping out the solution space, and the exploitation policy then maximally utilizing that information. The dynamic interaction of this two-stage learning process contributes significantly to DuPO’s high performance.

Implementation Details

Architecture Configuration

DuPO can be implemented on top of various LLM architectures. The specific configuration used in the research is as follows:

  • Base model: LLaMA-based 7B and 13B parameter models
  • Policy implementation: Each policy shares the same base model weights but has different fine-tuning layer configurations
  • Training infrastructure: Standard distributed training setup

Hyperparameter Settings

Key hyperparameter settings:

config = {
    "exploration_lr": 2e-5,
    "exploitation_lr": 1e-5,
    "alternating_interval": 100,  # steps
    "cross_policy_reward_weight": 0.3,
    "kl_penalty_coefficient": 0.1,
    "max_sequence_length": 2048,
    "batch_size": 32
}

An important observation during implementation is the learning rate setting for the two policies. The exploration policy uses a higher learning rate to favor faster, more exploratory updates, while the exploitation policy uses a lower learning rate to prioritize stable convergence.

Memory Efficiency Optimization

Since DuPO trains two policies simultaneously, naively implemented it would approximately double memory usage. To address this, the research team applied the following optimization techniques:

Weight Sharing: Share the lower layers of both policies, only differentiating the upper layers.

Gradient Checkpointing: Apply gradient checkpointing during training to reduce memory usage.

Dynamic Batching: Adjust batch size dynamically based on the computation requirements of each policy.

These optimizations allowed DuPO to be operated in a memory footprint approximately 1.4x that of a single-policy method, rather than 2x.

Applications and Potential

Applicability to Multi-Task Learning

DuPO’s dual-policy structure is especially powerful in multi-task learning settings. The exploration policy can find solution strategies common across diverse tasks, while the exploitation policy can specialize for each task.

In practice, this enables a single DuPO model to achieve high performance across multiple tasks simultaneously, significantly improving deployment efficiency compared to training separate specialized models.

Value in Continuous Learning Settings

DuPO is also advantageous in continuous learning settings. As new tasks or data are added, the exploration policy efficiently discovers new solution spaces while the exploitation policy’s previously accumulated knowledge is preserved and utilized.

This enables gradual performance improvement without catastrophic forgetting, which is a critical advantage for AI systems deployed in real-world environments.

Possibilities in Low-Resource Scenarios

In low-resource scenarios with limited labeled data, DuPO demonstrates particular value. The exploration policy can efficiently discover diverse solution strategies even with limited data, and the exploitation policy can learn to generate high-quality responses from this.

In experiments, DuPO showed superior performance with only 30% of the data compared to single-policy methods achieving the same performance. This represents an important economic advantage in real-world deployment.

Limitations and Future Research Directions

Current Limitations

The main limitations of DuPO are as follows. First, training complexity. Managing two policies simultaneously makes the hyperparameter tuning process more complex. Particularly, finding the right balance between policies is not straightforward.

Second, interpretability challenges. When two policies interact with each other, understanding the reasoning for each policy’s decisions becomes more difficult. This can be a barrier in high-stakes domains requiring interpretability.

Third, scaling challenges. The effectiveness of the dual-policy approach when scaled to very large models (100B+ parameters) has not been fully validated.

Future Research Directions

Extension to three or more policies: Following the dual-policy design principle, extending to three or more policies could potentially achieve even greater performance. For example, combining an exploration policy, exploitation policy, and verification policy.

Dynamic policy weighting: Rather than fixed alternating training intervals, dynamically adjusting the weighting of each policy based on the current training state.

Application to multimodal settings: Extending DuPO’s principles to multimodal models that handle both image and text.

Automated hyperparameter optimization: Developing automated techniques for finding the optimal balance between the two policies.

Conclusion

DuPO presents a novel solution to the fundamental exploration-exploitation dilemma in LLM post-training. By separating the roles of exploration and exploitation into distinct policies and training them to complement each other, it achieves superior performance and stability compared to single-policy methods.

The experimental results clearly demonstrate this effectiveness: 2.13 COMET improvement in translation, 6.4 point improvement in mathematical accuracy, and 9.3 point improvement in reranking. These improvements are consistent across diverse tasks, confirming the generalizability of DuPO’s approach.

From a technical perspective, DuPO’s dual-policy structure is a new contribution to the field of LLM post-training that can be applied across diverse domains. In particular, the finding that spontaneous policy specialization and complementary learning dynamics emerge from simple dual-policy training is an important insight.

From a practical perspective, DuPO offers concrete advantages in multi-task learning, continuous learning, and low-resource scenarios. Given that these are important challenges for real AI system deployment, DuPO’s contributions have great practical value beyond pure academic advancement.

Reference: DuPO: Dual-Policy Optimization for Large Language Models - relevant research paper on dual-policy RL post-training for LLMs.