⏱️ Estimated Reading Time: 15 minutes

Introduction: New Horizons in 2025 Reinforcement Learning Post-Training

The first half of 2025 has witnessed an unprecedented wave of innovation in reinforcement learning-based post-training (RL Post-Training). Beyond simple human feedback-based reinforcement learning (RLHF), various techniques have emerged that pursue agentic capability enhancement, self-feedback learning, and simultaneous stability and efficiency.

This article provides an in-depth analysis of 10 key research papers in reinforcement learning post-training published since April 2025, selected based on popularity and impact. The goal is to provide insights that can be directly utilized in practice, from core ideas of each research to practical applicability.

Particularly noteworthy are the combination of GLM-4.5’s large-scale MoE architecture with RL post-learning, Qwen team’s proposed GSPO sequence-level stabilization techniques, and self-feedback learning possible without external labeling, which represent innovative approaches that could change industry paradigms.

Ranks 1-5: Innovative Architectures and Stabilization Techniques

1st Place: GLM-4.5 - New Standard for Agentic AI (2025-08-08)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models is one of the most notable research papers in the second half of 2025. This model, based on a 355B parameter Mixture of Experts (MoE) architecture, achieved top benchmark performance in agentic, reasoning, and coding capabilities through large-scale reinforcement learning post-training.

Core Innovation Points

GLM-4.5’s most important innovation is performance improvement in agent capabilities through end-to-end large-scale RL post-learning. Beyond simple text generation optimization, it built a reinforcement learning framework specialized for agentic tasks requiring complex multi-step reasoning and tool usage.

The evaluation methodology in agentic benchmarks like TAU-Bench provides new KPI design ideas for practitioners. Beyond simple accuracy or BLEU scores, it built a system that comprehensively evaluates actual agent tool utilization efficiency, logical consistency of multi-step reasoning, and accuracy of user intent understanding.

Practical Application Points

GLM-4.5’s research results provide the following implications when building agentic AI systems in enterprise environments: benchmark design applying TAU-Bench style multi-dimensional evaluation systems as internal KPIs, training pipelines ensuring RL stability in MoE architectures, and performance monitoring developing qualitative evaluation metrics for agent behavior.

2nd Place: GSPO - Breakthrough in Sequence-Level Stabilization (2025-07-24)

Group Sequence Policy Optimization (GSPO) is an innovative stabilization technique proposed by the Qwen team that solves fundamental limitations of existing GRPO (Group Relative Policy Optimization).

Core of Technical Innovation

GSPO’s essence is importance ratio clipping at the sequence level rather than token level. Mathematically expressed, the GSPO Loss equals the expectation over trajectories sampled from the policy of the minimum between the importance ratio times advantage and the clipped importance ratio times advantage.

Here, the importance ratio for sequence tau and advantage function A(tau) enable stabilization considering the entire sequence context, unlike traditional token-level clipping.

Fundamental Solution to Stability Problems

GSPO effectively prevented training divergence and performance collapse, which were major problems in GRPO, through sequence-level clipping. It particularly enabled stable control of gradient explosion phenomena that frequently occur in large MoE models.

Practical Implementation Guide

GSPO has been actually applied to the Qwen3 line with verified effectiveness. Practitioners can consider implementation through the following steps: diagnosing existing GRPO pipelines for training stability issues, building sequence-level evaluation metrics beyond token level, and gradual implementation starting with small models before expanding to large models.

3rd Place: GTPO & GRPO-S - The Science of Precise Credit Assignment (2025-08-18)

GTPO (Group Token Policy Optimization) and GRPO-S directly address one of the most difficult problems in reinforcement learning: the credit assignment problem.

Nature of the Credit Assignment Problem

The fundamental limitation of existing GRPO-series algorithms was coarse credit assignment. It was difficult to accurately identify which tokens actually contributed to final rewards in long sequences. GTPO solves this with entropy-weighted token-level rewards.

The GTPO Reward at time t equals the trajectory reward multiplied by the ratio of policy entropy at that time step to the sum of policy entropies across all time steps. Here, policy entropy at time t represents uncertainty at that decision point. The idea is to allocate more credit to time points with higher entropy (decision points with greater uncertainty).

Innovation in Long-Form Reasoning

GTPO’s effectiveness is particularly excellent in Chain-of-Thought (CoT) reasoning. It provided mathematical and experimental answers to the fundamental question of “where should rewards be given to improve reasoning ability.”

GRPO-S is a lightweight version of GTPO that maintains credit assignment precision while reducing computational complexity through sequence-level approximation.

4th Place: RLSF - Revolution in Self-Feedback (2025-07-29)

Reinforcement Learning from Self-Feedback (RLSF) is an innovative approach that enables reinforcement learning without external labeling or reward models.

Self-Confidence Based Reward System

RLSF’s core idea is utilizing the model’s own confidence as intrinsic reward. When models show high confidence in their outputs, positive rewards are given; when showing low confidence, negative rewards are given.

Confidence is calculated as the average of maximum probability values at each token generation point, representing how confident the model is in its choices.

Realization of Low-Cost High-Efficiency Learning

RLSF’s biggest advantage is cost efficiency. It provides immediately applicable RL pipelines without human labelers or separate reward model training. It’s particularly effective in environments such as initial startups without sufficient labeling budgets, domain-specific tasks where expert feedback collection is difficult, and prototyping requiring rapid experimentation and validation.

5th Place: ReLIFT - Harmonious Collaboration Between RL and SFT (2025-06-09)

Learning What Reinforcement Learning Can’t: RL ↔ Online FT Interleaving systematically analyzed the complementary relationship between reinforcement learning and supervised learning.

Role Division Between RL and SFT

ReLIFT’s core insight is as follows: Reinforcement Learning (RL) specializes in distribution alignment and performance optimization within existing knowledge ranges, while Supervised Learning (SFT) excels at acquiring new knowledge and patterns.

Based on this, it proposed a pipeline that alternates between RL and online SFT: Phase 1 involves SFT injecting new knowledge and patterns, Phase 2 uses RL for performance optimization within existing distributions, and Phase 3 switches back to SFT when performance plateau is detected.

Practical Operation Recipe

ReLIFT provides concrete solutions for the commonly experienced “performance plateau with RL alone” in practice. It proposes the following indicators for automatic transition point design: Learning Plateau Detection when performance improvement is below threshold for consecutive N epochs, Knowledge Gap Identification detecting performance degradation in new task patterns, and Distribution Shift Monitoring expanding gaps between training and evaluation distributions.

Ranks 6-10: Latest Techniques Pursuing Efficiency and Robustness

6th Place: MGRPO - Two-Stage Loop of Self-Correction (2025-06-05)

Multi-Layer GRPO (MGRPO) is an innovative approach that introduces a two-stage self-correction loop to overcome limitations of single-stage GRPO.

Two-Stage Self-Correction Mechanism

MGRPO’s structure consists of Stage 1 GRPO performing basic policy optimization and Stage 2 GRPO taking Stage 1 output as input to perform self-correction.

Mathematically, this is expressed as the policy theta_2 that maximizes the expectation over trajectories from policy theta_1 of rewards from self-correction of those trajectories, where SelfCorrect is the self-correction function that improves Stage 1 output.

Breakthrough in Mathematical Reasoning

MGRPO shows excellent performance particularly in mathematical reasoning tasks. It learns the ability to discover and correct logical errors through step-by-step self-correction without process-level supervision.

This is particularly useful in scenarios such as complex multi-step reasoning for automatic detection of intermediate step errors, mathematical proofs for correcting logical leaps or calculation mistakes, and code generation for automatic correction of syntax or logic errors.

7th Place: 1-shot RLVR - Possibility of Minimal Data Learning (2025-04-29)

1-shot RLVR: Reinforcement Learning for Reasoning with One Training Example demonstrated that effective reinforcement learning is possible even in data-scarce environments.

Power of Verifiable Rewards

The core of 1-shot RLVR is Verifiable Reward. In tasks where answers can be clearly verified (math problems, coding problems, etc.), it showed that reasoning ability improvement through reinforcement learning is possible with just one example.

Verifiable reward is defined as 1 if verification is true, 0 otherwise. This approach is particularly useful in startup environments for extracting maximum effect with limited labeling budgets, domain-specific fields where expert feedback is expensive or difficult to obtain, and prototyping requiring rapid concept validation in initial stages.

8th Place: UFT - Unified Optimization of SFT and RFT (2025-05-22)

Unifying Supervised and Reinforcement Fine-Tuning (UFT) formalized supervised learning and reinforcement learning as a single unified optimization process rather than separate stages.

Mathematical Foundation of Unified Optimization

UFT unifies SFT and RFT into a single objective function combining supervised learning loss, reinforcement learning loss, and regularization terms with hyperparameters alpha and beta.

UFT’s theoretical contribution is solving the exponential sample complexity bottleneck in long-term reasoning tasks. It overcame inefficiencies occurring in traditional sequential SFT to RFT pipelines through unified optimization.

9th Place: RLUF - Utilizing Actual User Feedback (2025-05-20)

Reinforcement Learning from User Feedback (RLUF) is a reinforcement learning framework that utilizes implicit user signals collected in actual service environments.

Systematization of Implicit Feedback

RLUF converts the following actual user behaviors into reward signals: likes and dislikes as direct preference signals, dwell time as indirect indicators of content quality, reuse rates measuring long-term satisfaction, and sharing behavior as signals of social approval.

Verification in Production Environment

RLUF achieved concrete results of twenty-eight percent increase in positive feedback in actual online A/B tests. This was implemented through the following production loop: log collection for real-time user behavior data collection, reward modeling converting implicit signals to reward scores, policy updates improving policies through multi-objective optimization, and A/B testing for real-time performance change monitoring.

10th Place: Robust RLHF - Robustness in Noisy Environments (2025-04-03)

Robust RLHF for LLM Fine-Tuning proposes robust learning methods that respond to noisy and biased feedback occurring in actual work environments.

Overcoming Bradley-Terry Model Limitations

The Bradley-Terry (BT) model underlying existing RLHF assumes that the probability of preference follows an exponential relationship. However, this assumption is frequently violated in actual environments. Robust RLHF proposes robust estimators considering model misspecification.

The robust reward estimator minimizes the sum over preference pairs of robust functions applied to losses, where rho is a robustness function such as Huber loss.

Ensuring Stability in Work Environments

Robust RLHF is particularly useful in work situations such as crowdsourced labels with inconsistent quality mass feedback, multicultural environments with culturally biased preference data, and temporal changes where user preferences change over time.

Insight Map for Practical Application

Training Stability Assurance Strategy

Key techniques for stable RL post-training operation in practice include GSPO (sequence clipping) for preventing training divergence in large MoE models, MGRPO (two-stage self-correction) for intermediate output quality verification and improvement, Robust RLHF (robust estimation) for stability in noisy feedback environments, and GTPO/GRPO-S (precise credit assignment) for improved learning efficiency in long-form generation.

Cost Optimization Methods

Strategies for maximum effect with limited resources include RLSF (self-feedback) eliminating external labeling costs, 1-shot RLVR (minimal data) for efficient learning in verifiable tasks, ReLIFT (targeted data collection only in bottleneck sections), and RLUF (utilizing user behavior) converting existing service logs to reward signals.

SFT-RL Integrated Pipeline Design

Guidelines for building single or alternating pipelines include ReLIFT alternating RL and SFT execution to utilize each method’s strengths, UFT integrating into single optimization process to reduce complexity, automatic transition points implementing performance plateau detection algorithms, and multi-objective balancing quality, safety, and efficiency.

Agentic Performance Evaluation System

KPI construction referencing GLM-4.5’s benchmark design includes tool utilization efficiency measuring task completion rates versus API call costs, reasoning quality assessing consistency and accuracy of multi-step logic, user satisfaction measuring actual service feedback quality, and safety indicators measuring frequency of harmful or incorrect behaviors.

Conclusion: Future Prospects of 2025 RL Post-Training

Analyzing research trends in RL post-training for the first half of 2025 reveals several clear patterns and future directions.

Major Trend Summary

Prioritization of stability is evident as multiple studies including GSPO, MGRPO, and Robust RLHF focus on training stability. This indicates that RL post-training has moved beyond research stages to actual production environments where reliability has become a core requirement.

Pursuit of cost efficiency is shown through RLSF, 1-shot RLVR, and RLUF, all methods that dramatically reduce existing labeling costs. As commercialization of large language models accelerates, economic sustainability has become an important consideration.

Emphasis on agentic capabilities is demonstrated by latest research including GLM-4.5 focusing on implementing agentic AI capable of complex reasoning and tool usage beyond simple text generation.

Action Guidelines for Practitioners

Short-term application (3-6 months) should include upgrading existing GRPO pipelines to GSPO, starting RLSF pilot projects based on self-feedback, and building RLUF frameworks utilizing user behavior logs.

Medium-term planning (6-12 months) should involve designing ReLIFT-style RL-SFT integrated pipelines, developing internal benchmarks for agentic performance evaluation, and implementing multi-objective optimization frameworks.

Long-term vision (1-2 years) should include building complete integrated learning pipelines based on UFT, developing domain-specific reinforcement learning frameworks, and completing real-time user feedback-based online learning systems.

Technical Challenges and Opportunities

Major technical challenges to be solved in the future include scaling laws for optimizing RL efficiency in larger models, multimodal expansion for applying RL beyond text modalities, real-time adaptation for online learning that immediately responds to user feedback, and safety assurance for preventing harmfulness during reinforcement learning processes.

In the second half of 2025 and 2026, these research achievements are expected to be fully applied to actual commercial services, establishing RL post-training as a standard process in AI development. Particularly in agentic AI and personalized service fields, it’s already recognized as essential technology.

Current developers and researchers should selectively apply the 10 key techniques introduced in this article according to their environments and requirements while exploring new possibilities in RL post-training. Above all, simultaneously securing stability and cost efficiency will be key to successful implementation.


References

  1. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. arXiv:2508.06471
  2. Group Sequence Policy Optimization. arXiv:2507.18071
  3. GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy. arXiv:2508.04349
  4. Reinforcement Learning from Self-Feedback. arXiv:2505.23927
  5. Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning. arXiv:2506.07527
  6. Multi-Layer GRPO: Enhancing Reasoning and Self-Correction. arXiv:2506.04746
  7. Reinforcement Learning for Reasoning with One Training Example. arXiv:2504.20571
  8. UFT: Unifying Supervised and Reinforcement Fine-Tuning. arXiv:2505.16984
  9. Reinforcement Learning from User Feedback. arXiv:2505.14946
  10. Robust Reinforcement Learning from Human Feedback. arXiv:2504.03784