NVIDIA OpenMathReasoning: Large-Scale Mathematical Reasoning Dataset Behind AIMO-2 Winning Model
Overview
OpenMathReasoning is a large-scale mathematical reasoning dataset developed by NVIDIA that served as the foundation for the winning model in the AIMO-2 Kaggle competition. This dataset consists of 306K unique mathematical problems and a total of 5.68 million solutions, released under the CC BY 4.0 license.
High-quality mathematical problems collected from the AoPS (Art of Problem Solving) forum were processed using DeepSeek-R1 and QwQ-32B models to generate solutions applying various reasoning methodologies including Chain-of-Thought, Tool-Integrated Reasoning, and Generation Selection.
Dataset Composition and Scale
Core Statistics
Component | Scale | Description |
---|---|---|
Unique Math Problems | 306K | Unique problems collected from AoPS forum |
CoT Solutions | 3.2M | Long Chain-of-Thought solutions |
TIR Solutions | 1.7M | Tool-Integrated Reasoning solutions |
GenSelect Samples | 566K | Optimal solution selection from multiple candidates |
Additional Problems | 193K | Additional problems without solutions |
Total Data Points | 5,678,317 | Complete dataset size |
Data Sources
The primary source is the AoPS (Art of Problem Solving) forum, which includes categories such as high school olympiads and mathematical competitions. Additional sources include portions of the MATH training dataset, with problems refined using Qwen2.5-32B-Instruct for quality enhancement.
Dataset Field Structure
Key Field Descriptions
The dataset contains structured information including problem descriptions refined by Qwen2.5-32B-Instruct from AoPS forum, synthetic solutions generated by DeepSeek-R1 or QwQ-32B models, generation model identifiers, problem type classifications, expected answers through extraction or majority voting, source forum names, inference modes, pass rates from Qwen2.5-Math-72B-Instruct TIR mode, and usage indicators for AIMO-2 Kaggle winning model training.
Problem Type Classification
Problems are categorized into three main types: has_answer_extracted for problems with clearly extractable answers, no_answer_extracted for problems where answer extraction is difficult, and converted_proof for proof problems converted to answer-based questions.
Reasoning Methodologies
1. Chain-of-Thought (CoT)
Chain-of-Thought reasoning demonstrates step-by-step logical thinking processes. For example, when solving a function evaluation problem like finding f(3) for f(x) = x² + 2x + 1, the CoT approach would systematically substitute the value, calculate each term separately, and combine the results to reach the final answer of 16.
2. Tool-Integrated Reasoning (TIR)
Tool-Integrated Reasoning incorporates external tools or calculators for enhanced reasoning. When solving complex integrals, this approach would apply integration rules to each term, verify results using computational tools, and present the final organized answer with proper mathematical notation.
3. Generation Selection (GenSelect)
Generation Selection methodology involves creating multiple candidate solutions and selecting the optimal one based on criteria such as accuracy, efficiency, and clarity. This approach generates various solution strategies and chooses the most intuitive and accurate method.
Data Generation Pipeline
Stage 1: Problem Collection and Preprocessing
The pipeline begins with loading AoPS forum data and refining problems using Qwen2.5-32B-Instruct. Only validated problems that meet quality standards are included in the refined dataset.
Stage 2: Solution Generation
Solutions are generated using both DeepSeek-R1 and QwQ-32B models. The process creates multiple CoT solutions (typically 32 per problem) and TIR solutions (typically 16 per problem) for each refined problem.
Stage 3: Quality Filtering
The final stage involves comprehensive quality validation including format constraint verification, benchmark contamination removal through 9-gram duplicate checking, and answer verification to ensure mathematical accuracy and logical consistency.
OpenMath-Nemotron Model Series
Model Lineup
The OpenMath-Nemotron series includes five main models: the 1.5B model for lightweight mathematical reasoning, the 7B model offering balanced performance and efficiency, the 14B model for high-performance mathematical reasoning, the 14B-Kaggle model specifically for AIMO-2 competition winning, and the 32B model representing the highest performance tier.
Benchmark Performance
Performance evaluation across major mathematical benchmarks shows impressive results. The OpenMath-Nemotron-7B achieves 74.8 on AIME24 and 61.2 on AIME25 with CoT methodology. The 14B model demonstrates 76.3 on AIME24 and 63.0 on AIME25, while the 32B model with TIR reaches 78.4 on AIME24 and 64.2 on AIME25.
GenSelect Effectiveness
The GenSelect methodology shows significant performance improvements across all models. The 7B model improves from 74.8 to 86.7 on AIME24 (an 11.9 percentage point increase), the 14B model advances from 52.1 to 72.4 on HMMT (a 20.3 percentage point improvement), and the 32B model progresses from 78.4 to 93.3 on AIME24 (a 14.9 percentage point enhancement).
Usage Methods and Implementation
Dataset Loading
The dataset can be accessed through the Hugging Face Datasets library by loading the complete OpenMathReasoning dataset from NVIDIA. Users can filter data by specific inference modes such as CoT, TIR, or GenSelect to focus on particular reasoning approaches.
Problem Type Analysis
Analysis of problem types reveals the distribution across different categories, helping users understand the dataset composition and select appropriate subsets for their specific applications.
Generation Model Analysis
The dataset includes contributions from multiple generation models, with detailed statistics showing the percentage contribution of each model to the overall dataset, enabling users to understand the source diversity of the solutions.
AIMO-2 Kaggle Success Case
Competition Overview
The AIMO-2 (AI Mathematical Olympiad) competition hosted by Kaggle focused on solving mathematical olympiad-level problems. NVIDIA’s team achieved victory using this dataset as the foundation for their winning solution.
Winning Strategy
The success strategy involved four key elements: utilizing high-quality AoPS forum data, combining CoT, TIR, and GenSelect reasoning approaches, employing model ensembles with various sizes, and implementing continuous pipeline optimization.
Key Success Factors
The winning model training configuration utilized 2.2 million CoT solutions and 15,000 TIR solutions based on the OpenMath-Nemotron-14B model with supervised fine-tuning on OpenMathReasoning data. Performance metrics achieved 73.7 on AIME 2024, 57.9 on AIME 2025, 50.5 on HMMT 24-25, and 5.7 on HLE Math.
License and Usage Conditions
CC BY 4.0 License
The OpenMathReasoning dataset is provided under the Creative Commons Attribution 4.0 International License, which permits commercial use, modification, distribution of original and modified versions, and private use. The license requires attribution to NVIDIA Corporation, license notice inclusion, and recommended indication of changes when modifications are made.
Recommended Use Cases
The dataset is suitable for educational purposes in training mathematical reasoning models, research applications in mathematical AI development, commercial utilization in mathematical education tool development, and evaluation purposes for model performance benchmarking.
Technical Details
Data Storage Format
The dataset is stored in Parquet format with a size of 49.5GB, utilizing efficient columnar storage and accessible through the Hugging Face Datasets API.
Quality Control Process
Quality management involves three main stages: format constraint filtering to remove yes/no questions, multiple choice problems, and inappropriate format issues; benchmark deduplication through 9-gram overlap checking with existing evaluation data to prevent data contamination; and solution validation including LLM-generated solution validity checking, mathematical accuracy verification, and logical consistency review.
Pipeline Issues and Solutions
Initial discrepancies between reported problem counts (540K initially reported versus 306K actually released) were resolved through transparent data processing disclosure. Additionally, the loss of 137K proof problems due to pipeline bugs was identified, with recovery efforts showing performance degradation, leading to ongoing improvement research.
Applications and Use Cases
Educational Applications
The dataset enables development of personalized mathematical tutoring systems that can generate step-by-step solutions and provide learner-appropriate recommendations. Mathematical problem generators can create difficulty-graded problems with detailed solutions and customized suggestions based on learner levels.
Research Applications
Research applications include reasoning capability analysis for understanding mathematical reasoning mechanisms, multi-step reasoning process studies, and logical thinking pattern identification. The dataset also supports development of intelligent learning systems, automated grading and feedback systems, and learning progress tracking tools.
Performance Comparison and Analysis
Baseline Model Comparison
Comparison with baseline models shows significant improvements. OpenMath-Nemotron-7B achieves 74.8 on AIME24 and 61.2 on AIME25, representing improvements of 20.4 and 22.6 points respectively over DeepSeek-R1-Distill-Qwen-7B. Similarly, OpenMath-Nemotron-14B demonstrates 76.3 on AIME24 and 63.0 on AIME25, showing improvements of 10.5 and 14.6 points over the corresponding baseline.
Reasoning Method Performance Analysis
Comparison between CoT and TIR approaches reveals that CoT excels in clear logical thinking processes while TIR demonstrates superiority in complex calculations. The GenSelect methodology significantly enhances both approaches across all evaluation metrics.
Future Development Directions
Dataset Expansion Plans
Future expansion includes integration of additional mathematical forums, incorporation of mathematical problems in various languages, and implementation of real-time problem updates to maintain dataset freshness and relevance.
Technical Improvements
Technical enhancements focus on applying more powerful generation models, experimenting with diverse reasoning methodologies, supporting multimodal mathematical problems, implementing more accurate performance measurement systems, providing real-time benchmark updates, and conducting comparisons with human evaluators.
Conclusion
NVIDIA OpenMathReasoning represents a new standard in mathematical reasoning datasets with 5.68 million high-quality solutions and diverse reasoning methodologies. The dataset enabled OpenMath-Nemotron series models to achieve exceptional performance, particularly demonstrated through the AIMO-2 Kaggle competition victory.
The CC BY 4.0 license allows free utilization across educational, research, and commercial applications. The innovative reasoning methodologies of CoT, TIR, and GenSelect, combined with systematic data generation pipelines, establish important benchmarks for future mathematical AI development. This dataset is expected to contribute significantly to mathematical reasoning AI advancement by enabling more researchers and developers to participate in the field.
Citation Information
@article{moshkov2025aimo2,
title = {AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset},
author = {Ivan Moshkov and Darragh Hanley and Ivan Sorokin and Shubham Toshniwal and Christof Henkel and Benedikt Schifferer and Wei Du and Igor Gitman},
year = {2025},
journal = {arXiv preprint arXiv:2504.16891}
}