NVIDIA AceReason-1.1-SFT: Comprehensive Guide to Math & Code Reasoning Specialized SFT Dataset

Overview

AceReason-1.1-SFT, released by NVIDIA on June 16, 2025, is a large-scale supervised fine-tuning dataset specialized for mathematical and coding reasoning. This dataset served as the SFT training data for the AceReason-Nemotron-1.1-7B model, with all responses generated by the DeepSeek-R1 model.

The dataset comprises a total of 3,970,332 samples, including 2,668,741 mathematical reasoning samples and 1,301,591 coding reasoning samples.

Dataset Detailed Information

Basic Information

The AceReason-1.1-SFT dataset was developed by NVIDIA and released on June 16, 2025, under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The dataset is available in English language and stored in Arrow and Parquet formats, containing approximately 4 million samples in the 1M-10M range.

Technical Specifications

The dataset is documented in the ArXiv paper with identifier 2506.13284 and is available on Hugging Face at nvidia/AceReason-1.1-SFT. The file size is approximately 2.19GB for the first 5GB of Parquet files, with an estimated total of 3,958,018 rows.

Data Sources and Composition

Source-wise Statistics

The dataset draws from eight major data sources with varying contributions. OpenMathReasoning provides the largest contribution with 270,534 problems generating 2,147,570 samples (54.1% of total). NuminaMath-CoT contributes 78,880 problems with 521,171 samples (13.1%). OpenCodeReasoning adds 35,374 problems producing 763,495 samples (19.2%). Other sources include MagicoderEvolInstruct with 27,625 samples (0.7%), opc-sft-stage2 contributing 79,938 problems for 323,163 samples (8.1%), leetcode providing 5,571 problems for 126,878 samples (3.2%), TACO offering 16,726 problems for 56,694 samples (1.4%), and apps contributing 159 problems for 3,736 samples (0.1%).

Category Distribution

The dataset is divided into two main categories. Mathematical reasoning comprises 2,668,741 samples (67.2%), primarily from OpenMathReasoning with 2,147,570 samples and NuminaMath-CoT with 521,171 samples. Coding reasoning accounts for 1,301,591 samples (32.8%), including OpenCodeReasoning with 763,495 samples, opc-sft-stage2 with 323,163 samples, leetcode with 126,878 samples, TACO with 56,694 samples, MagicoderEvolInstruct with 27,625 samples, and apps with 3,736 samples.

Data Quality and Preprocessing

The dataset underwent comprehensive quality assurance through three main stages. Response generation ensured consistent quality by having all responses generated by the DeepSeek-R1 model. Deduplication involved filtering samples with 9-gram overlap with test samples from mathematical and coding benchmarks. Quality validation selected only samples containing high-quality reasoning processes and accurate answers.

Data Structure

Each sample in the dataset follows a structured format including category classification as either “math” or “code”, source identification indicating the originating dataset, input containing the problem or question, and output providing detailed reasoning process and answer.

License and Usage Conditions

CC BY 4.0 License

The AceReason-1.1-SFT dataset is provided under the Creative Commons Attribution 4.0 International License, which permits commercial use for profit-making purposes, modification and transformation of the dataset, distribution of both original and modified versions, and private use for personal purposes. The license requires attribution to the original author (NVIDIA) and license specification, license notice inclusion, and recommended indication of changes when modifications are made.

Utilization Methods

SFT Model Training

The dataset can be loaded using the Hugging Face Datasets library for supervised fine-tuning applications. Users can filter the dataset to focus on mathematical reasoning data by selecting samples with category “math” or coding reasoning data by filtering for category “code”.

Recommended Use Cases

The dataset is particularly suitable for mathematical reasoning model development to enhance mathematical problem-solving capabilities, learn step-by-step reasoning processes, and improve understanding of mathematical concepts. For coding reasoning model development, it supports algorithmic problem-solving abilities, code generation and debugging skills, and programming logic enhancement. The dataset also enables multimodal reasoning model development by integrating mathematics and coding for comprehensive STEM reasoning, holistic problem-solving capability evaluation, and educational AI system development.

Technical Details

Data Access Methods

The dataset can be accessed through the Hugging Face Datasets library by loading the complete dataset, using streaming mode for memory efficiency, or sampling specific portions such as the first 10% of the training split.

Storage Formats

The dataset is available in multiple formats including Arrow for memory-efficient columnar data format, Parquet for optimized compression efficiency and query performance, and JSON for high-compatibility text format.

Benchmarks and Performance

AceReason-Nemotron-1.1-7B Achievements

The model trained on this dataset demonstrates excellent performance across various benchmarks. For mathematical reasoning, it excels on GSM8K, MATH, and AMC evaluations. In coding reasoning, it performs well on HumanEval, MBPP, and CodeContests assessments. The model also shows strong comprehensive reasoning capabilities across diverse STEM-related evaluations.

Research Team and Contact Information

Key Researchers

The research team includes Zihan Liu (zihanl@nvidia.com), Zhuolin Yang (zhuoliny@nvidia.com), Yang Chen (yachen@nvidia.com), Chankyu Lee (chankyul@nvidia.com), and Wei Ping (wping@nvidia.com).

Citation Information

@article{liu2025acereason,
  title={AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy},
  author={Liu, Zihan and Yang, Zhuolin and Chen, Yang and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  journal={arXiv preprint arXiv:2506.13284},
  year={2025}
}

Ethical Considerations

NVIDIA has established policies and implementation measures for trustworthy AI development. Developers are encouraged to collaborate with internal model teams to ensure compliance with industry and use case requirements, address unexpected product misuse issues, and report security vulnerabilities to NVIDIA AI Concerns when discovered.

Conclusion

NVIDIA AceReason-1.1-SFT represents a high-quality large-scale SFT dataset in the fields of mathematical and coding reasoning. Available under the CC BY 4.0 license for commercial use, the dataset consists of high-quality responses based on DeepSeek-R1, making it an extremely valuable resource for developing AI models with superior reasoning capabilities.

The dataset’s scale of nearly 4 million samples and diversity from 8 major data sources establishes it as a new standard for mathematical and coding reasoning model development.