NVIDIA Nemotron Post-Training Dataset v1: A Comprehensive Analysis of a Large-Scale Synthetic Dataset for LLM Enhancement
⏱️ Estimated reading time: 12 min
Introduction
The Nemotron Post-Training Dataset v1, released by NVIDIA on July 31, 2025, is a large-scale synthetic dataset designed to improve LLM performance. With a total of 25.66 million high-quality samples, it provides training data that can significantly enhance capabilities in mathematics, coding, STEM, general reasoning, and tool calling.
This dataset was used to train the Llama-3.3-Nemotron-Super-49B-v1.5 model. Releasing the full training data for complete transparency and reproducibility represents a meaningful move in the industry.
Dataset Overview and Key Characteristics
Basic Information
| Item | Details |
|---|---|
| Dataset Name | NVIDIA Nemotron Post-Training Dataset v1 |
| Release Date | July 31, 2025 |
| Total Samples | 25,659,642 |
| File Size | 203 GB (Parquet format) |
| License | CC BY 4.0 (commercial and non-commercial use permitted) |
| Platform | Hugging Face Datasets |
Data Generation Approach
The defining characteristic of this dataset is that it is 100% synthetic:
- Prompts: extracted from public corpora or synthetically generated
- Responses: synthetically generated by high-performance AI models
- Quality filtering: samples with low consistency, trivial answers, or grammatical errors are removed
Per-Category Detailed Analysis
Overall Data Distribution
| Category | Sample Count | Share | Primary Use |
|---|---|---|---|
| stem | 20,662,167 | 80.5% | Science, engineering, and math reasoning |
| code | 1,896,395 | 7.4% | Programming skill improvement |
| math | 2,044,407 | 8.0% | Mathematical reasoning and computation |
| chat | 746,622 | 2.9% | Conversational interaction |
| tool_calling | 310,051 | 1.2% | Tool calling and API usage |
| Total | 25,659,642 | 100% | - |
1. STEM Category (20.7M Samples)
The core category, accounting for 80.5% of the total dataset.
Covered Areas
- Science: physics, chemistry, biology
- Engineering: various engineering disciplines
- Mathematics: advanced math problems
- Humanities: general reasoning problems
Recommended Prompt Template
Read the following problem carefully and provide a detailed, step-by-step answer.
{problem}
Use Cases
- Improving scientific reasoning
- Developing complex problem-solving skills
- Enhancing academic writing ability
2. Math Category (2.0M Samples)
Data focused on training step-by-step mathematical problem solving.
Characteristics
- Complex math problems included
- Step-by-step solution processes provided
- Final answers formatted explicitly as
\boxed{}
Recommended Prompt Template
Solve the following math problem. Explain your reasoning and put the final answer in \boxed{}.
{problem}
Training Benefits
- Improved logical reasoning
- Better understanding of mathematical notation
- Systematic problem-solving methodology
3. Code Category (1.9M Samples)
High-quality code generation data for improving programming skills.
Data Composition
- Programming challenges
- Algorithm problems
- Code explanation and optimization
- Support for multiple programming languages
Recommended Prompt Template
Write a solution for the following programming challenge. Provide a brief explanation of your approach, followed by the complete code.
{problem}
External Data Sources
Some prompts are sourced from external datasets such as OpenCodeReasoning. For those entries, the original source must be downloaded separately.
4. Chat Category (747K Samples)
Data for improving conversational AI performance.
Characteristics
- Natural conversational flow
- Diverse topics and situations
- Helpful and friendly AI assistant style
System Prompt
You are a helpful and friendly AI assistant.
External Data Integration
Some data is sourced from the lmsys-chat-1m dataset. If the input field is empty, the original source must be downloaded.
5. Tool Calling Category (310K Samples)
Data for improving AI agent and tool integration capabilities.
Supported Scenarios
- Single-turn: a single tool call
- Multi-turn: tool calling across multiple conversation turns
- Multi-step: complex, multi-stage tool calls
Metadata Structure
tools: definitions of available toolstool_calls: assistant’s tool call records
Analysis of Data Generation Models
Models Used
| Model | Samples Generated | Share | Notes |
|---|---|---|---|
| DeepSeek-R1-0528 | 24,602,969 | 95.9% | Primary generation model |
| Qwen3-235B-A22B | 1,056,673 | 4.1% | Secondary generation model |
Generation Quality Assurance
- Diversity: two distinct models used
- Reasoning mode separation: responses generated in both reasoning-on and reasoning-off modes
- Quality filtering: consistency validation and error removal
How to Use This Dataset
Loading Data via Hugging Face Datasets
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1")
# Load specific categories only
code_math_dataset = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split=["code", "math"]
)
# Load an individual category
stem_data = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="stem"
)
Understanding the Data Structure
# Inspect a data sample
sample = dataset['train'][0]
print("UUID:", sample['uuid'])
print("Category:", sample['category'])
print("License:", sample['license'])
print("Messages:", sample['messages'])
print("Metadata:", sample['metadata'])
Preparing Fine-Tuning Data
def format_chat_sample(sample):
"""Format chat data"""
messages = sample['messages']
formatted = f"<s>[INST] {messages[0]['content']} [/INST] {messages[1]['content']}</s>"
return {"text": formatted}
def format_math_sample(sample):
"""Format math problem data"""
problem = sample['messages'][0]['content']
solution = sample['messages'][1]['content']
formatted = f"Solve the following math problem. Explain your reasoning and put the final answer in \\boxed{{}}.\n{problem}\n\n{solution}"
return {"text": formatted}
# Apply transformations
chat_formatted = dataset['chat'].map(format_chat_sample)
math_formatted = dataset['math'].map(format_math_sample)
Quality Evaluation and Benchmarks
Data Quality Indicators
- Consistency: logical coherence between prompts and responses
- Accuracy: correctness of answers for math and science problems
- Complexity: maintaining an appropriate difficulty level
- Diversity: variety in topics and formats
Model Performance Improvements
Llama-3.3-Nemotron-Super-49B-v1.5, trained on this dataset, achieves:
- More efficient reasoning compared to the base Llama-3.3-70B
- 128K context length support
- Optimized accuracy-efficiency tradeoff
License and Usage Restrictions
License Information
- License: Creative Commons Attribution 4.0 International (CC BY 4.0)
- Commercial use: permitted
- Redistribution: permitted (with attribution required)
- Modification: permitted
Ethical Considerations
- Privacy protection: confirmed to contain no PII data
- Copyright review: legal review completed
- Bias minimization: diverse perspectives incorporated
- Safety: harmful content filtered out
Data Opt-Out
If issues are found, contact ln-dataset@nvidia.com.
Practical Application Examples
1. Math Education AI Development
# Prepare data for training a math tutor AI
math_subset = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="math"
)
def create_math_tutor_format(sample):
problem = sample['messages'][0]['content']
solution = sample['messages'][1]['content']
return {
"instruction": "Please solve the following math problem step by step.",
"input": problem,
"output": solution
}
math_tutor_data = math_subset.map(create_math_tutor_format)
2. Coding Assistant Development
# Prepare data for training a coding assistant AI
code_subset = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="code"
)
def create_coding_assistant_format(sample):
problem = sample['messages'][0]['content']
solution = sample['messages'][1]['content']
return {
"instruction": "Please solve the following programming problem and explain your approach.",
"input": problem,
"output": solution
}
coding_assistant_data = code_subset.map(create_coding_assistant_format)
3. Scientific Research Assistant Development
# Prepare data for training a STEM research assistant AI
stem_subset = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="stem"
)
def create_research_assistant_format(sample):
query = sample['messages'][0]['content']
response = sample['messages'][1]['content']
return {
"instruction": "Please provide a detailed and accurate answer to the scientific question.",
"input": query,
"output": response
}
research_assistant_data = stem_subset.map(create_research_assistant_format)
Technical Implementation Guide
GPU Memory Optimization
from datasets import load_dataset
import torch
# Batch processing for memory efficiency
def process_in_batches(dataset, batch_size=1000):
for i in range(0, len(dataset), batch_size):
batch = dataset[i:i+batch_size]
yield batch
# Streaming load for large datasets
dataset_stream = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="stem",
streaming=True
)
Distributed Processing Setup
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel
def setup_distributed_data(rank, world_size):
dataset = load_dataset(
"nvidia/Nemotron-Post-Training-Dataset-v1",
split="train"
)
shard_size = len(dataset) // world_size
start_idx = rank * shard_size
end_idx = start_idx + shard_size if rank < world_size - 1 else len(dataset)
return dataset[start_idx:end_idx]
Training Pipeline Configuration
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
def setup_training_pipeline():
model_name = "meta-llama/Llama-3.3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1")
def tokenize_function(examples):
return tokenizer(
examples['text'],
truncation=True,
padding=True,
max_length=2048
)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./nemotron-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
save_steps=1000,
eval_steps=500,
evaluation_strategy="steps",
save_total_limit=3,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
dataloader_num_workers=4,
fp16=True,
)
return model, tokenizer, tokenized_dataset, training_args
Performance Optimization Tips
1. Memory-Efficient Processing
def memory_efficient_data_loader(dataset_name, split, chunk_size=10000):
dataset = load_dataset(dataset_name, split=split, streaming=True)
chunk = []
for sample in dataset:
chunk.append(sample)
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
for chunk in memory_efficient_data_loader(
"nvidia/Nemotron-Post-Training-Dataset-v1",
"stem",
chunk_size=5000
):
process_chunk(chunk)
2. Tokenization Optimization
from transformers import AutoTokenizer
import multiprocessing as mp
def parallel_tokenization(dataset, tokenizer, num_processes=8):
def tokenize_batch(batch):
return tokenizer(
batch['text'],
truncation=True,
padding=True,
max_length=2048,
return_tensors='pt'
)
with mp.Pool(num_processes) as pool:
tokenized_batches = pool.map(tokenize_batch, dataset)
return tokenized_batches
Quality Validation and Monitoring
Dataset Quality Check Script
def validate_dataset_quality(dataset):
"""Validate dataset quality"""
quality_metrics = {
'total_samples': len(dataset),
'avg_input_length': 0,
'avg_output_length': 0,
'empty_samples': 0,
'malformed_samples': 0
}
input_lengths = []
output_lengths = []
for sample in dataset:
try:
messages = sample['messages']
if len(messages) != 2:
quality_metrics['malformed_samples'] += 1
continue
input_text = messages[0]['content']
output_text = messages[1]['content']
if not input_text or not output_text:
quality_metrics['empty_samples'] += 1
continue
input_lengths.append(len(input_text))
output_lengths.append(len(output_text))
except Exception as e:
quality_metrics['malformed_samples'] += 1
continue
quality_metrics['avg_input_length'] = sum(input_lengths) / len(input_lengths)
quality_metrics['avg_output_length'] = sum(output_lengths) / len(output_lengths)
return quality_metrics
dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1", split="math")
metrics = validate_dataset_quality(dataset)
print("Dataset quality metrics:", metrics)
Citation
@misc{bercovich2025llamanemotronefficientreasoningmodels,
title={Llama-Nemotron: Efficient Reasoning Models},
author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk},
year={2025},
eprint={2505.00949},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.00949},
}
Conclusion
NVIDIA Nemotron Post-Training Dataset v1 is a large-scale, high-quality synthetic dataset that can serve as a meaningful resource for LLM improvement. Key strengths:
Key Advantages
- Scale: 25.66 million samples
- Quality: strict filtering and validation
- Diversity: five major categories in a balanced composition
- Transparency: full data and generation process disclosed
- Commercial use: CC BY 4.0 license
Application Areas
- Education AI: math and science tutoring systems
- Coding assistants: programming helper AI
- Research tools: scientific research support systems
- General-purpose AI: models with stronger reasoning capabilities
Outlook
The release of this dataset is a meaningful step toward greater transparency and collaboration in the AI industry. Developers can use it to build more capable and reliable AI systems.
For researchers and developers working on Korean-language model development, this dataset also provides a foundation for training Korean-specialized models.