⏱️ Estimated reading time: 12 min

Introduction

The Nemotron Post-Training Dataset v1, released by NVIDIA on July 31, 2025, is a large-scale synthetic dataset designed to improve LLM performance. With a total of 25.66 million high-quality samples, it provides training data that can significantly enhance capabilities in mathematics, coding, STEM, general reasoning, and tool calling.

This dataset was used to train the Llama-3.3-Nemotron-Super-49B-v1.5 model. Releasing the full training data for complete transparency and reproducibility represents a meaningful move in the industry.

Dataset Overview and Key Characteristics

Basic Information

Item Details
Dataset Name NVIDIA Nemotron Post-Training Dataset v1
Release Date July 31, 2025
Total Samples 25,659,642
File Size 203 GB (Parquet format)
License CC BY 4.0 (commercial and non-commercial use permitted)
Platform Hugging Face Datasets

Data Generation Approach

The defining characteristic of this dataset is that it is 100% synthetic:

  • Prompts: extracted from public corpora or synthetically generated
  • Responses: synthetically generated by high-performance AI models
  • Quality filtering: samples with low consistency, trivial answers, or grammatical errors are removed

Per-Category Detailed Analysis

Overall Data Distribution

Category Sample Count Share Primary Use
stem 20,662,167 80.5% Science, engineering, and math reasoning
code 1,896,395 7.4% Programming skill improvement
math 2,044,407 8.0% Mathematical reasoning and computation
chat 746,622 2.9% Conversational interaction
tool_calling 310,051 1.2% Tool calling and API usage
Total 25,659,642 100% -

1. STEM Category (20.7M Samples)

The core category, accounting for 80.5% of the total dataset.

Covered Areas

  • Science: physics, chemistry, biology
  • Engineering: various engineering disciplines
  • Mathematics: advanced math problems
  • Humanities: general reasoning problems
Read the following problem carefully and provide a detailed, step-by-step answer.
{problem}

Use Cases

  • Improving scientific reasoning
  • Developing complex problem-solving skills
  • Enhancing academic writing ability

2. Math Category (2.0M Samples)

Data focused on training step-by-step mathematical problem solving.

Characteristics

  • Complex math problems included
  • Step-by-step solution processes provided
  • Final answers formatted explicitly as \boxed{}
Solve the following math problem. Explain your reasoning and put the final answer in \boxed{}.
{problem}

Training Benefits

  • Improved logical reasoning
  • Better understanding of mathematical notation
  • Systematic problem-solving methodology

3. Code Category (1.9M Samples)

High-quality code generation data for improving programming skills.

Data Composition

  • Programming challenges
  • Algorithm problems
  • Code explanation and optimization
  • Support for multiple programming languages
Write a solution for the following programming challenge. Provide a brief explanation of your approach, followed by the complete code.
{problem}

External Data Sources

Some prompts are sourced from external datasets such as OpenCodeReasoning. For those entries, the original source must be downloaded separately.

4. Chat Category (747K Samples)

Data for improving conversational AI performance.

Characteristics

  • Natural conversational flow
  • Diverse topics and situations
  • Helpful and friendly AI assistant style

System Prompt

You are a helpful and friendly AI assistant.

External Data Integration

Some data is sourced from the lmsys-chat-1m dataset. If the input field is empty, the original source must be downloaded.

5. Tool Calling Category (310K Samples)

Data for improving AI agent and tool integration capabilities.

Supported Scenarios

  • Single-turn: a single tool call
  • Multi-turn: tool calling across multiple conversation turns
  • Multi-step: complex, multi-stage tool calls

Metadata Structure

  • tools: definitions of available tools
  • tool_calls: assistant’s tool call records

Analysis of Data Generation Models

Models Used

Model Samples Generated Share Notes
DeepSeek-R1-0528 24,602,969 95.9% Primary generation model
Qwen3-235B-A22B 1,056,673 4.1% Secondary generation model

Generation Quality Assurance

  1. Diversity: two distinct models used
  2. Reasoning mode separation: responses generated in both reasoning-on and reasoning-off modes
  3. Quality filtering: consistency validation and error removal

How to Use This Dataset

Loading Data via Hugging Face Datasets

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1")

# Load specific categories only
code_math_dataset = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    split=["code", "math"]
)

# Load an individual category
stem_data = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    split="stem"
)

Understanding the Data Structure

# Inspect a data sample
sample = dataset['train'][0]
print("UUID:", sample['uuid'])
print("Category:", sample['category'])
print("License:", sample['license'])
print("Messages:", sample['messages'])
print("Metadata:", sample['metadata'])

Preparing Fine-Tuning Data

def format_chat_sample(sample):
    """Format chat data"""
    messages = sample['messages']
    formatted = f"<s>[INST] {messages[0]['content']} [/INST] {messages[1]['content']}</s>"
    return {"text": formatted}

def format_math_sample(sample):
    """Format math problem data"""
    problem = sample['messages'][0]['content']
    solution = sample['messages'][1]['content']
    formatted = f"Solve the following math problem. Explain your reasoning and put the final answer in \\boxed{{}}.\n{problem}\n\n{solution}"
    return {"text": formatted}

# Apply transformations
chat_formatted = dataset['chat'].map(format_chat_sample)
math_formatted = dataset['math'].map(format_math_sample)

Quality Evaluation and Benchmarks

Data Quality Indicators

  1. Consistency: logical coherence between prompts and responses
  2. Accuracy: correctness of answers for math and science problems
  3. Complexity: maintaining an appropriate difficulty level
  4. Diversity: variety in topics and formats

Model Performance Improvements

Llama-3.3-Nemotron-Super-49B-v1.5, trained on this dataset, achieves:

  • More efficient reasoning compared to the base Llama-3.3-70B
  • 128K context length support
  • Optimized accuracy-efficiency tradeoff

License and Usage Restrictions

License Information

  • License: Creative Commons Attribution 4.0 International (CC BY 4.0)
  • Commercial use: permitted
  • Redistribution: permitted (with attribution required)
  • Modification: permitted

Ethical Considerations

  1. Privacy protection: confirmed to contain no PII data
  2. Copyright review: legal review completed
  3. Bias minimization: diverse perspectives incorporated
  4. Safety: harmful content filtered out

Data Opt-Out

If issues are found, contact ln-dataset@nvidia.com.

Practical Application Examples

1. Math Education AI Development

# Prepare data for training a math tutor AI
math_subset = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    split="math"
)

def create_math_tutor_format(sample):
    problem = sample['messages'][0]['content']
    solution = sample['messages'][1]['content']
    
    return {
        "instruction": "Please solve the following math problem step by step.",
        "input": problem,
        "output": solution
    }

math_tutor_data = math_subset.map(create_math_tutor_format)

2. Coding Assistant Development

# Prepare data for training a coding assistant AI
code_subset = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    split="code"
)

def create_coding_assistant_format(sample):
    problem = sample['messages'][0]['content']
    solution = sample['messages'][1]['content']
    
    return {
        "instruction": "Please solve the following programming problem and explain your approach.",
        "input": problem,
        "output": solution
    }

coding_assistant_data = code_subset.map(create_coding_assistant_format)

3. Scientific Research Assistant Development

# Prepare data for training a STEM research assistant AI
stem_subset = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    split="stem"
)

def create_research_assistant_format(sample):
    query = sample['messages'][0]['content']
    response = sample['messages'][1]['content']
    
    return {
        "instruction": "Please provide a detailed and accurate answer to the scientific question.",
        "input": query,
        "output": response
    }

research_assistant_data = stem_subset.map(create_research_assistant_format)

Technical Implementation Guide

GPU Memory Optimization

from datasets import load_dataset
import torch

# Batch processing for memory efficiency
def process_in_batches(dataset, batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        batch = dataset[i:i+batch_size]
        yield batch

# Streaming load for large datasets
dataset_stream = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1",
    split="stem",
    streaming=True
)

Distributed Processing Setup

from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel

def setup_distributed_data(rank, world_size):
    dataset = load_dataset(
        "nvidia/Nemotron-Post-Training-Dataset-v1",
        split="train"
    )
    
    shard_size = len(dataset) // world_size
    start_idx = rank * shard_size
    end_idx = start_idx + shard_size if rank < world_size - 1 else len(dataset)
    
    return dataset[start_idx:end_idx]

Training Pipeline Configuration

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

def setup_training_pipeline():
    model_name = "meta-llama/Llama-3.3-70B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1")
    
    def tokenize_function(examples):
        return tokenizer(
            examples['text'], 
            truncation=True, 
            padding=True, 
            max_length=2048
        )
    
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    
    training_args = TrainingArguments(
        output_dir="./nemotron-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=100,
        save_steps=1000,
        eval_steps=500,
        evaluation_strategy="steps",
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        dataloader_num_workers=4,
        fp16=True,
    )
    
    return model, tokenizer, tokenized_dataset, training_args

Performance Optimization Tips

1. Memory-Efficient Processing

def memory_efficient_data_loader(dataset_name, split, chunk_size=10000):
    dataset = load_dataset(dataset_name, split=split, streaming=True)
    
    chunk = []
    for sample in dataset:
        chunk.append(sample)
        if len(chunk) >= chunk_size:
            yield chunk
            chunk = []
    
    if chunk:
        yield chunk

for chunk in memory_efficient_data_loader(
    "nvidia/Nemotron-Post-Training-Dataset-v1", 
    "stem", 
    chunk_size=5000
):
    process_chunk(chunk)

2. Tokenization Optimization

from transformers import AutoTokenizer
import multiprocessing as mp

def parallel_tokenization(dataset, tokenizer, num_processes=8):
    def tokenize_batch(batch):
        return tokenizer(
            batch['text'],
            truncation=True,
            padding=True,
            max_length=2048,
            return_tensors='pt'
        )
    
    with mp.Pool(num_processes) as pool:
        tokenized_batches = pool.map(tokenize_batch, dataset)
    
    return tokenized_batches

Quality Validation and Monitoring

Dataset Quality Check Script

def validate_dataset_quality(dataset):
    """Validate dataset quality"""
    quality_metrics = {
        'total_samples': len(dataset),
        'avg_input_length': 0,
        'avg_output_length': 0,
        'empty_samples': 0,
        'malformed_samples': 0
    }
    
    input_lengths = []
    output_lengths = []
    
    for sample in dataset:
        try:
            messages = sample['messages']
            if len(messages) != 2:
                quality_metrics['malformed_samples'] += 1
                continue
                
            input_text = messages[0]['content']
            output_text = messages[1]['content']
            
            if not input_text or not output_text:
                quality_metrics['empty_samples'] += 1
                continue
                
            input_lengths.append(len(input_text))
            output_lengths.append(len(output_text))
            
        except Exception as e:
            quality_metrics['malformed_samples'] += 1
            continue
    
    quality_metrics['avg_input_length'] = sum(input_lengths) / len(input_lengths)
    quality_metrics['avg_output_length'] = sum(output_lengths) / len(output_lengths)
    
    return quality_metrics

dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1", split="math")
metrics = validate_dataset_quality(dataset)
print("Dataset quality metrics:", metrics)

Citation

@misc{bercovich2025llamanemotronefficientreasoningmodels,
      title={Llama-Nemotron: Efficient Reasoning Models}, 
      author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk},
      year={2025},
      eprint={2505.00949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.00949}, 
}

Conclusion

NVIDIA Nemotron Post-Training Dataset v1 is a large-scale, high-quality synthetic dataset that can serve as a meaningful resource for LLM improvement. Key strengths:

Key Advantages

  1. Scale: 25.66 million samples
  2. Quality: strict filtering and validation
  3. Diversity: five major categories in a balanced composition
  4. Transparency: full data and generation process disclosed
  5. Commercial use: CC BY 4.0 license

Application Areas

  • Education AI: math and science tutoring systems
  • Coding assistants: programming helper AI
  • Research tools: scientific research support systems
  • General-purpose AI: models with stronger reasoning capabilities

Outlook

The release of this dataset is a meaningful step toward greater transparency and collaboration in the AI industry. Developers can use it to build more capable and reliable AI systems.

For researchers and developers working on Korean-language model development, this dataset also provides a foundation for training Korean-specialized models.