Comprehensive Guide to LLM Dataset Curation: From Training to Preference Alignment
Explore the essential datasets and tools for LLM post-training, including supervised fine-tuning datasets, preference alignment data, and curation methodolog...
Explore the essential datasets and tools for LLM post-training, including supervised fine-tuning datasets, preference alignment data, and curation methodolog...
Discover the ultimate collection of curated public datasets across diverse domains, from agriculture to eSports, maintained by the global open data community.
Discover the core features and applications of Rowfill, an open-source AI platform that automatically structures PDF, image, and audio files.
Comprehensive compilation of public datasets and implementation methods for building RAG-based LLM chatbots across banking, insurance, accounting, legal, hea...
Complete analysis of OpenMathReasoning dataset with 306K math problems and 5.68M solutions - CoT, TIR, GenSelect methodologies and OpenMath-Nemotron series p...
Complete analysis of OpenCodeReasoning with 735K samples and 28K problems - R1 model-based synthetic data, 10 major platforms integrated, SFT optimized
Detailed analysis of NVIDIA’s AceReason-1.1-SFT dataset - CC BY 4.0 license, 4M samples, DeepSeek-R1 based high-quality math and code reasoning data