NVIDIA Eval API¶

The NVIDIA Eval API provides specialized evaluation capabilities for mathematical reasoning and coding tasks using industry-standard benchmarks. This API focuses on rigorous evaluation of large language models across different domains including competitive programming and mathematical problem-solving.

Overview

NVIDIA Eval supports three main benchmark categories:

LiveCodeBench: Dynamic coding benchmark with recent problems
AIME24/AIME25: Mathematical reasoning benchmarks from American Invitational Mathematics Examination
Docker Support: Containerized execution for consistent environments

Prerequisites¶

Requirements

Before using NVIDIA Eval, ensure you have:

A running VLLM server endpoint (e.g.,http://localhost:8000/v1)
Python environment with required dependencies
Docker installed (for containerized execution)
Sufficient computational resources for inference

LiveCodeBench Evaluation¶

LiveCodeBench provides a dynamic evaluation platform for coding tasks with problems sourced from recent competitive programming contests, ensuring minimal data contamination.

About LiveCodeBench

LiveCodeBench is continuously updated with new problems, making it ideal for evaluating code generation capabilities without training data leakage concerns.

Step-by-Step Process¶

1. Dataset Preparation¶

First, download the LiveCodeBench dataset:

python download_livecodebench.py

Dataset Location

The dataset will be saved todata/livecodebench_problems.jsonl and contains coding problems with test cases.

2. Model Inference¶

Run inference against your model endpoint:

python inference.py \
    --api-base http://localhost:8000/v1 \
    --datapath data/livecodebench_problems.jsonl \
    --model-type opt125m \
    --output-folder "./results_livecodebench"

Parameter Explanation:

--api-base: Your VLLM server endpoint
--datapath: Path to the dataset file
--model-type: Model identifier for logging
--output-folder: Directory for storing inference results

3. Evaluation¶

Evaluate the generated solutions:

python evaluate_livecodebench.py \
 --question-path data/livecodebench_problems.jsonl \
 --generation-path results_livecodebench/

Expected Output

The evaluation will generate accuracy metrics, execution success rates, and detailed analysis of coding performance.

AIME Mathematical Benchmarks¶

The American Invitational Mathematics Examination (AIME) benchmarks test advanced mathematical reasoning capabilities across algebra, geometry, number theory, and combinatorics.

Benchmark Details

AIME24: ~30 challenging mathematical problems from 2024
AIME25: ~15 problems from 2025
Both datasets require multi-step reasoning and exact numerical answers

AIME24 Evaluation¶

1. Model Inference¶

python inference.py \
    --api-base http://localhost:8000/v1 \
    --datapath data/aime24.jsonl \
    --model-type opt125m \
    --output-folder "./results_aime24"

2. Evaluation¶

python evaluate_aime.py \
    --question-path data/aime24.jsonl \
    --generation-path results_aime24

AIME25 Evaluation¶

1. Model Inference¶

python inference.py \
    --api-base http://localhost:8000/v1 \
    --datapath data/aime25.jsonl \
    --model-type opt125m \
    --output-folder "./results_aime25"

2. Evaluation¶

python evaluate_aime.py \
    --question-path data/aime25.jsonl \
    --generation-path results_aime25

Mathematical Evaluation

AIME evaluations use exact match scoring for numerical answers. The evaluation script handles various answer formats and provides detailed problem-by-problem analysis.

Docker Deployment¶

For consistent execution environments and simplified deployment, use the Docker-based approach:

Building the Image¶

docker build -f docker/nvidia-eval.Dockerfile -t nvidia-benchmark:latest .

Build Requirements

The Docker build process requires significant resources and may take several minutes. Ensure you have adequate disk space and memory.

Running Evaluations¶

docker run --rm \
  --network host \
  -v $(pwd)/results:/app/results \
  -e MODEL_ENDPOINT="http://localhost:8080/v1" \
  -e MODEL_NAME="facebook/opt-125m" \
  -e OUTPUT_DIR="/app/results" \
  -e EVAL_TYPE="aime" \
  -e MAX_TOKENS="512" \
  nvidia-benchmark:latest

Environment Variables:

MODEL_ENDPOINT: VLLM server endpoint URL
OUTPUT_DIR: Container output directory
EVAL_TYPE: Benchmark type (aime, lcb, both)
MAX_TOKENS: Maximum output sequence length
Volume mount maps local output directory to container workspace

Network Configuration

The--network host flag enables the container to access the host's VLLM server. Adjust network settings based on your deployment architecture.

Output Structure¶

After running evaluations, you'll find results organized as:

results/
├── results_livecodebench/     # LiveCodeBench inference results
├── results_aime24/            # AIME24 inference results
├── results_aime25/            # AIME25 inference results
├── evaluation_results.json    # Aggregated metrics
└── logs/                      # Detailed execution logs

Next Steps

Use the generated results for model comparison, performance analysis, and integration with the broader VLLM evaluation pipeline.