Evalchemy API¶

Evalchemy is a unified benchmark runner built on EleutherAI's lm-evaluation-harness, providing comprehensive evaluation capabilities for large language models. It supports both standard academic benchmarks and custom evaluation tasks with flexible API-based model integration.

Evalchemy Overview

Evalchemy provides:

Unified Benchmarking: Single interface for multiple evaluation frameworks
Standard Benchmarks: ARC, HellaSwag, MMLU, Ko-MMLU, Ko-ARC support
API Integration: Compatible with VLLM and other serving frameworks
Flexible Deployment: Docker and script-based execution options

Prerequisites¶

System Requirements

Before using Evalchemy, ensure you have:

Python 3.8+ environment
A running VLLM server endpoint
Docker (for containerized execution)
Sufficient computational resources for benchmark evaluation
Network connectivity between client and model server

Execution Methods¶

Evalchemy supports two primary execution approaches, each optimized for different use cases:

1. Docker Execution (Recommended)¶

Advantages:

Consistent execution environment
Simplified dependency management
Easy integration with Kubernetes workflows
Reproducible results across systems

2. Script Execution¶

Advantages:

Direct system access for debugging
Faster iteration during development
Custom environment configuration
Local filesystem integration

Docker-Based Evaluation¶

Building the Container¶

Create the Evalchemy Docker image:

docker build -f docker/standard-evalchemy.Dockerfile \
    -t standard-evalchemy:latest .

Image Optimization

The Docker build process includes optimized dependency installation and caching for faster subsequent builds.

Container Execution¶

Run comprehensive benchmarks in a containerized environment:

docker run --rm \
    --network host \
    -v $(pwd)/results:/app/results \
    -v $(pwd)/parsed:/app/parsed \
    -e MODEL_ENDPOINT="http://localhost:8080/v1/completions" \
    -e MODEL_NAME="Qwen/Qwen2-0.5B" \
    -e TOKENIZER="Qwen/Qwen2-0.5B" \
    standard-evalchemy:latest

Network Configuration

--network host resolves MODEL ID and MODEL NAME resolution issues by providing direct access to the host network stack.

Environment Variables Explained¶

Model Configuration:

MODEL_ENDPOINT: Complete API endpoint URL including path
MODEL_NAME: HuggingFace model identifier
SERVED_MODEL_NAME: Model name as configured in VLLM server
TOKENIZER: Tokenizer specification for accurate token counting
TOKENIZER_BACKEND: Backend implementation (huggingface, tiktoken)

Evaluation Configuration:

MODEL_CONFIG: JSON configuration for API behavior and retries
EVALUATION_CONFIG: Benchmark parameters including limits and output format

Output Structure¶

After execution, results are organized in the following structure:

├── eval/
│   └── standard_evalchemy/
│       ├── parsed/          # Processed benchmark data
│       └── results/         # Evaluation outcomes and metrics

Output Details

parsed/: Contains preprocessed benchmark questions and expected answers
results/: Includes model responses, accuracy metrics, and detailed analysis

Script-Based Evaluation¶

Environment Setup¶

Install Core Dependencies¶

pip install -r requirements-dev.txt

Install Evalchemy Library¶

git clone https://github.com/ThakiCloud/evalchemy.git
cd evalchemy
pip install -e .
cd ..

Editable Installation

The -e flag installs Evalchemy in editable mode, allowing for local modifications and development.

Running Evaluations¶

Execute benchmarks using the provided shell script:

./run_evalchemy.sh \
  --endpoint http://localhost:8000/v1/completions \
  --model-name "facebook/opt-125m" \
  --tokenizer "facebook/opt-125m" \
  --tokenizer-backend "huggingface" \
  --batch-size 1 \
  --run-id test_01

Script Parameters:

--endpoint: API endpoint for model inference
--model-name: Model identifier for evaluation context
--tokenizer: Tokenizer specification for proper preprocessing
--tokenizer-backend: Tokenization implementation choice
--batch-size: Number of parallel requests (adjust based on server capacity)
--run-id: Unique identifier for tracking evaluation runs

Script Output Structure¶

Results are generated in the same directory structure:

├── eval/
│   └── standard_evalchemy/
│       ├── parsed/          # Benchmark preprocessing results
│       └── results/         # Evaluation metrics and analysis

Advanced Configuration¶

Custom Benchmark Selection¶

Modify evaluation parameters by editing configuration files:

# Edit benchmark selection
vim configs/evaluation_config.yaml

# Available benchmarks
benchmarks:
  - arc_easy
  - arc_challenge  
  - hellaswag
  - mmlu
  - ko_mmlu
  - ko_arc

Performance Tuning¶

Optimization Tips

For High-Throughput Evaluation:

Increase --batch-size based on server capacity
Configure appropriate retry policies in MODEL_CONFIG
Monitor server resource utilization during evaluation

For Memory-Constrained Environments:

Reduce batch size to minimize memory usage
Use streaming evaluation for large datasets
Configure appropriate timeouts to prevent hanging requests

Multi-GPU Evaluation¶

For accelerated evaluation on multi-GPU systems:

docker run --rm \
    --gpus all \
    --network host \
    -e CUDA_VISIBLE_DEVICES="0,1,2,3" \
    -e MODEL_ENDPOINT="http://localhost:8080/v1/completions" \
    # ... other environment variables
    standard-evalchemy:latest

Benchmark Coverage¶

Standard Academic Benchmarks¶

Supported Benchmarks

English Benchmarks:

ARC (Easy/Challenge): Science question answering
HellaSwag: Commonsense reasoning completion
MMLU: Massive multitask language understanding

Korean Benchmarks:

Ko-MMLU: Korean multitask language understanding
Ko-ARC: Korean science question answering

Custom Benchmark Integration¶

Evalchemy supports custom benchmark integration:

# Custom benchmark configuration
{
    "task": "custom_benchmark",
    "dataset_path": "/path/to/custom/data.jsonl",
    "metric": "exact_match",
    "few_shot": 5
}

Result Analysis¶

Metric Interpretation¶

Understanding Results

Key Metrics:

Accuracy: Percentage of correctly answered questions
F1 Score: Harmonic mean of precision and recall
Exact Match: Strict equality comparison for answers
BLEU Score: Text similarity for generative tasks

Comparative Analysis¶

Results include comparative analysis against baseline models:

{
    "model": "Qwen/Qwen2-0.5B",
    "benchmark": "arc_easy",
    "accuracy": 0.785,
    "baseline_comparison": {
        "random_baseline": 0.25,
        "human_performance": 0.95,
        "relative_performance": 0.713
    }
}

Integration with VLLM Eval Pipeline¶

Automated Pipeline Integration¶

Evalchemy integrates seamlessly with the broader VLLM evaluation ecosystem:

# Integration with aggregation pipeline
python scripts/aggregate_metrics.py \
    --include-evalchemy \
    --results-dir results

Pipeline Compatibility

Evalchemy results are automatically compatible with the VLLM evaluation aggregation system and ClickHouse analytics pipeline.

Troubleshooting¶

Common Issues¶

Troubleshooting Guide

API Connection Issues:

Verify VLLM server is running and accessible
Check endpoint URL format and network connectivity
Validate API authentication if required

Memory Issues:

Reduce batch size for memory-constrained environments
Monitor system memory usage during evaluation
Consider using swap space for large evaluations

Performance Issues:

Optimize batch size based on server capacity
Configure appropriate retry policies
Monitor server resource utilization

Debug Mode¶

Enable detailed logging for troubleshooting:

docker run --rm \
    --network host \
    -e DEBUG_MODE="true" \
    -e LOG_LEVEL="DEBUG" \
    # ... other environment variables
    standard-evalchemy:latest

Next Steps¶

Advanced Usage

After successful evaluation:

Analyze results using provided analysis scripts
Integrate with monitoring and alerting systems
Configure automated evaluation pipelines
Develop custom benchmarks for domain-specific evaluation