Evalchemy API¶
Evalchemy is a unified benchmark runner built on EleutherAI's lm-evaluation-harness, providing comprehensive evaluation capabilities for large language models. It supports both standard academic benchmarks and custom evaluation tasks with flexible API-based model integration.
Evalchemy Overview
Evalchemy provides:
- Unified Benchmarking: Single interface for multiple evaluation frameworks
- Standard Benchmarks: ARC, HellaSwag, MMLU, Ko-MMLU, Ko-ARC support
- API Integration: Compatible with VLLM and other serving frameworks
- Flexible Deployment: Docker and script-based execution options
Prerequisites¶
System Requirements
Before using Evalchemy, ensure you have:
- Python 3.8+ environment
- A running VLLM server endpoint
- Docker (for containerized execution)
- Sufficient computational resources for benchmark evaluation
- Network connectivity between client and model server
Execution Methods¶
Evalchemy supports two primary execution approaches, each optimized for different use cases:
1. Docker Execution (Recommended)¶
Advantages:
- Consistent execution environment
- Simplified dependency management
- Easy integration with Kubernetes workflows
- Reproducible results across systems
2. Script Execution¶
Advantages:
- Direct system access for debugging
- Faster iteration during development
- Custom environment configuration
- Local filesystem integration
Docker-Based Evaluation¶
Building the Container¶
Create the Evalchemy Docker image:
Image Optimization
The Docker build process includes optimized dependency installation and caching for faster subsequent builds.
Container Execution¶
Run comprehensive benchmarks in a containerized environment:
docker run --rm \
--network host \
-v $(pwd)/results:/app/results \
-v $(pwd)/parsed:/app/parsed \
-e MODEL_ENDPOINT="http://localhost:8080/v1/completions" \
-e MODEL_NAME="Qwen/Qwen2-0.5B" \
-e TOKENIZER="Qwen/Qwen2-0.5B" \
standard-evalchemy:latest
Network Configuration
--network host
resolves MODEL ID and MODEL NAME resolution issues by providing direct access to the host network stack.
Environment Variables Explained¶
Model Configuration:
MODEL_ENDPOINT
: Complete API endpoint URL including pathMODEL_NAME
: HuggingFace model identifierSERVED_MODEL_NAME
: Model name as configured in VLLM serverTOKENIZER
: Tokenizer specification for accurate token countingTOKENIZER_BACKEND
: Backend implementation (huggingface
,tiktoken
)
Evaluation Configuration:
MODEL_CONFIG
: JSON configuration for API behavior and retriesEVALUATION_CONFIG
: Benchmark parameters including limits and output format
Output Structure¶
After execution, results are organized in the following structure:
├── eval/
│ └── standard_evalchemy/
│ ├── parsed/ # Processed benchmark data
│ └── results/ # Evaluation outcomes and metrics
Output Details
- parsed/: Contains preprocessed benchmark questions and expected answers
- results/: Includes model responses, accuracy metrics, and detailed analysis
Script-Based Evaluation¶
Environment Setup¶
Install Core Dependencies¶
Install Evalchemy Library¶
Editable Installation
The -e
flag installs Evalchemy in editable mode, allowing for local modifications and development.
Running Evaluations¶
Execute benchmarks using the provided shell script:
./run_evalchemy.sh \
--endpoint http://localhost:8000/v1/completions \
--model-name "facebook/opt-125m" \
--tokenizer "facebook/opt-125m" \
--tokenizer-backend "huggingface" \
--batch-size 1 \
--run-id test_01
Script Parameters:
--endpoint
: API endpoint for model inference--model-name
: Model identifier for evaluation context--tokenizer
: Tokenizer specification for proper preprocessing--tokenizer-backend
: Tokenization implementation choice--batch-size
: Number of parallel requests (adjust based on server capacity)--run-id
: Unique identifier for tracking evaluation runs
Script Output Structure¶
Results are generated in the same directory structure:
├── eval/
│ └── standard_evalchemy/
│ ├── parsed/ # Benchmark preprocessing results
│ └── results/ # Evaluation metrics and analysis
Advanced Configuration¶
Custom Benchmark Selection¶
Modify evaluation parameters by editing configuration files:
# Edit benchmark selection
vim configs/evaluation_config.yaml
# Available benchmarks
benchmarks:
- arc_easy
- arc_challenge
- hellaswag
- mmlu
- ko_mmlu
- ko_arc
Performance Tuning¶
Optimization Tips
For High-Throughput Evaluation:
- Increase
--batch-size
based on server capacity - Configure appropriate retry policies in
MODEL_CONFIG
- Monitor server resource utilization during evaluation
For Memory-Constrained Environments:
- Reduce batch size to minimize memory usage
- Use streaming evaluation for large datasets
- Configure appropriate timeouts to prevent hanging requests
Multi-GPU Evaluation¶
For accelerated evaluation on multi-GPU systems:
docker run --rm \
--gpus all \
--network host \
-e CUDA_VISIBLE_DEVICES="0,1,2,3" \
-e MODEL_ENDPOINT="http://localhost:8080/v1/completions" \
# ... other environment variables
standard-evalchemy:latest
Benchmark Coverage¶
Standard Academic Benchmarks¶
Supported Benchmarks
English Benchmarks:
- ARC (Easy/Challenge): Science question answering
- HellaSwag: Commonsense reasoning completion
- MMLU: Massive multitask language understanding
Korean Benchmarks:
- Ko-MMLU: Korean multitask language understanding
- Ko-ARC: Korean science question answering
Custom Benchmark Integration¶
Evalchemy supports custom benchmark integration:
# Custom benchmark configuration
{
"task": "custom_benchmark",
"dataset_path": "/path/to/custom/data.jsonl",
"metric": "exact_match",
"few_shot": 5
}
Result Analysis¶
Metric Interpretation¶
Understanding Results
Key Metrics:
- Accuracy: Percentage of correctly answered questions
- F1 Score: Harmonic mean of precision and recall
- Exact Match: Strict equality comparison for answers
- BLEU Score: Text similarity for generative tasks
Comparative Analysis¶
Results include comparative analysis against baseline models:
{
"model": "Qwen/Qwen2-0.5B",
"benchmark": "arc_easy",
"accuracy": 0.785,
"baseline_comparison": {
"random_baseline": 0.25,
"human_performance": 0.95,
"relative_performance": 0.713
}
}
Integration with VLLM Eval Pipeline¶
Automated Pipeline Integration¶
Evalchemy integrates seamlessly with the broader VLLM evaluation ecosystem:
# Integration with aggregation pipeline
python scripts/aggregate_metrics.py \
--include-evalchemy \
--results-dir results
Pipeline Compatibility
Evalchemy results are automatically compatible with the VLLM evaluation aggregation system and ClickHouse analytics pipeline.
Troubleshooting¶
Common Issues¶
Troubleshooting Guide
API Connection Issues:
- Verify VLLM server is running and accessible
- Check endpoint URL format and network connectivity
- Validate API authentication if required
Memory Issues:
- Reduce batch size for memory-constrained environments
- Monitor system memory usage during evaluation
- Consider using swap space for large evaluations
Performance Issues:
- Optimize batch size based on server capacity
- Configure appropriate retry policies
- Monitor server resource utilization
Debug Mode¶
Enable detailed logging for troubleshooting:
docker run --rm \
--network host \
-e DEBUG_MODE="true" \
-e LOG_LEVEL="DEBUG" \
# ... other environment variables
standard-evalchemy:latest
Next Steps¶
Advanced Usage
After successful evaluation:
- Analyze results using provided analysis scripts
- Integrate with monitoring and alerting systems
- Configure automated evaluation pipelines
- Develop custom benchmarks for domain-specific evaluation