VLLM Benchmark API¶
The VLLM Benchmark API provides comprehensive performance evaluation capabilities for VLLM-served language models using the official VLLM benchmark_serving.py
tool. This sophisticated benchmarking system supports multi-scenario configuration, detailed statistical analysis, and result standardization for integration with the broader evaluation pipeline.
Benchmark Overview
VLLM Benchmark provides:
- Multi-Scenario Testing: JSON-based configuration for complex benchmark scenarios
- Performance Metrics: TTFT, TPOT, ITL, E2EL with detailed percentile analysis
- Concurrency Testing: Configurable concurrent request handling evaluation
- Result Analysis: Sophisticated statistical analysis and result standardization
- Integration Support: Standardized output for pipeline integration
Prerequisites¶
Requirements
Before running VLLM benchmarks, ensure:
- VLLM server is running and accessible
- Docker is installed for containerized benchmarking
- Sufficient system resources for load testing
- Network connectivity between benchmark client and VLLM server
Performance Metrics Explained¶
Key Performance Indicators¶
Understanding Metrics
Throughput Metrics:
- Request Throughput: Completed requests per second
- Output Token Throughput: Generated tokens per second
- Total Token Throughput: Combined input/output tokens per second
Latency Metrics:
- TTFT (Time to First Token): Latency until first token generation
- TPOT (Time per Output Token): Average time between subsequent tokens
- ITL (Inter-token Latency): Real-time token generation intervals
- E2EL (End-to-End Latency): Complete request processing time
Configuration System¶
VLLM benchmarks use a JSON-based configuration system located at configs/vllm_benchmark.json
:
{
"name": "VLLM Performance Benchmark",
"version": "2.0.0",
"description": "VLLM 서빙 성능 측정 벤치마크",
"defaults": {
"backend": "openai-chat",
"endpoint_path": "/v1/chat/completions",
"dataset_type": "random",
"percentile_metrics": "ttft,tpot,itl,e2el",
"metric_percentiles": "25,50,75,90,95,99"
},
"scenarios": [
{
"name": "basic_performance",
"description": "기본 성능 측정",
"max_concurrency": 1,
"random_input_len": 1024,
"random_output_len": 1024
}
],
"thresholds": {
"ttft_p95_ms": 200,
"tpot_mean_ms": 50,
"throughput_min": 10,
"success_rate": 0.95
}
}
Script-Based Execution¶
The primary benchmarking tool is eval/vllm-benchmark/run_vllm_benchmark.sh
:
Basic Usage¶
Environment Variables¶
MODEL_ENDPOINT
: VLLM server base URL (default:http://localhost:8000
)CONFIG_PATH
: Configuration file path (default:configs/vllm_benchmark.json
)OUTPUT_DIR
: Results output directory (default:/app/results
)REQUEST_RATE
: Request rate for load testing (default:1.0
)MODEL_NAME
: Model identifierSERVED_MODEL_NAME
: Model name as configured in VLLM serverTOKENIZER
: Tokenizer specification (default:gpt2
)
Advanced Configuration¶
MODEL_ENDPOINT="http://localhost:8080" \
CONFIG_PATH="configs/custom_benchmark.json" \
OUTPUT_DIR="./benchmark_results" \
MODEL_NAME="Qwen/Qwen2-0.5B" \
./run_vllm_benchmark.sh
Docker-Based Benchmarking¶
Building the Benchmark Image¶
Running Containerized Benchmarks¶
docker run --rm \
--network host \
-e MODEL_ENDPOINT="http://localhost:8080" \
-e MODEL_NAME="Qwen/Qwen2-0.5B" \
-e TOKENIZER="gpt2" \
-e OUTPUT_DIR="/app/results" \
-e PARSED_DIR="/app/parsed" \
-v $(pwd)/results:/app/results \
-v $(pwd)/parsed:/app/parsed \
vllm-benchmark:latest
Model Endpoint Validation
The script automatically validates the model endpoint and extracts the model ID from /v1/models
before running benchmarks.
Scenario Configuration¶
Scenario Parameters¶
Each scenario in the configuration supports the following parameters:
Core Parameters:
name
: Unique scenario identifierdescription
: Human-readable scenario descriptionmax_concurrency
: Maximum concurrent requestsnum_prompts
: Total number of prompts to processrandom_input_len
: Input token length for random datasetrandom_output_len
: Expected output token length
Advanced Parameters:
backend
: Backend type (openai-chat
,openai
,tgi
)endpoint_path
: API endpoint path (e.g.,/v1/chat/completions
)dataset_type
: Dataset type (random
,synthetic
)percentile_metrics
: Metrics to calculate percentiles formetric_percentiles
: Percentile values to compute
Multi-Scenario Example¶
{
"scenarios": [
{
"name": "light_load",
"description": "Light load testing",
"max_concurrency": 1,
"num_prompts": 10,
"random_input_len": 512,
"random_output_len": 128
},
{
"name": "heavy_load",
"description": "Heavy load testing",
"max_concurrency": 8,
"num_prompts": 100,
"random_input_len": 2048,
"random_output_len": 512
}
]
}
Sample Benchmark Output¶
Performance Results Analysis¶
When you run the benchmark, you'll see comprehensive performance metrics:
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf RPS.
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████| 1/1 [00:15<00:00, 15.13s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Maximum request concurrency: 1
Benchmark duration (s): 15.13
Total input tokens: 1024
Total generated tokens: 1024
Request throughput (req/s): 0.07
Output token throughput (tok/s): 67.69
Total Token throughput (tok/s): 135.37
---------------Time to First Token----------------
Mean TTFT (ms): 47.19
Median TTFT (ms): 47.19
P25 TTFT (ms): 47.19
P50 TTFT (ms): 47.19
P75 TTFT (ms): 47.19
P90 TTFT (ms): 47.19
P95 TTFT (ms): 47.19
P99 TTFT (ms): 47.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.74
Median TPOT (ms): 14.74
P25 TPOT (ms): 14.74
P50 TPOT (ms): 14.74
P75 TPOT (ms): 14.74
P90 TPOT (ms): 14.74
P95 TPOT (ms): 14.74
P99 TPOT (ms): 14.74
---------------Inter-token Latency----------------
Mean ITL (ms): 14.73
Median ITL (ms): 14.62
P25 ITL (ms): 14.33
P50 ITL (ms): 14.62
P75 ITL (ms): 15.01
P90 ITL (ms): 15.71
P95 ITL (ms): 16.34
P99 ITL (ms): 20.46
----------------End-to-end Latency----------------
Mean E2EL (ms): 15126.62
Median E2EL (ms): 15126.62
P25 E2EL (ms): 15126.62
P50 E2EL (ms): 15126.62
P75 E2EL (ms): 15126.62
P90 E2EL (ms): 15126.62
P95 E2EL (ms): 15126.62
P99 E2EL (ms): 15126.62
==================================================
Result Interpretation¶
Performance Analysis
Key Insights from Results:
- Throughput: 67.69 tokens/sec output rate indicates model generation speed
- TTFT: 47.19ms first token latency shows response initiation speed
- TPOT: 14.74ms per token indicates consistent generation performance
- Percentiles: P90, P95, P99 values help identify tail latency behavior
Load Testing Scenarios¶
Single Client Performance¶
The basic benchmark runs single-client scenarios to establish baseline performance characteristics.
Multi-Client Concurrency¶
Scaling Testing
For production readiness assessment:
- Start with single-client baseline
- Gradually increase concurrency levels
- Monitor performance degradation patterns
- Identify optimal concurrency for your hardware
Stress Testing¶
# High-concurrency stress test
docker run --rm \
--network host \
-e MODEL_ENDPOINT="http://localhost:8080" \
-e MODEL_NAME="Qwen/Qwen2-0.5B" \
-e TOKENIZER="gpt2" \
-e OUTPUT_DIR="/results" \
-e PARSED_DIR="/parsed" \
-e CONCURRENCY_LEVELS="50,100" \
-e DURATION="600" \
-v $(pwd)/results:/results \
-v $(pwd)/parsed:/parsed \
vllm-benchmark:latest
Result Analysis and Processing¶
Automated Analysis¶
The benchmark system includes sophisticated analysis capabilities via eval/vllm-benchmark/analyze_vllm_results.py
:
# Analyze specific result file
python eval/vllm-benchmark/analyze_vllm_results.py /path/to/results.json
# Analyze entire results directory
python eval/vllm-benchmark/analyze_vllm_results.py /path/to/results/directory
Output Directory Structure¶
benchmark_results/
├── vllm_benchmark_main_TIMESTAMP.log # Main execution log
├── scenario_NAME_TIMESTAMP.log # Per-scenario logs
├── scenario_NAME_TIMESTAMP/ # Raw VLLM results
│ └── benchmark_results.json
└── parsed/ # Standardized results
└── SCENARIO_NAME_TIMESTAMP_standardized.json
Result Standardization¶
Results are automatically standardized for pipeline integration:
python scripts/standardize_vllm_benchmark.py \
results.json \
--output_file standardized.json \
--task_name "performance_test" \
--config_path configs/vllm_benchmark.json
Analysis Features
The analysis system provides:
- Statistical Summary: Mean, median, percentiles for all metrics
- Performance Thresholds: Automatic threshold validation
- Trend Analysis: Multi-run performance comparison
- Export Formats: JSON, CSV output for further processing
Troubleshooting¶
Common Issues¶
Troubleshooting Guide
Connection Issues:
- Verify VLLM server is running and accessible
- Check network connectivity with
curl
test - Ensure firewall rules allow benchmark traffic
Performance Issues:
- Monitor GPU/CPU utilization during benchmarks
- Check for resource contention with other processes
- Verify model loading completed successfully
Result Inconsistencies:
- Run multiple benchmark iterations for statistical stability
- Account for system warm-up time in measurements
- Monitor system load during benchmark execution
Integration with VLLM Eval Pipeline¶
Pipeline Integration¶
VLLM benchmark results integrate seamlessly with the evaluation aggregation system:
# Include VLLM benchmark results in aggregation
python scripts/aggregate_metrics.py \
--include-vllm-benchmark \
--results-dir eval/vllm-benchmark/parsed
Performance Monitoring¶
Monitoring Integration
Standardized results support:
- ClickHouse analytics database storage
- Grafana dashboard visualization
- Prometheus metrics export
- Automated regression detection
Best Practices¶
Benchmark Configuration¶
Optimization Tips
For Accurate Results:
- Run multiple iterations for statistical stability
- Allow adequate warm-up time before measurements
- Monitor system resources during benchmarking
- Use consistent hardware configurations
For Production Testing:
- Configure scenarios matching production workloads
- Test various concurrency levels progressively
- Establish baseline performance thresholds
- Automate benchmark execution in CI/CD pipelines
Troubleshooting¶
Common Issues
Configuration Issues:
- Verify JSON configuration syntax
- Ensure model endpoint accessibility
- Check parameter compatibility with VLLM version
Performance Issues:
- Monitor GPU memory utilization
- Check for CPU bottlenecks during high concurrency
- Validate network connectivity stability
Result Issues:
- Ensure sufficient disk space for results
- Verify write permissions to output directories
- Check for incomplete benchmark runs in logs
Next Steps¶
Advanced Usage
After successful benchmarking:
- Analyze results using the provided analysis tools
- Configure performance thresholds for automated validation
- Integrate with monitoring and alerting systems
- Create custom scenarios for specific use cases
- Set up automated performance regression testing