Deepeval API¶
Deepeval is a comprehensive evaluation framework specifically designed for Large Language Models (LLMs), providing pytest-style unit testing capabilities for AI applications. It offers 30+ built-in metrics and supports both RAG (Retrieval-Augmented Generation) and end-to-end evaluation scenarios.
Framework Overview
Deepeval provides:
- 30+ Built-in Metrics: Comprehensive evaluation across multiple dimensions
- RAG Evaluation: Specialized metrics for retrieval-augmented generation
- Custom Metrics: Extensible framework for domain-specific evaluations
- Pytest Integration: Familiar testing patterns for AI applications
Prerequisites¶
Setup Requirements
Before using Deepeval, ensure you have:
- Python 3.8+ environment
- Required dependencies installed via
requirements-dev.txt
- A running VLLM server endpoint
- Access to evaluation datasets
Installation & Setup¶
Install Dependencies¶
Configuration¶
Deepeval uses configuration files located in configs/
:
deepeval.yaml
: Main configuration for evaluation parameterspytest.ini
: Pytest-specific settings
Configuration Files
Review and customize the configuration files in the configs/
directory to match your evaluation requirements and model specifications.
Core Evaluation Metrics¶
Deepeval provides several categories of evaluation metrics:
RAG-Specific Metrics¶
RAG Evaluation
For Retrieval-Augmented Generation systems, Deepeval offers specialized metrics:
- Context Relevance: Measures how relevant retrieved context is to the query
- Hallucination Detection: Identifies generated content not supported by context
- RAG Precision: Evaluates accuracy of responses given the retrieved context
- Context Utilization: Assesses how effectively the model uses provided context
General LLM Metrics¶
- Semantic Similarity: Compares generated text with reference answers
- Toxicity Detection: Identifies harmful or inappropriate content
- Bias Evaluation: Measures potential biases in model outputs
- Factual Consistency: Verifies factual accuracy of generated content
Running Deepeval Tests¶
Basic Test Execution¶
Navigate to the deepeval tests directory and run evaluations:
Custom Metric Testing¶
Run tests with custom metrics:
Test Structure
Deepeval tests are located in eval/deepeval_tests/
and follow pytest conventions:
Running Specific Metrics¶
Execute targeted evaluations:
# Run only RAG precision tests
pytest test_llm_rag.py::test_rag_precision -v
# Run hallucination detection
pytest test_llm_rag.py::test_hallucination_detection -v
Custom Metrics Development¶
Creating Custom Metrics¶
Deepeval supports custom metric development. Create new metrics in the metrics/
directory:
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
class CustomRelevanceMetric(BaseMetric):
def __init__(self, threshold: float = 0.7):
self.threshold = threshold
def measure(self, test_case: LLMTestCase) -> float:
# Implement your custom evaluation logic
pass
def is_successful(self) -> bool:
return self.score >= self.threshold
Metric Development
When developing custom metrics:
- Inherit from
BaseMetric
base class - Implement
measure()
method for scoring logic - Define success criteria via
is_successful()
- Add comprehensive error handling
Registering Custom Metrics¶
Register your custom metrics in the test files:
from metrics.custom_relevance import CustomRelevanceMetric
def test_custom_relevance():
metric = CustomRelevanceMetric(threshold=0.8)
test_case = LLMTestCase(
input="Your test input",
actual_output="Model's actual output",
expected_output="Expected output"
)
metric.measure(test_case)
assert metric.is_successful()
Docker Integration¶
Building Deepeval Container¶
Running Containerized Evaluations¶
docker run --rm \
--network host \
-e VLLM_ENDPOINT="http://localhost:8000/v1" \
-e EVALUATION_CONFIG='{"metrics": ["rag_precision", "hallucination"], "threshold": 0.7}' \
-v $(pwd)/eval/deepeval_tests/results:/workspace/results \
deepeval-runner:latest
Docker Networking
Use --network host
to ensure the container can access your local VLLM server endpoint.
Advanced Configuration¶
Evaluation Parameters¶
Configure evaluation behavior in deepeval.yaml
:
evaluation:
model_endpoint: "http://localhost:8000/v1"
batch_size: 10
timeout: 300
metrics:
rag_precision:
threshold: 0.8
strict_mode: true
hallucination:
threshold: 0.3
detection_model: "gpt-4"
Parallel Execution¶
Run evaluations in parallel for improved performance:
Performance Optimization
For large-scale evaluations:
- Use parallel execution with
-n
flag - Implement batching for API calls
- Configure appropriate timeouts
- Monitor resource usage during evaluation
Output & Results¶
Test Results Structure¶
Deepeval generates comprehensive evaluation reports:
eval/deepeval_tests/results/
├── test_results.json # Aggregated test results
├── detailed_metrics.json # Per-metric breakdown
├── failed_cases.json # Failed test cases for analysis
└── performance_stats.json # Execution performance metrics
Interpreting Results¶
Result Analysis
Deepeval results include:
- Overall Score: Aggregated performance across all metrics
- Per-Metric Scores: Individual metric performance
- Pass/Fail Status: Binary success indicators
- Confidence Intervals: Statistical confidence measures
- Error Analysis: Detailed failure case information
Integration with VLLM Eval Pipeline¶
Deepeval integrates seamlessly with the broader VLLM evaluation ecosystem:
# Integration with aggregation pipeline
python scripts/aggregate_metrics.py --include-deepeval --results-dir eval/deepeval_tests/results
Pipeline Integration
Deepeval results are automatically compatible with the VLLM evaluation aggregation system and can be included in comprehensive model performance reports.