Skip to content

Deepeval API

Deepeval is a comprehensive evaluation framework specifically designed for Large Language Models (LLMs), providing pytest-style unit testing capabilities for AI applications. It offers 30+ built-in metrics and supports both RAG (Retrieval-Augmented Generation) and end-to-end evaluation scenarios.

Framework Overview

Deepeval provides:

  • 30+ Built-in Metrics: Comprehensive evaluation across multiple dimensions
  • RAG Evaluation: Specialized metrics for retrieval-augmented generation
  • Custom Metrics: Extensible framework for domain-specific evaluations
  • Pytest Integration: Familiar testing patterns for AI applications

Prerequisites

Setup Requirements

Before using Deepeval, ensure you have:

  • Python 3.8+ environment
  • Required dependencies installed via requirements-dev.txt
  • A running VLLM server endpoint
  • Access to evaluation datasets

Installation & Setup

Install Dependencies

pip install -r requirements-dev.txt

Configuration

Deepeval uses configuration files located in configs/:

  • deepeval.yaml: Main configuration for evaluation parameters
  • pytest.ini: Pytest-specific settings

Configuration Files

Review and customize the configuration files in the configs/ directory to match your evaluation requirements and model specifications.

Core Evaluation Metrics

Deepeval provides several categories of evaluation metrics:

RAG-Specific Metrics

RAG Evaluation

For Retrieval-Augmented Generation systems, Deepeval offers specialized metrics:

  • Context Relevance: Measures how relevant retrieved context is to the query
  • Hallucination Detection: Identifies generated content not supported by context
  • RAG Precision: Evaluates accuracy of responses given the retrieved context
  • Context Utilization: Assesses how effectively the model uses provided context

General LLM Metrics

  • Semantic Similarity: Compares generated text with reference answers
  • Toxicity Detection: Identifies harmful or inappropriate content
  • Bias Evaluation: Measures potential biases in model outputs
  • Factual Consistency: Verifies factual accuracy of generated content

Running Deepeval Tests

Basic Test Execution

Navigate to the deepeval tests directory and run evaluations:

cd eval/deepeval_tests
pytest test_llm_rag.py -v

Custom Metric Testing

Run tests with custom metrics:

pytest test_custom_metric.py -v --model-endpoint http://localhost:8000/v1

Test Structure

Deepeval tests are located in eval/deepeval_tests/ and follow pytest conventions:

eval/deepeval_tests/
├── __init__.py
├── metrics/              # Custom metric definitions
├── test_custom_metric.py # Custom metric tests
└── test_llm_rag.py      # RAG evaluation tests

Running Specific Metrics

Execute targeted evaluations:

# Run only RAG precision tests
pytest test_llm_rag.py::test_rag_precision -v

# Run hallucination detection
pytest test_llm_rag.py::test_hallucination_detection -v

Custom Metrics Development

Creating Custom Metrics

Deepeval supports custom metric development. Create new metrics in the metrics/ directory:

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomRelevanceMetric(BaseMetric):
    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold

    def measure(self, test_case: LLMTestCase) -> float:
        # Implement your custom evaluation logic
        pass

    def is_successful(self) -> bool:
        return self.score >= self.threshold

Metric Development

When developing custom metrics:

  • Inherit from BaseMetric base class
  • Implement measure() method for scoring logic
  • Define success criteria via is_successful()
  • Add comprehensive error handling

Registering Custom Metrics

Register your custom metrics in the test files:

from metrics.custom_relevance import CustomRelevanceMetric

def test_custom_relevance():
    metric = CustomRelevanceMetric(threshold=0.8)
    test_case = LLMTestCase(
        input="Your test input",
        actual_output="Model's actual output",
        expected_output="Expected output"
    )
    metric.measure(test_case)
    assert metric.is_successful()

Docker Integration

Building Deepeval Container

docker build -f docker/deepeval.Dockerfile -t deepeval-runner:latest .

Running Containerized Evaluations

docker run --rm \
    --network host \
    -e VLLM_ENDPOINT="http://localhost:8000/v1" \
    -e EVALUATION_CONFIG='{"metrics": ["rag_precision", "hallucination"], "threshold": 0.7}' \
    -v $(pwd)/eval/deepeval_tests/results:/workspace/results \
    deepeval-runner:latest

Docker Networking

Use --network host to ensure the container can access your local VLLM server endpoint.

Advanced Configuration

Evaluation Parameters

Configure evaluation behavior in deepeval.yaml:

evaluation:
  model_endpoint: "http://localhost:8000/v1"
  batch_size: 10
  timeout: 300
  metrics:
    rag_precision:
      threshold: 0.8
      strict_mode: true
    hallucination:
      threshold: 0.3
      detection_model: "gpt-4"

Parallel Execution

Run evaluations in parallel for improved performance:

pytest test_llm_rag.py -n 4 --dist=loadfile

Performance Optimization

For large-scale evaluations:

  • Use parallel execution with -n flag
  • Implement batching for API calls
  • Configure appropriate timeouts
  • Monitor resource usage during evaluation

Output & Results

Test Results Structure

Deepeval generates comprehensive evaluation reports:

eval/deepeval_tests/results/
├── test_results.json         # Aggregated test results
├── detailed_metrics.json     # Per-metric breakdown
├── failed_cases.json         # Failed test cases for analysis
└── performance_stats.json    # Execution performance metrics

Interpreting Results

Result Analysis

Deepeval results include:

  • Overall Score: Aggregated performance across all metrics
  • Per-Metric Scores: Individual metric performance
  • Pass/Fail Status: Binary success indicators
  • Confidence Intervals: Statistical confidence measures
  • Error Analysis: Detailed failure case information

Integration with VLLM Eval Pipeline

Deepeval integrates seamlessly with the broader VLLM evaluation ecosystem:

# Integration with aggregation pipeline
python scripts/aggregate_metrics.py --include-deepeval --results-dir eval/deepeval_tests/results

Pipeline Integration

Deepeval results are automatically compatible with the VLLM evaluation aggregation system and can be included in comprehensive model performance reports.