Skip to content

πŸš€ macOS OrbStack ν™˜κ²½μ—μ„œ VLLM 둜컬 평가 κ°€μ΄λ“œ

이 λ¬Έμ„œλŠ” macOS의 OrbStack ν™˜κ²½μ—μ„œ VLLM λͺ¨λΈμ„ μ‹€ν–‰ν•˜κ³  LLM 평가λ₯Ό μˆ˜ν–‰ν•˜λŠ” 단계별 κ°€μ΄λ“œμž…λ‹ˆλ‹€.

πŸ“‹ λͺ©μ°¨

  1. πŸ›  사전 μš”κ΅¬μ‚¬ν•­
  2. πŸ”§ OrbStack μ„€μΉ˜ 및 μ„€μ •
  3. πŸ€– VLLM μ„œλ²„ μ‹€ν–‰
  4. πŸ”¬ 평가 ν™˜κ²½ ꡬ좕
  5. πŸ§ͺ Deepeval 평가 μ‹€ν–‰
  6. πŸ“š Evalchemy 벀치마크 μ‹€ν–‰
  7. πŸ“ˆ κ²°κ³Ό 뢄석 및 μ‹œκ°ν™”
  8. 🚫 문제 ν•΄κ²°

πŸ›  사전 μš”κ΅¬μ‚¬ν•­

μ‹œμŠ€ν…œ μš”κ΅¬μ‚¬ν•­

  • macOS: 13.0 이상 (Apple Silicon ꢌμž₯)
  • RAM: μ΅œμ†Œ 16GB, ꢌμž₯ 32GB
  • λ””μŠ€ν¬: μ΅œμ†Œ 20GB μ—¬μœ  곡간
  • GPU: Apple Silicon GPU λ˜λŠ” NVIDIA GPU (선택사항)

ν•„μˆ˜ 도ꡬ μ„€μΉ˜

# Homebrew μ„€μΉ˜ (μ—†λŠ” 경우)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# ν•„μˆ˜ 도ꡬ듀 μ„€μΉ˜
brew install python@3.11 git curl jq

πŸ”§ OrbStack μ„€μΉ˜ 및 μ„€μ •

1. OrbStack μ„€μΉ˜

# OrbStack μ„€μΉ˜ (Docker Desktop λŒ€μ‹  ꢌμž₯)
brew install --cask orbstack

# OrbStack μ‹œμž‘
open -a OrbStack

# OrbStack μ‹œμž‘ 확인
while ! docker info > /dev/null 2>&1; do
    echo "OrbStack μ‹œμž‘ λŒ€κΈ° 쀑..."
    sleep 3
done
echo "βœ… OrbStack이 μ„±κ³΅μ μœΌλ‘œ μ‹œμž‘λ˜μ—ˆμŠ΅λ‹ˆλ‹€."

2. ν”„λ‘œμ νŠΈ 클둠 및 μ„€μ •

# ν”„λ‘œμ νŠΈ 클둠
git clone https://github.com/your-org/vllm-eval.git
cd vllm-eval

# Python κ°€μƒν™˜κ²½ 생성
python3.11 -m venv venv
source venv/bin/activate

# ν•„μˆ˜ νŒ¨ν‚€μ§€ μ„€μΉ˜
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install -r requirements-deepeval.txt
pip install -r requirements-evalchemy.txt

πŸ€– VLLM μ„œλ²„ μ‹€ν–‰

1. λͺ¨λΈ λ‹€μš΄λ‘œλ“œ 및 μ‹€ν–‰

# VLLM μ„œλ²„ μ‹€ν–‰ (예: Qwen2-7B λͺ¨λΈ)
docker run -d \
  --name vllm-server \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model "Qwen/Qwen2-7B-Instruct" \
  --served-model-name "qwen3-8b" \
  --host 0.0.0.0 \
  --port 8000

# μ„œλ²„ μƒνƒœ 확인
docker logs vllm-server

# API ν…ŒμŠ€νŠΈ
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "ν•œκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μΈκ°€μš”?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256,
    "top_p": 0.95,
    "stream": false
  }'

2. λͺ¨λΈ μ„œλ²„ ν—¬μŠ€μ²΄ν¬

# λͺ¨λΈ λͺ©λ‘ 확인
curl http://localhost:8000/v1/models | jq

# μ„œλ²„ μƒνƒœ 확인
curl http://localhost:8000/health

πŸ”¬ 평가 ν™˜κ²½ ꡬ좕

1. ν™˜κ²½ λ³€μˆ˜ μ„€μ •

# .env 파일 생성
cat > .env << 'EOF'
# VLLM λͺ¨λΈ μ—”λ“œν¬μΈνŠΈ
VLLM_MODEL_ENDPOINT=http://localhost:8000/v1
MODEL_NAME=qwen3-8b

# 평가 μ„€μ •
EVAL_CONFIG_PATH=configs/evalchemy.json
OUTPUT_DIR=./test_results
RUN_ID=local_eval_$(date +%Y%m%d_%H%M%S)

# 둜그 μ„€μ •
LOG_LEVEL=INFO
PYTHONPATH=.
EOF

# ν™˜κ²½ λ³€μˆ˜ λ‘œλ“œ
source .env

2. ν…ŒμŠ€νŠΈ 데이터셋 μ€€λΉ„

# κ²°κ³Ό 디렉토리 생성
mkdir -p test_results

# ν…ŒμŠ€νŠΈμš© 데이터셋 생성
mkdir -p datasets/raw/local_test_dataset
cat > datasets/raw/local_test_dataset/test.jsonl << 'EOF'
{"input": "ν•œκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μΈκ°€μš”?", "expected_output": "ν•œκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.", "context": "ν•œκ΅­ 지리에 κ΄€ν•œ μ§ˆλ¬Έμž…λ‹ˆλ‹€."}
{"input": "νŒŒμ΄μ¬μ—μ„œ 리슀트λ₯Ό μ •λ ¬ν•˜λŠ” 방법은?", "expected_output": "νŒŒμ΄μ¬μ—μ„œ 리슀트λ₯Ό μ •λ ¬ν•˜λ €λ©΄ sort() λ©”μ„œλ“œλ‚˜ sorted() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.", "context": "ν”„λ‘œκ·Έλž˜λ° κ΄€λ ¨ μ§ˆλ¬Έμž…λ‹ˆλ‹€."}
{"input": "μ§€κ΅¬μ˜ λ‘˜λ ˆλŠ” μ–Όλ§ˆλ‚˜ λ©λ‹ˆκΉŒ?", "expected_output": "μ§€κ΅¬μ˜ λ‘˜λ ˆλŠ” μ•½ 40,075kmμž…λ‹ˆλ‹€.", "context": "지ꡬ과학에 κ΄€ν•œ μ§ˆλ¬Έμž…λ‹ˆλ‹€."}
EOF

πŸ§ͺ Deepeval 평가 μ‹€ν–‰

1. μ»€μŠ€ν…€ 평가 슀크립트 생성

# 둜컬 평가 슀크립트 생성
cat > scripts/run_local_deepeval.py << 'EOF'
#!/usr/bin/env python3
"""
둜컬 VLLM μ„œλ²„λ₯Ό μ΄μš©ν•œ Deepeval 평가 슀크립트
"""

import os
import json
import asyncio
from typing import List, Dict, Any
from deepeval import evaluate
from deepeval.models import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric
)
import openai
import logging

# λ‘œκΉ… μ„€μ •
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class VLLMModel(DeepEvalBaseLLM):
    """VLLM OpenAI ν˜Έν™˜ APIλ₯Ό μœ„ν•œ λͺ¨λΈ 클래슀"""

    def __init__(self, model_name: str = "qwen3-8b", base_url: str = "http://localhost:8000/v1"):
        self.model_name = model_name
        self.client = openai.OpenAI(
            base_url=base_url,
            api_key="dummy"  # VLLMμ—μ„œλŠ” API ν‚€κ°€ ν•„μš”μ—†μŒ
        )

    def load_model(self):
        return self.model_name

    def generate(self, prompt: str, schema: Dict = None) -> str:
        """ν…μŠ€νŠΈ 생성"""
        try:
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                max_tokens=512
            )
            return response.choices[0].message.content
        except Exception as e:
            logger.error(f"Generation failed: {e}")
            return ""

    async def a_generate(self, prompt: str, schema: Dict = None) -> str:
        """비동기 ν…μŠ€νŠΈ 생성"""
        return self.generate(prompt, schema)

    def get_model_name(self) -> str:
        return self.model_name

def load_test_dataset(file_path: str) -> List[Dict[str, Any]]:
    """JSONL ν…ŒμŠ€νŠΈ 데이터셋 λ‘œλ“œ"""
    test_cases = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            test_cases.append(json.loads(line.strip()))
    return test_cases

def create_test_cases(dataset: List[Dict], model: VLLMModel) -> List[LLMTestCase]:
    """ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€ 생성"""
    test_cases = []

    for item in dataset:
        # λͺ¨λΈλ‘œλΆ€ν„° μ‹€μ œ 응닡 생성
        actual_output = model.generate(item["input"])

        test_case = LLMTestCase(
            input=item["input"],
            actual_output=actual_output,
            expected_output=item["expected_output"],
            context=[item.get("context", "")]
        )
        test_cases.append(test_case)
        logger.info(f"Created test case: {item['input'][:50]}...")

    return test_cases

def main():
    """메인 평가 μ‹€ν–‰"""
    # λͺ¨λΈ μ΄ˆκΈ°ν™”
    model = VLLMModel()

    # ν…ŒμŠ€νŠΈ 데이터셋 λ‘œλ“œ
    dataset_path = "eval/deepeval_tests/datasets/test_local_dataset.jsonl"
    dataset = load_test_dataset(dataset_path)

    # ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€ 생성
    test_cases = create_test_cases(dataset, model)

    # 평가 λ©”νŠΈλ¦­ μ •μ˜
    metrics = [
        AnswerRelevancyMetric(
            threshold=0.7,
            model=model,
            include_reason=True
        ),
        ContextualRelevancyMetric(
            threshold=0.7,
            model=model,
            include_reason=True
        )
    ]

    # 평가 μ‹€ν–‰
    logger.info("Starting evaluation...")
    results = evaluate(
        test_cases=test_cases,
        metrics=metrics,
        print_results=True
    )

    # κ²°κ³Ό μ €μž₯
    output_dir = os.getenv("OUTPUT_DIR", "./test_results")
    os.makedirs(output_dir, exist_ok=True)

    results_file = f"{output_dir}/deepeval_results_{os.getenv('RUN_ID', 'local')}.json"
    with open(results_file, 'w', encoding='utf-8') as f:
        json.dump({
            "test_results": [
                {
                    "input": tc.input,
                    "actual_output": tc.actual_output,
                    "expected_output": tc.expected_output,
                    "metrics": {
                        metric.__class__.__name__: {
                            "score": getattr(metric, 'score', None),
                            "threshold": getattr(metric, 'threshold', None),
                            "success": getattr(metric, 'success', None),
                            "reason": getattr(metric, 'reason', None)
                        }
                        for metric in metrics
                    }
                }
                for tc in test_cases
            ]
        }, f, ensure_ascii=False, indent=2)

    logger.info(f"Results saved to: {results_file}")
    return results

if __name__ == "__main__":
    main()
EOF

# μ‹€ν–‰ κΆŒν•œ λΆ€μ—¬
chmod +x scripts/run_local_deepeval.py

2. Deepeval μ‹€ν–‰

# Deepeval 평가 μ‹€ν–‰
python scripts/run_local_deepeval.py

# κ²°κ³Ό 확인
ls -la test_results/
cat test_results/deepeval_results_*.json | jq

⚑ Evalchemy 벀치마크 μ‹€ν–‰

1. 둜컬 Evalchemy μ„€μ •

# 둜컬용 Evalchemy μ„€μ • 파일 생성
cat > eval/evalchemy/configs/local_eval_config.json << 'EOF'
{
  "benchmarks": {
    "arc_easy": {
      "enabled": true,
      "tasks": ["arc_easy"],
      "num_fewshot": 5,
      "batch_size": 4,
      "limit": 10,
      "description": "ARC Easy 벀치마크 (둜컬 ν…ŒμŠ€νŠΈμš©)",
      "metrics": ["acc", "acc_norm"]
    },
    "hellaswag": {
      "enabled": true,
      "tasks": ["hellaswag"],
      "num_fewshot": 10,
      "batch_size": 4,
      "limit": 10,
      "description": "HellaSwag 벀치마크 (둜컬 ν…ŒμŠ€νŠΈμš©)",
      "metrics": ["acc", "acc_norm"]
    }
  }
}
EOF

2. 둜컬 Evalchemy μ‹€ν–‰ 슀크립트

# 둜컬 Evalchemy μ‹€ν–‰ 슀크립트 생성
cat > scripts/run_local_evalchemy.py << 'EOF'
#!/usr/bin/env python3
"""
둜컬 VLLM μ„œλ²„λ₯Ό μ΄μš©ν•œ Evalchemy 벀치마크 μ‹€ν–‰
"""

import os
import json
import subprocess
import logging
from typing import Dict, Any

# λ‘œκΉ… μ„€μ •
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def run_evalchemy_benchmark(config_path: str, output_dir: str) -> Dict[str, Any]:
    """Evalchemy 벀치마크 μ‹€ν–‰"""

    # ν™˜κ²½ λ³€μˆ˜ μ„€μ •
    env = os.environ.copy()
    env.update({
        "VLLM_MODEL_ENDPOINT": "http://localhost:8000/v1",
        "MODEL_NAME": "qwen3-8b",
        "OUTPUT_DIR": output_dir,
        "EVAL_CONFIG_PATH": config_path
    })

    # lm_eval λͺ…λ Ήμ–΄ ꡬ성
    cmd = [
        "lm_eval",
        "--model", "openai-chat-completions",
        "--model_args", f"base_url=http://localhost:8000/v1,model={env['MODEL_NAME']},tokenizer={env['MODEL_NAME']}",
        "--tasks", "arc_easy,hellaswag",
        "--num_fewshot", "5",
        "--batch_size", "4",
        "--limit", "10",
        "--output_path", f"{output_dir}/evalchemy_results.json",
        "--log_samples"
    ]

    logger.info(f"Running command: {' '.join(cmd)}")

    try:
        # 벀치마크 μ‹€ν–‰
        result = subprocess.run(
            cmd,
            env=env,
            capture_output=True,
            text=True,
            timeout=3600  # 1μ‹œκ°„ νƒ€μž„μ•„μ›ƒ
        )

        if result.returncode == 0:
            logger.info("Evalchemy benchmark completed successfully")

            # 결과 파일 읽기
            results_file = f"{output_dir}/evalchemy_results.json"
            if os.path.exists(results_file):
                with open(results_file, 'r') as f:
                    results = json.load(f)
                return results
            else:
                logger.warning("Results file not found")
                return {}
        else:
            logger.error(f"Benchmark failed with return code: {result.returncode}")
            logger.error(f"Error output: {result.stderr}")
            return {}

    except subprocess.TimeoutExpired:
        logger.error("Benchmark timed out")
        return {}
    except Exception as e:
        logger.error(f"Benchmark failed with exception: {e}")
        return {}

def main():
    """메인 μ‹€ν–‰ ν•¨μˆ˜"""
    config_path = "eval/evalchemy/configs/local_eval_config.json"
    output_dir = os.getenv("OUTPUT_DIR", "./test_results")

    # 좜λ ₯ 디렉토리 생성
    os.makedirs(output_dir, exist_ok=True)

    # 벀치마크 μ‹€ν–‰
    results = run_evalchemy_benchmark(config_path, output_dir)

    if results:
        logger.info("Benchmark results:")
        for task, metrics in results.get("results", {}).items():
            logger.info(f"  {task}: {metrics}")
    else:
        logger.error("No results obtained")

if __name__ == "__main__":
    main()
EOF

# μ‹€ν–‰ κΆŒν•œ λΆ€μ—¬
chmod +x scripts/run_local_evalchemy.py

3. Evalchemy μ‹€ν–‰

# lm-evaluation-harness μ„€μΉ˜ (ν•„μš”ν•œ 경우)
pip install lm-eval[openai]

# Evalchemy 벀치마크 μ‹€ν–‰
python scripts/run_local_evalchemy.py

# κ²°κ³Ό 확인
ls -la test_results/
cat test_results/evalchemy_results.json | jq

πŸ“Š κ²°κ³Ό 뢄석 및 μ‹œκ°ν™”

1. κ²°κ³Ό 집계 슀크립트

# κ²°κ³Ό 집계 슀크립트 생성
cat > scripts/aggregate_local_results.py << 'EOF'
#!/usr/bin/env python3
"""
둜컬 평가 κ²°κ³Ό 집계 및 μ‹œκ°ν™”
"""

import os
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import logging

# λ‘œκΉ… μ„€μ •
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def load_deepeval_results(results_dir: str) -> dict:
    """Deepeval κ²°κ³Ό λ‘œλ“œ"""
    results = {}
    for file in os.listdir(results_dir):
        if file.startswith("deepeval_results_") and file.endswith(".json"):
            with open(os.path.join(results_dir, file), 'r') as f:
                results[file] = json.load(f)
    return results

def load_evalchemy_results(results_dir: str) -> dict:
    """Evalchemy κ²°κ³Ό λ‘œλ“œ"""
    results = {}
    for file in os.listdir(results_dir):
        if file.startswith("evalchemy_results") and file.endswith(".json"):
            with open(os.path.join(results_dir, file), 'r') as f:
                results[file] = json.load(f)
    return results

def create_summary_report(deepeval_results: dict, evalchemy_results: dict, output_dir: str):
    """톡합 λ³΄κ³ μ„œ 생성"""
    report = {
        "timestamp": datetime.now().isoformat(),
        "model_name": os.getenv("MODEL_NAME", "unknown"),
        "summary": {
            "deepeval": {},
            "evalchemy": {}
        }
    }

    # Deepeval κ²°κ³Ό μš”μ•½
    if deepeval_results:
        for filename, data in deepeval_results.items():
            test_results = data.get("test_results", [])
            if test_results:
                # λ©”νŠΈλ¦­λ³„ 평균 계산
                metrics_summary = {}
                for result in test_results:
                    for metric_name, metric_data in result.get("metrics", {}).items():
                        if metric_name not in metrics_summary:
                            metrics_summary[metric_name] = []
                        if metric_data.get("score") is not None:
                            metrics_summary[metric_name].append(metric_data["score"])

                # 평균 계산
                avg_metrics = {}
                for metric_name, scores in metrics_summary.items():
                    if scores:
                        avg_metrics[metric_name] = {
                            "average_score": sum(scores) / len(scores),
                            "count": len(scores)
                        }

                report["summary"]["deepeval"][filename] = avg_metrics

    # Evalchemy κ²°κ³Ό μš”μ•½
    if evalchemy_results:
        for filename, data in evalchemy_results.items():
            results = data.get("results", {})
            summary = {}
            for task, metrics in results.items():
                summary[task] = {
                    "accuracy": metrics.get("acc", 0),
                    "normalized_accuracy": metrics.get("acc_norm", 0)
                }
            report["summary"]["evalchemy"][filename] = summary

    # λ³΄κ³ μ„œ μ €μž₯
    report_file = f"{output_dir}/evaluation_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(report_file, 'w', encoding='utf-8') as f:
        json.dump(report, f, ensure_ascii=False, indent=2)

    logger.info(f"Summary report saved to: {report_file}")
    return report

def create_visualizations(report: dict, output_dir: str):
    """κ²°κ³Ό μ‹œκ°ν™”"""
    try:
        import matplotlib.pyplot as plt
        import seaborn as sns

        # μŠ€νƒ€μΌ μ„€μ •
        plt.style.use('seaborn-v0_8')
        sns.set_palette("husl")

        # Deepeval κ²°κ³Ό μ‹œκ°ν™”
        deepeval_data = report["summary"]["deepeval"]
        if deepeval_data:
            fig, axes = plt.subplots(2, 2, figsize=(15, 10))
            fig.suptitle(f'Local VLLM Evaluation Results - {report["model_name"]}', fontsize=16)

            # λ©”νŠΈλ¦­λ³„ 점수 μ‹œκ°ν™”
            all_scores = []
            all_metrics = []

            for filename, metrics in deepeval_data.items():
                for metric_name, metric_data in metrics.items():
                    all_scores.append(metric_data["average_score"])
                    all_metrics.append(metric_name.replace("Metric", ""))

            if all_scores:
                axes[0, 0].bar(all_metrics, all_scores)
                axes[0, 0].set_title('Deepeval Metrics Scores')
                axes[0, 0].set_ylabel('Score')
                axes[0, 0].tick_params(axis='x', rotation=45)

        # Evalchemy κ²°κ³Ό μ‹œκ°ν™”
        evalchemy_data = report["summary"]["evalchemy"]
        if evalchemy_data:
            tasks = []
            accuracies = []

            for filename, results in evalchemy_data.items():
                for task, metrics in results.items():
                    tasks.append(task)
                    accuracies.append(metrics["accuracy"])

            if tasks:
                axes[0, 1].bar(tasks, accuracies)
                axes[0, 1].set_title('Evalchemy Benchmark Accuracies')
                axes[0, 1].set_ylabel('Accuracy')
                axes[0, 1].tick_params(axis='x', rotation=45)

        plt.tight_layout()
        chart_file = f"{output_dir}/evaluation_charts_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
        plt.savefig(chart_file, dpi=300, bbox_inches='tight')
        logger.info(f"Charts saved to: {chart_file}")

    except ImportError:
        logger.warning("matplotlib/seaborn not installed, skipping visualization")
    except Exception as e:
        logger.error(f"Visualization failed: {e}")

def main():
    """메인 μ‹€ν–‰ ν•¨μˆ˜"""
    output_dir = os.getenv("OUTPUT_DIR", "./test_results")

    # κ²°κ³Ό λ‘œλ“œ
    deepeval_results = load_deepeval_results(output_dir)
    evalchemy_results = load_evalchemy_results(output_dir)

    # 톡합 λ³΄κ³ μ„œ 생성
    report = create_summary_report(deepeval_results, evalchemy_results, output_dir)

    # μ‹œκ°ν™”
    create_visualizations(report, output_dir)

    # μ½˜μ†” 좜λ ₯
    print("\n=== Local VLLM Evaluation Summary ===")
    print(f"Model: {report['model_name']}")
    print(f"Timestamp: {report['timestamp']}")

    if report["summary"]["deepeval"]:
        print("\n--- Deepeval Results ---")
        for filename, metrics in report["summary"]["deepeval"].items():
            print(f"File: {filename}")
            for metric_name, data in metrics.items():
                print(f"  {metric_name}: {data['average_score']:.3f} (n={data['count']})")

    if report["summary"]["evalchemy"]:
        print("\n--- Evalchemy Results ---")
        for filename, results in report["summary"]["evalchemy"].items():
            print(f"File: {filename}")
            for task, metrics in results.items():
                print(f"  {task}: {metrics['accuracy']:.3f}")

if __name__ == "__main__":
    main()
EOF

# μ‹€ν–‰ κΆŒν•œ λΆ€μ—¬
chmod +x scripts/aggregate_local_results.py

2. κ²°κ³Ό 집계 μ‹€ν–‰

# μ‹œκ°ν™” 라이브러리 μ„€μΉ˜
pip install matplotlib seaborn pandas

# κ²°κ³Ό 집계 및 μ‹œκ°ν™”
python scripts/aggregate_local_results.py

# μƒμ„±λœ 파일 확인
ls -la test_results/evaluation_*

πŸ”§ 톡합 μ‹€ν–‰ 슀크립트

# 전체 둜컬 평가 μ‹€ν–‰ 슀크립트 생성
cat > scripts/run_full_local_evaluation.sh << 'EOF'
#!/bin/bash
set -e

echo "πŸš€ 둜컬 VLLM 평가 μ‹œμž‘"
echo "===================="

# ν™˜κ²½ λ³€μˆ˜ λ‘œλ“œ
source .env

# 1. VLLM μ„œλ²„ μƒνƒœ 확인
echo "πŸ“‘ VLLM μ„œλ²„ μƒνƒœ 확인 쀑..."
if ! curl -f http://localhost:8000/health > /dev/null 2>&1; then
    echo "❌ VLLM μ„œλ²„κ°€ μ‹€ν–‰λ˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. μ„œλ²„λ₯Ό λ¨Όμ € μ‹œμž‘ν•΄μ£Όμ„Έμš”."
    exit 1
fi
echo "βœ… VLLM μ„œλ²„ 정상 μž‘λ™"

# 2. κ²°κ³Ό 디렉토리 생성
mkdir -p $OUTPUT_DIR

# 3. Deepeval μ‹€ν–‰
echo "πŸ§ͺ Deepeval 평가 μ‹€ν–‰ 쀑..."
python scripts/run_local_deepeval.py
echo "βœ… Deepeval μ™„λ£Œ"

# 4. Evalchemy μ‹€ν–‰
echo "⚑ Evalchemy 벀치마크 μ‹€ν–‰ 쀑..."
python scripts/run_local_evalchemy.py
echo "βœ… Evalchemy μ™„λ£Œ"

# 5. κ²°κ³Ό 집계
echo "πŸ“Š κ²°κ³Ό 집계 및 μ‹œκ°ν™” 쀑..."
python scripts/aggregate_local_results.py
echo "βœ… κ²°κ³Ό 집계 μ™„λ£Œ"

# 6. κ²°κ³Ό 좜λ ₯
echo ""
echo "πŸŽ‰ 둜컬 VLLM 평가 μ™„λ£Œ!"
echo "===================="
echo "κ²°κ³Ό 파일 μœ„μΉ˜: $OUTPUT_DIR"
echo "μ£Όμš” 파일:"
echo "  - Deepeval κ²°κ³Ό: $OUTPUT_DIR/deepeval_results_*.json"
echo "  - Evalchemy κ²°κ³Ό: $OUTPUT_DIR/evalchemy_results.json"
echo "  - 톡합 λ³΄κ³ μ„œ: $OUTPUT_DIR/evaluation_summary_*.json"
echo "  - μ‹œκ°ν™” 차트: $OUTPUT_DIR/evaluation_charts_*.png"
EOF

# μ‹€ν–‰ κΆŒν•œ λΆ€μ—¬
chmod +x scripts/run_full_local_evaluation.sh

🚨 문제 ν•΄κ²°

1. VLLM μ„œλ²„ κ΄€λ ¨ 문제

# μ„œλ²„ 둜그 확인
docker logs vllm-server

# μ„œλ²„ μž¬μ‹œμž‘
docker restart vllm-server

# 포트 μ‚¬μš© 확인
lsof -i :8000

2. Python μ˜μ‘΄μ„± 문제

# κ°€μƒν™˜κ²½ μž¬μƒμ„±
deactivate
rm -rf venv
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt

3. λ©”λͺ¨λ¦¬ λΆ€μ‘± 문제

# Docker λ©”λͺ¨λ¦¬ μ œν•œ μ„€μ •
docker run -d \
  --name vllm-server \
  --memory="8g" \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model "Qwen/Qwen2-7B-Instruct" \
  --gpu-memory-utilization 0.8

4. 평가 κ²°κ³Όκ°€ λ‚˜μ˜€μ§€ μ•ŠλŠ” 경우

# 둜그 레벨 λ³€κ²½
export LOG_LEVEL=DEBUG

# 단계별 디버깅
python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
response = client.chat.completions.create(
    model='qwen3-8b',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)
"

🎯 μ‹€ν–‰ μš”μ•½

방법 1: 톡합 슀크립트 μ‚¬μš© (ꢌμž₯)

# ν•œ λ²ˆμ— λͺ¨λ“  것을 μ‹€ν–‰ (VLLM μ„œλ²„ μžλ™ 감지)
./scripts/run_complete_local_evaluation.sh

방법 2: μˆ˜λ™ 단계별 μ‹€ν–‰

# 1. VLLM μ„œλ²„ μ‹œμž‘ (선택사항)
docker run -d \
  --name vllm-server \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model "Qwen/Qwen2-7B-Instruct" \
  --served-model-name "qwen3-8b"

# 2. κ°œλ³„ ν…ŒμŠ€νŠΈ μ‹€ν–‰
python scripts/run_simple_deepeval_test.py      # Mock ν…ŒμŠ€νŠΈ
python scripts/run_vllm_deepeval_test.py        # μ‹€μ œ VLLM ν…ŒμŠ€νŠΈ

# 전체 평가 μ‹€ν–‰
python scripts/run_complete_local_evalchemy.py

# κ°œλ³„ ν…ŒμŠ€νŠΈ μ‹€ν–‰
python scripts/run_simple_deepeval_test.py
python scripts/run_simple_evalchemy_test.py

# 3. κ²°κ³Ό 확인
cat test_results/*.json | jq

μ‹€ν–‰ κ²°κ³Ό μ˜ˆμ‹œ

πŸš€ macOS OrbStack VLLM 둜컬 평가 톡합 μ‹€ν–‰
=============================================
πŸ“‹ 1. ν™˜κ²½ 확인
πŸ“‹ 2. ν•„μˆ˜ νŒ¨ν‚€μ§€ 확인
βœ… ν•„μˆ˜ νŒ¨ν‚€μ§€ 확인 μ™„λ£Œ
πŸ“‹ 3. κ²°κ³Ό 디렉토리 생성
βœ… κ²°κ³Ό 디렉토리 생성: ./test_results
πŸ“‹ 4. VLLM μ„œλ²„ μƒνƒœ 확인
βœ… VLLM μ„œλ²„ 발견: http://localhost:1234
πŸ“‹ 5. μ‹€μ œ VLLM μ„œλ²„λ‘œ 평가 μ‹€ν–‰

πŸ“Š μ΅œμ’… κ²°κ³Ό:
  총 ν…ŒμŠ€νŠΈ: 5
  평균 점수: 0.50
  성곡λ₯ : 50.0%

βœ… 둜컬 VLLM 평가가 μ™„λ£Œλ˜μ—ˆμŠ΅λ‹ˆλ‹€!

πŸŽ‰ 정리

이 κ°€μ΄λ“œλ₯Ό 톡해 macOS OrbStack ν™˜κ²½μ—μ„œ VLLM λͺ¨λΈμ˜ 둜컬 평가λ₯Ό μ™„μ „νžˆ μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

πŸ”§ μ£Όμš” κΈ°λŠ₯

  • μžλ™ ν™˜κ²½ 감지: VLLM μ„œλ²„ μ—¬λŸ¬ 포트 μžλ™ 탐지
  • Mock λͺ¨λ“œ 지원: μ„œλ²„ 없이도 ν…ŒμŠ€νŠΈ κ°€λŠ₯
  • 톡합 μ‹€ν–‰: ν•œ 번의 λͺ…λ ΉμœΌλ‘œ 전체 평가 μˆ˜ν–‰
  • μƒμ„Έν•œ κ²°κ³Ό 뢄석: JSON ν˜•νƒœμ˜ κ΅¬μ‘°ν™”λœ κ²°κ³Ό

πŸ“Š μƒμ„±λ˜λŠ” κ²°κ³Ό 파일

  • simple_deepeval_results.json: Mock ν…ŒμŠ€νŠΈ κ²°κ³Ό
  • vllm_deepeval_results.json: VLLM μ„œλ²„ ν…ŒμŠ€νŠΈ κ²°κ³Ό

πŸš€ λ‹€μŒ 단계

이 둜컬 평가 μ‹œμŠ€ν…œμ„ 기반으둜 λ‹€μŒκ³Ό 같은 ν™•μž₯이 κ°€λŠ₯ν•©λ‹ˆλ‹€: - μ»€μŠ€ν…€ 평가 λ©”νŠΈλ¦­ μΆ”κ°€ - λ‹€μ–‘ν•œ λͺ¨λΈ 비ꡐ 평가 - μ„±λŠ₯ 벀치마크 ν™•μž₯ - CI/CD νŒŒμ΄ν”„λΌμΈ 톡합