LM-Evaluation-Harness 완전 가이드: 언어 모델 평가의 표준 프레임워크

소개

LM-Evaluation-Harness는 EleutherAI에서 개발한 언어 모델 평가를 위한 표준 프레임워크입니다. 9.3k+ 스타를 받은 이 도구는 GPT, Claude, Llama 등 다양한 언어 모델의 성능을 일관되고 재현 가능한 방식으로 측정할 수 있게 해줍니다.

이 가이드에서는 설치부터 고급 활용법까지, LM-Evaluation-Harness를 완전히 마스터하는 방법을 단계별로 알아보겠습니다.

주요 특징

핵심 기능

Few-shot 평가: 언어 모델의 few-shot learning 능력 측정
다양한 백엔드 지원: HuggingFace, vLLM, OpenAI API, Anthropic API 등
표준화된 벤치마크: MMLU, HellaSwag, ARC, GSM8K 등 주요 벤치마크 내장
확장 가능한 아키텍처: 새로운 태스크와 메트릭 쉽게 추가 가능
시각화 도구: Weights & Biases, Zeno 연동 지원

지원하는 모델 백엔드

백엔드	설명	사용 예시
`hf`	HuggingFace Transformers	로컬 모델 평가
`vllm`	vLLM 추론 엔진	고성능 GPU 추론
`openai`	OpenAI API	GPT-4, GPT-3.5 등
`anthropic`	Anthropic API	Claude 시리즈
`local-chat-completions`	로컬 OpenAI 호환 서버	자체 배포 모델

설치 및 환경 설정

기본 설치

# 기본 설치
pip install lm-eval

# 또는 개발 버전 설치
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

선택적 의존성 설치

LM-Evaluation-Harness는 다양한 extras를 제공합니다:

# API 모델 사용 (OpenAI, Anthropic)
pip install lm-eval[api]

# vLLM 백엔드 사용
pip install lm-eval[vllm]

# Weights & Biases 연동
pip install lm-eval[wandb]

# 시각화 도구 Zeno 사용
pip install lm-eval[zeno]

# 수학 문제 평가
pip install lm-eval[math]

# 모든 extras 설치 (권장하지 않음)
pip install lm-eval[all]

환경 변수 설정

API 기반 모델 사용 시 필요한 환경 변수:

# OpenAI API
export OPENAI_API_KEY="your-openai-api-key"

# Anthropic API
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# HuggingFace Hub (private 모델용)
export HF_TOKEN="your-huggingface-token"

# Weights & Biases
export WANDB_API_KEY="your-wandb-api-key"

# Zeno 시각화
export ZENO_API_KEY="your-zeno-api-key"

기본 사용법

간단한 평가 실행

# HuggingFace 모델로 HellaSwag 평가
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks hellaswag \
        --device cuda:0 \
        --batch_size 8

주요 CLI 옵션

옵션	설명	예시
`--model`	사용할 모델 백엔드	`hf`, `openai`, `vllm`
`--model_args`	모델별 설정	`pretrained=gpt2,device=cuda`
`--tasks`	실행할 태스크 목록	`hellaswag,arc_easy,mmlu`
`--num_fewshot`	Few-shot 예시 개수	`0`, `5`, `10`
`--batch_size`	배치 크기	`1`, `8`, `16`
`--output_path`	결과 저장 경로	`results/gpt2_results.json`
`--limit`	평가할 샘플 수 제한	`100`, `1000`
`--log_samples`	개별 샘플 로그 저장	플래그만 사용

다양한 모델 백엔드 활용

HuggingFace 모델 평가

# 기본 HuggingFace 모델
lm_eval --model hf \
        --model_args pretrained=EleutherAI/gpt-j-6B \
        --tasks hellaswag,arc_easy \
        --device cuda:0 \
        --batch_size 8

# 양자화된 모델
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium,load_in_8bit=True \
        --tasks mmlu \
        --device cuda:0

vLLM 백엔드 사용

# vLLM으로 고성능 추론
lm_eval --model vllm \
        --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
        --tasks gsm8k,math \
        --batch_size 16

OpenAI API 모델 평가

# GPT-4 평가
lm_eval --model openai \
        --model_args model=gpt-4-turbo \
        --tasks mmlu,hellaswag \
        --batch_size 5

# GPT-3.5 with custom settings
lm_eval --model openai \
        --model_args model=gpt-3.5-turbo,max_tokens=512 \
        --tasks arc_challenge \
        --num_fewshot 5

Anthropic Claude 평가

# Claude-3 평가
lm_eval --model anthropic \
        --model_args model=claude-3-opus-20240229 \
        --tasks mmlu,arc_challenge \
        --batch_size 3

사내 vLLM 서버와 LM Studio 평가 완벽 가이드

사내 vLLM 서버와 LM Studio에서 모델을 호출‧평가할 때 꼭 알아야 할 핵심 절차를 단계별로 정리했습니다. 두 환경 모두 OpenAI API 호환 모드를 사용하므로, LM-Evaluation-Harness나 OpenAI 클라이언트 코드를 거의 그대로 재활용할 수 있습니다.

핵심 요약

vLLM은 python -m vllm.entrypoints.openai.api_server 명령으로 OpenAI Compatible Server를 띄울 수 있으며, 기본적으로 http://<host>:8000/v1 경로를 사용합니다.
LM-Evaluation-Harness는 --model openai-completions(또는 openai-chat-completions) 드라이버를 통해 base_url·api_key를 외부 서버로 지정해 평가할 수 있습니다.
LM Studio는 UI의 Developer ▸ Start Server 버튼을 누르면 로컬에 http://localhost:1234/v1 서버가 열리고, /v1/chat/completions 등 주요 엔드포인트를 그대로 제공합니다.
API 키 인증이 없는 환경이라면 임의의 “dummy” 문자열을 api_key에 넣어주면 됩니다. 평가 코드 내부에서 키 유무만 확인하기 때문입니다.

사내 vLLM 서버 평가 절차

1. 서버 실행

python -m vllm.entrypoints.openai.api_server \
       --model <모델이름> \
       --host 0.0.0.0 \
       --port 8000

위 명령은 GPU가 연결된 머신에서 OpenAI API와 동일한 인터페이스를 제공하는 HTTP 서버를 띄웁니다.

2. LM-Evaluation-Harness 설치

pip install lm-eval

⚠️ 권장 버전은 Harness v0.4.0 이상입니다. 최신 릴리스에서 다수의 태스크가 추가·개선되었습니다.

3. 평가 실행 예시

lm_eval --model openai-chat-completions \
        --model_args model=<모델이름>,base_url=http://<서버IP>:8000/v1,api_key=dummy \
        --tasks hellaswag,arc_easy,arc_challenge \
        --batch_size 4 \
        --output_path ./results/

주요 옵션 설명:

openai-completions → text-completion 방식, openai-chat-completions → 챗 모델용 드라이버
base_url은 꼭 /v1까지 포함해야 엔드포인트 해석이 정확합니다
네트워크 지연·GPU 메모리에 따라 --batch_size를 조정하세요. 너무 크게 잡으면 context length 초과 오류가 날 수 있습니다

4. 결과 읽기

--output_path를 주면 JSON 리포트가 저장됩니다. 태스크별 정확도·F1·페널티 등 핵심 지표를 확인하고 사내 MLOps 대시보드로 넘겨서 트렌드를 추적하면 좋습니다.

LM Studio 모델 테스트 절차

1. 로컬 서버 켜기

LM Studio 실행 → Developer 탭 → Start Server 클릭
기본 포트는 1234, 베이스 URL은 http://localhost:1234/v1입니다

2. 빠른 동작 확인

curl -X POST http://localhost:1234/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model":"<로딩한-모델명>",
           "messages":[{"role":"user","content":"Hello"}]
         }'

정상 응답이 오면 서버-모델 연결이 성공한 것입니다.

3. Python / OpenAI 클라이언트 예시

from openai import OpenAI

client = OpenAI(
    api_key="lm-studio",               # 임의 문자열
    base_url="http://localhost:1234/v1"
)

resp = client.chat.completions.create(
    model="<모델명>",
    messages=[{"role":"user","content":"Introduce yourself"}]
)

print(resp.choices[0].message.content)

OpenAI 공식 라이브러리를 그대로 쓰되 base_url만 수정하면 됩니다.

4. LM-Evaluation-Harness로 벤치마킹

lm_eval --model openai-chat-completions \
        --model_args model=<모델명>,base_url=http://localhost:1234/v1,api_key=dummy \
        --tasks kobest_hellaswag,kobest_copa \
        --batch_size 2

lm_eval --model openai-chat-completions \
        --model_args model=deepseek-ai/deepseek-ai_deepseek-r1-0528-qwen3-8b,base_url=http://localhost:1234/v1,api_key=dummy \
        --tasks kobest_hellaswag \
        --batch_size 1

LM Studio도 OpenAI 규격을 따르므로 Harness 옵션이 vLLM과 동일합니다. 단, context window(ex. 8K·32K·128K) 한도를 모델/변환기에 맞춰야 오류가 안 납니다.

⚠️ LM Studio의 중요한 제한사항

현재 LM Studio의 OpenAI 호환 API를 통해 multiple-choice 유형의 태스크(예: hellaswag)를 평가하려고 할 때, loglikelihood 계산을 지원하지 않기 때문에 문제가 발생합니다. 이는 LM-Evaluation-Harness에서 multiple-choice 태스크를 평가할 때 loglikelihood 계산이 필수적이기 때문입니다.

❌ LM Studio의 제한 사항:

Chat Completions API의 logprobs 미지원: 현재 LM Studio의 Chat Completions API는 logprobs를 지원하지 않습니다. 따라서, loglikelihood 계산이 필요한 multiple-choice 태스크를 평가할 수 없습니다.
Completions API의 제한: LM Studio는 OpenAI의 Completions API를 지원하지만, 이 또한 logprobs를 지원하지 않거나 제한적으로 지원합니다. 따라서, loglikelihood 계산에 필요한 정보를 얻기 어렵습니다.

✅ 권장 대안:

vLLM 서버 사용: vLLM은 OpenAI 호환 API를 제공하며, logprobs를 지원하므로 multiple-choice 태스크 평가에 적합합니다.

lm_eval --model openai-completions \
        --model_args model=your-model-name,base_url=http://your-vllm-server:8000/v1,api_key=dummy \
        --tasks hellaswag \
        --batch_size 1

지원되는 태스크로 변경: loglikelihood 계산이 필요하지 않은 태스크를 선택하여 평가를 진행할 수 있습니다.
LM Studio의 REST API 활용: LM Studio의 REST API를 사용하여 직접 logprobs를 계산하고, 이를 기반으로 평가를 수행하는 커스텀 스크립트를 작성할 수 있습니다. 다만, 이 방법은 추가적인 개발 작업이 필요합니다.

🔍 결론:

현재 LM Studio의 OpenAI 호환 API는 multiple-choice 태스크 평가에 필요한 loglikelihood 계산을 지원하지 않으므로, 해당 평가를 수행하려면 vLLM과 같은 대안을 사용하는 것이 권장됩니다. 또는 loglikelihood 계산이 필요하지 않은 태스크를 선택하여 평가를 진행하시기 바랍니다.

5. 주의사항

LM Studio 파일 캐시에 모델이 완전히 로드되기 전에는 첫 요청이 느릴 수 있습니다
일부 구형 GGML 포맷 모델은 Chat API를 지원하지 않을 수 있습니다. 모델 설명의 “chat compatible” 표시를 확인하세요
LM Studio UI에서 Enable network access 옵션을 켜면 같은 네트워크 대역의 다른 PC에서도 http://<PC-IP>:1234/v1로 접근할 수 있습니다

공통 베스트 프랙티스

항목	팁
API 키 관리	인증이 필요 없는 내부 환경이라도 `api_key=dummy` 패턴을 통일해 스크립트를 재사용하세요
환경 변수	`OPENAI_API_BASE`, `OPENAI_API_KEY` 환경 변수를 세팅하면 여러 스크립트를 수정할 필요가 없습니다
모델 명칭	Harness는 `engine=` 또는 `model=` 파라미터로 모델 이름을 받으니, vLLM·LM Studio 모두 실제 로드한 이름과 일치시키세요
배치 관리	`--batch_size`는 API 호출 배치 수이지 GPU 마이크로배치와 다릅니다. 적정 값(1~8)부터 점진적으로 늘리세요
스트리밍	Harness 자체는 스트리밍 출력이 없지만, OpenAI 클라이언트로 직접 테스트할 때는 `stream=True`로 실시간 토큰을 받을 수 있습니다
태스크 선택	한국어 모델이면 `kobest_*`, 다국어 모델이면 `mmlu_zh`, `tydiqa` 등 지원 태스크를 확인하세요. Harness v0.4.0부터 리더보드 태스크가 대폭 추가되었습니다

실무 활용 팁

이 가이드를 따라 vLLM과 LM Studio 모두에서 손쉽게 벤치마크를 돌릴 수 있습니다. 내부 실험 → 결과 JSON 저장 → 사내 대시보드 시각화까지 일관된 파이프라인을 구축해두면, 모델 업그레이드나 하이퍼파라미터 튜닝 효과를 빠르게 파악할 수 있으니 적극 활용해보세요!

주요 벤치마크 태스크

일반 지능 평가

# MMLU (Massive Multitask Language Understanding)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks mmlu \
        --num_fewshot 5

# HellaSwag (상식 추론)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks hellaswag \
        --num_fewshot 10

수학 및 논리 추론

# GSM8K (초등 수학 문제)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks gsm8k \
        --num_fewshot 5

# ARC (AI2 Reasoning Challenge)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks arc_easy,arc_challenge \
        --num_fewshot 25

코딩 능력 평가

# HumanEval (코드 생성)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks humaneval \
        --batch_size 4

# MBPP (Python 프로그래밍)
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks mbpp \
        --batch_size 4

고급 활용법

배치 처리 및 성능 최적화

# 대용량 배치 처리
lm_eval --model vllm \
        --model_args pretrained=meta-llama/Llama-2-13b-hf,tensor_parallel_size=4 \
        --tasks mmlu,hellaswag,arc_easy,arc_challenge \
        --batch_size 32 \
        --output_path results/llama2_13b_comprehensive.json

샘플링 및 제한 설정

# 빠른 테스트를 위한 샘플 제한
lm_eval --model hf \
        --model_args pretrained=gpt2 \
        --tasks mmlu \
        --limit 100 \
        --output_path results/gpt2_quick_test.json

# 특정 서브태스크만 실행
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks mmlu_anatomy,mmlu_astronomy \
        --num_fewshot 5

결과 로깅 및 분석

# 상세 로그와 함께 평가
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks hellaswag \
        --log_samples \
        --output_path results/detailed_results \
        --batch_size 8

시각화 및 분석

Weights & Biases 연동

# W&B 로깅 활성화
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks mmlu,hellaswag \
        --wandb_args project=llm-evaluation,name=dialoGPT-experiment \
        --log_samples \
        --output_path results/wandb_experiment

Zeno 시각화

# Zeno용 결과 생성
lm_eval --model hf \
        --model_args pretrained=EleutherAI/gpt-j-6B \
        --tasks hellaswag \
        --log_samples \
        --output_path output/gpt-j-6B

# Zeno 업로드
python scripts/zeno_visualize.py \
        --data_path output \
        --project_name "GPT-J Evaluation"

커스텀 태스크 구현

새로운 태스크 추가

# custom_task.py
from lm_eval.api.task import Task
from lm_eval.api.metrics import mean

class CustomTask(Task):
    VERSION = 0
    DATASET_PATH = "path/to/your/dataset"
    DATASET_NAME = "custom_dataset"
    
    def has_training_docs(self):
        return True
    
    def has_validation_docs(self):
        return True
    
    def training_docs(self):
        return self.dataset["train"]
    
    def validation_docs(self):
        return self.dataset["validation"]
    
    def doc_to_text(self, doc):
        return f"Question: {doc['question']}\nAnswer:"
    
    def doc_to_target(self, doc):
        return doc['answer']
    
    def construct_requests(self, doc, ctx):
        return rf.greedy_until(ctx, ["\n"])
    
    def process_results(self, doc, results):
        return {"acc": results[0] == doc['answer']}
    
    def aggregation(self):
        return {"acc": mean}

YAML 기반 태스크 정의

# custom_task.yaml
task: custom_math_task
dataset_path: math_problems.json
training_split: train
validation_split: test
output_type: generate_until
generation_kwargs:
  until: ["\n"]
  max_gen_toks: 256
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
  description: "Custom math problem solving task"

성능 최적화 팁

메모리 최적화

# 메모리 효율적인 설정
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium,load_in_8bit=True,device_map=auto \
        --tasks mmlu \
        --batch_size 4 \
        --max_batch_size 8

병렬 처리

# 멀티 GPU 활용
lm_eval --model vllm \
        --model_args pretrained=meta-llama/Llama-2-70b-hf,tensor_parallel_size=8 \
        --tasks mmlu,hellaswag,arc_easy,arc_challenge \
        --batch_size 64

캐싱 활용

# 결과 캐싱으로 재실행 시간 단축
lm_eval --model hf \
        --model_args pretrained=microsoft/DialoGPT-medium \
        --tasks mmlu \
        --cache_requests \
        --output_path results/cached_results.json

문제 해결 가이드

일반적인 오류와 해결책

오류	원인	해결책
`CUDA out of memory`	GPU 메모리 부족	`--batch_size` 감소, `load_in_8bit=True` 사용
`Task not found`	태스크명 오타	`lm_eval --tasks list` 로 사용 가능한 태스크 확인
`API rate limit`	API 호출 한도 초과	`--batch_size` 감소, 요청 간격 조정
`Model loading failed`	모델 경로/권한 오류	HF_TOKEN 설정, 모델명 확인

디버깅 팁

# 상세 로그 출력
lm_eval --model hf \
        --model_args pretrained=gpt2 \
        --tasks hellaswag \
        --limit 10 \
        --log_samples \
        --verbosity DEBUG

# 단일 샘플 테스트
lm_eval --model hf \
        --model_args pretrained=gpt2 \
        --tasks hellaswag \
        --limit 1 \
        --log_samples

실제 사용 사례

모델 비교 평가

#!/bin/bash
# 여러 모델 성능 비교 스크립트

models=(
    "gpt2"
    "microsoft/DialoGPT-medium"
    "EleutherAI/gpt-neo-1.3B"
)

tasks="mmlu,hellaswag,arc_easy"

for model in "${models[@]}"; do
    echo "Evaluating $model..."
    lm_eval --model hf \
            --model_args pretrained="$model" \
            --tasks "$tasks" \
            --batch_size 8 \
            --output_path "results/$(echo $model | tr '/' '_').json" \
            --log_samples
done

CI/CD 파이프라인 통합

# .github/workflows/model-evaluation.yml
name: Model Evaluation

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model: ["gpt2", "microsoft/DialoGPT-small"]
        task: ["hellaswag", "arc_easy"]
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v3
      with:
        python-version: '3.9'
    
    - name: Install dependencies
      run: |
        pip install lm-eval[api]
    
    - name: Run evaluation
      run: |
        lm_eval --model hf \
                --model_args pretrained=$ \
                --tasks $ \
                --limit 100 \
                --output_path results/$_$.json
    
    - name: Upload results
      uses: actions/upload-artifact@v3
      with:
        name: evaluation-results
        path: results/

결과 분석 및 해석

결과 파일 구조

{
  "results": {
    "hellaswag": {
      "acc": 0.5234,
      "acc_stderr": 0.0123,
      "acc_norm": 0.5456,
      "acc_norm_stderr": 0.0134
    }
  },
  "configs": {
    "model": "hf",
    "model_args": "pretrained=gpt2",
    "batch_size": 8,
    "num_fewshot": 10
  },
  "versions": {
    "hellaswag": 1
  }
}

Python으로 결과 분석

import json
import pandas as pd
import matplotlib.pyplot as plt

def analyze_results(result_files):
    """여러 모델 결과 비교 분석"""
    results = {}
    
    for file_path in result_files:
        with open(file_path, 'r') as f:
            data = json.load(f)
            model_name = data['configs']['model_args'].split('=')[1]
            results[model_name] = data['results']
    
    # DataFrame으로 변환
    df = pd.DataFrame(results).T
    
    # 시각화
    df.plot(kind='bar', figsize=(12, 6))
    plt.title('Model Performance Comparison')
    plt.ylabel('Accuracy')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    return df

# 사용 예시
result_files = [
    'results/gpt2.json',
    'results/microsoft_DialoGPT-medium.json',
    'results/EleutherAI_gpt-neo-1.3B.json'
]

comparison_df = analyze_results(result_files)
print(comparison_df)

모범 사례

평가 설계 원칙

일관성: 동일한 설정으로 모든 모델 평가
재현성: 시드 고정 및 환경 문서화
투명성: 평가 과정과 결과 공개
다양성: 여러 태스크로 종합적 평가

권장 설정

# 표준 평가 설정
lm_eval --model hf \
        --model_args pretrained=your-model,trust_remote_code=True \
        --tasks mmlu,hellaswag,arc_easy,arc_challenge,gsm8k \
        --num_fewshot 5 \
        --batch_size 8 \
        --output_path results/comprehensive_eval.json \
        --log_samples \
        --wandb_args project=model-evaluation

결론

LM-Evaluation-Harness는 언어 모델 평가의 표준 도구로 자리잡았습니다. 이 가이드를 통해 다음을 달성할 수 있습니다:

핵심 성과

표준화된 평가: 일관되고 재현 가능한 모델 성능 측정
다양한 백엔드 지원: HuggingFace부터 API 모델까지 통합 평가
확장성: 커스텀 태스크와 메트릭 쉽게 추가
시각화: W&B, Zeno를 통한 직관적인 결과 분석

활용 분야

모델 개발: 새로운 모델의 성능 검증
모델 선택: 용도에 맞는 최적 모델 선택
연구: 학술 연구의 실험 재현성 확보
프로덕션: 배포 전 모델 품질 보증

LM-Evaluation-Harness를 MLOps 파이프라인에 통합하면 체계적이고 신뢰할 수 있는 언어 모델 평가 체계를 구축할 수 있습니다. 특히 다양한 모델과 태스크를 지원하는 확장성 덕분에 빠르게 변화하는 LLM 생태계에서 경쟁력을 유지하는 데 필수적인 도구입니다.

참고 자료:

소개