Unsloth+TRL 한국어 LLM 학습 자동화 - 2편: 쿠버네티스 파이프라인 구축
개요
본 가이드는 Unsloth+TRL 한국어 LLM 학습 가이드 - 1편의 연장선으로, 쿠버네티스를 활용하여 한국어 특화 LLM 학습 과정을 완전 자동화하는 방법을 다룹니다.
학습 목표:
- 🚀 자동화된 학습 파이프라인: CPT → SFT → RLHF 자동 실행
- 🎯 GPU 리소스 최적화: 동적 스케줄링 및 효율적 활용
- 📊 모니터링 및 로깅: 실시간 학습 상태 추적
- ⚡ 확장성: 다중 모델 동시 학습 지원
1. 쿠버네티스 클러스터 준비
1.1 GPU 노드 구성
# gpu-node-pool.yaml
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
node-type: gpu-training
gpu-type: h100
gpu-count: "8"
spec:
capacity:
nvidia.com/gpu: "8"
memory: 1000Gi
cpu: "128"
allocatable:
nvidia.com/gpu: "8"
memory: 900Gi
cpu: "120"
1.2 필수 컴포넌트 설치
# NVIDIA GPU Operator 설치
kubectl create namespace gpu-operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.enabled=true
# Kubeflow Training Operator 설치
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
# Argo Workflows 설치
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml
1.3 스토리지 구성
# nfs-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-training-data
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
nfs:
server: nfs-server.example.com
path: /data/llm-training
storageClassName: nfs-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-training-data-pvc
namespace: llm-training
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Ti
storageClassName: nfs-storage
2. 컨테이너 이미지 준비
2.1 베이스 이미지 구성
# Dockerfile.unsloth-korean
FROM nvidia/cuda:11.8-devel-ubuntu20.04
# 기본 패키지 설치
RUN apt-get update && apt-get install -y \
python3 python3-pip git wget curl \
&& rm -rf /var/lib/apt/lists/*
# Python 환경 설정
RUN pip3 install --upgrade pip
# Unsloth 및 의존성 설치
RUN pip3 install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
RUN pip3 install --no-deps xformers trl peft accelerate bitsandbytes
RUN pip3 install datasets transformers tokenizers sentencepiece
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# 한국어 처리 라이브러리
RUN pip3 install konlpy kss soynlp
# 모니터링 도구
RUN pip3 install wandb tensorboard prometheus_client
# 학습 스크립트 복사
COPY scripts/ /app/scripts/
COPY configs/ /app/configs/
WORKDIR /app
CMD ["python3", "-u", "scripts/train.py"]
2.2 이미지 빌드 및 푸시
# 이미지 빌드
docker build -t your-registry/unsloth-korean:latest -f Dockerfile.unsloth-korean .
# 레지스트리에 푸시
docker push your-registry/unsloth-korean:latest
3. 학습 파이프라인 구성
3.1 CPT (지속적 사전학습) Job
# cpt-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: korean-llm-cpt
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/cpt_train.py"]
env:
- name: STAGE
value: "CPT"
- name: MODEL_SIZE
value: "7B"
- name: WANDB_PROJECT
value: "korean-llm-cpt"
resources:
requests:
nvidia.com/gpu: 4
memory: 200Gi
cpu: 32
limits:
nvidia.com/gpu: 4
memory: 200Gi
cpu: 32
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
3.2 SFT (지도 미세조정) Job
# sft-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: korean-llm-sft
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/sft_train.py"]
env:
- name: STAGE
value: "SFT"
- name: BASE_MODEL_PATH
value: "/output/cpt_model"
- name: WANDB_PROJECT
value: "korean-llm-sft"
resources:
requests:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
limits:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
3.3 RLHF (강화학습) Job
# rlhf-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: korean-llm-rlhf
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/rlhf_train.py"]
env:
- name: STAGE
value: "RLHF"
- name: BASE_MODEL_PATH
value: "/output/sft_model"
- name: WANDB_PROJECT
value: "korean-llm-rlhf"
resources:
requests:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
limits:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
4. Argo Workflows 파이프라인
4.1 전체 학습 워크플로우
# korean-llm-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: korean-llm-training-
namespace: llm-training
spec:
entrypoint: korean-llm-pipeline
templates:
- name: korean-llm-pipeline
dag:
tasks:
- name: data-preprocessing
template: preprocess-data
- name: cpt-training
template: cpt-stage
dependencies: [data-preprocessing]
- name: sft-training
template: sft-stage
dependencies: [cpt-training]
- name: rlhf-training
template: rlhf-stage
dependencies: [sft-training]
- name: model-evaluation
template: evaluate-model
dependencies: [rlhf-training]
- name: model-deployment
template: deploy-model
dependencies: [model-evaluation]
- name: preprocess-data
container:
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/preprocess.py"]
volumeMounts:
- name: training-data
mountPath: /data
resources:
requests:
memory: 32Gi
cpu: 8
- name: cpt-stage
resource:
action: create
successCondition: status.conditions.0.type == Succeeded
failureCondition: status.conditions.0.type == Failed
manifest: |
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: cpt-job-
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/cpt_train.py"]
env:
- name: STAGE
value: "CPT"
resources:
requests:
nvidia.com/gpu: 4
memory: 200Gi
cpu: 32
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
- name: sft-stage
resource:
action: create
successCondition: status.conditions.0.type == Succeeded
failureCondition: status.conditions.0.type == Failed
manifest: |
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: sft-job-
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/sft_train.py"]
env:
- name: STAGE
value: "SFT"
- name: BASE_MODEL_PATH
value: "/output/cpt_model"
resources:
requests:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
- name: rlhf-stage
resource:
action: create
successCondition: status.conditions.0.type == Succeeded
failureCondition: status.conditions.0.type == Failed
manifest: |
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: rlhf-job-
namespace: llm-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/rlhf_train.py"]
env:
- name: STAGE
value: "RLHF"
- name: BASE_MODEL_PATH
value: "/output/sft_model"
resources:
requests:
nvidia.com/gpu: 2
memory: 100Gi
cpu: 16
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: llm-training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
node-type: gpu-training
- name: evaluate-model
container:
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/evaluate.py"]
env:
- name: MODEL_PATH
value: "/output/rlhf_model"
volumeMounts:
- name: model-output
mountPath: /output
resources:
requests:
nvidia.com/gpu: 1
memory: 50Gi
cpu: 8
- name: deploy-model
container:
image: your-registry/unsloth-korean:latest
command: ["python3", "-u", "scripts/deploy.py"]
env:
- name: MODEL_PATH
value: "/output/rlhf_model"
- name: DEPLOYMENT_TARGET
value: "production"
volumeMounts:
- name: model-output
mountPath: /output
volumeClaimTemplates:
- metadata:
name: training-data
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 10Ti
storageClassName: nfs-storage
- metadata:
name: model-output
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 2Ti
storageClassName: nfs-storage
5. 학습 스크립트 구성
5.1 CPT 학습 스크립트
# scripts/cpt_train.py
import os
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb
def main():
# 환경 변수 설정
stage = os.getenv("STAGE", "CPT")
model_size = os.getenv("MODEL_SIZE", "7B")
# Wandb 초기화
wandb.init(
project=os.getenv("WANDB_PROJECT", "korean-llm-cpt"),
name=f"cpt-{model_size}-{wandb.util.generate_id()}"
)
# 모델 로딩
model_name = f"unsloth/Qwen2.5-{model_size}-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=131072,
dtype=None,
load_in_4bit=True,
)
# LoRA 설정
model = FastLanguageModel.get_peft_model(
model,
r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0.1,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
# 데이터셋 로딩
dataset = load_dataset("json", data_files="/data/korean_corpus.jsonl", split="train")
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
padding=False,
max_length=4096,
return_overflowing_tokens=True,
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
batch_size=1000,
num_proc=4,
remove_columns=dataset.column_names,
)
# 학습 설정
training_args = TrainingArguments(
output_dir="/output/cpt_model",
overwrite_output_dir=True,
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=32,
learning_rate=1e-5,
weight_decay=0.01,
logging_steps=100,
save_steps=2000,
save_total_limit=3,
warmup_steps=1000,
fp16=True,
gradient_checkpointing=True,
dataloader_pin_memory=False,
remove_unused_columns=False,
report_to="wandb",
)
# 트레이너 초기화
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=tokenized_dataset,
dataset_text_field="text",
max_seq_length=4096,
packing=True,
)
# 학습 실행
trainer.train()
# 모델 저장
trainer.save_model("/output/cpt_model")
# Wandb 종료
wandb.finish()
if __name__ == "__main__":
main()
5.2 모니터링 및 알림
# scripts/monitor.py
import os
import time
import requests
from kubernetes import client, config
from prometheus_client.parser import text_string_to_metric_families
class TrainingMonitor:
def __init__(self):
config.load_incluster_config()
self.v1 = client.CoreV1Api()
self.batch_v1 = client.BatchV1Api()
def check_job_status(self, job_name, namespace="llm-training"):
"""Job 상태 확인"""
try:
job = self.batch_v1.read_namespaced_job(job_name, namespace)
return job.status
except Exception as e:
print(f"Error checking job status: {e}")
return None
def get_gpu_utilization(self, pod_name, namespace="llm-training"):
"""GPU 사용률 확인"""
try:
# Prometheus에서 GPU 메트릭 수집
response = requests.get(
f"http://prometheus:9090/api/v1/query",
params={
"query": f'nvidia_gpu_utilization{{pod="{pod_name}"}}'
}
)
data = response.json()
return data["data"]["result"]
except Exception as e:
print(f"Error getting GPU utilization: {e}")
return None
def send_slack_notification(self, message):
"""Slack 알림 전송"""
webhook_url = os.getenv("SLACK_WEBHOOK_URL")
if webhook_url:
payload = {"text": message}
requests.post(webhook_url, json=payload)
def monitor_training(self):
"""학습 모니터링 메인 루프"""
while True:
# 현재 실행 중인 Job 확인
jobs = self.batch_v1.list_namespaced_job("llm-training")
for job in jobs.items:
job_name = job.metadata.name
status = job.status
if status.failed:
message = f"🚨 Training job {job_name} failed!"
self.send_slack_notification(message)
elif status.succeeded:
message = f"✅ Training job {job_name} completed successfully!"
self.send_slack_notification(message)
# GPU 사용률 확인
if status.active:
gpu_util = self.get_gpu_utilization(job_name)
if gpu_util and gpu_util[0]["value"][1] < "50":
message = f"⚠️ Low GPU utilization in {job_name}: {gpu_util[0]['value'][1]}%"
self.send_slack_notification(message)
time.sleep(300) # 5분마다 확인
if __name__ == "__main__":
monitor = TrainingMonitor()
monitor.monitor_training()
6. 모니터링 및 로깅
6.1 Prometheus 메트릭 수집
# monitoring/prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'nvidia-dcgm'
static_configs:
- targets: ['nvidia-dcgm-exporter:9400']
6.2 Grafana 대시보드
{
"dashboard": {
"title": "Korean LLM Training Dashboard",
"panels": [
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "nvidia_gpu_utilization",
"legendFormat": "GPU "
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100",
"legendFormat": "GPU Memory "
}
]
},
{
"title": "Training Loss",
"type": "graph",
"targets": [
{
"expr": "training_loss",
"legendFormat": "Loss"
}
]
}
]
}
}
7. 자동 스케일링 및 리소스 관리
7.1 HPA (Horizontal Pod Autoscaler)
# hpa-config.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-training-hpa
namespace: llm-training
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference-server
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
7.2 GPU 스케줄러 설정
# gpu-scheduler.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: kube-system
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated
resources:
- name: nvidia.com/gpu
weight: 100
8. 파이프라인 실행 및 관리
8.1 워크플로우 실행
# 네임스페이스 생성
kubectl create namespace llm-training
# 시크릿 설정 (Wandb API Key)
kubectl create secret generic wandb-secret \
--from-literal=api-key=YOUR_WANDB_API_KEY \
-n llm-training
# 워크플로우 실행
argo submit korean-llm-pipeline.yaml -n llm-training
# 실행 상태 확인
argo list -n llm-training
argo get korean-llm-training-xxxxx -n llm-training
# 로그 확인
argo logs korean-llm-training-xxxxx -n llm-training
8.2 파이프라인 관리 스크립트
# scripts/pipeline_manager.py
import subprocess
import json
import time
from kubernetes import client, config
class PipelineManager:
def __init__(self):
config.load_incluster_config()
self.custom_api = client.CustomObjectsApi()
def submit_workflow(self, workflow_file, parameters=None):
"""워크플로우 제출"""
cmd = ["argo", "submit", workflow_file, "-n", "llm-training"]
if parameters:
for key, value in parameters.items():
cmd.extend(["-p", f"{key}={value}"])
result = subprocess.run(cmd, capture_output=True, text=True)
return result.stdout.strip()
def get_workflow_status(self, workflow_name):
"""워크플로우 상태 확인"""
try:
workflow = self.custom_api.get_namespaced_custom_object(
group="argoproj.io",
version="v1alpha1",
namespace="llm-training",
plural="workflows",
name=workflow_name
)
return workflow["status"]["phase"]
except Exception as e:
print(f"Error getting workflow status: {e}")
return None
def wait_for_completion(self, workflow_name, timeout=86400):
"""워크플로우 완료 대기"""
start_time = time.time()
while time.time() - start_time < timeout:
status = self.get_workflow_status(workflow_name)
if status in ["Succeeded", "Failed", "Error"]:
return status
time.sleep(60)
return "Timeout"
def cleanup_completed_workflows(self, days=7):
"""완료된 워크플로우 정리"""
cmd = [
"argo", "delete", "-n", "llm-training",
"--completed", f"--older={days}d"
]
subprocess.run(cmd)
# 사용 예시
if __name__ == "__main__":
manager = PipelineManager()
# 워크플로우 실행
workflow_name = manager.submit_workflow(
"korean-llm-pipeline.yaml",
parameters={
"model-size": "7B",
"learning-rate": "2e-5"
}
)
print(f"Submitted workflow: {workflow_name}")
# 완료 대기
status = manager.wait_for_completion(workflow_name)
print(f"Workflow completed with status: {status}")
# 정리
manager.cleanup_completed_workflows()
9. 고급 기능
9.1 다중 모델 동시 학습
# multi-model-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: multi-model-training
namespace: llm-training
spec:
entrypoint: multi-model-pipeline
templates:
- name: multi-model-pipeline
steps:
- - name: train-7b-model
template: single-model-training
arguments:
parameters:
- name: model-size
value: "7B"
- name: gpu-count
value: "4"
- name: train-72b-model
template: single-model-training
arguments:
parameters:
- name: model-size
value: "72B"
- name: gpu-count
value: "8"
- name: single-model-training
inputs:
parameters:
- name: model-size
- name: gpu-count
dag:
tasks:
- name: cpt
template: cpt-training
arguments:
parameters:
- name: model-size
value: ""
- name: gpu-count
value: ""
- name: sft
template: sft-training
dependencies: [cpt]
arguments:
parameters:
- name: model-size
value: ""
- name: rlhf
template: rlhf-training
dependencies: [sft]
arguments:
parameters:
- name: model-size
value: ""
9.2 실험 추적 및 버전 관리
# scripts/experiment_tracker.py
import mlflow
import wandb
from datetime import datetime
class ExperimentTracker:
def __init__(self):
# MLflow 설정
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("korean-llm-training")
# Wandb 설정
wandb.login(key=os.getenv("WANDB_API_KEY"))
def start_experiment(self, config):
"""실험 시작"""
# MLflow 실행 시작
self.mlflow_run = mlflow.start_run(
run_name=f"korean-llm-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
# Wandb 실행 시작
wandb.init(
project="korean-llm-training",
config=config,
name=self.mlflow_run.info.run_name
)
# 파라미터 로깅
mlflow.log_params(config)
def log_metrics(self, metrics, step=None):
"""메트릭 로깅"""
mlflow.log_metrics(metrics, step)
wandb.log(metrics, step=step)
def log_model(self, model_path, model_name):
"""모델 로깅"""
mlflow.log_artifacts(model_path, model_name)
wandb.save(f"{model_path}/*")
def end_experiment(self):
"""실험 종료"""
mlflow.end_run()
wandb.finish()
결론
본 가이드를 통해 쿠버네티스 기반의 완전 자동화된 한국어 LLM 학습 파이프라인을 구축했습니다.
주요 성과:
- 🚀 완전 자동화: CPT → SFT → RLHF 순차 실행
- 📊 실시간 모니터링: GPU 사용률, 학습 진행도 추적
- ⚡ 효율적 리소스 관리: 동적 스케줄링 및 자동 스케일링
- 🔄 재현 가능성: 버전 관리 및 실험 추적
실무적 가치:
- 운영 효율성: 수동 개입 최소화로 24/7 학습 가능
- 비용 최적화: 리소스 사용량 최적화로 클라우드 비용 절감
- 확장성: 다중 모델 동시 학습 지원
- 안정성: 장애 복구 및 모니터링 시스템 구축
이 시리즈의 다른 글 보기:
- 1편: Unsloth를 활용한 한국어 특화 LLM 학습 완전 가이드
- 2편: Unsloth 한국어 LLM 학습 자동화 - 쿠버네티스 파이프라인 구축 (현재 글)
이러한 자동화 시스템을 통해 한국어 특화 LLM 개발의 생산성과 품질을 크게 향상시킬 수 있습니다.