⏱️ Estimated reading time: 9 min

Ollama recorded 52 million downloads in Q1 2026. It has moved well beyond a personal experimentation tool and is being used as team-level infrastructure. Running ollama run on a local Mac and operating a serving layer for an entire team on a Kubernetes cluster are architecturally different problems. This post covers the latter.

Why Ollama: Positioning Against vLLM

vLLM focuses on throughput optimization. PagedAttention, continuous batching, and FP8 inference are all about squeezing maximum throughput from GPU resources. Ollama’s strength is simplicity of installation and model management. ollama pull llama3:70b downloads the model in one line and an OpenAI-compatible API server comes up automatically.

The two tools occupy different layers rather than competing directly. vLLM fits a public inference endpoint where throughput matters. Ollama fits an internal code-assist tool or a small private chatbot used by a development team, where operational simplicity outweighs raw throughput.

Basic Kubernetes Deployment

Namespace and RBAC

kubectl create namespace ollama
kubectl label namespace ollama kueue.x-k8s.io/team=internal-tools

GPU PersistentVolumeClaim

Model files range from tens of GB to hundreds of GB. Without a PVC, every Pod restart triggers a full model re-download. That is an operational disaster.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: nfs-retain    # use the StorageClass appropriate for your cluster
  resources:
    requests:
      storage: 500Gi

If multiple Pods need to share the same model volume, you need a StorageClass that supports ReadWriteMany (NFS, CephFS, Azure Files, etc.). With ReadWriteOnce, the volume attaches to only one Pod at a time.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: present
        effect: NoSchedule
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_MODELS
          value: "/models"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"         # number of concurrent requests to process
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"         # maximum number of models to keep in memory
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models

OLLAMA_NUM_PARALLEL caps the number of requests processed concurrently. When GPU memory is insufficient to hold multiple requests in flight, they must be serialized. Leaving the default (1) means requests queue up serially and latency grows.

Service

apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
  type: ClusterIP

For access from outside the cluster, expose via Ingress or LoadBalancer. Because Ollama has no built-in authentication, always place an auth proxy in front of any external exposure.

Custom Model Configuration with Modelfile

An Ollama Modelfile builds a custom model from a base model, with a fixed system prompt, parameters, and context length.

FROM llama3:8b

SYSTEM """
You are ThakiCloud's internal code review assistant.
You specialize in Go, Kubernetes YAML, and Python code.
Review in order: security vulnerabilities, performance issues, code style.
"""

PARAMETER temperature 0.1      # low temperature is better for code review
PARAMETER num_ctx 8192          # enough context to handle long files
PARAMETER num_predict 2048

Two approaches for building and deploying a Modelfile:

Option 1: Preload with an InitContainer

initContainers:
- name: model-puller
  image: ollama/ollama:latest
  command:
  - sh
  - -c
  - |
    ollama serve &
    sleep 5
    ollama pull llama3:8b
    # mount Modelfile from ConfigMap, then build
    ollama create code-reviewer -f /modelfiles/Modelfile
    kill %1
  volumeMounts:
  - name: models
    mountPath: /models
  - name: modelfiles
    mountPath: /modelfiles

Option 2: Run as a Separate Job

Run a separate Job after the Pod is up to pull the model and build the Modelfile. Run it once on initial deployment.

Structured Output

Ollama enforces JSON output via the format parameter:

curl http://ollama:11434/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Find security vulnerabilities in the following code and return them as JSON:",
  "format": "json",
  "stream": false
}'

You can also pin the output format via the system prompt in the Modelfile:

SYSTEM """
Always return responses in the following JSON schema:
{"issues": [{"severity": "high|medium|low", "line": number, "description": string}]}
Include no text outside the JSON structure.
"""

In practice, even with format: "json" enabled, the model does not always respect the schema fully. A validation layer that parses and checks the schema after each response is necessary.

Prometheus Monitoring

Ollama exposes Prometheus metrics at the /metrics endpoint:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Key metrics:

# number of requests currently being processed
ollama_request_duration_seconds_count

# average processing time
rate(ollama_request_duration_seconds_sum[5m])
/ rate(ollama_request_duration_seconds_count[5m])

# number of loaded models
ollama_loaded_model_count

HPA Autoscaling

GPU-based HPA scales on GPU utilization metrics. Collecting GPU utilization from NVIDIA’s DCGM Exporter into Prometheus makes it available as a custom HPA metric.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama
  namespace: ollama
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: ollama_queue_depth    # queued request count (custom metric)
      target:
        type: AverageValue
        averageValue: "10"

When GPU nodes are insufficient, HPA scale-out attempts leave Pods in Pending state. Node-level scaling requires Cluster Autoscaler or Karpenter in addition to HPA.

Authentication Proxy Pattern

Ollama has no built-in authentication. Even on an internal service, leaving it open means anyone can use the model. The simplest approach is OAuth2 Proxy or Nginx validating an API key.

# Nginx ConfigMap example
nginx.conf: |
  location / {
    if ($http_x_api_key != "your-team-key") {
      return 401;
    }
    proxy_pass http://ollama:11434;
  }

Integrating with an IdP such as Keycloak allows per-team access control.

Operational Tips

Schedule model updates as a separate Job. ollama pull can run alongside a live Pod, but capacity issues during updates sometimes cause Pod restarts. Running the update as a Job during a maintenance window is safer.

Tune OLLAMA_MAX_LOADED_MODELS to match GPU memory. Two 70B models loaded simultaneously will exhaust VRAM. Calculate the model size relative to available VRAM and set this value accordingly.

Adjust the log level. By default, Ollama logs detailed output for every request. Set OLLAMA_DEBUG=false to reduce log volume in production.

Summary

Running Ollama properly on Kubernetes requires four things: a model PVC, GPU tolerations, an auth proxy, and monitoring. Using Modelfile to configure team-specific models puts the system prompt and parameters under version control. For internal tool serving where operational simplicity matters more than throughput, Ollama is a good choice relative to its setup cost.