GPU Workload Preemption with Kueue: ClusterQueue Design and Priority Patterns
⏱️ Estimated reading time: 8 min
Sharing a GPU cluster across multiple teams surfaces two recurring problems. First, a high-priority job has no way to reclaim GPUs held by a lower-priority team. Second, GPUs sit idle while a team is not actively using them. Kueue addresses both problems through Preemption and Quota Borrowing.
How Kueue Differs from the Default Kubernetes Scheduler
The default Kubernetes scheduler does not touch a Pod once it reaches the Running state. PriorityClass can order Pending Pods, but there is no built-in mechanism to evict a lower-priority running Job to make room for a higher-priority one.
Kueue inserts a Workload abstraction above Pods. Rather than acting as a scheduler, it acts as a workload queue manager. ClusterQueues define quotas; Kueue decides which Workload to Admit, when to Admit it, and whether to Preempt another Workload.
Request arrives -> LocalQueue -> ClusterQueue -> Admit or Pending
|
when quota exceeded
search for preemption target
-> Preempt lower priority
ClusterQueue Design Basics
A ClusterQueue sets quotas per team or project. GPU, CPU, and memory are allocated per resource flavor.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: inference-team
spec:
namespaceSelector:
matchLabels:
kueue.x-k8s.io/team: inference
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: h100-flavor
resources:
- name: nvidia.com/gpu
nominalQuota: "8"
borrowingLimit: "4" # can borrow up to 4 from another team's quota
- name: cpu
nominalQuota: "64"
- name: memory
nominalQuota: "256Gi"
preemption:
reclaimWithinCohort: LowerPriority # preempt lower priority when reclaiming borrowed quota
borrowWithinCohort:
policy: LowerPriority
maxPriorityThreshold: 100
withinClusterQueue: LowerPriority # preempt lower priority within the same queue
A Cohort is a group of ClusterQueues that share quota. Queues within the same Cohort can borrow from one another.
Priority Design Patterns
A practical priority hierarchy for a GPU cluster typically has three tiers.
Tier 1: Production Inference (highest priority)
A serving endpoint that goes down is an outage. Attach PreemptLowerPriority and ensure that a traffic spike can immediately reclaim GPUs from lower-priority training Pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: inference-prod
value: 1000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
Tier 2: Interactive Experiments (medium priority)
Workloads where researchers run Jupyter sessions or short experiments. Responsiveness matters more than for training, but less than for serving.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: interactive-experiment
value: 500
preemptionPolicy: PreemptLowerPriority
Tier 3: Batch Training (low priority)
Long training jobs are the first preemption target. Keeping checkpoint intervals short minimizes the work lost when a preemption occurs.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-training
value: 100
preemptionPolicy: Never # this tier does not preempt anything above it
Quota Borrowing in Practice
Borrowing avoids quota waste while still providing burst capacity. If inference-team holds 8 GPU slots but is only using 4, training-team can borrow those 4.
# training-team ClusterQueue
spec:
resourceGroups:
- flavors:
- name: h100-flavor
resources:
- name: nvidia.com/gpu
nominalQuota: "4"
borrowingLimit: "8" # can borrow up to 8 (nominalQuota + borrowed = 12 max)
cohort: shared-gpu-pool # same cohort as inference-team
When inference-team submits new work, Kueue reclaims the borrowed GPUs from training-team. With reclaimWithinCohort: LowerPriority, lower-priority workloads are preempted first.
In practice, interaction with PodDisruptionBudget can produce unexpected behavior. The Pod termination grace period (terminationGracePeriodSeconds) must also be accounted for. If the grace period is shorter than the time needed to write a checkpoint, the checkpoint is lost.
Protecting GPU Nodes
Prevent CPU workloads from landing on GPU nodes by tainting GPU nodes and attaching tolerations only to GPU workloads.
# add taint to GPU node
kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule
# toleration in Kueue Workload
spec:
podSets:
- name: main
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
Without this combination, CPU-bound DaemonSets such as log collectors, proxies, and monitoring agents consume GPU node resources.
MultiKueue: Distributing Work Across Clusters
MultiKueue is currently in beta and enabled by default. It is a major feature on the 2026 Kueue roadmap. A manager cluster receives jobs and distributes them across worker clusters.
[manager cluster]
MultiKueue ClusterQueue
|
-----+-----
| |
[worker-1] [worker-2]
(A100 x 8) (H100 x 4)
Register worker clusters in the manager cluster:
apiVersion: kueue.x-k8s.io/v1beta1
kind: MultiKueueCluster
metadata:
name: worker-cluster-a100
spec:
kubeConfig:
locationType: Secret
location: worker-a100-kubeconfig
The distribution algorithm can be customized through the MultiKueue Dispatcher. Custom dispatchers are wired in as plugins alongside the built-in algorithms.
Cooperative Preemption and Checkpoints
Cooperative Preemption is another notable item on the 2026 Kueue roadmap. Workloads that implement checkpointing would, upon receiving a preemption signal, save state and then exit rather than terminating immediately.
Until that feature ships, the practical equivalent is to set terminationGracePeriodSeconds long enough and write a SIGTERM handler in the training code that saves a checkpoint before exiting.
import signal
import sys
def checkpoint_and_exit(signum, frame):
save_checkpoint(model, optimizer, current_epoch)
sys.exit(0)
signal.signal(signal.SIGTERM, checkpoint_and_exit)
When Cooperative Preemption is formally supported, the expected flow will have Kueue wait for the checkpoint to complete before Admitting the new Workload.
Common Pitfalls
Pitfall 1: Not validating preemption policy before going to production. Run a real preemption scenario on a development cluster. Verify that the combination of PDB, grace period, and checkpoint duration behaves as expected.
Pitfall 2: Setting borrowingLimit without a Cohort. A ClusterQueue not attached to a Cohort has nothing to borrow from, regardless of borrowingLimit.
Pitfall 3: Confusing LocalQueue and ClusterQueue. LocalQueue is namespace-scoped; ClusterQueue is cluster-scoped. Per-namespace team isolation is implemented with LocalQueue and namespaceSelector together.
Summary
Kueue is one of the very few production-grade tools for managing GPU quotas on Kubernetes. The ClusterQueue-Cohort-Preemption combination lets you express fair GPU allocation across teams as code. Always validate preemption policies against real workloads, and make sure the checkpoint write time fits inside terminationGracePeriodSeconds to achieve lossless preemption.