Don’t Pick an Inference Engine. Pick a Hardware Strategy.

Abstract image of hardware foundations determining the shape of software above them Foundations of different sizes (hardware) determining the shape of the software that flows on top of them.

Overview

When you decide to serve an LLM, the first question you usually ask is “which inference engine should I use?” You list names like vLLM, SGLang, TensorRT-LLM, and llama.cpp and start comparing benchmark tables. Recently, an engineer pointed out that this very question has the order wrong, and the observation gained attention. The core argument is simple: “You are not picking an inference engine. You are picking a hardware strategy, and the engine follows from that.”

ThakiCloud runs a platform that pools GPU resources on Kubernetes and serves inference workloads for multiple customers. The hardware-first framework is therefore not just a one-line insight from a tweet. It is the language of decisions we make every day when designing GPU scheduling and serving stacks. This post summarizes the framework and explains why we start with vLLM as our default in most cases and only move to another engine when a bottleneck is measured.

The Core Argument: Pick a Hardware Strategy, Not an Engine

The original author’s argument is not to rank inference engines in an abstract performance order. An engine is not fast or slow in a vacuum. It is fast or slow on specific hardware, for a specific workload shape. So the starting point for a decision should not be “what is the fastest engine?” but “what hardware do I have, and what shape is my workload?” Once hardware and workload are defined, the viable engine candidates naturally narrow to one or two.

This way of thinking matters because people often pick engines based on benchmarks that have nothing to do with their environment. Peak throughput numbers measured on an 8-GPU H100 cluster are nearly meaningless for someone running on a MacBook or one or two RTX cards. Conversely, taking a lightweight engine that runs well on an edge box and throwing it at large-scale concurrent request serving will break it. What you need to match is “the runtime to the shape of the workload, the hardware, and the amount of operational burden you can absorb.”

The Hardware-to-Engine Map

The cheat sheet from the original post maps engines to hardware scenarios. The essentials:

Hardware / Situation                    Suitable Engine
------------------------------------------------------------
CPU, Mac, edge, unusual env, low VRAM  -> llama.cpp (GGUF, hybrid offload)
  + plenty of system RAM
Mac (Apple Silicon) centered workflow  -> MLX
Consumer GPU 1-4x RTX                  -> ExLlamaV2 / ExLlamaV3
General production serving             -> vLLM
Complex infra, long context, MoE       -> SGLang (RadixAttention)
Maximum NVIDIA performance needed      -> TensorRT-LLM (compiled engine)

A bit more on each engine’s character:

llama.cpp: Works almost everywhere. Strong when VRAM is tight and system RAM is plentiful, handling GGUF quantization and hybrid offload. Good for personal experiments, edge devices, and laptops.
MLX: Optimized for Apple Silicon. Natural choice for Mac-centered workflows.
ExLlamaV2/V3: Specialized for running quantized models efficiently on one to four consumer GPUs.
vLLM: Broadest hardware and model coverage with predictable performance. Default candidate for general production serving.
SGLang: Strong for high concurrency, long context, and MoE workloads via RadixAttention and multi-call scheduling. Comes with more infrastructure complexity.
TensorRT-LLM: Delivers the highest raw throughput via NVIDIA’s compiled engine path. Comes with substantial operational complexity.

Comparative analyses generally converge on the same conclusions. For mid-to-high volume production, vLLM or SGLang offer a reasonable throughput-developer-experience balance. TensorRT-LLM’s compiled path is worth investing in only for extreme volume or latency-critical cases. The recommended starting point for most teams is vLLM, moving to another engine only once profiling confirms a bottleneck.

Why This Order Is Right

“Engine first” is dangerous because it is premature optimization. Picking the engine that is supposedly fastest before you know what the actual measured bottleneck is means paying the cost of operational complexity upfront in exchange for uncertain gains. TensorRT-LLM’s compiled path is genuinely fast, but it comes with continuous operational overhead: engine builds, version management, and model conversion pipelines. If the workload does not justify that complexity, it is a net loss.

“Hardware and workload first” starts from measurable facts instead. The GPU type and count you have, model size, context length, and concurrent request patterns are all observable values. Once those are fixed, the candidate engines narrow, and picking the simplest engine that addresses the measured bottleneck is rational. Simplicity has its own operational value. Fewer outages, easier upgrades, fewer people tied up.

Application to ThakiCloud K8s AI/ML SaaS Platform and Implications

This framework maps almost exactly onto ThakiCloud’s serving operations philosophy. We pool GPUs on Kubernetes, manage queuing and scheduling with Kueue, and serve inference workloads in a multi-tenant setup. In this environment, “hardware strategy first” becomes concrete in the following ways.

The shape of the GPU pool determines engine candidates. The GPU types and topology of our node pools constitute our hardware strategy. For our baseline situation of providing general-purpose serving to multiple tenants on data-center-grade NVIDIA GPUs, vLLM with its broad coverage and predictable performance is a natural default. So we start with vLLM when onboarding a new model.
Workload shape drives branching. When a specific tenant’s workload skews toward long context, MoE, or high concurrency, features like SGLang’s RadixAttention become meaningful. Conversely, for an extreme-volume service with fixed hardware and latency as the critical constraint, that is when we evaluate TensorRT-LLM’s compiled path. The key is making that branching decision based on measured bottlenecks, not guesswork.
This applies directly to on-premises and self-hosting. For public institutions, research organizations, and financial customers who cannot depend on external APIs, “what hardware do you have?” determines everything. Matching the engine to available GPU generation and count, then optimizing cost-per-throughput within their own infrastructure, is exactly the differentiation. Hardware-first thinking is especially powerful in constraint-bound environments.

The most practical conclusion from an operator’s perspective is the rule “default to vLLM, migrate only when a bottleneck is proven.” In a multi-tenant environment, running different engines per tenant multiplies management overhead. Keeping one standard engine and attaching specialized engines only for measured exceptions is much cheaper in the long run.

Limitations and Counterarguments

This framework is not a universal solution. A few points worth noting.

The cheat sheet is a starting point, not a conclusion. Even within vLLM, version, quantization method, and batch/KV-cache settings can swing results significantly. “Once you have vLLM you are done” is wrong. Tuning within a chosen engine is another major variable.
The premise that you can choose your hardware does not always hold. “Hardware strategy first” is most powerful when you have latitude to choose hardware. Organizations already locked into specific GPUs effectively face an engine-only decision. Even so, narrowing candidates by workload shape remains valid.
Numbers must come from your own measurements. Throughput and latency figures in third-party comparisons depend heavily on the measurement environment. Numbers produced without profiling on your own model, your own hardware, and your own traffic patterns are reference values only. That is also why this post avoids asserting specific throughput numbers as definitive.
The engine ecosystem moves fast. The current strength distribution can shift next quarter. That is why the operational routine needs to be not “pick once and done” but “keep a standard engine and re-evaluate periodically.”

In summary, the advice “don’t pick an inference engine, pick a hardware strategy” is good discipline for avoiding premature optimization and starting from measurable facts. For ThakiCloud, this is not a new idea. It is a concise summary of the principle we already follow when designing our GPU pool and multi-tenant serving. Start with the simplest, broadest engine, and introduce complex specialized engines only when a bottleneck proves itself.

Sources

Original tweet (@TheAhmadOsman): x.com/TheAhmadOsman/status/2051779046468673986
Shared tweet (via @hjguyhan): x.com/hjguyhan/status/2069049583276298276
Supplementary comparative analyses: vLLM vs SGLang vs TensorRT-LLM vs Ollama (LeetLLM, 2026), Best LLM Inference Engines 2026 (Yotta Labs)

Don’t Pick an Inference Engine. Pick a Hardware Strategy.

Overview

The Core Argument: Pick a Hardware Strategy, Not an Engine

The Hardware-to-Engine Map

Why This Order Is Right

Application to ThakiCloud K8s AI/ML SaaS Platform and Implications

Limitations and Counterarguments

Sources

참고

추론 엔진을 고르지 마세요, 하드웨어 전략을 고르세요

Hermes Bible: Hermes Agent 문서와 실전 워크플로를 한 번에 검색합니다

Fable 5가 내 코드의 100%를 쓴다는 말, 플랫폼 관점에서 따져봅니다

Claude Code로 영상을 만든다: claude-code-video-toolkit을 직접 돌려봤습니다