753B Open Weights on a Consumer GPU: GLM-5.2 and the Economics of On-Premise LLM Serving

Running a 753B-parameter model on a single consumer GPU would have been hard to imagine a few years ago. A recently shared case reports running the SOTA open-weight model GLM-5.2 (753B, FP8) on an RTX 4090 consumer GPU for the first time. It manages roughly 10 tok/s, but the point is not throughput. The point is that it runs at all.

At ThakiCloud we handle model serving on a K8s-based AI/ML SaaS platform. Let us look at what this case implies for the economics of on-premise large-LLM serving.

What Made It Possible: Porting the Sparse-Attention Kernel

Squeezing a large model onto a small GPU combines two techniques.

FP8 quantization: Representing weights in 8-bit floating point shrinks the memory footprint.
Porting the DSA sparse-attention kernel to the Ada architecture (sm_89): GLM-5.2’s DSA (sparse attention) kernel was ported to the RTX 4090’s Ada Lovelace architecture (compute capability sm_89). Sparse attention computes only the important token pairs instead of every pair, saving compute and memory on long contexts.

The roughly 10 tok/s throughput is slow for production serving, and since this figure comes from the author’s single-environment measurement, treating it as an [estimate] is more accurate. What matters is that a path to running a 753B model without dedicated datacenter GPUs has opened up.

What It Means from a Data Scientist/Engineer’s Perspective

Kernel porting equals accessibility: When a model uses a new attention mechanism, the work of porting that kernel to various GPU architectures determines accessibility. Even a SOTA model narrows the ecosystem if its kernel is locked to specific hardware.
Sparsity unlocks long context: Sparse attention like DSA is a key technique for lowering the compute and memory cost of long-context serving. As context grows, dense attention cost rises quadratically, while sparse attention mitigates it.
Throughput is a trade-off: 10 tok/s is the price of fitting a large model on small hardware. Real serving requires choosing the trade-off between model size, hardware, and throughput according to the workload.

ThakiCloud’s View: On-Premise Large-LLM Serving

The real reason this case matters is data sovereignty and the expansion of serving options. In sensitive domains there is clear demand to run 753B-class SOTA models in-house rather than sending data to an external API. 10 tok/s on a single consumer GPU is a demo, but scaling it across multiple GPUs with batch and tensor parallelism can reach practical throughput.

This is exactly the space we work in: serving large open-weight models sharded across multiple GPUs on K8s, allocating GPU resources with Kueue, and integrating per-model optimizations such as sparse-attention kernels into a standardized serving stack. Growing a single-machine demo into multi-tenant production serving is the core challenge.

Closing

Running GLM-5.2 on an RTX 4090 is a signal that an on-premise serving path for large SOTA models has opened up. Kernel porting and sparse attention create accessibility, while quantization unlocks memory. For engineers interested in scaling this into organization-grade serving infrastructure, this kind of problem is the daily work.

Sources

GLM-5 model card (Zhipu AI / Z.ai, zai-org): https://huggingface.co/zai-org/GLM-5
GLM-4.6 model card (prior generation reference): https://huggingface.co/zai-org/GLM-4.6
GLM-4.5 / 4.6 / MoE architecture docs (Hugging Face Transformers): https://huggingface.co/docs/transformers/model_doc/glm4_moe

The single RTX 4090 case and the ~10 tok/s throughput above are an [estimate] based on community reports, not an independently verified official benchmark. DSA (sparse attention) is confirmed in public materials as integrated into the GLM-5 family; details for the specific minor version (5.2) follow the official model cards.

753B Open Weights on a Consumer GPU: GLM-5.2 and the Economics of On-Premise LLM Serving

What Made It Possible: Porting the Sparse-Attention Kernel

What It Means from a Data Scientist/Engineer’s Perspective

ThakiCloud’s View: On-Premise Large-LLM Serving

Closing

Sources

참고

에이전트 AI를 처음부터 시스템까지 한 권으로: ‘The Hitchhiker’s Guide to Agentic AI’를 읽었습니다

OpenRouter 점유율 역전이 말하는 것: 토큰은 매출이 아니다, 그리고 모델 중립의 값어치

빅테크 GPU 과투자의 진짜 논리: 비대칭 보험과 다음 세대 톨게이트

GLM-5.2 469B를 NVFP4로 한 노드에 올리기: vLLM 서빙 레시피를 뜯어봤습니다