From TPU v2 to Ironwood: Five Generations of Google’s Training Supercomputers

How has the infrastructure for training large-scale AI models evolved. A paper Google published on arXiv, “Google’s Training Supercomputers from TPU v2 to Ironwood” (arXiv:2606.15870, submitted June 14, 2026), traces five generations of TPUs along five axes: architectural stability, scale, resilience, power efficiency, and sustainability. Because it covers how the entire system was scaled rather than single-chip performance, it offers direct lessons for teams that operate AI infrastructure.

ThakiCloud works with GPU workloads and training infrastructure on a K8s-based AI/ML SaaS platform. Let’s look at why this paper is useful for both data scientists and infrastructure engineers.

📄 Full in-depth review (DOCX): You can download the detailed peer review from Google Drive.

Evolution Viewed Through Five Axes

The five axes the paper uses to evaluate five generations form, on their own, a framework for assessing large-scale training infrastructure.

Architectural Stability: Keeping the core of the programming model and architecture consistent across generations lets the software stack and operational know-how accumulate. Not having to relearn everything from scratch each generation is itself an economy of scale.
Scale: Scaling to pods of thousands of chips, where inter-chip interconnect and topology become the key factors.
Resilience: At the scale of thousands of chips, failure is a constant. A design that lets training continue even when some chips die is essential.
Power Efficiency: Improving TFLOPS/Watt is the core metric across generations. Doing the same work with less power translates directly into operating cost.
Sustainability: Power efficiency is also, in the end, a matter of carbon footprint.

Lessons for Data Scientists

Here is why this paper is methodologically useful beyond being a hardware paper.

System performance vs. chip performance: Looking only at single-chip FLOPS misses the real improvement. You need to look at the performance of the entire system, including interconnect, topology, and the software stack, to see the actual gain in training throughput. This applies equally to inference serving. What matters is the effective throughput of the whole cluster, not the throughput of a single GPU.
Resilience is throughput: Without a failure-recovery design, effective throughput at large-scale training drops sharply. Checkpointing and tolerance for partial failure are not optional. They are throughput itself.
Power efficiency as a first-class metric: Tracking TFLOPS/Watt as a core metric reflects an operating philosophy that treats cost as a first-class citizen.

ThakiCloud’s Perspective: Porting Large-Scale Infrastructure Design Principles

We do not build dedicated supercomputers like TPUs, but the design principles in this paper carry over directly to a K8s-based GPU platform. Architectural stability shows up as standardized serving and training interfaces, resilience as Kueue-based job retries and checkpointing, and power efficiency as GPU utilization monitoring and workload packing.

Applying lessons from a scale of thousands of chips to a multi-tenant platform running tens to hundreds of GPUs is exactly the territory we operate in. Treating failure as a constant, watching whole-system throughput, and tracking power and cost as first-class metrics is the right operating philosophy regardless of scale.

Closing Thoughts

Google’s paper on five generations of TPUs makes the case, with data, that “large-scale AI infrastructure is a system, not a chip.” Accumulate through architectural stability, protect throughput through resilience, and treat power efficiency as a first-class metric. This principle applies to every team operating a GPU cluster.

Source: “Google’s Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations”, arXiv:2606.15870 (2026-06-14). https://arxiv.org/abs/2606.15870

📄 Full in-depth review (DOCX): You can download the detailed peer review from Google Drive.

From TPU v2 to Ironwood: Five Generations of Google’s Training Supercomputers

Evolution Viewed Through Five Axes

Lessons for Data Scientists

ThakiCloud’s Perspective: Porting Large-Scale Infrastructure Design Principles

Closing Thoughts

참고

내 AI 스택 전부 중국산이요

Fable 5를 프롬프트하는 법: 앤트로픽 공식 가이드가 말하는 다섯 가지

LLM 내부 구조를 체계적으로 배우는 법: 토큰화부터 추론 최적화까지

Claude Code의 /dataviz 스킬: 차트를 코드가 아니라 설계로 다루기