November 15, 2024

Google Kubernetes Engine Now Supports Trillion-Parameter AI Models

Ali Azhar

(Image source: Pepperdata)

The exponential growth in large language model (LLM) size and the resulting need for high-performance computing (HPC) infrastructure is reshaping the AI landscape. Some of the newer GenAI models have grown to well over a billion parameters, with some approaching 2 trillion.

Google Cloud announced that in anticipation of even larger models, it has upgraded its Kubernetes Engine’s capacity to support 65,000-node clusters, up from 15,000-node clusters. This enhancement enables Google Kubernetes Engine (GKE) to operate at a 10x scale compared to two other major cloud providers, according to Google Cloud.

While Google Cloud did not specify the names, this is likely in reference to Microsoft Azure and Amazon Web Services (AWS), two of the largest cloud providers.

The parameters of a GenAI model are variables within a model that dictate how it behaves and what output it generates. The number of parameters plays a key role in the model’s capacity to learn and represent complex patterns in language. The greater the number of parameters, the greater “memory” the model has to generate accurate and contextually appropriate responses.

“Scaling to 65,000 nodes provides much-needed capacity to the world’s most resource-hungry AI workloads,” shared Google Cloud via blog post. “Combined with innovations in accelerator computing power, this will enable customers to reduce model training time or scale models to multi-trillion parameters or more. Each node is equipped with multiple accelerators giving the ability to manage over 250,000 accelerators in one cluster.”

GKE is a Google-managed implementation of the Kubernetes open-source orchestration platform. It is designed to automatically add or remove hardware resources such as GPUs based on the workload requirement. It also manages maintenance tasks and handles Kubernetes updates.

To develop advanced models, users need the ability to allocate computing resources across various tasks. The upgraded 65,000-node capacity not only provides more computing power for training but also supports tasks like inference, serving, and research, ensuring users have the resources needed throughout the full lifecycle of AI model development.

To enable this advancement, Google Cloud is shifting GKE from the open-source etcd, a distributed key-value store, to a more powerful key-value store built on Spanner, Google’s distributed database that offers virtually unlimited scalability.

With this transition, Google aims to support larger GKE clusters, improve reliability for users, and reduce latency in cluster operations. Additionally, the Spanner-based etcd API will maintain backward compatibility, allowing users to adopt the new technology without needing to modify core Kubernetes configurations.

(Mia-Stendal/Shutterstock)

Google has also done a major overhaul of the GKE infrastructure that manages the Kubernetes control plane. This has enabled GKE to scale faster, meeting the demands of deployment with fewer delays. The control plane automatically adjusts to dynamic workloads. This is particularly effective for large-scale applications such as SaaS and disaster recovery.

Google claims that the GKE upgrade allows customers to meet demands significantly faster and that it is able to run five jobs in a single cluster – matching the tech giant’s achievement of handling the largest training job for LLMs.

GenAI models with billions of parameters offer impressive potential. Industry trends suggest that parameter size will remain a focal point in AI advancements. NVIDIA introduced its new Blackwell GPU earlier this year, designed to handle trillion-parameter AI models.

While parameter size is a notable measure of progress in AI development, it does not solely define a model’s usefulness or innovation. Achieving meaningful outcomes depends on a comprehensive approach that considers scalability, efficiency, and ethical responsibility alongside technological advancements.

The Generative AI Future Is Now, Nvidia’s Huang Says

IBM Unveils New Open Source Granite Models to Enhance AI Capabilities

Applications: Artificial Intelligence, Data Management, Enterprise Analytics, Research Analytics

Technologies: Cloud, Frameworks, Storage

Vendors: google, Google Cloud

Tags: GenAI, GKE, google, Google Cloud, Kubernetes, Kubernetes Engine, Parameters

Google Kubernetes Engine Now Supports Trillion-Parameter AI Models

July 11, 2025

July 10, 2025

July 9, 2025

Sponsored Partner Content

AI That Knows Your Business: Meet Cube D3

Mainframe data: A powerful source for AI insights

CData recognized in the 2024 Gartner ® Magic Quadrant™ Report

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Transforming Healthcare with Data

IDC Spotlight: Boosting AI Impact with Data Products

Sponsored Multimedia

Unlocking Unstructured Data with GenAI
No Comments

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Google Kubernetes Engine Now Supports Trillion-Parameter AI Models

July 11, 2025

July 10, 2025

July 9, 2025

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Share

Copy short link