Google Kubernetes Engine Now Supports Trillion-Parameter AI Models
The exponential growth in large language model (LLM) size and the resulting need for high-performance computing (HPC) infrastructure is reshaping the AI landscape. Some of the newer GenAI models have grown to well over a billion parameters, with some approaching 2 trillion.
Google Cloud announced that in anticipation of even larger models, it has upgraded its Kubernetes Engine’s capacity to support 65,000-node clusters, up from 15,000-node clusters. This enhancement enables Google Kubernetes Engine (GKE) to operate at a 10x scale compared to two other major cloud providers, according to Google Cloud.
While Google Cloud did not specify the names, this is likely in reference to Microsoft Azure and Amazon Web Services (AWS), two of the largest cloud providers.
The parameters of a GenAI model are variables within a model that dictate how it behaves and what output it generates. The number of parameters plays a key role in the model’s capacity to learn and represent complex patterns in language. The greater the number of parameters, the greater “memory” the model has to generate accurate and contextually appropriate responses.
“Scaling to 65,000 nodes provides much-needed capacity to the world’s most resource-hungry AI workloads,” shared Google Cloud via blog post. “Combined with innovations in accelerator computing power, this will enable customers to reduce model training time or scale models to multi-trillion parameters or more. Each node is equipped with multiple accelerators giving the ability to manage over 250,000 accelerators in one cluster.”
GKE is a Google-managed implementation of the Kubernetes open-source orchestration platform. It is designed to automatically add or remove hardware resources such as GPUs based on the workload requirement. It also manages maintenance tasks and handles Kubernetes updates.
To develop advanced models, users need the ability to allocate computing resources across various tasks. The upgraded 65,000-node capacity not only provides more computing power for training but also supports tasks like inference, serving, and research, ensuring users have the resources needed throughout the full lifecycle of AI model development.
To enable this advancement, Google Cloud is shifting GKE from the open-source etcd, a distributed key-value store, to a more powerful key-value store built on Spanner, Google’s distributed database that offers virtually unlimited scalability.
With this transition, Google aims to support larger GKE clusters, improve reliability for users, and reduce latency in cluster operations. Additionally, the Spanner-based etcd API will maintain backward compatibility, allowing users to adopt the new technology without needing to modify core Kubernetes configurations.
Google has also done a major overhaul of the GKE infrastructure that manages the Kubernetes control plane. This has enabled GKE to scale faster, meeting the demands of deployment with fewer delays. The control plane automatically adjusts to dynamic workloads. This is particularly effective for large-scale applications such as SaaS and disaster recovery.
Google claims that the GKE upgrade allows customers to meet demands significantly faster and that it is able to run five jobs in a single cluster – matching the tech giant’s achievement of handling the largest training job for LLMs.
GenAI models with billions of parameters offer impressive potential. Industry trends suggest that parameter size will remain a focal point in AI advancements. NVIDIA introduced its new Blackwell GPU earlier this year, designed to handle trillion-parameter AI models.
While parameter size is a notable measure of progress in AI development, it does not solely define a model’s usefulness or innovation. Achieving meaningful outcomes depends on a comprehensive approach that considers scalability, efficiency, and ethical responsibility alongside technological advancements.
Related Items
Are We Running Out of Training Data for GenAI?
The Generative AI Future Is Now, Nvidia’s Huang Says
IBM Unveils New Open Source Granite Models to Enhance AI Capabilities