Building the largest known Kubernetes cluster

cloud.google.com

156 points by TangerineDream 5 days ago

hazz99 2 days ago

I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?

For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.

Not trying to be reductive - I’m sure there’s some complexity here I’m missing!

phrotoma 2 days ago

I am extremely Not A Database Person but I understand that the rationale for Kubernetes adopting etcd as its preferred data store was more about its distributed consistency features and less about query throughput. etcd is slower cause it's doing RAFT things and flushing stuff to disk.
Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.
https://github.com/k3s-io/kine
- dijit 2 days ago
  
  You might not be a database person, but you’re spot on.
  A well managed HA postgresql (active/passive) is going to run circles around etcd for kube controlplane operations.
  The caveat here is increased risk of downtime, and a much higher management overhead, which is why its not the default.
- travem 2 days ago
  
  There are also distributed databases that use RAFT but can still scale while delivering distributed consensus don’t is not a challenge that can’t be solved. For example, TiDB handles millions of QPS while delivering ACID transactions, e.g. https://vivekbansal.substack.com/p/system-design-study-how-f...
- Sayrus 2 days ago
  
  GKE uses Spanner as an etcd replacement.
  
  ZeroCool2u 2 days ago
  
  But, and I'm honestly asking, you as a GKE user don't have to manage that spanner instance, right? So, you should in theory be able to just throw higher loads at it and spanner should be autoscaling?
  
  DougBTX 2 days ago
  
  Yes, from the article:
  > To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database... We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.
  
  ZeroCool2u 2 days ago
  
  Yeah, I guess my question was a bit more nuanced. What I was curious about was if they were fully relying on normal autoscaling that any customer would get or were they manually scaling the spanner instance in anticipation of the load? I guess it's unlikely we're going to get that level of detailed info from this article though.
PunchyHamster 2 days ago

it's not really bottlenecked by the store but by the calculations performed on each pod schedule/creation.
It's basically "take global state of node load and capacity, pick where to schedule it", and I'd imagine probably not running in parallel coz that would be far harder to manage.
- senorrib 2 days ago
  
  No a k8s dev, but I feel like this is the answer. K8s isn't usually just scheduling pods round robin or at random. There's a lot of state to evaluate, and the problem of scheduling pods becomes an NP-hard problem similar to bin packing problem. I doubt the implementation tries to be optimal here, but it feels a computationally heavy problem.
  
  OvervCW 2 days ago
  
  In what way is it NP-hard? From what I can gather it just eliminates nodes where the pod wouldn't be allowed to run, calculates a score for each and then randomly selects one of the nodes that has the lowest score, so trivially parallelizable.
  
  femiagbabiaka a day ago
  
  I think filtering and scoring fall under a heuristics based approach to address NP-hardness?
  Binpacking seems to be a well-defined NP-hard problem: https://en.wikipedia.org/wiki/Bin_packing_problem
  
  stevefan1999 a day ago
  
  That's greedy
- __turbobrew__ 2 days ago
  
  The k8s scheduler lets you tweak how many nodes to look at when scheduling a pod (percentage of nodes to score) so you can change how big “global state” is according to the scheduler algorithm.
nonameiguess 2 days ago

It says in the blog that they require 13,000 queries per second to update lease objects, not that 13,000 is the total for all queries. I don't know why they cite that instead of total, but etcd's normal performance testing indicates it can handle at least 50,000 writes per second and 180,000 reads: https://etcd.io/docs/v3.6/op-guide/performance/. So, without them saying what the real number is, I'm going to guess their reads and writes outside of lease updates are at least much larger than those numbers.

__turbobrew__ 2 days ago

It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from. Etcd is the main bottleneck in upstream k8s and it seems like there is no real steam to build an upstream replacement for etcd/boltdb.

I did poke around a while ago to see what interfaces that etcd has calling into boltdb, but the interface doesn’t seem super clean right now, so the first step in getting off boltdb would be creating a clean interface that could be implemented by another db.

locknitpicker 2 days ago

> It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from.
I'm not so sure. I mean, everything has tradeoffs, and what you need to do to put together the largest cluster known to man is not necessarily what you want to have to put together a mundane cluster.
dilyevsky a day ago

It's totally possible to run tens of thousands of QPS on etcd if your disks are NVMEs (or if you disable fdatasync which is not recommended). If you use kine+cockroachdb or tidb you can go even higher which is what I'm guessing is equivalent to their spanner setup.
ozgrakkurt a day ago

There was a blogpost about creating an alternative to etcd for super high scale kubernetes cluster. All code was open too. It was from someone named Benjamin I think but not sure.
I’m not able to find the blogpost but maybe someone else can!
- Saser 20 hours ago
  
  This might be what you're thinking of: https://bchess.github.io/k8s-1m/
kritr 20 hours ago

I’ve seen some talk of replacing etcd with FoundationDB, which could yield similar improvements.
iwontberude 2 days ago

For those not aware, if you create too many resources you can easily use up all of the 8GB hard coded maximum size in etcd which causes a cluster failure. With compaction and maintenance this risk is mitigated somewhat but it just takes one misbehaving operator or integration (e.g. hundreds of thousands of dex session resources created for pingdom/crawlers) to mess everything up. Backups of etcd are critical. That dex example is why I stopped it for my IDP.
- scoodah 2 days ago
  
  This is why I’ve always thought Tekton was a strange project. It feels inevitable that if you buy into Tekton CI/CD you will hit issues with etcd scaling due to the sheer number of resources you can wind up with.
  
  prescriptivist a day ago
  
  What boundaries does this 8GB etcd limit cut across? We've been using Tekton for years now but each pipeline exists in its own namespace and that namespace is deleted after each build. Presumably that kind of wholesale cleanup process keeps the DB size in check, because we've never had a problem with Etcd size...
  We have multiple hundreds of resources allocated for each build and do hundreds of builds a day. The current cluster has been doing this for a couple of years now.
  
  scoodah 5 hours ago
  
  Yeah I mean if you’re deleting namespaces after each run then sure, that may solve it. They have a pruner now that you can enable too to set up retention periods for pipeline runs.
  There’s also some issues with large Results, though I think you have to manually enable that. From their site
  > CAUTION: the larger you make the size, more likely will the CRD reach its max limit enforced by the etcd server leading to bad user experience.
  And then if you use Chains you’re opening up a whole other can of worms.
  I contracted with a large institution that was moving all of their cicd to Tekton and they hit scaling issues with etcd pretty early in the process and had to get Red Hat to address some of them. If they couldn’t get them addressed by RH they were going to scrap the whole project.
  
  iwontberude a day ago
  
  Yeah, quite unfortunate. But maybe there is hope. Apparently k3s uses Kine which is an etcd translation layer for relational databases and there is another project called Netsy which persists into s3 https://nadrama.com/netsy. Some interesting ideas. Hopefully native postgres support gets added since its so ubiquitous and performant.
- dilyevsky a day ago
  
  It's not hardcoded and you can increase it via flag.
nonameiguess a day ago

It's possible I'm talking out of my ass and totally wrong because I'm basing this on principles, not benchmarking, but I'm pretty sure the problem is more etcd itself than boltdb. Specifically, the Raft protocol requires that the cluster leader's log has to be replicated to a quorum of voting members, who need to write to disk, including a flush, and then respond to the leader, before a write is considered committed. That's floor(n/2) + 1 disk flushes and twice as many network roundtrips to write any value. When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle, it's hard for that not to become a bottleneck. Other limitations include the 8GiB disk limit another comment mentions and etcd's hard-coded 1.5 MiB request size limit that prevents you from writing large object collections in a single bundle.
etcd is fine for what it is, but that's a system meant to be reliable and simple to implement. Those are important qualities, but it wasn't built for scale or for speed. Ironically, etcd recommends 5 as the ideal number of cluster members and 7 as a maximum based on Google's findings from running chubby, that between-member latency gets too big otherwise. With 5, that means you can't ever store more than 40GiB of data. I have no idea what a typical ratio of cluster nodes to total data is, but that only gives you about 307MiB per node for 130,000 nodes, which doesn't seem like very much.
There are other options. k3s made kine which acts as a shim intercepting the etcd API calls made by the apiserver and translating it into calls to some other dbms. Originally, this was to make a really small Kubernetes that used an embedded sqlite as its datastore, but you could do the same thing for any arbitrary backend by just changing one side of the shim.
- __turbobrew__ a day ago
  
  I run several clusters a bit over 10k nodes and the etcd db size is about 30-50GiB depending on how long ago defragmentation was run.
  It is kindof sad as these nodes are running around 2k IOPS to the disk and are mostly sitting idle at the hardware level, but etcd still regularly chokes.
  I did look into kine in the past, but I have no idea if it is suitable for running a high performance data store.
  > When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle
  The trick is you deploy your k8s clusters in multiple datacenters in the same region (think AZs in AWS term). The control plane can span multiple AZs which are in separate buildings, but close in geography. From the setups I work on the latency between datacenters in the same region is only about 500 microseconds.

leo_e 20 hours ago

Papers like this are fascinating engineering, but dangerous marketing.

They convince every Series A startup that they need a multi-region federated control plane for their 50 microservices. I spend half my time convincing my team not to emulate Google, because we don't have Google's scale problems—we have velocity problems.

Complexity is an asset for Google (it's a moat), but a liability for the rest of us. I just want a cluster that doesn't require a dedicated ops team to upgrade.

blurrybird 2 days ago

AWS and Anthropic did this back in July: https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...

cowsandmilk 2 days ago

That is 100k vs 130k for Google’s new announcement. I can’t speak as to whether the additional 30k presented new challenges though.
- Cthulhu_ 2 days ago
  
  I want to believe that this is an order-of-magnitude kind of problem, that is, if 100K is fine then 500K is also fine.
  I only skimmed the article though, but I'm confident that it's more a physical hardware, time, space and electricity problem than a software / orchestration one; the article mentions that a cluster that size needs to be multi-datacenter already given the sheer power requirements (2700 watts for one GPU in a single node).

yanhangyhy 2 days ago

there is a doc about how to do with 1M nodes: https://bchess.github.io/k8s-1m/#_why

so i guess the title is not true?

arccy 2 days ago

That's simulated using kwok, not real.
> Unfortunately running 1M real kubelets is beyond my budget.
Thaxll 2 days ago

THis is a PoC not backed by a reliable etcd replacement.

xyse53 2 days ago

They mention GCS fuse. We've had nothing but performance and stability problems with this.

We treat it as a best effort alternative when native GCS access isn't possible.

dijit 2 days ago

fuse based filesystems in general shouldn’t be treated as production ready in my experience.
They’re wonderful for low volume, low performance and low reliability operations. (browsing, copying, integrating with legacy systems that do not permit native access), but beyond that they consume huge resources and do odd things when the backend is not in its most ideal state.
- dotwaffle 2 days ago
  
  I started rewriting gcsfuse using https://github.com/hanwen/go-fuse instead of https://github.com/jacobsa/fuse and found it rock-solid. FUSE has come a long way in the last few years, including things like passthrough.
  Honestly, I'd give FUSE a second chance, you'd be surprised at how useful it can be -- after all, it's literally running in userland so you don't need to do anything funky with privileges. However, if I starting afresh on a similar project I'd probably be looking at using 9p2000.L instead.
- xyse53 a day ago
  
  I think it's possible to write a solid fuse filesystem. Not as performant as in-kernel but it could easily not be the bottleneck depending on the backend.
  I commented though because GCP highlights it in a few places as component for AI workloads. I'm curious if anyone is using it in an important application and happy with it.
- thundergolfer 2 days ago
  
  AWS Lambda uses FUSE and that’s one of the largest prod systems in the world.
  
  dijit 2 days ago
  
  An option exists, but they prefer you use the block storage API.
  
  thundergolfer a day ago
  
  No, as in Lambda itself uses FUSE as an implementation detail of their container filesystem.
  
  dijit a day ago
  
  It seems there were some major issues, but AWS has developed around them and optimised for its needs; (https://www.madebymikal.com/on-demand-container-loading-in-a...)
  Fair, but far from a common advice I’m willing to tell people (other CTOs) to do.

zkmon a day ago

What business usecase requires a single cluster with thousands of pods? Wouldn't having multiple clusters, each hosting a few namespaces, be a better architecture?

solatic 18 hours ago

This. I may not work with AI training workflows, but I struggle to understand why they supposedly require launching a thousand pods per second to use GPUs that need to fundamentally be installed across different baremetal machines. Once the GPUs are on different machines, if there are 1k+ such machines, just start putting them on different Kubernetes clusters. Build a scheduling layer above the Kubernetes control plane to decide which Kubernetes cluster to schedule the pod onto.
The whole thing stinks of, AI investors are throwing money at AI companies, so go to GCP and tell them to solve the problem at any price so that they can keep scaling without needing to build the scheduling layer above the Kubernetes control planes.
- zkmon 2 hours ago
  
  Yep, it's just saying "you should now launch 1000's pods in a single cluster, just because we said it makes sense, and please don't look at the costs, business sense and operational issues."

Heliodex a day ago

View without needing to sign in: https://web.archive.org/web/20251124111136/https://cloud.goo...

Nextgrid 2 days ago

K8S clusters on VMs strike me as odd.

I see the appeal of K8s in dividing raw, stateful hardware to run multiple parallel workloads, but if you're dealing with stateless cloud VMs, why would you need K8S and its overhead when the VM hypervisor already gives you all that functionality?

And if you insist anyway, run a few big VMs rather than many small ones, since K8s overhead is per-node.

locknitpicker 2 days ago

> I see the appeal of K8s in dividing raw, stateful hardware to run multiple parallel workloads, but if you're dealing with stateless cloud VMs, why would you need K8S and its overhead when the VM hypervisor already gives you all that functionality?
I think you're not familiar with Kubernetes and what features it provides.
For example, kubernetes supports blue-green deployments and rollbacks, software-defined networks, DNS, node-specific purges and taints, etc. Those are not hypervisor features.
Also, VMs are the primitives of some cloud providers.
It sounds like you heard about how Borg/Kubernetes was used to simplify the task of putting together clusters with COTS hardware and you didn't bothered to learn more about Kubernetes.
victorbjorklund 2 days ago

Because k8s gives you lots of other things out of the box like easy scaling of apps etc. Harder to do on VM:s where you would either have to dedicate one VM per app (might be a waste of resources) or you have to try and deploy and run multiple apps on multiple VM:s etc.
(For the record I’m not a k8s fanatic. Most of the time a regular VM is better. But a VM isn’t = a kubernetes cluster).
acedTrex 2 days ago

because if you just do a few huge VMs you still have all the problems that k8s solves out of the box. Except now you have to solve them yourself, which will likely end up being a crappier less robust version of kubernetes.
reachableceo a day ago

VMs are a standardized system primitive. The “bare metal” bit with RBAC etc through the management layer / hypervisor.
K8s is pallets Vms are shipping containers
Systems / storage / network team can present a standardized set of primitives for any vm to consume that are more or less independent of the underlying bare metal.
Then the VMs can be live migrated when the inevitable hardware maintenance is needed (microcode patching , storage driver upgrades , etc etc etc). With no downtime for the vm itself
GauntletWizard 2 days ago

The reason to target k8s on cloud vms is that cloud VMs don't subdivide as easily or as cleanly. Managing them is a pain. K8s is an abstraction layer for that - Rather than building whole machine images for each product, you create lighter weight docker images (how light weight is a point of some contention), and you only have to install your logging, monitoring, and etc once.
Your advice about bigger machines is spot on - K8s biggest problem is how relatively heavyweight the kublet is, with memory requirements of roughly half a gig. On a modern 128g server node that's a reasonable overhead, for small companies running a few workloads on 16g nodes it's a cost of doing business, but if you're running 8 or 4g nodes, it looks pretty grim for your utilization.
- nyrikki 2 days ago
  
  You can run pods, with podman and avoid the entire k8s stack or even use minikube on a machine if you wanted to. Now that rootless is the default in k8s[0] the workflow is even more convenient and you can even use systemd with isolated users on the VM to provide more modularity and seporation.
  It really just depends on if you feel that you get value from the orchestration that full k8s offers.
  Note that on k8s or podman, you can get rid of most of the 'cost' of that virtualization for single placement and or long lived pods by simply sharing a emptyDir or volume shared between pod members.
  # Create Pod podman pod create --name pgdemo-pod # Create client podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e POSTGRES_PASSWORD=password --name client docker.io/ubuntu:25.04 # Unsafe hack to fix permissions in quick demo and install packages podman exec client /bin/bash -c 'chmod 0777 /mnt; apt update ; apt install -y postgresql-client' # Create postgres server podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e POSTGRES_PASSWORD=password --name pg docker.io/postgres:bookworm -c unix_socket_directories='/mnt,/var/run/postgresql/' # Invoke client using unix socket podman exec -it client /bin/bash -c "psql -U postgres -h /mnt" # Invoke client using localhost network podman exec -it client /bin/bash -c "psql -U postgres -h localhost"
  There is enough there for you to test to see that the performance is so close to native sharing unix sockets that way, that there is very little performance cost and a lot of security and workflow benefits to gain.
  As podman is daemonless, easily rootless, and on mac even allows you to ssh into the local linux vm with `podman machine ssh` you aren't stuck with the hidden abstractions of docker-desktop which hides that from you it has lots of value.
  Plus you can dump a k8s like yaml to use for the above with:
  podman kube generate pgdemo-pod
  So you can gain the advantages of k8s without the overhead of the cluster, and there are ways to launch those pods from systemd even from a local user that has zero sudo abilities etc...
  I am using it to validate that upstream containers don't have dial home by producing pcap files, and I would also typically run the above with no network on the pgsql host, so it doesn't have internet access.
  IMHO the confusion of k8s pods, being the minimal unit of deployment, with the fact that they are just a collection of containers with specific shared namespaces in the general form is missed.
  As Redhat gave podman to CNCF in 2024, I have shifted to it, so haven't seen if rancher can do the same.
  The point being is that you don't even need the complexity of minikube on VM's, you can use most of the workflow even for the traditional model.
  [0] https://kubernetes.io/blog/2025/04/25/userns-enabled-by-defa...
tayo42 2 days ago

In a large organization their more efficient to run on VMS. You can colocate services that fit together on one machine.
And in reality no one sizes their machines correctly. They always do some handwavey thing like we need 4 cores, but maybe well burst and maybe there will be an outage so lets double it. Now all that utilization can be watched and you can take advantage of over subscription.

sandGorgon 2 days ago

does anyone know the size at openai ? it used to run a 7500 node cluster back in 2021 https://openai.com/index/scaling-kubernetes-to-7500-nodes/

jakupovic 2 days ago

Doing this at anything > 1k nodes is a pain in the butt. We decided to run many <100 nodes clusters rather than a few big ones.

kvrty 2 days ago

Same here. Non Kubernetes project originated control plane components start failing beyond a certain limit - your ingress controllers, service meshes etc. So I don't usually take node numbers from these benchmarks seriously for our kind of workloads. We run a bunch of sub-1k node clusters.
liveoneggs 2 days ago

Same. The control plane and various controllers just aren't up to the task.
preisschild 2 days ago

Meh, I've had had clusters with close to 1k nodes (w/ cilium as CNI) and didnt have major issues
- __turbobrew__ 2 days ago
  
  When I was involved about a year ago, cilium falls apart at around a few thousand nodes.
  One of the main issues of cilium is that the bpf maps scale with the number of nodes/pods in the cluster, so you get exponential memory growth as you add more nodes with the cilium agent on them. https://docs.cilium.io/en/stable/operations/performance/scal...
  
  oasisaimlessly 2 days ago
  
  Wouldn't that be quadratic rather than exponential?
  
  preisschild a day ago
  
  Thats true and I definitely had to "tune" the bpf map limits, but it wasn't really that difficult to do.

moralestapia 21 hours ago

Cute. I've done ~2 million (not k8s though, that trash would only slow me down).

blamestross 2 days ago

I worked in DHTs in grad school. I still double take that Google and other companies "computers dedicated to a task" numbers are missing 2 digits from what I expected. We have a lot of room left for expansion, we just have to relax centralized management expectations.

rvz 2 days ago

> While we don’t yet officially support 130K nodes, we're very encouraged by these findings. If your workloads require this level of scale, reach out to us to discuss your specific needs

Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.

But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:

Unless you are ready to donate tens of billions of dollars yearly, you do not need this.

You are not Google.

game_the0ry 2 days ago

> You are not Google.
100% agree.
People at my co are horny to adopt k8s. Really, tech leads want to put it on their resume ("resume driven development") and use a tool that was made to solve a particular problem we never had. The downside is now we now need to be proficient it at, know how to troubleshoot it, etc. It was sold to leadership as something that would make our lives easier but the exact opposite has happened.
- scottyah 12 hours ago
  
  Use killercoda and get your CKA, I bet most of the confusion will be gone. I've basically started mandating it for newer folks on my team since it covers so many of the gaps that get created by people who try Just In Time learning on the systems. K9s is great for visual people who are used to vim.
- BruSwain 2 days ago
  
  I think k8s has a learning curve, absolutely, and there are absolutely cases where it can be unnecessary overhead. But I actually think those cases are pretty small. If you're running multiple apps, k8s is valuable. There is initial investment in learning the system, but its v-extensible, flexible, & portable. (Yes, every hyperscaler's implementation of k8s has its own nuance in certain places, but the core concept of k8s translates very well)
  
  game_the0ry 18 hours ago
  
  We must be terrible at implementation bc we have a had a prod outage and our DX is objectively worse. Its caused more problems and headaches for us.
jcims 2 days ago

I work for a mature public company that most people in the US have at least heard of. We're far from the largest in our industry and we run jobs with more than that almost every night. Not via k8s though.
- Tostino 2 days ago
  
  You have jobs running on more than 130k different machines daily??
  Are they cloud based VMs, or your own hardware? If cloud based, do you reprovision all of them daily and incur no cost when you are not running jobs? If it's your own hardware, what else do you do with it when not batch processing?
  
  jcims 2 days ago
  
  They are provisioned on demand (cloud) and shut down when no longer needed.
dilyevsky 12 hours ago

> You are not Google.
You think they are just running it for fun? It's literally non-Google customers who wanted this as was explained in the article
mlnj 2 days ago

>You are not Google.
It's literally Google coming out with this capability and how is the criticism still "You are not Google"
- Rastonbury 2 days ago
  
  The criticism is at pre-PMF startups who believe they need something similar

blinding-streak 2 days ago

Imagine a Beowulf cluster of these

jeffbee 2 days ago

You could remove all references to AI/ML topics from this article and it would remain just as interesting and informative. I really hate that we let marketing people cram the buzzword of the day into what should be a purely technical discussion.

supportengineer 2 days ago

Imagine a Beowulf cluster of these

John-Tony 2 days ago

[dead]

belter 2 days ago

130k nodes...cute...but can Google conquer the ultimate software engineering challenge they warn you about in CS school? A functional online signup flow?

chrisandchris 2 days ago

The could team up with Microsoft, because their signup flow is fine but the login flow is badly broken.
jasonvorhe 2 days ago

For what? Access to the control plane API?
- belter 2 days ago
  
  In general... Try to sign up for their AI services...

zoobab 2 days ago

The new mainframe.

bhouston 2 days ago

Sounds like hell. But I do really dislike Kubernetes: https://benhouston3d.com/blog/why-i-left-kubernetes-for-goog...