At my current employer, we use Kubernetes to run hundreds of thousands of bare metal servers, spread over hundreds of Kubernetes clusters. We use Kubernetes beyond officially supported/tested scale limits by running more than 5,000 nodes and over a hundred thousand of pods in a single cluster.1 In these large scale setups, expensive “list” calls on the Kubernetes API are the achilles heel of the control plane reliability and scalability. In this article, I’ll explain which list call patterns pose the most risk, and how recent and upcoming Kubernetes versions are improving the list API performance.
Table of Contents
# kube-apiserver Concepts
In this article, we’re gonna keep talking about some things over and over again, so best to define them once and for all.
Watch Cache: kube-apiserver caches all Kubernetes resources (i.e. everything in etcd) by default in memory. Depending on your Kubernetes version and call pattern, your get/list API calls may not hit this cache (and may hit etcd instead).
ResourceVersion (RV): an opaque string that the Kubernetes API returns to returns to indicate version of an object or collection (but in practice, this is etcd’s logical clock counter).
Pagination (chunking):
When you specify a limit
parameter on your list request, apiserver will
return a continue
token in the response, and you can use this token to
request the next chunk (page) of results.
# List API performance characteristics
Not all list calls to Kubernetes API have the same cost to apiserver. In this article, we will focus on list calls over large collection of resources (e.g. listing 100k+ pods, or 200k+ custom resources on a single apiserver). Overall, the cost of the list API call depends on the query pattern, and the version of Kubernetes you’re running.
The list API calls over large collections mainly impacts the apiserver and etcd by causing too much CPU time to be used (starving other API requests, causes them to time out), or by allocating too much memory (causing process to be OOM killed, which causes the load to shift to oter apiservers nd you have a cascading failure eventually taking all your apiservers down).
In practice, you create this load on the apiserver by making it query too many keys from etcd, or by making apiserver spend too much time in serialization/deserialization (either decoding values in etcd, or encoding response to JSON/protobuf). The apiserver will usually allocate too much memory if it’s working on large datasets which causes more garbage collection to happen in Go runtime, which causes more CPU cycles to be used.
# Pagination
If your initial list request specifies the ?limit=...
parameter, it means
you’re paginating results, so both apiserver and etcd does a
limited amount of work, and allocates
only a limited amount of memory per list request. In most blunt terms, this is
what prevents the apiserver from potentially allocating 5GB memory for a single
request.
Most list calls in practice happen because there’s a Kubernetes controller that establishes an informer (list+watch) on a resource type. If you use client-go or controller-runtime to establish informers, your list calls are automatically paginated. If you’re using client-go to do list calls directly, make sure to use pager package to paginate.
If you don’t take this advice, one day you’re going to hit an incident in production (and you’re going to hit it in your largest clusters, and even if your data set grew slowly) where list calls suddenly start failing, because they’re causing etcd to serve more than 2GB of data in a single response (protobuf encoding limitation), and the apiserver will observe an error like the following from etcd:
code = ResourceExhausted desc = grpc: trying to send message larger than max ([....] vs. 2147483647)
But specifying ?limit=...
on your list request doesn’t always mean
apiserver is actually respecting what you asked, keep reading.
# Hitting the watch cache
If you are on kube-apiserver v1.31+, almost all your get/list calls are served from apiserver’s watch cache. Despite being served from a cache, these reads are still consistent with what’s in etcd (imagine your redis cache acts like your db).2
Advice for those still using pre-v1.31 Kubernetes versions: you can specify
?resourceVersion=0
parameter on the list request to specifically hit the watch cache. This parameter means “just give me whatever the watch cache has, even if it’s stale.”3)If you’re writing controllers or use informers through client-go or controller-runtime, the clients make the initial list call with
resourceVersion=0
parameter, so you don’t have to worry about this in pre-v1.31 versions much either.
The tricky thing about v1.31+ versions is that not all requests hit the watch cache, and the behavior depends on the Kubernetes version (e.g. v1.34 which is not yet released as of writing has some improvements here, we’ll discuss later). More in this below.
# Deadly combo: pagination + watch cache
So far we learned using pagination and hitting the watch cache is good. But when the
two are combined in a request like ?limit=500&resourceVersion=0
, a deadly
combination takes place: apiserver ignores the limit
parameter, and returns
the entire list result in a single response.
(This will hopefully soon change with v1.34, more on this later.)
Encoding a very large response is catastrophic for the apiserver, as it needs to buffer the objects to be returned in-memory, and also allocate a lot of memory for JSON/proto response encoding, and spend a lot of CPU/memory for a single request. v1.33+ has some improvements around some of these (more on this later).
This point is interestingly also the reason why ArgoCD intentionally establishes informers without specifying RV=0, unlike most all other Kubernetes controllers in the ecosystem.
# Concurrent list calls
Usually a single API call isn’t going to crash your apiservers or fully starve their CPU, but too many list calls happening simultaneously can easily take an apiserver process down due to OOMs.4
This is so bad that your apiserver can go from using 20GB memory to 200GB within seconds, and there’s no builtin mechanism to prevent this completely in the apiserver. The builtin APF (API Priority and Fairness) feature is not good at estimating the memory cost of a list call, even though it has some heuristics from latency/object count. For example, the deadly combo mentioned above reads like it’ll return 100 objects instead of 100,000, which would be a gross under-estimation.
So, as your clusters grow in place, you’re sitting on a ticking time bomb: Eventually, you’re going to hit control plane incidents or due to the expensive list calls happening simultaneously in your clusters.
If a list call causes several GBs of memory to be allocated, either because it’s buffering a large response, or allocating too much memory for response encoding, too many of these will cause the apiserver to OOM. (We’ll shortly talk about some upcoming improvements that will help with this.)
# [Pre-v1.31] Label/field selectors are really bad
On Kubernetes versions older than v1.31 where list calls hit etcd, using label
selectors or field selectors over large collections is really bad. Even if the
response returns nothing, it will take a very long time to get a response. This
is unintuitive because the selectors sound like the WHERE
clauses in SQL, but
not in this case.
In <v1.31 versions, you can try a list call on a large collection and see it take forever. For example, in large clusters, listing pods on a node may take 15+ seconds to respond, even if the response returns an empty list:
kubectl get pods --field-selector=spec.nodeName=foo
This is because the apiserver needs to load the whole collection from etcd (e.g. all pods in a cluster), parse them, and evaluate the selector in memory. This is called out in the v1.31 watch cache optimizations blog:
While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.
The kubelet
case highlighted in this post is a really bad reliability problem:
Basically if 100 kubelets were to restart at the same time and tried to list
pods on the node, the apiserver would just OOM. (Thankfully kubelet does
?resourceVersion=0
on lists even on older versions, so it’s not a problem in
practice.)
Thankfully, in v1.31+ versions, the watch cache is able to use its in-memory indexes to serve label/field selectors, so this is no longer a problem.
# Upstream Improvements to List API performance
Lots of maintainers in SIG API Machinery have been working on improving the list API performance in recent years. As of writing (v1.33 is the latest version), we’re still not fully in the clear where all the problems listed in this article are fixed.
Here’s a handy summary of the improvements we’ll talk about shortly:
Improvement | Available in | Benefit |
---|---|---|
Consistent reads from cache | v1.31+ | Reduces load on etcd, label/field selectors are fast |
Memory-efficient list encoding | v1.33+ | Reduces memory usage of apiserver while encoding large list responses |
Informers stream results instead of lists | v1.34+(?) | Controller lists has less memory impact on apiserver during restarts |
Paginated lists from watch cache | v1.34+(?) | Reduces etcd load, makes apiserver not ignore pagination limit |
# Consistent reads from cache (v1.31+)
We talked about the watch cache a little earlier in the article. With Kubernetes v1.31+, apiserver serves most get/list calls from its in-memory cache instead of hitting etcd. (KEP-2340) This alleviates a ton of load from etcd, and improves performance of list calls using label/field selectors.
It’s probably the most important list API performance improvement in the last few years, but it’s not a silver bullet:
-
List calls with
resourceVersion=0
ignore thelimit
parameter and return the whole dataset in response as discussed earlier. If a few these occur simultaneously it can crash apiserver with OOM due to buffering the response in memory multiple times. (subject to change in v1.34) -
Not all list requests hit the watch cache. If the watch cache is not initialized, you’ll hit etcd. Also depending on your list parameters (and k8s version) you may not hit the cache.
Reading the code is the best way to get an answer. You can also check out the KEP, this talk by Madhav to understand under what conditions you’re hitting the cache.
To benefit from this feature, you must run etcd version 3.4.31+ or 3.5.13+ as the blog post notes.
# Memory-efficient list encoding (v1.33+)
KEP-5116 which has shipped in v1.33 makes encoding list responses more memory efficient. The release blog post explains where and why older versions of kube-apiserver had to keep multiple copies of the response in-memory. This feature essentially prevents apiserver memory spiking from 20GB to 200GB as we mentioned earlier.
Essentially, Go’s encoding/json
library marshals the whole response object at
once, and therefore holds onto the whole object the whole time. The feature
implements a custom encoder that builds the JSON response in chunks, here’s a
gross simplification of the implementation:
printf(`{`)
printf(`"apiVersion": "v1"," kind": "PodList",`)
printf(`"items": [`)
printf(encode_json(item1))
printf(`,`)
printf(encode_json(item2))
printf(`,`)
...
printf(`]}`)
This feature is enabled by default in v1.33+.
# Informers stream results instead of listing (v1.34+)
Today, most all controllers establish informers by making an initial “list” call, and then a “watch” call. The sequence of a controller starting up looks like this:
GET /api/v1/pods?resourceVersion=0&limit=500
-> {"metadata": {"resourceVersion": "123456"}, "items": [...]}
GET /api/v1/pods?watch=true&resourceVersion=123456
-> (long polling connection that streams new/updated/deleted objects)
KEP-3157 proposes that informers don’t do the initial list call, and instead get existing objects streamed to them as part of the watch request:
GET /api/v1/pods?watch=true&sendInitialEvents=true&resourceVersionMatch=NotOlderThan
By doing so, API Server does not have to allocate large buffers to serve list requests, and can be much more memory efficient. Given the main source of list calls are controllers/operators establishing informers, this alleviates a lot of the list API pressure originating from these clients.
Unfortunately, this WatchList
apiserver feature used to be beta and
enabled-by-default in v1.32, but was reverted in v1.33, and
to be disabled-by-default.
Also, clients also won’t automatically use this feature, and clients (currently)
must enable the client-side WatchListClient
feature flag (which
can be done by setting KUBE_FEATURE_WatchListClient=true
environment
variable). If you have old controllers compiled with client-go older than
v0.30, you should upgrade this module to be able to set this client-side feature
flag.
# Paginated lists from watch cache (v1.34+)
Currently, apiserver can’t serve pagination continuation tokens (the
?continue=...
parameter) from its watch cache, and such requests will hit
etcd. This is the case for all versions of apiserver including the latest v1.33.
As we discussed earlier, when your controller makes a
?limit=500&resourceVersion=0
request to start its informer by default,
apiserver returns the entire response in the first page (so there’s no need to
query the second page). But obviously this huge list response is still a problem
that needs to be solved.
KEP-4988 proposes that apiserver watch cache respects the limit
parameter, and can serve these paginated lists from in-memory.
This feature is alpha in v1.33, and the upcoming v1.34 release will make it available by default as beta.
# Recommendations for kube-apiserver scalability
What prompted us to learn about all these list API performance characteristics the hard way was a series of internal incidents around control plane availability and performance, combined with the fact that we were running pre-v1.31 versions of apiserver.
So definitely upgrade your clusters to at least Kubernetes v1.31+, or v1.33 if you can. (That said, as of writing most cloud k8s providers don’t have v1.33 in their stable release track, since they have a N-2 version policy). Understand what reliability risks you currently have until which improvements listed above land.
Another point is that, you should enable audit logs and keep an eye on who makes list calls (especially not paginated ones, and the deadly ones) on large collections, and how frequently. apiserver’s APF feature is not good at estimating cost of list calls, so you should crack down on clients in your ecosystem making these calls. At any large company, there are typically different teams writing various controllers, CSI drivers, or other agents, so this is not easy to keep track of.
Say no to daemonsets or other per-node agents from making list calls to your API server. It’s a common mistake to write a daemonset that watches pods on the node by querying the apiserver for it. Next thing you know, you have 1,000+ concurrent lists happening during a deamonset rollout. Avoid this at all costs, as this query model will never scale well with the node count. (For example, node agents should use the the node-local kubelet API to find list of pods on the node.)
Use least-priviled RBACs (don’t let anyone authenticated get/list resources), and start assigning each client/controller an explicit APF priority. Otherwise you’re letting clients run wild in your clusters. Restricting clients is much harder down the road than having a policy from the get go and loosening it in a controlled manner later.
Lastly, experiment with the GOGC environment variable to tune when
garbage collection that happens in kube-apiserver process. In many list
API-related incidents, we’ve seen apiserver spending too much CPU time doing GC
due to heap size doubling so frequently during list calls causing. We’ve found
GOGC=200
is better than not setting it. But if you use a cloud-managed k8s,
you probably won’t get to configure these.
Hopefully this article helps you understand the list API performance characteristics in large clusters. Please don’t hesitate to reach out to me if you have any corrections.
Special thanks to Madhav Jivrajani, one of the contributors of apiserver’s storage layer and interning with us this summer, for providing pointers to lot of these details and corrections during our research into this topic. He also has a great talk on the topic.
-
There are some reasons to run larger Kubernetes clusters. Mainly, if you don’t have workload federation to split workloads across multiple clusters. But in our case, we also have centralized regional k8s control planes hosting hundreds of thousands to manage the fleet. See our KubeCon EU 2025 talk to learn more. But we’re not alone in doing so, Uber and OpenAI are also running large clusters beyond the supported limits. ↩︎
-
How apiserver ensures cached data is up-to-date with etcd is brilliant (and is explained here). Basically on every read request to apiserver, apiserver makes a cheap request etcd to learn the highest RV, and holds onto the request until its cache has caught up, which happens pretty fast anyway, and it is better than querying gigabytes of data from etcd. ↩︎
-
From the KEP 2340: “After almost a year in Beta, running enabled default, we have collected the following data: […] In 99.9% of cases watch cache became fresh enough within 110ms.” So the watch cache is actually stale only a bit even in pre-v1.31 versions. ↩︎
-
If you have something like a 2000 node cluster (or in a test cluster, create 10k resources), try listing all pods in a call like
kubectl get --raw=/api/v1/pods
with some parallelism like 5 and watch your apiserver memory skyrocket. ↩︎