Notes on OpenAI Kubernetes outage

Last week, OpenAI has suffered a several hours long outage and published a detailed postmortem about it. Highly recommend reading it. These technical reports are usually a gold mine for all large-scale Kubernetes users, as we all go through similar set of reliability issues running Kubernetes in production.

The gist of the incident is that a new monitoring agent scraping control plane telemetry DoS’ed the API server:

[…] we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

[…] so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster.

[…] Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters.

We don’t know whether OpenAI used API Priority & Fairness (APF) here, or whether it would’ve shielded the apiserver, but this situation is relatable: We also have monitoring agents on each node that make a query to Kubernetes API to get metadata (e.g. labels of the node) to decorate logs/metrics with extra dimensions. ¹

During the remediation, kube-apiserver being under the load prevented the admins from deleting the offending agent. This is where APF helps us a little bit, as we prioritize requests from ServiceAccounts lower than requests from system administrators.

However, the telemetry agent rollout was only a trigger and not the root cause. The outage seemed to take place due a coupling between the control plane and data plane in DNS service discovery:

[…]Kubernetes API server is required for DNS resolution, which is a critical dependency for many of our services.

[…] As cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.

Presumably, a custom DNS server is at play here and I’m not sure why it has not continued to serve stale records for a while. (e.g. CoreDNS would continue to serve stale records even if its watch connection to the API server is broken)

That’s why asynchronous DNS announcement is inherently risky: if you roll out a new version of an app, which causes all pods to shuffle to new nodes, while the component responsible for keeping DNS records up-to-date is down, the app rollout would go through (as there’s no issue bringing the Pods up and their readiness probes will pass), but you won’t be able to get any traffic to the new Pods —and the old Pods still served in DNS answers are long gone.

This is one of the main reasons we still do DNS announcements synchronously in Pod startup with initContainers instead of moving this service announcement logic to a controller. (Dmitry has pointed out on X that one could halt rolling update of a Deployment/ReplicaSet by attaching an additional condition set by an asynchronous controller to a Pod and using custom Pod readiness gates to wait on that condition.)

Another tidbit is that this change was tested in a staging cluster (which is likely much smaller) and it wasn’t caught. This is relatable and similar to the NFD incident I recently posted, which we only hit in our largest production clusters.

Something that is both impressive and concerning at the same time from the postmortem is how quickly a commit has rolled out to production environment:

- 2:23pm: The change [...] was merged and the deployment pipeline triggered
- 2:51pm to 3:20pm: The change was applied to all clusters
- 3:13pm: Alerts fired, notifying engineers

I’m guessing the bake time in staging was roughly 30 minutes or so, and the next step was to roll to a production environment that would cause a global outage. This is a very tight window for a global change, something we typically do is to go through multiple staging environments in multiple regions, and go through production clusters in batches, one region at a time, with 30-60 minutes of soak time.

Right-sizing the Kubernetes control plane seems to be an active challenge for most companies, mostly because we tend to think of CP components as statically-sized fixed number of replica systems, whereas size of the CP is a function of the size/traffic of the cluster. Usually, overprovisioning based on the largest cluster is a common practice here.

Some cloud providers that offer managed Kubernetes scale the CP as a step-function of the number of nodes in the cluster, and some are able to run the CP as Pods on another cluster (and use autoscaler components like HPA/VPA). This is an area we are actively working on as we’re trying to move our control planes to Pods.

Props to OpenAI team for sharing a detailed postmortem and congrats on speedy recovery.

Some other folks have published their take on the report [1] [2] if you’re interested in their analysis.

Surprisingly, the kubelet API on the nodes don’t offer an endpoint to get the Node object to extract metadata out of it, so every agent has to make a GET /v1/node/{name} query to the API server. I think kubelet API had a /spec endpoint which has been removed a long time ago at this point. ↩︎