This is the analysis of a low severity incident that took place in the Kubernetes clusters at the company I work at that taught me a lot about how to think about the off-the-shelf components we bring from the ecosystem into the critical path and operate at a scale much larger than these components are intended.
Many years ago when we were first starting to run Kubernetes, the team figured node-feature-discovery (NFD) project looked like a DaemonSet that we could install in every node to expose the bare metal node’s features (like CPU type, PCI devices, NUMA enablement) as node labels on Kubernetes, and allow workloads to specify these labels as scheduling predicates.
Nowadays, NFD is even more popular thanks to NVIDIA’s GPU feature discovery and device plugin both requiring NFD installed in the cluster to expose the GPU features as node labels and use them for scheduling.
After 5 years of happily running a really old version of NFD, we decided to upgrade
the component. The new version was architecturally the same (an agent on the
node notified a “master” component of the node features, and master labels the nodes).
In the new version, the agent on the node would now write into a new custom
resource NodeFeature
that the master component would watch, instead of
communicating over gRPC.
As with any big upgrade done after a long while, things went poorly. We ran into bugs in the new version that removed all node labels (which breaks pod scheduling), and we observed scale issues that manifested only in the largest (which also happens to be most critical) Kubernetes clusters in our fleet.
At the end, we’ve decided to roll back to the five year old version until we phase out and remove the component from our Kubernetes clusters (except for the GPU nodes that still need it), and dediced to give the open source project this feedback for others to benefit from the component better.
Scale issues
A major architectural shift modern NFD versions brought is that the NFD workers
now write into the NodeFeature
custom resource (which previously didn’t exist)
to communicate the node features from the node to the controller that labels the
Node
on Kubernetes API server. This didn’t seem concerning prior to the
upgrade, as we write our fair share of Kubernetes controllers and custom
resources.
Before the upgrade, we took a backup of all the nodes and their labels in our fleet into a temporary YAML dump in case things went south. As the new version upgrade rolled out in our pre-production clusters, things went fairly smoothly.
If wasn’t until the new version has hit larger clusters in production, we found
out the hard way that each of these NodeFeature
custom resources take up ~140
KBs on the kube-apiserver/etcd. This was partially because NFD reports a ton of
kernel settings by default that we didn’t use.
This is particularly problematic because the kube-apiserver stores all custom resources as JSON (builtin Kubernetes resources are stored in protobuf encoding), which is a lot less space-efficient. If you do the math for 4,000 nodes, you’re looking at 540 MB just to store some features of nodes in a cluster.
Since we were running etcd storage with the recommended storage limit of 8 GB (which we’re now considering to increase), and and kube-apiserver has a watch cache for all resources by default, this has put more strain on both etcd and the kube-apiserver.
The large object sizes has made NFD controller unable to list the large number of NodeFeatures from the apiserver, causing its list requests to repetitively timeout.
Unfortunately, the NFD controller proceeding to start its job without a successful list response from the apiserver has triggered a worse bug…
Bugs leading to node label removals
We rely on node labels to be able to route the workloads to the correct node pool (and similarly, keep the unwanted workloads away from dedicated single-tenant node pools). Even though our node labels are rather static, if the component that manages your node labels decides to remove the node labels, you’ll have a big problem at hand.
Something we relied on NFD is to keep the node labels unless it is certain that information that a node label should be removed. However, after upgrading, we lost all node labels managed by NFD nearly simultaneously.
It turned out that in NFD v0.16.0 (and in many versions prior), the controller
starts up without authoritative list of NodeFeature
s from the apiserver. So
when the controller could not find a Node’s NodeFeature
object due to cache
being incomplete, it would treat the list of node labels as
“empty”,
so it would go ahead and remove all node labels.
Normally Kubernetes controllers must not start unless the controller has
successfully built an informer cache. However, NFD did not check the return
value of the WaitForCacheSync()
method (which would’ve told the controller to
not start with a missing cache when its list request was timing out). This issue
was reported
here.
This bug was easily reproducible on a kind
cluster: Install NFD v0.16.0, and
observe that the kind node would get its feature labels. Next, create 1,000 fake
Nodes and NodeFeatures, and watch the previously added feature labels on the
kind node disappear as the nfd-master
controller would run into list timeouts
from the apiserver (which there’s now a
fix
for).
Upon further auditing the code, we’ve found several other failure modes that similarly would lead to node removals under different conditions:
-
We found other controllers like nfd-gc that also did not check the return value of the
WaitForCacheSync()
method, which would similarly cause node label removals. This was reported here and fixed here and here. -
We recently found a newly introduced code where this mistake is repeated once again (reported here, yet to be fixed).
This is rather proving my point that implementing controllers correctly is inherently hard. NFD uses Kubernetes Go client correctly and manages informer lifecycle (which is a low level primitive), using a higher level controller development framework like controller-runtime would make it impossible to get this wrong.
-
NodeFeature
custom resources have their owner references set to thenfd-worker
Pod, which is a bad idea, because these Pod get deleted all the time during upgrades etc, and Kubernetes would garbage-collect theseNodeFeature
resources. This issue is reported here.The newer v0.16.6 version makes a change to set the parent object as
DaemonSet
, which I think is still the wrong fix, deletion of a DaemonSet should not cause all node labels to be nuked). The NFD worker would treat the lack of theNodeFeature
object as “node has no labels” and proceeded with node label removals as well. This was reported here and a fix was merged (however it’s still not the default behavior and requires you to start NFD with--no-owner-refs
flag).
Overall, we decided we’re probably better off not relying on dynamic node labeling controllers to block scheduling and decided to move all static node labels directly to the kubelet configuration file.
Conclusions
Kubernetes controllers are harder to write correctly. Frameworks like kubebuilder/controller-runtime are really good at giving you the impression that you have something working.
As we grew our Kubernetes infrastructure over time, something we observe time
and time again is that very few projects1 outside the kubernetes/kubernetes
repo scale well in large clusters. As a large Kubernetes customer, it’s on you
to due your code audit, scale test and understand the failure modes of the
components you bring into your stack.
If you’re a controller developer, consider performing a scale test on your controller in a large cluster (tools like kwok can help with creating synthetic clusters), and monitor your controller’s behavior through audit logs, and metrics around how long reconciliation loops take, whether your controller is able to keep up with the rate of changes in the cluster, etc.
As and end-user, I now feel the need to understand how an external off-the-shelf controller works (to the extent that I probably audit the code) before bringing it into the critical path of my Kubernetes clusters.
-
I also maintain some repos under
kubernetes-sigs
org, and my code there wouldn’t pass my own quality and testing bar today, which makes me extra cautious about the quality of off-the-shelf non-core Kubernetes components. ↩︎