Anyone who is running Kubernetes in a large-scale production setting cares about having a predictable Pod lifecycle. Having unknown actors that can terminate your Pods is a scary thought, especially when you’re running stateful workloads or care about availability in general.
There are so many ways Kubernetes terminates workloads, each with a non-trivial (and not always predictable) machinery, and there’s no page that lists out all eviction modes in one place. This article will dig into Kubernetes internals to walk you through all the eviction paths that can terminate your Pods, and why “kubelet restarts don’t impact running workloads” isn’t always true, and finally I’ll leave you with a cheatsheet at the end.
Eviction API
API-initiated Pod
eviction
is a way to delete Pods while respecting a configured PodDisruptionBudget (PDB)
–that’s practically all it does differently than directly deleting a Pod. The
logic behind the eviction API is an atomic
decrement
on PDB’s status.disruptionsAllowed
counter to make sure the concurrent
eviction requests don’t dip availability below what’s allowed.
Perhaps the most amusing thing about the Eviction API and PDBs is that basically
nothing1 in the Kubernetes core uses them, so you won’t see it in the rest of
this article. Workload controllers like ReplicaSets, Deployments, StatefulSets
will directly go ahead and delete Pods directly, completely ignoring your PDBs
during a rollout. (💡That’s why the Deployment and StatefulSet APIs have their
own separate maxUnavailable
knob that’s independent of the PDB setting with
the exact same name. As an app owner, it’s your job to make sure this setting
isn’t lower than what your PDB allows, otherwise you’ll dip below your PDB
during rollouts).
What I dislike most about the Eviction API is that if you want to keep a tight control over Pod lifecycle via an admission webhook, you must handle two distinct paths (Pod deletion and Pod eviction) with two distinct webhooks.2
Kubelet node-pressure eviction
Kubelet acts as a node health remediation actor and kicks Pods out of the node if the node is under memory/disk/inode pressure. For the most part, this is a nice feature because exhausting these resources would impact a random unlucky Pod if they’re not remediated earlier by kicking out a lower priority Pod.
This feature is rather well-documented. But if you’re someone who wants to have tight control over Pod lifecycle, or have your own node remediation mechanisms, you probably don’t want another actor (kubelet) in this picture as it overall reduces your ability to reason about Pod lifecycle.
When Pods are evicted by kubelet in this path, their PDBs are not respected (Eviction API is not used). Furthermore, hard thresholds cause kubelet to immediately terminate pods, and soft evictions also ignore Pod’s graceful termination period and cap the grace period at a preconfigured value.
Thankfully it’s possible to disable hard/soft evictions in the kubelet configuration.
Taint-based eviction
If a Pod happens to be on a node that has a NoExecute
taint, it gets
evicted
from the Node after a while. But how does this work?
Every kubelet reports heartbeat every 10 seconds to the Kubernetes API server by
updating a Lease resource. The node-lifecycle controller builtin to Kubernetes
(a routine in the kube-controller-manager process) monitors this Lease and sets
the Ready
condition on the Node to Unknown
if it doesn’t hear a heartbeat
from the kubelet in 50 seconds
(configurable).
When this occurs, the node-lifecycle controller adds the “unreachable” taint to
the Node with effect=NoExecute. By default, Kubernetes adds a toleration to
every Pod to tolerate this NoExecute
taint for 5 minutes. (This defaulting can
be disabled if you disable the relevant admission controller in the apiserver,
or you can specify a custom toleration in your PodSpec.)
Once a Node is tainted by the node-lifecycle controller, the Node might not even know about it, since the toleration is most likely added due to kubelet being down or network-partitioned etc. In this case Pod eviction is not issued on the node by the kubelet, but from outside –another controller builtin to Kubernetes called “taint eviction controller” would evict the Pods.
The taint manager controller used to be part of the node-lifecycle controller and couldn’t be turned off. Thanks to friends at Apple, as of Kubernetes 1.29, it’s now a separate controller, and can be entirely disabled. This is really useful if you want finer control over evictions due to NoExecute taints.
This eviction path also doesn’t use the Eviction API, and directly deletes Pods.
In this case, the graceful termination period is respected by kubelet while
terminating the pod, and you’ll see an Event named TaintManagerEviction on the
Pod when this happens. If the kubelet is unreachable, the Pod entry in the API
will remain in the Terminating
phase indefinitely (or force-deleted –at the
risk of leaving the Pod running on the disconnected kubelet indefinitely).
Kubelet admission
This is possibly the least known way a Pod that was perfectly running earlier would end up not running anymore: kubelet can refuse to run a Pod that it was just running moments ago, and immediately terminate it without respecting PDBs.
But to explain this, we need to rewind a little bit and start the story from kube-scheduler: When kube-scheduler is trying to find a suitable Node to assign a Pod (which is a permanent operation), it runs a set of filters, such as whether the node has enough resources, whether the node matches the specified node selectors or other affinity rules, or whether the requested host ports are allocated to another Pod to name a few.
What you probably didn’t know is that kubelet also runs these
filters
as admission
checks
while admitting the Pod into the node. There are good reasons kubelet does this.
For example, you can bypass the kube-scheduler and directly assign an illegal
Pod to a Node (by setting Pod’s spec.nodeName
) as if you’re the scheduler.
What’s even lesser known here is that kubelet runs these admission checks not once, but also after a kubelet restart. This basically means a Pod that was running earlier can be killed by the kubelet if it fails any kubelet admission checks later on due to a crash or a benign restart.
You can easily try this at home:
- Pick a Node and label it (e.g.
role=worker
) and create a Pod with the same nodeSelector. - Observe Pod is scheduled and is in “Running” state.
- Now remove the Node label you’ve added, observe the Pod is still Running
- Now restart the kubelet (which should transition the Pod into Error state):
kubectl get pods
NAME READY STATUS ...
alpine-sleep 0/1 Error ...
This is not really an “eviction” –since the Pod entry still exists on the API,
but the Pod is not run anymore by the kubelet (and killed gracefully). In this
case, you’ll see the Pod has transitioned into status.phase=Failed
permanently, and the status message will say something like this (which also
used to be more cryptic
and didn’t tell you what was failing):
Pod was rejected: Predicate NodeAffinity failed: node(s) didn't match Pod's node affinity/selector
This is a terminal condition and permanent failure for this Pod. This Pod will not be runnable again even if the admission checks pass again or kubelet restarts again –and it will not be cleaned up from the API server until 12,500 of such Pods exist (and only then the podgc controller will delete them3)4. So I highly recommend monitoring the number of such failed pods and alerting in your clusters.
This is a particularly messy eviction mode for Pods that previously ran fine on the node. Essentially, a point-in-time state of node labels (you’d hope nothing would change your node labels combined with a kubelet restart (due to cert renewal or a crash) can suddenly terminate your Pods—without respecting PodDisruptionBudgets and without graceful termination.
The kubelet has this problem because after a restart (like most controllers) kubelet starts with an empty cache and it thinks every Pod it sees is a newly added pod. The code spells out that kubelet doesn’t store (or rebuild the state) of pods it has launched from disk, so this may be fixed in the future. Restarting kubelet do not impact workloads” is a misconception [1][2] that I’ve also been regurgitating for many years at this point. There have been other bugs around this admission machinery marking Succeeded run-to-completion pods to be re-admitted and marked as failed.
Most controllers (e.g. ReplicaSet, StatefulSet, CloneSet) seem to be recovering from this situation. But if you write a workload controller yourself, handling the terminal pod phases like Failed seem like a good idea.
Pod preemption
In addition to node drainers, taint manager and the kubelet, a fourth actor that terminates your Pods is the kube-scheduler. If there are no nodes suitable for scheduling a Pod, kube-scheduler tries to find nodes where evicting a lower priority Pod would make up space for the Pod in the scheduling queue.
In doing so, kube-scheduler ranks all the nodes based on the least number of PDB violations preempting Pods would cause. If there are some nodes where preemption wouldn’t cause PDB violations, those are picked first, which is good.
Once a node is selected, lower priority Pods that are not violating their PDBs are evicted first. If no such Pods exist, kube-scheduler will violate some PDBs in favor of making place for the higher priority Pod. In that case, victim pods are sorted by priority, and lowest priority pods are preempted first. Pod termination is respected.
Node deletion
Lastly, when a Node
resource is removed from the Kubernetes API, you’ll see
that all the Pods running on that node are immediately terminated. This not a
documented eviction mode at all as far as I know, since most Kubernetes users
probably don’t even know who deletes the Node resource in the first place.
If you’re using a cloud provider, the cloud-controller-manager handles Node deletion, and if you’re using Cluster API (CAPI), its machine controller handles this for you. But if you implement your own cluster management tooling (like we do), you’re responsible for handling Node deletion after removing machines from your cluster. This also means you’re now subject to this eviction mode in case you have a bug unintentionally deletes your Nodes.
When a Node resource is removed, orphaned pods assigned to the Node are cleaned
up
by Kubernetes builtin controller named “pod-garbage-collector”. This Pod
deletion is forcible (equivalent to kubectl delete pod --force
, which specifies
gracePeriodSeconds=0
) and therefore does not respect termination grace
periods.
Conclusion
Here’s a quick summary of the eviction paths that we discussed above.
Method | Eviction Actor | PDB respected? | Graceful termination? |
---|---|---|---|
Eviction API | node drainers, kubectl drain |
yes | yes |
Direct pod deletion | workload controllers during rollouts (e.g. ReplicaSet, StatefulSet) or kubectl delete |
no | yes |
Node pressure (soft) | kubelet | no | yes |
Node pressure (hard) | kubelet | no | no |
NoExecute taint | kube-controller-manager (taint manager) | no | yes |
Kubelet admission | kubelet | no | no |
Priority preemption | kube-scheduler | best effort | yes |
Node deletion | kube-controller-manager (podgc controller) | no | no |
-
The only thing in Kubernetes core that actually uses this Eviction API path to evict pods while respecting PDBs is the
kubectl drain
command. Additionally, the cloud providers and open source projects like cluster-autoscaler, Karpenter and Cluster API use the eviction API while scaling down or removing nodes.) ↩︎ -
Writing a webhook that intercepts
/evict
requests for a subset of Pods is actually not straightforward, either: You can normally intercept a subset of Pod deletion requests by filtering them based on Pod labels. However, Eviction admission requests don’t have the labels of the Pod in the request body (it just has name and namespace), so you can’t filter the Eviction requests selectively. So you’d end up writing a webhook that potentially intercepts all eviction requests in the entire cluster —and if this webhook goes down, evictions are blocked in the entire cluster. Pretty hairy stuff. ↩︎ -
I actually don’t know why the kubelet just doesn’t delete the Pod as if it’s actually evicting it in this case. If anyone knows why, I’d be happy to make the edit. ↩︎
-
I found these kubelet admission checks to be rather arbitrary and not always in line with the kube-scheduler filters. For example, node selectors, affinity rules, the node OS label are enforced predicates; but cordoning a node with
NoSchedule
taint and restarting the kubelet doesn’t cause Pod to fail. Not sure why. ↩︎