Every pod eviction in Kubernetes, explained

Anyone who is running Kubernetes in a large-scale production setting cares about having a predictable Pod lifecycle. Having unknown actors that can terminate your Pods is a scary thought, especially when you’re running stateful workloads or care about availability in general.

There are so many ways Kubernetes terminates workloads, each with a non-trivial (and not always predictable) machinery, and there’s no page that lists out all eviction modes in one place. This article will dig into Kubernetes internals to walk you through all the eviction paths that can terminate your Pods, and why “kubelet restarts don’t impact running workloads” isn’t always true, and finally I’ll leave you with a cheatsheet at the end.

# Eviction API

API-initiated Pod eviction is a way to delete Pods while respecting a configured PodDisruptionBudget (PDB) –that’s practically all it does differently than directly deleting a Pod. The logic behind the eviction API is an atomic decrement on PDB’s status.disruptionsAllowed counter to make sure the concurrent eviction requests don’t dip availability below what’s allowed.

Perhaps the most amusing thing about the Eviction API and PDBs is that basically nothing¹ in the Kubernetes core uses them, so you won’t see it in the rest of this article. Workload controllers like ReplicaSets, Deployments, StatefulSets will directly go ahead and delete Pods directly, completely ignoring your PDBs during a rollout. (💡That’s why the Deployment and StatefulSet APIs have their own separate maxUnavailable knob that’s independent of the PDB setting with the exact same name. As an app owner, it’s your job to make sure this setting isn’t lower than what your PDB allows, otherwise you’ll dip below your PDB during rollouts).

What I dislike most about the Eviction API is that if you want to keep a tight control over Pod lifecycle via an admission webhook, you must handle two distinct paths (Pod deletion and Pod eviction) with two distinct webhooks.²

# Kubelet node-pressure eviction

Kubelet acts as a node health remediation actor and kicks Pods out of the node if the node is under memory/disk/inode pressure. For the most part, this is a nice feature because exhausting these resources would impact a random unlucky Pod if they’re not remediated earlier by kicking out a lower priority Pod.

This feature is rather well-documented. But if you’re someone who wants to have tight control over Pod lifecycle, or have your own node remediation mechanisms, you probably don’t want another actor (kubelet) in this picture as it overall reduces your ability to reason about Pod lifecycle.

When Pods are evicted by kubelet in this path, their PDBs are not respected (Eviction API is not used). Furthermore, hard thresholds cause kubelet to immediately terminate pods, and soft evictions also ignore Pod’s graceful termination period and cap the grace period at a preconfigured value.

Thankfully it’s possible to disable hard/soft evictions in the kubelet configuration.

# Taint-based eviction

If a Pod happens to be on a node that has a NoExecute taint, it gets evicted from the Node after a while. But how does this work?

Every kubelet reports heartbeat every 10 seconds to the Kubernetes API server by updating a Lease resource. The node-lifecycle controller builtin to Kubernetes (a routine in the kube-controller-manager process) monitors this Lease and sets the Ready condition on the Node to Unknown if it doesn’t hear a heartbeat from the kubelet in 50 seconds (configurable).

When this occurs, the node-lifecycle controller adds the “unreachable” taint to the Node with effect=NoExecute. By default, Kubernetes adds a toleration to every Pod to tolerate this NoExecute taint for 5 minutes. (This defaulting can be disabled if you disable the relevant admission controller in the apiserver, or you can specify a custom toleration in your PodSpec.)

Once a Node is tainted by the node-lifecycle controller, the Node might not even know about it, since the toleration is most likely added due to kubelet being down or network-partitioned etc. In this case Pod eviction is not issued on the node by the kubelet, but from outside –another controller builtin to Kubernetes called “taint eviction controller” would evict the Pods.

The taint manager controller used to be part of the node-lifecycle controller and couldn’t be turned off. Thanks to friends at Apple, as of Kubernetes 1.29, it’s now a separate controller, and can be entirely disabled. This is really useful if you want finer control over evictions due to NoExecute taints.

This eviction path also doesn’t use the Eviction API, and directly deletes Pods. In this case, the graceful termination period is respected by kubelet while terminating the pod, and you’ll see an Event named TaintManagerEviction on the Pod when this happens. If the kubelet is unreachable, the Pod entry in the API will remain in the Terminating phase indefinitely (or force-deleted –at the risk of leaving the Pod running on the disconnected kubelet indefinitely).

# Kubelet admission

This is possibly the least known way a Pod that was perfectly running earlier would end up not running anymore: kubelet can refuse to run a Pod that it was just running moments ago, and immediately terminate it without respecting PDBs.

But to explain this, we need to rewind a little bit and start the story from kube-scheduler: When kube-scheduler is trying to find a suitable Node to assign a Pod (which is a permanent operation), it runs a set of filters, such as whether the node has enough resources, whether the node matches the specified node selectors or other affinity rules, or whether the requested host ports are allocated to another Pod to name a few.

What you probably didn’t know is that kubelet also runs these filters as admission checks while admitting the Pod into the node. There are good reasons kubelet does this. For example, you can bypass the kube-scheduler and directly assign an illegal Pod to a Node (by setting Pod’s spec.nodeName) as if you’re the scheduler.

What’s even lesser known here is that kubelet runs these admission checks not once, but also after a kubelet restart. This basically means a Pod that was running earlier can be killed by the kubelet if it fails any kubelet admission checks later on due to a crash or a benign restart.

You can easily try this at home:

Pick a Node and label it (e.g. role=worker) and create a Pod with the same nodeSelector.
Observe Pod is scheduled and is in “Running” state.
Now remove the Node label you’ve added, observe the Pod is still Running
Now restart the kubelet (which should transition the Pod into Error state):

kubectl get pods
NAME           READY   STATUS  ...
alpine-sleep   0/1     Error   ...

This is not really an “eviction” –since the Pod entry still exists on the API, but the Pod is not run anymore by the kubelet (and killed gracefully). In this case, you’ll see the Pod has transitioned into status.phase=Failed permanently, and the status message will say something like this (which also used to be more cryptic and didn’t tell you what was failing):

Pod was rejected: Predicate NodeAffinity failed: node(s) didn't match Pod's node affinity/selector

This is a terminal condition and permanent failure for this Pod. This Pod will not be runnable again even if the admission checks pass again or kubelet restarts again –and it will not be cleaned up from the API server until 12,500 of such Pods exist (and only then the podgc controller will delete them³)⁴. So I highly recommend monitoring the number of such failed pods and alerting in your clusters.

This is a particularly messy eviction mode for Pods that previously ran fine on the node. Essentially, a point-in-time state of node labels (you’d hope nothing would change your node labels combined with a kubelet restart (due to cert renewal or a crash) can suddenly terminate your Pods—without respecting PodDisruptionBudgets and without graceful termination.

The kubelet has this problem because after a restart (like most controllers) kubelet starts with an empty cache and it thinks every Pod it sees is a newly added pod. The code spells out that kubelet doesn’t store (or rebuild the state) of pods it has launched from disk, so this may be fixed in the future. Restarting kubelet do not impact workloads” is a misconception [1][2] that I’ve also been regurgitating for many years at this point. There have been other bugs around this admission machinery marking Succeeded run-to-completion pods to be re-admitted and marked as failed.

Most controllers (e.g. ReplicaSet, StatefulSet, CloneSet) seem to be recovering from this situation. But if you write a workload controller yourself, handling the terminal pod phases like Failed seem like a good idea.

# Pod preemption

In addition to node drainers, taint manager and the kubelet, a fourth actor that terminates your Pods is the kube-scheduler. If there are no nodes suitable for scheduling a Pod, kube-scheduler tries to find nodes where evicting a lower priority Pod would make up space for the Pod in the scheduling queue.

In doing so, kube-scheduler ranks all the nodes based on the least number of PDB violations preempting Pods would cause. If there are some nodes where preemption wouldn’t cause PDB violations, those are picked first, which is good.

Once a node is selected, lower priority Pods that are not violating their PDBs are evicted first. If no such Pods exist, kube-scheduler will violate some PDBs in favor of making place for the higher priority Pod. In that case, victim pods are sorted by priority, and lowest priority pods are preempted first. Pod termination is respected.

# Node deletion

Lastly, when a Node resource is removed from the Kubernetes API, you’ll see that all the Pods running on that node are immediately terminated. This not a documented eviction mode at all as far as I know, since most Kubernetes users probably don’t even know who deletes the Node resource in the first place.

If you’re using a cloud provider, the cloud-controller-manager handles Node deletion, and if you’re using Cluster API (CAPI), its machine controller handles this for you. But if you implement your own cluster management tooling (like we do), you’re responsible for handling Node deletion after removing machines from your cluster. This also means you’re now subject to this eviction mode in case you have a bug unintentionally deletes your Nodes.

When a Node resource is removed, orphaned pods assigned to the Node are cleaned up by Kubernetes builtin controller named “pod-garbage-collector”. This Pod deletion is forcible (equivalent to kubectl delete pod --force, which specifies gracePeriodSeconds=0) and therefore does not respect termination grace periods.

# Conclusion

Here’s a quick summary of the eviction paths that we discussed above.

Method	Eviction Actor	PDB respected?	Graceful termination?
Eviction API	node drainers, `kubectl drain`	yes	yes
Direct pod deletion	workload controllers during rollouts (e.g. ReplicaSet, StatefulSet) or `kubectl delete`	no	yes
Node pressure (soft)	kubelet	no	yes
Node pressure (hard)	kubelet	no	no
NoExecute taint	kube-controller-manager (taint manager)	no	yes
Kubelet admission	kubelet	no	no
Priority preemption	kube-scheduler	best effort	yes
Node deletion	kube-controller-manager (podgc controller)	no	no

The only thing in Kubernetes core that actually uses this Eviction API path to evict pods while respecting PDBs is the kubectl drain command. Additionally, the cloud providers and open source projects like cluster-autoscaler, Karpenter and Cluster API use the eviction API while scaling down or removing nodes.) ↩︎
Writing a webhook that intercepts /evict requests for a subset of Pods is actually not straightforward, either: You can normally intercept a subset of Pod deletion requests by filtering them based on Pod labels. However, Eviction admission requests don’t have the labels of the Pod in the request body (it just has name and namespace), so you can’t filter the Eviction requests selectively. So you’d end up writing a webhook that potentially intercepts all eviction requests in the entire cluster —and if this webhook goes down, evictions are blocked in the entire cluster. Pretty hairy stuff. ↩︎
I actually don’t know why the kubelet just doesn’t delete the Pod as if it’s actually evicting it in this case. If anyone knows why, I’d be happy to make the edit. ↩︎
I found these kubelet admission checks to be rather arbitrary and not always in line with the kube-scheduler filters. For example, node selectors, affinity rules, the node OS label are enforced predicates; but cordoning a node with NoSchedule taint and restarting the kubelet doesn’t cause Pod to fail. Not sure why. ↩︎

Table of Contents