Any company using Kubernetes eventually starts looking into developing their custom controllers. After all, what’s not to like about being able to provision resources with declarative configuration: Control loops are fun, and Kubebuilder makes it extremely easy to get started with writing Kubernetes controllers. Next thing you know, customers in production are relying on the buggy controller you developed without understanding how to design idiomatic APIs and building reliable controllers.
Low barrier to entry combined with good intentions and the “illusion of working implementation1” is not a recipe for success while developing production-grade controllers. I’ve seen the real-world consequences of controllers developed without adequate understanding of Kubernetes and the controller machinery at multiple large companies. We went back to the drawing board and rewritten nascent controller implementations a few times to observe which mistakes people new to controller development make.
Design CRDs like Kubernetes APIs
It takes less than 5 minutes to write a Go struct and generate a Kubernetes CustomResourceDefinition (CRD) from it thanks to controller-gen. Then it takes several months to migrate from this poorly designed API to a better v2 design while the old API is being used in production. Don’t do that to yourself.
If you’re serious about developing long-lasting production grade controllers, you have to deeply understand the API Conventions that Kubernetes uses to design its builtin APIs. Then, you need to study the builtin APIs, and think about things like “why is this field here”, “why is this field not a boolean”, “why is this a list of objects and not a string array”. Only when you’re able to reason about the builtin Kubernetes APIs and their design principles, you’ll be able to design a long-lasting custom resource API.
Beginners not grasping these API conventions often make these mistakes:
-
They don’t understand the difference between
status
andspec
and who should be updating each field (more about this later ). -
They don’t understand how to embed a child object within a parent object (e.g. how
Deployment.spec.template
becomes aPod
) so they end up re-creating child object properties in the parent object, usually with a worse organized structure. -
They don’t understand field semantics well (e.g. zero values, defaulting, validation) and end up with fields that are not set, or set to wrong values accepted into the API. I covered this topic in my CRD generation pitfalls article. If the behavior of the API is not clear when a field is not set, you’ve already failed. API conventions guide covers this topic fairly well.
If you study the builtin Kubernetes APIs extensively, you’ll find out things like
spec
field is not a “must have”2, and not all APIs offer a status
field.
I would go as far as to say that you should also study custom APIs of projects
like Knative, Istio and other popular controllers to develop a better
understanding of organizing fields, and how to reuse some core types Kubernetes
already offers (like ControllerRevision
, PodSpecTemplate
).
Single-responsibility controllers
Time and time again we find engineers adding new unrelated responsibilities to existing controllers because it seems like a good place their thing can be shoved into. Kubernetes core controllers don’t have this problem for a reason.
One of the main Kubernetes design principles is that controllers have clear
inputs and outputs —and they do a well-defined job. For example, the Job
controller watches Job
objects and creates Pod
s, which is a clear mental
model to reason about. Similarly, each API is designed to offer a well defined
functionality. A controller’s output can be an input to another controller.
This is all what the UNIX
philosophy suggests for
a well reasoned system.
I recommend studying the common controller shapes (great talk by Daniel Smith, one of the architects of kube-apiserver) and the core Kubernetes controllers. You’ll notice that each core controller in Kubernetes core has a very clear job and inputs/outputs that can be explained in a small diagram. If your controller isn’t like this, you’re probably misarchitecting either your controller, or your CRDs.
If you architect your APIs and controllers correctly, your controllers will run in harmony as if they’re integrating with Kubernetes core APIs or an off-the-shelf operator.
When you controller design doesn’t quite feel right, or has too many inputs/outputs, does too much, or in general doesn’t feel right, you’re probably doing it unidiomatically. I struggled with this a lot myself, especially while developing controllers that manage external resources that have a non-declarative configuration paradigm.
Reconcile() method shape
Assuming you use kubebuilder (which uses controller-runtime like
almost everyone else to develop a controller, and you implement the
Reconcile()
method that controller-runtime invokes every time one of your
inputs change. This is where your controller does its magic, and since it’s
possible to implement this method in any way, most beginners dump their
spaghetti here.
Therefore, large projects like Knative define their own common controller shapes where every controller runs the same set steps in the same order. By developing a common controller shape/framework, you create a “guardrail” so that other engineers don’t deviate and introduce bugs in the reconciliation flow easily.
Sadly, controller-runtime is not opinionated about this topic. Your best bet is to read other controllers (like Cluster API to learn the idioms and master the reconciliation flow.
There are also new projects like Apollo SDK by Reddit that claims to offer finite-state machines for controller-runtime reconcilers.
Over time, we found that almost all our controllers have a similar reconciliation flow. Here’s a pseudo-code of how our controllers look like:
func (m *FooController) Reconcile(..., req ctrl.Request) (ctrl.Result, error) {
log := ctrl.LoggerFrom(ctx)
obj := new(apiv1.Foo)
// 1. Fetch the resource from cache: r.Client.Get(ctx, req.NamespacedName, obj)
// 2. Finalize object if it's being deleted
// 3. Add finalizers + r.Client.Update() if missing
orig := foo.DeepCopy() // take a copy of the object since we'll update it below
foo.InitializeConditions() // set all conditions to "Unknown", if missing
// 4. Reconcile the resource and its child resources (the magic happens here).
// Calculate the conditions and other "status" fields along the way.
reconcileErr := r.reconcileKind(ctx, kcp) // magic happens here!
// 5. if (orig.Status != foo.Status): client.Status.Patch(obj)
}
Most notably, you’ll see here that we always initialize conditions and always
update status inside reconcileKind
even if the reconciliation fails.
I recommend enforcing a similar common shape for controllers developed at your company (you can use custom kubebuilder plugins during scaffolding, but you can’t really enforce that either).
Report status
and conditions
I practically never seen a beginner engineer create a CRD that has properly
designed status
fields (if one exists, at all). Kubernetes API
conventions discuss this at length, so I’ll keep it
brief. If an API object is reconciled by a controller, the resource should
expose its status in status
fields. For example, there’s no ConfigMap
controller so ConfigMap doesn’t have a status
field.
At LinkedIn, our custom API objects have a status.conditions
field, similar to
the Kubernetes core or Knative
conditions,
and we use something similar to Knative condition set manager that provides
high-level accessor methods to set the conditions, and sort them etc.
This helps us define and report conditions for API objects in a high-level way in the reconciler code:
func (r *FooReconciler) reconcileKind(obj *Foo) errror {
// Create/configure a Bar object, wait for it to get Ready
if err := r.reconcileBar(obj); if err != nil {
obj.MarkBarNotReady("couldn't configure the Bar resource: %w", err)
return fmt.Errorf("failed to reconcile Bar: %w", err)
}
obj.MarkBarReady()
}
Every time we mark a condition, the condition manager recalculates the top-level
Ready
condition, which all our objects have as Kubernetes API conventions
suggest. Other controllers and humans consume this top-level condition to
understand how the objects are doing (plus you get to use kubectl cond
on your objects).
Learn to use observedGeneration
Something notable is that all our conditions
have an observedGeneration
field. You’ll even see some popular community CRDs (like ArgoCD
Application)
do not offer this field.
Essentially, this field tells us whether the condition is calculated based on the last configuration of the object —or whether we’re looking at a scale status information because the controller hasn’t gotten to reconciling the object after the update.
For example, observing a Ready
condition set to True
alone means nothing
(other than at some point in the past it was true). The condition offers a
meaningful status info if and only if the cond.observedGeneration == metadata.generation
.
Real-world story: A controller we had in production didn’t have the notion
of observedGeneration
, so its callers would update the object’s spec
and
immediately check its Ready
condition. This condition would almost always be
stale, as the controller hadn’t reconciled the object yet. So the callers
interpreted an app rollout as completed, even though it hadn’t even started
yet (and sometimes actually failed, but that failure was never noticed).
Understand the cached clients
controller-runtime, by default, gives you a client to the Kubernetes API which
serves the reads from an in-memory cache (as it uses shared informers from
client-go
under the covers). This is mostly fine, as controllers are designed
to operate on stale data —but it is detrimental if you didn’t know this
was the case since you might be writing buggy controllers due to this (more on
this later in “expectations” section).
When you perform a write (which directly hits the API server), its results may not be immediately visible in the cached client. For example, when you delete an object, it may still show up in the list result in a subsequent reconciliation to your surprise.
The lesser-known behavior of controller-runtime most beginners don’t realize is
that controller-runtime establishes new informers on-the-fly. Normally, when you
specify explicit event sources while building your controller (e.g. in
builder.{For,Owns,Watches}
), the informers are started and caches are started
during startup.
However, if you try to make queries with client.{Get,List}
on resources that
you haven’t declared upfront in your controller setup, controller-runtime will
initialize an informer on-the-fly and block on warming up its cache. This leads
to issues like:
- Controller-runtime starting a watch for a resource type and start caching all its objects in memory (even if you were trying to query only one resource), potentially leading to the process running out of memory.
- Unpredictable reconciliation times while the informer cache is syncing, during which your worker goroutine will be blocked from reconciling other resources.
That’s why I recommend setting ReaderFailOnMissingInformer: true
and disabling
this behavior so you’re fully aware of what kinds your controller is maintaining
watches/caches on. Otherwise, controller-runtime doesn’t provide any
observability on what informers it’s maintaining in the process.
controller-runtime offers a lot of other cache knobs, such as entirely disabling the cache on certain types, dropping some fields from the in-memory cache, or limit the cache to certain namespaces. I recommend studying them to better understand how you can customize the cache behavior.
Fast and offline reconciliation
Reconciling an object that is alredy up-to-date (i.e. goal state == current state) should be really fast and offline —meaning it should not make any API calls (to external APIs or writes to Kubernetes API). That’s why controllers use cached clients to serve reads from the cache to determine the state of the world.
I’ve seen many real-world controllers making API calls to external systems, or
make status updates to Kubernetes API (even when nothing has changed) every time
Reconcile()
got invoked. This is an anti-pattern, and a really bad idea for
writing scalable and reliable controllers:
-
They bombarded the external APIs with unnecessary calls during controller startup (or full resyncs, or when they had bugs causing infinite requeue loops)
-
When the external API was down, reconciliation would fail even though nothing has changed in the object. Depending on the implementation, this can block the next steps in the reconciliation flow even though those steps don’t depend on this external API call.
-
Logic that takes long to execute in a reconciliation loop will hog the worker goroutine, and cause workqueue depth to increase, and reduce the throughput/responsiveness of the controller as the worker goroutine is occupied with the slow task.
Let’s go through a concrete example: Assume you have an S3Bucket controller that
creates and manages S3 buckets using AWS S3 API. If you make a query to S3 API
on every reconciliation, you’re doing it wrong. Instead, you should store the
result of the S3 API calls you made, in a field like
status.observedGeneration
, to reflect what’s the last generation of the object
that was successfully conveyed to S3 API. If this field has 0 value, the
controller knows it needs to make a “Create Bucket” call to S3 API. When a
client updates the S3Bucket custom resource, its metadata.generation
will no
longer match its stored status.observedGeneration
, so the controller knows it
needs to make a “Update Bucket” call to S3 API, and only upon success it will
update the status.observedGeneration
field. This way, you avoid making calls
to the external S3 API when the object is already up-to-date.3
Reconcile return values
Your Reconcile()
function signature returns
ctrl.Result
+error
values. Usually beginners don’t have a solid grasp on what values to
return from Reconcile()
.
You should know that your Reconcile()
function is invoked every time your
event sources declared in builder.{For,Owns,Watches}
changes. If you know
that, my general advice while returning reconciliation values:
-
If you have
error
s during reconciliation, return the error; notRequeue: true
. Controller-runtime will requeue for you. -
Use
Requeue: true
only when there’s no error but something you started is still in progress, and you want to check its status with the default backoff logic. -
Use
RequeueAfter: <TIME>
only when you want to reconcile the object after a certain time has passed. This is useful for implementing a wall-clock based periodic reconciliation (e.g. a CronJob controller, or you want to retry reconciliation at a custom poll interval).
It’s a matter of
preference
whether your Reconcile()
function should make as much progress as possible in
a single run; or, return early every time you change something and requeue
itself again. You’ll see the earlier approach is more unit-test friendly and
what you’ll find more frequently in the open-source controllers because if your
event triggers are set up correctly, the object will get requeued anyway.
Workqueue/resync mechanics
OpenKruise has an article about workqueue mechanics, go read that. I frequently see beginners not relying on assumptions like an object is guaranteed to be reconciled at the same time in different workers, so they end up implementing unnecessary locking mechanisms in their controllers.
Similarly, beginners frequently don’t understand when and how many times an object gets reconciled. For example, when your controller updates the object it’s working on, it’ll be requeued for a reconciliation immediately again (because the update you made triggers watch event).
Even when no objects were updated, all watched resources will be requeued
periodically to get reconciled again (called “resync” configured via
SyncPeriod
option). This is the default behavior since the controllers may
miss watch events (very rare), or skip processing some events during leadership
change. But this behavior causes you to do a full reconciliation of all objects
cached.4 So by default, your controller should assume it’ll reconcile the
entire world periodically.
Real-world story: We had a controller that managed several thousand objects and it did a full resync every 20 minutes. Every object took several seconds to reconcile. So any time a client created or updated an object, it would not get reconciled until many minutes later, as it goes to the back of the workqueue among. If this happened during full resync or controller startup, it took many minutes until any work was done on this object.
Starting with controller-runtime v0.20 has introduced a priority queue implementation for the workqueue. This would deprioritize reconciliation of objects that were not edge-triggered (i.e. due to an create/update etc.) and make the controller more responsive during full resyncs and controller startups.
That’s why understanding the workqueue semantics, worker count
(MaxConcurrentReconciles
) and monitoring your controller’s reconciliation
latency, workqueue depth and active workers count is super important to know if
your controller scales or not.
Expectations pattern
We discussed above that controller-runtime client serves the reads from an informer cache, and doesn’t query the API server except during the startup/resyncs.
This cache is kept up-to-date based on the received “watch” events from the API server. Therefore, your controller will almost certainly read stale data at some point, since the watch events arrive asynchronously after the writes you make. Cached clients don’t offer read-your-writes consistency.
This means you need to program your Reconcile()
method with this assumption at
all times. This is not at all intuitive, but a reality when you work with
a cached client. I’ll give several real-world examples:
Example 1: You’re implementing the ReplicaSet
controller. Controller sees
*a
ReplicaSet with replicas: 5
, so it lists the pods with client.List
(which
is served from the cache), and you get 3 Pods. It turns out the informer cache
wasn’t up-to-date, but the API actually had 5 pods. Your controller creates 2
more pods, now you have 7 Pods. Definitely not what you wanted.
Example 2: Now you’re scaling down a ReplicaSet from 5 to 3. You list the Pods, you see 5 Pods, you delete 2 Pods, and next time you list the Pods again, you still see, you delete another 2 Pods. If your deletion logic is not deterministic (e.g. sorting Pods by name), you scaled from 5 to 1 —definitely not what you wanted.
Example 3: For every object kind=A
, you create an object kind=B
. When
A
gets updated, you update B
. The update succeeds, but next time you reconcile
A
again, you don’t see an updated version of B
, so you update B
to the
goal state again, and you get a Conflict
error because you’re updating the old
version of the object. But you already updated it, why update again?
If you don’t know how to solve these problems in your controller, it’s likely because you haven’t seen the “expectations” pattern before.
In this case, controllers need to do in memory bookkeeping of their expectations that resulted from the successful writes they made. Once an expectation is recorded, the controller knows it needs to wait for the cache to catch up (which will trigger another reconciliation), and not do its job based on the stale result it sees from the cache.
You can see many core controllers use this pattern, and Elastic operator also has a great explanation alongside their implementation. We implemented a couple of variants of these at LinkedIn ourselves.
Conclusion
Usually when you have controller development questions, join the Kubernetes slack and ask in the #controller-runtime channel. The maintainers are very helpful! If you’re looking for a good controller implementation, I recommend studying the Cluster API codebase. Also, Operator SDK has a best practices guide you should check out.
I’m not the most experienced person to write a detailed guide on this, but I’ll be writing more about beginner pitfalls and controller development anti-patterns.
At LinkedIn we use a controller development exercise a former colleague came up with to onboard new engineers to get them to understand the controller machinery. This exercise touches many aspects of controller development and gets people familiar with core Kubernetes APIs:
Exercise: Implement a SequentialJob API and controller. The API should allow users to specify a series of run-to-completion (batch job) container images to run sequentially.
Follow up questions:
- How do users specify the list of containers? (Do you use the core types?)
- Do you report status? How is status calculated? How do you surface job failures?
- Where do you validate user inputs? Where do you report reconciliation failures?
- What happens if the SequentialJob changes while the jobs are running?
- How are the child resources you created are cleaned up?
I hope this article helps you be a better controller developer. If you feel like this sort of work resonates with you, we’re usually hiring nowadays [1] [2] so reach out to me for a referral!
Thanks to Mike Helmick for reading drafts of this article and giving feedback.
-
In distributed systems, “success” is not an interesting case. A controller seeming to work okay is not an indicator of much. The hard work is designing for scale and time, understanding failure modes, proving correctness in the edge cases. ↩︎
-
See APIs like ConfigMap, Secret, ValidatingWebhookConfiguration. ↩︎
-
This example assumes the controller doesn’t need to periodically sync with the external API to correct potentially drifted configuration (e.g. updates made out of band). Whether controllers should do this or not entirely depends on your business case. Here is a Bluesky thread [1]] [2] about this topic if it interests you. ↩︎
-
Note that full periodic resyncs can overload API Servers prior to Kubernetes v1.27 (when your controller watches high-cardinality resources like Pods) as the List calls are quite expensive in kube-apiserver (although recent versions addressed this, with more improvements on the way). Uber mentioned in one of their talks they don’t do a full list from kube-apiserver, as the reason why some objects may miss reconciliation is that the events get dropped during leadership change, not because the watch stream is missing events. So, instead of using
SyncPeriod
which does a full List call, they instead requeue all objects that are already in cache. ↩︎