Blogs

From Metal To Apps: our Kubecon EU 2025 talk ↗
03 April 2025

We presented our LinkedIn compute infrastructure team’s journey moving LinkedIn’s large 500,000+ bare metal servers running thousands of microservices and a lot of stateful workloads to a Kubernetes based platform.

In this session, we talk about LinkedIn’s scale, how we automate bare metal server management and maintenance from the ground up, built Kubernetes node and cluster management layers for our needs, and how we’re building workload platforms for stateless, stateful and batch workloads.

You can find slides and recording of the talk.

Read on kccnceu2025.sched.com →
LinkedIn on the Kubernetes Podcast ↗
10 March 2025

Ronak and I were Abdel’s guests at the Kubernetes Podcast by Google ahead of our KubeCon talk in London next month. We talked about our work building the next generation of the compute infrastructure at LinkedIn with Kubernetes, the challenges we faced and our journey dealing with the scale and complexity so far.

Read on kubernetespodcast.com →
Every pod eviction in Kubernetes, explained
27 February 2025

Anyone who is running Kubernetes in a large-scale production setting cares about having a predictable Pod lifecycle. Having unknown actors that can terminate your Pods is a scary thought, especially when you’re running stateful workloads or care about availability in general.

There are so many ways Kubernetes terminates workloads, each with a non-trivial (and not always predictable) machinery, and there’s no page that lists out all eviction modes in one place. This article will dig into Kubernetes internals to walk you through all the eviction paths that can terminate your Pods, and why “kubelet restarts don’t impact running workloads” isn’t always true, and finally I’ll leave you with a cheatsheet at the end. Read more →
So you wanna write Kubernetes controllers?
22 January 2025

Any company using Kubernetes eventually starts looking into developing their custom controllers. After all, what’s not to like about being able to provision resources with declarative configuration: Control loops are fun, and Kubebuilder makes it extremely easy to get started with writing Kubernetes controllers. Next thing you know, customers in production are relying on the buggy controller you developed without understanding how to design idiomatic APIs and building reliable controllers.

Low barrier to entry combined with good intentions and the “illusion of working implementation¹” is not a recipe for success while developing production-grade controllers. I’ve seen the real-world consequences of controllers developed without adequate understanding of Kubernetes and the controller machinery at multiple large companies. We went back to the drawing board and rewritten nascent controller implementations a few times to observe which mistakes people new to controller development make. Read more →
Notes on OpenAI Kubernetes outage
18 November 2024

Last week, OpenAI has suffered a several hours long outage and published a detailed postmortem about it. Highly recommend reading it. These technical reports are usually a gold mine for all large-scale Kubernetes users, as we all go through similar set of reliability issues running Kubernetes in production. Read more →
Tale of a Kubernetes node-feature-discovery incident
15 November 2024

This is the analysis of a low severity incident that took place in the Kubernetes clusters at the company I work at that taught me a lot about how to think about the off-the-shelf components we bring from the ecosystem into the critical path and operate at a scale much larger than these components are intended. Read more →
Kubernetes CRD generation pitfalls
10 September 2024

A quick code search query reveals at least 7,000 Kubernetes Custom Resource Definitions in the open source corpus,¹ most of which are likely generated with controller-gen —a tool that turns Go structs with comments-based markers into Kubernetes CRD manifests, which end up being custom APIs served by the Kubernetes API server.

At LinkedIn, we develop our fair share of custom Kubernetes APIs and controllers to run workloads or manage infrastructure. In doing so, we rely on the custom resource machinery and controller-gen heavily to generate our CRDs. Read more →
Why Kubernetes secrets take so long to update?
28 December 2022
I’ve recently done a Twitter poll and only 20% of the participants accurately predicted that it takes Kubernetes 60-90 seconds to propagate changes to Secrets and ConfigMaps on the mounted volumes. So I want to take you on a journey in the codebase on how the mechanics of these volume types work and why it takes so long. Before going on this journey, I would answer the poll “nearly instantly” (like the majority 40% did). Read more →
Pitfalls reloading files from Kubernetes Secret & ConfigMap volumes
22 September 2022

Files on Kubernetes Secret and ConfigMap volumes work in peculiar and undocumented ways when it comes to watching changes to these files with the inotify(7) syscall. Your typical file watch that works outside Kubernetes might not work as you expect when you run the same progam on Kubernetes.

On a normal filesystem, you start a watch on a file on disk with a library and expect to get an event like IN_MODIFY (file modified) or IN_CLOSE_WRITE (file opened for writing closed) when the file is changed. But these filesystem events never happen for files on Kubernetes Secret/ConfigMap volumes. Read more →
Did we market Knative wrong?
16 June 2021
It has been over two years since we announced Knative. As the project and its community is going strong, I think we made some mistakes in the early positioning and messaging of Knative prevented the project from being a go-to addon for Kubernetes that’s adopted widely. Because I have never been a decision-maker for the Knative project and its messaging at Google, I can provide an outsider’s perspective despite having worked on different aspects of Knative during this time. Read more →

Blogs

From Metal To Apps: our Kubecon EU 2025 talk ↗

LinkedIn on the Kubernetes Podcast ↗

Every pod eviction in Kubernetes, explained

So you wanna write Kubernetes controllers?

Notes on OpenAI Kubernetes outage

Tale of a Kubernetes node-feature-discovery incident

Kubernetes CRD generation pitfalls

Why Kubernetes secrets take so long to update?

Pitfalls reloading files from Kubernetes Secret & ConfigMap volumes

Did we market Knative wrong?