We presented our LinkedIn compute infrastructure team’s journey moving LinkedIn’s large 500,000+ bare metal servers running thousands of microservices and a lot of stateful workloads to a Kubernetes based platform.
In this session, we talk about LinkedIn’s scale, how we automate bare metal server management and maintenance from the ground up, built Kubernetes node and cluster management layers for our needs, and how we’re building workload platforms for stateless, stateful and batch workloads.
Ronak and I were Abdel’s guests at the Kubernetes Podcast by Google ahead of our KubeCon talk in London next month. We talked about our work building the next generation of the compute infrastructure at LinkedIn with Kubernetes, the challenges we faced and our journey dealing with the scale and complexity so far.
Anyone who is running Kubernetes in a large-scale production setting cares about having a predictable Pod lifecycle. Having unknown actors that can terminate your Pods is a scary thought, especially when you’re running stateful workloads or care about availability in general.
There are so many ways Kubernetes terminates workloads, each with a non-trivial
(and not always predictable) machinery, and there’s no page that lists out all
eviction modes in one place. This article will dig into Kubernetes internals to
walk you through all the eviction paths that can terminate your Pods, and why
“kubelet restarts don’t impact running workloads” isn’t always true, and
finally I’ll leave you with a cheatsheet at the end.
Any company using Kubernetes eventually starts looking into developing their custom controllers. After all, what’s not to like about being able to provision resources with declarative configuration: Control loops are fun, and Kubebuilder makes it extremely easy to get started with writing Kubernetes controllers. Next thing you know, customers in production are relying on the buggy controller you developed without understanding how to design idiomatic APIs and building reliable controllers.
Low barrier to entry combined with good intentions and the “illusion of
working implementation1” is not a recipe for
success while developing production-grade controllers. I’ve seen the real-world
consequences of controllers developed without adequate understanding of
Kubernetes and the controller machinery at multiple large companies. We went
back to the drawing board and rewritten nascent controller implementations a few
times to observe which mistakes people new to controller development
make.
Last week, OpenAI has suffered a several hours long outage and published a
detailed postmortem about it.
Highly recommend reading it. These technical reports are usually a gold mine for
all large-scale Kubernetes users, as we all go through similar set of
reliability issues running Kubernetes in production.
This is the analysis of a low severity incident that took place in the
Kubernetes clusters at the company I work at that taught me a lot about how to
think about the off-the-shelf components we bring from the ecosystem into the
critical path and operate at a scale much larger than these components are
intended.
A quick code search query reveals at least 7,000 Kubernetes Custom Resource Definitions in the open source corpus,1 most of which are likely generated with controller-gen —a tool that turns Go structs with comments-based markers into Kubernetes CRD manifests, which end up being custom APIs served by the Kubernetes API server.
At LinkedIn, we develop our fair share of custom Kubernetes APIs and controllers
to run workloads or manage infrastructure. In doing so, we rely on the custom
resource machinery and controller-gen
heavily to generate our CRDs.
Files on Kubernetes Secret and ConfigMap volumes work in peculiar and
undocumented ways when it comes to watching changes to these files with the
inotify(7)
syscall. Your typical file watch that works outside
Kubernetes might not work as you expect when you run the same progam on
Kubernetes.
On a normal filesystem, you start a watch on a file on disk with a library and
expect to get an event like IN_MODIFY
(file modified) or IN_CLOSE_WRITE
(file opened for writing closed) when the file is changed. But these filesystem
events never happen for files on Kubernetes Secret/ConfigMap volumes.