I have been craving to write about this since this is what I have been up to lately at work. I spent quite some time investigating the state of instance provisioning on each cloud provider and I thought I could share these here.
If you are dealing with deploying instances (a.k.a Virtual Machines or VMs) to public cloud (e.g. AWS, Azure), then you might be wondering what your instance goes through before you can start using it.
This article is going to be about that. I hope you enjoy it. Please let me know at the end how you liked it!
Table of Contents
- What Is Provisioning?
- Key Tools for Provisioning 0. Instance Metadata API 0. Cloud-Init
- Provisioning in Public Cloud 0. AWS EC2 0. DigitalOcean 0. Google Compute Engine 0. Microsoft Azure
What Is Provisioning?
You have an application and you need to run this without purchasing physical servers. You go to a cloud provider and you ask for virtual servers to run your service on. You tell the cloud provider: “I want a Debian 8 Linux machine, and I want you to add this SSH public key to the machine so that I can log in and run my app". You get what you asked for within a matter of seconds if not minutes.
(Very ordinary, right? I am very sure this was mind-blowing decade ago when AWS EC2 came out. It is certainly still mind-blowing to me.)
All operations that occur from the moment you request for a VM to the moment you can log in to the VM is called provisioning.
Most of the provisioning magic happens at cloud provider’s proprietary/internal software that manages their physical machines in the datacenter. A physical node is picked and the VM image you specified is copied to the machine and hypervisor boots up your VM. This is provisioning from the infrastructure side and we are not going to be talking about it here.
Then the provisioning goes on… Your machine is now up (think of a Debian or Ubuntu Server image). It has no accounts, no SSH keys. It is almost like a vanilla OS image you can download from internet yourself. You can’t log in.
This is where the user-mode provisioning kicks in. Your machine runs some code and starts doing specialization on this image. Some of them could be:
- creating the OS user you wanted
- adding SSH credentials to the machine so that you can log in
- running startup scripts you provided (to install or configure stuff)
- mounting an ephemeral/scratch disk from the physical host
All this is part of provisioning and once this is all done, you have a VM prepared for you to log in and use it!
This user-mode provisioning runs only once. When it is all set, it gets out of your way and lets you run your workloads.
Key Tools for Provisioning
How does a vanilla Linux server image know about your configuration such as your credentials and set up the Virtual Machine accordingly? To understand that you should know about some tools that play key roles here:
⚒ Instance Metadata API
This is a HTTP API that runs at
http://169.254.169.254/ if you have an VM
running on the cloud. You make calls to it and it gives you information about
your VM such
- what is the name of the VM
- what is the instance size
- what region is your VM in
- what are the SSH public keys assigned to the VM
- what is the startup script that the VM should execute
It is a trivial and text-based API. You just make the request, you get what you want:
$ curl http://169.254.169.254/latest/meta-data/ami-id ami-2bb65342
Instance Metadata is provided to the VM by the hypervisor or other underlying infrastructure of the cloud provider. This is how your VM knows about itself and what it should do.
In case you are interested, I have a comprehensive blog post about comparison of instance metadata APIs across public cloud providers on my blog.
cloud-init is a Linux tool that runs when your instance boots to handle
provisioning from within the instance. It sets up your virtual machine by
configuring networking, hostname, placing your SSH credentials and optionally,
by running startup scripts you provided.
cloud-init detects which cloud provider you are running on (by doing certain
heuristics on the filesystem or metadata API protocol) then figures out which
data source class to use.
Then it calls the data source class to get the data (such as SSH keys) it needs
provisions your instance with that. Basically, the
cloud-init package is what
makes a server image different than a vanilla distro image.
cloud-init is written in Python and originally developed by Canonical for
Ubuntu Server, however over time it has gotten populer over time and got love
from other Linux distro vendors as well as cloud providers and became a
widely-adopted package baked in many distro images on the cloud.
Provisioning in Public Cloud
In this section I am going to explain how each cloud provider does provisioning of the instances in user space. Some of them are similar, although some have differences interesting enough to point out.
☁︎ Amazon Web Services EC2
Obviously AWS started doing all this since it is the first IaaS provider, but the notion of provisioning is older than that as people used to have virtualization software they ran in their on-premise datacenters or servers.
The way EC2 provisions instances is plain and simple:
- Most images on EC2 (AMIs) have
cloud-initbaked in to the image.
cloud-initqueries EC2 Instance Metadata API and gathers data to provision the instance.
The whole AWS implementation in
cloud-init is only 200 lines of code. I
cloud-init everywhere gives EC2 a clean and unified way of
provisioning Linux instances.
I love DigitalOcean and use it personally myself. When it comes to provisioning, they follow the AWS EC2 principle:
cloud-initpackage baked on all images
- use metadata API to get the data about the instance.
Since DigitalOcean has only a few images available (at least today),
provisioning is not very exciting here either. The
of DigitalOcean is also very small, just 110 lines of code.
The only difference I spotted is DigitalOcean has a way of reordering the steps
executed by cloud-init (listed in
cloud.cfg) via a cloud-init feature
vendor-data. This data also comes from metadata API and as far as I can
tell nobody except DigitalOcean uses this feature in the cloud-init codebase.
They use it for keeping the root user enabled, managing /etc/hosts via
cloud-init etc. (In case you want to dig deep, here is the
DigitalOcean presents and its diff with
☁ Google Compute Engine
Google does not use
cloud-init (BOOM!). I do not know why but what they came
up with instead is remarkably cool:
Google wrote their own instance guest agents in Python. These are installed on all stock images on GCE and open source on GitHub. And I said agents, meaning not a single monolithic agent but a bunch of small services. You can find a list of these in your GCE instance:
# systemctl list-units | grep ^google google-accounts-daemon.service Google Compute Engine Accounts Daemon google-clock-skew-daemon.service Google Compute Engine Clock Skew Daemon google-ip-forwarding-daemon.service Google Compute Engine IP Forwarding Daemon google-shutdown-scripts.service Google Compute Engine Shutdown Scripts
Self-documenting enough… The one I really want to talk about is
google-accounts-daemon, the one which creates the user accounts and places SSH
keys you have given to create your VM.
Those who are GCE customers will know these two fantastic features: If they ever
lose their credentials, they can drop new SSH keys to the VM from
gcloud CLI or
the web console at any time and restore access.
Or another killer feature: GCE has a “SSH” button on the web interface and the
gcloud compute ssh <vm> command to give you on-the-fly SSH access to the instance
by creating short-lived SSH keys and dropping them to the instance in 10 seconds.
What is this magic? How is this possible and so fast? Well, first you need to
know this: Google Metadata API supports long polling. This means you
start a long-standing HTTP request to metadata API and if the key you are
watching changes, the server returns a response with the new values. Then the
google-account-daemon parses the
instance/attributes/ssh-keys value in the
response (which contains the new SSH keys) and creates Linux users and adds SSH
keys accordingly. This is how the magic works.
If you ever lose your SSH key on EC2 or DigitalOcean, you are doomed. But GCE has this (and Azure has something similar). So this is pretty cool.
I said Google does not use cloud-init, but if you bring your own custom VM image
cloud-init package in it) it will provision just fine as there
is a GCE implementation in
cloud-init. It is short (160 lines of code) and
just queries GCE Metadata API to get all the data it needs to provision. If you
go down this route, you won’t be getting all these cool features.
☁︎ Microsoft Azure
(Before I begin, quick disclaimer to save my butt: I work at Microsoft Azure Linux team and this is precisely the area that I work on. These are my personal opinions and it goes without saying that I tried to write this section objectively as much as I can.)
Azure has started as a Windows PaaS provider in 2010 and stayed as such until 2013, when IaaS was made generally available. When IaaS was launched, Azure did not have many Linux images, however it was picking up. (read: Microsoft is now doing Linux, can ya believe it?!! and I was there!!1) However as most of the infrastructure and the APIs were designed for Windows, provisioning on Azure Linux instances is a bit unconventional and non-trivial.
As of this writing, Azure does not have an instance metadata service. It has an undocumented HTTP API which is internally called “Wire Server”. One of the Red Hat engineers kindly documented it here. It is XML-based (compared to other metadata servers being JSON/text based) and a bit cryptic at first.
However, Azure does not use this Wire Server for provisioning. So where do we get the
provisioning data from? When a new virtual machine boots for the first time,
Hyper-V (hypervisor of Azure datacenters) attaches a DVD-ROM device to the
instance. Then the provisioning code mounts the device and reads a file called
ovf-env.xml from it. This file contains username, SSH key and/or password data
(yes, Azure allows creating Linux VMs with passwords) and is used to provision
So who provisions the instance then? Well, it depends. First of all Azure has
its own guest agent running on all Linux instances, called
is written in Python and open source on GitHub.
On most Linux images on Azure,
waagent is the provisioning tool. However, for
images like Debian and Ubuntu Server,
cloud-init does the provisioning and
waagent is still there. This is mostly because Azure guest agent does a lot
more than provisioning. It has quite many tasks, such as some important ones:
- formatting the ephemeral disk to ext4 (Hyper-V gives it as NTFS)
- processing virtual machine extensions
- enabling RDMA (remote direct memory access) for HPC (high performance computing) workloads
- retrieving and placing instance certificates and private keys by converting them from PKCS#12 format to PEM (PKCS#12 works well with Windows, however it is not conventional in POSIX environments.)
In cases where
cloud-init and are
waagent both present, they coordinate and
do not step on each other (although these parts are a bit hacky). Finally, Azure
has to send a “provisioned” signal to the Wire Server, otherwise Azure thinks
that the provisioning is still going on your VM will not be listed as started on
the API or the management portal.
Some of the extra features of
waagent such as the VM extensions offers
flexibility to the users, which lets users to install application bundles called
extensions to their VMs via CLI or REST APIs. Extensions can do things like
restore access or run arbitrary scripts or install anti-malware
software and such on the machine without even having to do SSH.
As I said before, provisioning in Azure is a bit non-trivial. You can see this
cloud-init Azure data source implementation which is like 650
lines long (+another 280 lines of utility methods). This is mostly
because Azure has no text-based metadata service and it has an unconventional
way of delivering pieces of information and the need for signaling that
provisioning has completed via a complicated protocol.
Now you know how each cloud provider brings up your virtual machine.
cloud-init is huge and plays a key role in instance provisioning across the
cloud providers. Instance Metadata APIs are another key to do provisioning for
Most people have no reason to care about these or even know about them –as long as the cloud providers are doing their job right. But I hope you now have more visibility into what your instances on the cloud go through before you get to use them.
If you have read this far please let me know in the comments about what you think!
Leave your thoughts