It’s been over a year since my last blog post, and since then I’ve been working on Kubernetes almost exclusively for $employer. During that time, I’ve noticed a growing need for something that many people in the DevOps/SRE/Sysadmin world take for granted. I wanted to come out of my blog post hiatus to discuss that need, and try and create a call to arms around this particular problem.

The Problem

Configuration Management

Back in the “old days” - way before I even got started in this game, a bunch of sysadmins decided that manually configuring their hosts was silly. Sysadmins at the time would configure their fleet of machines, whether it be large or small, using either a set of custom and hand crafted processes and runbooks, or maybe if they were very talented, a set of perl/bash or PHP scripts which would take weeks to figure out for anyone new to the org.

People got pretty tired of this, and a new suite of tools was born - configuration management. From CFEngine, to Puppet, to Chef, to Ansible, these tools came along and revolutionized our industry.

The driving forces behind this tooling was the need to know that machines looked the way you want them to, and also be able to describe that state with code. The Infrastructure as Code movement started with configuration management, and has evolved even further since then, to bring us tools like Terraform, Habitat and Vagrant.

What about Kubernetes?

So how does configuration management fit in with Kubernetes? The first thing you might be thinking is that deploying things to Kubernetes already solves the two problems above. Kubernetes is a declarative system, and you can simply define a deployment with a yaml config and Kubernetes takes care of the rest. You have the “code” there (although yaml is a poor imitation of code!) and then Kubernetes takes care of the rest. It will manage your process, ensure it converges and looks the way it should, and continue from there.

So what’s the problem?

The Kubernetes Abstraction Layer

When you think about what Kubernetes does for an organization, it’s worth thinking about what Kubernetes does for an engineering organization.

Before Kubernetes, the management layer generally existed above the Operating System. You would generally operate with individual machines when managing applications, and linking them together involved putting services on top of other operating systems which would allow to link those applications together. You probably automated this process with configuration management, but ultimately the operators worked at the operating system layer.

With Kubernetes hitting the marketplace, that abstraction layer has changed. To deploy an application, operators no longer need to install software on the OS. Kubernetes takes care of all that for you.

So scaling applications at the machine layer was managed by configuration management and automated. You had many machines, and you needed to make sure they all looked the same. However, with Kubernetes coming along, how do you do the same there?

Kubernetes Components

So now we get to the meat of the problem. At $employer, we run multiple Kubernetes clusters across various “tiers” of infrastructure. Our Development environment currently has 3 clusters, for example.

Generally, we manage the installation of the components for the Kubernetes cluster (the kubelet, api server, controller manager, kube-proxy, scheduler) etc with Puppet. All of these components live at the operating system layer, so using a configuration management tool for the installation makes perfect sense.

But there are components of a Kubernetes cluster that need to be the same across all the clusters that are being managed. Some examples:

  • Ingress Controllers. All clusters generally need an ingress controller of some kind, and you generally want it be the same across all your clusters so there are no surprises.
  • Monitoring. Installing prometheus usually makes sense, and monitoring at the hardware layer is taken care of by DaemonSets. However, how do you make sure the prometheus config works the same across all clusters? Prometheus installations are designed to be small and modular, so you want to repeat this process over and over again.
  • Kubernetes Dashboard. Generally there will be some users who want to see a picture, and the dashboard helps with that.

These are just some examples, but there are more. So, how do you do this?

Helm

We adopted Helm charts at $employer for a few reasons:

  • The helm charts for bigger projects are generally well maintained, and have sane defaults out of the box.
  • Helm allows you to do reasonable upgrades of things using helm releases.
  • Helm’s templating system allows you to manipulate configuration values for cluster level naming etc.

The problem is that syncing helm releases across clusters isn’t really very easy. Therein lies the problem: how do I sync resources across Kubernetes cluster. Specifically, how do I sync helm releases across clusters? I want the same configuration but some slight changes in Helm values.

I believe this problem is unsolved. Here’s a few of the things I’ve tried.

The (unsatisfactory) solutions

Puppet

Because my mind initially goes towards Puppet as configuration management, my first attempted solution used Puppet alongside Helm. I found the Puppetlabs Helm Module and got to work. I even refactored it and sent some pull requests to improve its functionality.

Unfortunately, I ran across a few issues, and these mainly related to Puppet’s lack of awareness of concepts higher than the OS, Puppet’s less than ideal management of staged upgrades, as well as some missing features in the module:

  • Declaring helmcharts as Puppet defined types means they get initialized on all masters. This can cause issues and race conditions. The way Puppet determines if it should install a helm chart is by running a helm ls on the machine its operating on and finding the release. This isn’t ideal, because if a master gets out of sync with the cluster, you come across problems.
  • The module itself has different defined types for helm chart installations and upgrades. You can’t just change the helm module version, or different values for the --set or values.yaml. This requires custom workflows for helm upgrades and makes it really difficult to manage across multiple installations.
  • Sometimes when updating a helm release, you might want to do certain actions before updating the release, like switch a load balancer config. You can’t do this with Puppet, it’s a manual process and requires some knowledge of multi-stage deployments. Puppet isn’t great at this, so there has to be something else involved.

This is originally when my thoughts around this issue began to form. After coming to the conclusion that Puppet wouldn’t quite cut it, I began to look at alternatives..

Ansible

My initial thought was to use ansible. Ansible has a native helm module and is (by their own admission) designed for multi-stage deployments from the get-go. Great!

Unfortunately, the reality was quite different.

  • The module has at least two different issues open stating that the module is broken, and not ready: here and here. There is even a comment stating a user switch to the k8s_raw ansible module.
  • Setting up and installing pyhelm, the module’s dependency, caused us a lot of issues on OS X. We had to run it in a docker container
  • The ansible module relies on you configuring manual access to Tiller (helm’s orchestrator inside the k8s cluster) manually, which is not easy to do.

Once it was realised that the ansible helm module isn’t really ready for prime time, we reluctantly let Ansible go as an option. I’m hoping for this option to potentially improve in future, as I think personally this is the best approach

Terraform

Terraform, by Hashicorp, was also considered as a potential option. It epitomizes the ideals of Infrastructure as Code by being both idempotent, as well as declarative. It interacts directly with the APIs of major cloud providers, and has a relatively short ramp-up/learning curve. Hashicorp also provides a well-written provider specifically for Kubernetes as well as a community written Helm provider and I was keen to give it a try. Unfortunately, we encountered issues that echoed the problems faced with Puppet:

  • Terraform simply defines an endstate, and makes it so, similar to Puppet but at a different level. The multi stage deployments (ie do X, then do Y) aren’t really supported in Terraform which causes us problems when switching config.
  • We have several clusters in AWS, but some outside AWS. Having to store state for non AWS resources in an S3 bucket or such like was annoying. Obviously it’s possible to store state elsewhere, but it’s not a simple process.
  • By default, Terraform destroys and recreates resources you change. You can modify this behaviour using lifecycle but it’s mentality is very different than the Kubernetes/Helm concepts.

Ksonnet

I was very excited by Ksonnet at one point. Ksonnet is a Heptio project that is designed to streamline deployments to Kubernetes. It has a bunch of really useful concepts, such as environments which allow you to tailor components to a unique cluster. This would have been perfect to run from a CI pipeline, unfortunately the issues were:

  • Doesn’t support helm (yet). This means anything you install, you need to take all of the helm configuration that is figured out for you and write it again. A lot of overhead
  • Has a steep learning curve. Jsonnet (which ksonnet uses under the hood) is a language itself, which takes some learning and isn’t well documented.
  • Ksonnet’s multi cluster support still hasn’t really been figured out, so you need to manage the credentials manually and switch between them across clusters.

Other Contenders

There were a few other concepts which looked ideal, but we didn’t try as they were missing key components.

Kubed

kubed can sync configmaps across namespaces and even clusters. If only it could do the same with other resources, it would have been perfect!

Helmfile

helmfile allows you to create a list of helm charts to install, but at the time of looking at it, it didn’t really support templating for cluster differences, so it wasn’t considered. It now seems to support that, so maybe it’s time to reconsider it.

Ansible k8s_raw

We briefly considered rewriting all the helm charts into standard ansible k8s_raw tasks, and putting the templating into the ansible variables. The amount of work to do that isn’t trivial, but we may have to go down that route.

Wrap up

So, as you can see here, this problem isn’t solved. In summary, the requirements seem to be a tool that is:

  • Aware of multiple Kubernetes Clusters
  • Supports templating
  • Can perform operations in stages
  • Understands helm charts and releases
  • Is declarative

Unfortunately, I don’t believe this tool exists right now, so the search continues!


Day 3 of Kubecon! Before I begin, I have to make it clear that this was another day of frustration for me. As it was yesterday, all of the talks I really wanted to see were completely overflowing, and this was despite me making efforts to get to the talks well in advance.

The organisers did what they could to alleviate this, such as moving the deep dive track into the larger main conference room, but ultimately people paid vast sums of money to attend this conference, and I firmly believe they over subscribed the event and it was detrimental to the people attending. The building can hold X number of people, and it was obvious that this number of tickets were sold, but the smaller breakout rooms were very popular and ultimately it was impossible to take in the fantastic content because of the overcrowding.

Keynotes

Huawei

I found the short Huawei talk very interesting. Firstly, the scale they’re operating at is inspirational, and the way they’ve solved those problems and the benefits they’ve reaped from implementing a cloud native approach is very impressive.

I was also intrigued by their in house/custom networking solution, iCAN. Always interesting to see how large enterprise innovate, given the resources at their disposal.

Scaling Kubernetes Users

Joe Beda of Heptio talked a little bit about the user experience of Kubernetes. Starting with the mantra of “Kubernetes sucks, like all software” is an interesting opening gambit from a co-founder of Kubernetes, but he backed this up with detailed examples of how the user experience can improve. One of my favourite thoughts of his was the idea that operators build software for people like us and this definitely ties with my experiences of Kubernetes being intimating to the average user.

Federation

Finally, Kelsey Hightower talked about Federation, and how he sees it being used in the Kubernetes landscape. As usual, Kelsey was a quote machine. Some of the more memorable ones:

After Kelsey was done cracking the audience up, there was a demo of ingress federation and the expected results. This was quite interesting to me, because one of the things I’ve been hoping to do is use federation to move workloads, but Kelsey’s argument is that you absolutely should not do that. It’s meant to be used to get an overview of the cluster as a whole, not magically move workloads around. As always, a very useful talk from Kelsey.

Using the k8s go client

Aaron Schlesinger did a fantastic demo of writing a custom thirdparty resource (in his case, a backup resource) using the golang API client. It was very informative, even for someone like myself who has written quite a few things that use the golang client. I highly recommend checking out the github repo to get an idea what was covered.

Prometheus Storage

I half attended a storage talk by Björn Rabenstein detailing some of the ways of dealing with storage in Prometheus. This seemed interesting, but again it was difficult because of how full the room was. Essentially, what I took away from it was that storage in prometheus is not plug and play, and it’s worth reading the docs to ensure you’re doing it right.

k8sniff

The guys at kubermatic detailed k8sniff an interesting layer 3 TLS load balancer, that can terminate TLS down to the pod level. For those people offering kubernetes as a service to customers, this seems invaluable for separating traffic for different customers, however this isn’t a problem I’ve personally seen.

Consul & Ingress at Concur

Finally, I saw a fantastic talk about leveraging Consul & Ingress controllers to route traffic to pods. There was a discussion about the existing method, which used a custom load balancer endpoint and the pitfalls, and then porting that to a consul/ingress model. I thought this was quite interesting, but I also felt like you’d be losing some of the advantages of kubernetes by using consul to hit your pod endpoint IPs directly, such as the advantage of service IPs, as well as integrating your pod network to make it routable to external infrastructure. Nevertheless, it was very interesting to hear how a large company like Concur manages to sovle these complex problems.


Day 2 of KubeCon was absolutely jam packed! There were lots of tracks, so I won’t be able to cover everything that happened, but hopefully I can recap some of the stuff I found interesting.

One thing to note is that the Technical deep dive rooms were dramatically over subscribed, to the point where I walked out of some of them halfway through because the environment was unbearable. Maybe someone else will be able to recap those.

kubernetes 1.6

The kubernetes 1.6 announcement was done during the keynote, and we had a fantastic demo by Aparna Sinha of some of these features.

Pod Affinity/Anti Affinity

The pod affinity/anti affinity was shown off, which allows you to schedule workloads on certain nodes in your cluster. Anti affinity seemed particularly interesting to me, because it means you can say “if pod has label X, don’t schedule anything with the same label on a node” which is very powerful for environments where you need to ensure failure domains stay low. It also seems very obvious in terms of affinity, where you need to schedule workloads on certain nodes and ensure they stay there. As our master of ceremonies Kelsey Hightower would say - super dope!

Dynamic Storage Provisioning

Another thing that caught my eyes was the dynamic storage provisioning becoming a standard within 1.6. This has been around a while, and it basically goes off and creates volumes for you (EBS, GKE etc etc) and then creates a persistent volume claim for your pod to consume. Aparna did a demo of this and it really showed the power of this model when creating dynamic workloads. I’ve been doing this in the beta/alpha tree for a while, and I hope to have a blog post up soon showing this off for FlexVolumes and how you can extend it with out of tree provisioners!

Roadmap

Some really great things on the 1.7 roadmap. Of particular interest is the service-catalog work, allowing you to provision resources that might not be in kubernetes. More about that later

Monzo & Linkerd

I attended a fantastically interesting talk co-chaired by Monzo and Buoyant to talk about how they worked together at Monzo. Monzo is incredibly interesting in that they built a bank (yes, you read that right, a fucking bank) powered by Kubernetes.

The story started with Monzo explaining how their previous architecture relied on infrastructure services like RabbitMQ to ensure high reliability, but this was failing them. They were very candid about a recent outage they had, which meant payment processing stopped. This seems like an incredible nightmare for a bank, if you can’t buy stuff because you can’t use your card, it can cause a real drop in trust. However, along came Linkerd to operate as a fabric in their k8s cluster which vastly increased their reliability.

Oilver Gould then took us through some of the features of linkerd, and the sell was compelling. I’d highly recommend reviewing the slides and I’ll be taking a look at linkerd soon.

Service Catalog

There were several talks about service catalog maturing as an API, and this is a relatively new concept to me in kubernetes, but it’s incredibly interesting. After a great chat with the folks at the Deis booth, it made more sense.

Essentially, the service catalog will allow developers/applications to consume resources that may or may not live outside the API. The best example of this that I got during the day is as follows.

  • You may or may not run your database outside your cluster.
  • Your developers want to be able to provision and consume that resource, by creating a new database for their app and/or creating credentials to login
  • Traditionally, you would have this be a manual process
  • However, with a service catalog resource, you can make an API call, and it will provision the database and then you can bind to the resource, which will provide you with API generated credentials.

Needless to say, the potential here is mind boggling. With full kubernetes audit logging, and API driven provisioning of external components to the cluster, it becomes really obvious how you can create a programmable datacenter even outside of AWS/Google clouds.

The most up to date example of this is Steward by Deis. This is still in alpha state, but it’s definitely something I think needs to be considered when moving towards “Cloud native” workloads.

Everything Else

Some other fun tidbits I picked up today:

James Munnelly of Jetstack has created a very awesome keepalived cloud provider, which operates as an out of tree load balancer type, allowing those of us unlucky enough to be running bare metal based kubernetes clusters to use the load balancer service type.

Quay - the container registry from CoreOS now supports serving helm charts from its registry which is obviously super useful for those people using quay.

The kubernetes job market is looking pretty healthy based on the job board

Having a grafana dashboard or the conference wifi network is pretty awesome


I was lucky enough to be able to attend CloudNativeCon/Kubecon in Berlin, Germany. This is my recap of the first half day of lightning talks, panels and project updates.

Note - this is not an exhaustive recap. The stuff here is mainly what caught my eye during the first evening. Things are definitely missing

Fluentd update

First up, we had an update from Eduardo Silva from TreasureData about the fantastic FluentD project. The main highlight I can remember is that Fluentd now supports Windows logging. Eduardo’s made a joke about “Windows being a serious system” and got a great laugh. There was a brief discussion of how fluentd solves the logging pipeline problem, and it’s definitely something I’ll be investigating as we try and better ship logs in our Kubernetes implementation.

OpenTracing update

Next, we had Priyanka Sharma on stage to talk about OpenTracing. This is something I was aware of, but hadn’t fully investigated, and Priyanka made it much clearer by demoing a custom “donut order” application she’d written with OpenTracing built in. She did a demo of the app (as a side note, she also asked everyone in the conference room to open a website, which I thought was crazy given the traditional state of conference wifi, but it worked!) and then drilled down into performance problems using OpenTracing. It was a fantastic visualization of how OpenTracing can solve problems, and I got a lot of value out of it!

Linkerd update

Next up was Oliver Gould with an update about Linkerd and the big news here is that Linkerd now has TCP Support!. This is a massive step forward, and judging by reaction to my tweet the community agrees!

Additionally, something that caught me eye is that linkerd now supports kubernetes ingress in its config. I found a great blog post about this on the bouyant blog, and this has me really excited.

CoreDNS update

The next thing I really paid attention to was the CoreDNS update by Miek Gieben and this really stuck in my mind because I’ve really had some trouble and concerns around the kube-dns/skyDNS implementation currently being used in Kube. For some reason, a go shim coupled with dnsmasq and a sidecar container just doesn’t feel like the right way to do something that’s a critical component of the kubernetes stack.

With that in mind, Miek provided an update on CoreDNS, what it can do and why it’s “better” than kube-dns. This slide stood out to me, and after reading some of the available middleware plugins I’m looking to replace kube-dns with CoreDNS as soon as I can!

Panel

The panel was a real disappointment for me. It consisted of 3 senior/executive level employees of TicketMaster, Amadeus and Haufe-Lexware discussiong their transition to Kubernetes. I would have loved to have my bosses, and their bosses watch this panel, because there was some good conversation around how to transition organisations to new ways of thinking, but it was nothing I hadn’t already heard. I would have loved to have had maybe 5/10 operators on stage talking war stories of Kubernetes in production, with a little less “media training” feel to it. I understand there’s a wide range of attendees at this conference though, and I’m sure more people got value out of it than I did.

As a side note, I was very disappointed at the lack of diversity in the panel. I’m sure we could have found someone from a more diverse background/gender to be involved in the discussions.

Helm with AppController

There was a nice short talk about the capabilities of the Mirantis Appcontroller and this seems like a really interesting project. Essentially, this brings orchestration to your kubernetes cluster, and allows staged deployments (ie, don’t deploy the web app until your database is initialized) - I can see this getting heavy usage once it’s in Beta/ready to test, but as it stands it seems fairly new.

BGP Routing in Kubernetes

This talk confused me, because it basically described what Calico already does, unless I’m missing something? Would love someone to fill in the gaps here.

Fluentd Logging Pipelines

The last talk I really tuned in on was one that really stuck with me. Jakob Karalus talked about how he’d built a flexible logging service inside Kubernetes, using annotations and fluentd. This tweet should give you an idea of how it works. This is a really interesting solution to a problem that is definitely on my mind at the moment - with Docker you can simple set the logging driver per container but this is currently not possible in kubernetes (see this issue for more details. Jakob’s solution provides an interesting mechanism for developers to be able to decide where they want their logs to go, and I think flexibility is always key. Check out his github repo to see how it works.

Wrap Up

I’m hoping to write one of these for each day, but the amount of content may make that difficult! Let me know if you think I missed anything important!


Kubernetes has a reputation for being great for stateless application deployment. If you don’t require any kind of local storage inside your containers, the barrier to entry for you to deploy on Kubernetes is probably very, very low. However, it’s a fact of life that some applications require some kind of local storage.

Kubernetes supports this using Volumes and out of the box there is support for more than enough volume types for the average kubernetes user. For example, if your kubernetes cluster is deployed to AWS, you’re probably going to make use make use of the awsElasticBlockStore volume type, and think very little of it.

There are situations however, where you might be deploying your cluster to a different platform, like physical datacenters or perhaps another “cloud” provider like DigitalOcean. In these situations, you might think you’re a little bit screwed, and up until recently you kind of were. The only way to get a new storage provider supported in Kubernetes was to write one, and then run the gauntlet of getting a merge request accepted into the main kubernetes repo.

However, a new volume type has opened up the door to custom volume providers, and they are exceptionally simple to write and use. FlexVolumes are a relatively new addition to the kubernetes volume list, and they allow you to run an arbitrary script or volume provisioner on the kubernetes host to create a volume.

Before we dive too deep into FlexVolumes, it’s worth refreshing exactly how volumes work on Kubernetes and how they are mapped into the container.

Volumes Crash Course

If you’ve been using Volumes in Kubernetes in a cloud provider, you might not be fully aware of exactly how they work. If you are aware, I suggest you skip ahead. For those that aren’t, let’s have a quick overview of how EBS volumes work in Kubernetes.

Create an EBS Volume.

The first thing you have to do is create an EBS volume. If you’re using the AWS CLI this is easy as:

aws ec2 create-volume --availability-zone eu-west-1c --size 10 --volume-type gp2

Which will return something like..

{
    "AvailabilityZone": "eu-west-1c",
    "Encrypted": false,
    "VolumeType": "gp2",
    "VolumeId": "vol-xxxxxxxxxxxxxxxxx",
    "State": "creating",
    "Iops": 100,
    "SnapshotId": "",
    "CreateTime": "2017-03-12T14:49:36.377Z",
    "Size": 10
}

Your EBS volume is now ready to go.

Once you have the volume, you’ll probably want to attach it to a Kubernetes pod! In order to do this, you’ll need to take the volume ID and use it in your kubernetes manifest. The awsElasticBlockStore has an example, like so:

apiVersion: v1
kind: Pod
metadata:
  name: test-ebs
spec:
  containers:
  - image: gcr.io/google_containers/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /test-ebs
      name: test-volume
  volumes:
  - name: test-volume
    awsElasticBlockStore:
      volumeID: vol-xxxxxxxxxxxxxxxxx
      fsType: ext4

Now, if you look in the pod, you’ll see a mount at /test-ebs, but how has it got there? The answer is actually surprisingly simple.

If you examine the ebs volume that was created, you’ll see it’s been attached to an instance!

aws ec2 describe-volumes --volume-ids vol-xxxxxxxxxxxxxxxxx
{
    "Volumes": [
        {
            "AvailabilityZone": "eu-west-1c",
            "Attachments": [
                {
                    "AttachTime": "2017-03-12T14:53:55.000Z",
                    "InstanceId": "i-xxxxxxxxxxxxxxxxx", << --- attached to an instance
                    "VolumeId": "vol-xxxxxxxxxxxxxxxxx",
                    "State": "attached",
                    "DeleteOnTermination": false,
                    "Device": "/dev/xvdba"
                }
            ],
            "Encrypted": false,
            "VolumeType": "gp2",
            "VolumeId": vol-xxxxxxxxxxxxxxxxx",
            "State": "in-use",
            "Iops": 100,
            "SnapshotId": "",
            "CreateTime": "2017-03-12T14:49:36.377Z",
            "Size": 10
        }
    ]
}

So let’s log into this host, and find the device:

findmnt /dev/xvdba
TARGET                                                                                               SOURCE     FSTYPE OPTIONS
/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-xxxxxxxxxxxxxxxxx                          /dev/xvdba ext4   rw,relatime,data=ordered
/var/lib/kubelet/pods/b6c57370-0733-11e7-8421-06533dc554b3/volumes/kubernetes.io~aws-ebs/test-volume /dev/xvdba ext4   rw,relatime,data=ordered

As you can see here, it’s mounted on the host under the /var/lib/kubelet directory. This gives us a clue as to how this happened, but to confirm, you can examine the kubelet logs and you’ll see things like this:

Mar 12 14:54:11 ip-172-20-57-70 kubelet[1199]: I0312 14:54:11.716670    1199 operation_executor.go:832] MountVolume.WaitForAttach succeeded for volume "kubernetes.io/aws-ebs/vol-xxxxxxxxxxxxxxxxx" (spec.Name: "test-volume") pod "b6c57370-0733-11e7-8421-06533dc554b3" (UID: "b6c57370-0733-11e7-8421-06533dc554b3").
...
Mar 12 14:54:15 ip-172-20-57-70 kubelet[1199]: I0312 14:54:15.738019    1199 mount_linux.go:369] Disk successfully formatted (mkfs): ext4 - /dev/xvdba /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-xxxxxxxxxxxxxxxxx

The main point here is that when we provide a pod with a volume mount, it’s the kubelet that takes care of the process. All it does it mount the external volume (in this case the EBS volume) onto a directory on the host (under the /var/lib/kubelet dir) and then from there, it can map that volume into the container. There isn’t any fancy magic on the container side, it’s essentially just a normal docker volume to the container.

FlexVolumes examined

Okay, so now we know how volumes work in Kubernetes, we can start to examine how FlexVolumes work.

FlexVolumes are essentially very simple scripts executed by the Kubelet on the host. The script should have 5 functions

  • init - to initialize the volume driver. This could be just an empty function if needed
  • attach - to attach the volume to the host. In many cases, this might be empty, but in some cases, like for EBS, you might have to make an API call to attach it to the host
  • mount - mount the volume on the host. This is the important part, and is the part that makes the volume available to to the host to mount it in /var/lib/kubelet
  • unmount - hopefully self explanatory - unmount the volume
  • detach - again, hopefully self explanatory - detach the volume from the external host.

For each of these functions, there’s some parameters passed to the function as scripts arguments (such as $1, $2, $3). The last passed argument is interesting, because it’s actually a JSON string with options from the driver (more on this later) These parameters specify options that are important to the function, as as we examine a real world example they should become more clear.

LVM Example

The kubernetes repo has a helpful LVM example in the form of a bash script, which makes it nice and readable and easy to understand. Let’s look at some of the functions..

Init

The init function is very simple, as LVM doesn’t require and initialization:

if [ "$op" = "init" ]; then
	log "{\"status\": \"Success\"}"
	exit 0
fi

Notice how we’re returning JSON here, which isn’t much fun in bash!

Attach

The attach function for the LVM example simply determines if the device exists. Because we don’t have to do any API calls to a cloud provider, this makes it quite simple:

attach() {
	JSON_PARAMS=$1
	SIZE=$(echo $1 | jq -r '.size')

	DMDEV=$(getdevice)
	if [ ! -b "${DMDEV}" ]; then
		err "{\"status\": \"Failure\", \"message\": \"Volume ${VOLUMEID} does not exist\"}"
		exit 1
	fi
	log "{\"status\": \"Success\", \"device\":\"${DMDEV}\"}"
	exit 0
}

As you saw earlier, the LVM device needs to exist before we can mount it (in the EBS example earlier, we had to create the device) and so during the attach phase, we ensure the device is available.

Mount

The final stage is the mount section.

domountdevice() {
	MNTPATH=$1
	DMDEV=$2
	FSTYPE=$(echo $3|jq -r '.["kubernetes.io/fsType"]')

	if [ ! -b "${DMDEV}" ]; then
		err "{\"status\": \"Failure\", \"message\": \"${DMDEV} does not exist\"}"
		exit 1
	fi

	if [ $(ismounted) -eq 1 ] ; then
		log "{\"status\": \"Success\"}"
		exit 0
	fi

	VOLFSTYPE=`blkid -o udev ${DMDEV} 2>/dev/null|grep "ID_FS_TYPE"|cut -d"=" -f2`
	if [ "${VOLFSTYPE}" == "" ]; then
		mkfs -t ${FSTYPE} ${DMDEV} >/dev/null 2>&1
		if [ $? -ne 0 ]; then
			err "{ \"status\": \"Failure\", \"message\": \"Failed to create fs ${FSTYPE} on device ${DMDEV}\"}"
			exit 1
		fi
	fi

	mkdir -p ${MNTPATH} &> /dev/null

	mount ${DMDEV} ${MNTPATH} &> /dev/null
	if [ $? -ne 0 ]; then
		err "{ \"status\": \"Failure\", \"message\": \"Failed to mount device ${DMDEV} at ${MNTPATH}\"}"
		exit 1
	fi
	log "{\"status\": \"Success\"}"
	exit 0
}

This is a little bit more involved, but still relatively simple. Essentially, what happens here is:

  • The passed device is formatted to a filesystem provided in the parameters
  • A directory is created to mount the volume to
  • it’s then mounted to a mountpath by the kubelet

Parameters

You may be wondering, where do these parameters I keep talking about come from? The answer is from the pod manifest sent to the kubelet. Here’s an example that uses the above LVM FlexVolume:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
  - name: nginx
    image: nginx
    volumeMounts:
    - name: test
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: test
    flexVolume:
      driver: "leebriggs.co.uk/lvm"
      fsType: "ext4"
      options:
        volumeID: "vol1"
        size: "1000m"
        volumegroup: "kube_vg"

The key section here is the “options” section. This volume ID, size and volume group is all passed to the kubelet as $3 as a JSON string, which is why there’s a bunch of jq munging happening in the above scripts.

Using FlexVolumes

Now you understand how FlexVolumes work, you need to make the kubelet aware of them. Currently, the only way to do this is to install them on the host under a specific directory.

FlexVolumes need a “namespace” (for want of a better word) and a name. So for example, my personally built lvm FlexVolume might be leebriggs.co.uk/lvm. When we install our script, it needs to be installed like so on the host that runs the kubelet:

mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/leebriggs.co.uk~lvm
mv lvm /usr/libexec/kubernetes/kubelet-plugins/volume/exec/leebriggs.co.uk~lvm/lvm

Once you’ve done this, restart the kubelet, and you should be able to use your FlexVolume as you need.

Manifest

The manifest above give you an example of how to use FlexVolumes. It’s worth noting that not all FlexVolumes will be in the same format though. Make sure the driver name matches the directory under the exec folder (in our case, leebriggs.co.uk~lvm and that you pass your required options around.

Wrapping up

This was a relative crash course in FlexVolumes for Kubernetes. There are a couple problems with it:

  • The example is written in bash, which isn’t great at manipulating JSON
  • It uses LVM, which isn’t exactly multi host compatible

The first point is easily solved, by writing a driver in a language with JSON parsing built in. There are a few FlexVolume drivers popping up in Go - I wrote one for ploop in Go using a library which was written to ease the process, but there are others:

All of this deals with mapping single, static volumes into containers, but there is more. Currently, you have to manually provision the volumes you use before spinning up a pod, and as you start to create more and more volumes, you may want to deal with Persistent Volumes to have a process that automatically creates the volumes for you. My next post will detail how you can use these FlexVolumes in a custom provisioner which resembles the persistent volumes in AWS and GCE!