Day 2 of KubeCon was absolutely jam packed! There were lots of tracks, so I won’t be able to cover everything that happened, but hopefully I can recap some of the stuff I found interesting.

One thing to note is that the Technical deep dive rooms were dramatically over subscribed, to the point where I walked out of some of them halfway through because the environment was unbearable. Maybe someone else will be able to recap those.

kubernetes 1.6

The kubernetes 1.6 announcement was done during the keynote, and we had a fantastic demo by Aparna Sinha of some of these features.

Pod Affinity/Anti Affinity

The pod affinity/anti affinity was shown off, which allows you to schedule workloads on certain nodes in your cluster. Anti affinity seemed particularly interesting to me, because it means you can say “if pod has label X, don’t schedule anything with the same label on a node” which is very powerful for environments where you need to ensure failure domains stay low. It also seems very obvious in terms of affinity, where you need to schedule workloads on certain nodes and ensure they stay there. As our master of ceremonies Kelsey Hightower would say - super dope!

Dynamic Storage Provisioning

Another thing that caught my eyes was the dynamic storage provisioning becoming a standard within 1.6. This has been around a while, and it basically goes off and creates volumes for you (EBS, GKE etc etc) and then creates a persistent volume claim for your pod to consume. Aparna did a demo of this and it really showed the power of this model when creating dynamic workloads. I’ve been doing this in the beta/alpha tree for a while, and I hope to have a blog post up soon showing this off for FlexVolumes and how you can extend it with out of tree provisioners!

Roadmap

Some really great things on the 1.7 roadmap. Of particular interest is the service-catalog work, allowing you to provision resources that might not be in kubernetes. More about that later

Monzo & Linkerd

I attended a fantastically interesting talk co-chaired by Monzo and Buoyant to talk about how they worked together at Monzo. Monzo is incredibly interesting in that they built a bank (yes, you read that right, a fucking bank) powered by Kubernetes.

The story started with Monzo explaining how their previous architecture relied on infrastructure services like RabbitMQ to ensure high reliability, but this was failing them. They were very candid about a recent outage they had, which meant payment processing stopped. This seems like an incredible nightmare for a bank, if you can’t buy stuff because you can’t use your card, it can cause a real drop in trust. However, along came Linkerd to operate as a fabric in their k8s cluster which vastly increased their reliability.

Oilver Gould then took us through some of the features of linkerd, and the sell was compelling. I’d highly recommend reviewing the slides and I’ll be taking a look at linkerd soon.

Service Catalog

There were several talks about service catalog maturing as an API, and this is a relatively new concept to me in kubernetes, but it’s incredibly interesting. After a great chat with the folks at the Deis booth, it made more sense.

Essentially, the service catalog will allow developers/applications to consume resources that may or may not live outside the API. The best example of this that I got during the day is as follows.

  • You may or may not run your database outside your cluster.
  • Your developers want to be able to provision and consume that resource, by creating a new database for their app and/or creating credentials to login
  • Traditionally, you would have this be a manual process
  • However, with a service catalog resource, you can make an API call, and it will provision the database and then you can bind to the resource, which will provide you with API generated credentials.

Needless to say, the potential here is mind boggling. With full kubernetes audit logging, and API driven provisioning of external components to the cluster, it becomes really obvious how you can create a programmable datacenter even outside of AWS/Google clouds.

The most up to date example of this is Steward by Deis. This is still in alpha state, but it’s definitely something I think needs to be considered when moving towards “Cloud native” workloads.

Everything Else

Some other fun tidbits I picked up today:

James Munnelly of Jetstack has created a very awesome keepalived cloud provider, which operates as an out of tree load balancer type, allowing those of us unlucky enough to be running bare metal based kubernetes clusters to use the load balancer service type.

Quay - the container registry from CoreOS now supports serving helm charts from its registry which is obviously super useful for those people using quay.

The kubernetes job market is looking pretty healthy based on the job board

Having a grafana dashboard or the conference wifi network is pretty awesome


I was lucky enough to be able to attend CloudNativeCon/Kubecon in Berlin, Germany. This is my recap of the first half day of lightning talks, panels and project updates.

Note - this is not an exhaustive recap. The stuff here is mainly what caught my eye during the first evening. Things are definitely missing

Fluentd update

First up, we had an update from Eduardo Silva from TreasureData about the fantastic FluentD project. The main highlight I can remember is that Fluentd now supports Windows logging. Eduardo’s made a joke about “Windows being a serious system” and got a great laugh. There was a brief discussion of how fluentd solves the logging pipeline problem, and it’s definitely something I’ll be investigating as we try and better ship logs in our Kubernetes implementation.

OpenTracing update

Next, we had Priyanka Sharma on stage to talk about OpenTracing. This is something I was aware of, but hadn’t fully investigated, and Priyanka made it much clearer by demoing a custom “donut order” application she’d written with OpenTracing built in. She did a demo of the app (as a side note, she also asked everyone in the conference room to open a website, which I thought was crazy given the traditional state of conference wifi, but it worked!) and then drilled down into performance problems using OpenTracing. It was a fantastic visualization of how OpenTracing can solve problems, and I got a lot of value out of it!

Linkerd update

Next up was Oliver Gould with an update about Linkerd and the big news here is that Linkerd now has TCP Support!. This is a massive step forward, and judging by reaction to my tweet the community agrees!

Additionally, something that caught me eye is that linkerd now supports kubernetes ingress in its config. I found a great blog post about this on the bouyant blog, and this has me really excited.

CoreDNS update

The next thing I really paid attention to was the CoreDNS update by Miek Gieben and this really stuck in my mind because I’ve really had some trouble and concerns around the kube-dns/skyDNS implementation currently being used in Kube. For some reason, a go shim coupled with dnsmasq and a sidecar container just doesn’t feel like the right way to do something that’s a critical component of the kubernetes stack.

With that in mind, Miek provided an update on CoreDNS, what it can do and why it’s “better” than kube-dns. This slide stood out to me, and after reading some of the available middleware plugins I’m looking to replace kube-dns with CoreDNS as soon as I can!

Panel

The panel was a real disappointment for me. It consisted of 3 senior/executive level employees of TicketMaster, Amadeus and Haufe-Lexware discussiong their transition to Kubernetes. I would have loved to have my bosses, and their bosses watch this panel, because there was some good conversation around how to transition organisations to new ways of thinking, but it was nothing I hadn’t already heard. I would have loved to have had maybe 5/10 operators on stage talking war stories of Kubernetes in production, with a little less “media training” feel to it. I understand there’s a wide range of attendees at this conference though, and I’m sure more people got value out of it than I did.

As a side note, I was very disappointed at the lack of diversity in the panel. I’m sure we could have found someone from a more diverse background/gender to be involved in the discussions.

Helm with AppController

There was a nice short talk about the capabilities of the Mirantis Appcontroller and this seems like a really interesting project. Essentially, this brings orchestration to your kubernetes cluster, and allows staged deployments (ie, don’t deploy the web app until your database is initialized) - I can see this getting heavy usage once it’s in Beta/ready to test, but as it stands it seems fairly new.

BGP Routing in Kubernetes

This talk confused me, because it basically described what Calico already does, unless I’m missing something? Would love someone to fill in the gaps here.

Fluentd Logging Pipelines

The last talk I really tuned in on was one that really stuck with me. Jakob Karalus talked about how he’d built a flexible logging service inside Kubernetes, using annotations and fluentd. This tweet should give you an idea of how it works. This is a really interesting solution to a problem that is definitely on my mind at the moment - with Docker you can simple set the logging driver per container but this is currently not possible in kubernetes (see this issue for more details. Jakob’s solution provides an interesting mechanism for developers to be able to decide where they want their logs to go, and I think flexibility is always key. Check out his github repo to see how it works.

Wrap Up

I’m hoping to write one of these for each day, but the amount of content may make that difficult! Let me know if you think I missed anything important!


Kubernetes has a reputation for being great for stateless application deployment. If you don’t require any kind of local storage inside your containers, the barrier to entry for you to deploy on Kubernetes is probably very, very low. However, it’s a fact of life that some applications require some kind of local storage.

Kubernetes supports this using Volumes and out of the box there is support for more than enough volume types for the average kubernetes user. For example, if your kubernetes cluster is deployed to AWS, you’re probably going to make use make use of the awsElasticBlockStore volume type, and think very little of it.

There are situations however, where you might be deploying your cluster to a different platform, like physical datacenters or perhaps another “cloud” provider like DigitalOcean. In these situations, you might think you’re a little bit screwed, and up until recently you kind of were. The only way to get a new storage provider supported in Kubernetes was to write one, and then run the gauntlet of getting a merge request accepted into the main kubernetes repo.

However, a new volume type has opened up the door to custom volume providers, and they are exceptionally simple to write and use. FlexVolumes are a relatively new addition to the kubernetes volume list, and they allow you to run an arbitrary script or volume provisioner on the kubernetes host to create a volume.

Before we dive too deep into FlexVolumes, it’s worth refreshing exactly how volumes work on Kubernetes and how they are mapped into the container.

Volumes Crash Course

If you’ve been using Volumes in Kubernetes in a cloud provider, you might not be fully aware of exactly how they work. If you are aware, I suggest you skip ahead. For those that aren’t, let’s have a quick overview of how EBS volumes work in Kubernetes.

Create an EBS Volume.

The first thing you have to do is create an EBS volume. If you’re using the AWS CLI this is easy as:

aws ec2 create-volume --availability-zone eu-west-1c --size 10 --volume-type gp2

Which will return something like..

{
    "AvailabilityZone": "eu-west-1c",
    "Encrypted": false,
    "VolumeType": "gp2",
    "VolumeId": "vol-xxxxxxxxxxxxxxxxx",
    "State": "creating",
    "Iops": 100,
    "SnapshotId": "",
    "CreateTime": "2017-03-12T14:49:36.377Z",
    "Size": 10
}

Your EBS volume is now ready to go.

Once you have the volume, you’ll probably want to attach it to a Kubernetes pod! In order to do this, you’ll need to take the volume ID and use it in your kubernetes manifest. The awsElasticBlockStore has an example, like so:

apiVersion: v1
kind: Pod
metadata:
  name: test-ebs
spec:
  containers:
  - image: gcr.io/google_containers/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /test-ebs
      name: test-volume
  volumes:
  - name: test-volume
    awsElasticBlockStore:
      volumeID: vol-xxxxxxxxxxxxxxxxx
      fsType: ext4

Now, if you look in the pod, you’ll see a mount at /test-ebs, but how has it got there? The answer is actually surprisingly simple.

If you examine the ebs volume that was created, you’ll see it’s been attached to an instance!

aws ec2 describe-volumes --volume-ids vol-xxxxxxxxxxxxxxxxx
{
    "Volumes": [
        {
            "AvailabilityZone": "eu-west-1c",
            "Attachments": [
                {
                    "AttachTime": "2017-03-12T14:53:55.000Z",
                    "InstanceId": "i-xxxxxxxxxxxxxxxxx", << --- attached to an instance
                    "VolumeId": "vol-xxxxxxxxxxxxxxxxx",
                    "State": "attached",
                    "DeleteOnTermination": false,
                    "Device": "/dev/xvdba"
                }
            ],
            "Encrypted": false,
            "VolumeType": "gp2",
            "VolumeId": vol-xxxxxxxxxxxxxxxxx",
            "State": "in-use",
            "Iops": 100,
            "SnapshotId": "",
            "CreateTime": "2017-03-12T14:49:36.377Z",
            "Size": 10
        }
    ]
}

So let’s log into this host, and find the device:

findmnt /dev/xvdba
TARGET                                                                                               SOURCE     FSTYPE OPTIONS
/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-xxxxxxxxxxxxxxxxx                          /dev/xvdba ext4   rw,relatime,data=ordered
/var/lib/kubelet/pods/b6c57370-0733-11e7-8421-06533dc554b3/volumes/kubernetes.io~aws-ebs/test-volume /dev/xvdba ext4   rw,relatime,data=ordered

As you can see here, it’s mounted on the host under the /var/lib/kubelet directory. This gives us a clue as to how this happened, but to confirm, you can examine the kubelet logs and you’ll see things like this:

Mar 12 14:54:11 ip-172-20-57-70 kubelet[1199]: I0312 14:54:11.716670    1199 operation_executor.go:832] MountVolume.WaitForAttach succeeded for volume "kubernetes.io/aws-ebs/vol-xxxxxxxxxxxxxxxxx" (spec.Name: "test-volume") pod "b6c57370-0733-11e7-8421-06533dc554b3" (UID: "b6c57370-0733-11e7-8421-06533dc554b3").
...
Mar 12 14:54:15 ip-172-20-57-70 kubelet[1199]: I0312 14:54:15.738019    1199 mount_linux.go:369] Disk successfully formatted (mkfs): ext4 - /dev/xvdba /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-xxxxxxxxxxxxxxxxx

The main point here is that when we provide a pod with a volume mount, it’s the kubelet that takes care of the process. All it does it mount the external volume (in this case the EBS volume) onto a directory on the host (under the /var/lib/kubelet dir) and then from there, it can map that volume into the container. There isn’t any fancy magic on the container side, it’s essentially just a normal docker volume to the container.

FlexVolumes examined

Okay, so now we know how volumes work in Kubernetes, we can start to examine how FlexVolumes work.

FlexVolumes are essentially very simple scripts executed by the Kubelet on the host. The script should have 5 functions

  • init - to initialize the volume driver. This could be just an empty function if needed
  • attach - to attach the volume to the host. In many cases, this might be empty, but in some cases, like for EBS, you might have to make an API call to attach it to the host
  • mount - mount the volume on the host. This is the important part, and is the part that makes the volume available to to the host to mount it in /var/lib/kubelet
  • unmount - hopefully self explanatory - unmount the volume
  • detach - again, hopefully self explanatory - detach the volume from the external host.

For each of these functions, there’s some parameters passed to the function as scripts arguments (such as $1, $2, $3). The last passed argument is interesting, because it’s actually a JSON string with options from the driver (more on this later) These parameters specify options that are important to the function, as as we examine a real world example they should become more clear.

LVM Example

The kubernetes repo has a helpful LVM example in the form of a bash script, which makes it nice and readable and easy to understand. Let’s look at some of the functions..

Init

The init function is very simple, as LVM doesn’t require and initialization:

if [ "$op" = "init" ]; then
	log "{\"status\": \"Success\"}"
	exit 0
fi

Notice how we’re returning JSON here, which isn’t much fun in bash!

Attach

The attach function for the LVM example simply determines if the device exists. Because we don’t have to do any API calls to a cloud provider, this makes it quite simple:

attach() {
	JSON_PARAMS=$1
	SIZE=$(echo $1 | jq -r '.size')

	DMDEV=$(getdevice)
	if [ ! -b "${DMDEV}" ]; then
		err "{\"status\": \"Failure\", \"message\": \"Volume ${VOLUMEID} does not exist\"}"
		exit 1
	fi
	log "{\"status\": \"Success\", \"device\":\"${DMDEV}\"}"
	exit 0
}

As you saw earlier, the LVM device needs to exist before we can mount it (in the EBS example earlier, we had to create the device) and so during the attach phase, we ensure the device is available.

Mount

The final stage is the mount section.

domountdevice() {
	MNTPATH=$1
	DMDEV=$2
	FSTYPE=$(echo $3|jq -r '.["kubernetes.io/fsType"]')

	if [ ! -b "${DMDEV}" ]; then
		err "{\"status\": \"Failure\", \"message\": \"${DMDEV} does not exist\"}"
		exit 1
	fi

	if [ $(ismounted) -eq 1 ] ; then
		log "{\"status\": \"Success\"}"
		exit 0
	fi

	VOLFSTYPE=`blkid -o udev ${DMDEV} 2>/dev/null|grep "ID_FS_TYPE"|cut -d"=" -f2`
	if [ "${VOLFSTYPE}" == "" ]; then
		mkfs -t ${FSTYPE} ${DMDEV} >/dev/null 2>&1
		if [ $? -ne 0 ]; then
			err "{ \"status\": \"Failure\", \"message\": \"Failed to create fs ${FSTYPE} on device ${DMDEV}\"}"
			exit 1
		fi
	fi

	mkdir -p ${MNTPATH} &> /dev/null

	mount ${DMDEV} ${MNTPATH} &> /dev/null
	if [ $? -ne 0 ]; then
		err "{ \"status\": \"Failure\", \"message\": \"Failed to mount device ${DMDEV} at ${MNTPATH}\"}"
		exit 1
	fi
	log "{\"status\": \"Success\"}"
	exit 0
}

This is a little bit more involved, but still relatively simple. Essentially, what happens here is:

  • The passed device is formatted to a filesystem provided in the parameters
  • A directory is created to mount the volume to
  • it’s then mounted to a mountpath by the kubelet

Parameters

You may be wondering, where do these parameters I keep talking about come from? The answer is from the pod manifest sent to the kubelet. Here’s an example that uses the above LVM FlexVolume:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
  - name: nginx
    image: nginx
    volumeMounts:
    - name: test
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: test
    flexVolume:
      driver: "leebriggs.co.uk/lvm"
      fsType: "ext4"
      options:
        volumeID: "vol1"
        size: "1000m"
        volumegroup: "kube_vg"

The key section here is the “options” section. This volume ID, size and volume group is all passed to the kubelet as $3 as a JSON string, which is why there’s a bunch of jq munging happening in the above scripts.

Using FlexVolumes

Now you understand how FlexVolumes work, you need to make the kubelet aware of them. Currently, the only way to do this is to install them on the host under a specific directory.

FlexVolumes need a “namespace” (for want of a better word) and a name. So for example, my personally built lvm FlexVolume might be leebriggs.co.uk/lvm. When we install our script, it needs to be installed like so on the host that runs the kubelet:

mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/leebriggs.co.uk~lvm
mv lvm /usr/libexec/kubernetes/kubelet-plugins/volume/exec/leebriggs.co.uk~lvm/lvm

Once you’ve done this, restart the kubelet, and you should be able to use your FlexVolume as you need.

Manifest

The manifest above give you an example of how to use FlexVolumes. It’s worth noting that not all FlexVolumes will be in the same format though. Make sure the driver name matches the directory under the exec folder (in our case, leebriggs.co.uk~lvm and that you pass your required options around.

Wrapping up

This was a relative crash course in FlexVolumes for Kubernetes. There are a couple problems with it:

  • The example is written in bash, which isn’t great at manipulating JSON
  • It uses LVM, which isn’t exactly multi host compatible

The first point is easily solved, by writing a driver in a language with JSON parsing built in. There are a few FlexVolume drivers popping up in Go - I wrote one for ploop in Go using a library which was written to ease the process, but there are others:

All of this deals with mapping single, static volumes into containers, but there is more. Currently, you have to manually provision the volumes you use before spinning up a pod, and as you start to create more and more volumes, you may want to deal with Persistent Volumes to have a process that automatically creates the volumes for you. My next post will detail how you can use these FlexVolumes in a custom provisioner which resembles the persistent volumes in AWS and GCE!


In the previous post, I went over some basics of how Kubernetes networking works from a fundamental standpoint. The requirements are simple: every pod needs to have connectivity to every other pod. The only differentiation between the many options were how that was achieved.

In this post, I’m going to cover some of the fundamentals of how Calico works. As I mentioned in the previous post, I really don’t like the idea that with these kubernetes deployments, you simply grab a yaml file and deploy it, sometimes with little to no explanation of what’s actually happening. Hopefully, this post will servce to better understand what’s going on.

As before, I’m not by any means a networking expert, so if you spot any mistakes, please send a pull request!

What is Calico?

Calico is a container networking solution created by MetaSwitch. While solutions like Flannel operate over layer 2, Calico makes use of layer 3 to route packets to pods. The way it does this is relatively simple in practice. Calico can also provide network policy for Kubernetes. We’ll ignore this for the time being, and focus purely on how it provides container networking.

Components

Your average calico setup has 4 components:

Etcd

Etcd is the backend data store for all the information Calico needs. If you’ve deployed Kubernetes already, you already have an etcd deployment, but it’s usually suggested to deploy a separate etcd for production systems, or at the very least deploy it outside of your kubernetes cluster.

You can examine the information that calico provides by using etcdctl. The default location for the calico keys is /calico

$ etcdctl ls /calico
/calico/ipam
/calico/v1
/calico/bgp

BIRD

The next key component in the calico stack is BIRD. BIRD is a BGP routing daemon which runs on every host. Calico makes uses of BGP to propagate routes between hosts. BGP (if you’re not aware) is widely used to propagate routes over the internet. It’s suggested you make yourself familiar with some of the concepts if you’re using Calico.

Bird runs on every host in the Kubernetes cluster, usually as a DaemonSet. It’s included in the calico/node container.

Confd

Confd is a simple configuration management tool. It reads values from etcd and writes them to files on disk. If you take a look inside the calico/node container (where it usually runs) you can get an idea of what’s it doing:

# ps
PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/runsvdir -P /etc/service/enabled
  105 root       0:00 runsv felix
  106 root       0:00 runsv bird
  107 root       0:00 runsv bird6
  108 root       0:00 runsv confd
  109 root       0:28 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
  110 root       0:00 confd -confdir=/etc/calico/confd -interval=5 -watch --log-level=debug -node=http://etcd1:4001
  112 root       0:40 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
  230 root      31:48 calico-felix
  256 root       0:00 calico-iptables-plugin
  257 root       2:17 calico-iptables-plugin
11710 root       0:00 /bin/sh
11786 root       0:00 ps

As you can see, it’s connecting to the etcd nodes and reading from there, and it has a confd directory passed to it. The source of that confd directory can be found in the calicoctl github repository.

If you examine the repo, you’ll notice three directories.

Firstly, there’s a conf.d directory. This directory contains a bunch of toml configuration files. Let’s examine one of them:

[template]
src = "bird_ipam.cfg.template"
dest = "/etc/calico/confd/config/bird_ipam.cfg"
prefix = "/calico/v1/ipam/v4"
keys = [
    "/pool",
]
reload_cmd = "pkill -HUP bird || true"

This is pretty simple in reality. It has a source file, and then where the file should be written to. Then, there’s some etcd keys that you should read information from. Essentially, confd is what writes the BIRD configuration for Calico. If you examine the keys there, you’ll see the kind of thing it reads:

$  etcdctl ls /calico/v1/ipam/v4/pool/
/calico/v1/ipam/v4/pool/192.168.0.0-16

So in this case, it’s getting the pod cidr we’ve assigned. I’ll cover this in more detail later.

In order to understand what it does with that key, you need to take a look at the src template confd is using.

Now, this at first glance looks a little complicated, but it’s not. It’s writing a file in the Go templating language that confd is familiar with. This is a standard BIRD configuration file, populated with keys from etcd. Take this for example:

This is essentially:

  • Looping through all the pools under the key /v1/ipam/v4/pool - in our case we only have one: 192.168.0.0-16
  • Assigning the data in the pools key to a var, $data
  • Then grabbing a value from the JSON that’s been loaded into $data - in this case the cidr key.

This makes more sense if you look at the values in the etcd key:

etcdctl get /calico/v1/ipam/v4/pool/192.168.0.0-16
{"cidr":"192.168.0.0/16","ipip":"tunl0","masquerade":true,"ipam":true,"disabled":false}

So it’s grabbed the cidr value and written it to the file. The end result of the file in the calico/node container brings this all together:

if ( net ~ 192.168.0.0/16 ) then {
    accept;
  }

Pretty simple really!

calico-felix

The final component in the calico stack is the calico-felix daemon. This is the tool that performs all the magic in the calico stack. It has multiple responsibilities:

  • it writes the routing table of the operating system. You’ll see this in action later
  • it manipulates IPtables on the host. Again, you’ll see this in action later.

It does all this by connecting to etcd and reading information from there. It runs inside the calico/node DaemonSet alongside confd and BIRD.

Calico in Action

In order to get started, it’s recommend that you’ve deployed Calico using the installation instructions here. Ensure that:

  • you’ve got a calico/node container running on every kubernetes host
  • You can see in the calico/node logs that there’s no errors or issues. Use kubectl get logs on a few hosts to ensure it’s working as expected

At this stage, you’ll want to deploy something so that Calico can work it’s magic. I recommend deploying the guestbook to see all this in action.

Routing Table

Once you’ve deployed Calico and your guestbook, get the pod IP of the guestbook using kubectl:

kubectl get po -o wide
NAME                           READY     STATUS    RESTARTS   AGE       IP                NODE
frontend-88237173-f3sz4        1/1       Running   0          2m        192.168.15.195    node1
frontend-88237173-j407q        1/1       Running   0          2m        192.168.228.195   node2
frontend-88237173-pwqfx        1/1       Running   0          2m        192.168.175.195   node3
redis-master-343230949-zr5xg   1/1       Running   0          2m        192.168.0.130    node4
redis-slave-132015689-475lt    1/1       Running   0          2m        192.168.71.1      node5
redis-slave-132015689-dzpks    1/1       Running   0          2m        192.168.105.65   node6

If everything has worked correctly, you should be able to ping every pod from any host. Test this now:

ping -c 1 192.168.15.195
PING 192.168.15.195 (192.168.15.195) 56(84) bytes of data.
64 bytes from 192.168.15.195: icmp_seq=1 ttl=63 time=0.318 ms

--- 192.168.15.195 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.318/0.318/0.318/0.000 ms

If you have fping and installed, you can verify all pods in one go:

kubectl get po -o json | jq .items[].status.podIP -r | fping
192.168.15.195 is alive
192.168.228.195 is alive
192.168.175.195 is alive
192.168.0.130 is alive
192.168.71.1 is alive
192.168.105.65 is alive

The real question is, how did this actually work? How come I can ping these endpoints? The answer becomes obvious if you print the routing table:

ip route
default via 172.29.132.1 dev eth0
169.254.0.0/16 dev eth0  scope link  metric 1002
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1
172.29.132.0/24 dev eth0  proto kernel  scope link  src 172.29.132.127
172.29.132.1 dev eth0  scope link
192.168.0.128/26 via 172.29.141.98 dev tunl0  proto bird onlink
192.168.15.192/26 via 172.29.141.95 dev tunl0  proto bird onlink
blackhole 192.168.33.0/26  proto bird
192.168.71.0/26 via 172.29.141.105 dev tunl0  proto bird onlink
192.168.105.64/26 via 172.29.141.97 dev tunl0  proto bird onlink
192.168.175.192/26 via 172.29.141.102 dev tunl0  proto bird onlink
192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink

A lot has happened here, so let’s break it down in sections.

Subnets

Each host that has calico/node running on it has its own /26 subnet. You can verify this by looking in etcd:

etcdctl ls /calico/ipam/v2/host/node1/ipv4/block/
/calico/ipam/v2/host/node1/ipv4/block/192.168.228.192-26

So in this case, the host node1 has been allocated the subnet 192.168.228.192-26. Any new host that starts up, connects to kubernetes and has a calico/node container running on it, will get one of those subnets. This is a fairly standard model in Kubernetes networking.

What differs here is how Calico handles it. Let’s go back to our routing table and look at the entry for that subnet:

192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink

What’s happened here is that calico-felix has read etcd, and determined that the ip address of node1 is 172.29.141.96. Calico now knows the IP address of the host, and also the pod subnet assigned to it. With this information, it programs routes on every node in the kubernetes cluster. It says “if you want to hit something in this subnet, go via the ip address x over the tunl0 interface.

The tunl0 interface may not be present on your host. It exists here because I’ve enabled IPIP encapsulation in Calico for the sake of testing.

Destination Host

Now, the packets know their destination. They have a route defined and they know they should head directly via the interface of the node. What happens then, when they arrive there?

The answer again, is in the routing table. On the host the pod has been scheduled on, print the routing table again:

ip route
default via 172.29.132.1 dev eth0
169.254.0.0/16 dev eth0  scope link  metric 1002
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1
172.29.132.0/24 dev eth0  proto kernel  scope link  src 172.29.132.127
172.29.132.1 dev eth0  scope link
192.168.0.128/26 via 172.29.141.98 dev tunl0  proto bird onlink
192.168.15.192/26 via 172.29.141.95 dev tunl0  proto bird onlink
blackhole 192.168.33.0/26  proto bird
192.168.71.0/26 via 172.29.141.105 dev tunl0  proto bird onlink
192.168.105.64/26 via 172.29.141.97 dev tunl0  proto bird onlink
192.168.175.192/26 via 172.29.141.102 dev tunl0  proto bird onlink
192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink
192.168.228.195 dev cali7b262072819  scope link

There’s an extra route! You can see, there’s the pod IP has the destination and it’s telling the OS to route it via a device, cali7b262072819.

Let’s have a look at the interfaces:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT qlen 1000
    link/ether 00:25:90:62:ed:c6 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
    link/ether 00:25:90:62:ed:c6 brd ff:ff:ff:ff:ff:ff
5: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
    link/ether 32:e9:d2:f3:17:0f brd ff:ff:ff:ff:ff:ff link-netnsid 4

There’s an interface for our pod! When the container spun up, calico (via CNI) created an interface for us and assigned it to the pod. How did it do that?

CNI

The answer lies in the setup of Calico. If you examine the yaml you installed when you installed Calico, you’ll see a setup task which runs on every container. That uses a configmap, which looks like this

# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # The location of your etcd cluster.  This uses the Service clusterIP
  # defined below.
  etcd_endpoints: "http://10.96.232.136:6666"

  # True enables BGP networking, false tells Calico to enforce
  # policy only, using native networking.
  enable_bgp: "true"

  # The CNI network configuration to install on each node.
  cni_network_config: |-
    {
        "name": "k8s-pod-network",
        "type": "calico",
        "etcd_endpoints": "__ETCD_ENDPOINTS__",
        "log_level": "info",
        "ipam": {
            "type": "calico-ipam"
        },
        "policy": {
            "type": "k8s",
             "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
             "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
        },
        "kubernetes": {
            "kubeconfig": "/etc/cni/net.d/__KUBECONFIG_FILENAME__"
        }
    }

  # The default IP Pool to be created for the cluster.
  # Pod IP addresses will be assigned from this pool.
  ippool.yaml: |
      apiVersion: v1
      kind: ipPool
      metadata:
        cidr: 192.168.0.0/16
      spec:
        ipip:
          enabled: true
        nat-outgoing: true

This manifests itself in the /etc/cni/net.d directory on every host:

ls /etc/cni/net.d/
10-calico.conf  calico-kubeconfig  calico-tls

So essentially, when a new pod starts up, Calico will:

  • query the kubernetes API to determine the pod exists and that it’s on this node
  • assigns the pod an IP address from within its IPAM
  • create an interface on the host so that the container can get an address
  • tell the kubernetes API about this new IP

Magic!

IPTables

The final piece of the puzzle here is some IPTables magic. As mentioned earlier, Calico has support for network policy. Even if you’re not actively using the policy components, it still exists, and you need some default policy for connectivity is work. If you look at the output of iptables -L you’ll see a familiar string:

**Chain felix-to-7b262072819 (1 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK and 0xfeffffff
MARK       all  --  anywhere             anywhere             /* Start of tier default */ MARK and 0xfdffffff
felix-p-_722590149132d26-i  all  --  anywhere             anywhere             mark match 0x0/0x2000000
RETURN     all  --  anywhere             anywhere             mark match 0x1000000/0x1000000 /* Return if policy accepted */
DROP       all  --  anywhere             anywhere             mark match 0x0/0x2000000 /* Drop if no policy in tier passed */
felix-p-k8s_ns.default-i  all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere             mark match 0x1000000/0x1000000 /* Profile accepted packet */
DROP       all  --  anywhere             anywhere             /* Packet did not match any profile (endpoint eth0) */

The IPtables chain here has the same string at the calico interface. This iptables rule is vital for calico to pass the packets onto the container. It grabs the packet destined for the container, determines if it should be allowed and sends it on its way if it is.

If this chain doesn’t exist, it gets captured by the default policy, and the packet will be dropped. It’s calico-felix that programs these rules.

Wrap Up

Hopefully, you now have a better knowledge of how exactly Calico gets the job done. At its core, it’s actually relatively simple, simply ip routes on each host. What it does it take the difficult in managing those routes away from you, giving you a simple, easy solution to container networking.


I have some problems with Kubernetes.

It’s a fantastic tool that is revolutionizing the way we do things at $work. However, because of its code complexity, and the vast number of features, plugins, addons and options, the documentation isn’t getting the job done.

The other issue is that too many of the “Getting Started” tutorials gloss over the parts that you actually need to know. Let’s take a look at the kubeadm page, for example. In the networking section, it says this:

You can install a pod network add-on with the following command: kubectl apply -f

Now, the ease of this is fantastic. You can initialize your network super easily, and if you’re playing around with minikube or some other small setup, this really takes the pain out of getting started.

However, take a look at the full networking documentation page. If things go wrong, are you going to have any idea what’s going on here? Do you feel comfortable running this in production?

I certainly didn’t, so for the past week or so, I’ve been learning how all this works. I’m going to detail all this in two parts. First, I’m going to explain in sysadmin (ie I try to avoid network gear at all costs) terms how kubernetes approaches networking. Most of the information here is in the earlier linked networking doc, but I’m going to put it in my own words.

The next post will be specifically about my chosen pod network provider, Calico and how it interacts with your OS and containers.

Disclaimer: I’m not an expert on networking by any stretch of the imagination. If any of this is wrong, please send a pull request

Basics

There’s a lot of words on the earlier networking page. I’m going to sum it up a bit differently. In order for Kubernetes to work, every pod needs to have its own IP address like a VM

This is in direct conflict with the default setup of standalone Docker. By default Docker gives itself a private IP address on the host. It creates an bridge interface, docker0 and then grabs an IP, usually something like 172.17.0.1

All the containers then get ` veth` interface so they can talk to each other. The problem here is that they can only talk to containers on the same host. In order to talk to containers on other hosts, they have to start port mapping on the host. Anyone who’s had to deal with this at scale knows its an exercise in futility.

So, back to Kubernetes. Every pod gets an IP right? How does it do that?

Well, the pod network mentioned above (you know, that yaml file you downloaded and blindly installed) is usually the thing that controls that. The way it does that varies slightly depending on your chosen network provider (whether it be flannel, weave, calico etc) but the basics remain essentially the same.

An IP for every container

When the pod network starts up, you usually have to provide a relatively large subnet for configuration. The CoreOS flannel docs, for example suggest using the subnet ` 10.1.0.0/16`. You’ll see why this is so large in a moment.

The subnet is usually predetermined and needs to be stored somewhere, which increasingly seems to be etcd. You usually have to set this before launching the pod network, and it’s often stored in the kubernetes manifest. If you look at the kube-flannel manifest, you’ll see this:

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "type": "flannel",
      "delegate": {
        "isDefaultGateway": true
      }
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }

This is simple, it’s setting a CNI config, which will then be shipped off to etcd to be stored for safekeeping.

When a container comes online, it looks at the preprovided subnet, and will give itself an IP address from the subnet provided.

Connectivity

Now, just because there’s a subnet assigned, doesn’t mean there’s connectivity. And if you remember previously, pods need to have connectivity, even across different hosts.

This is important, and is something you should ensure works before you start deploying this to Kubernetes. From kubernetes node, you should be able to get icmp traffic any pod on your network and you should also be able to ping any pod ip from another pod. It depends on your pod network how this work. With flannel for example, you get an interface added on each host (usually flannel0) and the connectivity is provided across a layer2 overlay network using vxlan. This is relatively simple, but there are some performance penalties. Calico uses a more elegant but more complicated solution which I’ll cover in much more detail in the next post.

In the meantime, let’s look at what a working config looks like in action.

Testing Connectivity

I’ve deployed the guestbook here, and you can see the pod ips like so:

kubectl get po -o wide
NAME                           READY     STATUS    RESTARTS   AGE       IP                NODE
frontend-88237173-jdfgg        1/1       Running   0          2h        192.168.175.197   host1
frontend-88237173-mzmjf        1/1       Running   0          4h        192.168.163.65    host2
frontend-88237173-z3ltv        1/1       Running   0          5h        192.168.173.195   host1
redis-master-343230949-2qrp7   1/1       Running   0          5h        192.168.90.131    host3
redis-slave-132015689-890b2    1/1       Running   0          5h        192.168.90.132    host1
redis-slave-132015689-k0rk5    1/1       Running   0          5h        192.168.175.196   host3

Now, in a working cluster, I should be able to get to any one of these IPs from my master:

# ping -c 1 192.168.175.196
PING 192.168.175.196 (192.168.175.196) 56(84) bytes of data.
64 bytes from 192.168.175.196: icmp_seq=1 ttl=63 time=0.433 ms

--- 192.168.175.196 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.433/0.433/0.433/0.000 ms

if this doesn’t work from any node in your cluster, something is probably wrong

Similarly, you should be able to enter another pod and ping across pods:

# ping -c 1 192.168.90.131
PING 192.168.90.131 (192.168.90.131): 48 data bytes
56 bytes from 192.168.90.131: icmp_seq=0 ttl=62 time=0.358 ms
--- 192.168.90.131 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.358/0.358/0.358/0.000 ms

This fulfills the fundamental requirements of Kubernetes, and you know things are working. If this isn’t working, you need to get troubleshooting as to why.

Now, with flannel, this is all abstracted away from you, and it’s difficult to decipher. Some troubleshooting tips I’d recommend:

  • Make sure flannel0 actually exists, and check the flannel logs
  • Break out tcpdump with tcpdump -vv icmp and check the icmp request are arriving and leaving the nodes correctly.

With Calico, this is much easier to debug (in my opinion) and I’ll detail some troubleshooting exercises in the next post.

A quick note about services

One thing that I got confused about when I started with kubernetes is, why can’t I ping service IPs?

# ping -c 1 kubernetes.default
PING kubernetes.default.svc.cluster.local (10.96.0.1): 48 data bytes
--- kubernetes.default.svc.cluster.local ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

The reason for this is actually quite simple - they don’t technically exist!

kube-proxy

All the services in a cluster are handled by kube-proxy. kube-proxy runs on every node in the cluster, and what it does it write iptables rules for each service. You can see this when you run iptables-save:

-A KUBE-SERVICES -d 10.107.179.200/32 -p tcp -m comment --comment "default/redis-master: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-7GF4BJM3Z6CMNVML
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.98.90.196/32 -p tcp -m comment --comment "default/redis-slave: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.98.90.196/32 -p tcp -m comment --comment "default/redis-slave: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-AGR3D4D4FQNH4O33
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.99.237.90/32 -p tcp -m comment --comment "default/frontend: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.99.237.90/32 -p tcp -m comment --comment "default/frontend: cluster IP" -m tcp --dport 80 -j KUBE-SVC-GYQQTB6TY565JPRW
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU

This is just a taste of what you’ll see, but essentially, these iptables rules manage the traffic towards the service IPs. They don’t actually have any rules for ICMP, because it’s not needed.

So, if from host you try hit a service on a TCP port, you’ll see it works!

# curl -k https://10.96.0.1
Unauthorized

Don’t be fooled on the Unauthorized message here, it’s just the kubernetes API rejecting unauthorized requests. Iptables handily translated the request off towards the node the pod is running on, and made it hit the IP for you. Here’s the iptables rule:

-A KUBE-SEP-EHCIXHWU3R7SVNN2 -s 172.29.132.126/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-EHCIXHWU3R7SVNN2 -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-EHCIXHWU3R7SVNN2 --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 172.29.132.126:6443

Simple!

Wrap up

This should wrap up the basics of how kubernetes networking works, without going into the specifics of exactly what’s happening. In the next post, I’ll specifically cover Calico and how it operates alongside kubernetes using the magic of routing to help your packets reach their destination.