In the previous post, I went over some basics of how Kubernetes networking works from a fundamental standpoint. The requirements are simple: every pod needs to have connectivity to every other pod. The only differentiation between the many options were how that was achieved.

In this post, I’m going to cover some of the fundamentals of how Calico works. As I mentioned in the previous post, I really don’t like the idea that with these kubernetes deployments, you simple grab a yaml file and deploy it, sometimes with little to no explanation of what’s actually happening. Hopefully, this post will servce to better understand what’s going on.

As before, I’m not by any means a networking expert, so if you spot any mistakes, please send a pull request!

What is Calico?

Calico is a container networking solution created by MetaSwitch. While solutions like Flannel operate over layer 2, Calico makes use of layer 3 to route packets to pods. The way it does this is relatively simple in practice. Calico can also provide network policy for Kubernetes. We’ll ignore this for the time being, and focus purely on how it provides container networking.

Components

Your average calico setup has 4 components:

Etcd

Etcd is the backend data store for all the information Calico needs. If you’ve deployed Kubernetes already, you already have an etcd deployment, but it’s usually suggested to deploy a separate etcd for production systems, or at the very least deploy it outside of your kubernetes cluster.

You can examine the information that calico provides by using etcdctl. The default location for the calico keys is /calico

$ etcdctl ls /calico
/calico/ipam
/calico/v1
/calico/bgp

BIRD

The next key component in the calico stack is BIRD. BIRD is a BGP routing daemon which runs on every host. Calico makes uses of BGP to propagate routes between hosts. BGP (if you’re not aware) is widely used to propagate routes over the internet. It’s suggested you make yourself familiar with some of the concepts if you’re using Calico.

Bird runs on every host in the Kubernetes cluster, usually as a DaemonSet. It’s included in the calico/node container.

Confd

Confd is a simple configuration management tool. It reads values from etcd and writes them to files on disk. If you take a look inside the calico/node container (where it usually runs) you can get an idea of what’s it doing:

# ps
PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/runsvdir -P /etc/service/enabled
  105 root       0:00 runsv felix
  106 root       0:00 runsv bird
  107 root       0:00 runsv bird6
  108 root       0:00 runsv confd
  109 root       0:28 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
  110 root       0:00 confd -confdir=/etc/calico/confd -interval=5 -watch --log-level=debug -node=http://etcd1:4001
  112 root       0:40 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
  230 root      31:48 calico-felix
  256 root       0:00 calico-iptables-plugin
  257 root       2:17 calico-iptables-plugin
11710 root       0:00 /bin/sh
11786 root       0:00 ps

As you can see, it’s connecting to the etcd nodes and reading from there, and it has a confd directory passed to it. The source of that confd directory can be found in the calicoctl github repository.

If you examine the repo, you’ll notice three directories.

Firstly, there’s a conf.d directory. This directory contains a bunch of toml configuration files. Let’s examine on of them:

[template]
src = "bird_ipam.cfg.template"
dest = "/etc/calico/confd/config/bird_ipam.cfg"
prefix = "/calico/v1/ipam/v4"
keys = [
    "/pool",
]
reload_cmd = "pkill -HUP bird || true"

This is pretty simple in reality. It has a source file, and then where the file should be written to. Then, there’s some etcd keys that you should read information from. Essentially, confd is what writes the BIRD configuration for Calico. If you examine the keys there, you’ll see the kind of thing it reads:

$  etcdctl ls /calico/v1/ipam/v4/pool/
/calico/v1/ipam/v4/pool/192.168.0.0-16

So in this case, it’s getting the pod cidr we’ve assigned. I’ll cover this in more detail later.

In order to understand what it does with that key, you need to take a look at the src template confd is using.

Now, this at first glance looks a little complicated, but it’s not. It’s writing a file in the Go templating language that confd is familiar with. This is a standard BIRD configuration file, populated with keys from etcd. Take this for example:

This is essentially:

  • Looping through all the pools under the key /v1/ipam/v4/pool - in our case we only have one: 192.168.0.0-16
  • Assigning the data in the pools key to a var, $data
  • Then grabbing a value from the JSON that’s been loaded into $data - in this case the cidr key.

This makes more sense if you look at the values in the etcd key:

etcdctl get /calico/v1/ipam/v4/pool/192.168.0.0-16
{"cidr":"192.168.0.0/16","ipip":"tunl0","masquerade":true,"ipam":true,"disabled":false}

So it’s grabbed the cidr value and written it to the file. The end result of the file in the calico/node container brings this all together:

if ( net ~ 192.168.0.0/16 ) then {
    accept;
  }

Pretty simple really!

calico-felix

The final component in the calico stack is the calico-felix daemon. This is the tool that performs all the magic in the calico stack. It has multiple responsibilities:

  • it writes the routing table of the operating system. You’ll see this in action later
  • it manipulates IPtables on the host. Again, you’ll see this in action later.

It does all this by connecting to etcd and reading information from there. It runs inside the calico/node DaemonSet alongside confd and BIRD.

Calico in Action

In order to get started, it’s recommend that you’ve deployed Calico using the installation instructions here. Ensure that:

  • you’ve got a calico/node container running on every kubernetes host
  • You can see in the calico/node logs that there’s no errors or issues. Use kubectl get logs on a few hosts to ensure it’s working as expected

At this stage, you’ll want to deploy something so that Calico can work it’s magic. I recommend deploying the guestbook to see all this in action.

Routing Table

Once you’ve deployed Calico and your guestbook, get the pod IP of the guestbook using kubectl:

kubectl get po -o wide
NAME                           READY     STATUS    RESTARTS   AGE       IP                NODE
frontend-88237173-f3sz4        1/1       Running   0          2m        192.168.15.195    node1
frontend-88237173-j407q        1/1       Running   0          2m        192.168.228.195   node2
frontend-88237173-pwqfx        1/1       Running   0          2m        192.168.175.195   node3
redis-master-343230949-zr5xg   1/1       Running   0          2m        192.168.0.130    node4
redis-slave-132015689-475lt    1/1       Running   0          2m        192.168.71.1      node5
redis-slave-132015689-dzpks    1/1       Running   0          2m        192.168.105.65   node6

If everything has worked correctly, you should be able to ping every pod from any host. Test this now:

ping -c 1 192.168.15.195
PING 192.168.15.195 (192.168.15.195) 56(84) bytes of data.
64 bytes from 192.168.15.195: icmp_seq=1 ttl=63 time=0.318 ms

--- 192.168.15.195 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.318/0.318/0.318/0.000 ms

If you have fping and installed, you can verify all pods in one go:

kubectl get po -o json | jq .items[].status.podIP -r | fping
192.168.15.195 is alive
192.168.228.195 is alive
192.168.175.195 is alive
192.168.0.130 is alive
192.168.71.1 is alive
192.168.105.65 is alive

The real question is, how did this actually work? How come I can ping these endpoints? The answer becomes obvious if you print the routing table:

ip route
default via 172.29.132.1 dev eth0
169.254.0.0/16 dev eth0  scope link  metric 1002
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1
172.29.132.0/24 dev eth0  proto kernel  scope link  src 172.29.132.127
172.29.132.1 dev eth0  scope link
192.168.0.128/26 via 172.29.141.98 dev tunl0  proto bird onlink
192.168.15.192/26 via 172.29.141.95 dev tunl0  proto bird onlink
blackhole 192.168.33.0/26  proto bird
192.168.71.0/26 via 172.29.141.105 dev tunl0  proto bird onlink
192.168.105.64/26 via 172.29.141.97 dev tunl0  proto bird onlink
192.168.175.192/26 via 172.29.141.102 dev tunl0  proto bird onlink
192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink

A lot has happened here, so let’s break it down in sections.

Subnets

Each host that has calico/node running on it has its own /26 subnet. You can verify this by looking in etcd:

etcdctl ls /calico/ipam/v2/host/node1/ipv4/block/
/calico/ipam/v2/host/node1/ipv4/block/192.168.228.192-26

So in this case, the host node1 has been allocated the subnet 192.168.228.192-26. Any new host that starts up, connects to kubernetes and has a calico/node container running on it, will get one of those subnets. This is a fairly standard model in Kubernetes networking.

What differs here is how Calico handles it. Let’s go back to our routing table and look at the entry for that subnet:

192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink

What’s happened here is that calico-felix has read etcd, and determined that the ip address of node1 is 172.29.141.96. Calico now knows the IP address of the host, and also the pod subnet assigned to it. With this information, it programs routes on every node in the kubernetes cluster. It says “if you want to hit something in this subnet, go via the ip address x over the tunl0 interface.

The tunl0 interface may not be present on your host. It exists here because I’ve enabled IPIP encapsulation in Calico for the sake of testing.

Destination Host

Now, the packets know their destination. They have a route defined and they know they should head directly via the interface of the node. What happens then, when they arrive there?

The answer again, is in the routing table. On the host the pod has been scheduled on, print the routing table again:

ip route
default via 172.29.132.1 dev eth0
169.254.0.0/16 dev eth0  scope link  metric 1002
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1
172.29.132.0/24 dev eth0  proto kernel  scope link  src 172.29.132.127
172.29.132.1 dev eth0  scope link
192.168.0.128/26 via 172.29.141.98 dev tunl0  proto bird onlink
192.168.15.192/26 via 172.29.141.95 dev tunl0  proto bird onlink
blackhole 192.168.33.0/26  proto bird
192.168.71.0/26 via 172.29.141.105 dev tunl0  proto bird onlink
192.168.105.64/26 via 172.29.141.97 dev tunl0  proto bird onlink
192.168.175.192/26 via 172.29.141.102 dev tunl0  proto bird onlink
192.168.228.192/26 via 172.29.141.96 dev tunl0  proto bird onlink
192.168.228.195 dev cali7b262072819  scope link

There’s an extra route! You can see, there’s the pod IP has the destination and it’s telling the OS to route it via a device, cali7b262072819.

Let’s have a look at the interfaces:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT qlen 1000
    link/ether 00:25:90:62:ed:c6 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
    link/ether 00:25:90:62:ed:c6 brd ff:ff:ff:ff:ff:ff
5: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
    link/ether 32:e9:d2:f3:17:0f brd ff:ff:ff:ff:ff:ff link-netnsid 4

There’s an interface for our pod! When the container spun up, calico (via CNI) created an interface for us and assigned it to the pod. How did it do that?

CNI

The answer lies in the setup of Calico. If you examine the yaml you installed when you installed Calico, you’ll see a setup task which runs on every container. That uses a configmap, which looks like this

# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # The location of your etcd cluster.  This uses the Service clusterIP
  # defined below.
  etcd_endpoints: "http://10.96.232.136:6666"

  # True enables BGP networking, false tells Calico to enforce
  # policy only, using native networking.
  enable_bgp: "true"

  # The CNI network configuration to install on each node.
  cni_network_config: |-
    {
        "name": "k8s-pod-network",
        "type": "calico",
        "etcd_endpoints": "__ETCD_ENDPOINTS__",
        "log_level": "info",
        "ipam": {
            "type": "calico-ipam"
        },
        "policy": {
            "type": "k8s",
             "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
             "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
        },
        "kubernetes": {
            "kubeconfig": "/etc/cni/net.d/__KUBECONFIG_FILENAME__"
        }
    }

  # The default IP Pool to be created for the cluster.
  # Pod IP addresses will be assigned from this pool.
  ippool.yaml: |
      apiVersion: v1
      kind: ipPool
      metadata:
        cidr: 192.168.0.0/16
      spec:
        ipip:
          enabled: true
        nat-outgoing: true

This manifests itself in the /etc/cni/net.d directory on every host:

ls /etc/cni/net.d/
10-calico.conf  calico-kubeconfig  calico-tls

So essentially, when a new pod starts up, Calico will:

  • query the kubernetes API to determine the pod exists and that it’s on this node
  • assigns the pod an IP address from within its IPAM
  • create an interface on the host so that the container can get an address
  • tell the kubernetes API about this new IP

Magic!

IPTables

The final piece of the puzzle here is some IPTables magic. As mentioned earlier, Calico has support for network policy. Even if you’re not actively using the policy components, it still exists, and you need some default policy for connectivity is work. If you look at the output of iptables -L you’ll see a familiar string:

**Chain felix-to-7b262072819 (1 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK and 0xfeffffff
MARK       all  --  anywhere             anywhere             /* Start of tier default */ MARK and 0xfdffffff
felix-p-_722590149132d26-i  all  --  anywhere             anywhere             mark match 0x0/0x2000000
RETURN     all  --  anywhere             anywhere             mark match 0x1000000/0x1000000 /* Return if policy accepted */
DROP       all  --  anywhere             anywhere             mark match 0x0/0x2000000 /* Drop if no policy in tier passed */
felix-p-k8s_ns.default-i  all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere             mark match 0x1000000/0x1000000 /* Profile accepted packet */
DROP       all  --  anywhere             anywhere             /* Packet did not match any profile (endpoint eth0) */

The IPtables chain here has the same string at the calico interface. This iptables rule is vital for calico to pass the packets onto the container. It grabs the packet destined for the container, determines if it should be allowed and sends it on its way if it is.

If this chain doesn’t exist, it gets captured by the default policy, and the packet will be dropped. It’s calico-felix that programs these rules.

Wrap Up

Hopefully, you now have a better knowledge of how exactly Calico gets the job done. At its core, it’s actually relatively simple, simply ip routes on each host. What it does it take the difficult in managing those routes away from you, giving you a simple, easy solution to container networking.


I have some problems with Kubernetes.

It’s a fantastic tool that is revolutionizing the way we do things at $work. However, because of its code complexity, and the vast number of features, plugins, addons and options, the documentation isn’t getting the job done.

The other issue is that too many of the “Getting Started” tutorials gloss over the parts that you actually need to know. Let’s take a look at the kubeadm page, for example. In the networking section, it says this:

You can install a pod network add-on with the following command: kubectl apply -f

Now, the ease of this is fantastic. You can initialize your network super easily, and if you’re playing around with minikube or some other small setup, this really takes the pain out of getting started.

However, take a look at the full networking documentation page. If things go wrong, are you going to have any idea what’s going on here? Do you feel comfortable running this in production?

I certainly didn’t, so for the past week or so, I’ve been learning how all this works. I’m going to detail all this in two parts. First, I’m going to explain in sysadmin (ie I try to avoid network gear at all costs) terms how kubernetes approaches networking. Most of the information here is in the earlier linked networking doc, but I’m going to put it in my own words.

The next post will be specifically about my chosen pod network provider, Calico and how it interacts with your OS and containers.

Disclaimer: I’m not an expert on networking by any stretch of the imagination. If any of this is wrong, please send a pull request

Basics

There’s a lot of words on the earlier networking page. I’m going to sum it up a bit differently. In order for Kubernetes to work, every pod needs to have its own IP address like a VM

This is in direct conflict with the default setup of standalone Docker. By default Docker gives itself a private IP address on the host. It creates an bridge interface, docker0 and then grabs an IP, usually something like 172.17.0.1

All the containers then get ` veth` interface so they can talk to each other. The problem here is that they can only talk to containers on the same host. In order to talk to containers on other hosts, they have to start port mapping on the host. Anyone who’s had to deal with this at scale knows its an exercise in futility.

So, back to Kubernetes. Every pod gets an IP right? How does it do that?

Well, the pod network mentioned above (you know, that yaml file you downloaded and blindly installed) is usually the thing that controls that. The way it does that varies slightly depending on your chosen network provider (whether it be flannel, weave, calico etc) but the basics remain essentially the same.

An IP for every container

When the pod network starts up, you usually have to provide a relatively large subnet for configuration. The CoreOS flannel docs, for example suggest using the subnet ` 10.1.0.0/16`. You’ll see why this is so large in a moment.

The subnet is usually predetermined and needs to be stored somewhere, which increasingly seems to be etcd. You usually have to set this before launching the pod network, and it’s often stored in the kubernetes manifest. If you look at the kube-flannel manifest, you’ll see this:

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "type": "flannel",
      "delegate": {
        "isDefaultGateway": true
      }
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }

This is simple, it’s setting a CNI config, which will then be shipped off to etcd to be stored for safekeeping.

When a container comes online, it looks at the preprovided subnet, and will give itself an IP address from the subnet provided.

Connectivity

Now, just because there’s a subnet assigned, doesn’t mean there’s connectivity. And if you remember previously, pods need to have connectivity, even across different hosts.

This is important, and is something you should ensure works before you start deploying this to Kubernetes. From kubernetes node, you should be able to get icmp traffic any pod on your network and you should also be able to ping any pod ip from another pod. It depends on your pod network how this work. With flannel for example, you get an interface added on each host (usually flannel0) and the connectivity is provided across a layer2 overlay network using vxlan. This is relatively simple, but there are some performance penalties. Calico uses a more elegant but more complicated solution which I’ll cover in much more detail in the next post.

In the meantime, let’s look at what a working config looks like in action.

Testing Connectivity

I’ve deployed the guestbook here, and you can see the pod ips like so:

kubectl get po -o wide
NAME                           READY     STATUS    RESTARTS   AGE       IP                NODE
frontend-88237173-jdfgg        1/1       Running   0          2h        192.168.175.197   host1
frontend-88237173-mzmjf        1/1       Running   0          4h        192.168.163.65    host2
frontend-88237173-z3ltv        1/1       Running   0          5h        192.168.173.195   host1
redis-master-343230949-2qrp7   1/1       Running   0          5h        192.168.90.131    host3
redis-slave-132015689-890b2    1/1       Running   0          5h        192.168.90.132    host1
redis-slave-132015689-k0rk5    1/1       Running   0          5h        192.168.175.196   host3

Now, in a working cluster, I should be able to get to any one of these IPs from my master:

# ping -c 1 192.168.175.196
PING 192.168.175.196 (192.168.175.196) 56(84) bytes of data.
64 bytes from 192.168.175.196: icmp_seq=1 ttl=63 time=0.433 ms

--- 192.168.175.196 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.433/0.433/0.433/0.000 ms

if this doesn’t work from any node in your cluster, something is probably wrong

Similarly, you should be able to enter another pod and ping across pods:

# ping -c 1 192.168.90.131
PING 192.168.90.131 (192.168.90.131): 48 data bytes
56 bytes from 192.168.90.131: icmp_seq=0 ttl=62 time=0.358 ms
--- 192.168.90.131 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.358/0.358/0.358/0.000 ms

This fulfills the fundamental requirements of Kubernetes, and you know things are working. If this isn’t working, you need to get troubleshooting as to why.

Now, with flannel, this is all abstracted away from you, and it’s difficult to decipher. Some troubleshooting tips I’d recommend:

  • Make sure flannel0 actually exists, and check the flannel logs
  • Break out tcpdump with tcpdump -vv icmp and check the icmp request are arriving and leaving the nodes correctly.

With Calico, this is much easier to debug (in my opinion) and I’ll detail some troubleshooting exercises in the next post.

A quick note about services

One thing that I got confused about when I started with kubernetes is, why can’t I ping service IPs?

# ping -c 1 kubernetes.default
PING kubernetes.default.svc.cluster.local (10.96.0.1): 48 data bytes
--- kubernetes.default.svc.cluster.local ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

The reason for this is actually quite simple - they don’t technically exist!

kube-proxy

All the services in a cluster are handled by kube-proxy. kube-proxy runs on every node in the cluster, and what it does it write iptables rules for each service. You can see this when you run iptables-save:

-A KUBE-SERVICES -d 10.107.179.200/32 -p tcp -m comment --comment "default/redis-master: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-7GF4BJM3Z6CMNVML
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.98.90.196/32 -p tcp -m comment --comment "default/redis-slave: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.98.90.196/32 -p tcp -m comment --comment "default/redis-slave: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-AGR3D4D4FQNH4O33
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.99.237.90/32 -p tcp -m comment --comment "default/frontend: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.99.237.90/32 -p tcp -m comment --comment "default/frontend: cluster IP" -m tcp --dport 80 -j KUBE-SVC-GYQQTB6TY565JPRW
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU

This is just a taste of what you’ll see, but essentially, these iptables rules manage the traffic towards the service IPs. They don’t actually have any rules for ICMP, because it’s not needed.

So, if from host you try hit a service on a TCP port, you’ll see it works!

# curl -k https://10.96.0.1
Unauthorized

Don’t be fooled on the Unauthorized message here, it’s just the kubernetes API rejecting unauthorized requests. Iptables handily translated the request off towards the node the pod is running on, and made it hit the IP for you. Here’s the iptables rule:

-A KUBE-SEP-EHCIXHWU3R7SVNN2 -s 172.29.132.126/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-EHCIXHWU3R7SVNN2 -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-EHCIXHWU3R7SVNN2 --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 172.29.132.126:6443

Simple!

Wrap up

This should wrap up the basics of how kubernetes networking works, without going into the specifics of exactly what’s happening. In the next post, I’ll specifically cover Calico and how it operates alongside kubernetes using the magic of routing to help your packets reach their destination.


One of the first tools I came across when I started out in the IT industry was SmokePing. It’s been around for years and solves the important job of graphing latency between two points in a reasonable way. As a company grows and scales out into multiple datacenters, latency can affect the operation of software, so having it graphed makes a lot of sense.

I was surprised that there hasn’t been any alternative to SmokePing developed in the years since it was conceived. This is probably a testament to how well it works, but in my case, I already had a kick-ass graphite installation (with Grafana frontend, obviously) and I wanted to get my latency metrics in there, rather than having to support RRD tool and install a perl app.

So, I set about reinventing the wheel. Something on my radar was to get my head around Go and this seemed perfect for the task because:

  • It’s fast
  • You can build binaries with it super easily
  • It has concurrency built in

The last point was a big consideration, because pinging lots of endpoints consistently like SmokePing would be much easier if it’s trivial to launch concurrent operations. Go’s goroutines make this very easy.

Graphping is Born

So, with all this in mind, a colleague and I wrote Graphping. You can see the source code here.

In order to run it, you need to specify a config file, and an address for a statsd server. Making use of statsd means you can write metrics to any of the available statsd backends which allows you to use your existing monitoring infrastructure.

The config file makes use of HCL which means you can either write a human readable config file, or use a machine generated JSON config file. An example config file looks like this:

interval = 10 # A global interval. Can be overwritten per target group
prefix = "graphping" # A global prefix for statsd metrics


# Declare a target group with a name
target_group "search_engines" {
  # a custom ping interval for this group
  interval = 2
  # A prefix for the statsd metric for this group
  prefix = "search"
  # A name for the target. This becomes the statsd metric
  target "google" {
    address = "www.google.co.uk"
  }
  target "bing" {
    address = "www.bing.com"
  }
}

# You can specify multiple target groups
target_group "news_sites" {
  prefix = "uk"
  target "bbc" {
    address = "www.bbc.co.uk"
  }
}

This all comes together to allow you to create graphs very similar to SmokePing. Here’s an example:

This is only my second project in Go, so there might be some issues or bugs and the code quality might not be fantastic. Hopefully as time goes on, further improvements will come.


Every company that uses Puppet eventually gets to the stage in their development where they want to store “secrets” within Puppet. Usually (hopefully!) your Puppet manifests and data will be stored in version control in plaintext and therefore adding these secrets to your manifests has some clear security concerns which need to be addressed.

You could just restrict the data to a few select people, and have it in a separate control repo, but at the end of the day, your secrets will still be in plaintext and you’re at the mercy of your version control ACLs.

Fortunately, a bunch of very smart people came across this problem a while ago and gave us the solutions we need to be able to solve the problem.

hiera-eyaml has been around a while now and gives you the capability to encrypt secrets stored in hiera. It provides an eyaml command line tool to make use of this, and will encrypt values for you using a pluggable backend. By default, it uses asymmetric encryption (PKCS#7) and will make any value indecipherable to anyone who has the key. You can see the example in the linked github repo, but for verbosity sake, it looks like this:

---
plain-property: You can see me

encrypted-property: >
    ENC[PKCS7,Y22exl+OvjDe+drmik2XEeD3VQtl1uZJXFFF2NnrMXDWx0csyqLB/2NOWefv
    NBTZfOlPvMlAesyr4bUY4I5XeVbVk38XKxeriH69EFAD4CahIZlC8lkE/uDh
    jJGQfh052eonkungHIcuGKY/5sEbbZl/qufjAtp/ufor15VBJtsXt17tXP4y
    l5ZP119Fwq8xiREGOL0lVvFYJz2hZc1ppPCNG5lwuLnTekXN/OazNYpf4CMd
    /HjZFXwcXRtTlzewJLc+/gox2IfByQRhsI/AgogRfYQKocZgFb/DOZoXR7wm
    IZGeunzwhqfmEtGiqpvJJQ5wVRdzJVpTnANBA5qxeA==]

In order to see the encrypted-property, you need to have access to the preshared key you used to encrypt the value, which means you have to copy the pre-shared key to your master. This is fine if you’re a single user managing a small number of Puppetmasters, but as your team scales this actually introduces a security consideration.

How do you pass the preshared key around? The more people that touch that key, the less secure it becomes. Distributing it to 20 odd people means that if a single user’s laptop is compromised, all your secrets will be under threat. Fortunately, there’s a better way of managing this which is facilitated by the plugin system hiera-eyaml supports, and the solution is hiera-eyaml-gpg

Using GPG Keys

The problem with hiera-eyaml-gpg is that the documentation only shows you how to set up hiera-eyaml-gpg, but you then have to go off and do a bunch of reading about how GPG keys work. If you already know how GPG keys work, skip ahead, this isn’t for you! If you don’t, let’s cover quickly how GPG keys work, and how this helps us solve the single key problem above.

Quick Overview

In a nutshell, GPG is a hybrid public and private key encryption system. In a bullet point format:

  • Each user or entity has a public and private key pair.
  • Public keys are used for encryption, private keys are used for decryption. Messages can be signed by encrypting a hash of the message using the sender’s private key, allowing the receiver to verify the integrity of the data by using the sender’s public key to decrypt the hash.
  • Private keys need to be kept secure by the owner.
  • Public keys need to be transferred reliably such that they cannot be altered, and substitute keys inserted.
  • You can encrypt data for multiple recipients, by using all of their public keys together. Any of them will be able to decrypt the data using their own private key.
  • A user can add a set of other user’s public keys to their GPG public keyring. Users create a “web of trust” by validating and signing user’s keys in their keyrings.

Thinking of this from a Puppet perspective:

  • Each puppetmaster (or sets of puppetmasters) will have their own public and private GPG key pair. The private keys will be kept local on the puppetmasters and should not be transferred anywhere else.
  • Each user that will be editing the secure data within puppet will also have a public and private key pair. They will keep their private key secure and private to themselves.
  • Each user will need to have all of the public keys of all puppetmasters, along with all other eyaml users, added to their own public keyring.
  • When a new user or new puppetmaster is added or a key is changed, all users will need to update their keyrings with the new public keys. Additionally, all encrypted data in hiera will need to be re-encrypted so that the new puppetmasters and users are able to decrypt the encrypted data.
  • If a puppetmaster gets compromised, or a user leaves the company, only the key for that puppetmaster (or set of puppetmasters) needs to be removed from the keyrings and encrypted data. None of the other puppetmaster or user keys need to be updated.

As you can see, this drastically improves security of your important data stored in hiera. With that in mind, let’s get started..

Generate a GPG Key

There are plenty of docs out there to explain how to generate a GPG key for each OS. In a short form, you should do this:

gpg --gen-key

You’ll get a handy menu prompt that will help you generate a key. SET A PASSPHRASE. Having a blank passphrase will compromise the whole web of trust for your encrypted data.

Generate a GPG Key for your Puppetmaster

Because GPG operates on the concept of each user using different keys, you’ll now need to generate a key for your Puppetmaster.

If you’re lucky, you can just use the above command and have done with it. In order to be more specific, here’s the way I know works to generate keys:

# Use a reasonable directory for gpghome
mkdir -m 0700 /etc/puppetlabs/.gpghome
chown puppet:puppet /etc/puppet/.gpghome

Now, the GPG we generate for the puppetmasters need some special attributes, so we’ll need a custom batch config file at /etc/puppetlabs/.gpghome/keygen.inp. Make sure you replace _keyname_ with something useful, like maybe puppetmaster

%echo Generating a default key
Key-Type: default
Subkey-Type: default
Name-Real: _keyname_
Expire-Date: 0
%no-ask-passphrase
%no-protection
%commit
%echo done

Now, generate the key:

sudo -u puppet gpg --homedir /etc/puppet/.gpghome --gen-key --batch /etc/puppet/.gpghome/keygen.inp

Now that’s done, you should see a GPG key in your puppetmaster’s keyring:

sudo -u puppet gpg --homedir /etc/puppet/.gpghome --list-keys
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
/etc/puppet/.gpghome/pubring.gpg
--------------------------------
pub   2048R/XXXXXXXX 2016-11-14
uid                  puppetmaster
sub   2048R/XXXXXXX 2016-11-14

A web of trust

GPG keys operate under the model that everyone has their own public and private key, and everyone in your team trusts each other (hopefully you trust your colleagues!). In the previous step, you generated a key, now, you need to make sure all your colleagues sign your key to verify its authenticity and confirm it’s valid. In order to do this, you need to distribute your public key to everyone and they need to sign it.

The way you distribute the public keys is up to you, but there are tools like Keybase or private keyservers available which you may choose to use. Obviously, it’s not recommended to send your puppetmasters GPG key to keybase. The most important consideration here is that the public keys can’t be modified in transit somehow. This means sending the GPG keys via email over the internet is probably not a fantastic idea, however sending to your colleagues via internal email probably wouldn’t be so terrible.

At a very minimum, you’ll need to sign the keys from your puppetmaster that you generated earlier. In order to do that, export the key in ASCII format:

sudo -u puppet gpg --homedir /etc/puppet/.gpghome --export -a -o /etc/puppet/.gpghome/puppetmaster.pub

Copy the puppetmaster.pub file locally so it’s ready to import.

In order to sign it, copy the file locally to your machine from $distribution_method and then run this:

gpg --import /path/to/file.pub

From here, you need to verify the signature, and if you’re happy, sign the key:

gpg --sign-key <keyname>

You’ll need to enter your own GPG keys’ passphrase in order to sign the key.

Everyone who’s going to be using encrypted yaml will need to perform this step for each of the keypairs you generate. This means when a new user joins the company, you’ll have to import and sign the keys that users generates. There are puppet modules which ease this process, and you can simply add a public key to a puppetmaster’s keyring by using golja-gnupg

Puppet, Hiera and GPG Keys

If you’ve now created a web of trust, you need to make Puppet aware of the GPG keys. Firstly, you’ll need to generate a GPG key for your masters. We group our masters into different tiers, dev/stg and prod, and we ensure these keys are distinctly separate. Then, make sure the key is signed by the relevant people, otherwise it’s pretty much useless :)

Once your keys and gpg config are set up, you’ll need to get hiera-eyaml-gpg working.

Install hiera-eyaml-gpg

The installation requirements are clearly spelled out here but for clarity’s sake, I’ll cover the process here. The process is basically the same for both users who’ll be using eyaml to encrypt values, and puppetmasters who will be encrypting values. From an OS perspective, you’ll need to make sure you have the ruby, ruby-devel, rubygems and gpgme packages installed. On CentOS, that looks like this:

sudo yum install ruby gpgme rubygems ruby-devel

Then, install the required rubygems in the relevant ruby path. If you’re using the latest version of puppetserver, you’ll need to install this using puppetserver gem install

sudo gem install gpgme
sudo gem install hiera-eyaml
sudo gem install hiera-eyaml-gpg

The Recipients File

One of the main ways that hiera-eyaml-gpg differs from standard hiera-eyaml is the gpg.recipients file. This file essentially lists the GPG keys that are available to decrypt secrets with a directory in hiera. This is an incredibly powerful tool, especially if you wish to allow users to encrypt/decrypt some secrets in your environment, but not others.

When the eyaml command is invoked, it will search in the current working directory for this file, and if one is not found it will go up through the directory tree until one is found. As an example, your hieradats directory might look like this:

├── development
│   ├── app1
│   │   └── hiera-eyaml-gpg.recipients
│   └── app2
│       └── hiera-eyaml-gpg.recipients
└── production
    ├── dc1
    │   ├── base.eyaml
    │   └── hiera-eyaml-gpg.recipients
    ├── hiera-eyaml-gpg.recipients
    └── role.eyaml

With this kind of layout, it’s possible to allow users access to certain app credentials, datacenters or even environments, without compromising all the credentials in hiera.

The format of the hiera-eyaml-gpg.recipients file is simple, it simply lists the GPG keys that are allowed to encrypt/decrypt values:

puppetmaster-prod
lbriggs
a_nother_user

The value of this can be found in the uid field of the gpg --list-keys command.

Modify hiera.yaml

The final step in the process is to make hiera aware of this GPG plugin. Update to hiera.yaml to look like this:

---
:backends:
  - yaml
  - eyaml
:hierarchy:
  - "nodes/%{clientcert}"
:yaml:
  :datadir: "/etc/puppet/environments/%{environment}/hieradata"
:eyaml:
  :datadir: "/etc/puppet/environments/%{environment}/hieradata"
  :gpg_gnupghome: /etc/puppet/.gpghome
  :extension: 'eyaml'

At this point, puppet should use the GPG extension, assuming you installed it correctly previously

Adding an Encrypted Parameter

At this stage, you’ve done the following:

  • Generated GPG keys for all the human users who will be encrypting/decrypting values
  • Generated GPG keys for the puppetmasters which will be decrypting values
  • Shared the public keys around all the above to ensure they’re trusted
  • Installed the components required for Puppet to use GPG keys
  • Set up the hiera-eyaml-gpg.recipients file so hiera-eyaml-gpg knows who can read/write values.

The final step here is adding an encrypted value to hiera. When you did gem install hiera-eyaml you also got a handy command line tool to help with this.

In order to use it simply run the following:

eyaml edit hieradata/<folder>/<file>.eyaml

You’ll be asked to enter your GPG key password, and then you’ll get dropped into an editor with something like this in the header:

#| This is eyaml edit mode. This text (lines starting with #| at the top of the
#| file) will be removed when you save and exit.
#|  - To edit encrypted values, change the content of the DEC(<num>)::PKCS7[]!
#|    block (or DEC(<num>)::GPG[]!).
#|    WARNING: DO NOT change the number in the parentheses.
#|  - To add a new encrypted value copy and paste a new block from the
#|    appropriate example below. Note that:
#|     * the text to encrypt goes in the square brackets
#|     * ensure you include the exclamation mark when you copy and paste
#|     * you must not include a number when adding a new block
#|    e.g. DEC::PKCS7[]! -or- DEC::GPG[]!

As we noted, you’re using the GPG plugin, so add your value like so:

class::class::parameter: DEC::GPG[correct_horse_battery_staple]!

When you save the file, you can cat it again and you’ll see the value is now encrypted:

class::class::parameter: ENC[GPG,XXXXXXXXXXXXXXXXXXXXXXXXXXXXX=]

From here, you can push it to git and have it downloaded using the method you use to grab your config (I hope you’re using r10k!) and the puppetmaster (assuming you set up the GPG encryption correctly!) will be able to decrypt these secret and service it to hosts.

Happy encrypting!


I love Gitlab. With every release they announce some amazing new features and it’s one of the few software suites I consider to be a joy to use. Since we adopted it at $job we’ve seen our release cycle within the OPS team improve dramatically and pushing new software seems to be a breeze.

My favourite part of Gitlab is the flexibility and robustness of the gitlab-ci.yml file. Simply by adding a file to your repository, you can now have complex pipeline running tasks which can test, build and deploy your software. I remember doing things like this with Jenkins and being incredibly frustrated - with gitlab I seem to be able to do everything I need to without all the fuss.

I also make heavy use of travis-ci in my public and open source projects, and I really like the matrix feature that Travis offers. Fortunately, there’s a similar (but not quite the same) feature available in Gitlab CI but I feel like the documentation is lacking a little bit, so I figured I’d write up a step by step guide to how I’ve started to use these features for our pipelines.

A starting example

Let’s say you have a starting .gitlab-ci.yml like so:

---
stages:
  - test
  - build
  - deploy

build_rpm_centos6:
  image: centos:6
  script: 
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

build_rpm_centos7:
  image: centos:7
  script:
    - rpmbuild
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

This is a totally valid file, but there’s a whole load of repetition in here which really shouldn’t need to be here. We can use some features of yaml called anchors and aliases which allow us to reduce the amount of code here. This is documented here in the Gitlab CI Readme, but I want to break it down into sections.

Define a hidden job

Firstly, we need to define a “hidden job” - this is essentially of course a job gitlab-ci is aware of but doesn’t actually run. It defines a yaml hash which we can merge into another hash later. We’ll take all of the hash values from the above two jobs that are the same, and place it in that hidden job:

# here we define a hidden job called "build" (prefixed with a dot)
# and then we assign it to an alias &build_definition
.build: &build_definition
  script:
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

What this has done is essentially created something like a function. When we call &build_definition, it’ll spit out the following yaml hash:

---
  script:
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

As you can see, the above yaml hash is only missing 2 things: A parent hash key and the value for “image”.

Reduce the code

In order to make use of this alias, we first need to actually define our build jobs. Remember, the above job is hidden so if we pushed to our git repo right now, nothing would happen. Let’s define our two build jobs.

build_centos6:
  image: centos:6

build_centos7:
  image: centos:7

Obviously, this isn’t enough to actually run a build. What we now need to do is merge to two hashes from the hidden job/alias and with our build definition.

build_centos6:
  <<: *build_definition # this essentially says insert the hash values from &build_definition hash
  image: centos:6

build_centos7:
  <<: *build_definition
  image: centos:7

That’s a lot less code duplication, and if you know what you’re looking at, it’s much easier to read.

Visualising your gitlab-ci.yml file

This all might seem a little confusing at first because it’s hard to visualise. The best way to get your head around what the output of your CI file is, is to remember that all Gitlab CI does when you push the file is load it into a hash and read the values. With that in mind, try this little 1 line script on your file:

ruby -e "require 'yaml'; require 'pp'; hash = YAML.load_file('.gitlab-ci.yml'); pp hash"

This is what the original yaml file hash looks like:

{"stages"=>["test", "build", "deploy"],
 "build_rpm_centos6"=>
  {"image"=>"centos:6",
   "script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]},
 "build_rpm_centos7"=>
  {"image"=>"centos:7",
   "script"=>["rpmbuild"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]}}

And this is what the hash from the file with the anchors and such like contains:

{"stages"=>["test", "build", "deploy"],
 ".build"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]},
 "build_centos6"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"],
   "image"=>"centos:6"},
 "build_centos7"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"],
   "image"=>"centos:7"}}

Hopefully that makes it easier to understand! As mentioned earlier, this isn’t as powerful (yet?) as Travis’s matrix feature, which can quickly expand your jobs multiple times over, but with nested aliases you can easily have quite a complex matrix.