I love Gitlab. With every release they announce some amazing new features and it’s one of the few software suites I consider to be a joy to use. Since we adopted it at $job we’ve seen our release cycle within the OPS team improve dramatically and pushing new software seems to be a breeze.

My favourite part of Gitlab is the flexibility and robustness of the gitlab-ci.yml file. Simply by adding a file to your repository, you can now have complex pipeline running tasks which can test, build and deploy your software. I remember doing things like this with Jenkins and being incredibly frustrated - with gitlab I seem to be able to do everything I need to without all the fuss.

I also make heavy use of travis-ci in my public and open source projects, and I really like the matrix feature that Travis offers. Fortunately, there’s a similar (but not quite the same) feature available in Gitlab CI but I feel like the documentation is lacking a little bit, so I figured I’d write up a step by step guide to how I’ve started to use these features for our pipelines.

A starting example

Let’s say you have a starting .gitlab-ci.yml like so:

---
stages:
  - test
  - build
  - deploy

build_rpm_centos6:
  image: centos:6
  script: 
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

build_rpm_centos7:
  image: centos:7
  script:
    - rpmbuild
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

This is a totally valid file, but there’s a whole load of repetition in here which really shouldn’t need to be here. We can use some features of yaml called anchors and aliases which allow us to reduce the amount of code here. This is documented here in the Gitlab CI Readme, but I want to break it down into sections.

Define a hidden job

Firstly, we need to define a “hidden job” - this is essentially of course a job gitlab-ci is aware of but doesn’t actually run. It defines a yaml hash which we can merge into another hash later. We’ll take all of the hash values from the above two jobs that are the same, and place it in that hidden job:

# here we define a hidden job called "build" (prefixed with a dot)
# and then we assign it to an alias &build_definition
.build: &build_definition
  script:
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

What this has done is essentially created something like a function. When we call &build_definition, it’ll spit out the following yaml hash:

---
  script:
    - rpmbuild -ba
  except:
    - tags
    - master
  stage: build
  tags:
    - docker

As you can see, the above yaml hash is only missing 2 things: A parent hash key and the value for “image”.

Reduce the code

In order to make use of this alias, we first need to actually define our build jobs. Remember, the above job is hidden so if we pushed to our git repo right now, nothing would happen. Let’s define our two build jobs.

build_centos6:
  image: centos:6

build_centos7:
  image: centos:7

Obviously, this isn’t enough to actually run a build. What we now need to do is merge to two hashes from the hidden job/alias and with our build definition.

build_centos6:
  <<: *build_definition # this essentially says insert the hash values from &build_definition hash
  image: centos:6

build_centos7:
  <<: *build_definition
  image: centos:7

That’s a lot less code duplication, and if you know what you’re looking at, it’s much easier to read.

Visualising your gitlab-ci.yml file

This all might seem a little confusing at first because it’s hard to visualise. The best way to get your head around what the output of your CI file is, is to remember that all Gitlab CI does when you push the file is load it into a hash and read the values. With that in mind, try this little 1 line script on your file:

ruby -e "require 'yaml'; require 'pp'; hash = YAML.load_file('.gitlab-ci.yml'); pp hash"

This is what the original yaml file hash looks like:

{"stages"=>["test", "build", "deploy"],
 "build_rpm_centos6"=>
  {"image"=>"centos:6",
   "script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]},
 "build_rpm_centos7"=>
  {"image"=>"centos:7",
   "script"=>["rpmbuild"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]}}

And this is what the hash from the file with the anchors and such like contains:

{"stages"=>["test", "build", "deploy"],
 ".build"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"]},
 "build_centos6"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"],
   "image"=>"centos:6"},
 "build_centos7"=>
  {"script"=>["rpmbuild -ba"],
   "except"=>["tags", "master"],
   "stage"=>"build",
   "tags"=>["docker"],
   "image"=>"centos:7"}}

Hopefully that makes it easier to understand! As mentioned earlier, this isn’t as powerful (yet?) as Travis’s matrix feature, which can quickly expand your jobs multiple times over, but with nested aliases you can easily have quite a complex matrix.


We’re finally beginning to build out our production Kubernetes infrastructure at work, after some extensive testing in dev. Kubernetes relies heavily on TLS for securing communications between all of the components (quite understandably) and while you can disable TLS on many components, obviously once you get to production, you don’t really want to be doing that.

Most of the documentation shows you how to generate a self signed certficate using a CA certificate you create especially for kubernetes. Even Kelsey Hightower’s excellent “Kubernetes the Hard Way” post shows you how to generate the TLS components using a self signed CA. One of the nicest things about using Puppet is that you already have a CA set up and best of all, there’s some really nice APIs inside the puppet master/server meaning provisioning new certs for hosts is relatively straightforward. I really wanted to take advantage of this with our kubernetes setup, so I made sure etcd was using Puppet’s certs:

#[security]
ETCD_CERT_FILE="/var/lib/puppet/ssl/certs/hostname.server.lan.pem"
ETCD_KEY_FILE="/var/lib/puppet/ssl/private_keys/hostname.server.lan.pem"
ETCD_TRUSTED_CA_FILE="/var/lib/puppet/ssl/certs/ca.pem"
ETCD_PEER_CERT_FILE="/var/lib/puppet/ssl/certs/hostname.server.lan.pem"
ETCD_PEER_KEY_FILE="/var/lib/puppet/ssl/private_keys/hostname.server.lan.pem"
ETCD_PEER_CLIENT_CERT_AUTH=true
ETCD_PEER_TRUSTED_CA_FILE="/var/lib/puppet/ssl/certs/ca.pem"

This works out of the box, because the certs for all 3 etcd hosts have been signed by the same CA.

Securing Kubernetes with Puppet’s Certs.

I figured it would be easy to use these certs for Kubernetes also. I set the following parameters in the API server config:

--service-account-key-file=/var/lib/puppet/ssl/private_keys/hostname.server.lan.pem --tls-cert-file=/var/lib/puppet/ssl/certs/hostname.server.lan.pem --tls-private-key-file=/var/lib/puppet/ssl/private_keys/hostname.server.lan.pem

but there were a multitude of problems, the main one being that when a pod starts up, it connects to the API using the kubernetes service cluster IP. You can see this in the log messages when starting a pod:

# kubectl logs kube-dns-v15-017ri --namespace=kube-system kubedns
I0821 08:48:12.808230       1 server.go:91] Using https://10.254.0.1:443 for kubernetes master
I0821 08:48:12.808304       1 server.go:92] Using kubernetes API <nil>
I0821 08:48:12.809448       1 server.go:132] Starting SkyDNS server. Listening on port:10053

I figured it would be easy enough to fix, I’ll just add a SAN for the puppet cert using the dns_alt_names configuration option. Unfortunately, this didn’t work, and I got the following error message:

E1125 17:33:16.308389 1 errors.go:62] Status: x509: cannot validate certificate because it doesn't contain any IP SANs

Puppet doesn’t have an option to set IP SANS in the SSL certificate, so I had to generate the cert manually and sign it by the Puppet CA. Thankfully, this is fairly straightforward (albeit manual)

Generating Certs Manually

First, create a Kubernetes config file for OpenSSL on your puppetmaster. I created a directory /var/lib/puppet/ssl/manual_ca to do all this.

[ ca ]

default_ca      = CA_default

[ CA_default ]

dir            = /var/lib/puppet/ssl/manual_ca
certs          = $dir/certs
crl_dir        = $dir/crl
database       = $dir/index.txt
new_certs_dir  = $dir/newcerts
certificate    = /var/lib/puppet/ssl/ca/ca_crt.pem
serial         = $dir/serial
crl            = /var/lib/puppet/ssl/ca/ca_crl.pem
private_key    = /var/lib/puppet/ssl/ca/ca_key.pem
RANDFILE       = $dir/ca/.rand
default_md     = sha256
policy         = policy_any
unique_subject = no

[ policy_any ]
countryName            = supplied
stateOrProvinceName    = optional
organizationName       = optional
organizationalUnitName = optional
commonName             = supplied
emailAddress           = optional

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
string_mask             = utf8only

[ req_distinguished_name ]
countryName             = Country
stateOrProvinceName     = State
localityName            = Locality
organizationName        = Org
organizationalUnitName  = Me
commonName              = hostname

[ v3_req ]
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names

[alt_names]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster.local
DNS.5 = kubernetes.service.discover
DNS.6 = hostname
IP.1 = 10.254.0.1
IP.2 = 192.168.4.10 # external IP

Note the two IPs here. The first is the cluster IP from the kubernetes service, you can retrieve it like so:

# kubectl get svc
NAME           CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
kubernetes     10.254.0.1       <none>        443/TCP    21d

I also added the actual IP of the kubernetes host for some future proofing. The DNS names have been generated from the Kube DNS config, so also make sure you match that to your kube-dns name.

Next, we need to generate a CSR and a key:

openssl req -newkey rsa:2048 -nodes -keyout private_keys/kube-api.key -out certificate_requests/kube-api.csr -config kubernetes.cnf

Verify that your CSR has the IP SANS in it:

openssl req -text -noout -verify -in certificate_requests/kube-api.csr | grep "X509v3 Subject" -A 1

Now, we need to sign the cert with the Puppet CA:

openssl x509 -req -in certificate_requests/kube-api.csr -CA /var/lib/puppet/ssl/ca/ca_crt.pem -CAkey /var/lib/puppet/ssl/ca/ca_key.pem -CAcreateserial -out certs/kube-api.pem -days 3000 -extensions v3_req -extfile kubernetes.cnf

This will create a cert in certs/kube-api.pem. Now verify it to ensure it looks okay:

openssl x509 -in certs/kube-api.pem -text -noout

We now have the a cert we can use for our kube-apiserver, so we just need to configure kubernetes to use it.

Configuring Kubernetes to use the certs.

Assuming you’ve copied the certs to your kubernetes master, we now need to configure k8s to use it. First, make sure you have the following config set in the apiserver:

--service-account-key-file=/etc/kubernetes/kube-api.key --tls-cert-file=/etc/kubernetes/kube-api.pem --tls-private-key-file=/etc/kubernetes/kube-api.key

And then configure the controller manager like so:

--root-ca-file=/var/lib/puppet/ssl/certs/ca.pem --service-account-private-key-file=/etc/kubernetes/kube-api.key

Restart all the k8s components, and you’re almost set.

Regenerate service account secrets

The final thing you’ll need to do is delete the service account secrets kubernetes generates on launch. The reason for this is because it use the service-account-private-key-file to generate them, and if you don’t do this you’ll get all manner of permission denied errors when launching pods. It’s easy to do this:

kubectl delete sa default
kubectl delete sa default --namespace=kube-system

NOTE if you’re already running pods in your kubernetes system, this may affect them and you may want to be careful doing this. YMMV.

From here, you’re using Puppet’s SSL Certs for kubernetes.


So you’ve decided you want to use Configuration Management to control your infrastructure. You’ve read about all of the benefits of “infrastructure as code” and you’ve decided you’re going to Puppet as your chosen configuration management tool.

I personally believe this to be a good choice. When making comparisons between Ansible, Chef and other configuraiton management tools, the true benefit of Puppet over these is the ecosystem that has been established around it to benefit your workflow. The problem with Puppet is getting started. You want to manage a bunch of stuff, but where do you start? How do all these tools fit together? What decisions do you need to make before diving in?

This is a very opinionated set of posts. I’ll try to cover the options, and how I’ve set about doing it, but the main theme here is essentially, getting you off the ground with Puppet.

So if you’ve finished the Puppet learning VM, and you’ve browsed a few modules and think “this is the tool for me!” then open up your desired editor and let’s get cracking!

Decision Time

Okay, now close your editor. We’re not going anywhere near it yet, because the first thing you need to do is make a few decisions about your infrastructure and what you’ll be managing with Puppet.

There are a lot of components within Puppet that let you manage your infrastructure in a flexible manner, but before you use them you need to know exactly what you want to do with them.

Your Infrastructure Layout

The first thing to think about is how does your infrastructure look at the high level. There are a few things you need to think about:

  • Do you have multiple geographic datacenters?
  • Do you have multiple deployment environments? (eg. dev, stage, production)
  • Do you have multiple infrastructure types? (eg. AWS and physical infrastructure)

The reason you need to think about these things is because it will determine how you use hiera to differentiate between these environments. There will be a full blog post about hiera later, but before you start using it you need to determine how your environment looks. The question you’re looking to answer is how do you logically seperate your infrastructure? Once you have an idea, it’s time to think about your individual hosts.

Node Classification and Roles

The very next thing you need to thing you need to decide is how you’re going to classify your nodes. Each node in a puppet infrastrucuture has a role or a “thing that it does”. As an example, you might have a database server (with role dbserver) and a web server (webserver). How do you determine that a webserver is a webserver? You might already know it is, but how does your infrastructure know? There are quite a few ways to do this.

  • Name based. You might always have the word “web” in the name, in which you can use a regex match
  • IP address. Maybe all webservers are in a specific subnet, in which case you might want to match on IP

These two options are both valid, and you can support them within Puppet. If you don’t currently have a classification system, or you want to improve it, you can use an ENC (External Node Classifier). The most popular ones are:

  • LDAP
  • Foreman
  • Something else with a HTTP API

Essentially if you use an ENC, it’ll become your source of true for your roles. This is personally the way I think it should be done, and I highly recommend using foreman. More to come later.


In my last post I wrote about service discover with my Puppetmasters using consul

As part of this deployment, I deployed a healthcheck using Consul’s TCP Checks to check the puppetmasters was responding in its default port (8140). In Puppet, it looked like this:

::consul::check { 'puppetmaster_tcp':
    interval   => '60',
    tcp        => 'localhost:8140',
    notes      => 'Puppetmasters listen on port 8140',
    service_id => 'puppetmaster',
}

The problem with this approach is that it’s a dumb check - the puppetmaster runs in a webserver and while the port might be open, what happens if the application is returning a 500 internal server error, for example?

In order to rectify this, I decided to make use of a Puppet HTTP API endpoint to query the status.

I must admit, I didn’t even know that Puppet had a HTTP API until recently. Looking through the docs brought up some gems, but the problem is that by default it’s pretty locked down - and rightly so. It’s a powerful API and a compromised Puppetmaster via API is a dangerous prospect.

Managing this is done via auth.conf and you use the allow directive.

While digging through the API docs, I found a nice status endpoint. However, while querying it, I got a 404 access denied:

curl --cert /var/lib/puppet/ssl/certs/puppetmaster.example.com --key /var/lib/puppet/ssl/private_keys/puppetmaster.example.com.pem --cacert /var/lib/puppet/ssl/ca/ca_crt.pem -H 'Accept: pson' https://puppetmaster.example.com:8140/production/status/test?environment=production
Forbidden request: puppetmaster.example.com(192.168.4.21) access to /status/test [find] authenticated  at :119

This seems easily fixable and extremely useful. In order to make this work, I made a quick change to the auth.conf:

# allow access to the status API call to test if the master is alive
path /status
auth any
method find
allow_ip 192.168.4.21,127.0.0.1

This needs go to above the default policy in auth.conf, which looks like this:

# deny everything else; this ACL is not strictly necessary, but
# illustrates the default policy.
path /
auth any

Now, when I try the curl command again, it works!

curl --cert /var/lib/puppet/ssl/certs/puppetmaster.example.com --key /var/lib/puppet/ssl/private_keys/puppetmaster.example.com.pem --cacert /var/lib/puppet/ssl/ca/ca_crt.pem -H 'Accept: pson' https://puppetmaster.example.com:8140/production/status/test?environment=production
{"is_alive":true,"version":"3.8.4"}

Sweet, now we can make a proper healthcheck!

Because we set the auth.conf entry to be auth any, it’s straightforward to make a query to the API endpoint. I used the nagios check_http check to get this looking nice. The command looks a bit like this:

/usr/lib64/nagios/plugins/check_http -H localhost -p 8140 -u /production/status/test?environment=production -S -k 'Accept: pson' -s '"is_alive":true'

Simply, we’re querying localhost on port 8140 and then providing an environment (production is my default environment). The Puppetmaster wants pson, so we send a PSON header, and then we check for the string is_alive. The output looks like this:

HTTP OK: HTTP/1.1 200 OK - 312 bytes in 0.127 second response time |time=0.127082s;;;0.000000 size=312B;;;0

This is much, much better than our port check. If we get something other than a 200 OK HTTP code, we’re in trouble.

Consul

The original point of this post was replacing the consul check of TCP. In Puppet code, that looks like this:

  ::consul::check { 'puppetmaster_healthcheck':
    interval   => '60',
    script     => "/usr/lib64/nagios/plugins/check_http -H ${::fqdn} -p 8140 -u /production/status/test?environment=production -S -k 'Accept: pson' -s '\"is_alive\":true'",
    notes      => 'Checks the puppetmaster\'s status API to determine if the service is healthy',
    service_id => 'puppetmaster',
  }

We’ll now get an accurate an reliable healthcheck from our consul check!


I had a problem recently. I’m deploying services, and everything is Puppetized, but I have to manually tell other infrastructure that it exists. It’s frustrating. As an “ops guy” I focus on making my infrastructure services available, resiliant and distributed so that they can scale well and not fail catastrophically. I think we’ve all done this when deploying $things, and most people (in my experience) go through the following stages..

Stage 1 - DNS based discovery

Everyone has used or is using a poor man’s load balancer somewhere in their infrastructure. DNS is also basically the most basic of service discovery tools, you enter a DNS name and it provides the address of the service! Great! You can also get really fancy and use SRV records for port discovery as well, but then you realise there’s quite a few problems with doing load balancing and service discovery like this:

  • One of the servers in your infrastructure breaks or goes offline.
  • The DNS record (either an A record or SRV record) for broken service still exists
  • Your DNS server, quite rightly, keeps resolving that address for $service because it doesn’t know any better
  • Pretty much half of your requests to $service fail (assuming you have two servers for $service)

I know this pain, and I’ve had to deal with an outage like this. In order to fix this problem, I went with stage 2..

Stage 2 - An actual load balancer

Once you have your service fail once (and it will..) you think you need a better solution, so you look at an actual load balancer, like HAProxy (or in my case, an F5 Big-IP). This has extra goodness, like service availability healthchecks, and will present a single VIP for the service address. You add your service to the load balancing pool, set up a healthcheck and assign a VIP to it, and it will yank out any service provider that isn’t performing as expected (perhaps the TCP port doesn’t respond?) - Voila! You’re not going to have failures for $service now.

This is really great for us infrastructure guys, and a lot of people stop there. Their service is now reliable, and all you have to do is set up a DNS record for the VIP and point all your clients to it.

Well, this wasn’t good enough for me because everytime I provisioned a new instance of $service, I had to add it to the load balancer pool. Initially we did it manually, and then we got bored and used the API. I was still annoyed though, because I had to keep track of what $service was running where and make sure every instance of it was in the pool. In a land managed by configuration management, this really wasn’t much fun at all. I want to provision a VM for $service, and I want it to identify when it’s ready and start serving traffic automatically, with no manual intervention required.

The straw that broke the camels back for me was spinning up a new Puppetmaster. We might do this rarely, but it should be an automated job - create a new VM and assign it the Puppetmaster role in Puppet, then use a little script on VM startup to add the puppetmaster to the load balancing pool. It worked, but I wanted more.

  • Notifications when a puppetmaster failed, so I could fix
  • Service availability announcements - when the Puppetmaster was ready, I wanted it to announce its availability to the world and start serving traffic. A script just didn’t feel…right.

This is how I got to stage 3 - with service discovery. Consul, a service discovery tool written by hashicorp was the key.

Stage 3 - A different way

Before I get started, I must note that there are many other tools in this space. Things like SmartStack and Zookeeper can do things like this for you, but I went with Consul for a few reasons:

  • It uses operationally available tools to practice service discovery, like DNS and Nagios Checks
  • It’s written in Go (performance, concurrency, language agnostic)
  • We use hashicorp tools elsewhere, and they have always proved to be very reliable and well designed.

In order for consul to do its thing, you need to understand a few basic concepts about how it works..

  • The best way to implement is to deploy the consul agent to all your service providing infrastructure and clients. The way consul works means that this seems (to me) to be the best implementation.
  • The consul agent will form a cluster using the raft consensus protocol
  • There are some agents in your infrastructure that operate in “server mode” - these perform consensus checks and elect a leader. You need to decide how many there are in advance, and I suggest an odd number so they can elect a leader.
  • Consul uses DNS for service discovery. To do that, it provides a DNS resolver on port 8600.
  • The way it decides what to serve on that DNS resolver is by healthchecking services and determining their health status. If the service is healthy, you can query port 8600 on any consul agent (port 8600) and it will provide the list of available servers.
  • The healthcheck can be done in a variety of ways, but a particularly nice way of doing it is by having consil execute nagios scripts. There’s also a pure TCP check, a HTTP check and many more

This provides three interesting problems for deployment:

  • How do I get my DNS queries to Consul?
  • How do I deploy Consul?
  • How do I put an infrastructure service in Consul?

Well, I’m going to write about the ways I did this!

Deploying Consul with Puppet

Consul has an excellent Puppet Module written by @solarkennedy of Yelp which will handle the deploy for you pretty nicely. I found that it wasn’t great at bootstrapping the cluster, but once you have your servers up and running it worked pretty flawlessly!

Deploying the servers

To deploy the servers with Puppet, create a consul server role and include the puppet module:

node 'consulserver' {
  class { '::consul':
    config_hash => {
      datacenter       => "home",
      client_addr      => "0.0.0.0", # ensures the server is listening on a public interface
      node_name        => $::fqdn,
      bootstrap_expect => 3, # the number of servers that should be found before attempting to create a consul cluster
      server           => true,
      data_dir         => "/opt/consul",
      ui_dir           => "/opt/consul/ui",
      recusors         => ['8.8.8.8', '192.168.0.1'], # Your upstream DNS servers
    }
  }
}

The important params here are: * bootstrap_expect: How large should your server cluster be? * node_name: Make sure it’s unique, $::fqdn fact seems reasonable to me * server: true - make sure it’s a server

Once you’ve deployed this to your three consul hosts, and the service is started, you’ll see something like this in the logs of each server:

[WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.

What’s happening here is that your cluster is looking for peers, but it can’t find them, so let’s make a cluster. From one of the servers, perform the following command:

$ consul join <Node A Address> <Node B Address> <Node C Address>
Successfully joined cluster by contacting 3 nodes.

and then, in the logs, you’ll see something like this

[INFO] consul: adding server foo (Addr: 127.0.0.2:8300) (DC: dc1)
[INFO] consul: adding server bar (Addr: 127.0.0.1:8300) (DC: dc1)
[INFO] consul: Attempting bootstrap with nodes: [127.0.0.3:8300 127.0.0.2:8300 127.0.0.1:8300]
...
[INFO] consul: cluster leadership acquired

You have now bootstrapped a consul cluster, and you’re ready to start adding agents to it from the rest of your infrastructure!

Deploying the agents

As I said earlier, you’ll probably want to deploy the agent to every single host that hosts a service or queries a service. There are other ways to do this, such as not deploying the agent everywhere and changing your DNS servers to resolve to the consul servers, but I chose to do it this way. Your mileage may vary.

Using Puppet, you deploy the agent to every server like so:

node 'default' {
  class { '::consul':
    datacenter    => "home",
    client_addr   => "0.0.0.0",
    node_name     => $::fqdn,
    data_dir      => "/opt/consul",
    retry_join    => ["server_a", "server_b", "server_c"],
  }
}

The key differences from the servers are:

  • the server param is not defined (it’s false by default)
  • retry_join is set: this tells the agent to try and retry these servers and therefore rejoin the cluster when it starts up.

Once you’ve deployed this, you’ll have a consul cluster running with agents attached. You can see the status of the cluster like so:

[[email protected]~]# consul members
Node          Address            Status  Type    Build  Protocol  DC
hostD         192.168.4.26:8301  alive   client  0.6.3  2         home
hostA         192.168.4.21:8301  alive   server  0.6.3  2         home
hostB         192.168.4.29:8301  alive   server  0.6.3  2         home
hostC         192.168.4.34:8301  alive   server  0.6.3  2         home

Consul Services

Now we have our cluster deployed, we need to make it aware of services. There’s a service already deployed for the consul cluster itself, and you can see how it’s deployed and the status of it using a DNS query to the consul agent.

dig +short @127.0.0.1 -p 8600 consul.service.home.consul. ANY
192.168.4.34
192.168.4.21
192.168.4.29

Here, consul has returned the status of the consul service to let me know it’s available from these IP addresses. Consul also supports SRV records, so it can even return the port that it’s listening on

dig +short @127.0.0.1 -p 8600 consul.service.home.consul. SRV
1 1 8300 nodeA.node.home.consul.
1 1 8300 nodeB.node.home.consul.
1 1 8300 nodeC.node.home.consul.

The way it determines what nodes are available to provide a service is using checks which I mentioned earlier. These can be either:

  • A script which is executed and returns a nagios compliant code, where 0 is healthy and anything else is an error
  • HTTP check which returns a HTTP response code, where anything with 2XX is healthy
  • TCP check, basically checking if a port is open.

There are 2 more, TTL and Docker + Interval, but for the sake of this post I’m going to refer you to the documentation for those.

In order for us to get started with a consul service, we need to deploy a check..

Puppetmaster service

I chose to first deploy a puppetmaster service check, so I’ll use that as my example. Again, I used the puppet module to do this, so in my Puppetmaster role definition, I simple did this:

node 'puppetmaster' {
  ::consul::service { 'puppetmaster':
    port => '8140',
    tags => ['puppet'],
  }
}

This defines the service that this node provides and on which port. I now need to define the healthcheck for this service - I used a simple TCP check:

::consul::check { 'puppetmaster_tcp':
    interval   => '60',
    tcp        => 'localhost:8140',
    notes      => 'Puppetmasters listen on port 8140',
    service_id => 'puppetmaster',
}

now, when Puppet converges, I should be able to query my service on the Puppetmaster:

dig +short @127.0.0.1 -p 8600 puppetmaster.service.home.consul. SRV
1 1 8140 puppetmaster.example.lan.node.home.consul.

Excellent, the service exists and it must be healthy, because there’s a result for the service. Just to confirm this, lets use consul’s http API to query the service status:

[[email protected] ~]# curl -s http://127.0.0.1:8500/v1/health/service/puppetmaster | jq
[
  {
    "Node": {
      "Node": "puppetmaster.example.lan",
      "Address": "192.168.4.21",
      "CreateIndex": 5,
      "ModifyIndex": 11154
    },
    "Service": {
      "ID": "puppetmaster",
      "Service": "puppetmaster",
      "Tags": [
        "puppet"
      ],
      "Address": "",
      "Port": 8140,
      "EnableTagOverride": false,
      "CreateIndex": 5535,
      "ModifyIndex": 5877
    },
    "Checks": [
      {
        "Node": "puppetmaster.example.lan",
        "CheckID": "puppetmaster_tcp",
        "Name": "puppetmaster_tcp",
        "Status": "passing",
        "Notes": "Puppetmasters listen on port 8140",
        "Output": "TCP connect localhost:8140: Success",
        "ServiceID": "puppetmaster",
        "ServiceName": "puppetmaster",
        "CreateIndex": 5601,
        "ModifyIndex": 5877
      },
      {
        "Node": "puppetmaster.example.lan",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "CreateIndex": 5,
        "ModifyIndex": 11150
      }
    ]
  }
]

A failing check

Now, this is great at this point, we have a healthy service with a passing healthcheck - but what happens when something breaks. Let’s say a Puppetmaster service is stopped - what exactly happens?

Well, let’s stop our Puppetmaster and see.

[[email protected] ~]# service httpd stop
Redirecting to /bin/systemctl stop  httpd.service # I use passenger to serve puppetmasters, so we'll stop http

Now, let’s do our DNS query again

[[email protected] ~]# dig +short @127.0.0.1 -p 8600 puppetmaster.service.home.consul. SRV
[[email protected] ~]#

I’m not getting any dns results from consul. This is basically because I’ve only deployed one Puppetmaster, and I just stopped it from running, but in a multi-node setup, it would return only the healthy nodes. I can confirm this from the consul API again:

[[email protected] ~]# curl -s http://127.0.0.1:8500/v1/health/service/puppetmaster | jq
[
  {
    "Node": {
      "Node": "puppetmaster.example.lan",
      "Address": "192.168.4.21",
      "CreateIndex": 5,
      "ModifyIndex": 97009
    },
    "Service": {
      "ID": "puppetmaster",
      "Service": "puppetmaster",
      "Tags": [
        "puppet"
      ],
      "Address": "",
      "Port": 8140,
      "EnableTagOverride": false,
      "CreateIndex": 5535,
      "ModifyIndex": 97009
    },
    "Checks": [
      {
        "Node": "puppetmaster.example.lan",
        "CheckID": "puppetmaster_tcp",
        "Name": "puppetmaster_tcp",
        "Status": "critical",
        "Notes": "Puppetmasters listen on port 8140",
        "Output": "dial tcp [::1]:8140: getsockopt: connection refused",
        "ServiceID": "puppetmaster",
        "ServiceName": "puppetmaster",
        "CreateIndex": 5601,
        "ModifyIndex": 97009
      },
      {
        "Node": "puppetmaster.example.lan",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "CreateIndex": 5,
        "ModifyIndex": 11150
      }
    ]
  }
]

Note here how the service is returning critical, so consul has removed it from the DNS query! Easy!

Now if I start it back up, it will of course start serving traffic again and become available in the DNS query:

[[email protected] ~]# service httpd start
Redirecting to /bin/systemctl start  httpd.service
[[email protected] ~]# dig +short @127.0.0.1 -p 8600 puppetmaster.service.home.consul. SRV
1 1 8140 puppetmaster.example.lan.node.home.consul.

DNS resolution

The final piece of this puzzle is to make sure regular DNS traffic can perform these queries. Because consul serves DNS on a non-standard port, we need to figure out how standard DNS queries from applications that expect DNS to always be on port 53 can get in on the action. There are a couple of ways of doing this:

  • Have your DNS servers forward queries for the .consul domain to their local agent
  • Install a stub resolver or caching resolver on each host which does support port config, like dnsmasq.

In my homelab, I went for option 2, but I would imagine in lots of production environments this wouldn’t really be an options, so forwarding with bind would be a better idea. Your mileage may vary.

Configuring DNSmasq

Assuming dnsmasq is installed, you just need a config option in /etc/dnsmasq.d/10-consul like so:

server=/consul/127.0.0.1#8600

Now, set your resolv.conf to look at localhost first:

nameserver 127.0.0.1

And now you can make DNS queries without the port for consul services!

Puppetmaster Service Deployment

For the final step, you need to do a final thing for your Puppermasters. Because the puppetmaster is now being served on the address puppetmaster.service.home.consul, you’ll need to tweak your puppet config slightly to get things working. First, updated the cert names allowed adding the following to your master’s /etc/puppet/puppet.conf:

dns_alt_names=puppetmaster.service.home.consul

Then, clean our the master’s client key (not the CA!) and regenerate a new cert:

rm -rf /var/lib/puppet/ssl/private_keys/puppetmaster.example.lan.pem
puppet cert clean puppetmaster.example.lan
service httpd stop
puppet master --no-daemonize

At this point, we should be able to run puppet against this new DNS name:

[email protected] ~]# puppet agent -t --server=puppetmaster.service.home.consul
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
....

Now, we just need to change the master setting in our puppet.conf, which you can do with Puppet itself of course!

Congratulations, you’ve deployed a service with infrastructure service discovery!

Wrapping up

What we’ve deployed here only used a single node, but the real benefits should be obvious.

  • When we deploy ANY new puppetmaster now, it will use our role definition, and automatically add the service and the healthcheck
  • Using whatever deployment automation we use, we should be able to deploy a new service immediately and it will automatically start serving traffic for our infrastructure - no additional config necessary

This article only covers a few of the possibilites with consul. I didn’t cover the key/value store or adding services dynamically using the API. Consul also has first class support for distributed datacenters which wasn’t covered here, which means you can even distribute your services across DC’s and over the WAN.