If you’ve used Terraform before, migrating to Pulumi is often an exhilarating experience. Since I started working at Pulumi back in March, I’ve heard countless stories from users about how adopting Pulumi has changed the way their organizations work and allowed them to be more expressive and productive with their cloud infrastructure.
When switching between similar software products, it’s often an instinctive reaction to try and reach for familiar concepts from the thing you know. One of the most common types of questions I’ve answered in the Pulumi community slack is “In Terraform, I can do X. How do I do that in Pulumi?”
So, in this post, I’d like to try and detail a few concepts I’ve learned and mapped them back to Terraform concepts. If you’re picking up Pulumi for the first time, this is a great place to start - let’s take a look!
The very first thing you’ll come across when you fire up Pulumi is that state management gets handled differently. In Terraform, you set up your state inside a provider block within your code, like so:
Every time this is changed, you need to run
terraform init before running
Depending on how you organize your terraform code, you need to provide this configuration for each repo you’re managing. You might choose to use different state buckets or use different keys within that state.
Pulumi handles this very differently. You manage the state using the
pulumi login command.
By default, pulumi will log you into its SaaS managed backend (which is free for individual use). To use an object store/bucket, you login by providing the bucket name prefixed with the type of bucket, like so:
You can read more information on how to use this here.
Managing state this way has quite a few implications on how you might organize your code (which I’ll get to shortly) but mainly means you no longer need to worry about specifying the keys when things up. When you initialize a project and a stack (don’t worry, I’ll discuss stacks shortly!), pulumi will automatically create a path in the bucket for the project, and each stack will have a unique path.
My personal opinion is that this dramatically reduces the complexity when managing the state. Still, for those familiar with Terraform’s way of handling this, it might be a departure from the norm, so it’s worth knowing.
You may be asking yourself at this point “should I use the same state for all my environments?”. The answer to that depends on your security posture and your chosen backend. As an example, you probably don’t want the same state bucket for your prodiction and development workflows because you might want to give less access to production than development.
With the SaaS backend, you can define permissions easily in your organization use the console. If you’re using the cloud storage backends, you might want to consider using different state for each environment. To do that, you need to make sure you login to the correct backend before running pulumi up:
A quick note on sensitive data
Terraform is very explicit about how important the state file is and the security considerations around values like passwords in the state file:
For resources such as databases, this may contain initial passwords
With Terraform, you need to be very careful with values like passwords and providing access to state files. Pulumi doesn’t have this problem, as it supports encrypting sensitive values in the state with keys from your cloud provider or using a password/unique key. You can read more about this here and here
Stacks & Projects
Stacks are a unique feature to Pulumi that might seem familiar if you’ve ever used Terraform Workspaces
Stacks are incredibly flexible and powerful and create lots of excellent scenarios around making Pulumi programs configurable and reusable. Using them is very easy, you create a stack when you initialize a new pulumi project:
pulumi new typescript This command will walk you through creating a new Pulumi project. Enter a value or leave blank to accept the (default), and press <ENTER>. Press ^C at any time to quit. project name: (test-project) my-first-project Sorry, 'my-first-project' is not a valid project name. A project with this name already exists. project name: (test-project) test-project project description: (A minimal TypeScript Pulumi program) A project for Pulumi Created project 'test-project' Please enter your desired stack name. To create a stack in an organization, use the format <org-name>/<stack-name> (e.g. `acmecorp/dev`). stack name: (dev) dev
Once you’ve done this, you’ll find a Pulumi.
You can create stacks very easily by using the
stack init command:
pulumi stack init production Created stack 'production'
This stack init process adds a new YAML file which can be updated. You can switch between them or remove them if necessary.
Stacks work regardless of your chosen backend, but depending on which backend you’re using, you might want to consider how you name things. With the Pulumi SaaS you can specify your stack name using the slash, for example,
my-company/test-stack The part before the slash is an organizational namespace, and within the SaaS it’ll places the stack in the right place and permissions will be applied (note: organization support is a paid feature!)
If you’re using a cloud backend, you’ll need to take an additional step. Naming your stack, you’ll want to set the project name and the environment or region somehow. A commonly developed pattern is to use periods or dashes in the stack name. For example, if we have a project that manages our VPCs, we might do this for two different stacks:
pulumi stack init vpc.production pulumi stack init vpc.development
A quick note about locking
Terraform locking is supported differently depending on which backend you’re using. With Pulumi, locking is not currently supported for Cloud Backends. You can achieve methods of locking by wrapping the Pulumi CLI with a wrapper script, and some users are doing just this. If you’re using the Pulumi SaaS backend, it handles locking for you.
Modules & Component Resources
Creating reusable code in Terraform often involves creating a module. Modules can be nested (for example, it’s often the case that a public module will have more modules within it) and they take inputs and define outputs (more on that later).
Modules are an implementation detail of Terraform that allows you to define groups of resources that live together. For example, if you want to create an EKS cluster in AWS, you’ll need a create a bunch of worker nodes and the control plane, which are distinct resources. Modules allow you to define these together, and make them configurable via inputs.
Pulumi, however, doesn’t have a module system of its own. Pulumi’s use of standard programming languages (rather than HCL) mean you can leverage the package manager for your language of choice (e.g. NPM, NuGet, Pip or Go Modules) to share code.
Pulumi uses Component Resources to group resources together and allows you to define and register resources in Pulumi with their unique name. This method of group resources is an incredibly powerful tool. Depending on your chosen programming language, the way you specify inputs varies, and outputs are handled slightly different (more on that in a second).
There are some great ComponentResource examples available, but my favourite is this one written by James Nugent that defines a VPC that adheres to AWS best practices. It’s available for NodeJS and Python and is a great example of the powerful ways you can reuse code with Pulumi.
Outputs & Stack References
With Terraform, if you need to pass data between different projects or modules, you’d define an output:
The “output” then gets stored in the terraform state in a way that makes it accessible either when a module reads the state or when the module is instantiated within your terraform code.
With Pulumi, you just need to export the resource or parameter, which varies depending on the programming languages. As an example, you might create a VPC and export it in typescript like so:
Here, we export the bucket id from the created bucket, which makes it available across stacks. You can then use a Stack Reference to use it elsewhere.
In addition to this, it’s common to have different states for different components in Terraform, you might also need to use the remote state data source to reference outputs in other terraform states:
In Pulumi, this isn’t supported because it’s very rarely needed. I’m hoping to write a more detailed post on this soon.
Organizing your Code
Terraform’s method of managing backends, workspaces and the implementation of modules can often mean that very quickly, your terraform code might begin to get out of control. Some interesting solutions have materialized for this, like Terragrunt and Astro and if you’re familiar with them, you might wonder how to approach this with Pulumi.
Pulumi has a page dedicated to this very question, but I’d like to add a bit of personal opinion to this.
Blast radius & rate of change
Because of how powerful pulumi is, you might be tempted to create a monolithic repository with all your logic in a single stack/project. As an example of this, with Pulumi it’s very easy to write a program that creates a VPC, Subnets, a Kubernetes cluster and installs several applications on that cluster. You can create a very useful piece of automation here, which allows users of your program to quickly deploy their entire stack. Generally, this isn’t considered a good idea. The VPC in your infrastructure is unlikely to be changing at the same rate as the applications in your EKS cluster, and you don’t want to be in a situation whereby you can accidentally nuke all your infrastructure with a bad command.
Generally, before I start writing some code, I start by considering the rate of change of a project, and what the impact of making a mistake in it would be.
A quick example
If you look at a simple hierarchy of some infrastructure I recently provisioned with Pulumi, it looks like this:
tree -L 1 ├── README.md ├── alb ├── ecs-cluster ├── grafana └── vpc
To break this down a little:
- the ALB project is shared between multiple applications. It exports the listeners and the name of the load balancer as an output, which can then be used as a stack reference later
- the ECS cluster project defines a component resource which bootstraps an ECS cluster with an autoscaling group, a launch template, autoscaling policies, cloudwatch log groups etc. This is completely reusable, and could easily be packaged as an NPM packages (It’s on my todo list, honest!)
- the grafana project defines an ECS task definition, an ECS service, an IAM role etc. to run as and a database for grafana to connect to. You can see here, the important project decision that’s being made is grouping things together (similarly to Terraform modules, but not quite the same!) so that you can destroy and iterate as needed. In order to use the ALB and ECS cluster we created in the other projects, stackreferences are used:
- the VPC project defines a VPC, subnets and other lower-level components live here I could (and hopefully will!) write a whole blog post on this, but essentially what I’m trying to get across is that you shouldn’t just bundle all your code into a single project unless you’re really happy about the implications. Use Stack References liberally where you can, and separate things into projects that make sense.
If you follow this approach, whether you use a mono-repo or a git repository for each project is entirely up to you. This page talks more about the trade-offs, but the choice is yours.
There are other aspects of picking up Pulumi which might catch you out, and I may write a second post along the way, but hopefully, this will give you a nice idea of how to continue down your Pulumi journey. If you’re interested in Pulumi and want to give it a try, reach out to me via twitter! Regardless of the technology you choose, enjoy building your infrastructure!
Configuration complexity chases you
This year marks my 10th anniversary as a (full time) system administrator. When I look back over that journey, remembering my first role for a large bank as a Lotus Notes administrator to now, I can safely say that one part of the job has been frustrating for me and has been ever-present in every role I’ve taken on. It’s something that has followed me throughout every interpretation of “system administration” (and the job role has had many names, which is a thread I don’t want to pull at). It’s something that has I’ve seen declared “solved” multiple times by different tools and products, but always manages to evolve as the industry changes.
When I started my career, the problem of the day was the configuration of operating systems. Workloads were beginning to scale beyond the scope of single machines, and we needed a set of solutions to ensure all those machines looked the way we wanted them to. Tooling like Puppet, Chef, and then Ansible became incredibly popular very quickly because they were declarative. You defined your desired state in code (or something like code, which I’ll get to in a moment), and the tool took care of converging on that state. This pattern worked, and we all got a lot better at managing massive numbers of machines. At this point, someone at Amazon realised that companies were spending thousands of person-hours wasting their time doing stupid things like managing servers and buying hard drives. AWS changed the way we managed our systems, and the tooling we had adapted to suit those systems. When AWS was starting to gain momentum, it was still a widespread practice to boot your EC2 instance and configure the operating system on it. Unfortunately, this introduced another layer of complexity. Your cloud provider’s API layer now needs configuration, and we had all gotten used to the idea we wanted to declare our state and have something converge on it. The existing tools in this space weren’t cutting it, and then all of a sudden, Terraform emerged out of Hashicorp to solve most of our problems.
DSLs: A necessary evil
The most successful tools of this era had something in common, even if they differed in the way they solved the problem. I attribute the success of the two tools I’ve used the most until this point (Puppet and Terraform) to the fact that they both have very readable and powerful DSLs. The decision to use DSLs made them extremely approachable to people, even with rudimentary software engineering backgrounds. Generally, you can take a simple block of Puppet code and very quickly get an idea of what it’s going to do:
HCL has a similar approach - its simplicity allows you to look at (basic) HCL and get a decent idea of what’s going to happen when you execute it. Here’s a similar operation as our Puppet example in HCL:
Simplicity is fantastic when you get started. However, in my experience with DSLs over the past 10 years is that you will, unfortunately, reach a tipping point in which you’re going to look at what you’ve created in horror.
Ultimately, there’s a universal truth of configuration complexity. No matter how you approach it, you’re dealing with 2 distinct users:
People who want to twiddle all the different knobs, so they want all the configuration options available to them.
People who only want the defaults, and might make a few changes later.
Catering to both those users with a DSL is hard. Both of the tools I’m most familiar with, Puppet and Terraform tried to approach this using a concept of “modules.” At their core, the idea is reasonable - abstract away the configuration complexity into a set of sane defaults, and expose the knobs for people to twiddle as parameters to the module. Unfortunately, this - in my admittedly humble opinion, hasn’t solved the problem.
You’ll probably notice it’s over 450 lines of HCL. I find it very difficult to understand what this file is doing at first glance. Launch templates are incredibly complex mechanisms in AWS, with lots of different options depending on your needs. Supporting all of these different cases for both of the users mentioned above in a terraform module is creating new configuration complexity. The terraform-eks module has so many possible inputs I genuinely couldn’t be bothered to go through and count them all. In addition to this, if I want to make a configuration change to the module that for a parameter that doesn’t exist, my options are: Fork the module Don’t use the module, and write everything from scratch again.
Okay, we get it, what’s the answer?
Recently I changed companies, and this problem was consistently in my mind when deciding on what to do next. I’ve written before about the need for configuration management for Kubernetes clusters, however as time has gone on, I’ve realised what we need is configuration management for any abstraction layer. I even helped in trying to solve this problem at my former employer with Kr8. Ultimately, I believe that the only way to solve this configuration complexity is with language that is expressive and flexible. I’ve concluded that DSLs will only ever get you part of the way there. The only solution currently on the market is something I excitedly wrote about in September 2018 - Pulumi. Pulumi allows you to take control of your configuration in your choice of programming language. With the decision to use a fully-featured language instead of a DSL, a whole world of opportunity opens up.
Pulumi provides you direct access to the configuration options you might be familiar with in Terraform (in fact, you can convert terraform providers to Pulumi providers in a relatively straightforward manner). However, by providing access to these resources using a programming language, you can be extremely creative in how they get used.
Pulumi x libraries
Examples of how this flexibility looks like in practice can be when you take a look at the awsx library, which is maintained by the Pulumi team.
This library uses the standard aws library under the hood, but wraps it up in sane configuration defaults using standard packaging methods. I previously wrote a very ranty and frustrated post about how hard it was to stand up a service on fargate using terraform. Here’s what it looks like in awsx (using typescript):
Seeing code like this makes sense to me. If I want simple, off the shelf defaults, I can write a module/library, but if I want to get into the nuts and bolts of the configuration, I can use the
@pulumi/aws library and talk to the API directly.
I joined Pulumi at the end of March, and I’m incredibly excited about being on the frontlines of battling configuration complexity. Going forward, I expect this blog to contain updates (sporadically, of course) about my journey. Already in my short time at Pulumi, I’ve dived into new programming languages (I wrote my first ever dotnet code this week!), heard from users, and been more involved than ever before in an open-source community. Most importantly, I can see a time where I don’t have to write a single line of YAML!
$work, we have several Kubernetes clusters across different geographical and AWS regions. The reasons range from customer requirements, to our own desire to reduce operational “blast radius” issues that might come up. Our team has experience large outages before, and we try and build the smallest unit of deployment we possibly can for out platform.
Unfortunately, this brings with it new challenges, especially when it comes to running Kubernetes clusters. I’ve spoke extensively about this on this blog before, particularly regarding configuration management needs and the overhead that scaling out to multiple clusters brings.
As these clusters have become more utilized by application teams, a new consideration has arisen. Deploying regional applications has the same configuration complexity problems I’ve spoken about when it comes to the infrastructure management, and essentially the needs boil down to the same words we’re familiar with: configuration management.
I set out to try and make the task of deploying applications to regional clusters as easy as possible for our teams following the same philosophy frustrations I had before. The requirements were a bit like this:
- no templating languages (no helm!)
- continuous deployment made easy
- easy for developers to grasp - low barrier to entry
- abstract as much of configuration complexity away as possible
What I came up with works very nicely, and uses largely off the shelf tooling that you can replicate very easily.
This post is the first in what I hope will be a 2 part series of posts which covers the following topics:
- Generating regional/parameterised manifests using jkcfg
- Using Gitlab, Gitlab-CI, ArgoCD and GitOps to deploy to multiple clusters
Part 1: Generate your config
It’s the first step, but it’s also the hardest. How do you generate your YAML configuration for the different clusters?
Ksonnet uses jsonnet, which for us infrastructure people didn’t seem so bad (we used it in kr8) but there was very little desire for developers to actually learn this new and strange language for their application development needs. Luckily for me, it was deprecated just as I was trying to convince my developers otherwise, which meant I kept search for other solutions.
Helm is the defacto standard for this kind of thing but again, there were some confused questions when it came to the templating of YAML. I could sympathise with this, and at the time, Helm had some serious security problems with Tiller. Helm3 has largely addresses this, but I still can’t bring myself to template yaml.
Let’s generate some manifests
The jkcfg repo has some excellent examples you can use, and getting started was generally pretty straightforward.
Download the jk binary
Okay, so we’re ready to generate some manifests. A simple deployment might look like this:
Okay, we have a nice deployment manifest now, but how does this help me with different regions?
jkcfg supports “parameters” which can be passed either via a command line argument, or a file. This is similar to Helm’s values files which are evaluated at compile time. Using values in jkcfg is very straightforward.
You can then use this value inside your deployment manifest. Here’s the end result:
Once you’ve started using parameters, you probably don’t always want to use the same number of replicas. You can invoke the parameters in two ways. The first, and easiest, is on the command line:
Check the manifest now in
manifests/myapp.yaml: you’ll see we’ve set the replicas to 5!
The other way of overriding the parameters is using a parameters file. This can be YAML or JSON. Create a file called
params/myapp.yaml and populate it like so:
Then use it with jk like so:
Abstract, abstract, abstract
As I went through this journey, it became apparent there was a lot of code reuse across different services. Most of the services we build are using the same frameworks, and need a lot of similar configuration.
For example, every service we deploy to Kubernetes needs 4 basic things:
- A deployment spec
- With KIAM annotations
- With security contexts
- etc etc
- A service spec
- An ingress
- A configmap
Here’s the repo contents:
I’ll break down the
js files so we can get an idea of what this entails.
The main meat of the package is in
kube.js. Let’s take a look at this:
Obviously this is a lot more involved than our previous, very simple deployment from earlier. What’s worth noting though is that this is completely configurable by a
service object in our jkcfg configuration parameter. I’ve gone for an approach here as well where we load the configuration variable from a configmap, which is not managed by this module. That way, we can use a basic boilerplate module for most of the stuff we want to deploy, and we can have a pretty high degree of confidence that the deployments meet our standards.
This labels file is simply a way of ensuring we have the correct labels defined for all our resources. Here’s what it looks like:
Notice, we’re exporting this as a function. The “why” will become apparent later…
index.js where we export all this to be used:
We now have a very basic NPM package. We can push this to git or to the NPM registry and let people use. So how do we actually use it?
Using the package
Using this is pretty straightforward. First, add it as a dependency:
Then import it to be used in your jkcfg
Now we’ve imported it, the fun stuff starts. For your average person, you can generate the deployment, service and ingress with two files. First, pad out your
index.js like so:
Before we get too excited, we need to populate our params file:
As you can see, the service object does most of the work for us, and we can tailor it per region or per environment as needed.
At this stage, we’re ready to generate our manifests again. Let’s use the command from before:
Notice how we specify the version as a parameter outside the params file, simply because we expect this to be a dynamic value. I’ll talk more about this in my next post.
This should have generated manifests for you in
complex/manifests and now you’re ready to do some deployments!
Add a ConfigMap
The final part of this is the configuration. We’ve so far tried to build a jkcfg package that is as agnostic as possible, and can be reused across many different services and deployments.
The reality is though, you’re still going to need to add configuration data for the service. We can utilise what we’ve got here and add a configmap to the equation which can be customised per-deployment very easily.
index.js add the following:
Your end result should be this:
This will now generate a ConfigMap, but it’ll be empty. You need to add a
config object to your params file. It’ll end up looking like this:
And now we have a working deployment, with all the resources we might need to run a service.
You might be wondering at this point, this a lot of work! Why not just use a Helm chart? My personal thoughts about why I prefer this way are:
- Using Helm charts mean developers have to learn a new pattern, specifically Go templates. With this method, they can use a programming language which they will undoubtedly feel more familiar with
This concludes part 1 of my series. In part 2, I’ll talk a little bit about using the CI pipeline to generate these configs and push them to a deploy repo (specifically with Gitlab-CI) and then talk a little bit about using Argo do deployments.
All the code from this post can be found in the github repo if you want to take a look!
Stay tuned for the next post!
I’ve been building a Kubernetes based platform at $work now for almost a year, and I’ve become a bit of a Kubernetes apologist. It’s true, I think the technology is fantastic. I am however under no illusions about how difficult it is to operate and maintain. I read posts like this one earlier in the year and found myself nodding along to certain aspects of the opinion. If I was in a smaller company, with 10/15 engineers, I’d be horrified if someone suggested managing and maintaining a fleet of Kubernetes clusters. The operational overhead is just too high.
Despite my love for all things Kubernetes at this point, I do remain curious about the notion that “serverless” computing will kill the ops engineer. The main source of intrigue here is the desire to stay gainfully employed in the future - if we aren’t going to need OPS engineers in our glorious future, I’d like to see what all the fuss is about. I’ve done some experimentation in Lamdba and Google Cloud Functions and been impressed by what I saw, but I still firmly believe that serverless solutions only solve a percentage of the problem.
I’ve had my eye on AWS Fargate for some time now and it’s something that developers at $work have been gleefully pointing at as “serverless computing” - mainly because with Fargate, you can run your Docker container without having to manage the underlying nodes. I wanted to see what that actually meant - so I set about trying to get an app running on Fargate from scratch. I defined the success criteria here as something close-ish to a “production ready” application, so I wanted to have the following:
- A running container on Fargate
- With configuration pushed down in the form of environment variables
- “Secrets” should not be in plaintext
- Behind a loadbalancer
- TLS enabled with a valid SSL certificate
I approached this whole task from an infrastructure as code mentality, and instead of following the default AWS console wizards, I used terraform to define the infrastructure. It’s very possible this overcomplicated things, but I wanted to make sure any deployment was repeatable and discoverable to anyone else wanting to follow along.
All of the above criteria is generally achieveable with a Kubernetes based platform using a few external add-ons and plugins, so I’m admittedly approaching this whole task with a comparitive mentality - because I’m comparing it with my common workflow. My main goal was to see how easy this was with Fargate, especially when compared with Kubernetes. I was pretty surprised with the outcome.
AWS has overhead
I had a clean AWS account and was determined to go from zero to a deployed webapp. Like any other infrastructure in AWS, I had to get the baseline infrastructure working - so I first had to define a VPC.
I wanted to follow the best practices, so I carved the VPC up into subnets across availability zones, with a public and a private subnet. It occurred to me at this point that as long as this need was always there, I’d probably be able to find a job of some description. The notion that AWS is operationally “free” is something that has irked me for quite some time now. Many people in the developer community take for granted how much work and effort there is in setting up and defining a well designed AWS account and infrastructure. This is before we even start talking about a multi-account architecture - I’m still in a single account here and I’m already having to define infrastructure and traditional network items.
It’s also worth remembering here, I’ve done this quite a few times now, so I knew exactly what to do. I could have used the default VPC in my account, and the pre-provided subnets, which I expect many people who are getting started might do. This took me about half an hour to get running, but I couldn’t help but think here that even if I want to run lambda functions, I still need some kind of connectivity and networking. Defining NAT gateways and routing in a VPC doesn’t feel very serveless at all, but it has to be done to get things moving.
Run my damn container
Once I had the base infrastructure up and running, I now wanted to get my docker container running. I started examining the Fargate docs and browsed through the Getting Started docs and something immediately popped out at me:
Hold on a minute, there’s at least THREE steps here just to get my container up and running? This isn’t quite how this whole thing was sold to me, but let’s get started.
A task definition defines the actual container you want to run. The problem I ran into immediately here is that this thing is insanely complicated. Lots of the options here are very straightforward, like specifying the docker image and memory limits, but I also had to define a networking model and a variety of other options that I wasn’t really familiar with. Really? If I had come into this process with absolutely no AWS knowledge I’d be incredibly overwhelmed at this stage. A full list of the parameters can be found on the AWS page, and the list is long. I knew my container needed to have some environment variables, and it needed to expose a port. So I defined that first, with the help of a fantastic terraform module which really made this easier. If I didn’t have this, I’d be hand writing JSON to define my container definition.
First, I defined some environment variables:
Then I compiled the task definition using the module I mentioned above:
I was pretty confused at this point - I need to define a lot of configuration here to get this running and I’ve barely even started, but it made a little sense - anything running a docker container needs to have some idea of the configuration values of the docker container. I’ve previously written about the problems with Kubernetes and configuration management and the same problem seemed to be rearing its ugly head again here.
Next, I defined the task definition from the module above (which thankfully abstracted the required JSON away from me - if I had to hand write JSON at this point I’ve have probably given up).
I realised immediately I was missing something as I was defining the module parameters. I need an IAM role as well! Okay, let me define that:
That makes sense, I’d need to define an RBAC policy in Kubernetes, so I’m still not exactly losing or gaining anything here. I am starting to think at this point that this feels very familiar from a Kubernetes perspective.
At this point, I’ve written quite a few lines of code to get this running, read a lot of ECS documentation and all I’ve done is define a task definition. I still haven’t got this thing running yet. I’m really confused at this point what the value add is here over a Kubernetes based platform, but I continued onwards.
A service is partly how to expose the container to the world, and partly how you define how many replicas it has. My first thought was “Ah! This is like a Kubernetes service!” and I set about mapping the ports and such like. Here was my first run at the terraform:
I again got frustrated when I had to define the security group for this that allowed access to the ports needed, but I did so and plugged that into the network configuration. Then I got a smack in the face.
I need to define my own loadbalancer?
LoadBalancers Never Go Away
I was honestly kind floored by this, I’m not even sure why. I’ve gotten so used to Kubernetes services and ingress objects that I completely took for granted how easy it is to get my application on the web with Kubernetes. Of course, we’ve spent months building a platform to make this easier at $work. I’m a heavy user of external-dns and cert-manager to automate populating DNS entries on ingress objects and automating TLS certificates and I am very aware of the work needed to get these set up, but I honestly thought it would be easier to do this on Fargate. I recognise that Fargate isn’t claiming to be the be all and end-all of how to run applications - it’s just abstracting away the node management - but I have been consistently told this is easier than Kubernetes. I really was surprised. Defining a LoadBalancer (even if you don’t want to use Ingresses and Ingress controllers) is part and parcel of deploying a service to Kubernetes, and I had to do the same thing again here. It just all felt so familiar.
I now realised I needed:
- A loadbalancer
- A TLS certificate
- A DNS entry
So I set about making those. I made use of some popular terraform modules, and came up with this:
I’ll be completely honest here - I screwed this up several times. I had to fish around in the AWS console to figure out what I’d done wrong. It certainly wasn’t an “easy” process - and I’ve done this before - many times. Honestly, at this point, Kubernetes looked positively enticing to me, but I realised it was because I was very familiar with it. If I was lucky enough to be using a managed Kubernetes platform (with external-dns and cert-manager preinstalled) I’d really wonder what value add I was missing from Fargate. It just really didn’t feel that easy.
After a bit of back and forth, I now had a working ECS service. The final definition, including the service, looked a bit like this:
I felt like it was close at this point, but then I remembered I’d only done 2 of the required 3 steps from the original “Getting Started” document - I still needed to define the ECS cluster.
Thanks to a very well defined module, defining the cluster to run all this on was actually very easy.
What surprised me the most here is why I had to define a cluster at all. As someone reasonably familiar with ECS it makes some sense you’d need a cluster, but I tried to consider this from the point of view of someone having to go through this process as a complete newcomer - it seems surprising to me that Fargate is billed as “serverless” but you still need to define a cluster. It’s a small detail, but it really stuck in my mind.
Tell me your secrets
At this stage of the process, I was fairly happy I managed to get something running. There was however something missing from my original criteria. If we go all the way back to the task definition, you’ll remember my app has an environment variable for the password:
If I looked at my task definition in the AWS console, my password was there, staring at me in plaintext. I wanted this to end, so I set about trying to move this into something else, similar to Kubernetes secrets
The way Fargate/ECS does the secret management portion is to use AWS SSM (the full name for this service is AWS Systems Manager Parameter Store, but I refuse to use that name because quite frankly it’s stupid)
The AWS documentation covers this fairly well, so I set about converting this to terraform.
Specifying the Secret
First, you have to define a parameter and give it a name. In terraform, it looks like this:
Obviously the key component here is the “SecureString” type. This uses the default AWS KMS key to encrypt the data, something that was not immediately obvious to me. This has a huge advantage over Kubernetes secrets, which aren’t encrypted in etcd by default.
Then I specified another local value map for ECS, and passed that as a secret parameter:
A problem arises
At this point, I redeployed my task definition, and was very confused. Why isn’t the task rolling out properly? I kept seeing in the console that the running app was still using the previous task definition (version 7) when the new task definition (version 8) was available. This took me way longer than it should have to figure out, but in the events screen on the console, I noticed an IAM error. I had missed a step, and the container couldn’t read the secret from AWS SSM, because it didn’t have the correct IAM permissions. This was the first time I got genuinely frustrated with this whole thing. The feedback here was terrible from a user experience perspective. If I hadn’t known any better, I would have figured everything was fine, because there was still a task running, and my app was still available via the correct URL - I was just getting the old config.
In a Kubernetes world, I would have clearly seen an error in the pod definition. It’s absolutely fantastic that Fargate makes sure my app doesn’t go down, but as an operator I need some actual feedback as to what’s happening. This really wasn’t good enough. I genuinely hope someone from the Fargate team reads this and tries to improve this experience.
That’s a wrap?
This was the end of the road - my app was running and I’d met all my criteria. I did realise that I had some improvements to make, which included:
- Defining a cloudwatch log group, so I could write logs correctly
- Add a route53 hosted zone to make the whole thing a little easier to automate from a DNS perspective
- Fix and rescope the IAM permissions, which were very broad at this point
But honestly at this point, I wanted to reflect on the experience. I threw out a twitter thread about my experience and then spent the rest of the time thinking about what I really felt here.
What I realised, after an evening of reflection, was that this process is largely the same whether you’re using Fargate or Kubernetes. What surprised me the most was that despite the regular claims I’ve heard that Fargate is “easier” I really just couldn’t see any benefits over a Kubernetes based platform. Now, if you’re in a world where you’re building Kubernetes clusters I can absolutely see the value here - managing nodes and the control plane is just overhead you don’t really need. The problem is - most consumers of a Kubernetes based platform don’t have to do this. If you’re lucky enough to be using GKE, you barely even need to think about the management of the cluster, you can run a cluster with a single gcloud command nowadays. I regularly use Digital Ocean’s managed Kubernetes service and I can safely say that it was as easy as spinning up a Fargate cluster - in fact in some way’s it was easier.
Having to define some infrastructure to run your container is table stakes at this point. Google may have just changed the game this week with their Google Cloud Run product, but they’re massively ahead of everyone else in this field.
What I think can be safely said from this whole experience though is this: Running containers at scale is still hard. It requires thought, it requires domain knowledge, it requires collaboration between Operations and Developers. It also requires a foundation to build on - any AWS based operation is going to need to have some fundamental infrastructure defined and running. I’m very intrigued by the “NoOps” concept that some companies seem to aspire for. I guess if you’re running a stateless application, and you can put it all inside a lambda function and an API gateway you’re probably in a good position, but are we really close to this in any kind of enterprise environment? I really don’t think so.
Another realisation that struck me is that often the comparisons between technology A and technology B sometimes aren’t really fair, and I see this very often with AWS. The reality of the situation is often very different from the Jeff Barr blogpost. If you’re a small enough company that you can deploy your application in AWS using the AWS console and select all of the defaults, this absolutely is easier. However, I didn’t want to use the defaults, because the defaults are almost always not production ready. Once you start to peel back the layers of cloud provider services, you begin to realise that at the end of the day - you’re still running software. It still needs to be designed well, deployed well and operated well. I believe that the value add of AWS and Kubernetes and all the other cloud providers is it makes it much, much easier to run, design and operate things well, but it is definitely not free.
Arguing for Kubernetes
My final takeaway here is this: if you view Kubernetes purely as a container orchestration tool, you’re probably going to love Fargate. However, as I’ve become more familiar with Kubernetes, I’ve come to appreciate just how important it is as a technology - not just because it’s a great container orchestration tool but also because of its design patterns - it’s declarative, API driven platform. A simple thought that occurred to me during all of this Fargate process was that if I deleted any of this stuff, Fargate isn’t necessarily going to recreate it for me. Autoscaling is nice, not having to manage servers and patching and OS updates is awesome, but I felt I’d lost so much by not being able to use Kubernetes self healing and API driven model. Sure, Kubernetes has a learning curve - but from this experience, so does Fargate.
Despite my confusion during some of this process, I really did enjoy the experience. I still believe Fargate is a fantastic technology, and what the AWS team has done with ECS/Fargate really is nothing short of remarkable. My perspective however is that this is definitely not “easier” than Kubernetes, it’s just.. different.
The problems that arise when running containers in production are largely the same. If you take anything away from this post it should be this: whichever way you choose is going to have operational overhead. Don’t fall into the trap of believing that you can just pick something and your world is going to be easier. My personal opinion is this: If you have an operations team and your company is going to be deploying containers across multiple app teams - pick a technology and build processes and tooling around it to make it easier.
I’m certainly going to take the claims from people that certain technology is easier with a grain of salt from now on. At this stage, when it comes to Fargate, this sums up my feelings:
I made a statement during the talk which ignited some fairly fierce discussion both online, and at the conference:
To put this into my own words:
At some point, we decided it was okay for us to template yaml. When did this happen? How is this acceptable?
After some conversation, I figured it was probably best to back up my claims in some way. This blog post is going to try to do that.
The configuration problem
Once the applications and infrastructure you’re going to manage grows past a certain size, you inevitably end up in some form of configuration complexity hell. If you’re only deploying 1 or maybe 2 things, you can write a yaml configuration file and be done with it. However once you grow beyond that, you need to figure out how to manage this complexity. It’s incredibly likely that the reason you have multiple configuration files is because the $thing that uses that config is slightly different from its companions. Examples of this include:
- Applications deployed in different environments, like dev, stg and prod
- Applications deployed in different regions, like Europe or North American
Obviously, not all the configuration is different here, but it’s likely the configuration differs enough that you want to be able to differentiate between the two.
This configuration complexity has been well known for Operators (System Administrators, DevOps engineers, whatever you want to call them) for some years now. An entire discpline grew up around this in Configuration Management, and each tool solved this problem in their own way, but ultimately, they used YAML to get the job done.
My favourite method has always been hiera which comes bundled with Puppet. Having the ability to hierarchically look up the variables of specific config needs is incredibly powerful and flexible, and has generally meant you don’t actually need to do any templating of yaml at all, except perhaps for embedding Puppet facts into the yaml.
Did we go backwards?
Then, as our industries’ needs moved above the operating system and into cloud computing, we had a whole new data plane to configure. The tooling to configure this changed, and tools like CloudFormation and Helm appeared. These tools are excellent configuration tools, but I firmly believe we (as an industry) got something really, really wrong when we designed them. To examine that, let’s take a look at example of a helm chart taking a custom parameter
Helm charts can take external parameters defined by an
values.yaml file which you specify when rendering the chart. A simple example might look like this:
Let’s say my external parameter is simple - it’s a string. It’d look a bit like this:
That’s not so bad right? You just specify a value for
image in your values.yaml and you’re on your way.
The real problem starts to get highlighted when you want to do more complicated and complex things. In this particular example, you’re doing okay because you know you have to specify an image for a Kubernetes deployment. However, what if you’re working with something like an optional field? Well, then it gets a little more unwieldy:
Optional values just make things ugly in templating languages, and you can’t just leave the value blank, so you have to resort to ugly loops and conditionals that are probably going to bite you later.
Let’s say you need to go a step further, and you need to push an array or map into the config. With helm, you’d do something like this.
Firstly, let’s ignore the madness of having a templating function
toYaml to convert yaml to yaml and focus more on the whitespace issue here.
YAML has strict requirements and whitespace implementation rules. The following, for example, is not valid or complete yaml:
Generally, if you’re handwriting something, this isn’t necessarily a problem because you just hit backspace twice and it’s fixed. However, if you’re generating YAML using a templating system, you can’t do that - and if you’re operating above 5 or 10 configuration files, you probably want to be generating your config rather than writing it.
So, in the above example, you want to embed the values of
.Values.podAnnotations under the annotations field, which is indented already. So you’re having to not only indent your values, but indent them correctly.
What makes this even more confusing is that the go parser doesn’t actually know anything about YAML at all, so if you try to keep the syntax clean and indent the templates like this:
You actually can’t do that, because the templating system gets confused. This is a singular example of the complexity and difficulty you end up facing when generating config data in YAML, but when you really start to do more complex work, it really starts to become obvious that this isn’t the way to go.
Needless to say, this isn’t what I want to spend my time doing. If fiddling around with whitespace requirements in a templating system doing something it’s not really designed for is what suits you, then I’m not going to stop you. I also don’t want to spend my time writing configuration in JSON without comments and accidentally missing commas all over the shop. We (as an industry) decided a long time ago that shit wasn’t going to work and that’s why YAML exists.
So what should we do instead? That’s where jsonnet comes in.
JSON, Jsonnet & YAML
Before we actually talk about Jsonnet, it’s worth reminding people of a very important (but oft forgotten point). YAML is a superset of JSON and converting between the two is trivial. Many applications and programming languages will parse JSON and YAML natively, and many can convert between the two very simple. For example, in Python:
So with that in mind, let’s talk about Jsonnet.
Welcome to the church of Jsonnet
Jsonnet is a relatively new, little known (outside the Kubernetes community?) language that calls itself a data templating language. It’s definitely a good exercise to read and consume the Jsonnet design rationale page to get an idea why it exists, but if I was going to define in a nutshell what its purpose is - it’s to generate JSON config.
So, how does it help, exactly?
Well, let’s take our earlier example - we want to generate some JSON config specifying a parameter (ie, the image string). We can do that very very easily with Jsonnet using external variables.
Firstly, let’s define some Jsonnet:
Then, we can generate it using the Jsonnet command line tool, passing in the external variable as we need to:
Before, I noted that if you wanted to define an optional field, with YAML templating you had to define if statements for everything. With Jsonnet, you’re just defining code!
The output here, because our variable is null, means that we never actually populate resourceGroup. If you specify a value, it will appear:
Maps and parameters
Okay, now let’s look at our previous annotation example. We want to define some pod annotations, which takes a YAML map as its input. You want this map to be configurable by specifying external data, and obviously doing that on the command line sucks (you’d be very unlikely to specify this with Helm on the command line, for example) so generally you’d use Jsonnet imports to this. I’m going to specify this config as a variable and then load that variable into the annotation:
This might just be my bias towards Jsonnet talking, but this is so dramatically easier than faffing about with indentation that I can’t even begin to describe it.
The final thing I wanted to quickly explore, which is something that I feel can’t really be done with Helm and other yaml templating tools, is the concept of manipulating existing objects in config.
Let’s take our example above with the annotations, and look at the result file:
Now, let’s say for example I wanted to append a set of annotations to this annotations map. In any templating system, I’d probably have to rewrite the whole map.
Jsonnet makes this trivial. I can simply use the
+ operator to add something to this. Here’s a (poor) example:
The end result is this:
Obviously, in this case, it’s more code to this, but as your example get more complex, it becomes extremely useful to be able to manipulate objects this way.
We use all of these methods in kr8 to make creating and manipulating configuration for multiple Kubernetes clusters easy and simple. I highly recommend you check it out if any of the concepts you’ve found here have found you nodding your head.