DSLs are a waste of time

If you’ve read this blog before, or are unfortunate enough to have an actual personal relationship with me, you’ll know that I have strong opinions and can be, shall we say, passionate about them. For posts on this blog, I try to share those opinions and avoid directly addressing the people who hold the opposing view.

However there’s a schism happening in the infrastructure as code world at the moment with the announcement that HashiCorp is changing the software license that it uses for its products. As a result, and for reasons that have been covered in great depth by people far more qualified than I am, this has resulted in a set of vocal competitors to Terraform Cloud and Terraform Enterprise to “fork” (and that’s a loose interpretation of the word because because all we have at the time of writing is a markdown file and some gifs) called OpenTF.

As we watch these two competing factions live out Friedrich Glasl’s model of conflict escalation in public, I have found myself asking a question that has always lingered in the back of mind while working in my day job at Pulumi..

Why the fuck does everyone love this domain specific language so much?

Hold your horses, Terraform isn’t just a DSL!

Your first reaction will undoubtedly be “well, it’s not the DSL we love, it’s Terraform! It’s made me more productive and solves real world problems!”

I understand the impact Terraform has had on Infrastructure as Code and DevOps in general. I’m lucky enough to consider two of the top 10 contributors of all time to Terraform close personal friends and we generally hold similar perspectives on things (although, one of them has appaling taste in beer, I know you’re reading this).

I used Terraform from the very first version, back when every time you ran terraform apply you sort of closed your eyes and hoped you didn’t break everything. I watched in wonder as Terraform introduced modules, started adding hundreds of providers to support basically every cloud provider you could think of. As our industry started to migrate to “cloud first” or as the marketing behemoth likes to call it, “cloud native” I migrated by world view from one of “configuration management” to “infrastructure as code” and became a subject matter expert.

All of this is to say, I get it. I understand why people want to use Terraform, and I can sort of understand why people are so upset about the license change and want to support the forking of the product so it stays “open source”.

The problem here is that while you can say “Terraform is more than a DSL” - if you ever back the layers or try have a conversation with someone who’s a true dyed in the wool Terraform zealot and start to try and understand where they’re coming from, you begin to realise they don’t love Terraform, they love this weird half language because they’re an expert in it, and they’ve built their entire career on being an expert on something you can only use for one specific use case.

A path to nowhere

In college I spent hundreds of hours playing Guitar Hero. I would play songs over and over again until I could get through them on expert mode. My ultimate goal was to be able to at least finish the toughest song on the game, Through The Fire and Flames on the toughest possible difficulty. A friend of mine was an accomplished guitar player and as a passing comment, pointed out “imagine if you’d spent the same amount of time playing an actual guitar”.

It never really registered for me at the time what he was saying, but as the years have gone by and I’ve put thousands of hours into being an “expert” in infrastructure, I’ve started to realise I made the same mistakes in my career as I did with guitar hero.

DSLs are everywhere

I’ve used a DSL before, you see. Even before Terraform. I spent a whole 7 or 8 years writing thousands of lines of DSL.

Puppet has its own Ruby based DSL it now calls the Puppet language. When all of the problems in the infrastructure space existed at the operating system instead of the API layer, Puppet was everywhere and I loved the language because it meant I didn’t have to be a “software developer”. I didn’t want to write PHP or Python, I wanted to manage infrastructure, and Puppet’s DSL let me do that.

The problem with this mentality was that when the industry inevitably changed around me, I was left with an expertise and knowledge of a language that was now, effectively useless. I haven’t written a single line of Puppet language in a long time, and I don’t miss it at all. I don’t miss the syntax, parsing the docs and trying to figure out how these built in functions actually worked and mapping those ideas to problem sets.

Configuration complexity clock

The fact Puppet and Terraform (and other tools, of course) tend to converge on a DSL to solve infrastructure problems doesn’t appear to be a coincidence. Infrastructure (whether it be at the operating system layer or the API layer) is at its core, a sea of configuration. When dealing with configuration, the configuration complexity clock dictates that eventually, you’re going to try and configure a DSL to manage it.

If you’re familiar with the configuration complexity clock, you’ll notice that the next step in the evolution of infrastructure configuration is to “hard code” the values again, which isn’t really a possibility because, well, have you seen how much infrastructure we’re dealing with? So what’s really happening here is that the DSLs (and particularly, the HCL DSL) is becoming it’s own progamming language, adding new constructs and methods which increase the amount of complexity in the language itself.

This is all well and good, because the features solve problems people have, the real issue here is that it’s still a DSL, and you’re still learning to play guitar hero instead of actually learning a useful skill you can learn elsewhere - playing the guitar.

Complexity is in the eye of the beholder

One of the arguments I always hear when talking to people about this problem is the argument that Terraform’s DSL reduces the amount of complexity in the code because of its limited feature set. Putting aside the fact these people are arguing that the lack of features is itself a feature it doesn’t really hold up to any scrutiny.

If you do a quick search of a sufficiently complex Terraform module, you’ll see all kinds of craziness. I just picked a random module from the Terraform registry and looked at the main.tf and found this:

resource "aws_route_table_association" "redshift" {
  count = local.create_redshift_subnets && !var.enable_public_redshift ? local.len_redshift_subnets : 0

  subnet_id = element(aws_subnet.redshift[*].id, count.index)
  route_table_id = element(
    coalescelist(aws_route_table.redshift[*].id, aws_route_table.private[*].id),
    var.single_nat_gateway || var.create_redshift_subnet_route_table ? 0 : count.index,
  )
}

resource "aws_route_table_association" "redshift_public" {
  count = local.create_redshift_subnets && var.enable_public_redshift ? local.len_redshift_subnets : 0

  subnet_id = element(aws_subnet.redshift[*].id, count.index)
  route_table_id = element(
    coalescelist(aws_route_table.redshift[*].id, aws_route_table.public[*].id),
    var.single_nat_gateway || var.create_redshift_subnet_route_table ? 0 : count.index,
  )
}

Now, don’t get me wrong, I know what this is doing, and I also understand why it has to be implemented this way, but if you’re not an expert in this space, you probably have a whole bunch of reasonable questions:

Why are we using count here?
What is coalescelist?
What is element?
What is local?

And that’s just the first few lines. I’m not trying to pick on this module, I’m just trying to demonstrate that even in a language that is supposed to be “simple” and “easy to understand” you can still end up with code that is complex and difficult to understand.

Terraform’s DSL allows you to use abstractions (modules), nest resources inside those abstractions and then create varying levels of indirection when trying to instantiate resources. These things are all necessary when trying to truly create an ecosystem that allows reusability and sharing, but of course that brings with it complexity.

My personal opinion here is that when people say “it’s less complex” - what they actually mean is “I understand it”. Which is totally fine, and a reasonable position to be in, but it’s not the same thing. If you’re a “devops engineer” or “platform engineer” (or whatever the job title is called this week) and you look at the application code you’re supporting - which is all object orientated and scary and you can’t figure out where this values comes from - put yourself in the position of an application developer trying to look at YOUR Terraform code and think about their perspective. It’s just as difficult, and it was your job to learn Terraform’s DSL, the application developer has a full time job AND now you’re expecting them to learn this weird DSL too?

There’s more to Terraform than a DSL

It’s true, that Terraform is more than a DSL. There’s an entire ecosystem of modules, which seem to be just a grouping of Terraform resources (which expose every single possible API element to the user, but that’s a post for another day) and tutorials and even entire companies set up around the Terraform ecosystem. That’s all very compelling when choosing and considering how you’re going to manage your infrastructure, but what’s pretty clear after the last few weeks is that this ecosystem is becoming increasingly fragile. Once the Terraform fork happens (and again, we don’t have anything yet other than a README and some gifs) - will that ecosystem continue to grow and thrive, or will it split and become two separate ecosystems?

If you look at the history of software that has been forked, you’ll see that rarely do the two forks grow along the same path. When the maintainers of OpenTF inevitably decide that they want to add enhancements that require language changes, you as the end user end up in a situation where you’re effectively back to square 1, learning (potentially) a new DSL.

What’s the alternative?

As you can probably ascertain, I think there’s a better path forward here. I’ve made the case before that programming languages are the better authoring model for cloud infrastructure because they allow you to express the complexity of your infrastructure in the most flexible, intuitive way. My argument here is that if you want to continue to cling on to DSLs, you’re going to end up in a situation where you’re going to have to learn a new DSL every time you want to do something new.

I’m not going to make the explicit argument you should use Pulumi as your IaC tool because that would make me even more of a shill than I am now. What I will say is this: in this pivotal moment for the industry where Terraform gets forked, where the ecosystem is going to fracture and diverge, maybe you as a user can ask yourself the question - should I really go and learn another DSL? My sincere hope is that either Terraform or the OpenTF folks decide to invest in the CDK model. It’s an inferior programming language authoring model to Pulumi’s, but at least the concepts you learn in your chosen programming language are applicable elsewhere in your career. Understanding how to handle complex data structures in say, Python is something which might help you fix bug’s in application code, you can’t throw a DSL at that.