Landlord: a tenancy controller and experiment in AI driven product building

Note: The TL;DR of this post is I built a thing with AI, you can check it out here

Back in 2012 I joined a growing startup that sold SaaS software to its customers. For those of you reading who didn’t understand how the infrastructure world looked back then, AWS, Google Cloud and Azure were all relatively niche offerings at the time, Docker hadn’t been invented yet and most of the technologies you wanted to use to manage infrastructure were written in Ruby.

I had never built any SaaS software, but I was still surprised at how things worked behind the scenes when I joined. Provisioning a new customer involved my “TechOps” team running some shell scripts, data was stored in an NFS mount shared between 2 or 3 racks of whitebox servers, and if one customer’s usage got out of control, we would have to rsync the data to another server to avoid any noisy neighbour problems.

Obviously none of this was tenable in the long term, but it worked for the time we were in. There was an acknowledgement amongst our team of 5 or 6 system administrators (that’s DevOps or Platform engineers to you young ‘uns) that we couldn’t scale the business this way, and one of my more senior colleagues rolled up his sleeves and started automating some of the tasks we had in front of us.

What emerged over the course of several weeks was a project we dubbed “selfserve”. Its functionality started simple: anything we could automate with a shell script was now executed by a “runner” which was dispatched from a centralized control plane. It would SSH into a box, execute the script and return the results to the control plane, which would then be rendered in PHP to whoever ran it. Our ability to scale our processes accelerated massively. In parallel, I worked with another colleague to automate the art of bringing servers online (with Puppet and Cobbler, two technologies that seem to be on the path to extinction) and another colleague worked on streamlining our ability to get datacenters off the ground. Within the space of 2 years, we grew from 2 racks in RDU to 7 datacenters around the world with thousands of customers and millions of dollars in revenue.

Another issue we had to solve was tenancy - how should we segment our customer’s data so we meet compliance goals and scale effectively? Remember, this is well before Docker is considered the de facto standard, and this wasn’t a problem any of us had expertise in solving.

Tenancy as a model is relatively common nowadays, and there are many ways to solve it. At the time, the right way seemed to be to have distinct, segregated compute per tenant - and it scaled remarkably well. Back in 2012 to 2015, we felt like it was the right approach, but as the company grew, we all held some regrets we hadn’t been more ambitious about solving this problem..

My career has changed since then, and I work in a more consultative role as a solutions engineer with companies trying to solve technology problems with Tailscale. What has consistently surprised me over this time is how many companies and orgs are solving the tenancy model in the same way! When I think about it now, wearing the scars of being paged at 2am for most of my career, the “compute-per-tenant” model actually makes a remarkable amount of sense. Sure, it’s inefficient but it’s safe, easy and reduces the blast radius of outages. As with all technology decisions, the tradeoffs have to be considered, but I speak to a remarkable amount of customers who have solved the problem the same way.

Why am I telling you this story in a post with AI in the title? Well, I’ve had in the back of my mind that this problem space could be commoditized because of how common and ubiquitous it is. The problem was always how hard it seemed to be to try and solve it on my own as a spare time project. If you haven’t solved this problem before, you’d be forgiven for thinking it’s quite easy to solve now: Just have a provisoning system hit the Kubernetes API! It’ll handle all this stuff for you! I don’t blame anyone for thinking that’s the right way to go about it, but it fails to understand that managing the automation itself then becomes the battle. If you’re a SaaS service, you probably want to have an automated onboarding flow that provisions the tenant for you, but you now need to build a durable process that involves lots of distinct steps and assumes the entire process is reliable. What happens if you run out of compute in your AWS account? What if your Kubernetes API responds slowly? What if the Docker image you’re pulling pulls too slowly and the job times out? What about if the Docker image is wrong in the request, and now you’re in ImagePullBackOff and you have to nurse it through automatically.

You’re essentially in a reliability maths problem, and what seemed like a simple problem becomes a complex web of inter-dependencies that all have to have 99.99999% uptime to work and be successful.

Landlord

As I mentioned earlier, I’ve seen this problem be common enough I wanted to build something to solve it for a modern tech stack, but it seemed to incredibly daunting. If I was lucky enough to not have to work and could spend my time writing code all day, I figured I could probably get this done to a reasonable standard. I’m not what you’d consider the most talented software developer (after all, I started writing shell scripts and PHP!) but I had enough information to be dangerous, and most importantly I feel like I know relatively well how to design a good system.

Then the world was hit by the meteor now called agentic coding, and I had a little crack at building it. It was an incredibly frustrating experience because the LLM couldn’t hold enough context to build such a complicated system, I felt like all I did was fight with the model.

Then a few weeks ago, I started experimenting with spec-driven development. And things changed overnight.

What has emerged is landlord, an experimental, pluggable compute manager that provisions tenants via a series of different compute providers. It leverages pluggable workflow providers to handle the durable executions, and looks very similar in design to how the first versions of selfserve looked back in 2012/2013. Written in Go, it consists of a single API managed control plane that can handle most of the work you need to do to provision tenants in a way that’s reliable and effective.

Walkthrough

Landlord currently supports one workflow provider, Restate which handles the durable execution of the compute. It supports two databases, SQLite and Postgres to store the state of the execution. It currently supports one compute provider, Docker.

We start by provisioning a tenant. Landlord accepts config for the tenant in the format of the compute provider you’re provisioning.

go run ./cmd/cli create --tenant-name lbr \
  --config '{
    "image": "nginx:1.25",
    "env": {
      "FOO": "bar"
    },
    "ports": [
      {
        "container_port": 80,
        "host_port": 8888,
        "protocol": "tcp"
      }
    ]
  }'
Tenant created
ID: 0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78
Name: lbr
Status: requested
Config: {"env":{"FOO":"bar"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Compute: {"env":{"FOO":"bar"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Created At: 2026-02-06T16:00:09Z
Updated At: 2026-02-06T16:00:09Z
Version: 1

This request then kicks off a workflow to the pluggable workflow provider, which includes a worker that performs the execution. By dispatching this off to a workflow provider, we mitigate against any errors in the request flow, meaning we can provision tenants in parallel without worrying about building durable code.

The tenant has been requested, meaning the workflow job is dispatching

go run ./cmd/cli list
ID                                    Name  Status     Workflow  Retries
0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78  lbr   requested

The workflow job runs and finally, I can see my tenant is up and running!

ID: 0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78
Name: lbr
Status: ready
Status Message: Workflow execution completed: inv_1gD1CwSdyy410oGeothz81216ifhbZkCAx
Workflow Sub-State: succeeded
Config: {"env":{"FOO":"bar"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Compute: {"env":{"FOO":"bar"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Created At: 2026-02-06T16:00:09Z
Updated At: 2026-02-06T16:00:22Z
Version: 3

Something I took from modern reconciliation approaches like Kubernetes was the ability to set desired state. If I want to modify the configuration, I can simply issue a set command with a new image or config. Let’s add a new environment variable

go run ./cmd/cli set --tenant-name lbr \
  --config '{
    "image": "nginx:1.25",
    "env": {
      "FOO": "bar",
      "LAND": "lord"
    },
    "ports": [
      {
        "container_port": 80,
        "host_port": 8888,
        "protocol": "tcp"
      }
    ]
  }'
Tenant updated
ID: 0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78
Name: lbr
Status: updating
Status Message: Update requested
Config: {"env":{"FOO":"bar","LAND":"lord"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Compute: {"env":{"FOO":"bar","LAND":"lord"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Created At: 2026-02-06T16:00:09Z
Updated At: 2026-02-06T16:03:36Z
Version: 4

This kicks off a new workflow:

go run ./cmd/cli get --tenant-name lbr
Tenant details
ID: 0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78
Name: lbr
Status: ready
Status Message: Workflow execution completed: inv_17jEoSZg1AhW2mztVGxctly24kh6hua6el
Workflow Sub-State: succeeded
Config: {"env":{"FOO":"bar","LAND":"lord"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Compute: {"env":{"FOO":"bar","LAND":"lord"},"image":"nginx:1.25","ports":[{"container_port":80,"host_port":8888,"protocol":"tcp"}]}
Created At: 2026-02-06T16:00:09Z
Updated At: 2026-02-06T16:03:52Z
Version: 6

and uses the underlying compute primitives to replace the existing tenant with new config:

docker ps --filter label=landlord.owner=landlord
CONTAINER ID   IMAGE        COMMAND                  CREATED              STATUS              PORTS                  NAMES
0962efd7c26b   nginx:1.25   "/docker-entrypoint.…"   About a minute ago   Up About a minute   0.0.0.0:8888->80/tcp   landlord-tenant-0b79eb9b-2c4c-43fb-85f0-aeb0ed63de78

docker inspect 0962efd7c26b | jq '.[0].Config.Env'
[
  "FOO=bar",
  "LAND=lord",
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "NGINX_VERSION=1.25.5",
  "NJS_VERSION=0.8.4",
  "NJS_RELEASE=3~bookworm",
  "PKG_RELEASE=1~bookworm"
]

I personally believe this is tenancy model that is very common around our industry, and I hope this experimental approach goes some way towards commoditizing this particular part of the stack.

The AI of it all

I couldn’t in good faith write this post without talking about AI. As I mentioned before, I’ve had aspirations to build something like this quite a while, but the level of effort involved seemed so incredibly daunting, I never really got started. Spec driven development has completely changed what I could achieve in this space, and if you take a look at the openspec directory in the landlord repo, you’ll be able to see just how long it took me to build something relatively complex.

Now, how do I feel about the quality of this? Well, as I was reviewing the code generated by Codex 5.2, I did have to corral it a little bit. I’m not particularly enthused about the quality of the code here. There are several areas in here where the model made trade-offs I would have likely not wanted to make, and a lot of times I had to catch it during the process of building a spec and say “actually no, please don’t do that”. The real thing that’s worth talking about here of course if productivity - the fact I could build a relatively complex system with domain knowledge or a problem I’ve already solved is remarkable to me.

What’s next?

I’d like to continue adding pluggable interfaces to landlord, support for Amazon Step Functions and Temporal as workflow engines is high on the agenda, as well as expanding the compute support for ECS, Kubernetes and EC2. After that? Who knows, it was fun remiscing about career years gone by.

FAQs

Remind me again why you wouldn’t just use Kubernetes or Nomad or some other cluster scheduler?

Nomad and Kubernetes solve the scheduling problem. They’re very good at placing workloads once you’ve decided what should exist. Landlord is focused on the control plane problem that sits above that:

orchestrating multi-step provisioning
handling retries and partial failure
managing tenant lifecycle as a first-class concept
integrating workflow durability with compute execution

You can absolutely use Kubernetes or Nomad underneath Landlord. The point is to avoid baking all of this logic directly into cluster automation where it becomes fragile and hard to reason about.

Is this production-ready?

No - and that’s intentional.

This is an experiment in system design, workflow durability, and spec-driven development. The abstractions matter more than the current implementations. Some parts are deliberately simple so the seams are visible.

If this ever becomes “production-ready”, it will be because the design holds up as more providers and workflows are added.

Why spec-driven development instead of just writing code?

Because the hard part here isn’t syntax—it’s decision-making. Specs force clarity:

what guarantees exist
what can fail
what is allowed to change
what must remain stable

Once that’s written down, AI becomes genuinely useful as an accelerator instead of a liability. Without specs, you’re just arguing with the model.

What problem are you actually trying to commoditise?

The boring but critical middle:

tenant provisioning
lifecycle management
durable automation
“what happens when this goes wrong?”

Most teams end up re-building this ad-hoc. Landlord is an attempt to make that layer explicit, inspectable, and reusable.

What models and tools did you use to build this?

I used a combination of GPT-Codex 5.2 with the Codex VSCode extension, as well as GitHub Copilot with Claude Sonnet 4.5 for some light fixes and documentation.