A Whole New Plane
2024-02-01
Paul Butler
Last year, we open-sourced Plane, the server infrastructure for session backends that powers Jamsocket.
Over the last few months, we rewrote Plane from scratch. This post is about why we did it, and some new features that users can look forward to.
If you’re not familiar with Plane, here’s the gist of what it is:
Plane provides an API for running ephemeral processes across a cluster of computers. Plane gives each process its own hostname that you can connect to from the internet (including web browsers). When all connections to a process have been dropped, Plane shuts it down.
It was heavily inspired by Figma’s approach to scaling multiplayer backends.
Internally, Plane is a distributed system. A controller acts as the hub for a set of drones, which are responsible for running the backends.
Here’s what the (new) architecture looks like:
Deciding to rewrite
Plane started as a way to test a hypothesis: as browser-based applications get more ambitious, session backends become a useful (even necessary) primitive for building them.
When we wrote the first lines of code for what would become Plane in 2021, companies like Figma had written about how they built similar infrastructure, but as far as we knew, nobody had tried to turn that approach into a generalized piece of infrastructure software.
Two years later, I’m more convinced than ever that our core hypothesis is right. But inevitably, two years of operating and evangelizing Plane has taught us that we didn’t get everything right the first time.
For example:
- Using DNS-based routing exclusively made local testing a headache.
- Some developers want to use Plane when developing locally or in CI/CD, and then deploy to Jamsocket. The fact that they each have a different API is a barrier to that.
- Developers like to use backends in a way that ties each one to an external resource (e.g. a key on S3 or a file on NFS). They need a way to ensure that only one backend runs at a time for that resource, as well as to connect users to an existing backend if one is already running.
- Building on NATS meant that for most developers, operating Plane entailed not only learning Plane, but also learning to operate NATS.
With these in mind, we decided to make three foundational changes to the architecture as part of the rewrite.
1. Proxy-based routing
The old Plane relied on DNS for routing inbound traffic to the desired backend. Each machine in the cluster needed a public IPv4 address, and traffic to a given backend was routed to that drone’s IPv4 address using DNS. A proxy process running on the drone terminated TLS and routed traffic to the right local port.
This had some nice properties — for one, building on top of foundational internet technologies meant we could be completely agnostic to cloud vendors when deploying it. But it also added complexity, like the added factor of DNS caching, debugging against differently-behaving DNS clients, and inflexibility of network topology. It also meant that local testing was difficult — at one point we even shipped Plane with a containerized version of Firefox preconfigured with DNS and a self-signed root certificate, just so people could try it!
In the new Plane, we have moved the responsibility of routing into separate “proxy” nodes. It is these proxy nodes, rather than the drones themselves, that are exposed to traffic ingress (optionally behind a level 3 or 4 load balancer). The proxies are responsible for terminating TLS and routing traffic toward the right drone.
Moving away from DNS for routing also makes the localhost
development story better. We moved the information needed to route out of the hostname and into the HTTP path, meaning that while Plane still supports having a unique subdomain for each backend, it’s now an optional feature.
Plane still ships with a built-in DNS server, but it is now not used for production web traffic, only for ACME DNS-01 certificate validation.
2. Locks are key
After launching the initial version of Plane, the most-requested feature was a way to ensure that two users who simultaneously access the same document are routed to the same server. This allows session backends to be used to create an authoritative backends for each document, without worrying about two backends clashing and overwriting the document.
Prior to the recent rewrite, Plane achieved this through a feature we call locks. Before starting a backend that had a lock, we would attempt a two-phase commit, piggybacking on NATS Jetstream’s RAFT implementation. Only if the commit succeeded would we start the backend.
Despite the name, Plane’s locks do not actually enforce anything — you can think of them as a primitive in Plane to allow your application to build and enforce its own concept of locks.
The existing implementation of locks suffered from a two problems:
- As a matter of terminology, “locks” are a bad metaphor, both because they are, strictly speaking, not actually locks in the computer science sense, but also because calling them “locks” makes what is a very simple concept sound more complicated than it is.
- They are implemented as a controller-side construct; the drone is blissfully unaware that they exist. This means the controller has to be maximally pessimistic: locks are only “released” when the backend attached to them is known to have terminated. In practice, this means that if a drone instance dies, the locks need to be manually released.
To solve #1, the new Plane renames locks to keys, in the key-value store sense. The new metaphor is that Plane is sort of like a distributed hashmap mapping strings (keys) to running processes.
When you want to connect to a backend, you send Plane its key, and get back a URL for that process. If there is currently no running process for that key, but your request included instructions for spawning one, Plane will spawn a new backend and give you back its URL. If you are familiar with setdefault on Python dicts, the semantics here are similar.
To solve #2, we made drones responsible for periodically renewing keys. If a drone fails to renew a key, the contract with the controller is that it must terminate the backend within a predetermined time period.
This means that the controller even if the drone loses connection to the controller, the controller can assume that the backend has stopped running after a fixed period of key non-renewal.
While we expect this to be a sufficient protection against concurrent backends for most use cases, we also pass a fencing token to backends for cases where extra protection is needed.
3. Replacing NATS
From early on, Plane has used NATS as a message bus. All communication between entities in Plane has gone through a NATS node.
As we incorporated stateful features like locks, we needed a place to put durable state. To avoid adding another component, we began using NATS Jetstream to provide persistence and consensus as well.
Over time, we heard feedback from would-be Plane users that administering NATS was a barrier to some people self-hosting Plane.
And frankly, we had our own struggles with operating NATS. When it came to multi-tenancy, we found NATS’ authorization model to be an awkward fit for our requirements. We also generally had trouble gaining an intuition for “the NATS way of doing things”, and had to put out a few fires caused by NATS’ behavior catching us off guard.
Since our customers depend on Jamsocket to run production software, uptime is an I-will-be-out-of-a-job-if-we-screw-this-up problem. When we looked at options for replacing our use of Jetstream as the persistence layer, Postgres was a natural fit, if a little boring. Plus, we were already using it in Jamsocket, and the team knows it pretty well.
Since Postgres is so widely used, people who self-host Plane are more likely to already know it, and already have an instance available.
When it came to replacing NATS as the message bus, we leaned on the web technologies we know: drones and proxies connect directly to the controller by opening a WebSocket connection. Using WebSockets (instead of plain TCP) gives us frames, and also allows us to delegate responsibilities like authorization and TLS termination to a reverse proxy, reducing the complexity that needs to live in Plane itself without compromising flexibility.
Towards Plane 1.0
While this rewrite brings a new level of maturity to Plane, there are a number of things that we need to stabilize before a 1.0 release. We’re excited to bring outside users into the mix, whether as users of Plane through our managed Jamsocket offering, or as self-hosting users of Plane, so if this interests you, follow along on GitHub, or reach out to hi@jamsocket.com.