Almost each successful business application sooner or later enters the phase when the horizontal scaling is needed. In many cases one might simply launch the new instance and decrease the average load. In most cumbersome cases we must ensure different instances know about each other and distribute the workload gently.


Luckily enough, erlang has a first class support for distributed systems. In theory it sounds as easy as

Message passing between processes at different nodes, as well as links and monitors, are transparent […]

In practice distributed erlang was developed when container meant vessel and docker was a shorthand for longshoreman. IP4 had lots of unoccupied addresses, network splits were mostly caused by rats that chewed through the cables and the average uptime of production system was measured in decades.

Now we are all unbridledly self-contain[eriz]ed and running distributed erlang in the environment where IPs are randomly dynamic and nodes might appear and disappear just because scheduler decided to unwind is a bit tricky. To avoid an enormous boilerplate in each project running distributed erlang to deal with a hostile environment, helpers are to be created.

Sidenote: I am fully aware of the great libcluster package. If you are fine with what it provides, you should be all set. I am unfortunately not.


What I personally needed was the cluster assembly helper that:

  • transparently deals with both hardcoded node list and erlang service discovery
  • has a callback on each topology change (node up, node down, network split)
  • provides the transparent interface to run the cluster as :longnames, :shortnames, and :nonode@nohost
  • zero code support for docker

The latter means once I have tested the application locally at :nonode@nohost or in distributed environment with test_cluster_task, I want to run docker-compose up --scale my_app=3 and see it runs three instances in docker without any code change. I also want the dependent applications, like mnesia, to rehash nodes when the topology changes without any additional handling needed in the application.

Cloister does not aim to be a silver bullet, or to cover all the cases possible, or to be academically full in whatever sense computer scientists put into this term. I want it to serve the very narrow purpose, but to serve it in the best way possible. This goal would be: full transparency between local development environment and docker distributed environment.


Cloister is supposed to be run as the application, although advanced users might deal with the cluster assembly manually by directly starting Cloister.Manager within the target application supervision tree.

When started as an application, it relies on config to have the following significant values:

config :cloister,
  otp_app: :my_app,
  sentry: :"cloister.local", # or ~w|n1@foo n2@bar|a
  consensus: 3,              # number of nodes to consider
                             #    the cluster is up
  listener: MyApp.Listener   # listener to be called when
                             #    the ring has changed

Options above means precisely: Cloister is used for :my_app OTP application, with erlang service discovery to connect nodes, three at least, and MyApp.Listener to be notified about topology changes. The detailed description of the full configuration might be found here.

With this config, Cloister application would start in phases postponing the application startup process until the consensus is reached (three nodes are up and connected as by the example above.) That helps the main application to assume that when started, the cluster is already available. On each topology change (there would be many because nodes are started not fully syncronously,) MyApp.Listener.on_state_change/2 callback would be called. In most cases we would perform an action when state is %Cloister.Monitor{status: :up} meaning the cluster as assembled.

In most cases consensus: 3 setting is optimal because even if we expect more nodes to connect, the callback would go through status: :rehashing → status: :up loop on any new node added/removed.

When running in development mode, simply put consensus: 1 config value and Cloister would be happy to return for :nonode@nohost, or :node@host, or :node@host.domain depending on how the node was configured (:none | :shortnames | :longnames.)

Managing Distributed Applications

It’s very common for the distributed application to have some dependencies, like mnesia. It’s easy to handle their reconfiguration from the very same on_state_change/2 callback. Here is a detailed description of how to reconfigure mnesia on the fly in the Cloister documentation.

The main advantage of using Cloister is it performs all the necessary operations to rebuild the cluster after topology changes under the hood. The application simply starts in the already prepared distributed environment, with all the nodes connected, no matter whether we know IPs and hence node names upfront, or they were dynamically assigned/changed. It requires zero docker configuration tweaks and from the point of view of the application developer, there is no difference between running the application in distributed environment, or in local at :nonode@nohost. The detailed description might be also found in Cloister documentation.

While more sophisticated topology changes handling is possible via MyApp.Listener implementation, there could always be complicated cases where this library limitations and biased approach to configuration would be a show-stopper. That is fine, just pick up the aforementioned libcluster, which is way more generic, or even handle the cluster low-level on your own. The goal of this piece of code is not to cover all the possible scenarios, but to make the most common scenario use like a charm.

Happy clustering!