March 1st Matrix Outage

Added: March 2 2026, at 01:20am
Modified: March 7 2026, at 03:27pm
Tags:

Prologue

Matrix is not a fun protocol, especially not helped by the decisions I made early on.

I made two bad decisions in the beginning.

  1. Using Synapse.
  2. Using SQLite for the entire database.

These were both fine for a very small server with only about 3 people actively using it. But as Discord has gotten worse and more of my friends have moved over, the cracks quickly began to show.

What happened?

The more people I had on my server, the slower things became. Sometimes, I'd struggle to log in. Send a message. Upload an image. This was fine, I thought - things would often return to normal after a while. But these problems began growing, until I was completely unable to use my Homeserver at all. Since Matrix is my primary chat client, this became urgent.

What I learned

It turns out having all my chats and clients running through both one file and one service is really taxing on the server! Synapse by default is not very hyperthreaded - running everything on only a couple of cores with only a few gigs of ram. This was exaserbated with the use of one file - now 5GB! - hosting the entirety of my server's database.

I am also federating quite a bit, which brought my server down to its knees on every reboot. Having to fetch all that data was brutal.

My fix

Unfortunately, switching to Tuwunel, as much as I would like, was not a solution, as I need to preserve my DM history. However, two things proved useful:

Migrating the Database

Using Synapse Migrate DB as instructed here was able to fix the backend issues. Any data Synapse could move could now be accepted into my DB as fast as it needed. However, Synapse was still not processing fast enough.

In addition, upping the cache played a big role in allowing a bigger RAM buffer for Synapse to use, keeping things working when the load gets heavier.

services.matrix-synapse = {
  settings = {
    # Use PostgreSQL instead of SQLite
    database.type = "psycopg2";

    # Attempt a bigger RAM cache
    cache_factor = 2.0;
  };
};

Using Workers

Synapse can be split up into multiple mini services, that can deal with individual tasks, freeing the main service to handle media, logins, and general snappiness.

On Nix, workers can be enabled like so:

services.matrix-synapse = {
  settings = {
    # Use workers
    send_federation = false;

    federation_sender_instances = [
      "federation_worker_1"
      "federation_worker_2"
      "federation_worker_3"
      "federation_worker_4"
    ];

    sync_instances = [
      "sync_worker_1"
      "sync_worker_2"
    ];
  };

  # Add workers to distribute the load
  workers = {
    "federation_worker_1".worker_app = "synapse.app.generic_worker";
    "federation_worker_2".worker_app = "synapse.app.generic_worker";
    "federation_worker_3".worker_app = "synapse.app.generic_worker";
    "federation_worker_4".worker_app = "synapse.app.generic_worker";
    "sync_worker_3".worker_app = "synapse.app.generic_worker";
    "sync_worker_4".worker_app = "synapse.app.generic_worker";
  };
};

By disabling federation and offloading syncing to workers, these services now hyperthread and distribute their load evenly, allowing DMs and logging in to work seamlessly.

Ratelimiting and presence

When federating from Synapse, there is no option to allow presence - the "user is online" check - to be allowed for the local homeserver only. This meant microchatter - thousands of tiny, individual pings - were being sent from homeserver to homeserver, clogging memory.

By disabling this and slowing down federation in general, we can keep a ligher load on the system.

services.matrix-synapse = {
  settings = {
    # Ratelimit federation
    federation_rc_window_size = 1000;
    federation_rc_sleep_limit = 5;
    federation_rc_sleep_delay = 1000;
    federation_rc_reject_limit = 20;
    federation_rc_concurrent = 10;

    # Turn off presence
    presence.enabled = false;
  };
};

Upkeep!

While I was here, I thought it important to enable things like Element Call and to serve from a VPS rather than my home. All of these changes can be explored further in my Git repo.