March 1st Matrix Outage
Added: March 2 2026, at 01:20am
Modified: March 7 2026, at 03:27pm
Tags: #nixfox #tech
Prologue
Matrix is not a fun protocol, especially not helped by the decisions I made early on.
I made two bad decisions in the beginning.
- Using Synapse.
- Using SQLite for the entire database.
These were both fine for a very small server with only about 3 people actively using it. But as Discord has gotten worse and more of my friends have moved over, the cracks quickly began to show.
What happened?
The more people I had on my server, the slower things became. Sometimes, I'd struggle to log in. Send a message. Upload an image. This was fine, I thought - things would often return to normal after a while. But these problems began growing, until I was completely unable to use my Homeserver at all. Since Matrix is my primary chat client, this became urgent.
What I learned
It turns out having all my chats and clients running through both one file and one service is really taxing on the server! Synapse by default is not very hyperthreaded - running everything on only a couple of cores with only a few gigs of ram. This was exaserbated with the use of one file - now 5GB! - hosting the entirety of my server's database.
I am also federating quite a bit, which brought my server down to its knees on every reboot. Having to fetch all that data was brutal.
My fix
Unfortunately, switching to Tuwunel, as much as I would like, was not a solution, as I need to preserve my DM history. However, two things proved useful:
Migrating the Database
Using Synapse Migrate DB as instructed here was able to fix the backend issues. Any data Synapse could move could now be accepted into my DB as fast as it needed. However, Synapse was still not processing fast enough.
In addition, upping the cache played a big role in allowing a bigger RAM buffer for Synapse to use, keeping things working when the load gets heavier.
services.matrix-synapse = {
settings = {
# Use PostgreSQL instead of SQLite
database.type = "psycopg2";
# Attempt a bigger RAM cache
cache_factor = 2.0;
};
};
Using Workers
Synapse can be split up into multiple mini services, that can deal with individual tasks, freeing the main service to handle media, logins, and general snappiness.
On Nix, workers can be enabled like so:
services.matrix-synapse = {
settings = {
# Use workers
send_federation = false;
federation_sender_instances = [
"federation_worker_1"
"federation_worker_2"
"federation_worker_3"
"federation_worker_4"
];
sync_instances = [
"sync_worker_1"
"sync_worker_2"
];
};
# Add workers to distribute the load
workers = {
"federation_worker_1".worker_app = "synapse.app.generic_worker";
"federation_worker_2".worker_app = "synapse.app.generic_worker";
"federation_worker_3".worker_app = "synapse.app.generic_worker";
"federation_worker_4".worker_app = "synapse.app.generic_worker";
"sync_worker_3".worker_app = "synapse.app.generic_worker";
"sync_worker_4".worker_app = "synapse.app.generic_worker";
};
};
By disabling federation and offloading syncing to workers, these services now hyperthread and distribute their load evenly, allowing DMs and logging in to work seamlessly.
Ratelimiting and presence
When federating from Synapse, there is no option to allow presence - the "user is online" check - to be allowed for the local homeserver only. This meant microchatter - thousands of tiny, individual pings - were being sent from homeserver to homeserver, clogging memory.
By disabling this and slowing down federation in general, we can keep a ligher load on the system.
services.matrix-synapse = {
settings = {
# Ratelimit federation
federation_rc_window_size = 1000;
federation_rc_sleep_limit = 5;
federation_rc_sleep_delay = 1000;
federation_rc_reject_limit = 20;
federation_rc_concurrent = 10;
# Turn off presence
presence.enabled = false;
};
};
Upkeep!
While I was here, I thought it important to enable things like Element Call and to serve from a VPS rather than my home. All of these changes can be explored further in my Git repo.
