Notes on Running a Full-Network atproto Relay (July 2024)

@bnewbold.net

Contents changed

These are some informal notes on setting up a full-network atproto Relay, using the bigsky relay software developed by Bluesky. This is the same software we run ourselves at https://bsky.network. The focus here is on the compute resources necessary to replicate the type of full-network, full-featured service that Bluesky currently operates with the size of the network that exists today.

The demo Relay described here is running at relay-ovh.demo.bsky.dev. It handles crawling and re-publishing the full network firehose, with headroom for traffic spikes and growth on a number of dimensions: new accounts (repos), new PDS instances, more content being created (firehose event rate), number of services consuming from the firehose, etc.

Changes

Running through this demo setup turned up some sharp edges and missing configuration knobs in bigsky. For example, trying to backfill the full network with default configuration was resulting in OOM errors with an instance this size. Tweaks and configuration have been merged to the main branch of bigsky, along with additions to the README.

Network Scaling

How big can this instance scale? Hard to tell exactly, my guess is that it could do an order of magnitude more event rate, but will run out of disk before too long (eg, in the next year).

There are a number of possibilities for improving Relay efficiency to make this kind of service cheaper: There are implementation details (like using alternative database engines, or not storing repo data as millions of small files on disk). Data and work could be sharded across multiple machines. Not every Relay needs to crawl and mirror the entire network, and Relays could potentially be simplified to not maintain a full mirror of network content. On the other hand, actually running critical services would have a number of needs not covered here: legal and administrative burdens, monitoring and alerting, etc.

Shopping for an Server

My assumption is that the main thing needed is a relatively large and reasonably fast disk. The Bluesky production Relays currently use about 1 TByte for PostgreSQL and 1 TByte for CAR storage on local disk. The CAR storage filesystem should also be XFS (not ext4), to handle many millions of small files. When shopping for instances I looked for around 2 TByte for PostgreSQL and 2 TByte for CAR storage so that this setup would be realistic even with growth of the network over time. This storage could all be one disk/filesystem or two separate disks/filesystems. Regular SSD would probably work fine, NVMe is nice, especially for backfill.

More RAM always helps (page cache and other caches). Don’t need much CPU; the relay process is highly concurrent, but mostly I/O bound. Do want decent network monthly quota.

Disk is definitely the hard part. Network block storage (eg, AWS EBS) is pretty expensive even from cheaper providers, and usually costs more monthly than an entire bare metal instance with larger disks. Bare metal instances are mostly spinning disk or NVMe, not SSD; I assume that spinning disk isn't realistic for a fast backfill demo.

I ended up selecting an OVH instance with for about $150/month:

  • ADVANCE-2-LE: https://www.ovhcloud.com/en/bare-metal/advance/adv-2/
  • 12 vCPU (Intel Xeon-E 2136 - 6c/12t - 3.3 GHz/4.5 GHz)
  • 32 GB RAM (32 GB ECC 2666 MHz)
  • disks: 2×1.92 TB NVMe
  • 1Gbit/s unmetered and guaranteed
  • $152/month plus one-time $92 setup fee (no commitment)

That exact config isn’t available now (a week later), but a very similar one is:

  • ADVANCE-1 : https://www.ovhcloud.com/en/bare-metal/advance/adv-1/
  • 12 vCPU (AMD EPYC 4244P - 6c/12t - 3.8GHz/5.1GHz)
  • 32 GB RAM (32GB DDR5 ECC 5200MHz)
  • rootfs disk: 2x NVMe 960GB (RAID)
  • data disks: 2x 1.92TB NVMe
  • 1Gbit/s unmetered and guaranteed
  • $153/month plus one-time $93 setup (no commitment)

In both cases the setup fee is waved with a 6 month commitment, and there are discounts on the monthly rate with longer commitments.

Host Provisioning

Using the OVH web interface, provisioned the server with Ubuntu 24.04. With the ADVANCE-2-LE host, I specified partitioning to not use RAID. I let the setup wizard use one of the two disks for rootfs, boot, and swap. With all defaults this resulted in ext4. The second disk was not partitioned or configured using the wizard (I got to that later on the server itself).

Configured a DNS A record to point at the IPv4 that OVH gave us.

Logged in to the server and ran commands similar to this:

hostnamectl hostname relay-example.demo.bsky.dev

apt update
apt upgrade
apt install ripgrep fd-find dstat htop iotop iftop pg-activity httpie caddy golang postgresql yarnpkg

# set up yarn command; could also have used nvm
ln -s /usr/bin/yarnpkg /usr/bin/yarn

# punch holes in default firewall for HTTP/S
ufw allow 80/tcp
ufw allow 443/tcp

Ran through partitioning of the second NVMe with XFS. Note that on a real machine you'd want to set up fstab so this mounts automatically on a reboot.

# create a partition
sudo fdisk /dev/nvme1n1
# c (create), default (primary), default (1), default (start sector), default (entire disk), w (write)

# create XFS filesystem on that partition
sudo mkfs.xfs /dev/nvme1n1p1

# mount that filesystem to /data
sudo mkdir -p /data
sudo mount /dev/nvme1n1p1 /data

Pull the indigo codebase and build; ran this as the ubuntu user not root:

# depending on user that will be running the service
mkdir -p /data/bigsky
mkdir -p /data/bigsky/events
sudo chown ubuntu:ubuntu /data/bigsky/
sudo chown ubuntu:ubuntu /data/bigsky/events

# pull source code and build. if you had patches or a working branch, would modify here
cd
git clone https://github.com/bluesky-social/indigo
cd indigo
make build-relay-ui build

Configure PostgreSQL (sudo -u postgres psql); replace CHANGME with a secure password of your choice:

CREATE DATABASE bgs;
CREATE DATABASE carstore;

CREATE USER bigsky WITH PASSWORD 'CHANGEME';
GRANT ALL PRIVILEGES ON DATABASE bgs TO bigsky;
GRANT ALL PRIVILEGES ON DATABASE carstore TO bigsky;

# these are needed for newer versions of postgres
\c bgs postgres
GRANT ALL ON SCHEMA public TO bigsky;

\c carstore postgres
GRANT ALL ON SCHEMA public TO bigsky;

Create a config file at ~/indigo/.env:

ENVIRONMENT=production
DATABASE_URL="postgres://bigsky:CHANGEME@localhost:5432/bgs"
CARSTORE_DATABASE_URL="postgres://bigsky:CHANGEME@localhost:5432/carstore"
DATA_DIR=/data/bigsky
RELAY_PERSISTER_DIR=/data/bigsky/events
GOLOG_LOG_LEVEL=info
# or whatever DNS you want to use for handle resolution
RESOLVE_ADDRESS="8.8.8.8:53"
FORCE_DNS_UDP=true
RELAY_COMPACT_INTERVAL=0
RELAY_DEFAULT_REPO_LIMIT=500000

# these were somewhat tuned to this instance size
MAX_CARSTORE_CONNECTIONS=12
MAX_METADB_CONNECTIONS=12
MAX_FETCH_CONCURRENCY=25
RELAY_CONCURRENCY_PER_PDS=20
RELAY_MAX_QUEUE_PER_PDS=200

#RELAY_ADMIN_KEY=CHANGEME

UPDATE: renamed BGS_COMPACT_INTERVAL to RELAY_COMPACT_INTERVAL, and added RELAY_PERSISTER_DIR.

With the RELAY_ADMIN_KEY set to a strong random value, and DATABASE_URL substituted to the earlier database password. You can create one with:

openssl rand -base64 30

Create a system-wide Caddy config at /etc/caddy/Caddyfile. Substitute in your hostname, and comment out any other lines in the file:

relay-example.demo.bsky.dev {
  reverse_proxy 127.0.0.1:2470
}

Restart caddy: sudo systemctl restart caddy

Running bigsky and Backfilling

Run the actual service! For example, in a screen session, or a service management tool of your choice:

cd ~/indigo
./bigsky --api-listen 127.0.0.1:2470

Confirm that everything is working by connecting using the gosky command from a laptop (which is in the indigo repo). Won’t get events (because Relay hasn't subscribed to anything yet), but should connect successfully:

gosky readStream wss://relay-example.demo.bsky.dev

You can also connect to the web management interface at https://relay-example.demo.bsky.dev/dash. This lets you view basic stats per PDS, modify limits, add new PDS instances to crawl, takedown individual repos (by DID), block PDS instances by domain suffix, etc.

Relay admin interface screenshot

To start backfills from a laptop, create a hosts.txt file with PDS hostnames, then run initial crawl command. Can do this from a laptop:

cd ~/indigo/cmd/bigsky
export RELAY_ADMIN_KEY=CHANGEMESECRET
export RELAY_HOST=relay-example.demo.bsky.dev

cat hosts.txt | parallel -j1 ./crawl_pds.sh {}

Let that bake for a few hours or overnight. Only accounts with new commits will get backfilled. A 24 hour period is usually around 10% of the network.

Then can start explicit backfills per-PDS ("resync"). This will pull a complete list of DIDs hosted on the PDS (or at least, which the PDS thinks are still hosted on the PDS, this might not yet handle migrations). Don’t want to do full PDS backfills for all the big PDS instances at once, or the relay will get overwhelmed (eg, OOM). Instead, do 4-8 at a time, modifying hosts.txt or the head command as needed:

head -n 4 hosts.txt | parallel -j1 ./sync_pds.sh {}

# check progress
head -n 4 hosts.txt | parallel -j1 ./sync_status_pds.sh {}

Smaller self-hosted instances can be backfilled in big batches (eg, hundreds of backfills at the same time).

While running backfill, some new PDS instances will be discovered, even if not crawled specifically, and even with “spidering” disabled. My guess is that accounts which migrate away from our PDS instances are still listed by the original PDS. When bigsky does a backfill, it resolves all the DIDs, and sees a different PDS in the atproto service entry, and adds it to the PDS list.

The entire backfill took a couple days of casual checking in and poking it along.

How does one get a complete list of PDS instances in the network? It would be helpful if Relays had a public endpoint to scroll through all known PDS instances, and indicate if they are active, blocked/suspended, and roughly how many repos there are. For now, you can pull hostnames from public listings like https://blue.mackuba.eu/directory/ and https://bsky-debug.app/. Or scrape the complete DID PLC directory (which is public and enumerable), and extract all PDS service endpoints.

Rough Performance Stats

Here are some quick/informal system performance snapshots. The backfill period (fetching all previous repo content from the network) is far more resource intensive than steady operation.

I didn't run any compactions manually, and disabled automatic/periodic compactions. These are resource intensive to process, but free up disk and database space. Compactions are a feature specific to bigsky and it's data storage system.

UPDATE: you can re-enable compactions by editing the RELAY_COMPACT_INTERVAL environment variable. The default is 4h; it is disabled (set to zero) in the template env file above.

During early phase of backfill:

# dstat

----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
 32  16  44   5   0|  17M 1410M|  20M  879k|   0     0 | 107k  173k
 52  13  29   4   0|  13M 1495M|  18M  765k|   0     0 | 134k  117k
 74  13  13   1   0|  20M  623M|  18M  722k|   0     0 |  47k   62k
 26  14  52   6   0|  12M 1610M|  14M  613k|   0     0 | 133k  120k
 51  16  30   2   0|  19M  928M|  18M  813k|   0     0 |  55k  118k
 29  16  47   6   0|  15M 1842M|  13M  587k|   0     0 | 133k  114k
 26  14  55   3   0|  16M 1124M|  12M  537k|   0     0 |  69k  165k
 24  15  52   7   0|  14M 1600M|  13M  575k|   0     0 | 131k  138k
 30  14  51   3   0|  16M 1041M|  18M  786k|   0     0 |  62k  155k
 20  14  57   7   0|  13M 1719M|8916k  406k|   0     0 | 137k  121k
# pg_analyze (as postgres user)

PostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -
 * Global: 38 minutes uptime, 12.54G dbs size - 14.84M/s growth, 90.60% cache hit ratio
   Sessions: 81/100 total, 42 active, 39 idle, 0 idle in txn, 0 idle in txn abrt, 0 waiting
   Activity: 4382 tps, 74089 insert/s, 164 update/s, 0 delete/s, 28383 tuples returned/s, 0
 * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers
   Other processes & info: 0/3 autovacuum workers, 0/10 wal senders, 0 wal receivers, 0/10
 * Mem.: 31.12G total, 756.90M (2.38%) free, 14.14G (45.44%) used, 16.24G (52.19%)
   Swap: 512.00M total, 511.00M (99.80%) free, 1.00M (0.20%) used
   IO: 155846/s max iops, 2.15K/s - 0/s read, 608.78M/s - 155846/s write
   Load average: 8.19 7.33 5.16

I don’t have stats, but at a later phase of backfill, I/O wait was pretty high and disk read/write were more symmetrical around 500MB/sec (should have taken a snapshot of that!), and CPU wait was only single-digit.

After all major backfills, just cruising along at a normal firehose subscription:

# dstat

----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  1   1  97   1   0|5794k 8046k| 232k 6851B|   0     0 |4439  7848 
  1   0  98   0   0|3615k 7399k| 219k 5071B|   0     0 |4062  7163 
  1   1  98   0   0|4645k   13M| 198k 5184B|   0     0 |4473  7194 
  1   0  98   0   0|4831k 8142k| 242k 7273B|   0     0 |4174  7581 
  1   0  98   0   0|3264k 7092k| 178k 4784B|   0     0 |3625  6619 
  1   1  98   0   0|3564k 7336k| 153k 3394B|   0     0 |3253  5502 
  1   0  98   0   0|4930k 9139k| 239k 6719B|   0     0 |4119  7364 
  2   1  97   1   0|6430k   14M| 313k   10k|   0     0 |6372    13k
  1   0  98   0   0|3359k 7422k| 172k 5255B|   0     0 |3670  6860 
  1   0  98   0   0|3929k   10M| 206k 7088B|   0     0 |4036  8954 
  1   1  98   0   0|3732k 7560k| 212k 6173B|   0     0 |3789  6771 
  1   1  97   1   0|5694k 8819k| 267k 7758B|   0     0 |4630  8511 
  1   0  98   0   0|3480k   11M| 175k 4565B|   0     0 |3764  5758
# pg_analyze

PostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -
 * Global: 4 days, 22 hours and 23 minutes uptime, 445.38G dbs size - 162.93K/s growth, 79.55% cache hit ratio
   Sessions: 22/100 total, 1 active, 21 idle, 0 idle in txn, 0 idle in
   Activity: 514 tps, 912 insert/s, 0 update/s, 0 delete/s, 1028 tuples returned/s, 0 temp files, 0B temp size
 * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers
   Other processes & info: 0/3 autovacuum workers, 0/10 wal senders, 0
 * Mem.: 31.12G total, 676.83M (2.12%) free, 17.36G (55.78%) used, 13.10G (42.09%) buff+cached
   Swap: 512.00M total, 608.00K (0.12%) free, 511.40M (99.88%) used
   IO: 0/s max iops, 0B/s - 0/s read, 0B/s - 0/s write
   Load average: 0.46 0.44 0.38
# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  1.6M  3.2G   1% /run
efivarfs        192K   37K  151K  20% /sys/firmware/efi/efivars
/dev/nvme0n1p3  1.8T  452G  1.2T  28% /
tmpfs            16G  1.1M   16G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p2  974M  182M  725M  21% /boot
/dev/nvme0n1p1  511M  5.2M  506M   2% /boot/efi
/dev/nvme1n1p1  1.8T  722G  1.1T  41% /data
tmpfs           3.2G   12K  3.2G   1% /run/user/1000

# df -i (inodes)
Filesystem        Inodes    IUsed     IFree IUse% Mounted on
tmpfs            4078568     1019   4077549    1% /run
efivarfs               0        0         0     - /sys/firmware/efi/efivars
/dev/nvme0n1p3 117080064   230862 116849202    1% /
tmpfs            4078568        3   4078565    1% /dev/shm
tmpfs            4078568        3   4078565    1% /run/lock
/dev/nvme0n1p2     65536      603     64933    1% /boot
/dev/nvme0n1p1         0        0         0     - /boot/efi
/dev/nvme1n1p1 187537280 25138236 162399044   14% /data
tmpfs             815713       32    815681    1% /run/user/1000

# sudo du -sh /var/lib/postgresql/16/
447G    /var/lib/postgresql/16/

# lsblk (for reference)
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda           8:0    1    0B  0 disk 
sr0          11:0    1 1024M  0 rom  
nvme1n1     259:0    0  1.7T  0 disk 
└─nvme1n1p1 259:7    0  1.7T  0 part /data
nvme0n1     259:1    0  1.7T  0 disk 
├─nvme0n1p1 259:2    0  511M  0 part /boot/efi
├─nvme0n1p2 259:3    0    1G  0 part /boot
├─nvme0n1p3 259:4    0  1.7T  0 part /
├─nvme0n1p4 259:5    0  512M  0 part [SWAP]
└─nvme0n1p5 259:6    0    2M  0 part 

PostgreSQL table sizes:

bgs=# SELECT
    table_name,
    pg_size_pretty(table_size) AS table_size,
    pg_size_pretty(indexes_size) AS indexes_size,
    pg_size_pretty(total_size) AS total_size
FROM (
    SELECT
        table_name,
        pg_table_size(table_name) AS table_size,
        pg_indexes_size(table_name) AS indexes_size,
        pg_total_relation_size(table_name) AS total_size
    FROM (
        SELECT ('"' || table_schema || '"."' || table_name || '"') AS table_name
        FROM information_schema.tables
        WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'
    ) AS all_tables
    ORDER BY total_size DESC
) AS pretty_sizes;
          table_name           | table_size | indexes_size | total_size 
-------------------------------+------------+--------------+------------
 "public"."repo_event_records" | 18 GB      | 541 MB       | 18 GB
 "public"."actor_infos"        | 987 MB     | 993 MB       | 1980 MB
 "public"."users"              | 749 MB     | 1060 MB      | 1809 MB
 "public"."pds"                | 3752 kB    | 32 kB        | 3784 kB
 "public"."auth_tokens"        | 16 kB      | 48 kB        | 64 kB
 "public"."slurp_configs"      | 16 kB      | 32 kB        | 48 kB
 "public"."feed_posts"         | 8192 bytes | 24 kB        | 32 kB
 "public"."vote_records"       | 8192 bytes | 16 kB        | 24 kB
 "public"."follow_records"     | 8192 bytes | 16 kB        | 24 kB
 "public"."domain_bans"        | 8192 bytes | 16 kB        | 24 kB
 "public"."repost_records"     | 8192 bytes | 8192 bytes   | 16 kB
(11 rows)


carstore=# SELECT
    table_name,
    pg_size_pretty(table_size) AS table_size,
    pg_size_pretty(indexes_size) AS indexes_size,
    pg_size_pretty(total_size) AS total_size
FROM (
    SELECT
        table_name,
        pg_table_size(table_name) AS table_size,
        pg_indexes_size(table_name) AS indexes_size,
        pg_total_relation_size(table_name) AS total_size
    FROM (
        SELECT ('"' || table_schema || '"."' || table_name || '"') AS table_name
        FROM information_schema.tables
        WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'
    ) AS all_tables
    ORDER BY total_size DESC
) AS pretty_sizes;
      table_name       | table_size | indexes_size | total_size 
-----------------------+------------+--------------+------------
 "public"."block_refs" | 192 GB     | 217 GB       | 409 GB
 "public"."car_shards" | 4011 MB    | 4088 MB      | 8098 MB
 "public"."stale_refs" | 6245 MB    | 576 MB       | 6821 MB
(3 rows)
bnewbold.net
bryan newbold

@bnewbold.net

dweb, cycling, snow, big cities, wiki. I like speculating about found objects.
protocol engineer @bsky.app. formerly archive.org
elsewhere: bnewbold.net / @bnewbold@social.coop

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)