it's an appview, michael, how long could it take?
tl;dr
- I'm running what is, to my knowledge, the first instance of the Bluesky AppView containing all* data in the history of the network
- This took me just over six months, for reasons entirely different than you think
- You can query it at
bsky.zeppelin.social
, or try out an instance of the web app using it at https://zeppelin.social - The cost to run this is about US $200/mo, primarily due to the 16 terabytes of storage it currrently uses
- This is a quantity of money that will render me broke in short order, so it's only staying up for a few weeks unless someone is interested in helping out ;)
* Thanks to imperfect error handling, I lost somewhere between a few hundred thousand & a few million records along the way, but I think that's still close enough.
the machine
This is running on a Hetzner auction server with a Ryzen 9 5950X, 8x 3.84TB SSDs, and 128GB of RAM. Of these, really only the storage space is needed. The AppView is currently, as of June 22 2025, using about 16TB of storage. It runs fine on under 32GB of RAM, though the backfill process benefits from as much CPU and RAM as you can get ahold of.
Now, on to the boring details about the process!
backfill
In the adage about the first 90% of a project taking 90% of the time, then the remaining 10% taking another 90%, this was the first 90%. Getting data from the millions of repos into the AppView is (was?) understood as the missing piece in self-hosting Bluesky from end to end. It was nontrivial!
You can find the code here. Inserting commits one by one would've taken impossibly long. I started out with a hodgepodge of CTEs to write a few hundred records at a time, but that, too, was too slow, as well as not allowing nearly enough records per query. Next was INSERT INTO ... SELECT * FROM UNNEST(...)
, which let me do thousands of rows at a time, but was still too slow.
The missing piece was COPY FROM
, allowing for streaming in tens of thousands of rows per second. The final setup for insertion involves creating a temporary table, copying into it, then copying from the temp table into the actual table. This extra step of indirection is necessary lets me add in ON CONFLICT DO NOTHING
, where duplicates would otherwise cause the COPY
stream to error.
Speed does come with some tradeoffs. Checking validity against postgates/threadgates is impossible at the rate rows are inserted; so is calculating aggregates — like counts, follower counts, etc. (There's a script in the backfill repo to evaluate these after the fact that will probably take about 3 weeks to run at current network scale)
With insertion figured out, the rest of the backfill script wasn't too complicated. The main process uses mary-ext/atproto-scraping to get a list of PDSes, then fetches every repo on the network, aiming for 50 at a time from each PDS. In practice, I get about 120 repos per second on a 1Gbps connection; this was the primary bottleneck in the backfill process. In theory, this 3-day process could take just under a day, limited only by the number of repos on the largest PDS on the network.
As the main process fetches repos, it writes them to disk for a pool of repo workers to pick up using Bun.mmap
. These workers parse out commits from each repo and send them, in bulk, back to the main process. During my run, these were able to parse repos at 10x the speed they were receiving them; about 1200 per second. Once the main process receives a set of commits from however many repos each worker parsed in the last fifth of a second, they're sent to two workers. One is a collection write worker, which is allocated a subset of collections to handle writing to the database. The other is a record write worker, which writes all records to the generic record
table in the database. Splitting up the workload this way prevents any worker from receiving too many records too quickly. Each collection worker writes tens of thousands of rows per second, while the record worker's load is closer to a hundred thousand records every half second. This is where having a decent CPU comes in handy!
All in all, putting this together took just over a month. Even as I successfully ran the backfill script, about five months before writing this post, I had yet to realize it was just the beginning.
indexing
This was the other 90%. The issue, it turned out, wasn't getting events that had already occurred into the AppView; it was getting events as they occurred. While the open source AppView implementation is identical in behaviour to the production AppView Bluesky runs, some of the code — namely, the indexer — doesn't actually fare as well under real pressure. The single threaded implementation struggled to get more than 200 events per second into the database, which is a decent number if not for the fact that baseline activity is double that, and daily peaks are closer to ten times as many events. This meant, of course, that I had to write my own indexer.
It was this adventure that took me another few months. The basic structure remained about the same throughout; main process listens to relay, queues events to be picked up by workers and processed. Many weeks were spent tinkering around the edges — whether to decode on the main process or in the workers, whether to send messages directly or intermediate with a Redis queue — but I wasn't getting the throughput I needed.
After approximately the seventh rewrite, the solution came from a very silly place. I had ruled out Bun early on because its worker threads support is buggy and prone to crashing. With this rewrite, three or four months in, I was running into some weird Node networking behaviour. I decided to switch from node api.js
to deno run api.ts
on a whim, just to see if it would do anything different, and that single change increased throughput by about 400%, rendering the indexer capable of handling events at full speed. Yeah, I never figured out how or why either.
With this working, I was able to run backfill a second time, start up the indexer, and have a fully functional AppView. Almost. It turned out I'd underestimated network growth since the first attempt, and had to clumsily port 12TB of data to a new server, losing about a day of events around June 1 in the process.
now what
This isn't quite a complete Bluesky clone. Still missing are:
- Labels: there's no reference implementation for indexing labels. I did write labelmuncher to listen to the Bluesky moderation service, but more work would be needed to discover & ingest from all labelers.
- Chat: is also closed source, so the hosted web app uses Bluesky's api.bsky.chat.
- Video: the video processing service, too, is closed source, so videos just don't play.
My goal starting out with this project was really just to say, now what? There's another Bluesky, serving the same data as Bluesky Bluesky. Is the network more decentralized, whatever that means, than it was yesterday? Certainly there are benefits. Zeppelin ignores nuclear blocks and quote detaching, so we benefit from a bit less link rot. There's plenty of feed ideas made possible by having all this data. Apart from that, though, what do we really get out of having another copy of the same posts and profiles and numbers?
This is a bit of an anticlimactic ending, but I'm happy with that since that is, in a way, what I set out for. If you do see more value in this than I do, my DMs (and my GitHub Sponsors) are open, and I'm happy to maintain it for as long as I can afford to, and maybe even work on alternatives to those missing services if there's interest. If not, I'll leave behind this repo with instructions for doing all this yourself, hopefully on a much shorter time scale.
Many thanks to Skyseed for covering hosting costs on a project that spiralled past all estimates long enough for me to be able to say I did it.