For most of March 3rd my Bluesky feed generator was down, this affected all of my feeds, notably:
- Catchup (and it's variants)
- Follower / Following Roulette
- The Good Stuff
- Lovelies
There is no SLA on these feeds, they are a hobby done in my spare time, but none the less they generally run well with few outages, so I think it's interesting to do a write up on what happened.
The background
The feed generator runs on a 1vCPU, 2GB RAM digital ocean droplet. Recently I noticed that while it is keeping up with the firehose, this is no longer enough to catch up if it falls behind for some reason. If the feed generator has got significantly behind for any reason I'm having to provision more compute to allow it to catch up.
Amongst other things, CPU profiling identified a significant amount of time being spent in garbage collection. Profiling memory allocations identified a number of areas of improvement which will be discussed in another blog.
The profiling also highlighted that one of the single biggest allocations was loading the DID of all 1.6 million followed posters as part of the statistic updates. These are processed in batches of 1000 posters to gather some statistics about the poster and their interactions to provide a baseline for various parts of the feed algorithms.
Statistics updates run over several hours, so the DIDs are processed in a random order so that if the process is restarted it will start with different posters rather than re-updating those that were updated before the restart.
The failure
The change that triggered the outage was attempting to optimise this process to only load 1000 DIDs to process a batch at a time rather than loading all 1.6 million up front.
I cannot simply page through the posters due to the random order. If I re-execute the query the order changes and the first page will likely be posters that have not been updated yet.
If I just repeatedly grab the same first page I don't know when all statistics have been updated.
Instead I opted to have a long running statement where I would issue the query at the start of statistics updates and only read 1000 rows at a time. Triggering the statistics update for those posters then reading another thousand.
The database for the feed generator is a SQLite database running in WAL mode. WAL mode requires regular checkpointing to consolidate the changes from the write ahead log into the main database. However checkpointing can only complete if there are no other connections using the WAL file.
This long running open statement prevented any checkpointing as the WAL file was always in use, creating Checkpoint Starvation resulting in te WAL file continuing to grow until disk space on the droplet was exhausted, at which point the feed generator both stopped consuming from the firehose and also stopped responding to feed queries.
The recovery
That the WAL file was filling the entire disk was quickly identified, the cause of the growth was not identified until later, but was assumed to be one of the changes made to address memory consumption as those were the only recent changes.
I did not want to simply increase the disk space, rather I wanted to get the database back to a more reasonable size (it is normally ~10GB, the database and WAL file combined was over 45GB when the disk was exhausted). Because of this I decided to take the feed generator down until I could VACUUM the SQLite database to restore it to a normal size. Recovery was delayed because this took several attempts because of my unfamiliarity with the process.
- Initially I attempted an in-place VACUUM but this failed because the disk was full
- After mounting a larger volume and copying the database I attempted a second VACUUM, this failed because it was still using the main (full) disk for the temp space
- A third attempt after setting the temp location to the volume with empty space, but took several hours
Once the VACUME had completed, the feed generator was restarted. At this point it was about 7 hours behind the firehose so had a large backlog of events to process. Monitoring the lag, the feed generator would initially catch up and then event processing would slow and the generator would start falling further behind.
Several targeted changes were made and deployed to try and address this before the feedgenerator was rolled back to a state before the memory optimisations. At this point the feed generator was about 8 hours behind the firehose.
At this point the feed generator was essentially functional again, but feeds were running significantly behind. It took approximately 6 hours to process the backlog of events and return to current.
The aftermath
During tha time I identified WAL Checkpoint starvation as the cause of the growth and the explanation of what was causing the event processing rate to tail off after each restart.
Having identified the technical cause, all other memory optimisations have been re-applied which should allow the feeds to be a bit more robust going forward.
Lessons learned, I mean further investigation
I'm not going to really do lessons learned on this like I would with post mortem for my day job. The core systematic cause of this issue is that this is a hobby project I work on in my spare time and YOLO stuff into production when I have time, often before i log off and go to sleep.
But I did learn some more details about running SQLite and while I did know about checkpoint starvation I suspect it will be a bit more front of mind the next time I make these kinds of changes.
I'm also keen to explore if I can effectively compact a SQLite database with a shorter outage by copying it with sqllite3_rsync rather than in-place VACUUM.
This also to me emphasises the need for the original changes despite the outage. The feed generator needs to have enough headroom to catch up from falling behind rather than only just keeping up with the firehose.