Hacking Graphtracks in 2024
We began GraphTracks in February 2024, shortly after Bluesky became publicly available. It was a spin-off from another idea—proof that sometimes the best ideas come when you're procrastinating on others. Bluesky was a greenfield back then, with very few tools and obscure documentation. However, due to the openness of the platform, it was easy to start with, "What if we build social media analytics just for Bluesky?" After all, who doesn't love diving into the unknown with a blindfold on?
Early Development
We had no idea where to start; we'd never built anything like this before, except for my little project for a Factorio server list chart. Because, you know, if you can manage virtual factories, real-world analytics should be a piece of cake, right? What if we just built a top 1000 for Bluesky? Not just a static one, but a sliding list that shows the most popular accounts for the last hour, day, week, and month. Ambitious? Maybe. Exciting? Definitely.
This is how the first implementation looked:
We started with the documentation and quickly discovered that Bluesky, much like Twitter and Facebook in the 2010s, has a Firehose. That was it. I quickly built a prototype using plain JavaScript and SQLite. Express.js was so fast for prototyping that it felt like it was on a caffeine rush.
You can see how simple the implementation was. The whole code for the Firehose listener was around 100 lines of code. SQLite proved to be a real help and an awesome prototyping vehicle. JavaScript was chosen simply because the SDK looked the most mature at that point, and why not? Sometimes, you just have to go with the flow—or the Firehose.
The "frontend" was a simple static file with a little bit of JavaScript and D3 in it. It was minimalistic, like a hipster's coffee table.
This was enough to prove that whatever we were trying to build was technically possible, but not enough to show to anyone. We had a prototype, but it was like a cake without icing—functional but not enticing.
Moving Forward
We had a proof of concept, but no product. We started researching what was available at the time. Market research: because guessing is only fun in games, not in product development.
We began with simple sketches...
...and user flows.
Initially, we did not want to have any authorization, as all data is public, and we just wanted to showcase capabilities and collect interest. Plus, who needs security when you have enthusiasm, right? (Spoiler: We did.)
Adding Frontend
Even before we had a single piece of design, we started integrating SvelteKit with our proof-of-concept backend. Having the PoC gave us a head start with a ready-to-integrate API. We chose SvelteKit because it was a technology we knew. Another good choice at this point was the Carbon Design System, as it provided fully featured components like tables. Why reinvent the wheel when you can borrow a well-designed unicycle?
One of the first versions of frontend with change marking over it
Evolving Backend
Our first deployment was to Fly.io's free tier. Within a week, it became obvious that SQLite would not suffice for this task. I had some experience with PostgreSQL and the TimescaleDB extension, so it was the obvious choice. Fast forward a month, and we had the top 100 for the whole of Bluesky. The way we collected data allowed us to show analytics for any account on Bluesky. It was like having a backstage pass to the concert of data.
Having a working, albeit naive, SQLite implementation allowed us to simultaneously build the frontend. Multitasking: because who needs sleep anyway?
Growth Issues
When we started, the event stream was around 50-100 events per second. After half a year, we began to see peaks as groups of users joined Bluesky. Our Node.js event collector was simple but not exactly performant. We started to lag behind the Firehose cursor heavily. Playing with database configurations and moving to a more performant server helped a bit. It was like trying to fix a leaky boat with duct tape—temporary and slightly soggy.
It became obvious that our event collector's Node.js process was single CPU-bound and required parallelism. I wasn't sure how to implement parallelism in Node.js correctly and considered porting my 100-line script to the only other language I knew better—Python. When in doubt, throw a snake at the problem.
Then I discovered the Python SDK. The prototype evolved quickly, and data collection was moved to Python within a week. First iteration did not include any multiprocessing but was a direct Python/asyncio port of the JavaScript counterpart. Surprisingly, even this was already a much more stable solution. Sometimes, a change of scenery—or syntax—does wonders.
Database
Timescaledb is awesome solution, but of course required a lot of tinkering. First iteration of schema was carbon copy of what I had in sqlite plus time based sharding from timescaledb. It worked until database did not become more than 50Gb. At this point calculating top 100 was challenging and was taking more than 1 minute. Moving calculation to continuous aggreagetes helped a lot.
Another optimisation was using COPY command instead of INSERT.
Brazil and multiprocessing
Twitter was banned in Brazil in September 2024. BlueSky immediately started making waves. We moved to bigger server and returned to idea mutliprocessing. As collector was already failing from 1000-2000 events per second, swift solution required. At this point we were thinking about adding something like kafka to have good backpressure. That would be to complex and we just used internal python queue with producer and multiple consumer workers. that put us back in the game. This solution really unfucked lagging. Now the only bottleneck was bluesky itself and hardware.
We survived many other waves of exodus from X: creators wave, US election wave, vtuber waves without single issue. Out statistic is always precices.
Caching
Caching is true king of performance. Initially we had no caching and slowly we started adding layers of cache.
First was cdn and browser cache, as this is very simple to add. All you have to do is to add headers or configure cdn to enforce caching. In our case we cannot have one ttl fits all we have to differentiate for different kind of statistic periods accordingly. This can be achived with dynamic ttl like this
const ttl = period[ttl]
res.setHeader(
"Cache-Control",
`public, max-age=${ttl}, stale-if-error=${ttl}, stale-while-revalidate=${ttl} stale-if-error=${ttl}`,
);
Of cource, this was not not enough, as we had users in many countries and cdn is only caching responses for specific location. We added memcached. Another trick we used is to always serve data from cache is to update memcache key while serving "stale data". This is really easy to achieve if you set ttl in memcache as half and update cache in background.
Materialized views are also a way of caching, as it protects reads from accessing hot data as well as preparing subset of well defined structure to query right a way.
need more caching
Outro
As we conclude 2024, we're filled with gratitude for the journey of building GraphTracks. From our initial concept to the challenges we've overcome, every step has been a learning experience. We extend our heartfelt thanks to our users, supporters, and the BlueSky community for their invaluable feedback and encouragement. Your support has been instrumental in our progress. Here's to a new year filled with growth, innovation, and continued collaboration. Wishing everyone a prosperous and joyful 2025!