Some things relating to Reddit breaking could be mitigated with additional error handling, but others were more complicated. For example, how should we handle a wiki page suddenly saying that it doesn't exist or flair returning a null value? The next move was to try and start migrating our
Source of Truth away from Reddit to something more stable. The most stable relative to us is Redis, since if the bot server is down, Redis is down too -- a theoretical 100% relative uptime! Since we have a database already, why not retrofit that?
Let's talk for a moment about key-value stores. Though they are databases, they're what's commonly called a NoSQL database, since they are effectively "flat". There's only one table and there's no way to have relational data because everything is just two elements: a key and a value. The type of the value can be different things (a set, a list, or a string, for example) but overall, we can still only store two things at a time. Take a look at this:
# Here's an example of what data might look like.
post_ids: ['abc', 'def', 'hij']
accepted_code_of_conduct: ['person1', 'person2']
key: value
# if we interact with this, we can just ask Redis for the key and it spits back out the value.
> GET key
>> value
Now the problem becomes that we need to store relational data. We need to be able to keep track of how many posts each user has completed (because we can't rely on Reddit) AND which post IDs they've done (for debugging and future investigation). That's highly relational, but without building support for a standard SQL database, do we have an option that will just work without causing us too much effort? The answer is "sadly, yes" -- enter Caroline.
Caroline was a system that I wrote as a proof of concept that would go on to power this monstrosity -- it's a method of using a key-value store like Redis to hold pseudo-relational data. It's not perfect, but now we can store a lot more in the database we already have:
::user::person1: "{'gamma': 2, 'completed_posts': ['abc', 'def']}"
::user::person2: "{'gamma': 1, 'completed_posts': ['hij']}"
# Now if we ask Redis for a username, we have built in namespacing (using `::user::`, so we know where to look for the values we want.)
> GET ::user::person1
>> "{'gamma': 2, 'completed_posts': ['abc', 'def']}"
Translate that into Python and we have an actual object we can use and update, all while utilizing the speed and reliability of Redis. The problem here is that this is really not what Redis is designed for; it'll work, but we really need a SQL database. It's one of those things we can handle later, right? Right?
After roughly 2 years, we started to realize just how constrained we were by the current psuedo-relational system because extending
the "models" was a trial in and of itself. Part of that was handled by
processes written for Caroline, but as it became more obvious that a
"real database" would be needed, we started to push off more features
that we needed in the hopes that they would be implementable once we had
the "proper" setup. During that time, other things became more struggle points; as we introduced other bots (u/transcribot, our OCR solution, and u/ToR_Archivist, the Keeper of Completed Posts), our little microservice-oriented system began to get more and more complicated to maintain. As we abstracted code out to keep everything standalone with shared libraries, making a change in a shared library sometimes required deploying four things at once to make sure everything got updated -- with only a handful of people who had all the necessary permissions to make that happen, it slowed us down more than it helped.
The New Infrastructure
Looking at what to fix meant we had to take a good long look at what was broken and what we needed. Things that were broken:
- our usage of the database
- deployments
- update speed
- too many things to keep track of
- inability to grow because of too much existing complexity / tech debt
With the immediate problems in mind, sometimes it's good to sit down and question everything. Why do we do X? Why do we do Y? If we need to start over, what would we do to solve this problem?
What if we took all the core logic out of the bots and just had them report back to a single thing that tells them what to do next?
Over a hurried dinner at Steak'n'Shake before they closed,
David Alexander and I hashed out what a brand new database could look like if we abandoned the microservice architecture and made it a little more old-school -- a
monolithic application with microservice workers. The key reason that we chose a monolith was simple -- it's far easier for a small team to maintain. Having a single application responsible for all primary functionality also fixed our single source of truth problem; we would just ignore whatever Reddit said the answer was and overwrite it with whatever the monolith said, completely removing flairs from our data management solution. The new plan looked like this:
Core monolith:
- handle all database-related information
- host all the logic related to transcriptions (u/transcribersofreddit)
- host all the logic related to OCR (u/transcribot)
- host all the logic related to expired and completed submissions (u/tor_archivist)
- host our website (where you're reading this right now!)
- act as a jumping-off point for future systems
Microservice worker (all the bots):
- remove the existing logic
- replace each logic section with a call to the monolith via HTTP
- act on the response from the monolith
- provide new information to the monolith to update the database / perform more logic
Work began in earnest in late 2019, building on a proof of concept developed in late 2018. Since we had a bubbly chatbot for Slack named
Bubbles and an in-progress Discord bot named
Buttercup, it only made sense to have our monolith be the leader of the trio,
Blossom.
With the only major unsolved problem from the list being deployments, we
took another hard look at exactly what our deployments looked like and
why we did it that way. We were trying to follow best operations
practices using large systems like Chef, Packer, and Terraform, but we
hit the same wall there: the tools and environments we were trying to
use were simply not designed for the scale of our team and the type of
work that we needed to do. How can we take a step back from that and
make deployments easier?
We accidentally stumbled over the answer when I wrote a system to allow Bubbles, our Slack chatbot, to update itself by pulling from git and restarting its own service. Why not just extend that capability and have it do the same thing for the other services? Deployment now for us is handled solely through Bubbles, where a simple "@Bubbles deploy blossom" means that the new code will be pulled into production, all preliminary commands are run, the service is restarted, and we're back in action. Bubbles can even recover from a failed deployment of herself or one of the services she manages!
I'd also like to take a moment and give some shoutouts to
Max van Deursen for his fantastic work on Blossom's API and Pf441 for their excellent work on Bubbles.
The Results
The glaringly obvious question is, "was it worth it"? Overwhelmingly, the answer is yes. While Blossom has a moderate amount of complexity, having a single source of the primary system logic means that it's much easier to track down bugs and identify potential issues, and with a 94% test coverage over the codebase, we're much better prepared for edge cases and things we couldn't adequately prepare for before. We've vastly expanded our data management capabilities; for example, we now have the ability to keep track of:
Submission:
- Reddit ID
- who claimed it and when
- who completed it and when
- where the submission came from
- what the content that we're working on is
- what transcription(s) are linked to this submission
Transcriptions:
- who did them and when
- what submission they're for
- transcription text
- OCR text
- where the transcription is posted
- whether it got eaten by the Reddit spam filter
It also allowed us to completely replace our website,
grafeas.org, which meant that we could drop the Ruby language entirely and focus solely on Python and reduce our external service count by one. It also consolidates several smaller services that were just hanging out on their own, like our payment processor for donations.
The overall upgrade is not without its downsides, though; as we found out during deployment, when running at full speed the system requires much more computational power than the standalone microservices do. We were previously able to run everything on a single $5 server from Linode, but the system as it stands right now keeps a $20 server at a steady 20% usage. Quadrupling our bill is not generally a good idea, but in this particular case we can handle it without too much griping. After we finish the next phase of data migration, we'll try to bring that down to a $10 server and see if it's feasible for the sake of cost. (Would you like to help us out with this?
More info here!)
And finally, as far as the end user is concerned, the experience is essentially the same. The bots are just as fast even with the much larger amount of data being processed and everything "looks" like it did before. Overall, a successful upgrade!
Epilogue
We're continuing to monitor and improve the new system; since deployments especially are so much easier, we have maintained roughly a deployment a day since the rollout of the new system. We look forward to fully utilizing the increased flexibility that the new system offers us... and that is a post for another day. Thanks so much for reading!