To Infinity And Beyond: Redesigning from the Ground Up

Published June 11, 2021, 4:39 a.m.

Every group likes making incremental changes to fix issues and tech debt... but sometimes the best possible option is to burn it all to the ground and start anew. Seeing as we've just recently finished doing that very thing, this write up will cover:

the old infrastructure
why we decided to migrate
the new infrastructure
the results of migrating

Buckle in, because this is a long ride!

The Old

The thing about tech debt is that, more often than not, it creeps up on you. You have a problem, you make a decision, that decision fixes things for a while, but then you need to make a new decision to solve a new problem... but the old decision has created different issues of its own, so you work around it. When that happens enough times (not if, but when), the monster of technical debt rears its head and suddenly you find yourself making more decisions that work around the debt or try to control it in some way instead of actually focusing on what you need to focus on.

Tech debt happens to every project -- it's the nature of moving forward. In order to properly address it, it's important to understand how we got here. Like many projects, we started out with the understanding that we probably weren't going to grow beyond the initial two people, and really if we did, it wouldn't be more than a handful anyway. Looking back, we probably should have realized that notion was flawed after 30 people joined on the first day, but we deliberately built small. (In fact, if you'd like to see how small, I unearthed version 0.1 of u/transcribersofreddit -- just missing a handful of helper functions and strings -- over here on GitHub.) The key design requirements were:

We need to...

track which posts we've seen before
track status of posts (unclaimed, claimed, done, invalid, meta)
track the score of each volunteer
keep a list of our registered partners
keep templates and be able to link to them

In looking over that core list, the first immediate decision was that for items 2 through 5, we could use the tools we already have and treat Reddit like a database. Let's take another look at that list:

We need to...

...track which posts we've seen before

unsolved so far

...track status of posts (unclaimed, claimed, done, invalid, meta)

Every post can have post flair and that can be updated or read at any time

...track the score of each volunteer

Every user can have a flair that can be updated or read at any time

...keep a list of our registered partners

Every subreddit has a wiki with configurable pages

...keep templates and be able to link to them

Another entry for the wiki!

Since we can just read and write against Reddit itself, that simplifies the need for a database tremendously. With the first item on the list being the only one left, the decision was made to store the Reddit ID of the submission so we could easily tell them apart. Each time we look for new submissions, we would be pulling a ton of IDs from every partner subreddit to see what we knew about and what was new; that meant we needed to be able to check quickly to see if we knew about a given ID or not. If we kept the list of known IDs in memory then that could feasibly cause problems quickly given how fast some subreddits move... so why not farm that out to a system that's way better at it than we are? Enter Redis, a key-value store. Keep that in mind -- that will bite us later.

One of Redis' specialties is its Set data type, which allows for an unordered collection of strings where the lookup time is always the same, no matter how many elements it contains. That made it the obvious choice for something we knew would quickly rise into the hundreds of thousands, if not millions. Throwing that into the mix meant that we could have Redis handle all the heavy lifting for the data management, and we were off to the races!

The Issues Begin

While this was a solid start, there were three specific issues that we quickly ran into:

Reddit kept breaking
Volunteers could turn off their own flair, which meant they would get reset to 0 points
We had no way of knowing who completed each post if they deleted their comments

Some things relating to Reddit breaking could be mitigated with additional error handling, but others were more complicated. For example, how should we handle a wiki page suddenly saying that it doesn't exist or flair returning a null value? The next move was to try and start migrating our Source of Truth away from Reddit to something more stable. The most stable relative to us is Redis, since if the bot server is down, Redis is down too -- a theoretical 100% relative uptime! Since we have a database already, why not retrofit that?

Let's talk for a moment about key-value stores. Though they are databases, they're what's commonly called a NoSQL database, since they are effectively "flat". There's only one table and there's no way to have relational data because everything is just two elements: a key and a value. The type of the value can be different things (a set, a list, or a string, for example) but overall, we can still only store two things at a time. Take a look at this:

# Here's an example of what data might look like.

post_ids: ['abc', 'def', 'hij']

accepted_code_of_conduct: ['person1', 'person2']

key: value
# if we interact with this, we can just ask Redis for the key and it spits back out the value.
> GET key
>> value

Now the problem becomes that we need to store relational data. We need to be able to keep track of how many posts each user has completed (because we can't rely on Reddit) AND which post IDs they've done (for debugging and future investigation). That's highly relational, but without building support for a standard SQL database, do we have an option that will just work without causing us too much effort? The answer is "sadly, yes" -- enter Caroline.

Caroline was a system that I wrote as a proof of concept that would go on to power this monstrosity -- it's a method of using a key-value store like Redis to hold pseudo-relational data. It's not perfect, but now we can store a lot more in the database we already have:

::user::person1: "{'gamma': 2, 'completed_posts': ['abc', 'def']}"

::user::person2: "{'gamma': 1, 'completed_posts': ['hij']}"

# Now if we ask Redis for a username, we have built in namespacing (using `::user::`, so we know where to look for the values we want.)

> GET ::user::person1
>> "{'gamma': 2, 'completed_posts': ['abc', 'def']}"

Translate that into Python and we have an actual object we can use and update, all while utilizing the speed and reliability of Redis. The problem here is that this is really not what Redis is designed for; it'll work, but we really need a SQL database. It's one of those things we can handle later, right? Right?

After roughly 2 years, we started to realize just how constrained we were by the current psuedo-relational system because extending the "models" was a trial in and of itself. Part of that was handled by processes written for Caroline, but as it became more obvious that a "real database" would be needed, we started to push off more features that we needed in the hopes that they would be implementable once we had the "proper" setup. During that time, other things became more struggle points; as we introduced other bots (u/transcribot, our OCR solution, and u/ToR_Archivist, the Keeper of Completed Posts), our little microservice-oriented system began to get more and more complicated to maintain. As we abstracted code out to keep everything standalone with shared libraries, making a change in a shared library sometimes required deploying four things at once to make sure everything got updated -- with only a handful of people who had all the necessary permissions to make that happen, it slowed us down more than it helped.

The New Infrastructure

Looking at what to fix meant we had to take a good long look at what was broken and what we needed. Things that were broken:

our usage of the database
deployments
update speed
too many things to keep track of
inability to grow because of too much existing complexity / tech debt

With the immediate problems in mind, sometimes it's good to sit down and question everything. Why do we do X? Why do we do Y? If we need to start over, what would we do to solve this problem?

What if we took all the core logic out of the bots and just had them report back to a single thing that tells them what to do next?

Over a hurried dinner at Steak'n'Shake before they closed, David Alexander and I hashed out what a brand new database could look like if we abandoned the microservice architecture and made it a little more old-school -- a monolithic application with microservice workers. The key reason that we chose a monolith was simple -- it's far easier for a small team to maintain. Having a single application responsible for all primary functionality also fixed our single source of truth problem; we would just ignore whatever Reddit said the answer was and overwrite it with whatever the monolith said, completely removing flairs from our data management solution. The new plan looked like this:

Core monolith:

handle all database-related information
host all the logic related to transcriptions (u/transcribersofreddit)
host all the logic related to OCR (u/transcribot)
host all the logic related to expired and completed submissions (u/tor_archivist)
host our website (where you're reading this right now!)
act as a jumping-off point for future systems

Microservice worker (all the bots):

remove the existing logic
replace each logic section with a call to the monolith via HTTP
act on the response from the monolith
provide new information to the monolith to update the database / perform more logic

Work began in earnest in late 2019, building on a proof of concept developed in late 2018. Since we had a bubbly chatbot for Slack named Bubbles and an in-progress Discord bot named Buttercup, it only made sense to have our monolith be the leader of the trio, Blossom.

With the only major unsolved problem from the list being deployments, we took another hard look at exactly what our deployments looked like and why we did it that way. We were trying to follow best operations practices using large systems like Chef, Packer, and Terraform, but we hit the same wall there: the tools and environments we were trying to use were simply not designed for the scale of our team and the type of work that we needed to do. How can we take a step back from that and make deployments easier?

We accidentally stumbled over the answer when I wrote a system to allow Bubbles, our Slack chatbot, to update itself by pulling from git and restarting its own service. Why not just extend that capability and have it do the same thing for the other services? Deployment now for us is handled solely through Bubbles, where a simple "@Bubbles deploy blossom" means that the new code will be pulled into production, all preliminary commands are run, the service is restarted, and we're back in action. Bubbles can even recover from a failed deployment of herself or one of the services she manages!

I'd also like to take a moment and give some shoutouts to Max van Deursen for his fantastic work on Blossom's API and Pf441 for their excellent work on Bubbles.

The Results

The glaringly obvious question is, "was it worth it"? Overwhelmingly, the answer is yes. While Blossom has a moderate amount of complexity, having a single source of the primary system logic means that it's much easier to track down bugs and identify potential issues, and with a 94% test coverage over the codebase, we're much better prepared for edge cases and things we couldn't adequately prepare for before. We've vastly expanded our data management capabilities; for example, we now have the ability to keep track of:

Submission:

Reddit ID
who claimed it and when
who completed it and when
where the submission came from
what the content that we're working on is
what transcription(s) are linked to this submission

Transcriptions:

who did them and when
what submission they're for
transcription text
OCR text
where the transcription is posted
whether it got eaten by the Reddit spam filter

It also allowed us to completely replace our website, grafeas.org, which meant that we could drop the Ruby language entirely and focus solely on Python and reduce our external service count by one. It also consolidates several smaller services that were just hanging out on their own, like our payment processor for donations.

The overall upgrade is not without its downsides, though; as we found out during deployment, when running at full speed the system requires much more computational power than the standalone microservices do. We were previously able to run everything on a single $5 server from Linode, but the system as it stands right now keeps a $20 server at a steady 20% usage. Quadrupling our bill is not generally a good idea, but in this particular case we can handle it without too much griping. After we finish the next phase of data migration, we'll try to bring that down to a $10 server and see if it's feasible for the sake of cost. (Would you like to help us out with this? More info here!)

And finally, as far as the end user is concerned, the experience is essentially the same. The bots are just as fast even with the much larger amount of data being processed and everything "looks" like it did before. Overall, a successful upgrade!

Epilogue

We're continuing to monitor and improve the new system; since deployments especially are so much easier, we have maintained roughly a deployment a day since the rollout of the new system. We look forward to fully utilizing the increased flexibility that the new system offers us... and that is a post for another day. Thanks so much for reading!