Back to overview
Degraded

Delays in background processing

Jan 30 at 04:22am EST
Affected services
Backlog

Resolved
Jan 31 at 03:52pm EST

Post mortem

What broke?

workerscheduler, our process for running asynchronous jobs that are scheduled for some date in the future, was hard down for ~six hours. This meant, amongst other things:

  1. Outbound emails were down
  2. Cron was down
  3. Other stuff, but those two dwarf everything else

Why did it break?

At a very high level, our asynchronous worker schedulers work something like this (none of this is bespoke, it's standard RQ):

  1. To enqueue a job, serialize the method name, the arguments you want to pass to that method, and a timestamp — store that in Redis.
  2. To find a job that should be worked on, pull all of the potential jobs. Sort those jobs by timestamp and start running the first one that is ready.

Now, that the arguments you want to pass thing is a bit of a landmine. Consider the following job that we might enqueue:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email: Email, recipients: list[str]):
  for recipient in recipients:
    send_email_to_recipient(email, recipient)

Enqueuing this means serializing a Python object that looks like:

{
    "method_name": "path.to.module.send_email",
    "arguments": [
        {
            "class": "path.to.module.email",
            "id": "1",
            "subject": "Hi there!",
            "body": "How are you doing?"
        },
        [
            "penelope@buttondown.com",
            "telemachus@buttondown.com"
        ]
    ]
}

(This is a simplified example, but it's directionally accurate.)

Now, the tricky thing about this is that emails can get... large. Our Email object stores over sixty columns (four of which are some variation of "the fully rendered body for an email in different formats"), and some authors write extremely length emails. Some emails, in memory, are >5MB!

Now, with the above toy example — we're sending a single email to a list of two recipients. In reality, Buttondown batches the total recipient list for an email based on a number of factors: an author with 30,000 subscribers might have their email batched into groups of 100 subscribers each (for a total of 300 batches.)

And this is where we get into tricky territory. 300 jobs, all containing a 5MB email... now we've suddenly added 1.5GB of memory to Redis, and also now we force the workerscheduler to deserialize 1.5GB of data just to figure out a single job to run.

This is, in fact, exactly what happened: the Heroku dyno running the workerscheduler has a memory cap of 512MB, and once we got into this state it kept on trying to read the entire list of jobs, OOM-ing, restarting, ad infinitum.

Okay, so don't store the entire thing in memory!

Right! This is not rocket science. The correct approach is to do something like this:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email_id: str, recipients: list[str]):
  email = fetch_email_from_db(email_id)
  for recipient in recipients:
    send_email_to_recipient(email, recipient)

Which then means, in exchange for a slight performance hit (because now we're hitting the database to hydrate the email), all we have to serialize is this:

{
    "method_name": "path.to.module.send_email",
    "arguments": [
        "1",
        [
            "penelope@buttondown.com",
            "telemachus@buttondown.com"
        ]
    ]
}

We do this almost everywhere... except one place — rate limiting logic per-domain. We happened to trigger that one place, and then we were stuck.

How did we stop the bleeding?

By clearing out all of the problematically large jobs from the safety (and large memort size) of my laptop:

import django_rq
from rq.job import Job

STRING_OF_JOB_TO_REMOVE = "send_email_to"
QUEUE_NAME = "five_minutes"
queue = django_rq.get_queue(QUEUE_NAME)
jids = queue.scheduled_job_registry.get_job_ids()
jobs = Job.fetch_many(jids, connection=django_rq.get_connection(QUEUE_NAME))
for job in jobs:
    print(job.description)
    if STRING_OF_JOB_TO_REMOVE in job.description:
        queue.scheduled_job_registry.remove(jid)
        print("Removing!")

Once we did that (and re-ran a bunch of stuff to get things flowing again), we were back to normal (albeit with a big backlog!)

Why did it take so long to notice/fix?

The shortest answer is: almost all of our observability runs on crons. If crons are broken, then we're flying blind. But that leads us to...

How do we make sure this never happens again?

  • At an object level, we fixed that one problematic code path. We now load emails by ID there instead of serializing the entire object.
  • At a meta level, we've added a lot of observability through Better Stack so we're not dependent on our own rails. Most notably, we now get paged if no crons have been executed in five minutes: this would have immediately caught the problem.
  • At an observability level, I've added some internal tooling around analyzing the backlog. It didn't take us too long to diagnose the issue (~thirty minutes), but that's still not great.

Updated
Jan 30 at 11:21am EST

We've finished redriving the stuck backlog items! We'll follow up with a postmortem later today.

Updated
Jan 30 at 10:14am EST

We have identified the cause and are working on a fix.

In the meantime, background work that was stuck has started to be processed again (for example, emails stuck in "About to Send" are now sending).

Importantly, do not send your email a second time. Emails that were sent previously and are waiting will be sent out eventually.

Created
Jan 30 at 04:22am EST

Degradations to our background processing system are causing email sends, scheduled emails, and other background work to be delayed.