Delays and issues with sending
Resolved
Apr 16 at 05:00pm EDT
TL;DR
Bad configuration on one of our self-hosted SMTP servers caused a crash that proved difficult to recover from, leaving lots of emails “stuck” in varying degrees – and their being stuck manifested in a slew of unpleasant ways. We’ve fixed the configuration, are investing (literally, right at this very moment) in better tooling and alerting, and are architecting a way to prevent this from ever happening again.
The gory details
Buttondown uses a number of providers to actually send the emails written by authors to their subscribers. In addition to using explicit vendors, we run and maintain our own fleet of servers dedicated to this purpose. We're going to refer to these servers as postal servers, as that's a reference to the great open source project which we rely on.
On Wednesday morning, we received some automated alerts from our checker system indicating that our backlog of emails was higher than it should be. After digging in a little bit, we realized the reason it was so high was because each individual email was taking a huge amount of time to send out for one specific server. After a couple more minutes, this server got to the point where all it was doing was trying for a minute and then timing out. (Software engineers reading this might already be getting some ideas of what had happened.) We logged into that server and quickly discovered that the issue was with the database storing messages that were pending delivery.
While our initial instinct was thinking that the problem was the overall volume of messages being sent to this particular server, we discovered that the volume was actually secondary to just the overall connection count. We were trying to connect to this database from too many worker threads and it was not set up to recover gracefully or even notify downstream connections what was happening. Once we discovered this, the first order solution was pretty simple. We cycled the database, scaled down the number of workers, and got the connections into a pretty manageable state.
The problem we were now left with was that of recovery. We had around 70,000 messages stuck in purgatory. They were technically pending, but some of them, just due to the database connection, were actually correctly sent. Some of them were marked as sent but not actually sent, and so on.
We basically entered a fog of war situation where our sources of truth were no longer valid. Our SOP in these cases is to err on the side of caution. Caution in this case means hazmatting that specific server, spinning down all of the workers, leaving all of those messages as pending, and then traffic over to another server or vendor to make sure we don't exacerbate the problem nor accidentally act upon incorrect information.
This is exactly what we did. We shifted over traffic, the queue drained, and we resent any emails that we were very, very confident hadn't been sent. We cleared out the problematic server and resumed traffic.
How we’re fixing it
If you've read along this far, you're probably wondering what we're going to do to make this better. The first step, one that is essentially complete by the time we publish this postmortem, is a classic one. Add much more monitoring and alerting. We were over-reliant on the integration and high-level metrics for these servers, which works well when problems are obvious and well-formed, but doesn't work well when they're a little bit more out of the mainstream.
To be specific, we already had alerting on pending or stuck messages at a per-server basis. But in order to actually fire those alerts, you needed to have an active connection to the database, which we couldn't have in this scenario.
The second one is a little bit broader, which is that we need to do a much, much better job of proactively pushing information about these kinds of spending patterns to you, the author — one of the worst feelings is sending an email and being confused because it's marked as sent, but you haven't seen it in your inbox. We're going to start erring on the side of oversharing about the state of these things so you can proactively poke around within the dashboard and understand what might be causing delays in us getting your emails to your readers.
Customer impact
Over the course of the afternoon, approximately 13,000 subscribers across 40 authors experienced some combination of the following:
- Multiple hour delays before receiving a message
- Not receiving an email at all (though we’ve redriven these.)
- Multiple sends of the same email
Zooming out
To be blunt, we've had too many incidents lately.
We've invested a lot in fixing bugs and stability at an object level over the past six months. But we've done a poor job of investing in stability at an end-to-end infrastructural level. The past few weeks have driven that point home. Our most important job as a tool is to reliably send your writing to your subscribers. We have not sufficiently invested in the very boring but very important kinds of observability that we needed to, and we're shifting a lot of our roadmap over the next six months to make sure that our ability to diagnose and resolve these issues is much, much stronger than it has been. If you're still at the end having read through all of this, I know it's not because of rabid curiosity but likely because of frustration because you've trusted us with a job and we haven't been up to the task. But we take this stuff seriously and we're pouring everything we have into it.
Affected services
Backlog