App is currently unavailable

Resolved
Jul 17 at 01:42pm EDT

What was happening

For the past few days, we’ve been experiencing thundering herd-esque downtime every time we deploy at the top of the hour. The top of the hour nuance is not actually that uncommon: we serve a lot of RSS traffic. (A fun fact is that 80% of our page views come from RSS readers and scrapers and the vast majority of them are fairly naive and run the equivalent of an hourly cron to ping every single RSS feed in which they are interested.)
After investigation, we were able to identify the reason why this started happening recently, even though the above traffic pattern and our general CI/CD posture has remained unchanged. Part of our firewall system involves checking incoming IP addresses against a deny list culled from a variety of trusted sources. In order to make the firewall as performant as possible, we aggressively cache that list so that we're not pulling it every time a subscription attempt is made. However, we recently changed the logic to expand the purview of the database that held those IP addresses to also store aggregate-level data about IPs for telemetry purposes.
At a high level, the logic looked something like this:
@cache def get_problematic_ip_addresses(): ip_address_models = IPAddress.objects.all() return { ip.ip_address for ip in ip_address_models if ip.do_not_honor }
And that logic remained the same! But that backing IPAddress model went from a few hundred records to a few hundred thousand, replete with a JSON payload for each IP.
And because we were caching this, it meant that even with rolling deploys, every single time a new server would come online, it would be aggressively unresponsive as it tried to pull and then collate every single IP address within the 30-second time span of a request.
We’ve fixed this trivially:
@cache def get_problematic_ip_addresses(): ip_address_models = IPAddress.objects.filter(do_not_honor=True) return { ip.ip_address for ip in ip_address_models }
Going forward, we’ll be paying much closer attention to the actual timeline of the deploy process. It was easy to chalk this up to luck of the draw to a certain extent, but such “luck” is scarce and still came at the cost of severely degraded performance. By logging and alerting on startup time and deviations thereof, we’ll be able to more actively identify aberrations of this nature in the future.

Updated
Jul 17 at 12:10pm EDT

We've identified the issue, and see service being restored. We're continuing to monitor as there may be intermittent errors.

Created
Jul 17 at 11:50am EDT

We're currently investigating reports that the web app is unavailable. This is impacting the Dashboard, login and hosted archives.