About yesterday's downtime
On April 14, 2014 FeedHQ was down for approximatively 16 hours. This was rather unusual. If you want an explanation, read on.
So what happened? Basically, around 7AM UTC, the primary database crashed because of a full disk. This database runs on SSDs, which makes it relatively fast, but SSDs usually don't offer as much capacity as good old spinning disks.
The database couldn't be started anymore but a replica was available on a secondary server. That server only has spinning disks and its replica is meant to be a backup rather than a standby server: falling back on that server would have made the site extremely slow.
What do you do when you fill a disk? Either you get more space or you free up some data. In this case I went for the 2nd option. Remember the data retention feature that was introduced in November? It wasn't enforced for about 2/3rds of all accounts. So I took the replica and started to expire old data. This operation took approximatively 9 hours.
After that the database was theoretically much smaller in size but its footprint on disk was the same: PostgreSQL doesn't release unused disk space until you
VACUUM FULL the database. Vacuuuming a database is a very slow operation, all the more on slow disks. I went for a more radical option: instead of vacuuming the data in-place, I dumped it and loaded it to a fresh database instance on SSD-enabled primary server. Between loading the data and creating the indexes, this still took an additional 7 hours. I don't know how long a
VACUUM FULL would have taken but I suspect it'd have been much longer.
During all that time the site served an error page and feed updates were stopped. Once data was fully loaded, the site went back online and updates started again to catch up with the previous 16 hours of Internet activity.
Almost a day of downtime, that's pretty bad. All the more since the root problem could have been detected in advance. Here is what's needed to improve disaster recovery:
Alerts. There are pretty graphs for almost everything that happens with the servers but no actual alerts outside of Sentry which catches application errors. Proper alerts will be setup to plan actions earlier in such cases. Alerts should have been sent when the server hit 70% or 80% of its capacity, not 100%.
More elasticity. FeedHQ isn't big, but it's hitting the limits of what you can do with vertical scaling. It's time to prepare for an architecture that's more distributed and easier to deal with when a single server goes down.
More communication. People sent out emails or tweeted @FeedHQ to get some news. Everyone got replies but that wasn't optimal. A proper status page with a link to it on FeedHQ error pages would have helped a lot to communicate more clearly around the issue.
On a final, very positive note, I'd like to thank you all for staying positive and encouraging, whether it was via email or Twitter. Awesome users are what keeps FeedHQ going!