Why is Foursquare down?
Update (5 October 2010 at 5:36 pm PDT) : The folks at Foursquare tell us why in a post-mortem. There are autosharding issues with MongoDB. Yup, my guesses were wrong, unless you consider MongoDB a kind of cache. 😉
I used to work for a few sites that required high scalability expertise. Now that we’re over 5 hours into the outage I’ll share some of my thoughts.
But before I do, I’d just like to say, I really hope that it’s nothing bad and I really like the Foursquare peeps. I’m not putting out this article to harsh on anybody, but just to share some knowledge I have. Outages happen to everybody!
- The worst case scenario is a full scale Magnolia meltdown. This is where because of a backup process that was off, they cannot restore ever from backup. Odds: unlikely.
- Someone turned off caching. I’m not sure how cache dependent the architecture is at Foursquare. If someone turned off the cache and the cache is just plain gone, then the caches have to be re-built. Rebuilding caches, depending on the time and complexity of each query can take up to 100x more time that it takes to retrieve the cache. If there’s some cached item that takes 100 seconds per user, the site will be down for a long time. They can only put a user back on foursquare at a rate of 100 per second if that’s the case, unless they can concurrently run the re-building of the cache.
- There’s an issue with a hacker who has broken through security and is wreaking havoc on Foursquare. It’s happened to the best sites, e.g. Google in the 90s, and it’s pretty tough to recover from. Sometimes you let the criminals in and do their worst while keeping the site up. Sometimes you have 0 tolerance.
I wish Foursquare the best of luck. I am more than happy to lend a hand to their issues, if they need another pair of eyes.