scalability hacking sysadmin TechBiz WebApps

Why is Foursquare Down? 3 Educated Guesses

Why is Foursquare down?

Update (5 October 2010 at 5:36 pm PDT) : The folks at Foursquare tell us why in a post-mortem. There are autosharding issues with MongoDB. Yup, my guesses were wrong, unless you consider MongoDB a kind of cache. 😉

I used to work for a few sites that required high scalability expertise. Now that we’re over 5 hours into the outage I’ll share some of my thoughts.

But before I do, I’d just like to say, I really hope that it’s nothing bad and I really like the Foursquare peeps. I’m not putting out this article to harsh on anybody, but just to share some knowledge I have. Outages happen to everybody!

Also, I do not feel that this meltdown is in any way indicative of Amazon’s EC2. I have a site that shares the same IP space and facility as Foursquare and we have had no outages today.

  • The worst case scenario is a full scale Magnolia meltdown. This is where because of a backup process that was off, they cannot restore ever from backup. Odds: unlikely.
  • Someone turned off caching. I’m not sure how cache dependent the architecture is at Foursquare. If someone turned off the cache and the cache is just plain gone, then the caches have to be re-built. Rebuilding caches, depending on the time and complexity of each query can take up to 100x more time that it takes to retrieve the cache. If there’s some cached item that takes 100 seconds per user, the site will be down for a long time. They can only put a user back on foursquare at a rate of 100 per second if that’s the case, unless they can concurrently run the re-building of the cache.
  • There’s an issue with a hacker who has broken through security and is wreaking havoc on Foursquare. It’s happened to the best sites, e.g. Google in the 90s, and it’s pretty tough to recover from. Sometimes you let the criminals in and do their worst while keeping the site up. Sometimes you have 0 tolerance.
  • I wish Foursquare the best of luck. I am more than happy to lend a hand to their issues, if they need another pair of eyes.

command-line sysadmin WebApps

Doing Sysadmin on the iPhone

For checking up on sites in the enterprise, I use Alertsite. It was suggested to me by a VP I work with at McCann, Ed Recinto. It’s been a great tool.

For personal websites that I manage, I’ve been using something I rolled in newLISP, sitebeagle. Why? Because beagles are great watchdogs.

Very often, most problems can be solved with tweaking code, changing permissions, or upgrading and apache or mysql.

Very often, it’s the weekend, I’m sitting in a cafe, and get an alert from Nagios or Alertsite. With iSSH, on the iPhone, I can ssh into a LAMP server and do the work I need.

I can see things getting a bit more complex. What tools do you use to sysadmin from an iPhone?