This post was originally written as a guest post for RyboMedia on October 15, 2009. Thanks to Rybo for letting me post this! Hope you all found it enjoyable and informative.
We ran into a problem at work last week that was, at the same time, a nightmare and exactly the kind of problem you want to have.
The culprit was our latest Big Prize Giveaways promotion, and the problem was that our app had metaphorically gone from 0-60 in about two seconds, and it experienced the same thing your neck feels when it accelerates that fast: whiplash.
This was my first experience with an app that was this big; even in my McJournal days, I rarely averaged more than
50,000 5 hits per day. So in the last week, I've learned a ton during our march to a million fans that I think can be useful to everyone, no matter how close your app is to that kind of scale.
This is the first rule in Hitchhiker's Guide to the Galaxy, so it's the first rule here. My first (extremely narcissistic) thought was “what? Something broke? Our server sucks!", and my second (seriously overdramatic) thought was “I tried my best when I was writing it, how am I supposed to improve it now?” Most scaling problems are definitely manageable, and even though (for me) they're not much fun and can be stressful, there are worse problems to have.
Have a plan
These types of problems aren't quick fixes, you'll need to know what you want to do, how you intend to do it, and when. Keep in mind that unless you shut down your app completely, the slowdowns will continue to affect you as you work through the problem, so having a plan is crucial so you avoid wasting time thrashing.
You should do this to every query you run and make sure nothing jumps out at you. Many of my queries were non-indexed, meaning that while the query would be quick to run with only 50,000 contest entries, as the entries piled up and became more frequent and less distinct, even fast queries slowed down. And when your app is doing 50 queries per second, an increase of even 10 ms to a query will start to snowball and drag your app to a crawl. Point is, you should run
EXPLAIN on your queries to ensure they're doing what you expect them to.
Our CTO, Jim Rubenstein (who was the one teaching me a lot of this stuff as we worked through these issues) mentioned that for each
SELECT query you run, you should have an index defined. This is particularly true on high volume
SELECT queries. Indexing costs a little when you insert and costs some extra hard drive space, but in a day and age with multi-terabyte sized hard drives, some extra
INSERT work is easy to work around.
Use less queries
Better than even
EXPLAINing and indexing, you should write your app so it uses the fewest database queries possible. This doesn't just mean reducing database queries per page load (although that's important too), it means using tools like memcached whenever possible to reduce the number of unnecessary or redundant queries to the database. memcached simply stores stuff in RAM as needed, and while the concept is simple, intelligently caching your data greatly reduces the number of database queries you have to make. Many of our sites at work use memcached, and we are looking into using a tool developed by Facebook called Scribe to do the same thing with
INSERTing data that memcached does with
SELECTs. When I first started work a few months ago, I had never used any real caching algorithms or practices in any of my projects, but now it's one of my favorite problems to solve and one of the most fascinating.
Tune your webserver(s)
Most web servers (including Apache) are configured by default to be able to run on a server that divides its resources among other software, like a database. If you have the flexibility to allow your server more space in RAM, more CPU cycles and more hard drive space, open ‘er up and let your server work harder. To do this in Apache, we had to alter the
MaxClients Apache directives, although it may vary for you depending on what version of Apache you have. Do some research on the web to find the ideal settings and set them appropriately.
When all else fails, buy more hardware
Without going into the details of our configuration, we scaled our webserver (which is a beast of a server, by the way) to four webservers that are load-balanced. Since implementing this scheme, along with the earlier fixes, we have had few problems. Let me say though, that unless your app or site is huge, there's probably more optimizations to be done before this option.
I think the most important piece of advice I can give you is to not panic. As Jim and I worked through this last week he noted that it's a good problem to have, and that's true. After all, if you're getting slammed, it means people are finding compelling reasons to use your app, correct? Our president (who's had a lot more experience than us) said also that with these kind of problems, you face growing pains but you fix them on the fly and you learn. I think it's also true that until your app is on the web, there's no telling how it'll perform, and no matter how much you test, no matter how much you benchmark, stuff is bound to go wrong – it's just a matter of coming back and making it better.