More than five years ago, I started work to with a small company with a simple idea. Rent Like a Champion was founded by alumni of the University of Notre Dame, which has a very strong culture of college football games. (Note to my non-American readers: We’re talking here about American football, rather than “real” football.) However, South Bend, Indiana doesn’t have hotels close to the stadium, meaning that people don’t have a nearby, convenient place to stay when they come into town. RLAC thus offers homeowners the chance to rent out their homes on football weekends to visitors. Homeowners make some extra money, and football fans get to see the games they wanted while staying nearby.
I took over for a previous programmer, improving the site’s management and e-commerce features. RLAC has since grown massively, and my role has grown along with them. (And, I hasten to add, I’m no longer doing day-to-day coding on the site; my employee Genadi is doing a fantastic job of that, leaving me to handle more architectural, strategic, and managerial issues on the technology side.) There are still vestiges of the code that I inherited in 2010, but a great deal has changed and improved since then. We are using the latest version of both Ruby and Ruby on Rails, the latest version of MySQL, have both staging and production servers, and have a large number of automated tests. The site rarely goes down or is overwhelmed by traffic.
On the business side, we’re now on several dozen US college campuses, handle incoming and outgoing payments automatically, and even allow people to negotiate on the dates and prices of where they’re staying. Really, I couldn’t be prouder to be associated with this company.
A few months ago, we got some big news: Rent Like a Champion was going to be on Shark Tank, a reality TV show that allows you to pitch your company to investors. We didn’t know what this would mean for the company’s future, but we did know that it would mean lots of additional traffic. The numbers we heard were in the area of 10-20 thousand requests per second. This, from a site that had only 10-20 requests per second on our busiest days. Meaning, we had to figure out a reasonable way to scale our site up so that it could handle about 1,000 times the traffic we were used to seeing.
Now, Rails has a reputation for not being scalable. But it’s clear that Rails can scale, given enough computers. When someone says that it’s “not scalable,” what they probably mean is that given a certain amount of traffic, Rails requires more servers than other languages and frameworks. And yes, that’s probably true. But we clearly weren’t going to change our entire technology stack just for a few hours of television time.
We thus took a multi-pronged approach, working on every part of the site — from the infrastructure, to the application, to the servers we were using.
The bottom line, by the way, was that we more than survived the ordeal: Not only did we handle about 8,000 simultaneous requests/second for an extended period of time, but our servers were barely breaking a sweat. And as if that weren’t enough, we got investments from both Mark Cuban and Chris Sacca, two well-known billionaire investors.
So, what did we do to scale up? And what does this mean for projects you’re doing?
No state on the server
First and foremost, modern Web applications can be inherently scalable, if you design them correctly. Rails, like many other modern frameworks, assumes a “zero state” situation on the Web application server. This means that no user-related state is stored on the server itself. Instead, we store all such state in the database. This means that we can add as many servers as we want, because the actual data won’t be stored on the Web server.
One of the potentially tricky parts has to do with user sessions. Every modern Web framework offers developers the chance to use sessions; the user’s cookie contains an ID (typically encrypted) that allows us to look on disk, in memory, or in a database for the user’s session information.
If the session information is stored in a disk file, then you can only use a single Web server. That’s because if the same user visits several servers in the same day (which happens all of the time, if you have several servers behind a load balancer), and the session info is spread across multiple servers, they’ll effectively be logged out when they reach the second server. You can avoid this by storing session information in the database, but then that’ll slow down the application.
We decided to use cookie-based session information, in which the cookie itself contains the user’s session info. That means we don’t have to worry about how many servers we’re using, because the session info will be available to all of them. If you only store a user ID in the session, then the cookie will be particularly tiny (albeit encrypted), and thus you don’t have to worry about the size.
One part of our scaling was thus to ensure that users could move freely from one server to another. We then put all of the Web servers behind a single load balancer, giving the illusion that RentLikeAChampion.com was a single machine, but actually distributing the load among many. The user’s cookie would go from their browser, to the load balancer, to the ultimate machine. From there, an ActiveRecord lookup via the user ID gave the user’s information.
Bottom line: Your app might already be more scalable than you think, if session information is stored in cookies. If you’re using files to store session information, though, your application is inherently non-scalable; once you use more than one Web server, the user might end up being logged out.
Use VMs, but not necessarily AWS
Several of the biggest trends we’ve seen in the last few years involve the combination of cloud computing and virtual machines. Thanks to VMs, we can spin up as many servers as we need, only when we need them.
I’ve long been of the opinion that putting a server “in the cloud,” as everyone likes to say nowadays, is often unnecessary. There’s nothing wrong with a plain ol’ server, after all, especially since such servers are often going to be cheaper and easier to maintain. (I’m getting to the point where I might be changing my tune, as I see the advantages of deploying to a new VM or container with each release, rather than upgrading each existing release.)
However, in the case of RLAC, moving to the cloud was a no-brainer — not because we need it for our permanent infrastructure, but because the key worry we had was being able to scale up quickly. We didn’t know how many requests we would get, and putting in place a set of “real” servers that could handle that capacity would cost a fortune. Besides, we knew that we needed to scale up quickly for (and during) the Shark Tank airing, but then scale down just as quickly following the show’s broadcast.
Amazon Web Services is the first name that people think of in this space. We decided to go a different route, in no small part thanks to the suggestions and connections of RLAC’s CIO, Mike Hostetler: We went with Server Central, a Chicago-based company that has massive bandwidth, and lots of experience configuring virtual machines for this kind of work. Sure, AWS might be better known. But one of the advantages we got from Server Central was actual, in-person help — something that Amazon wouldn’t provide for a player as small as ourselves.
I have to say that Server Central’s staff amazed and impressed me at every turn: They were available, helpful, and polite, and knew a ton about how to configure and tune our VMs for maximum benefit. They set up our MariaDB cluster (more on that below), as well as a load balancer. They helped us to clone VMs ahead of schedule, and to connect them (virtually, of course) with our network.
In the end, we didn’t go with a Chef or Puppet configuration of our VM, but rather used a combination of Capistrano to deploy our software our main VMs, which we then cloned for some backup VMs. Not super elegant, I’ll admit, but it worked just fine, and meant that we could eliminate another learning curve. Mike H. worked extensively with Server Central, and really got the hang of configuring and deploying those VMs, such that we had more than 20 available for the night of broadcast.
Bottom line: If you want to scale up and/or down quickly, then VMs are almost certainly the way to go. And if you’re looking to get actual service, rather than a faceless SaaS company, I’d definitely suggest speaking with Server Central.
Switching from Apache to nginx
I’ve said it on many occasions: I have a warm spot in my heart for htApache httpd. I’ve been doing Web development since before the first version of Apache was released, much less the Apache Foundation was founded, and I have always found it to be an easy to use, flexible HTTP server.
However, there’s no doubt that another server, nginx, offers greater performance and scalability than Apache. In the case of RLAC and Shark Tank, we were far more interested in scalability than ease of use. However, we didn’t want to have a super-hard learning curve and transition, which we feared would be the case if we moved to a combination of nginx and Unicorn, a popular duo in the Ruby on Rails world. We thus settled on using nginx and Phusion Passenger, a plugin that provides Ruby on Rails applications which we had previously used with Apache.
I have to admit that I was pleasantly surprised on all fronts by nginx. It installed without a hitch, thanks in no small part to Passenger’s super-easy installation process on Linux boxes. The configuration is quite different from Apache, but not as difficult as I would have expected. It allowed us to use our existing SSL certificates quite easily. And the performance, without a doubt, was quite impressive. Indeed, any performance issues we saw were the result of Passenger, rather than nginx; Ruby and Rails both use lots of memory, and thus there is a limit to the number of simultaneous requests you can handle per machine.
Bottom line: I hate to say it, but I don’t see a big advantage to Apache any more. nginx documentation and tutorials are quite good, support for Rails applications is excellent with Passenger, and the performance we saw quite very good. Configuration of Apache is still easier for me, but that’s going to be true for anything I’ve been doing for 20 years, I expect.
With the Web servers scaling up nicely, the biggest potential bottleneck suddenly became our database. The database on RLAC, as with most Web applications, is involved in every single page displayed, from showing the user’s name to listing homes for a particular football game, to letting homeowners set the prices for individual games and events.
The problem is that the database is a finite resource; if too many people come and visit the site at the same time, the database will cease responding to some of them, causing a domino effect that will cause many people to get errors or timeouts.
I’m a big fan of PostgreSQL, which I have been using for about 20 years. But when I inherited the RLAC code, we were already using MySQL on the site — and switching technologies is almost never a good idea if things are already working, so I didn’t move things around.
While PostgreSQL’s master-master replication is still being discussed and designed, MySQL has master-master, high-availability clustering working already. Even better, following the purchase of MySQL by Oracle (via their purchase by Sun), Monty Widenius, the original author of MySQL, has been working on MariaDB, a MySQL-compatible plugin that offers superior performance.
Server Central suggested that we use a HA master-master cluster of VMs, all running MariaDB. From our perspective as developers, it looked and felt like a normal MySQL database. But the performance was rock solid and super fast, and it meant that we didn’t need to worry about the database being a bottleneck, unless we were crushed by people actually trying to use the application.
Importing our old database (from the MySQL server on Rackspace) to our new MariaDB cluster was fast and easy. The two database servers feel almost identical, and the dump-restore cycle took a matter of minutes.
Bottom line: I feel somewhat chastened; after years of telling people how much better it is to use PostgreSQL, and how master-slave fallover is probably fine for most purposes, I was pleasantly surprised to use a master-master cluster that more than suited our needs. PostgreSQL could probably have handled the load just as well, of course, but switching to a different database just before a major PR event would have been a very bad idea.
The above was a great way to get our system to work under high loads. But it’s always possible to optimize things further, and one of the best-known ways to make Web applications increase their speed is to add caching. If you can cache a page, then everyone benefits: The user gets a must faster response, and the server doesn’t have to spend its time running Ruby and SQL, because the request has been served without even touching the server.
On RLAC, we used three different types of caches. These worked together spectacularly well, improving our performance massively. Our working assumption was that we would have a number of VMs on standby in case we would need them, and that most users would look at a few pages before bouncing off of our site. But we still knew that we would need to server thousands (and many tens of thousands) of requests per second, many of which just wanted to see our home page while watching Shark Tank on their TVs.