sprites server

The sprites project leverages Node.js to implement it’s server logic. The server is surprisingly simple JavaScript code which accepts long-poll COMET HTTP connections from it’s C++ clients in order to push JSON format sprite information in real-time. Sprites can be thrown from one client’s desktop to another via HTTP. The server maintains a network topology (which is worth a separate post in itself, it turned out to be an interesting algorithm), which is then used to send the posted sprite to the appropriate neighbor.

The asynchronous sockets in Node.js scale very well. Combine this with the JavaScript being interpreted by the blazing fast V8 engine, and you have the foundation of a very simple and flexible server platform which scales mightily.

So, how well does it scale?

In order to test that, I decided to leverage Amazon EC2 to run a bunch of fake clients to simulate network usage. Each EC2 instance can make thousands of connections to the sprites server, and simulate throwing sprites around.

To implement this stress test, I first booted up a single instance with one of the default Linux templates. From there it only takes 5 minutes to install a basic build environment using “yum install subversion make gcc-c++” — after that, I modified /etc/rc.local to do a couple things:

1) ulimit -n 999999

This is necessary to get around the default limitation in file handles, which applies to sockets, that would otherwise prevent the instance from making more than about 1024 connections.

This command must also be used on the Node.js server, which you don’t really see documented anywhere! If you don’t increase the file handle limit on your Node.js server, somewhere down the line you are going to run into clients being unable to connect, along with the server suddenly deciding to peg the CPU at 100% and behave strangely.

I’ve seen all sorts of posts around the net struggling to figure out why their Node.js server doesn’t scale quite as high as they think it should. Well, if you’re not increasing your file handle limit, that would definitely be one explanation. Aside from that, it’s obviously critical to write very performance-conscious JavaScript code.

2) svn update the test working directory

This is done so that each instance is always up to date with the latest test code after a reboot. This greatly simplifies scalability stress testing. You just start and stop instances using the EC2 console to magically test the latest code at whatever scale you want. You can literally test a million concurrent connections using this technique (though, I’m pending a quota increase on EC2 to actually give this a try, they stop you after 20 instances by default).

3) cd into the test working directory, build, and run the test

That’s it! I use a Windows development machine to create the cross-platform test code, and whenever a change is made, I commit it to subversion, and recreate/reboot all the EC2 instances. They automatically update, and swarm the server. It’s a beautiful thing 🙂

Since each instance on EC2 costs $0.02 (yes, 2 cents), this is actually incredibly cheap. Each test run takes well under an hour, so for 20 servers with 1000 simulated connections each, you can test 20,000 users for 40 cents! To test a million clients, it would run you about 20 bucks. Not bad… You could tweak this to be more frugal by adding more connections per-instance, and maybe even leverage Spot instances for a lower rate per-hour.

7 thoughts on “Node.js scalability testing with EC2

  1. So how did it perform? And I was interested in how the network topology was implemented as well (how appropriate neighbors are picked, etc).

    Like

  2. The performance is really good — So far, I have only tested up to about 20,000 concurrent connections, each connection simulating a sprite send every 5 seconds. So, that works out to 4000 requests per second, each of which has a corresponding response, plus the overhead of connect and maintaining the COMET long-poll for each. The server was using about 40% of a single core, and the memory was well below max.

    The machine it’s running on isn’t super beefy, either. Also, when I ran the tests was before separating it from the same server which runs my blogs and SVN repository.

    I have a request pending with Amazon to increase my quota. They actually replied already and said it was accepted, but it hasn’t taken effect yet. I’ll give an update once that happens, and see how far the performance can be pushed.

    I’ll be writing a separate post about the network topology. In short, it’s akin to a 2-dimensional, circular, doubly linked list. Visually, you can picture it as a torus where individual segments can be added and removed, with the network automatically re-stitching itself together in such a way as to minimize the number of nodes affected by any given add or delete.

    Like

  3. I have to say, there is a right way, and a wrong way to do things. Essentially disabling ulimit’s file descriptor handling is not the right way to do things, which is why many systems just won’t allow you to modify that parameter.

    You can theoretically reach significantly higher numbers by just doing some simple process management. You’ll end up with a much more graceful and scaleable solution, and you’ll keep people like me from making rageface. 🙂

    Like

  4. That’s not always the most efficient way (dividing into multiple processes) – this particular usage of Node.js (Sprites) is far more efficient when run from a single process. It doesn’t bother me if you make a rageface, since I’ve already weighed the pros and cons and am definitely happy with the solution I picked. Rage away, though!

    Like

  5. Anyway, ulimit -n is system wide, so dividing into multiple processes still has the same issue. It’s obviously far cheaper to run 1 server, when you can, than multiple.

    Like

  6. I’m curious how you found those micro instances at 2c per hour worked. Because I just did a test on one and found that the sustained request rate each can manage without getting *silenced* (at the network level) is about 100.

    You can burst over 1000 requests per second, but within seconds get shut down.

    This is different to the limit for cpu usage which is a known throttling problem.

    So did you strike this new connection rate limit? or are your ec2 load generators only opening then holding open each socket which perhaps gets round the micro instance limits?

    What limit on “your account” did you need lifted? is there a global limit for simultaneous open connections across all your instances?

    thanks.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s