UPDATE: Broke the 250k barrier, too :]

The node.js powered sprites fun continues, with a new milestone:

That’s right, 100,004 active connections! Note the low %CPU and %MEM numbers in the picture. To be fair, the CPU usage does wander between about 5% and 40% – but it’s also not a very beefy box. This is on a $0.12/hr rackspace 2GB cloud server.

Each connection simulates sending a single sprite every 5 seconds. The destination for each sprite is randomize to an equal distribution across all nodes. This means there is traffic of 20,000 sprites per second, which amounts to 40,000 JSON packets per second. This doesn’t even include the keep-alive pings which occur on a 2-minute interval per connection.

At this scale, the sprite network topology remains very responsive. Tested using my desktop PC neighboring my laptop, throwing a sprite off the screen arrives at the laptop so fast that I can’t gauge any latency at all.

Here are a few key tweaks which contribute to this performance:

1) Nagle’s algorithm is disabled

If you’re familiar at all with real-time network programming, you’ll recognize this algorithm as a common socket tweak. This makes each response leave the server much quicker.

The tweak is available through the node.js API “socket.setNoDelay“, which is set on each long-poll COMET connection’s socket.

2) V8’s idle garbage collection is disabled via “–nouse-idle-notification”

This was critical, as the server pre-allocates over 2 million JS Objects for the network topology. If you don’t disable idle garbage collection, you’ll see a full second of delay every few seconds, which would be an intolerable bottleneck to scalability and responsiveness. The delay appears to be caused by the garbage collector traversing this list of objects, even though none of them are actually candidates for garbage collection.

I’m eager to experiment further by scaling this up to 250k connections. The only thing keeping that test from being run is the quota on my amazon EC2 account, which is limiting the number of simulated clients I can run simultaneously. They have responded to my request to increase quota, but sadly it hasn’t taken effect yet.

The sprites source code, both client and server, are available via subversion. The repository URLs are provided on the sprites web site.

http://sprites.caustik.com/

For more information about the testing and tweaks involved in scaling the server, check my previous post Node.js scalability testing with EC2.

28 thoughts on “Scaling node.js to 100k concurrent connections!

  1. This is awesome stuff. I work on a lot of “traditional” stacks that often struggle with this scenario, especially if customers are flooding in requests or needing responses at a high rate. This had def piqued my interest and I look forward to your future posts on this topic.

    Just looking for the “validation” for Node in our systems.

    Like

  2. You can use the flag “–expose_gc” to make the JS function “gc();” available. That triggers garbage collection at your whim, so for example it could run every hour or so via setTimeout or setInterval.

    I’m trying to find a way to detach JS Objects from the heap, for the dual purpose of excluding them from the time consuming GC traversal, and to exclude them from the address space limitations imposed by the heap.

    Like

  3. Charlie — I have been able to hit 250k concurrent with Node.js – the only current limitation appears to be V8’s heap addressing limitation (1GB) – combined with the overhead in JS heap per-connection in Node.js — basically, once you have 250k connections going, your JS heap is riiiiight at capacity and the garbage collector isn’t having any of that. The GC just starts churning away hopelessly trying to scrounge up enough memory to continue.

    I think it would be possible to skirt this issue using “cluster” in Node.js, since each child will have it’s own heap address space. I have a few other things to investigate, but if none of those pan out I’m going to go that route.

    Like

  4. Thank you for the advices ! We did a real time multiple content type stream last year that was struggling at 30k with no apparent reason. Will deffinitly try your way this year.

    Like

  5. I’m new at node.js and I found your project very interesting. I couldnt find the source code (trunk is 404not found) is there any chance you will be able to provide the node.js code somewhere (like github) ; just wonder is the part that you fake 100k connections are in the code or not?

    Thanks
    Zareh

    Like

  6. Hi, testing environment is static page where client does not change page or reload page usually, right?
    Did you ever test with dynamic environment where client change page much more, like contents/forums… website. Then socket close/open everytime client change page.

    Like

  7. it would have been nice if there were some solid scripts that would let me fully reproduce the outcome. (it’s the cold fusion thing all over). Trust but verify.

    Like

  8. I would like to download the client and server part to try it out with 100k sessions, would you please show me how. Thanks so much

    Like

  9. Great article, and I know that this is an extremely dated reply; and I have an important correction.

    One minor redux:

    “–nouse-idle-notification”, contrary to popular belief, does not stop automatic garbage collection, it just makes it run “less” frequently. (i.e., the gc won’t run on idle notifications; it can run on other occasions when it sees fit) — there’s no way completely disabling node.js automatic garbage collection (unless you change a considerable amount of source code and recompile v8 kernel from scratch)

    You can use `node –expose_gc –trace-gc –trace_gc_verbose –gc_global .` to see what the app is doing in the background.

    Keep up sharing your experiments.
    I love it!

    — Volkan.

    Like

Leave a comment