I’ve decided to ramp up the Node.js experiments, and pass the 1 million concurrent connections milestone. It worked, using a swarm of 500 Amazon EC2 test clients, each establishing ~2000 active long-poll COMET connections to a single 15GB rackspace cloud server.

This isn’t landing the mars rover, or curing cancer. It’s just a pretty cool milestone IMO, which may help a few people who want to use Node.js for a large number of concurrent connections. So, hopefully it’s of some benefit to a few Node developers who can use these settings as a starting point in their own projects.

Here’s the connection count as displayed on the sprite’s page:

Here’s a sysctl dumping the number of open file handles (sockets are file handles):

Here’s the view of “top” showing system resources in use:

I think it’s pretty reasonable for 1M connections to consume 16GB of memory, but it could probably be trimmed down quite a bit. I haven’t spent any time optimizing that. I’ll leave that for another day.

Here’s a latency test run against the comet URL:

The new tweaks, placed in /etc/sysctl.conf (CentOS) and then reloaded with “sysctl -p” :

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535

Other than that, the steps were identical to the steps described in my previous blog posts, except this time using Node.js version 0.8.3.

Here is the server source code, so you can get a sense of the complexity level. Each connected client is actively sending messages, for the purpose of verifying the connections are alive. I haven’t pushed that throughput yet, to see what data rate can be sent. Since the modest 16GB memory was consumed, this would have likely caused swapping and meant little. I’ll give it a shot with a higher memory server next time.

Escape the 1.4GB V8 heap limit in Node.jsSprites Project

71 thoughts on “Node.js w/1M concurrent connections!

  1. I’m guessing you mean the EC2 swarm were actually test clients connecting to a single node server? It’s kinda ambiguous – you might wanna expand on that a little.

    Like

  2. Why in the world did you need 500 EC2 instances for client connections? With the right settings, you should be able to get to 1M with about 20 instances.

    Like

    1. Tim – Let me know when you’ve pulled that off, using EC2 instances, and the overall cost was less than using 500 micro spot instances. Also, doing it this way is more realistic – why would you complain that -too many- unique servers were used, when it’s meant to simulate the real world scenario of each client being a unique IP? I’m just as likely to get a comment asking why I didn’t use -more- unique instances 😉 — anyway, my point is it’s not really as brain dead of a choice as you might think. I’m erring on the side of a more expensive, but more realistic, testing procedure.

      Like

  3. Well played, sir. I will look into your tweaks some more. At some point the project I’m bootstrapping with a couple partners may need your consulting services. Cheers.

    Like

  4. Just saw some limitations and wanted to find a way around them.

    Also, it seems to really upset anti-Node web zealots, trolls, and smug redditors, so that’s pretty satisfying.

    Like

  5. Great work: Now the outcome of a similar test for a more common Node.js stack (node.js – connect / express – socket.io / web socket connections) would be interesting.

    I’ll have a look at your setup and check if such a test can be created that allows comparing results to your simulation outcome.

    Like

  6. Adding more inbound unique IPs into the mix isn’t testing anything further, unless you believe your TCP stack to be broken in some fashion.

    Like

  7. Nice way to use 500 EC2 Instances to generate such large volume of requests and real world scenario. This will also get one past the I/O and bandwidth throttle that AWS imposes on an Instance. Can you tell me which load generation (client) software did you use? And did you work with AWS before the test? Like filling out the penetration test form?

    Like

  8. Yes, images are back thank you for such a quick response.

    “Also, it seems to really upset anti-Node web zealots, trolls, and smug redditors, so that’s pretty satisfying.”

    I liked this comment as this issue annoys me too, especially in Reddit.

    Like

  9. I am trying to reproduce your tests (which are very interesting) and I have run into a roadblock.

    My tests differ from your because I am running the node.js server on a 16GB m1.XL at EC2. And I am trying to run m1.L for the clients.

    The problem I appears on the server where once I reach about 150000 connections node/V8 keeps trying to Scavenge memory. Doing this causes my client connections to time out. Did you have this problem at all?

    Thanks.

    Like

  10. Caustik, I did read that and I have been using the ulimit change as well as the following command for running the script.

    node –trace-gc –expose-gc –nouse-idle-notification –max-new-space-size=2048 –max-old-space-size=14336

    Are you saying that you needed to custom build V8 with those two changes to get up to 1M connections?

    Like

  11. Have you tried this test on any of the v0.8.X releases? I forgot to mention that earlier, but I am trying it on v0.8.8 and I have also added the v8 settings, but I still get the same behavior.

    Here is an example of the output….

    313938 ms: Scavenge 850.2 (900.0) -> 849.6 (900.0) MB, 1 ms [Runtime::PerformGC].
    316877 ms: Scavenge 850.4 (900.0) -> 849.6 (900.0) MB, 1 ms [Runtime::PerformGC].
    320518 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [Runtime::PerformGC].
    324514 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [Runtime::PerformGC].
    328120 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [Runtime::PerformGC].
    331594 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [allocation failure].
    bdceb1f0-f0a7-11e1-8b35-4936f06b29eb
    335123 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [allocation failure].
    338720 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [Runtime::PerformGC].
    342342 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [allocation failure].
    345963 ms: Scavenge 850.6 (900.0) -> 849.6 (900.0) MB, 0 ms [Runtime::PerformGC].

    Like

  12. arnabc – the redditor hate is even better now that their own website couldn’t handle 5M requests per hour (not sure if it was over 1M concurrent at all during that time). It would be a nice test to replicate the stats that took down reddit, and show it being handled by a single Node.js server. Of course there’s complexity to the reddit back-end, for all the various site features, but it’s still fundamentally a connected graph structure like Sprites is.

    Like

  13. amazing.
    what would the configuration options be to increase the udp performance in the same way? currently max on our servers is 15k msg/sec, which is no joke but am certain can improve.

    Like

  14. Hi!
    How to use 8 GB memory?

    My test:

    var ph = [];
    while (true) {
    ph.push(‘7232985jkdjf’);
    }

    node –trace-gc –max-old-space-size=8192 heaptest.js
    31 ms: Scavenge 2.1 (35.0) -> 1.8 (36.0) MB, 0 ms [allocation failure].
    33 ms: Scavenge 2.6 (36.0) -> 2.4 (36.0) MB, 0 ms [Runtime::PerformGC].
    ……..
    3108 ms: Mark-sweep 712.2 (746.7) -> 428.0 (462.4) MB, 504 ms [Runtime::PerformGC] [GC in old space requested].
    4641 ms: Mark-sweep 1067.6 (1102.0) -> 641.2 (675.6) MB, 752 ms [Runtime::PerformGC] [GC in old space requested].
    FATAL ERROR: JS Allocation failed – process out of memory

    ubuntu server x64
    node -p -e “process.arch” >> x64
    node -v >> v0.8.9

    Like

  15. Can you explain some of your server code; just the reasoning for using nodes? Is that just to re-use ID’s? Any reason to not just dump all the connections in an array? And maybe a queue with redis or something for push/pop IDs?

    Other than that, the server code seems pretty standard; the kernel tweaks are what IMO is more valuable.

    Like

    1. I decided to use nodes because each connection needs to efficiently find out who it’s neighbors are. If each node knew it’s neighbors only by ID, as opposed to holding a direct reference, each traversal step would require a look-up into an associative array. For finding the neighbor 2 positions to your left and 2 positions up, for example, a graph structure is just a little easier to work with IMO (e.g. pNode->pLeft->pLeft->pTop->pTop). Also, although chromium’s associative arrays may be O(1) complexity, they involve the overhead of a hash function and potential for collisions, which when dealing with hundreds of thousands of connections, add up enough to be a prohibiting performance bottleneck.

      Like

  16. @caustik – Thanks, makes sense. So traversal for sending responses are much faster, also any lookups of specific nodes to send targeted messages would be much faster. Something I didn’t think of

    Like

  17. I’m really interested in this topic. I’m are trying to emulate something similar but using SSE. I’m trying to stress test a server to see how many concurrent SSE streams can it hold open.

    Is there a change to see the code for the client side?

    The server listening on port 8080 so the client can request a session id and then open the SSE stream on port 8081 where the workers are listening. The problem is that the client doesn’t seem to be able to open the socket on that port. I’m still struggling to find how you manage the connections on both ports.

    As a single thread application, with no workers listening and on a single port, it works fine though.

    Thanks.

    Like

  18. Hello!

    I’m trying to use your server file to create my own server to deliver push notifiations. But I have a problem. I can sent the messages and received them, but I don’t know how can I used them in the client. I mean, If I point to server:8080 the page keeps loading and I can see how it receives the messages in firebug, but I have no idea how to add a listener to it. Should it be long-poll-ajax?

    Like

  19. Just wondering, if you don’t mind, what the cost of this little test was for you?

    I’m looking at websockets, using python at the moment, possibly moving away for it though, and would be interested in hearing more about this.

    Like

    1. I don’t remember the exact cost, but it was a few hundred USD, I think. It may be different pricing by now and you can also reduce costs by quite a bit by using Spot Instances on Amazon, for example.

      Like

  20. Good job Caustik !

    Shahzad Bhatti have published a very interesting tests about connections with NodeJS and Vert.x

    He push until 24.000 connected clients on NodeJS / Vert.x and we could see NodeJS became really slow to receive messages in theses conditions (but Vert.x seems stable).

    http://weblog.plexobject.com/?p=1698

    So, when I see you are capable to push limits to 1M, I’m not sure it’s possible for NodeJS to really works correctly in theses conditions ?

    This could be really interesting if you could make this kind of tests !

    Or better, if you could compare with Vert.x 🙂

    Like

  21. Agree with the above note that “a high number of simultaneous clients has been achieved on different platforms such as Node”. Here is the difference: in many cases, tester engineers will be assigned to create test code to emulate the complex interaction between client and server (to load test the high performance server). So the test platform needs to make it easy to develop the high performance test scripts that don’t have call-backs, too much syntactic sugar (image you have to teach someone “public static” function).

    Sorry it’s a little deviated from the main topic in this blog (thanks for it!) even though it’s related.

    Like

  22. Hi,
    I have a 32 core system with 48GB, what should be my sysctl.conf settings, if there is a logic behind it, please explain.

    Thanks
    Sai

    Like

  23. Hi,

    I am running 24k connections, and after that my clients get disconnecting events…I increase emphemeral port range, I don’t think it is the 3 simulation clients problem. Have absolutely no way of knowing what’s going on with the disconnecting event…

    You happen to know anything about this?

    Thanks!

    Like

Leave a reply to Tim McClarren Cancel reply