Pushing the performance limits of node.js

We love node.js and javascript. We love them so much, in fact, that when Jut decided to build a streaming analytics platform from scratch, we put node.js at the center of it all. This decision has brought us several benefits, but along with those came a few unique scaling challenges. With some careful programming, we've been able to largely overcome node.js's limitations: I'll share with you some of the tricks we used, so you can go ahead and make new mistakes (err... "lessons learned") all on your own

Wait... what did Jut build?

OK, so I need to give you a little context before diving in. Jut calls its product an "operations data hub." It's a streaming analytics platform designed for dev and ops teams to collect all their operational data like logs, metrics, and events, and then do whiz-bang analysis and correlation against the whole lot. Forget about deploying graphite and a separate logging tool and whatever else. Nirvana, right? Yeah, all we needed to do was build a system that can deal with real-time data, historical data, unstructured data and structured data all at the same time.

So here's what we've built:

Architecture of Jut Operations Data Hub

As you can see there are a few moving parts to this system. The biggest and heaviest are our big data back-ends: ElasticSearch and Cassandra. We rely on these systems for historical data processing, storage, and general data resiliency and management capabilities. Jut (and our dataflow language, Juttle) uses node.js as the "smart analytics layer" that unifies these projects, letting you:

treat real-time and historical data in the same way
ask questions using a blend of log data, metric data, and event data
send real-time data to your browser to be visualized using d3

That should give you enough context... let's dig into how we use node.js.

Node.js is at the heart of Jut

The heart of the Jut platform is the Juttle Processing Core (JPC). The JPC is responsible for running Juttle programs. When you click play on a Juttle program, your browser sends the program to the JPC, which compiles the Juttle into Javascript. The JPC itself is written in Javascript too, using node.js. We chose to use node.js for the JPC for several reasons.

Programming in a high-level language such as Javascript enables the rapid prototyping and iteration that a startup depends on.
Since the front end of our platform is written in Javascript, implementing the back end in Javascript as well makes it easy for developers to implement features end-to-end, without too many context switches or handoffs between front-end developers and back-end developers.
Node.js has a vibrant open-source community, so we can stand on the shoulders of giants such as moment.js, request.js, and bluebird.js. In fact, as of this writing, the JPC depends on 103 NPM packages, and Jut has open-sourced seven of our own, with more to come.

So node.js offers a lot of attractive qualities when choosing a platform to build on. However, it also imposes a few restrictions, especially when your software has to deal with significant amounts of data, as the JPC does.

First, a node.js application is single-threaded. This means that even if your computer has multiple CPU cores, as most computers do these days, your node.js application can only use one of them at any given time.
Second, node.js's garbage collector becomes inefficient as the heap gets large. The garbage collector is a part of node.js that finds space for the numbers, strings, and objects that the server is performing computations on. If heap usage goes past a gigabyte or so, this process becomes slow, and long garbage collection pauses will stall the server.
Third, because of the problem described above, node.js puts a hard limit of 1.5 GB on its heap. If your total heap usage reaches 1.5 GB, you will see the dreaded "FATAL ERROR: JS Allocation failed - process out of memory" message. That is not a happy message. It means your application has crashed.

Here at Jut we have employed several tricks to achieve high performance on large data sets despite these limitations.

When not to use node

One way the JPC manages large computations is by optimizing Juttle for efficiency. The JPC can break down Juttle flowgraphs into subgraphs which can then be executed more efficiently at a deeper layer of the Jut platform. A good example of this is Juttle's reduce processor. The JPC can translate Juttle programs involving the reduce processor into functions that our big data backends can independently execute (eg, Elasticsearch today, Cassandra and potentially other backends in the future). Thus, we come to our first node.js tip: one effective way that we get high performance out of node.js is to avoid doing computation in node.js.

These optimized programs proceed much faster than processing events in Javascript to perform the computation. That is because the unoptimized approach requires Elasticsearch or Cassandra to pull all the relevant event data from disk, encode it as JSON, and send it over HTTP to the JPC, which then has to decode the JSON and perform the desired calculations. Getting rid of that overhead saves a lot of time. Furthermore, both Elasticsearch and Cassandra are written in Java, so they can harness as many CPUs as are available when it needs to perform on big sets of data. For Jut that means building as many optimizations as possible into the JPC, whether that's for ElasticSearch or our metrics backend Cassandra.

Here's an example of optimization with ElasticSearch: Elasticsearch has functionality called Aggregations, which perform computations across a set of data. So for a simple aggregation like counting the number of records returned for a log search, it's easy to optimize that into an ElasticSearch aggregation. Unfortunately, the optimization approach does not work all the time. Juttle is much more expressive than any single underlying component we've used to build the system. Continuing with the ElasticSearch example, Elasticsearch Aggregations have no notion of merging or joining streams as Juttle does, and users are not empowered to write their own Aggregations. For these and several other core features of Juttle, we have to do all the computation in Javascript. So our burden is to continue making the JPC as fast as possible.

Node event loop performance

The key to understanding node.js performance is the event loop. Basically, the event loop is a list of functions that node.js will invoke when certain events occur. When you tell your node.js server to make a request to another server, read a file from the filesystem, or do anything else that depends on an outside service, you also provide it with a function to call when that operation completes. node.js puts this function on its event loop, and when the outside operation completes, node.js applies the function you provided to the result of the outside operation. For instance, you can tell node.js to read some rows from a database (outside operation), then do some math on those rows when the database query completes (event loop function). This is essentially how the JPC works.

The node.js event loop (Source exortech.github.io)

Trouble occurs, however, when one of these event loop functions takes a long time to compute. Since node.js is single-threaded, it can actively process only one of its event loop functions at any point in time. So if the aforementioned database query returns a lot of rows, and the math you want to do on those rows is particularly involved, then node.js will spend a long time exclusively working on that. If other requests to your server are made during this time, or other outside operations complete, they will just pile up on the event loop to-do list, waiting for the expensive query to finish. This will drive up the response time of your server, and if it falls too far behind it may never be able to catch up.

Therefore, avoiding situations where one function takes a long time to compute is crucial for getting good performance out of a node.js server. In order to do this, we implement paging wherever possible. That means that when we need to read points from one of our data stores, we don't request them all at once. Instead, we fetch a few of them, then have node.js handle any other functions on its event loop before fetching the next batch. Of course, there are still trade-offs with this approach: each request has some overhead of its own, so if you make too many tiny requests, the program will still be slow, even though the event loop will never be blocked for an extended period of time. For Juttle, we have found that a fetch size of 20,000 points strikes a happy medium: node.js is able to perform the required computations for almost any Juttle program on 20,000 points in a few milliseconds, and it is still a large enough fetch size that we can perform computations over millions of points without making too many requests.

A Case Study

One of Jut's beta customers is NPM, the company that makes the Node Package Manager. NPM has been a Jut user since the alpha days - they talk about us a little bit here. (Thanks NPM for putting up with all the pain of an alpha AND a beta!) NPM is interested in finding the ten packages with the most downloads in the past two weeks, to fill out a table on their website. A Juttle program that computes this is:

  read -last :2 weeks: | reduce count() by package | sort count -desc | head 10 | @table

Simple! Unfortunately, the first time they tried to run this program, it tied up the JPC CPU for over 60 seconds. Jut has a process monitoring service that restarts the JPC if it does not respond to pings for a minute. This kicked in, the JPC was terminated, and NPM never got their data. I was called in to figure out what went wrong and to fix it. It turned out that the JPC had optimized the read/reduce combination here, making it into an Elasticsearch Terms Aggregation. Optimization backfired on us in this case, though, since the Terms Aggregation does not support any paging and NPM has close to a million packages. So Elasticsearch sent back a giant response with a million-item array containing all the results, with a total size of several hundred megabytes. The JPC attempted to process this all at once, and the additional overhead took us right up to the 1.5-gigabyte limit of node.js, so the JPC was stuck in garbage collection and never managed to get through all the data.

To fix the program, I decided that even though ElasticSearch didn't give us paging for the aggregation, we could pretend it did. Instead of processing the whole giant result all at once, we could divide it into manageable chunks and process those one by one, yielding the CPU after each one. With the help of some open-source libraries, this was easy! The resulting Javascript code looks something like this:

    var points = perform_elasticsearch_aggregtion();
      
    Promise.each(_.range(points.length / 20000), function processChunk(n) {
        return Promise.try(function() {
            process(points.splice(0, 20000));
        }).delay(1);
    });

Promise.each is a handy utility added to the open-source bluebird.js library in 2014. Its arguments are an array and a function to perform on each item in the array. Promise.each traverses the array, calling the function on each item sequentially. If one of the function calls yields the CPU before completion, Promise.each also yields the CPU until that function resumes and completes. (This is the difference between Promise.each and the built-in Array.forEach, which will move on to the next item in the array it's traversing if one of its function calls yields the CPU). _.range is a simple function from the underscore.js library. _.range takes a number and returns an array of integers starting at 0 and ending one before that number. So for our million-item points array, _.range(points.length / 20000) returns the array [0, 1, 2, ... 49].

Using Promise.each, we apply the function processChunk to each of these numbers, for a total of 50 calls. Each call to processChunk pulls the first 20,000 points out of our array and calls "process" on them, which performs the computations needed for the Juttle program. Since we use the splice method of the array, these 20,000 points are discarded when we are done with them. This enables the garbage collector to reclaim all the space they were using, decreasing the memory cost of the program. This call to "process" is enclosed by Promise.try. Promise.try is a wrapper from bluebird.js. It takes a function argument and returns an object with methods that can control the execution of that function. Here, we use the ".delay(1)" method, which yields the CPU for one millisecond. Altogether, this gives us an implementation that processes our giant array in managable chunks of size 20,000, punctuated by brief pauses that enable the server to service other requests. After deploying this change, NPM's download-ranking program, which formerly locked up the JPC for over a minute, only took about 20 seconds to complete, and the server was responsive to other requests for the whole duration. Cool!

Conclusion

So that's how Jut operates a big-data platform based upon node.js. By understanding and working within its CPU and memory limitations, we can get strong performance even on millions of data points. But node.js is only one (big) part of Jut's infrastructure: stay tuned for more thrilling stories on the other parts of the infrastructure that makes Jut tick (hint: you might want to polish up on your C++!).