Lazy Parsing in JavaScript Engines

Modern JavaScript engines can defer the parsing process of a function body until it is completely needed. Why is this done and how does this work?

The last blog post titled Advances in JavaScript Performance in IE10 and Windows 8 from the Internet Explorer team mentions the use of deferred parsing to improve the performance. In fact, the stable IE 9 already implements such optimization while IE 10 improves it further to take account the popular module pattern. According to team (Chakra refers to the JavaScript engine used in IE):

To further reduce the time to first executed instruction, Chakra processes and emits bytecode only for functions that are about to be executed using a mechanism called deferred parsing.

Let’s have a simplified example to see how this works. Supposed your web application looks like the following JavaScript code.

function add(x, y) { return x + y; }
function mul(x, y) { return x * y; }
alert(add(40, 2));

Before the engine can execute the code, it has to feed the source into its parser. The purpose of the parser is to perform the syntactic analysis and to produce an abstract syntax tree (AST). For an illustrative example on how the syntax tree may look like, you can use the online parser demo of Esprima (a JavaScript parser project I have started some months ago). The full syntax tree will be quite complex, but if we translate the work of the parser to plain English, this is what happens:

Declare a function called add. It accepts x and y as the arguments. It has one statement, a return statement.The return value is a binary operation + of x and y.
Declare a function called mul. It accepts x and y as the arguments. It has one statement, a return statement. The return value is a binary operation * of x and y.
Create a function call to alert. The argument is the result of function add with 40 and 2 as the arguments.

Based on this syntax tree, some more magic occurs. At the end of the day, when the interpreter executes your code, it pops-up the dialog with the answer. Now, if you pay attention carefully, there is a wasted step from the above work of the parser, namely the effort to parse function mul because it is not being called at all, the later alert only invokes function add. While this example might be really simple and obvious, in real-world (according to Microsoft JSMeter research), a lot of declared functions are never called at all.

Instead of dutifully parsing everything at one, modern JavaScript engine uses the lazy parsing approach. The work of the parser changes into something like:

Declare a function call add with the function body “{ return x + y; }”.
Declare a function call mul with the function body “{ return x * y; }”.
Call alert with the result of function add with 40 and 2 as the arguments.

Here the parser does not bother to go deep into the statements of each and every function. At the execution stage, the sequence continues:

Call function add. Hmm, it is not parsed yet. Call the real parser for “{ return x + y; }”.
It accepts x and y as the arguments. The return value is a binary operation + of x and y.

Basically the task of parsing the source for that function is deferred, it is only carried out when it is necessary, right before executing it. The lazy parser still needs to parse the incoming source because it needs to locate the entire function body. If you see function add(x, y) { then you need to locate the matching } which ends the function body. It can’t be done by regular expression or any form of scanning, the parser needs to consume the code as if it is a real code. The good news is that the parser does not need to do much beside trying to find that closing curly brace. This means it can optimize a certain thing. For a start, we do not need the syntax tree because it is not going to be processed by anyone. In addition, the code path does not need to use any memory from the heap. Allocating memory eats the system resource, avoiding it will lead to a speed-up.

Here is an analogy in real-life. You stumble upon a nice article (maybe this blog post) but then you realize that you only want to read it later when you need it. You decide to stash the text (digitally of course) in your note. You still have to scan the blog post to find out where it starts and where it ends, though you can do this scanning rather quickly (faster than reading the entire text). Once you get the start and the end markers, you can just select the text, copy it to the clipboard, switch to the note application, and finally paste the content. When it is finally the time to use the information in to the article, you open the note and complete the reading.

Let’s compare some hypothetical code to parse a while statement. As you already know, this statement has the following grammar:

‘while’ ‘(‘ Expression ‘)’ Statement

The real parser needs to understand this and produce an abstract syntax tree (AST) which represents the construct. The code for doing that, if it would have been in JavaScript, may look like:

function realParseWhileStatement()
{
  expect('while');
  expect('(');
  var expression = parseExpression();
  expect(')');
  var statement = parseStatement();
 
  // node for the AST
  return {
    type: 'WhileStatement',
    test: expression,
    body: statement
  };
}

In the case of lazy parsing, we don’t care about the result and thus the code is simplified to:

function lazyParseWhileStatement()
{
  expect('while');
  expect('(');
  parseExpression();
  expect(')');
  parseStatement();
}

Obviously there are other functions to parse various constructs as specified in the grammar. The bottom line is that the parser has to consume all the tokens and advances as farther as it can until the function body is completed. That way, it knows where is the closing curly brace to to match the opening curly brace that starts the function body.

What if during lazy parsing we stumble upon another nested function declaration? The same rule applies and therefore that function will be lazily parsed. Function Inception, anyone?

In reality, lazy parser can be slightly complicated, this involves handling strict mode properly, making sure parsing error is taken care, avoiding stack overflow (for recursive descent parser), and many other delicate situations.

Let’s see how this lazy parsing approach is implemented in two popular JavaScript engines.

First, we look at JavaScriptCore (JSC), the engine used in WebKit and powers the popular Apple Safari browser (desktop and mobile). The source code for JavaScriptCore resides in Source/JavaScriptCore subdirectory, if you check out WebKit code. The files relevant to the lazy parsers are:

parser/Parser.h
parser/Parser.cpp
parser/SyntaxChecker.h

The normal parser and the lazy parser in JavaScriptCore are essentially the same code, the specialization is done through C++ template. The parser itself does not construct a syntax tree, the job is delegated to a TreeBuilder. There are two builders available, ASTBuilder and SyntaxChecker. The latter essentially does nothing, but since it is driven by the parser, the parser can go forward and consume the constructs to any point it wants to stop. It acts as some kind of syntax checker, hence the name.

JSC parser uses the SyntaxChecker builder whenever it needs to parse a function body, see parseFunctionBody, which gets called from parseFunctionInfo (this leads to a funny variable name in the code, FunctionBodyBuilder). After the syntax checker stops at then of the function body, the range which scopes the curly braces will be stored, this is needed when the real parsing kicks in at some later stage, i.e. when that function is invoked. Because JSC keeps the entire source, storing the range is sufficient and there is no need to copy the source string.

The situation is similar in V8, the JavaScript engine used in Google Chrome and Node.js. Files related to the lazy parsing in V8 source code are:

src/preparser.cc
src/preparser.h
src/preparser-api.cc

Unlike JSC, V8 has two different code base (but with similar interface) for the normal parser and the lazy parser. The latter is called PreParser in V8 terminology. This PreParser kicks in when V8 encounters a function body during the parsing, i.e Parser::ParseFunctionLiteral. Interesting enough, V8 does an optimization for a special case. This is the case for immediately invoked function expression, a pattern popular to provide better namespacing without polluting the global and therefore perfect to implement a module, e.g.:

var foobar = (function() {
    //  do something
    //  return the module object
})();

Because this pattern is quite common nowadays, V8 detects the usage using a simple heuristic: if there is ( before function, then forget the lazy parsing and do the real parsing. In the above example, V8 will produce a syntax tree for the statements inside the function.

What about SpiderMonkey, Mozilla JavaScript engine used in Firefox? Lazy parsing is not there yet but it is being implemented, just track Mozilla bug 678037. This will be another awesome move to improve Firefox performance.

Last but not least, since this blog post is quite long, you might just quickly scan it and read it later. That’s called deferred reading!