Unicode and JavaScript

Update 2013-09-29:

New sections 4.1 (“Matching any code unit”) and 4.2 (“Libraries”).

This blog post is a brief introduction to Unicode and how it is handled in JavaScript.

Unicode

History

Unicode was started in 1987, by Joe Becker (Xerox), Lee Collins (Apple) and Mark Davis (Apple). The idea was to create a universal character set, as there were many incompatible standards for encoding plain text at that time: numerous variations of 8 bit ASCII, Big Five (Traditional Chinese), GB 2312 (Simplified Chinese), etc. Before Unicode, no standard for multi-lingual plain text existed, but there were rich text systems (such as Apple’s WorldScript) that allowed one to combine multiple encodings.

The first Unicode draft proposal was published in 1988. Work continued afterwards and the working group expanded. The Unicode Consortium was incorporated on January 3, 1991:

The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard [...]
The first volume of the Unicode 1.0 standard was published in October 1991, the second one in June 1992.

Important Unicode concepts

The idea of a character may seem a simple one, but there are many aspects to it. That’s why Unicode is such a complex standard. The following are important basic concepts:

Code points

The range of the code points was initially 16 bits. With Unicode version 2.0 (July 1996), it was expanded: it is now divided into 17 planes, numbered from 0 to 16. Each plane comprises 16 bits (in hexadecimal notation: 0x0000–0xFFFF). Thus, in the hexadecimal ranges shown below, digits beyond the four bottom ones contain the number of the plane. Planes 1–16 are called supplementary planes or astral planes.

Unicode encodings

UTF-32 (Unicode Transformation Format 32) is a format with 32 bit code units. Any code point can be encoded by a single code unit, making this the only fixed-length encoding. For other encodings, the number of units needed to encode a point varies.

UTF-16 is a format with 16 bit code units that needs one to two units to represent a code point. BMP code points can be represented by single code units. Higher code points are 20 bit, after subtracting 0x10000 (the range of the BMP). These bits are encoded as two code units:

To enable this encoding scheme, the BMP has a hole with unused code points whose range is 0xD800–0xDFFF. Therefore the ranges of lead surrogates, tail surrogates and BMP code points are disjoint, making decoding robust in the face of errors. The following function encodes a code point as UTF-16. An example of using it is given later.
    function toUTF16(codePoint) {
        var TEN_BITS = parseInt('1111111111', 2);
        function u(codeUnit) {
            return '\\u'+codeUnit.toString(16).toUpperCase();
        }

        if (codePoint <= 0xFFFF) {
            return u(codePoint);
        }
        codePoint -= 0x10000;
        
        // Shift right to get to most significant 10 bits
        var leadSurrogate = 0xD800 + (codePoint >> 10);

        // Mask to get least significant 10 bits
        var tailSurrogate = 0xDC00 + (codePoint & TEN_BITS);

        return u(leadSurrogate) + u(tailSurrogate);
    }

UCS-2, a deprecated format, uses 16 bit code units to represent (only!) the code points of the BMP. When the range of Unicode code points expanded beyond 16 bits, UTF-16 replaced UCS-2.

UTF-8. UTF-8 has 8 bit code units. It builds a bridge between the legacy ASCII encoding and Unicode. ASCII only has 128 characters, whose numbers are the same as the first 128 Unicode code points. UTF-8 is backwards compatible, because all ASCII characters are valid code units. In other words, a single code unit in the range 0–127 encodes a single code point in the same range. Such code units are marked by their highest bit being zero. If, on the other hand, the highest bit is one then more units will follow, to provide the additional bits for the higher code points. That leads to the following encoding scheme:

If the highest bit is not 0 then the number of ones before the zero indicates how many code units there are in a sequence. All code units after the initial one have the bit prefix 10. Therefore, the ranges of initial code units and subsequent code units are disjoint, which helps with recovering from encoding errors.

UTF-8 has become the most popular Unicode format. Initially, due to its backwards compatibility with ASCII. Later, due to its broad support across operating systems, programming environments and applications.

JavaScript source code and Unicode

Source code internally

Internally, JavaScript source code is treated as a sequence of UTF-16 code units. Quoting Sect. 6 of the EMCAScript specification:
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.
In identifiers, string literals and regular expression literals, any code unit can also be expressed via a Unicode escape sequence \uHHHH, where HHHH are four hexadecimal digits. For example:
    > var f\u006F\u006F = 'abc';
    > foo
    'abc'

    > var λ = 123;
    > \u03BB
    123
That means that you can use Unicode characters in literals and variable names, without leaving the ASCII range in the source code.

In string literals, an additional kind of escape is available: hex escape sequences with two-digit hexadecimal numbers that represent code units in the range 0x00–0xFF. For example:

    > '\xF6' === 'ö'
    true
    > '\xF6' === '\u00F6'
    true

Source code externally

While that format is used internally, JavaScript source code is usually not stored as UTF-16. When a web browser loads a source file via a script tag, it determines the encoding as follows: Recommendations: Some minification tools can translate source with Unicode code points beyond 7 bit to source that is “7 bit clean”. They do so by replacing non-ASCII characters with Unicode escapes. For example, the following invocation of UglifyJS translates the file test.js:
    uglifyjs -b beautify=false,ascii-only=true test.js
The file test.js looks like this:
    var σ = 'Köln';
The output of UglifyJS looks like this:
    var \u03c3="K\xf6ln";
Negative example: For a while, the library D3.js was published in UTF-8. That caused an error when it was loaded from a page whose encoding was not UTF-8, because the code contained statements such as
    var π = Math.PI, ε = 1e-6;
The identifiers π and ε were not decoded correctly and not recognized as valid variable names. Additionally, some string literals with code points beyond 7 bit weren’t decoded correctly, either. As a work-around, the code could be loaded by adding the appropriate charset attribute to the script tag:
    <script charset="utf-8" src="d3.js"></script>

JavaScript strings and Unicode

A JavaScript string is a sequence of UTF-16 code points. Quoting the ECMAScript specification, Sect. 8.4:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
Escape sequences. As mentioned before, you can use Unicode escape sequences and hex escape sequences in string literals. For example, you can produce the character “ö” by combining an “o” with a diaeresis (code point 0x0308):
    > console.log('o\u0308')
    ö
This works in command lines, such as web browser consoles and the Node.js REPL in a terminal. You can also insert this kind of string into the DOM of a web page.

Refering to astral plane characters via escapes. There are many nice Unicode symbol tables on the web. Take a look at Tim Whitlock’s “Emoji Unicode Tables” and be amazed by how many symbols there are in modern Unicode fonts. None of the symbols in the table are images, they are all font glyphs. Let’s assume you want to display a character via JavaScript that is in an astral plane. For example, a cow (code point 0x1F404):

You can either copy the character and paste it directly into your Unicode-encoded JavaScript source:
    var str = '';
JavaScript engines will decode the source (which is most often in UTF-8) and create a string with two UTF-16 code units. Alternatively, you can compute the two code units yourself and use Unicode escape sequences. There are web apps that perform this computation: The previously defined function toUTF16 performs it, too:
    > toUTF16(0x1F404)
    '\\uD83D\\uDC04'
The UTF-16 surrogate pair (0xD83D, 0xDC04) does indeed encode the cow:
    > console.log('\uD83D\uDC04')
    

Counting characters. If a string contains a surrogate pair (two code units encoding a single code point) then the length property doesn’t count characters, any more. It counts code units:

    > var str = '';
    > str === '\uD83D\uDC04'
    true
    > str.length
    2
This can be fixed via libraries, such as Mathias Bynens’ Punycode.js, which is bundled with Node.js:
    > var puny = require('punycode');
    > puny.ucs2.decode(str).length
    1

Unicode normalization. If you want to search in strings or compare them then you need to normalize, e.g. via the library unorm (by Bjarke Walling).

JavaScript regular expressions and Unicode

Support for Unicode in JavaScript’s regular expressions [1] is very limited. For example, there is no way to match Unicode categories such as “uppercase letter”.

Line terminators influence matching and do have a Unicode definition. A line terminator is either one of four characters:

Code unitNameCharacter escape sequence
\u000ALine feed\n
\u000DCarriage return\r
\u2028Line separator
\u2029Paragraph separator

The following regular expression constructs support Unicode:

Other important character classes have definitions that are based on ASCII, not on Unicode:

Matching any code unit

To match any code unit, you can use [\s\S], see [1]. To match any code point, you need to use:
    ([\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
The above pattern works like this:
    ([BMP code point]|[lead surrogate][tail surrogate])
As all of these ranges are disjoint, the pattern will correctly match code units in well-formed UTF-16 strings.

Libraries

Regenerate helps with generating ranges like the one above, for matching any code unit. It is meant to be used as part of a built tool, but also works dynamically, for trying out things.

XRegExp is a regular expression library that has an official addon for matching Unicode categories, scripts, blocks and properties via one of the following three constructs:

    \p{...} \P{...} \p{^...}
For example, \p{Letter} matches letters in various alphabets.

The future of handling Unicode in JavaScript

Two new standards, one that is in the process of being implemented and another one that is in the process of being designed will bring better support for Unicode to JavaScript:

Recommended reading and sources of this post

Information on Unicode: Information on Unicode support in JavaScript:

Acknowledgements

The following people helped with this blog post: Mathias Bynens (@mathias), Anne van Kesteren ‏(@annevk), Calvin Metcalf ‏(@CWMma).

References

  1. JavaScript: an overview of the regular expression API
  2. The ECMAScript Internationalization API