Update 2013-09-29: New sections 4.1 (“Matching any code unit”) and 4.2 (“Libraries”).
This blog post is a brief introduction to Unicode and how it is handled in JavaScript.
Unicode
History
Unicode was started in 1987, by Joe Becker (Xerox), Lee Collins (Apple) and Mark Davis (Apple). The idea was to create a universal character set, as there were many incompatible standards for encoding plain text at that time: numerous variations of 8 bit ASCII, Big Five (Traditional Chinese), GB 2312 (Simplified Chinese), etc. Before Unicode, no standard for multi-lingual plain text existed, but there were rich text systems (such as Apple’s WorldScript) that allowed one to combine multiple encodings.
The first Unicode draft proposal was published in 1988. Work continued afterwards and the working group expanded. The Unicode Consortium was incorporated on January 3, 1991:
The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard [...]
The first volume of the Unicode 1.0 standard was published in October 1991, the second one in June 1992.
Important Unicode concepts
The idea of a character may seem a simple one, but there are many aspects to it. That’s why Unicode is such a complex standard. The following are important basic concepts:
- Characters and graphemes: Both terms mean something quite similar. Characters are are digital entities, graphemes are atomic units of written languages (alphabetic letters, typographic ligatures, etc.). Sometimes, several characters are used to display a single grapheme.
- Glyph: A concrete way of writing a grapheme. Sometimes the same grapheme is written differently, depending on its context or other factors. For example, the graphemes f and i can be displayed as a glyph f and a glyph i, connected by a ligature glyph. Or without a ligature.
- Code points: Unicode maps the characters it supports to numbers called code points.
- Code units: To store or transmit code points, they are encoded as code units, pieces of data with a fixed length. The length is measured in bits and determined by an encoding scheme, of which Unicode has several ones: UTF-8, UTF-16, etc. The number in the name indicates the length of the code units, in bits.
If a code point is too large to fit into a single code unit, it must be broken up into multiple units. That is, the number of code units needed to represent a single code point can vary.
- BOM (byte order mark): If a code unit is larger than a single byte, byte ordering matters. The BOM is a single pseudo-character (possibly encoded as multiple code units) at the beginning of a text that indicates whether the code units are big endian (most significant bytes come first) or little endian (least significant bytes come first). The default, for texts without a BOM, is big endian.
The BOM also indicates the encoding that is used, it is different for UTF-8, UTF-16, etc. It also serves as a marker for Unicode, if web browsers have no other information w.r.t. the encoding of a text. However, the BOM is not used very often, for several reasons:
- UTF-8 is by far the most popular Unicode encoding and does not need a BOM, because there is only one way of ordering bytes.
- Several character encodings include byte ordering. Then a BOM must not be used. Examples: UTF-16BE (UTF-16 big endian), UTF-16LE, UTF-32BE, UTF-32LE. This is a safer way of handling byte ordering, because there is no danger of mixing up meta-data and data.
- Normalization: Sometimes the same grapheme can be represented in several ways. For example, the grapheme “ö” can be represented as a single code point or as an “o” followed by a combining character “¨” (diaeresis, double dot). Normalization is about translating a text to a canonical representation; equivalent code points and sequences of code points are all translated to the same code point (or sequence of code points). That is useful for text processing, e.g. to search for text. Unicode specifies several normalizations.
- Character properties: Each Unicode character is assigned several properties by the specification:
- Name: an English name, composed of uppercase letters A-Z, digits 0-9, hypen - and <space>. Two examples:
- “λ” has the name “GREEK SMALL LETTER LAMBDA”
- “!” has the name “EXCLAMATION MARK”
- General category: Partitions characters into categories such as letter, uppercase letter, number, punctuation, etc.
- Age: With what version of Unicode was the character introduced (1.0, 1.1., 2.0, etc.)?
- Deprecated: Is the use of the character discouraged?
- And many more.
Code points
The range of the code points was initially 16 bits. With Unicode version 2.0 (July 1996), it was expanded: it is now divided into 17
planes, numbered from 0 to 16. Each plane comprises 16 bits (in hexadecimal notation: 0x0000–0xFFFF). Thus, in the hexadecimal ranges shown below, digits beyond the four bottom ones contain the number of the plane.
- Plane 0: Basic Multilingual Plane (BMP): 0x0000–0xFFFF
- Plane 1: Supplementary Multilingual Plane (SMP): 0x10000–0x1FFFF
- Plane 2: Supplementary Ideographic Plane (SIP): 0x20000–0x2FFFF
- Planes 3–13: Unassigned
- Plane 14: Supplementary Special-Purpose Plane (SSP: 0xE0000–0xEFFFF
- Planes 15–16: Supplementary Private Use Area (S PUA A/B): 0x0F0000–0x10FFFF
Planes 1–16 are called
supplementary planes or
astral planes.
Unicode encodings
UTF-32 (Unicode Transformation Format 32) is a format with 32 bit code units. Any code point can be encoded by a single code unit, making this the only fixed-length encoding. For other encodings, the number of units needed to encode a point varies.
UTF-16 is a format with 16 bit code units that needs one to two units to represent a code point. BMP code points can be represented by single code units. Higher code points are 20 bit, after subtracting 0x10000 (the range of the BMP). These bits are encoded as two code units:
- Lead surrogate – most significant 10 bits: stored in the range 0xD800–0xDBFF (four times 8 bits = 4 × two hexadecimal digits).
- Tail surrogate – least significant 10 bits: stored in the range 0xDC00–0xDFFF (four times 8 bits = 4 × two hexadecimal digits).
To enable this encoding scheme, the BMP has a hole with unused code points whose range is 0xD800–0xDFFF. Therefore the ranges of lead surrogates, tail surrogates and BMP code points are disjoint, making decoding robust in the face of errors. The following function encodes a code point as UTF-16. An example of using it is given later.
function toUTF16(codePoint) {
var TEN_BITS = parseInt('1111111111', 2);
function u(codeUnit) {
return '\\u'+codeUnit.toString(16).toUpperCase();
}
if (codePoint <= 0xFFFF) {
return u(codePoint);
}
codePoint -= 0x10000;
// Shift right to get to most significant 10 bits
var leadSurrogate = 0xD800 + (codePoint >> 10);
// Mask to get least significant 10 bits
var tailSurrogate = 0xDC00 + (codePoint & TEN_BITS);
return u(leadSurrogate) + u(tailSurrogate);
}
UCS-2, a deprecated format, uses 16 bit code units to represent (only!) the code points of the BMP. When the range of Unicode code points expanded beyond 16 bits, UTF-16 replaced UCS-2.
UTF-8.
UTF-8 has 8 bit code units. It builds a bridge between the legacy ASCII encoding and Unicode. ASCII only has 128 characters, whose numbers are the same as the first 128 Unicode code points. UTF-8 is backwards compatible, because all ASCII characters are valid code units. In other words, a single code unit in the range 0–127 encodes a single code point in the same range. Such code units are marked by their highest bit being zero. If, on the other hand, the highest bit is one then more units will follow, to provide the additional bits for the higher code points. That leads to the following encoding scheme:
- 0000–007F: 0xxxxxxx (7 bits, stored in 1 byte)
- 0080–07FF: 110xxxxx, 10xxxxxx (5+6 bits = 11 bits, stored in 2 bytes)
- 0800–FFFF: 1110xxxx, 10xxxxxx, 10xxxxxx (4+6+6 bits = 16 bits, stored in 3 bytes)
- 10000–1FFFFF: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx (3+6+6+6 bits = 21 bits, stored in 4 bytes)
(The highest code point is 10FFFF, so UTF-8 has some extra room.)
If the highest bit is not 0 then the number of ones before the zero indicates how many code units there are in a sequence. All code units after the initial one have the bit prefix 10. Therefore, the ranges of initial code units and subsequent code units are disjoint, which helps with recovering from encoding errors.
UTF-8 has become the most popular Unicode format. Initially, due to its backwards compatibility with ASCII. Later, due to its broad support across operating systems, programming environments and applications.
JavaScript source code and Unicode
Source code internally
Internally, JavaScript source code is treated as a sequence of UTF-16 code units. Quoting
Sect. 6 of the EMCAScript specification:
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.
In identifiers, string literals and regular expression literals, any code unit can also be expressed via a Unicode escape sequence
\uHHHH, where
HHHH are four hexadecimal digits. For example:
> var f\u006F\u006F = 'abc';
> foo
'abc'
> var λ = 123;
> \u03BB
123
That means that you can use Unicode characters in literals and variable names, without leaving the ASCII range in the source code.
In string literals, an additional kind of escape is available: hex escape sequences with two-digit hexadecimal numbers that represent code units in the range 0x00–0xFF. For example:
> '\xF6' === 'ö'
true
> '\xF6' === '\u00F6'
true
Source code externally
While that format is used internally, JavaScript source code is usually not stored as UTF-16.
When a web browser loads a source file via a script tag, it determines the encoding
as follows:
Recommendations:
- For your own application, you can use Unicode. But you must specify the encoding of the app’s HTML page as UTF-8.
- For libraries, it’s safest to release code that is ASCII (7 bit).
Some minification tools can translate source with Unicode code points beyond 7 bit to source that is “7 bit clean”. They do so by replacing non-ASCII characters with Unicode escapes.
For example, the following invocation of
UglifyJS translates the file
test.js:
uglifyjs -b beautify=false,ascii-only=true test.js
The file
test.js looks like this:
var σ = 'Köln';
The output of UglifyJS looks like this:
var \u03c3="K\xf6ln";
Negative example:
For a while, the library D3.js was published in UTF-8. That caused an
error when it was loaded from a page whose encoding was not UTF-8, because the code contained statements such as
var π = Math.PI, ε = 1e-6;
The identifiers π and ε were not decoded correctly and not recognized as valid variable names. Additionally, some string literals with code points beyond 7 bit weren’t decoded correctly, either.
As a work-around, the code could be loaded by adding the appropriate
charset attribute to the script tag:
<script charset="utf-8" src="d3.js"></script>
JavaScript strings and Unicode
A JavaScript string is a sequence of UTF-16 code points. Quoting the ECMAScript specification,
Sect. 8.4:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
Escape sequences.
As mentioned before, you can use Unicode escape sequences and hex escape sequences in string literals. For example, you can produce the character “ö” by combining an “o” with a diaeresis (code point 0x0308):
> console.log('o\u0308')
ö
This works in command lines, such as web browser consoles and the Node.js REPL in a terminal. You can also insert this kind of string into the DOM of a web page.
Refering to astral plane characters via escapes. There are many nice Unicode symbol tables on the web. Take a look at Tim Whitlock’s “Emoji Unicode Tables” and be amazed by how many symbols there are in modern Unicode fonts. None of the symbols in the table are images, they are all font glyphs. Let’s assume you want to display a character via JavaScript that is in an astral plane. For example, a cow (code point 0x1F404):
You can either copy the character and paste it directly into your Unicode-encoded JavaScript source:
var str = '';
JavaScript engines will decode the source (which is most often in UTF-8) and create a string with two UTF-16 code units. Alternatively, you can compute the two code units yourself and use Unicode escape sequences. There are web apps that perform this computation:
The previously defined function
toUTF16 performs it, too:
> toUTF16(0x1F404)
'\\uD83D\\uDC04'
The UTF-16 surrogate pair (0xD83D, 0xDC04) does indeed encode the cow:
> console.log('\uD83D\uDC04')
Counting characters. If a string contains a surrogate pair (two code units encoding a single code point) then the length property doesn’t count characters, any more. It counts code units:
> var str = '';
> str === '\uD83D\uDC04'
true
> str.length
2
This can be fixed via libraries, such as Mathias Bynens’
Punycode.js, which is bundled with Node.js:
> var puny = require('punycode');
> puny.ucs2.decode(str).length
1
Unicode normalization. If you want to search in strings or compare them then you need to normalize, e.g. via the library unorm (by Bjarke Walling).
JavaScript regular expressions and Unicode
Support for Unicode in JavaScript’s regular expressions
[1] is very limited. For example, there is no way to match Unicode categories such as “uppercase letter”.
Line terminators influence matching and do have a Unicode definition. A line terminator is either one of four characters:
Code unit | Name | Character escape sequence |
\u000A | Line feed | \n |
\u000D | Carriage return | \r |
\u2028 | Line separator | |
\u2029 | Paragraph separator | |
The following regular expression constructs support Unicode:
Other important character classes have definitions that are based on ASCII, not on Unicode:
Matching any code unit
To match any code unit, you can use
[\s\S], see
[1]. To match any code point, you need to use:
([\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
The above pattern works like this:
([BMP code point]|[lead surrogate][tail surrogate])
As all of these ranges are disjoint, the pattern will correctly match code units in well-formed UTF-16 strings.
Libraries
Regenerate helps with generating ranges like the one above, for matching any code unit. It is meant to be used as part of a built tool, but also works dynamically, for trying out things.
XRegExp is a regular expression library that has an official addon for matching Unicode categories, scripts, blocks and properties via one of the following three constructs:
\p{...} \P{...} \p{^...}
For example,
\p{Letter} matches letters in various alphabets.
The future of handling Unicode in JavaScript
Two new standards, one that is in the process of being implemented and another one that is in the process of being designed will bring better support for Unicode to JavaScript:
- The ECMAScript Internationalization API [2]: offers Unicode-based collation (sorting and searching) and more.
- ECMAScript 6: The next version of JavaScript will have several Unicode-related features, such as escapes for arbitrary code points and a method for accessing code points in a string (as opposed to code units). The blog post “Supplementary Characters for ECMAScript” by Norbert Lindenberg explains the plans for Unicode support in ECMAScript 6.
Recommended reading and sources of this post
Information on Unicode:
Information on Unicode support in JavaScript:
Acknowledgements
The following people helped with this blog post: Mathias Bynens (
@mathias), Anne van Kesteren (
@annevk), Calvin Metcalf (
@CWMma).
References
- JavaScript: an overview of the regular expression API
- The ECMAScript Internationalization API