· tagged with JavaScript, Unicode
The way JavaScript handles Unicode is… surprising, to say the least. This write-up explains the pain points associated with Unicode in JavaScript, provides solutions for common problems, and explains how the upcoming ECMAScript 6 will improve the situation.
Note: It sucks to have to say this, but this document is best viewed in a browser that is capable of rendering emoji, like Firefox, Safari, or Internet Explorer. Blink (Chrome/Opera) on OS X doesn’t render these glyphs at all, which makes some of the code examples on this page pretty hard to make sense of. You’ve been warned!
Before we take a closer look at JavaScript, let’s make sure we’re all on the same page when it comes to Unicode.
It’s easiest to think of Unicode as a database that maps any symbol you can think of to a number called its code point, and to a unique name. That way, it’s easy to refer to specific symbols without actually using the symbol itself. Examples:
Code points are usually formatted as hexadecimal numbers, zero-padded up to at least four digits, with a U+ prefix.
The possible code point values range from U+0000 to U+10FFFF. That’s over 1.1 million possible symbols. To keep things organised, Unicode divides this range of code points into 17 planes that consist of about 65 thousand code points each.
The first plane is called the Basic Multilingual Plane or BMP, and it’s probably the most important one, as it contains all the most commonly used symbols. Most of the time you don’t need any code points outside of the BMP for text documents in English. Just like any other Unicode plane, it groups about 65 thousand symbols.
That leaves us about 1 million other code points that live outside the BMP. The planes these code points belong to are called supplementary planes, or astral planes.
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
Now that we have a basic understanding of Unicode, let’s see how it applies to JavaScript strings.
You may have seen things like this before:
>> '\x41\x42\x43'
'ABC'>> '\x61\x62\x63'
'abc'
These are called hexadecimal escape sequences. They consist of two hexadecimal digits that refer to the matching code point. For example, \x41
represents U+0041 LATIN CAPITAL LETTER A. These escape sequences can be used for code points in the range from U+0000 to U+00FF.
Also common is the following type of escape:
>> '\u0041\u0042\u0043'
'ABC'>> 'I \u2661 JavaScript!'
'I ♡ JavaScript!'
These are called Unicode escape sequences. They consist of exactly 4 hexadecimal digits that represent a code point. For example, \u2661
represents U+2661 WHITE HEART SUIT. These escape sequences can be used for code points in the range from U+0000 to U+FFFF, i.e. the entire Basic Multilingual Plane.
But what about all the other planes — the astral planes? We need more than 4 hexadecimal digits to represent their code points… So how can we escape them?
In ECMAScript 6 this will be easy, since it introduces a new type of escape sequence: Unicode code point escapes. For example:
>> '\u{41}\u{42}\u{43}'
'ABC'>> '\u{1F4A9}'
'💩' // U+1F4A9 PILE OF POO
Between the braces you can use up to six hexadecimal digits, which is enough to represent all Unicode code points. So, by using this type of escape sequence, you can easily escape any Unicode symbol based on its code point.
For backwards compatibility with ECMAScript 5 and older environments, the unfortunate solution is to use surrogate pairs:
>> '\uD83D\uDCA9'
'💩' // U+1F4A9 PILE OF POO
In that case, each escape represents the code point of a surrogate half. Two surrogate halves form a single astral symbol.
Note that the surrogate code points don’t look anything like the original code point. There are formulas to calculate the surrogates based on a given astral code point, and the other way around — to calculate the original astral code point based on its surrogate pair.
Using surrogate pairs, all astral code points (i.e. from U+010000 to U+10FFFF) can be represented… But the whole concept of using a single escape to represent BMP symbols, and two escapes for astral symbols, is confusing, and has lots of annoying consequences.
Let’s say you want to count the number of symbols in a given string, for example. How would you go about it?
My first thought would probably be to simply use the length
property.
>> 'A'.length // U+0041 LATIN CAPITAL LETTER A
1>> 'A' == '\u0041'
true
>> 'B'.length // U+0042 LATIN CAPITAL LETTER B
1
>> 'B' == '\u0042'
true
In these examples, the length
property of the string happens to reflect the number of characters. This makes sense: if we use escape sequences to represent the symbols, it’s obvious that we only need a single escape for each of these symbols. But this is not always the case! Here’s a slightly different example:
>> '𝐀'.length // U+1D400 MATHEMATICAL BOLD CAPITAL A
2>> '𝐀' == '\uD835\uDC00'
true
>> '𝐁'.length // U+1D401 MATHEMATICAL BOLD CAPITAL B
2
>> '𝐁' == '\uD835\uDC01'
true
>> '💩'.length // U+1F4A9 PILE OF POO
2
>> '💩' == '\uD83D\uDCA9'
true
Internally, JavaScript represents astral symbols as surrogate pairs, and it exposes the separate surrogate halves as separate “characters”. If you represent the symbols using nothing but ECMAScript 5-compatible escape sequences, you’ll see that two escapes are needed for each astral symbol. This is confusing, because humans generally think in terms of Unicode symbols or graphemes instead.
Getting back to the question: how to accurately count the number of symbols in a JavaScript string? The trick is to account for surrogate pairs properly, and only count each pair as a single symbol. You could use something like this:
var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;function countSymbols(string) {
return string
// replace every surrogate pair with a BMP symbol
.replace(regexAstralSymbols, '_')
// then get the length
.length;
}
Or, if you use Punycode.js (which ships with Node.js), make use of its utility methods to convert between JavaScript strings and Unicode code points. The punycode.ucs2.decode
method takes a string and returns an array of Unicode code points; one item for each symbol.
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
Using either implementation, we’re now able to count code points properly, which leads to more accurate results:
>> countSymbols('A') // U+0041 LATIN CAPITAL LETTER A
1>> countSymbols('𝐀') // U+1D400 MATHEMATICAL BOLD CAPITAL A
1
>> countSymbols('💩') // U+1F4A9 PILE OF POO
1
But if we’re being really pedantic, counting the number of symbols in a string is even more complicated. Consider this example:
>> 'mañana' == 'mañana'
false
JavaScript is telling us that these strings are different, but visually, there’s no way to tell! So what’s going on there?
As my JavaScript escapes tool would tell you, the reason is the following:
>> 'ma\xF1ana' == 'man\u0303ana'
false>> 'ma\xF1ana'.length
6
>> 'man\u0303ana'.length
7
The first string contains U+00F1 LATIN SMALL LETTER N WITH TILDE, while the second string uses two separate code points (U+006E LATIN SMALL LETTER N and U+0303 COMBINING TILDE) to create the same glyph. That explains why they’re not equal, and why they have a different length
.
However, if we want to count the number of symbols in these strings the same way a human being would, we’d expect the answer 6
for both strings, since that’s the number of visually distinguishable glyphs in each string. How can we make this happen?
In ECMAScript 6, the solution is fairly simple:
function countSymbolsPedantically(string) {
// Unicode Normalization, NFC form, to account for lookalikes:
var normalized = string.normalize('NFC');
// Account for astral symbols / surrogates, just like we did before:
return punycode.ucs2.decode(normalized).length;
}
The normalize
method on String.prototype
performs Unicode normalization, which accounts for these differences. If there is a single code point that represents the same glyph as another code point followed by a combining mark, it will normalize it to the single code point form.
>> countSymbolsPedantically('mañana') // U+00F1
6
>> countSymbolsPedantically('mañana') // U+006E + U+0303
6
For backwards compatibility with ECMAScript 5 and older environments, a String.prototype.normalize
polyfill can be used.
This still isn’t perfect, though — code points with multiple combining marks applied to them always result in a single visual glyph, but may not have a normalized form, in which case normalization doesn’t help. For example:
>> 'q\u0307\u0323'.normalize('NFC') // `q̣̇`
'q\u0307\u0323'>> countSymbolsPedantically('q\u0307\u0323')
3 // not 1
>> countSymbolsPedantically('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')
74 // not 6
You could use a regular expression to remove any combining marks from the input string instead if a more accurate solution is needed.
// Regex generated by this script: https://github.com/mathiasbynens/esrever/blob/master/scripts/export-data.js
var regexSymbolWithCombiningMarks = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]+)/g;function countSymbolsIgnoringCombiningMarks(string) {
// Remove any combining marks, leaving only the symbols they belong to:
var stripped = string.replace(regexSymbolWithCombiningMarks, function($0, symbol, combiningMarks) {
return symbol;
});
// Account for astral symbols / surrogates, just like we did before:
return punycode.ucs2.decode(stripped).length;
}
This function removes any combining marks, leaving only the symbols they belong to. Any unmatched combining marks (at the start of the string) are left intact. This solution works even in ECMAScript 3 environments, and it provides the most accurate results yet:
>> countSymbolsIgnoringCombiningMarks('q\u0307\u0323')
1
>> countSymbolsIgnoringCombiningMarks('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')
6
Here’s an example of a similar problem: reversing a string in JavaScript. How hard can it be, right? A common, very simple, solution to this problem is the following:
// naive solution
function reverse(string) {
return string.split('').reverse().join('');
}
It seems to work fine in a lot of cases:
>> reverse('abc')
'cba'>> reverse('mañana') // U+00F1
'anañam'
However, it completely messes up strings that contain combining marks or astral symbols.
>> reverse('mañana') // U+006E + U+0303
'anãnam' // note: the `~` is now applied to the `a` instead of the `n`>> reverse('💩') // U+1F4A9
'��' // `'\uDCA9\uD83D'`, the surrogate pair for `💩` in the wrong order
Luckily, a brilliant computer scientist named Missy Elliot came up with a bulletproof algorithm that accounts for these issues. It goes:
I put my thang down, flip it, and reverse it. I put my thang down, flip it, and reverse it.
And indeed: by swapping the position of any combining marks with the symbol they belong to, as well as reversing any surrogate pairs before further processing the string, the issues are avoided successfully. Thanks, Missy!
// using Esrever (http://mths.be/esrever)>> esrever.reverse('mañana') // U+006E + U+0303
'anañam'
>> esrever.reverse('💩') // U+1F4A9
'💩' // U+1F4A9
This behavior affects other string methods, too.
String.fromCharCode
allows you to create a string based on a Unicode code point. But it only works correctly for code points in the BMP range (i.e. from U+0000 to U+FFFF). If you use it with an astral code point, you’ll get an unexpected result.
>> String.fromCharCode(0x0041) // U+0041
'A' // U+0041>> String.fromCharCode(0x1F4A9) // U+1F4A9
'' // U+F4A9, not U+1F4A9
The only workaround is to calculate the code points for the surrogate halves yourself, and pass them as separate arguments.
>> String.fromCharCode(0xD83D, 0xDCA9)
'💩' // U+1F4A9
If you don’t want to go through the trouble of calculating the surrogate halves, you could resort to Punycode.js’s utility methods once again:
>> punycode.ucs2.encode([ 0x1F4A9 ])
'💩' // U+1F4A9
Luckily, ECMAScript 6 introduces String.fromCodePoint(codePoint)
which does handle astral symbols correctly. It can be used for any Unicode code point, i.e. from U+000000 to U+10FFFF.
>> String.fromCodePoint(0x1F4A9)
'💩' // U+1F4A9
For backwards compatibility with ECMAScript 5 and older environments, use a String.fromCodePoint()
polyfill.
If you use String.prototype.charAt(position)
to retrieve the first symbol in the string containing the pile of poo character, you’ll only get the first surrogate half instead of the whole symbol.
>> '💩'.charAt(0) // U+1F4A9
'\uD83D' // U+D83D, i.e. the first surrogate half for U+1F4A9
There’s a proposal to introduce String.prototype.at(position)
in ECMAScript 6. It would be like charAt
except it deals with full symbols instead of surrogate halves whenever possible.
>> '💩'.at(0) // U+1F4A9
'💩' // U+1F4A9
For backwards compatibility with ECMAScript 5 and older environments, a String.prototype.at()
polyfill/prollyfill is available.
Similarly, if you use String.prototype.charCodeAt(position)
to retrieve the code point of the first symbol in the string, you’ll get the code point of the first surrogate half instead of the code point of the pile of poo character.
>> '💩'.charCodeAt(0)
0xD83D
Luckily, ECMAScript 6 introduces String.prototype.codePointAt(position)
, which is like charCodeAt
except it deals with full symbols instead of surrogate halves whenever possible.
>> '💩'.codePointAt(0)
0x1F4A9
For backwards compatibility with ECMAScript 5 and older environments, use a String.prototype.codePointAt()
polyfill.
Let’s say you want to loop over every symbol in a string and do something with each separate symbol.
In ECMAScript 5 you’d have to write a lot of boilerplate code just to account for surrogate pairs:
function getSymbols(string) {
var length = string.length;
var index = -1;
var output = [];
var character;
var charCode;
while (++index < length) {
character = string.charAt(index);
charCode = character.charCodeAt(0);
if (charCode >= 0xD800 && charCode <= 0xDBFF) {
// note: this doesn’t account for lone high surrogates
output.push(character + string.charAt(++index));
} else {
output.push(character);
}
}
return output;
}var symbols = getSymbols('💩');
symbols.forEach(function(symbol) {
assert(symbol == '💩');
});
In ECMAScript 6, you can simply use for…of
. The string iterator deals with whole symbols instead of surrogate pairs.
for (let symbol of '💩') {
assert(symbol == '💩');
}
Unfortunately there’s no way to polyfill this, as for…of
is a grammar-level construct.
This behavior affects pretty much all string methods, including those that weren’t explicitly mentioned here (such as String.prototype.substring
, String.prototype.slice
, etc.) so be careful when using them.
The dot operator (.
) in regular expressions only matches a single “character”… But since JavaScript exposes surrogate halves as separate “characters”, it won’t ever match an astral symbol.
>> /foo.bar/.test('foo💩bar')
false
Let’s think about this for a second… What regular expression could we use to match any Unicode symbol? Any ideas? As demonstrated, .
is not sufficient, because it doesn’t match line breaks or whole astral symbols.
>> /^.$/.test('💩')
false
To match line breaks correctly, we could use [\s\S]
instead, but that still won’t match whole astral symbols.
>> /^[\s\S]$/.test('💩')
false
As it turns out, the regular expression to match any Unicode code point is not very straight-forward at all:
>> /^[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]$/.test('💩') // wtf
true
Of course, you wouldn’t want to write these regular expressions by hand, let alone debug them. To generate the previous regex, I’ve used Regenerate, a library to easily create regular expressions based on a list of code points or symbols:
>> regenerate.fromCodePointRange(0x0, 0x10FFFF)
'[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
From left to right, this regex matches BMP symbols, or surrogate pairs (astral symbols), or lone surrogates.
While lone surrogates are technically allowed in JavaScript strings, they don’t map to any symbols by themselves, and should be avoided. The term Unicode scalar values refers to all code points except for surrogate code points. Here’s a regular expression is created that matches any Unicode scalar value:
>> regenerate()
.addRange(0x0, 0x10FFFF) // all Unicode code points
.removeRange(0xD800, 0xDBFF) // minus high surrogates
.removeRange(0xDC00, 0xDFFF) // minus low surrogates
.toRegExp()
/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/
Regenerate is meant to be used as part of a build script, to create complex regular expressions while still keeping the script that generates them very readable and easily to maintain.
ECMAScript 6 will hopefully introduce a u
flag for regular expressions that causes the .
operator to match whole code points instead of surrogate halves.
>> /foo.bar/.test('foo💩bar')
false>> /foo.bar/u.test('foo💩bar')
true
When the u
flag is set, .
is equivalent to the following backwards-compatible regular expression pattern:
>> regenerate()
.addRange(0x0, 0x10FFFF) // all Unicode code points
.remove( // minus `LineTerminator`s (http://ecma-international.org/ecma-262/5.1/#sec-7.3):
0x000A, // Line Feed <LF>
0x000D, // Carriage Return <CR>
0x2028, // Line Separator <LS>
0x2029 // Paragraph Separator <PS>
)
.toString();
'[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'>> /foo(?:[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])bar/u.test('foo💩bar')
true
Considering that /[a-c]/
matches any symbol from U+0061 LATIN SMALL LETTER A to U+0063 LATIN SMALL LETTER C, it might seem like /[💩-💫]/
would match any symbol from U+1F4A9 PILE OF POO to U+1F4AB DIZZY SYMBOL. This is however not the case:
>> /[💩-💫]/
SyntaxError: Invalid regular expression: Range out of order in character class
The reason this happens is because that regular expression is equivalent to:
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/
SyntaxError: Invalid regular expression: Range out of order in character class
Instead of matching U+1F4A9, U+1F4AA, and U+1F4AB like we wanted to, instead the regex matches:
ECMAScript 6 allows you to opt in to the more sensical behavior by — once again — using the magical /u
flag.
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCA9') // match U+1F4A9
true>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4A9}') // match U+1F4A9
true
>> /[💩-💫]/u.test('💩') // match U+1F4A9
true
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCAA') // match U+1F4AA
true
>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4AA}') // match U+1F4AA
true
>> /[💩-💫]/u.test('💪') // match U+1F4AA
true
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCAB') // match U+1F4AB
true
>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4AB}') // match U+1F4AB
true
>> /[💩-💫]/u.test('💫') // match U+1F4AB
true
Sadly, this solution isn’t backwards compatible with ECMAScript 5 and older environments. If that is a concern, you should use Regenerate to generate ES5-compatible regular expressions that deal with astral ranges, or astral symbols in general:
>> regenerate.fromSymbolRange('💩', '💫')
'\uD83D[\uDCA9-\uDCAB]'>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💩') // match U+1F4A9
true
>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💪') // match U+1F4AA
true
>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💫') // match U+1F4AB
true
This behavior leads to many issues. Twitter, for example, allows 140 characters per tweet, and their back-end doesn’t mind what kind of symbol it is — astral or not. But because the JavaScript counter on their website at some point simply read out the string’s length
without accounting for surrogate pairs, it wasn’t possible to enter more than 70 astral symbols. (The bug has since been fixed.)
Many JavaScript libraries that deal with strings fail to account for astral symbols properly.
For example, when Countable.js was released, it didn’t count astral symbols correctly.
Underscore.string has an implementation of reverse
that doesn’t handle combining marks or astral symbols. (Use Missy Elliot’s algorithm instead.)
It also incorrectly decodes HTML numeric entities for astral symbols, such as 💩
. Lots of other HTML entity conversion libraries have similar problems. (Until these bugs are fixed, consider using he instead for all your HTML-encoding/decoding needs.)
These are all easy mistakes to make — after all, the way JavaScript handles Unicode is just plain annoying. This write-up already demonstrated how these bugs can be fixed; but how can we prevent them?
Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩
) in a string, and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it.
A good test string for Unicode support in general is the following: Iñtërnâtiônàlizætiøn☃💩
. Its first 20 symbols are in the range from U+0000 to U+00FF, then there’s a symbol in the range from U+0100 to U+FFFF, and finally there’s an astral symbol (from the range of U+010000 to U+10FFFF).
TL;DR Go forth and submit pull requests with piles of poo in them. It’s the only way to Unicode the Web Forward®.
Disclaimer: This post is based on the latest ES6 draft and the various strawmans and proposals to further improve Unicode support in JavaScript. Some of these new features may not make it to the final ES6 specification.
This write-up summarizes the various presentations I’ve given on the subject of Unicode in JavaScript over the past few years. The slides I used for those talks are embedded below.
Want me to give this presentation at your meetup/conference? Let’s talk.