Going unicode without hurting perf

2012-08-08

Bit of a cryptic topic, but this blog post concerns parsing JS source and specifically unicode characters in JS identifiers.

So ES5 allows a wide range of unicode characters in variable and function names. This is fine and I'm sure you'll appreciate that if you're in a country that actually uses them on a daily basis. My language, Dutch, doesn't really (though a few words do have diacritical marks) but for example German has the "Ringel-S", ß. Though these days they tend to write it as a double ss too. Or take Greek, who's letters don't even exist in the ascii range.

Now I'm not one to write JS in any other language than English. Doesn't feel natural to me. But I can dig it if that you do want to do that. In fact, I recently learned about the Bablyscript project. This aims to allow you to work in JS in your native language. So all keywords are translated to your own language. In Dutch, the for keyword would be come voor, etc. I shiver at those translations, but I guess some people would like them. At least I can appreciate the concept :) The Java applet is just a jab, I'm sure.

Aaaanyways, enough foreplay. Let's get down to the problem at hand. When parsing JS source you need to parse identifiers. These are the variable names and keywords that you use in the language to keep your sanity. So while you'll almost never see identifiers with non-ascii characters, as I said above, they are valid. That's for instance why ಠ_ಠ is a valid variable name. The ಠ is a letter in some language. Read more about that in this blog post.

The number of valid characters exceeds about ten thousand. I've made a simple regex for it for my first version.

Running those unicode regexes is of course a waste of cpu cycles. So for the first version of ZeParser, if unicode was enabled (I figured I'd made it optional), I'd first check the next character with an object before actually running the unicode regex:{'>':1,'=':1,'!':1,'|':1,'<':1,'+':1,'-':1,'&':1,'*':1,'%':1,'^':1,'/':1,'{':1,'}':1,'(':1,')':1,'[':1,']':1,'.':1,';':1,',':1,'~':1,'?':1,':':1,'\\':1,'\'':1,'"':1,' ':1,'\t':1,'\n':1};. If the character was a key in that hash, don't run the unicode regex, else run it.

In hindsight, that was actually quite silly. ES5 does not specify any identifiers with non-ascii characters. And none of it's punctuators are non-ascii (because they don't exist on US qwerty keyboards), making them hard to type. Furthermore, of all the ascii characters, only the letters, numbers, and $ and _ are allowed. Which brings us to the center of this blog post: The simple check to determine the heavy unicode path or consider the token ended is to check whether the next character is an ascii character. If it is, it's probably invalid (otherwise you would not have been here), if it's not ascii, you'll either add another character to the current token or you'll run into an error. The check is as simple as if (nextCharCode > 128) ....

Oh note that there are three exceptions to that rule. There's a unicode character that's considered whitspace (0xFFFF) and two unicode line terminators (0x2028 and 0x2029) that are explicitly defined in the spec as such. But checking for four numbers is still cheaper than running an 11k+ regex... :)

I do hope that ES6 will sport these unicode character groups as regex escapes, but I doubt it... :/

Hope you liked it.