Regex unicode escaped flag

2012-07-05

It must be that time of the year again. I encountered a new syntactical edge case in js I didn't really know about. I suppose I should've realized it before. I could have anyways. But I never built it into ZeParser either way.

The case is simple: A regular expression has a body and a tail. The body is anything between the forward slashes that delimit it (/). The tail is the optional flags. It turns out that, since this tail is defined as an IdentifierPart production in the grammar, this can also contain unicode escapes! In fact, it may contain much more than you'd expect, but I just wanna focus on the escape right now. So this is valid:

Code:
/foo/\u0069

This should be equivalent to

Code:
/foo/i

because unicode escapes are used as their "canonical" equivalent. That means that whether they are parsed as a literal character or an escaped unicode/hex, internally they mean the same thing.

Interestingly enough, Chrome chokes on it (and currently might have a more serious problem with it too...). What surprised me was to see that Firefox choked on it too.

The syntactical oddity is probably a byproduct of lazily defining the regex flag in the grammar. Since a flag is generally just zero or more ascii ranged letters, they simply defined it as zero or more IdentifierParts.

Ok, back to work you!