Where does the regex go

2011-02-13

Parsing js (or ECMAScript) is not as straightforward as it might seem. Even though the language has a simple syntactic structure and is easy to learn, it's not as context free as one might like. Yes, the specification gives you a "context free grammar" (CFG) but it's not really Free. It could be made Free, but that entails rewriting the grammar in such a way that it's unusable for a specification.

As the title suggests, the main culprit is the regular expression literal. Of course, after that there's also the case of "Automatic Semi-colon Insertion" (ASI) and some minor cases of "strict mode". But they can usually be applied after parsing. The regex literal you need to deal with while parsing.

tl;dr: scroll down.

Multi functional forward slash

Let's review the uses of the forward slash in js...

Code: (js)
/abc/

/ab/c

a/b/c

//abc

/* abc */

a /= bc

a
/b/

a
/b/c

There are no less than six different ways for a forward slash to occur in js.

Quiz: of the eight cases presented above, only one is illegal (each case is separated by an empty line). Can you figure out which? No? Continue reading.

Quiz: can you name a punctuator that has even more contexts?

Intro to parsing (js)

The classic parsing method of dealing with it is to hook up the tokenizer to the parser. The tokenizer would accept a CFG (for ES, that's only A.1, A.2 and A.8.1, starting with either InputElementRegExp, InputElementDiv or JSONValue) and some source and spit out tokens. These are basically the atoms of the language. By definition a token is a single character of whitespace (like spaces and tabs, but not line terminators), line terminators, entire comments (single and multi-line), and a production named Token. This Token consists of an identifier (including reserved words), a punctuator (+*/{([ etc), numeric and string literals (single and double quote).

The parser then takes these tokens and tries to apply a second CFG (starting at Program or JSONText) to them. It can't just accept these tokens and put them in the parse tree because an identifier might not be valid at that specific position. For example you can't create a variable named for, but it's a valid IdentifierName, which is all the tokenizer cares about. The parser will analyze that identifier further and determine whether it's valid in that context.

Two notable exception from Punctuator is the division sign (/) and from Token we're missing the regular expression literal. This is because they both start with a forward slash and as I said before it would not be easy to build a useful CFG that's still able to make the distinction. So when the parser asks the tokenizer for the next token, it also tells it whether or not it might be looking for a regular expression. The specification literally notes that there is no context in which both a regex and a division might be possible. They had to cheat a little for this to happen, but I'll get back to that :)

Where

So at first glance it's a little puzzling where the regular expression is allowed. Especially from the CFG it's not easy to see the patterns. However, using the general knowledge of the language and writing some tests for my first successful parser gave me some insights to this problem as well. The first pattern is that a regular expression is not allowed to cross lines. Just like a string literal, except string literals can use the line continuation escape. Regular expressions don't have this (although there's talk about maybe adding this..). So that's rule one: regular expressions do not span multiple lines.

The next thing is that a regular expression is, as the name says, "an expression". That means you can apply the generic logic of where expressions are allowed to it. Let's view this from the first base element of any Program where expressions are allowed. This would be a statement. More specifically, it would be the ExpressionStatement production. I'm just gonna use "statement" though. If you're picky, it can be proven that the actual root production where you need to start looking is Expression, since a regular expression can't start with function or {. The second rule is: a statement never begins with a division

Punctuators

So we're trying to make the distinction between where a regex and a div is allowed. At the start of a statement, only regex is allowed. After a regex, only div is allowed. And likewise, after a div, only regex is allowed. That doesn't help us much :) Let's be more generic and look at punctuators. Looking at the list of punctuators ({ } ( ) [ ] . ; , < > <= >= == != === !== + - * % ++ -- << >> >>> & | ^ ! ~ && || ? : = += -= *= %= <<= >>= >>>= &= |= ^=) there are actually only three cases that need special care: {, } and ).

For any mathematical/logical operator in that list (+*&) can only be followed by a regular expression. It makes no sense (and is not allowed) to have 5 * / 6 in js. Of course, it makes no sense to multiply a regular expression with anything else, but at least syntactically it is allowed.

The compound assignments and the normal assignment may only be followed by a regular expression, never a division.

For the remaining punctuators, ( [ ; , ? : may always only be followed by a regular expression. On the other hand ] may only be followed by a division (either property or array expression). And . may be followed by neither if it's a property operator, by a division if part of a number. Note that for ;, it doesn't matter whether it occurs in a for-header or as a statement closer. In both cases, the next token may only be a regular expression.

So that leaves us with { } ).

The ) is basically either the closer of the grouping operator or a call expression (function invocation). In those cases only a division is allowed. Otherwise it's part of a statement header (for, while, etc) and the next token is a new statement so only regular expression is allowed.

The { can either be the opening to an object literal (neither allowed to follow it) or a block statement / function declaration (only regular expression allowed). Likewise, the } either ends an object literal (only division allowed) or a block statement / function declaration. In the case of a block statement, you probably need to encounter a semi-colon first. But in any case, a division is never allowed. So at best, you're looking for a regular expression.

Recap (tl;dr)

I think tracking these is much easier. You'll still have to track a few things (namely, the context of the three punctuator exceptions) but that's not as complex as figuring out where a regular expression is allowed syntactically by inspecting the CFG.

So, a regular expression may always occur at the start of a statement, after unary operators and after the mathematical, logical and assignment punctuators. A division may always but only occur after any kind of expression (and never when a regular expression is allowed).

So as long as you track the statements and keep the rules laid out above in mind, you should be fine.

Caveat

For this magic to work, one exception to ASI had to be included. Namely, if the next line starts with a forward slash AND could be interpreted as a division, a semi-colon is never included through ASI. This can lead to slightly awkward situations. The statement rule still applies, but you need to remind yourself of this special ASI rule because it's not listed in the specific ASI section in the specification.

Edit: minor addition; after re-reading the first note in chapter 7, this rule seems to be lighter than I thought at first. That is; asi is only not applied when the regex statement is preceeded by a trailing expression on the previous line. This means that var a \n /b/ is indeed valid after all, but var a=x \n /b/ would not be. Important to note here is that this only evaluates the first division and not the regex as a whole. This can make a difference when the regex has a flag, which could make the difference in making the whole valid or not. Just because the first forward slash can be a division doesn't mean that both forward slashes make the program valid.

Oh, the joys of parsing... :)