Parsing ES6: Arrows


I've started to develop the next version of my JS parser. The old one only supported ES5. However ES6 introduced new syntax changes and so the parser simply can't parse it. That means I can't create new tooling for JS (built on my own parser) and that's just holding me back for implementing my ideas. So okay, a parser.

I'm still contemplating the actual architecture to use so I figured let's try to solve one of the bigger new parsing problems: the ES6 arrow function.

The problem with arrow syntax is that it you don't know what you're parsing until you've reached the end of it. And the header can have an arbitrary size so you can't really hardcode an efficient specific path for it.

You might wonder why that's a problem. Why not just do a peek parse until you've determined arrow or group mode and restart it in that specific mode? Ideally I don't want the parser to visit the same input characters (or tokens) more than once. I'd like to think it's quite obvious why, especially in cases like these where the peeking could be of arbitrary size. I mean, in the real world the headers won't be that big (in before (f=()=>superlongfunction)=>f() becomes a thing) but when writing a parser you need to be generic. In the end I'll probably go for a hybrid solution, but more on that later.

Conceptually an arrow function is a short hand way of defining a function. There are also some other important differences, like how this and arguments are handled and how the return value is implicit. Those are not super relevant to our parsing problem, though. It's all about the ambiguous syntax.

Let's take a look at the syntax. The arrow function takes any number of parameters, including zero, that can each have an initializer. The parenthesis are only optional for a single identifier without initializer. Additionally the "regular" destructuring syntax is supported as well as the "rest parameter".

arg => arg
() => 15
(arg) => arg
(a, b, c) => [ a, b, c ]

On the one hand they are different but they're really kind of the same.

Fun fact: this is the only place in the language where the opening parenthesis may be immediately followed by a closing parenthesis.

The arrow header is pretty much the same as the regular function header. The rest parameter can only be the last one and its name is to be preceded by three dots (...).

The big problem here is that the header of an arrow is usually also a valid expression if not followed by the actual arrow (=>). And obviously the result is completely different. So when you start parsing an expression you may not be able to know what you're actually parsing until you find a token that disambiguates this for you. More often than not, this will be the arrow token itself (=>).

Let's take a look at higher level parsing patterns.

The body of an arrow function can either be a single expression, which is also the return value, or a block that has the same rules as regular function bodies. The arrow body is not really a parsing problem so I'll skip that here and call it the parseArrowBody function.

The simplest arrow header cases are ident => body and () => body. These headers are very short and simple in those cases as there is no initializer or destructuring going on. Both cases are a simple check at the start of the expression and are easy to assert.

That basically leaves us with two cases; one with plain args and one with destructuring syntax. Additionally both can have an optional initializer.

(a) => body
(a = expression) => body
([ a, b ]) => body
([ a, b ] = x) => body

In both cases there can be multiple args with commas but that's a trivial problem in the context of parsing because you simply parse "an expression" and then either expect a comma or the closing parenthesis to follow. Your expression parsing function will keep on consuming tokens until they can no longer be part of the conceptual expression. From the caller point you don't have to worry about this and can simply check the fist token after the call.

(a, b, c) => body
([ a, b ], [ c, d ]) => body
(a, [ b, c ], d) => body

And keep in mind, these too may also just be an arbitrary grouped expression. So at any time an arbitrary number of arbitrary expressions could be parsed. All of the above cases would still be valid if the => body part would be dropped. Obviously the execution semantics would be very different.

At high level this means your groupOrArrow parsing function starts with consuming an opening parenthesis. That much seems obvious. The single identifier header gets a special path. After that you'll want to check whether the next token (which is already parsed) is an identifier, an assignment, and an expression. Alternatively you'll need to check the more complex destructuring tokens and process them that way. If at any point while parsing the args something was encountered that wouldn't be valid in an arrow header the parsing can be continued with a parser that's not burdened by the possibility of parsing an arrow header. And vice versa. This can be a little complex on its own, to continue parsing midway a structure. But certainly doable.

It gets harder when going beyond "validation" parsing. When you build up scopes you'll need to make a decision on whether you want to precautionary want to build up scopes "just in case", or whether you want to parse ahead until you know for sure what to parse and then repeat parsing it in that particular mode. And when building an AST you must know whether you're currently parsing an arrow header or regular group. For these cases we can only really resort to super inefficient code. For example code that generates objects for every parse "just in case", or that creates an AST tree for a group and an arrow. Alternatively we could use some heuristics and hardcode some parameter offsets so we can revisit them once we disambiguate or reach the end of the header. If there are more args than hardcoded places, we can still choose to do a full reparse from the start of the header. But in most cases the arrow header will be short and a hardcoded fast path seems useful. Scoping gets a similar treatment.

I'm probably going to go down the heuristic path. Hardcode some offset variables to hold the first three or four argument tokens for quick rescans once the mode is known. If that doesn't work I'll probably have to fall back to a full reparse system, but I'd really rather not.

I'll probably post a new blog about the architecture once I settle on it. It's basically a toss between pure static functions, scoped functions that use a closure over state, or using a state objects. I've already got feedback that it's pretty much the same so it'll be more about personal preference and maintainability. Still, not an easy choice :)