5. Notational Conventions

2010-03-09

This post describes how the language is described and how you can apply the terminology used yourself. It's a bit abstract, but you can do it!

This specification is defined in terms of so called "productions". They are part of a system that's called a "Context Free Grammar" or CFG. As the name suggests, a CFG is able to define the rules of a language without knowing anything about the semantics (context, meaning).

What confused me personally for a long time was the usage of the words "terminal" and "non-terminal". I thought they meant some kind of `ending` with it. They don't. A terminal is any character that might occur literally in the source text. A non-terminal is any production. They will not be seen literally in any Javascript source.

For example; LineTerminator is a production and function is a series of eight characters that should occur literally in the source, hence called terminal symbols.

A CFG is a closed set of rules able to define a language on a very low level. The type of CFG used in the Ecmascript specification is slightly different from a regular CFG in that it is able to lookahead in certain edge cases.

A CFG is made up by productions. Every production (think of it as one way to match part of the source) has a set of productions I call rules. For example:

Code: (CFG)
Example :
abc
Example abc

StatementList is a production with two productions which I will call rules from now on, to make the difference more obvious.

So the first rule is a literal abc. This should occur literally in the source text.

The second rule is the same production again ("recursive"), Example, followed by the same literal.

So when you try to match this production to the source text, it should start with abc. A parser should try to match all rules and take the longest sequence of characters that matched. After it matched one rule, it should retry all the "recursive rules" again. Those are the rules in which the production itself also occurs. The production itself will match nothing but also not reject the match (because it already matched something) and the rest of the rule is applied.

For example, say you have an input string "abcabcabdabc". When only using Example this will be parsed as Example Example. After that a parser will reject because nothing matches the remaining string, "abdabc", in our example. Note that the parser will probably return a parse tree with one Example token on top having two Example tokens as childeren, one for each match. (But this will probably vary per implementation.)

The specification will write all productions (or non-terminals) cursively and all source that should occur literally with a fixed typeset. This latter convention makes it difficult to read so in my posts, I will mark them green and red for clarity. That way when you see a dot or a parentheses, you'll know it's a literal.

When parsing, you should be parsing for a single production called the "goal symbol". When applying the rules in the above fashion, that goal symbol should be able to parse the language that is defined by the CFG.

There are four special circumstances which you will encounter for these productions.
- negative lookahead
- exceptions ("but not")
- optionals (opt in subscript)
- special descriptions

Negative lookahead is mostly used for line terminators. For instance, the ReturnStatement does not allow a LineTerminator between return and its argument, if any. So the parser should look ahead to the next token. If it matches anything in the set of things not allowed, the match fails. So even when a production might succeed, the lookahead can still cause a rejection.

Exceptions are applied after something is matched. The matched substring of the source is compared with a set of productions or strings. If one of them matches, the original match is rejected after all.

Optional productions may fail to match but will never cause a rule to be rejected for that production. So whenever a production is optional, it may be matched, but if that's impossible, the whole rule is not rejected (which would otherwise be the case). So this means that the parent production actually has two different rules; one with and one without that optional production. If there are two optional productions in a rule, it denotes four different rules (one rule with both, one with the first, one with the second and one with neither of the optional productions).

Special descriptions are specifically used for character sets such as SourceCharacter. There are simply too many characters to list them all so they are described. This concerns only about six productions in the whole spec. They all have something to do with character sets.

The colon after a production describes, to some level, what grammar (CFG) a production belongs. There can be one, two or three colons. Don't worry about this, it's not really important for us right now.

This basically how the CFG works. It is used quite extensively in the specification so knowing how it works will help you go a long way in understanding the syntactic peculiarities :)