A new GSSParser

2015-08-27

I love writing parsers. So when a new client asked me to write a CSS and GSS parser in JavaScript I jumped right in. I've had enough experience with CSS to figure that a CSS parser shouldn't take me that much time. GSS on the other hand I had no experience with at all, as it turned out later, but looked easy enough. It's actually a "ccss" or "constraint driven CSS". Of course the parser doesn't deal with the constraints, it only parses them into a parse tree so another program can deal with them.

This post will discuss the CSS/GSS parser which can be found at https://github.com/qfox/gssparser. It was written for a client, The Grid, who will use it to replace their current PEG based parser in various products. Suffice it to say the performance gains were huge.

So I tackled this parser in three steps; CSS (+lexer), GSS, cleanup/finalizing/etc.

CSS


The basic structure of CSS is (deliberately) quite simple. You have at-rules and other rules. Other rules have a preamble (the selector) and a body. The body is a set of declarations. And there are comments.

@-rules


At-rules start with an @-sign and are like the custom syntax, where the syntax spec only defines a generic outline for them. This outline consists of pretty much anything after the @name part as long as it ends with a body (that's a {} curly pair) or a semi-colon (;). Anything in between is pretty much fair game. The CSS spec of course makes use of this in several ways on its own. My parser will simply parse them as is and hand off a token stream for them. No special handling for certain known CSS @-rules either; let another parser deal with them. Subject to change

Regular rules


So then for regular rules. You start with a so called preamble; a css selector, and a body which contains declarations to actually set the styles: preamble body or foo { bar: baz; }.

Preamble


The preamble is better known as comma delimited list of css selectors. Their basic structure is a series of atoms in some way chained together with (or without) combinators. An atom would be the smallest valid (sub-)selector that applies to a single element. A tricky aspect here is that a space character is also a combinator but only if there wasn't any other combinator (besides spaces) before or after it. This is annoying because it means the lexer (the part of the parsing library that chunks the input up into tokens so it's easier to manage) has to mark its tokens with a boolean to indicate that this token was preceded by a space. Hm come to think about it, is this any whitespace or only spaces? Hold up I need to confirm this and write some extra tests...

These "atoms" can be a limited handful of things, all selectors on their own, and can all be combined together in any order to form new selectors.

A regular identifier (div) signifies an element by its tagname. A dot prefixed identifier (.foo) means a tag with the identifier in its class list. A hash prefixed identifier (#foo) means a tag with an id by that exact name.

Then there's an arbitrary attribute selector ([foo] or [foo="bar"]) with slightly more complex rules because it may or it may not have a value, and this value may or may not be quoted. The operator may not be just = but one of the more exotic ones like ~=. The id selector can be expressed in an attribute selector like [id="myid"], the class selector can usually be expressed with the more exotic operators.

The last one is a pseudo selector, which is an identifier with one (:hover) or two colons (::first-line) prefixed (yawn at the difference) and optionally this is "called" by appending a parens with some arguments (:nth-child(2n+1)). The arguments of nth-child actually pose the biggest nuisance here as the An+B syntactic rules are a bit iffy in the edge case department. More on that later. Pseudos are more like meta selectors or state. They select based on the position of an element in the page (fifth child), the current state (like hovering or having visited the link), etc. It's irrelevant for the parser what the pseudo does though, there's a single syntax to parse them all.

There's also a special "star" selector (*) which matches any single element.

So while there's a few ways to select an element, at the same time, that's it for them. Tag, class, id, attribute, and pseudo. There's also namespaces, but they were out of scope of this project and who uses them anyways. No seriously, could you quickly from the top of your head even tell me how to use namespaces in CSS? I didn't think so.

Body


The body is simpler in concept, although it can be hairy in certain values. In principle it's a list of declarations wrapped in a pair of curly brackets.

Declaration


This is how you "assign" values to (mostly) style properties of the elements you've selected. They always have the form of name: value with an optional semi-colon for the last declaration. The value can be the tricky one here. It can be a simple keyword like the name of a color or a simple number. It can be a value with a type like px, rem, or %. But a CSS lexer doesn't really care and just consumes identifiers (% are a special case here) and lets the parser deal with it. The parser just sees two consecutive identifiers or a number with an identifier and assumes it's fine. There's more complex stuff like short hands or even gradients and animations. But mostly you process them generically and it's all handled quite simple.

GSS


So when I finished the CSS parser side of things I figured GSS would be a shoe-in. But nope. The problem with GSS from a parser perspective is that there are basically two modes to parse in; a "selector mode" and something I've called the "calc mode". In calc mode the parser parses a value. Something that can be evaluated to a number or a variable. In selector mode we are just doing the same as a CSS parser. The problem is that the change of mode is signaled by something called the "accessor", which looks something like [this]; an identifier wrapped in square brackets. In my head I interpret that as accessing said property of a selector. The accessor scopes to all the atoms of a single selector without combinators. That means .a.foo[x] is fine but .a .foo[x] is not since .a cannot evaluate to a numbered value (because it's an element). The other is the value of x of any element that matches .a.foo.

Of course the accessor only appears at the end of a selector and may not appear at all if the whole value is just a giant calc expression. Consider div[x] vs div, both are valid because the second simply interprets div as a variable.

To make matters worse, this mode may change at (pseudo-)arbitrary points in a GSS rule. It's not exactly ambiguous but craaaap. I saw no other way than to parse GSS code from right to left. And that really messes with your brain at first, let me tell you. On top of that I basically had to duplicate the selector parsing code, backwards. It's tempting to think this part can be re-used, but since the mode can switch even half-way while parsing a selector, you may as well just dupe the selector parsing algo. Sigh.

So parsing from right to left was a new trick for me. Ultimately though it's very similar to parsing left to right. The same kind of mindset works and it works fine once you get used to it. Of course it's less efficient because you first have to scan forward to the end of a rule. In CSS this means parsing forward to a semi-colon or curly bracket open or EOF (or error). THEN we apply a small heuristic to detect whether we're parsing CSS or GSS and we continue parsing with one or the other.

Colon ambiguity


One problem is determining whether you must now start parsing CSS or GSS code. This is sligthly worsened by the fact that GSS can occur as a rule and a declaration, and that in this whole parser the CSS declarations were also required to work top-side, so without a rule.

Biggest problem here is colon ambiguity. Well it's not actually ambiguous but it's not that trivial to figure out the css/gss path by just looking at the second token. Reason is pseudos. Consider foo:hover {} vs foo:hover[x] == y. Remember to think of this generically. Add to this that this may occur inside a body or without and you've got scanner problems. Well it's mostly a performance hit at this point. Unfortunate but still not a major problem. Just for perfection :)

Spaces redux


One other catch in parsing right-to-left were the space combinators. The lexer was only indicating whether the current token was preceded by whitespace. But when parsing from right to left you need to know whether the current token had spaces following it. It's not a problem per se but something that I would have tackled better in architecture if I had known it ahead of time.

Overall


GSS is actually not that hard to parse, once you get through te aforementioned issues. There were some parse tree format requests and you have to take the mode switching into account, but the code itself is fairly similar to regular CSS. Sure there's something called "splats" which need complex parsing algorithms. And dealing with numbers and calculation expressions can be a bit of a pain, especially considering that identifiers in CSS may start with a minus. So is a+b-c a calculation with two variables, three, or an error? I'll let you figure that out.. Ok no, the answer is 2! Because b-c is one identifier in CSS. But that's not very intuitive.)

Of course there was still some work involved, but so far nothing that couldn't be handled.

Error handling


CSS (and therefor GSS) is very forgiving while parsing. VERY forgiving. Whenever something can't be parsed the parser is to consume input until certain markers and then proceed in whatever context the group was. This basically comes down to regular rules being parsed until the end of the body that must follow (or EOF). After the body the parser simply continues and ignores the botched rule altogether. Same for a declaration; if it's broken parse until the first next semi-colon, discard it, and continue parsing a new declaration.

It's actually quite simple to implement this but you do have to make sure any weird unexpected input is properly handled. I've built a fuzzer to take away my biggest pains. Man I love seeing fuzzers at work :)

Conclusion


This was a fun project to do. I learned a bunch about CSS, little things I didn't know about before like how An+B works in syntax or how @-rules are supposed to be parsed. It was fun doing an all-out parser for CSS rather than JS for a change.

Parsing CSS is fairly easy, much easier than JS at least. GSS is slightly more tricky, and having to do it backwards was a new one for me. But it turns out backwards parsing is not that much different from forward parsing and only adds a thin layer of cognitive overhead to the mix for me.

I've built the parser from absolute scratch (clean room!) over the course of about three or four months (though not full time on this). Not sure about the actual hours.

Of course the GSS parser is never finished and I could work on it for like ever and ever. Just as with ZeParser. But you've got to draw a line somewhere. Unless you get paid for it of course :)

Resources


The parser can be found on github: https://github.com/qfox/gssparser. To use it run node bin/build.js which will put the result in a build dir. Then include that in a project (with or without minification). Then something like this should work:

Code:
var tree = window.parsers.css('a { b: c; }');

// ->
["rule", ["tag", "a"], [["set", "b", ['get', "c"]]]]

See the README for more details on this.