Simple minify rules

2011-03-12

Minification in js is a special craft. There are all kinds of levels of minification. Today I will discuss only the simplest version of them all. Minification to remove excessive whitespace, comments and (if allowed) newlines.

The premise is that you have a complete token stream. The stream should include comments, line terminators and whitespace (the specification tells you to strip them). If this is not the case, we'll assume asi has been properly applied when required. In my stream there will also be "asi" tokens, which indicate automatic semi-colon insertion should occur there. This is important because with them, we can safely remove the newlines.

So let's say we do have a token stream like that and we want to minify it. When creating the output we'll simply skip all whitespace tokens (tabs, spaces, that sort of thing). They are never significant, except to prevent parser ambiguity. The same goes for single line comments. Never significant.

We replace the multi line comments by a newline if it contains at least one newline. You have to check because multi line comments can induce asi, but only if they actually contain at least one newline. Having replaced them, the token stream should no longer contain any comments or whitespace. We leave the newlines as is. My token stream will also have asi tokens, so I can safely remove newline tokens as well.

Now comes the tricky part, concatenating tokens such that semantically it's still the same source.

There are actually just three five rules for this step. I had forgotten about rule 4, but it is important! And discovered rule 5 later.

1: if two "identifiers" follow each other, they must be separated by whitespace

Even though in js you may never put two variables after each other with just whitespace between them, you should note that keywords (like unary operators) are also identifiers. So void, delete and instanceof can be followed by a variable name. That's why this rule is important.

2: if an "identifier" is followed by a "numeric literal", they must be separated by whitespace

Since numbers can be a part of variables, identifiers (must be keywords in this case) must be split from the number. Note that the other way around is not required, since variables names may not start with a number. So something like (5in x) is fine.

3: if an identifier is any of the four flow control keywords (return, throw, continue, break) and followed by a line terminator, replace the line terminator by a semi-colon

The newlines are only significant to determine asi.

4: if a number is followed by a dot for property access (so not a trailing decimal point!), add whitespace between the number and the dot if there is no dot in the number.

If the number does not contain a dot (and has no trailing exponent part) the parser will eat the dot as being the decimal point of the number. Since it wasn't, the minified script would break unless we keep at least some whitespace between the two tokens. This is a very obscure issue btw :)

5: if two tokens that follow each other both contain only plusses or only minusses, don't concat them.

You would introduce an error for a + + b and you would switch the incrementee for a + ++ b from b to a.


In my case, I can skip rule three, but have to make sure to replace all asi tokens by actual semi-colons.

Rule four can be specified even further, if there's an exponent trailing part, no whitespace is necessary. The rule is only there for otherwise ambiguity would be introduced.

That's it. Now you have a very simple minification step. Of course, things can get much more interesting, but we'll leave that for another blog post ;)

What's counter intuitive is that from these rules, certain unexpected cases rise. Consider these valid snippets:

Code: (js)
function f(){
return[x];
return"y";
return{z};
}

log(5in y);