Building ze parser

2014-07-28

The source code of ZeParser2 was pretty clean and straightforward. Lately I've been adding some test-related code to it though. I've also been fine tuning the perf of the parser, which includes changing the prototypal code to a big closure approach. Everything counts when you deal with cycles. So how does the build work right now? Let me explain...

Currently, the build file (bin/build.js) does a few things:
- concat the source files into a single file
- eliminate comments
- handle macros which eliminate dev-only code
- replace usages of constants with their constant value and eliminate constant declarations (sort of DSL)
- transform prototype to function closures
- add build to a special Gonzales config file for benchmarking
- construct a streaming version from the regular build
- cleanup some trivial things

Concat


Combining the three source files should be trivial for anyone reading this post. You read the files, concat their contents, write it to a new file. In my case I don't even have to be defensive; it's my project and I know I don't have to worry about trailing asi and all that stuff.

Comments


Dropping the comments is done in part to clean up the source a bit while debugging, and in part for trivial though stupid perf gains, short of minimizing the whole thing. Obviously comments are dropped after processing the build macro's.

Macro's


There are currently two macro's in my build step and I'm planning two additional macro's.

The most common macro is a single line comment that will remove the entire line, regardless. This helps to eliminate unit tests and loop guards. Their only purpose is sanity checks and infinite loop protection during development.

The other macro will eliminate a .call(), although I don't seem to use that anymore. Since the dev code is prototypal, the context is sometimes important. It seems I've eliminated that need recently since I don't see a usage of it at this moment. Ohwell :)

One macro I have planned will eliminate streaming specific code from the regular build. There are currently many "if this is EOF we need to ask for more input" checks in place. Most of them can be removed or simplified. There's also gonna be some offset checks for the streamer which allow discarding consumed input. This also adds some overhead (namely, subtracting the offset from all the other numbers) which is not needed in the normal build since the offset remains zero there.

The other future macro is just meant to eliminate tests from the build. Or rather, by making it a special macro, I can create a build with the guards in place. This helps debugging the build if it turns out to contain an error, like infinite loop, that doesn't occur in the dev version.

Constants


I have many constants in the source files. For example, I use numbers for all the letters, digits, and other characters while parsing. These are stored as a hex number in a constant. Most if not all of the boolean arguments are constants as well. And then there are some type constants. In the build I have no need for these constants. They require a scope-crossing lookup for no other reason than legibility during development. So in the build I strip them. This is done in a DSL kind of way.

I parse the source code and check for any identifier that is all upper cased. I will confirm that this identifier is not a property by confirming that the preceding non-whitespace token is not a dot. I'll confirm that the assigned value is a primitive. If that all passes, all variables by that name are replaced with the primitive value, which will always be one token. The original constant declarations are removed in a second pass since they serve no more purpose.

Afterwards, the code will be littered with magic values, but who cares :)

Proto to closure


One of the two biggest code architectural changes the build script can do is change the prototypal structure of the code to be closured. In perf, closures are faster than prototype. I can guess and theorize on the why, but I won't. I don't have the numbers either, it doesn't matter too much to me. I only care about which is faster, not why or how much. I want the prototypal structure in dev because I find it a much cleaner structure to work with. But for the build, only perf matters and structure becomes irrelevant.

The transformation is actually quite simple. For each file it finds the prototype. If it finds the prototype, which won't exist in uni.js. From it all properties and functions are gathered. This is trivial since they are defined in an object literal and I simply grab all property names. (It'd be a bit harder to do generically, though you can get a long way if you wanted to) Then, depending on whether it's a function or a non-function value, I'll rename it to a this_par_name or this_tok_name depending on the file. The end result is a file littered with var this_par_tok = null; function this_par_run(){...} kinds of code.

The actual prototype object and assignment is obviously eliminated.

What's left is context of the original code. Most notable is invoking the constructor with new. Though I suppose by now I've moved away with that and just expose Par.parse. It kind of makes more sense this way and there's no real point into returning an instance of Par anyways. Whatever you need should be returned in an object by a certain API. The rest should not be important. I'll admit this api choice is slightly influenced by this transformation, but who cares.

An important side effect of this transformation is that since there's no more instance, all variables are suddenly static. This means that any instance of the parser actually shares the variable-containers. This wasn't a problem though, since the parser ran sync and all relevant results were copied out before the next instance could start and clobber the variables. However, with the streaming parser this could actually kind of become a problem since it can yield at arbitrary points in time. Two concurrent streamers would share the same variables and things will go awry. I haven't fixed this yet since it's not likely to be a real problem in practice. The fix is kind of trivial but I'm worried about the perf hit it brings.

Right now I expose a few debug methods to test the streamer build. These are mainly getters and setters that allow me to handle the state of the special frozen variable, get the secret token count, and expose an internal function for the test runner. I have to assign them to this['foo'] to prevent them from being rewritten to this_tok_foo, which would put them out of reach of the test runner. This is stuff that the new macro for tests will remove from the build, as it's obviously not meant for prod.

There's still some clunky contextual stuff in there right now. I think I can eliminate that since I've moved away from new Par and am now just returning a fresh object. There's still a place where I par.run.call() which I think I can eliminate. I have to keep some of it though, because I also want to be able to run the dev code without any transformations.

Note: The transformation of prototype to closure is pretty cool. It saves some perf, though not as much as I thought at first. It's worth it for me, but it also became obvious to me that there aren't that many projects that could argue closures are worth it for them in terms of perf. Heed this as a warning; you will not notice the perf difference between prototype and closures unless you're parsing megabytes of data. I don't care what jsperf tells you.

Gonzales


I've built Gonzales a while back to benchmark the parser and released it to compare zeparser to other parsers out there. Since one of the goals of zeparser was to create the fastest parser, that only makes sense.

I still use Gonzales a lot to check for perf regressions. I created a special configuration mode where I can put different versions of zeparser, build or regular, and give them a name. I can then check them in Gonzales and quickly see whether some minor change is actually regressing compared to another version. While the benchmark is very tentative and fluctuating, I believe I've come to a point where I can properly recognize a regression. It helped me in improving the perf of the parser by about 30%, though these numbers don't really mean anything and I can't clarify them any further :) So let's just say I've significantly improved the perf of the parser with the help of Gonzales as my benchmarking tool.

The build script allows you to give the build a name (bin/build.js foo). It will create a new dir in the build directory, preceeded by a timestamp and followed by the name. The build files and original files are put in that dir, so I can easily revert later if need be. It will add this name to the special parser config of Gonzales. There are also flags supported by the build script to limit the number of scripts included in it, defaulting to 10, and to a set certain number of last versions enabled in Gonzales by default, defaulting to 5. This way you can simply check the last 20 versions, or maybe just compare the last two versions.

I also modified this special Gonzales page to contain buttons with which I can open up external diff sites, which compare the build files of selected versions. This allows me to compare the scripts and see which changes affected which perf, since at some point you just lose track. Oh man, tooling is great :)

Streaming


A normal parser will process everything in sync. It will receive the entire data to parse at once and when it returns, it will have processed the entire input or bailed. A streaming parser does the same but is capable of yielding at any point in the parsing process and wait for more input.

I will discuss the streaming parser in greater detail in a future post. For now it will suffice that the build script can transform the regular build to a streaming parser automatically. It first has to eliminate the logic operators from the code (&& and ||) because of the second step. The second step is transforming the build to a kludge of switches and loops, transforming pretty much all functions to be a generator. For this we need to eliminate logic because we'll need to be able to slice out function calls at arbitrary points in the code. Ok, so I've solved all that and will explain it later :)

There actually isn't more to say about this. The streaming option will generate a build/zp-nologic.js and build/zps.js file. The first is for me to debug in case I need to. The second is, well, the streaming parser.

Cleanup


There's some minor cleanup left. Stuff like reducing multiple newlines, removing multiple spaces, and whatever. Just trivial stuff to make the build slightly cleaner.

Conclusion


So there you have it. That's the build process I've got for ZeParser2, producing zp.js and zps.js.