pvdz.ee - Introducing Drew

Last year I was refactoring some code for a client, implementing a style guide and such, when I realized I was doing the same thing over and over again. Now don't get me wrong; refactoring can be a tedious job but it's not completely mindless. To restructure a large code base without tripping up on any of the clustermi^H^H^H^H^Htests can be quite difficult. Regardless, I noticed patterns and figured I could use my JS parser and write some scripts to do some trivial style guide presets rewrites for me. Stuff like eliminating tabs, comma first, spaces with functions, blabla. The stuff eslint can do as well these days (nice!).

When writing those scripts I noticed I was still doing the same thing over and over again. Search for a comma, go back one token, check for whitespace, go forward two tokens, check for whitespace, go forw... you get the drill. Instead of doing it manually with my keyboard, I was now repetitively writing the same kinds of scripts. And that's when I started to write Drew.

Check out the demo at qfox.github.io/drew, then come back ;)

Drew is a language agnostic query language that can search for patterns in a list of tokens. Concretely that means you need something to process your language, a parser, which cuts your source code up in chunks (or "tokens"). These tokens, at minimum plain objects with a value property, can be fed to Drew and a query can be applied to them.

The Drew query, a DSL, is somewhat similar to regular expressions. Its execution is too, as it's a recursive descent search tool with naive backtracking. It's not really meant to be super fast. I don't even have benchmarks, this time. The point is that you can search through a token based stream of data. When the query matches your callback is called and you can modify this token. In the end you can reconstruct the code with your rewrites. Easy peasy. Ish.

While the DSL borrows many things from regular expressions, it'll only resemble it a little bit. A big reason for this is that you have to be explicit in matching a token. For regular expressions this is implicitly one character and these boundaries are implied. On the other hand Drew is wired to make a distinction between whitespace and non-whitespace tokens. In many languages whitespace is insignificant. But in most languages, there are many weird ways whitespace may occur. In JavaScript for example the comments are also considered whitespace. And there are "Automatically Inserted Semi-colons" (ASI), which have some importance but don't consitute a character to the source. Other languages have other rules.

Matching tokens

In Drew you can define yourself what you think is whitespace (see macros below) and automatically ignore any whitespace tokens with curly brackets {xxx}. If you do want be aware of whitespace tokens you can use square brackets `[xxx]`. The brackets signify and form the boundary of matching one token. The conditions within a bracket apply to the current token only and you can match on a string literal or with a regular expression. The details can be found in the extensive README but if we can look at a JavaScript example real quick:

Code:

if (foo) bar;

// parser would create tokens like:
// ["if", " ", "(", "foo", ")", " ", "bar", ";"

// match the spaces
[` `]

// match the opening parenthesis of an `if`
{`if`} {`(`}

// match the opening parenthesis of an if without whitespace between
{`if`} [`(`]

The queries translate to source code and allow you to completely manipulate the search. And you're not limited to just compare against the value of the current token. You can also group conditions as you'd expect to and use & and | like you'd expect to. On the top-level, tokens are implicitly & like they are in regular expressions. Inside a token condition you'll have to be explicit about it.

Seeking

You can seek through the input (change the pointer that determines which token is "current" or "active"). You can arbitrarily move to the previous or next token (< >). Even to the previous or next token while ignoring whitespace (<< >>), or move one forward while the next token or group fails (~), or to the end of the line (-->). Combined to seek to the start of the next line (-->>). You can check whether the current token is the first of a line (^), first of the input (^^), and similarly to the end of line/input ($ and $$).

Quantifiers

Just like regular expressions Drew can repeat a token or group match for you. The options are similar to those in regex, having * for "zero or more", + for "one or more", ? for "zero or one", and explicit numbers to control the repetition; a .. b or a ... or ... b, controlling the minimum and maximum number of times the last token must match in order to pass. This looks something like [`foo`]+ or [`foo`]24..34 trying to match a token with value "foo" zero or one times, and between 24 and 34 times, inclusive.

Designators / capturing groups

Drew has a powerful tool to control what your callbacks receive. Regular expressions have capturing groups, in Drew every token or group (parenthesis around tokens) and seeks can be a capturing group. If you put an assignment (=) after the quantifier (or after the token/group if it has no quantifier) you can put names there that control values received by your callback.

The idea is that you can set the start and end of the range of the previous token or group, over the entire quantified range even. And can even choose whether to get the parameters as individual arguments or as a big object (tip: if you only use numbers as names Drew will interpret them as callback parameter indexes).

This looks something like [`foo`]? = 0 (passes the token that matches "foo" as the first parameter of your callback) or ([`a`][`b`]*[`a`]) = offset, last (match any number of "b" between two tokens with value "a", passing an object to your callback with properties "offset" and "last" which contain the first and last token of the entire match).

Macros and constants

You can embed any language because Drew is highly customizable. There are macros and constants which are sort of like aliases and plugins. The macros allow you to name certain token matching conditions so you could do [space] instead of [` `]. The macro would still just do the latter but your query would be easier to read.

Constants are more like plugins; they execute a custom JavaScript script when they are used which must return a boolean to indicate whether the condition passes or not. When you use macros in a query Drew will first check the list of macros for that name, then the constants if there is no such macro.

Requirements for implementing a language

When you want to use Drew in your own language of choice you'll have to define two macros or constants yourself; IS_NEWLINE, which should determine whether a token is considered a newline in your language. And IS_BLACK, which should determine whether the current token is NOT considered whitespace in your language. It doesn't matter whether you define them as macros or constants or mixed, Drew will first check macros and then constants when looking them up.

The newline macro will drive the line conditionals (< > << >>) and line seek (-->). The other one drives the `{xxx}` "seeks". For example, with ZeParser I create a constant for IS_BLACK which contains "current().type !== Par.WHITE". For IS_NEWLINE I create a simple macro that checks whether the current value is equal to the valid newlines in JS.

Comment syntax

The last bit I want to touch on are the comments. I decided to go with : as a "comment opener". This kind of happened as I was trying to describe numbered parameters, like [`foo`]=1 : start of something big [`bar`] = 2 : end of it all. In the end I landed on three different types of comments.

The simple one you just saw and can only consist of letters and whitespace and are terminated as soon as it would otherwise be ambiguous, which you could also force with a semi-colon (parser will actively look for it). The others are more common; the single line comment starts with two colons and runs up to the next newline. The multi line comment starts with three colons and is ended by the first next set of three colons.

Code:

{`if` | `while`}       :: find a token whose entire value is `if` or `while`
  {PAREN_PAIR}=,0      :: skip the parens and pass the closing paren as first param to callback
  (~ [IS_NEWLINE])=2,3 :: skip tokens until the next newline
  {`{`}=,1             :: pass on the open curly bracket

Comments can appear anywhere where you can put a space. I think a colon works quite well as comment opening in this DSL. I also like the simple comment for a quick note. Oh and have I mentioned you can use any number of spaces, tabs, or newlines inside a query. Oh.

Timeframe

The work was kind of split in two parts. I started on it last year, worked on it for quite some time, then got a little side tracked. I also got a little feature creep. I wanted arbitrary callbacks in there but that leads to a lot of "manual" stack pointer labor (you have to track which values your callback receives, and discard them when backtracking). I also got distracted by played.today, a project that I've wrapped up by now, and oh having a kid is making minced meat of my free time as well.

But I recently got back into the project, rewrote it from near scratch, and dropped some of the more exotic features of the project. Unfortunately the tests are kind of in a bad spot. There's a new fuzzer but the old tests are incompatible with the final version of Drew and I haven't bothered updating them, pruning out the old stuff. But as I have more pressing matters on my hands, I don't have the time to properly wrap this up and I really wanted to push this out before this becomes another one of my dead-before-arrival projects.

As far as maintenance goes; look, you're kind of on your own. I'll take tickets and look to fix stuff. I'm fairly confident the code works as intended so I don't expect anything major here. Is it production ready? YMMV. Want a new feature? Well you could always pay me :)

Conclusion

I had a lot of fun writing this. I'll probably use it myself a few times. Just like I still use Zeon and HeatFiler from time to time. But apart from that I have no big plans with this as of yet.

I've built a "simple" web interface for Drew which you can find at qfox.github.io/drew. It won't resize very nicely but it's just a demo.

Besides the web interface you can npm install drew, which comes with a neat package and an example file. As you may have guessed by now, the source code can be found at github.com/qfox/drew/.

Hope you can find a use case for it. It's okay if you don't :)