Applying regex on substring

2011-08-22

While working on Zeon and it's parsing I keep running in wanting to search for a string by regular expressions. There are actually two different wishes here. One is searching with regular expressions and the other is applying a regular expression at a specific point in a string. My challenge to you is; Can you make them any faster?

The first wish is to be able to apply a regular expression in a given substring of a string (and optionally just get back the length that was matched rather than the substring). The substring should apply the start and end of the string with ^ and $. This could be simple arguments to the existing methods. regex.test(r[, from[, to]]), regex.exec(r[, from[, to]]), etc. And a new method regex.lenMatch(r, from). It could in fact be done with RegExp.prototype.test, including backwards compat, but there were quite a few catches.

The simple implementation of this is the substringing... The problem is of course that you usually don't really know how long the substring should be, so it's usually to the end of the string. Even when the regex might only match four characters.

Code: (js)
// apply regex at given position in string
RegExp.prototype.match = function(str, start){
if (typeof start != 'number' || start < 0) start = 0;
if (start) return this.exec(str.substring(0, start));
return this.exec(str);
};
// return length of match at specified pos in string
RegExp.prototype.test2 = function(str, start){
if (typeof start != 'number' || start < 0) start = 0;
var match = null;
if (start) match = this.exec(str.substring(start));
else match = this.exec(str);
if (match) return match[0].length;
return -1;
};

The good news is that this is something that might actually enter the spec, see this strawman. But that's gonna take a long time...

John-David points out that you can use lastIndex pretty safely in non-IE contexts and get this behaviour. The regex must be global though. You can even use it safely in all browsers (inc IE) as long as you explicitly set it yourself... So I'll try some of that soon.

As Mathias points out... there's String.prototype.search that does exactly the following. Ugh, fail :)

The second is regexIndexOf; finding the first position in a string where a given regular expression would match. Of course there would be a regexLastIndexOf equivalent. In this, the start-of-line character (^) would no longer matter.

A native example would be:

Code: (js)
String.prototype.regexIndexOf_1 = function(regexBody, start){
if (typeof regexBody != 'string') regexBody = regexBody.toString().slice(1, - 1);
if (typeof start != 'number' || start < 0) start = 0;
for (var i = start, len = this.length; i < len; ++ i) {
if (new RegExp('^.{' + i + '}' + regexBody, '').test(this)) {
return i;
}
}
return - 1;
};

This method accepts a regular expression or string and loops through the string, constructing a new regular expression at every iteration that skips more characters (that's the .{n} part). Though it works fine, it's dreadfully slow on longer strings (check the jsperf below).

But we can use the regex trick to improve that a little...

Code: (js)
String.prototype.regexIndexOf_3 = function(regexBody, pos){
if (typeof regexBody != 'string') regexBody = regexBody.toString().slice(1, - 1);
var startIgnore = '';
if (typeof pos != 'number' || pos < 0) pos = 0;
else startIgnore = '.{' + pos + '}';
var result = new RegExp('^' + startIgnore + '(.*?)' + regexBody, '').exec(this);
if (result && result[1]) return pos + String(result[1]).length;
return - 1;
};

This does not loop but instead uses a (non-greedy) capturing group. It will count the length of the caught group if any to determine it's position. The problem is that it's effectively doing a substring on the string, even though we just wanted the length. Regardless, this runs much much faster.

Here is the jsperf on short strings (just a few bytes) and on longer strings (190k). I've actually commented the first version out in the longer test because it be too slow to run even once. Also, it turns out that RegExp or new RegExp didn't really matter anything.

Now, if you can improve the perf on this, please let me know :) The lastIndexOf version of the second method is easily created by making the .* part greedy (by removing the question mark that follows it) and adding the start part (.{n}) at the end of it rather than the begin.

PS. There's a catch. It's going to be troublesome to remove the caret (^) from the regular expression bodies. Especially with more complex regular expressions, you'd have to explicitly parse the entire body to safely do so. A browser engine could just ignore the caret (and dollar) for these methods.