Token start stats for JS

2013-11-10

And while I'm at it, below are some stats regarding the first character of a token on the same 8mb test script as was used for the punctuator stats. First ordered by popularity, then ordered by ascii.

As before, certain parts of these stats are extremely skewed! Here even more so, actually, since build scripts and minification affects naming, which affects identifier tokens. Transform scripts and code guidelines affects single quotes vs double quotes. Same for tabs vs spaces vs minified. No LF's since the test was done on linux (but at least that shouldn't really make a huge difference for the other stats).

As for the "why", I'm looking into various ways of optimizing my parser. I want to see which characters are most important, or seem to anyways, and start from there, rather than based on token type like I'm doing right now. That means that I will no longer just start parsing a punctuator, but I'll start with spaces, dots, parens, semi-colon, comma, equal sign, and then t. Normally the t would have come after many checks for punctuators. But obviously, it's far more likely to appear than many of the punctuator characters and or other letters.

So I need to very obviously change my parsing order. If I look at the head of these stats, they clearly show a different order of importance than what I'm currently generalizing it to. No more token based parsing priority, but a nextChar filter first. If I check for space, dot, parens, semi, comma, equal, t, cr, curlies first. Then do a lowercase alpha check (because that's a range check, so it might as well check the entire ascii range a-z). I won't even "optimize" that to a case insensitive alpha check (by subtracting 32), since the first capital on the list is the Z, at a mere 19k.

So the checks before the alpha will amount to 1706.9k, which is about 58% of all tokens. The lowercase alpha check will take another 591.1k, totalling to 2.4m (or 83%). And we'll consider the rest to be the tail, with some obvious outliers like double quote, colon, and square brackets.

Of course, this is _only_ to decide which token to parse next. Imagine that...

Total tokens: 2.9m

By occurrence:
Code:
32: SPC: 268.7k
46: . : 251.4k
40: ( : 197.8k
41: ) : 197.8k
59: ; : 140.2k
44: , : 134.5k
61: = : 123.9k
116: t : 120.1k
13: CR : 107.5k
123: { : 82.5k
125: } : 82.5k
97: a : 64.8k
105: i : 63.6k
34: " : 61k
102: f : 59.5k
99: c : 50.4k
58: : : 49.6k
91: [ : 43.7k
93: ] : 43.7k
98: b : 42.5k
114: r : 42k
101: e : 40.8k
118: v : 39k
115: s : 37.3k
100: d : 35.8k
110: n : 33.7k
95: _ : 29.9k
112: p : 25.8k
103: g : 25k
43: + : 24.7k
109: m : 22.7k
111: o : 19.6k
48: 0 : 19.5k
108: l : 19.1k
90: Z : 16.7k
104: h : 16.6k
69: E : 15.3k
33: ! : 14k
49: 1 : 14k
68: D : 12.3k
117: u : 11.6k
38: & : 11k
65: A : 10.9k
119: w : 9.5k
70: F : 9.4k
124: | : 9.1k
39: ' : 7.7k
107: k : 7.7k
45: - : 7.5k
120: x : 7.4k
9: TAB: 7.3k
67: C : 6.9k
106: j : 6.9k
63: ? : 6.7k
36: $ : 6.7k
77: M : 6.7k
121: y : 6.1k
83: S : 5.8k
47: / : 5.6k
60: < : 5.5k
66: B : 5.6k
72: H : 4.9k
73: I : 4.6k
50: 2 : 4.6k
79: O : 4.6k
42: * : 4.4k
113: q : 4.2k
71: G : 4k
80: P : 4k
84: T : 4k
82: R : 3.9k
122: z : 3.8k
78: N : 3.6k
89: Y : 3.5k
74: J : 3.4k
76: L : 3.4k
62: > : 3.2k
75: K : 2.7k
88: X : 2.7k
51: 3 : 2.5k
81: Q : 2.5k
85: U : 2.5k
86: V : 2.3k
87: W : 2k
52: 4 : 1.8k
53: 5 : 1.1k
54: 6 : 933
55: 7 : 567
56: 8 : 778
57: 9 : 546
37: % : 313
94: ^ : 130
126: ~ : 56
0: NUL: 0
1: : 0
2: : 0
3: : 0
4: : 0
5: : 0
6: : 0
7: : 0
8 : : 0
10: LF : 0 (test done on linux, so no LFs seen)
11: : 0
12: : 0
14: : 0
15: : 0
16: : 0
17: : 0
18: : 0
19: : 0
20: : 0
21: : 0
22: : 0
23: : 0
24: : 0
25: : 0
26: : 0
27: : 0
28: : 0
29: : 0
30: : 0
31: : 0
35: # : 0 (does not exist in JS language)
64: @ : 0 (does not exist in JS language)
92: \ : 0 (just because no _token_ started with a backslash doesnt mean it didnt occur)
96: ` : 0 (does not exist in JS language)
127: : 0
>127: 1


By ascii:
Code:
0: NUL: 0
1: : 0
2: : 0
3: : 0
4: : 0
5: : 0
6: : 0
7: : 0
8 : : 0
9: TAB: 7.3k
10: LF : 0 (test done on linux, so no LFs seen)
11: : 0
12: : 0
13: CR : 107.5k
14: : 0
15: : 0
16: : 0
17: : 0
18: : 0
19: : 0
20: : 0
21: : 0
22: : 0
23: : 0
24: : 0
25: : 0
26: : 0
27: : 0
28: : 0
29: : 0
30: : 0
31: : 0
32: SPC: 268.7k
33: ! : 14k
34: " : 61k
35: # : 0
36: $ : 6.7k
37: % : 313
38: & : 11k
39: ' : 7.7k
40: ( : 197.8k
41: ) : 197.8k
42: * : 4.4k
43: + : 24.7k
44: , : 134.5k
45: - : 7.5k
46: . : 251.4k
47: / : 5.6k
48: 0 : 19.5k
49: 1 : 14k
50: 2 : 4.6k
51: 3 : 2.5k
52: 4 : 1.8k
53: 5 : 1.1k
54: 6 : 933
55: 7 : 567
56: 8 : 778
57: 9 : 546
58: : : 49.6k
59: ; : 140.2k
60: < : 5.5k
61: = : 123.9k
62: > : 3.2k
63: ? : 6.7k
64: @ : 0
65: A : 10.9k
66: B : 5.6k
67: C : 6.9k
68: D : 12.3k
69: E : 15.3k
70: F : 9.4k
71: G : 4k
72: H : 4.9k
73: I : 4.6k
74: J : 3.4k
75: K : 2.7k
76: L : 3.4k
77: M : 6.7k
78: N : 3.6k
79: O : 4.6k
80: P : 4k
81: Q : 2.5k
82: R : 3.9k
83: S : 5.8k
84: T : 4k
85: U : 2.5k
86: V : 2.3k
87: W : 2k
88: X : 2.7k
89: Y : 3.5k
90: Z : 16.7k
91: [ : 43.7k
92: \ : 0
93: ] : 43.7k
94: ^ : 130
95: _ : 29.9k
96: ` : 0
97: a : 64.8k
98: b : 42.5k
99: c : 50.4k
100: d : 35.8k
101: e : 40.8k
102: f : 59.5k
103: g : 25k
104: h : 16.6k
105: i : 63.6k
106: j : 6.9k
107: k : 7.7k
108: l : 19.1k
109: m : 22.7k
110: n : 33.7k
111: o : 19.6k
112: p : 25.8k
113: q : 4.2k
114: r : 42k
115: s : 37.3k
116: t : 120.1k
117: u : 11.6k
118: v : 39k
119: w : 9.5k
120: x : 7.4k
121: y : 6.1k
122: z : 3.8k
123: { : 82.5k
124: | : 9.1k
125: } : 82.5k
126: ~ : 56
127: : 0
>127: 1