| 1 |
nigel |
3 |
Some comparisons of PCRE with the original Henry Spencer (1986) regular |
| 2 |
|
|
expression functions were done on a SPARCstation IPC using gcc version 2.6.3 |
| 3 |
|
|
with -O optimization, to give some idea as to how the two libraries compare. |
| 4 |
|
|
This is not a major statistical investigation. |
| 5 |
|
|
|
| 6 |
|
|
|
| 7 |
|
|
Code size |
| 8 |
|
|
--------- |
| 9 |
|
|
|
| 10 |
|
|
The code size of PCRE is a bit over twice the size of the Henry Spencer |
| 11 |
|
|
functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O). |
| 12 |
|
|
|
| 13 |
|
|
|
| 14 |
|
|
Store size for compiled expressions |
| 15 |
|
|
----------------------------------- |
| 16 |
|
|
|
| 17 |
|
|
For expressions that are compatible with both libraries, PCRE uses less store |
| 18 |
|
|
for the examples tried, except in some cases that involve the use of character |
| 19 |
|
|
classes. Except in the special case of a negated charcter class containing only |
| 20 |
|
|
one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character |
| 21 |
|
|
class, in order to get the maximum matching speed. By contrast the Spencer code |
| 22 |
|
|
uses a strchr() call. |
| 23 |
|
|
|
| 24 |
|
|
The Spencer functions have an overhead of 92 bytes per expression, because |
| 25 |
|
|
there is a table for up to 10 matched substrings held with every compiled |
| 26 |
|
|
expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the |
| 27 |
|
|
caller to supply a vector to receive the offsets of the matched substrings. In |
| 28 |
|
|
the table below, the size without the overhead is shown in brackets. |
| 29 |
|
|
|
| 30 |
|
|
PCRE Spencer Pattern |
| 31 |
|
|
---- ------- ------- |
| 32 |
|
|
|
| 33 |
|
|
18 (09) 109 (17) /^$/ |
| 34 |
|
|
25 (16) 120 (28) /^.*nter/ |
| 35 |
|
|
26 (17) 121 (29) /^12.34/ |
| 36 |
|
|
37 (28) 126 (34) /the quick brown fox/ |
| 37 |
|
|
50 (41) 114 (22) /^[]cde]/ |
| 38 |
|
|
50 (41) 114 (22) /^[^]cde]/ |
| 39 |
|
|
51 (42) 125 (33) /^[.^$|()*+?{,}]+/ |
| 40 |
|
|
52 (43) 126 (34) /^[0-9]+$/ |
| 41 |
|
|
56 (47) 153 (61) /^(abc)(abc)?zz/ |
| 42 |
|
|
57 (48) 133 (41) /^xxx[0-9]+$/ |
| 43 |
|
|
57 (48) 145 (53) /([0-9a-fA-F:]+)$/ |
| 44 |
|
|
62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/ |
| 45 |
|
|
70 (61) 170 (78) /^(b+|a)(b+|a)?c/ |
| 46 |
|
|
74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/ |
| 47 |
|
|
99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/ |
| 48 |
|
|
119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/ |
| 49 |
|
|
165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ |
| 50 |
|
|
451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ |
| 51 |
|
|
|
| 52 |
|
|
|
| 53 |
|
|
Compilation time |
| 54 |
|
|
---------------- |
| 55 |
|
|
|
| 56 |
|
|
Timing was done using the clock() function to time 2000 compilations of each |
| 57 |
|
|
expression and then dividing by twice the number of clocks per second, to get a |
| 58 |
|
|
value in milliseconds. The variation observed over several runs was never more |
| 59 |
|
|
than 0.01: |
| 60 |
|
|
|
| 61 |
|
|
PCRE Spencer Pattern |
| 62 |
|
|
---- ------- ------- |
| 63 |
|
|
|
| 64 |
|
|
0.04 0.07 /^$/ |
| 65 |
|
|
0.06 0.12 /^.*nter/ |
| 66 |
|
|
0.06 0.13 /^12.34/ |
| 67 |
|
|
0.06 0.09 /^[]cde]/ |
| 68 |
|
|
0.07 0.14 /^[0-9]+$/ |
| 69 |
|
|
0.07 0.10 /^[^]cde]/ |
| 70 |
|
|
0.08 0.17 /^xxx[0-9]+$/ |
| 71 |
|
|
0.08 0.14 /the quick brown fox/ |
| 72 |
|
|
0.09 0.14 /^[.^$|()*+?{,}]+/ |
| 73 |
|
|
0.10 0.33 /([0-9a-fA-F:]+)$/ |
| 74 |
|
|
0.12 0.26 /^.+[0-9][0-9][0-9]$/ |
| 75 |
|
|
0.12 0.42 /^(abc)(abc)?zz/ |
| 76 |
|
|
0.14 0.51 /^(b+|a)(b+|a)?c/ |
| 77 |
|
|
0.15 0.53 /^(ba|b*)(ba|b*)?bc/ |
| 78 |
|
|
0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ |
| 79 |
|
|
0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/ |
| 80 |
|
|
0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ |
| 81 |
|
|
0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ |
| 82 |
|
|
|
| 83 |
|
|
|
| 84 |
|
|
Execution time |
| 85 |
|
|
-------------- |
| 86 |
|
|
|
| 87 |
|
|
Execution timing was done in a similar manner. Blank entries in the "pattern" |
| 88 |
|
|
column below indicate the use of the same pattern as before. |
| 89 |
|
|
|
| 90 |
|
|
PCRE Spencer Subject Pattern |
| 91 |
|
|
---- ------- ------- ------- |
| 92 |
|
|
|
| 93 |
|
|
0.03 0.02 <null string> /^$/ |
| 94 |
|
|
0.04 0.04 enter /^.*nter/ |
| 95 |
|
|
0.04 0.04 uponter |
| 96 |
|
|
0.03 0.03 12\r34 /^12.34/ |
| 97 |
|
|
0.03 0.03 0 /^[0-9]+$/ |
| 98 |
|
|
0.04 0.03 100 |
| 99 |
|
|
0.03 0.03 ]thing /^[]cde]/ |
| 100 |
|
|
0.03 0.03 ething |
| 101 |
|
|
0.03 0.03 athing /^[^]cde]/ |
| 102 |
|
|
0.04 0.04 xxx0 /^xxx[0-9]+$/ |
| 103 |
|
|
0.04 0.04 xxx1234 |
| 104 |
|
|
0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/ |
| 105 |
|
|
0.03 0.03 the quick brown fox /the quick brown fox/ |
| 106 |
|
|
0.06 0.08 What do you know about the quick brown fox? |
| 107 |
|
|
0.04 0.07 0abc /([0-9a-fA-F:]+)$/ |
| 108 |
|
|
0.04 0.07 abc |
| 109 |
|
|
0.05 0.13 5f03:12C0::932e |
| 110 |
|
|
0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/ |
| 111 |
|
|
0.06 0.07 123456 |
| 112 |
|
|
0.06 0.09 abczz /^(abc)(abc)?zz/ |
| 113 |
|
|
0.06 0.12 abcabczz /^(abc)(abc)?zz/ |
| 114 |
|
|
/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ |
| 115 |
|
|
0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk |
| 116 |
|
|
0.09 0.15 bc /^(b+|a)(b+|a)?c/ |
| 117 |
|
|
0.09 0.15 bbc |
| 118 |
|
|
0.08 0.15 bbbc |
| 119 |
|
|
0.09 0.15 bac |
| 120 |
|
|
0.09 0.15 bbac |
| 121 |
|
|
0.07 0.14 aac |
| 122 |
|
|
0.09 0.15 abbbbbbbbbbbc |
| 123 |
|
|
0.09 0.15 bbbbbbbbbbbac |
| 124 |
|
|
0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/ |
| 125 |
|
|
0.12 0.24 bbabc |
| 126 |
|
|
0.07 0.15 bababc |
| 127 |
|
|
0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ |
| 128 |
|
|
0.13 0.34 ab-c.pq-r. |
| 129 |
|
|
0.24 0.58 sxk.zzz.ac.uk. |
| 130 |
|
|
0.12 0.34 x-.y-. |
| 131 |
|
|
0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/ |
| 132 |
|
|
/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ |
| 133 |
|
|
0.18 0.30 From abcd Mon Sep 01 12:33:02 1997 |
| 134 |
|
|
|
| 135 |
|
|
In general, PCRE runs faster than the Spencer function, but remember, this |
| 136 |
|
|
is just for one particular compiler on one set of hardware and operating |
| 137 |
|
|
system. Until comprehensive tests have been run in other environments, the most |
| 138 |
|
|
one can plausibly say is that it is probably no worse on average for the kinds |
| 139 |
|
|
of expression tested here. |
| 140 |
|
|
|
| 141 |
|
|
|
| 142 |
|
|
Speeding up matching |
| 143 |
|
|
-------------------- |
| 144 |
|
|
|
| 145 |
|
|
A character class is much more efficient than a set of bracketed alternatives. |
| 146 |
|
|
Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas |
| 147 |
|
|
/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve |
| 148 |
|
|
recursion. |
| 149 |
|
|
|
| 150 |
|
|
|
| 151 |
|
|
Serious test |
| 152 |
|
|
------------ |
| 153 |
|
|
|
| 154 |
|
|
One of the tests of PCRE is the monster regular expression from "Mastering |
| 155 |
|
|
Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3) |
| 156 |
|
|
which recognizes email addresses. There are two versions, unoptimized and |
| 157 |
|
|
optimized. For interest, here are their timings, again on a SPARCstation IPC. |
| 158 |
|
|
The compile times were 55 ms and 94 ms, and the compiled expressions |
| 159 |
|
|
occupied 11010 and 15426 bytes of store, respectively. The following strings |
| 160 |
|
|
were matched in the times shown (unoptimized first): |
| 161 |
|
|
|
| 162 |
|
|
0.34 0.38 user@dom.ain |
| 163 |
|
|
0.38 0.42 <user@dom.ain> |
| 164 |
|
|
0.88 0.60 Alan Other <user@dom.ain> |
| 165 |
|
|
1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment) |
| 166 |
|
|
1.77 1.19 A. Other <user.1234@dom.ain> (a comment) |
| 167 |
|
|
2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay |
| 168 |
|
|
|
| 169 |
|
|
The optimization of the expression clearly has a dramatic effect in some cases. |
| 170 |
|
|
|
| 171 |
|
|
Philip Hazel <ph10@cus.cam.ac.uk> |
| 172 |
|
|
October 1997 |