| 1 |
Some comparisons of PCRE with the original Henry Spencer (1986) regular
|
| 2 |
expression functions were done on a SPARCstation IPC using gcc version 2.6.3
|
| 3 |
with -O optimization, to give some idea as to how the two libraries compare.
|
| 4 |
This is not a major statistical investigation.
|
| 5 |
|
| 6 |
|
| 7 |
Code size
|
| 8 |
---------
|
| 9 |
|
| 10 |
The code size of PCRE is a bit over twice the size of the Henry Spencer
|
| 11 |
functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O).
|
| 12 |
|
| 13 |
|
| 14 |
Store size for compiled expressions
|
| 15 |
-----------------------------------
|
| 16 |
|
| 17 |
For expressions that are compatible with both libraries, PCRE uses less store
|
| 18 |
for the examples tried, except in some cases that involve the use of character
|
| 19 |
classes. Except in the special case of a negated charcter class containing only
|
| 20 |
one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character
|
| 21 |
class, in order to get the maximum matching speed. By contrast the Spencer code
|
| 22 |
uses a strchr() call.
|
| 23 |
|
| 24 |
The Spencer functions have an overhead of 92 bytes per expression, because
|
| 25 |
there is a table for up to 10 matched substrings held with every compiled
|
| 26 |
expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the
|
| 27 |
caller to supply a vector to receive the offsets of the matched substrings. In
|
| 28 |
the table below, the size without the overhead is shown in brackets.
|
| 29 |
|
| 30 |
PCRE Spencer Pattern
|
| 31 |
---- ------- -------
|
| 32 |
|
| 33 |
18 (09) 109 (17) /^$/
|
| 34 |
25 (16) 120 (28) /^.*nter/
|
| 35 |
26 (17) 121 (29) /^12.34/
|
| 36 |
37 (28) 126 (34) /the quick brown fox/
|
| 37 |
50 (41) 114 (22) /^[]cde]/
|
| 38 |
50 (41) 114 (22) /^[^]cde]/
|
| 39 |
51 (42) 125 (33) /^[.^$|()*+?{,}]+/
|
| 40 |
52 (43) 126 (34) /^[0-9]+$/
|
| 41 |
56 (47) 153 (61) /^(abc)(abc)?zz/
|
| 42 |
57 (48) 133 (41) /^xxx[0-9]+$/
|
| 43 |
57 (48) 145 (53) /([0-9a-fA-F:]+)$/
|
| 44 |
62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/
|
| 45 |
70 (61) 170 (78) /^(b+|a)(b+|a)?c/
|
| 46 |
74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/
|
| 47 |
99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
| 48 |
119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/
|
| 49 |
165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
| 50 |
451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
| 51 |
|
| 52 |
|
| 53 |
Compilation time
|
| 54 |
----------------
|
| 55 |
|
| 56 |
Timing was done using the clock() function to time 2000 compilations of each
|
| 57 |
expression and then dividing by twice the number of clocks per second, to get a
|
| 58 |
value in milliseconds. The variation observed over several runs was never more
|
| 59 |
than 0.01:
|
| 60 |
|
| 61 |
PCRE Spencer Pattern
|
| 62 |
---- ------- -------
|
| 63 |
|
| 64 |
0.04 0.07 /^$/
|
| 65 |
0.06 0.12 /^.*nter/
|
| 66 |
0.06 0.13 /^12.34/
|
| 67 |
0.06 0.09 /^[]cde]/
|
| 68 |
0.07 0.14 /^[0-9]+$/
|
| 69 |
0.07 0.10 /^[^]cde]/
|
| 70 |
0.08 0.17 /^xxx[0-9]+$/
|
| 71 |
0.08 0.14 /the quick brown fox/
|
| 72 |
0.09 0.14 /^[.^$|()*+?{,}]+/
|
| 73 |
0.10 0.33 /([0-9a-fA-F:]+)$/
|
| 74 |
0.12 0.26 /^.+[0-9][0-9][0-9]$/
|
| 75 |
0.12 0.42 /^(abc)(abc)?zz/
|
| 76 |
0.14 0.51 /^(b+|a)(b+|a)?c/
|
| 77 |
0.15 0.53 /^(ba|b*)(ba|b*)?bc/
|
| 78 |
0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
|
| 79 |
0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
| 80 |
0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
| 81 |
0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
| 82 |
|
| 83 |
|
| 84 |
Execution time
|
| 85 |
--------------
|
| 86 |
|
| 87 |
Execution timing was done in a similar manner. Blank entries in the "pattern"
|
| 88 |
column below indicate the use of the same pattern as before.
|
| 89 |
|
| 90 |
PCRE Spencer Subject Pattern
|
| 91 |
---- ------- ------- -------
|
| 92 |
|
| 93 |
0.03 0.02 <null string> /^$/
|
| 94 |
0.04 0.04 enter /^.*nter/
|
| 95 |
0.04 0.04 uponter
|
| 96 |
0.03 0.03 12\r34 /^12.34/
|
| 97 |
0.03 0.03 0 /^[0-9]+$/
|
| 98 |
0.04 0.03 100
|
| 99 |
0.03 0.03 ]thing /^[]cde]/
|
| 100 |
0.03 0.03 ething
|
| 101 |
0.03 0.03 athing /^[^]cde]/
|
| 102 |
0.04 0.04 xxx0 /^xxx[0-9]+$/
|
| 103 |
0.04 0.04 xxx1234
|
| 104 |
0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/
|
| 105 |
0.03 0.03 the quick brown fox /the quick brown fox/
|
| 106 |
0.06 0.08 What do you know about the quick brown fox?
|
| 107 |
0.04 0.07 0abc /([0-9a-fA-F:]+)$/
|
| 108 |
0.04 0.07 abc
|
| 109 |
0.05 0.13 5f03:12C0::932e
|
| 110 |
0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/
|
| 111 |
0.06 0.07 123456
|
| 112 |
0.06 0.09 abczz /^(abc)(abc)?zz/
|
| 113 |
0.06 0.12 abcabczz /^(abc)(abc)?zz/
|
| 114 |
/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
|
| 115 |
0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk
|
| 116 |
0.09 0.15 bc /^(b+|a)(b+|a)?c/
|
| 117 |
0.09 0.15 bbc
|
| 118 |
0.08 0.15 bbbc
|
| 119 |
0.09 0.15 bac
|
| 120 |
0.09 0.15 bbac
|
| 121 |
0.07 0.14 aac
|
| 122 |
0.09 0.15 abbbbbbbbbbbc
|
| 123 |
0.09 0.15 bbbbbbbbbbbac
|
| 124 |
0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/
|
| 125 |
0.12 0.24 bbabc
|
| 126 |
0.07 0.15 bababc
|
| 127 |
0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
| 128 |
0.13 0.34 ab-c.pq-r.
|
| 129 |
0.24 0.58 sxk.zzz.ac.uk.
|
| 130 |
0.12 0.34 x-.y-.
|
| 131 |
0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
| 132 |
/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
| 133 |
0.18 0.30 From abcd Mon Sep 01 12:33:02 1997
|
| 134 |
|
| 135 |
In general, PCRE runs faster than the Spencer function, but remember, this
|
| 136 |
is just for one particular compiler on one set of hardware and operating
|
| 137 |
system. Until comprehensive tests have been run in other environments, the most
|
| 138 |
one can plausibly say is that it is probably no worse on average for the kinds
|
| 139 |
of expression tested here.
|
| 140 |
|
| 141 |
|
| 142 |
Speeding up matching
|
| 143 |
--------------------
|
| 144 |
|
| 145 |
A character class is much more efficient than a set of bracketed alternatives.
|
| 146 |
Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas
|
| 147 |
/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve
|
| 148 |
recursion.
|
| 149 |
|
| 150 |
|
| 151 |
Serious test
|
| 152 |
------------
|
| 153 |
|
| 154 |
One of the tests of PCRE is the monster regular expression from "Mastering
|
| 155 |
Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3)
|
| 156 |
which recognizes email addresses. There are two versions, unoptimized and
|
| 157 |
optimized. For interest, here are their timings, again on a SPARCstation IPC.
|
| 158 |
The compile times were 55 ms and 94 ms, and the compiled expressions
|
| 159 |
occupied 11010 and 15426 bytes of store, respectively. The following strings
|
| 160 |
were matched in the times shown (unoptimized first):
|
| 161 |
|
| 162 |
0.34 0.38 user@dom.ain
|
| 163 |
0.38 0.42 <user@dom.ain>
|
| 164 |
0.88 0.60 Alan Other <user@dom.ain>
|
| 165 |
1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment)
|
| 166 |
1.77 1.19 A. Other <user.1234@dom.ain> (a comment)
|
| 167 |
2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay
|
| 168 |
|
| 169 |
The optimization of the expression clearly has a dramatic effect in some cases.
|
| 170 |
|
| 171 |
Philip Hazel <ph10@cus.cam.ac.uk>
|
| 172 |
October 1997
|