| 1 |
Technical Notes about PCRE |
Technical Notes about PCRE |
| 2 |
-------------------------- |
-------------------------- |
| 3 |
|
|
| 4 |
|
These are very rough technical notes that record potentially useful information |
| 5 |
|
about PCRE internals. |
| 6 |
|
|
| 7 |
Historical note 1 |
Historical note 1 |
| 8 |
----------------- |
----------------- |
| 9 |
|
|
| 24 |
Historical note 2 |
Historical note 2 |
| 25 |
----------------- |
----------------- |
| 26 |
|
|
| 27 |
By contrast, the code originally written by Henry Spencer and subsequently |
By contrast, the code originally written by Henry Spencer (which was |
| 28 |
heavily modified for Perl actually compiles the expression twice: once in a |
subsequently heavily modified for Perl) compiles the expression twice: once in |
| 29 |
dummy mode in order to find out how much store will be needed, and then for |
a dummy mode in order to find out how much store will be needed, and then for |
| 30 |
real. The execution function operates by backtracking and maximizing (or, |
real. (The Perl version probably doesn't do this any more; I'm talking about |
| 31 |
optionally, minimizing in Perl) the amount of the subject that matches |
the original library.) The execution function operates by backtracking and |
| 32 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
maximizing (or, optionally, minimizing in Perl) the amount of the subject that |
| 33 |
terminology. |
matches individual wild portions of the pattern. This is an "NFA algorithm" in |
| 34 |
|
Friedl's terminology. |
| 35 |
|
|
| 36 |
OK, here's the real stuff |
OK, here's the real stuff |
| 37 |
------------------------- |
------------------------- |
| 47 |
predicted amount of store. The idea is that this is going to turn out faster |
predicted amount of store. The idea is that this is going to turn out faster |
| 48 |
because the first pass is degenerate and the second pass can just store stuff |
because the first pass is degenerate and the second pass can just store stuff |
| 49 |
straight into the vector, which it knows is big enough. It does make the |
straight into the vector, which it knows is big enough. It does make the |
| 50 |
compiling functions bigger, of course, but they have got quite big anyway to |
compiling functions bigger, of course, but they have become quite big anyway to |
| 51 |
handle all the Perl stuff. |
handle all the Perl stuff. |
| 52 |
|
|
| 53 |
Traditional matching function |
Traditional matching function |
| 67 |
simultaneously for all possible matches that start at one point in the subject |
simultaneously for all possible matches that start at one point in the subject |
| 68 |
string. (Going back to my roots: see Historical Note 1 above.) This function |
string. (Going back to my roots: see Historical Note 1 above.) This function |
| 69 |
intreprets the same compiled pattern data as pcre_exec(); however, not all the |
intreprets the same compiled pattern data as pcre_exec(); however, not all the |
| 70 |
facilities are available, and those that are don't always work in quite the |
facilities are available, and those that are do not always work in quite the |
| 71 |
same way. See the user documentation for details. |
same way. See the user documentation for details. |
| 72 |
|
|
| 73 |
Format of compiled patterns |
Format of compiled patterns |
| 161 |
|
|
| 162 |
OP_PROP and OP_NOTPROP are used for positive and negative matches of a |
OP_PROP and OP_NOTPROP are used for positive and negative matches of a |
| 163 |
character by testing its Unicode property (the \p and \P escape sequences). |
character by testing its Unicode property (the \p and \P escape sequences). |
| 164 |
Each is followed by a single byte that encodes the desired property value. |
Each is followed by two bytes that encode the desired property as a type and a |
| 165 |
|
value. |
| 166 |
|
|
| 167 |
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by two |
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by |
| 168 |
bytes: OP_PROP or OP_NOTPROP and then the desired property value. |
three bytes: OP_PROP or OP_NOTPROP and then the desired property type and |
| 169 |
|
value. |
| 170 |
|
|
| 171 |
|
|
| 172 |
Matching literal characters |
Matching literal characters |
| 345 |
data. |
data. |
| 346 |
|
|
| 347 |
Philip Hazel |
Philip Hazel |
| 348 |
January 2006 |
June 2006 |