| 1 |
MAINTENANCE README FOR PCRE |
MAINTENANCE README FOR PCRE |
| 2 |
--------------------------- |
--------------------------- |
| 3 |
|
|
| 4 |
The files in the "maint" directory of the PCRE source contain data, scripts, |
The files in the "maint" directory of the PCRE source contain data, scripts, |
| 5 |
and programs that are used for the maintenance of PCRE, but which do not form |
and programs that are used for the maintenance of PCRE, but which do not form |
| 6 |
part of the PCRE distribution tarballs. This document describes these files and |
part of the PCRE distribution tarballs. This document describes these files and |
| 7 |
also contains some notes for maintainers. Its contents are: |
also contains some notes for maintainers. Its contents are: |
| 8 |
|
|
| 9 |
Files in the maint directory |
Files in the maint directory |
| 10 |
Updating to a new Unicode release |
Updating to a new Unicode release |
| 11 |
Preparing for a PCRE release |
Preparing for a PCRE release |
| 12 |
Making a PCRE release |
Making a PCRE release |
| 13 |
Long-term ideas (wish list) |
Long-term ideas (wish list) |
| 14 |
|
|
| 15 |
|
|
| 16 |
Files in the maint directory |
Files in the maint directory |
| 20 |
from two Unicode data files, which themselves are downloaded |
from two Unicode data files, which themselves are downloaded |
| 21 |
from the Unicode web site. Run this script in the "maint" |
from the Unicode web site. Run this script in the "maint" |
| 22 |
directory. |
directory. |
| 23 |
|
|
| 24 |
ManyConfigTests A shell script that runs "configure, make, test" a number of |
ManyConfigTests A shell script that runs "configure, make, test" a number of |
| 25 |
times with different configuration settings. |
times with different configuration settings. |
| 26 |
|
|
| 27 |
Unicode.tables The files in this directory, Scripts.txt and UnicodeData.txt, |
Unicode.tables The files in this directory, Scripts.txt and UnicodeData.txt, |
| 28 |
were downloaded from the Unicode web site. They contain |
were downloaded from the Unicode web site. They contain |
| 29 |
information about Unicode characters and scripts. |
information about Unicode characters and scripts. |
| 30 |
|
|
| 31 |
ucptest.c A short C program for testing the Unicode property functions |
ucptest.c A short C program for testing the Unicode property functions |
| 32 |
in pcre_ucp_searchfuncs.c, mainly useful after rebuilding the |
in pcre_ucp_searchfuncs.c, mainly useful after rebuilding the |
| 33 |
Unicode property table. Compile and run this in the "maint" |
Unicode property table. Compile and run this in the "maint" |
| 34 |
directory. |
directory. |
| 35 |
|
|
| 36 |
ucptestdata A directory containing two files, testinput1 and testoutput1, |
ucptestdata A directory containing two files, testinput1 and testoutput1, |
| 37 |
to use in conjunction with the ucptest program. |
to use in conjunction with the ucptest program. |
| 38 |
|
|
| 39 |
utf8.c A short, freestanding C program for converting a Unicode code |
utf8.c A short, freestanding C program for converting a Unicode code |
| 40 |
point into a sequence of bytes in the UTF-8 encoding, and vice |
point into a sequence of bytes in the UTF-8 encoding, and vice |
| 41 |
versa. If its argument is a hex number such as 0x1234, it |
versa. If its argument is a hex number such as 0x1234, it |
| 43 |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it |
| 44 |
treats them as a UTF-8 character and outputs the equivalent |
treats them as a UTF-8 character and outputs the equivalent |
| 45 |
code point in hex. |
code point in hex. |
| 46 |
|
|
| 47 |
|
|
| 48 |
Updating to a new Unicode release |
Updating to a new Unicode release |
| 49 |
--------------------------------- |
--------------------------------- |
| 50 |
|
|
| 51 |
When there is a new release of Unicode, the files in Unicode.tables must be |
When there is a new release of Unicode, the files in Unicode.tables must be |
| 52 |
refreshed from the web site, and the Buildupctable script can then be run to |
refreshed from the web site, and the Buildupctable script can then be run to |
| 53 |
generate a new version of ucptable.h. The ucptest program can be used to check |
generate a new version of ucptable.h. The ucptest program can be used to check |
| 54 |
that the resulting table works properly, using the data files in ucptestdata to |
that the resulting table works properly, using the data files in ucptestdata to |
| 55 |
check a number of test characters. |
check a number of test characters. |
| 56 |
|
|
| 57 |
|
|
| 58 |
Preparing for a PCRE release |
Preparing for a PCRE release |
| 66 |
|
|
| 67 |
. Run ./autogen.sh to ensure everything is up-to-date. |
. Run ./autogen.sh to ensure everything is up-to-date. |
| 68 |
|
|
| 69 |
. Compile and test with many different config options, and combinations of |
. Compile and test with many different config options, and combinations of |
| 70 |
options. The maint/ManyConfigTests script now encapsulates this testing. |
options. The maint/ManyConfigTests script now encapsulates this testing. |
| 71 |
|
|
| 72 |
. Run perltest.pl on the test data for tests 1 and 4. The output should match |
. Run perltest.pl on the test data for tests 1 and 4. The output should match |
| 73 |
the PCRE test output, apart from the version identification at the top. The |
the PCRE test output, apart from the version identification at the top. The |
| 74 |
other tests are not Perl-compatible (they use various special PCRE options). |
other tests are not Perl-compatible (they use various special PCRE options). |
| 75 |
|
|
| 76 |
. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest |
. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest |
| 77 |
valgrind", though that takes quite a long time. |
valgrind", though that takes quite a long time. |
| 78 |
|
|
| 79 |
. It may also useful to test with Electric Fence, though the fact that it |
. It may also useful to test with Electric Fence, though the fact that it |
| 80 |
grumbles for missing free() calls can be a nuisance. (A missing free() in |
grumbles for missing free() calls can be a nuisance. (A missing free() in |
| 81 |
pcretest is hardly a big problem.) To build with EF, use: |
pcretest is hardly a big problem.) To build with EF, use: |
| 82 |
|
|
| 83 |
LIBS='/usr/lib/libefence.a -lpthread' with ./configure. |
LIBS='/usr/lib/libefence.a -lpthread' with ./configure. |
| 84 |
|
|
| 85 |
Then all normal runs use it to check for buffer overflow. Also run everything |
Then all normal runs use it to check for buffer overflow. Also run everything |
| 86 |
with: |
with: |
| 87 |
|
|
| 88 |
EF_PROTECT_BELOW=1 <whatever> |
EF_PROTECT_BELOW=1 <whatever> |
| 89 |
|
|
| 90 |
because there have been problems with lookbehinds that looked too far. |
because there have been problems with lookbehinds that looked too far. |
| 91 |
|
|
| 92 |
. Test with the emulated memmove() function by undefining HAVE_MEMMOVE and |
. Test with the emulated memmove() function by undefining HAVE_MEMMOVE and |
| 93 |
HAVE_BCOPY in config.h. You may see a number of "pcre_memmove defined but not |
HAVE_BCOPY in config.h. You may see a number of "pcre_memmove defined but not |
| 94 |
used" warnings for the modules in which there is no call to memmove(). These |
used" warnings for the modules in which there is no call to memmove(). These |
| 95 |
can be ignored. |
can be ignored. |
| 96 |
|
|
| 97 |
. Documentation: check AUTHORS, COPYING, ChangeLog (check date), INSTALL, |
. Documentation: check AUTHORS, COPYING, ChangeLog (check date), INSTALL, |
| 98 |
LICENCE, NEWS (check date), NON-UNIX-USE, and README. Many of these won't |
LICENCE, NEWS (check date), NON-UNIX-USE, and README. Many of these won't |
| 99 |
need changing, but over the long term things do change. |
need changing, but over the long term things do change. |
| 100 |
|
|
| 101 |
. Man pages: Check all man pages for \ not followed by e or f or " because |
. Man pages: Check all man pages for \ not followed by e or f or " because |
| 102 |
that indicates a markup error. |
that indicates a markup error. |
| 103 |
|
|
| 104 |
. When the release is built, test it on a number of different operating |
. When the release is built, test it on a number of different operating |
| 105 |
systems if possible, and using different compilers as well. For example, |
systems if possible, and using different compilers as well. For example, |
| 106 |
on Solaris it is helpful to test using Sun's cc compiler as a change from |
on Solaris it is helpful to test using Sun's cc compiler as a change from |
| 107 |
gcc. Adding -xarch=v9 to the cc options does a 64-bit test, but it also |
gcc. Adding -xarch=v9 to the cc options does a 64-bit test, but it also |
| 122 |
------------------------ |
------------------------ |
| 123 |
|
|
| 124 |
This section records a list of ideas so that they do not get forgotten. They |
This section records a list of ideas so that they do not get forgotten. They |
| 125 |
vary enormously in their usefulness and potential for implementation. Some are |
vary enormously in their usefulness and potential for implementation. Some are |
| 126 |
very sensible; some are rather wacky. Some have been on this list for years; |
very sensible; some are rather wacky. Some have been on this list for years; |
| 127 |
others are relatively new. |
others are relatively new. |
| 128 |
|
|
| 129 |
. Optimization |
. Optimization |
| 130 |
|
|
| 131 |
There are always ideas for new optimizations so as to speed up pattern |
There are always ideas for new optimizations so as to speed up pattern |
| 132 |
matching. Most of them try to save work by recognizing a non-match without |
matching. Most of them try to save work by recognizing a non-match without |
| 133 |
having to scan all the possibilities. These are some that I've recorded: |
having to scan all the possibilities. These are some that I've recorded: |
| 134 |
|
|
| 135 |
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very |
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very |
| 136 |
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? |
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? |
| 137 |
OTOH, this is pathological - the user could easily fix it. |
OTOH, this is pathological - the user could easily fix it. |
| 138 |
|
|
| 139 |
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems |
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems |
| 140 |
to have little effect, and maybe makes things worse. |
to have little effect, and maybe makes things worse. |
| 141 |
|
|
| 142 |
* "Ends with literal string" - note that a single character doesn't gain much |
* "Ends with literal string" - note that a single character doesn't gain much |
| 143 |
over the existing "required byte" (reqbyte) feature that just saves one |
over the existing "required byte" (reqbyte) feature that just saves one |
| 144 |
byte. |
byte. |
| 145 |
|
|
| 146 |
* These probably need to go in study(): |
* These probably need to go in study(): |
| 147 |
|
|
| 148 |
o Remember an initial string rather than just 1 char? |
o Remember an initial string rather than just 1 char? |
| 149 |
|
|
| 150 |
o A required byte from alternatives - not just the last char, but an |
o A required byte from alternatives - not just the last char, but an |
| 151 |
earlier one if common to all alternatives. |
earlier one if common to all alternatives. |
| 152 |
|
|
| 153 |
o Minimum length of subject needed. |
o Minimum length of subject needed. |
| 154 |
|
|
| 155 |
o Friedl contains other ideas. |
o Friedl contains other ideas. |
| 156 |
|
|
| 157 |
. If Perl gets to a consistent state over the settings of capturing sub- |
. If Perl gets to a consistent state over the settings of capturing sub- |
| 158 |
patterns inside repeats, see if we can match it. One example of the |
patterns inside repeats, see if we can match it. One example of the |
| 159 |
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE |
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE |
| 160 |
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard |
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard |
| 161 |
because I think it needs much more state to be remembered. |
because I think it needs much more state to be remembered. |
| 162 |
|
|
| 163 |
. Perl 6 will be a revolution. Is it a revolution too far for PCRE? |
. Perl 6 will be a revolution. Is it a revolution too far for PCRE? |
| 164 |
|
|
| 165 |
. Unicode |
. Unicode |
| 166 |
|
|
| 167 |
* Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX |
* Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX |
| 168 |
character classes. For the moment, I've chosen not to support this for |
character classes. For the moment, I've chosen not to support this for |
| 169 |
backward compatibility, for speed, and because it would be messy to |
backward compatibility, for speed, and because it would be messy to |
| 170 |
implement. |
implement. |
| 171 |
|
|
| 172 |
* A different approach to Unicode might be to use a typedef to do everything |
* A different approach to Unicode might be to use a typedef to do everything |
| 173 |
in unsigned shorts instead of unsigned chars. Actually, we'd have to have a |
in unsigned shorts instead of unsigned chars. Actually, we'd have to have a |
| 174 |
new typedef to distinguish data from bits of compiled pattern that are in |
new typedef to distinguish data from bits of compiled pattern that are in |
| 175 |
bytes, I think. There would need to be conversion functions in and out. I |
bytes, I think. There would need to be conversion functions in and out. I |
| 176 |
don't think this is particularly trivial - and anyway, Unicode now has |
don't think this is particularly trivial - and anyway, Unicode now has |
| 177 |
characters that need more than 16 bits, so is this at all sensible? |
characters that need more than 16 bits, so is this at all sensible? |
| 178 |
|
|
| 179 |
* There has been a request for direct support of 16-bit characters and |
* There has been a request for direct support of 16-bit characters and |
| 180 |
UTF-16. However, since Unicode is moving beyond purely 16-bit characters, |
UTF-16. However, since Unicode is moving beyond purely 16-bit characters, |
| 181 |
is this worth it at all? One possible way of handling 16-bit characters |
is this worth it at all? One possible way of handling 16-bit characters |
| 182 |
would be to "load" them in the same way that UTF-8 characters are loaded. |
would be to "load" them in the same way that UTF-8 characters are loaded. |
| 183 |
|
|
| 184 |
. Allow errorptr and erroroffset to be NULL. I don't like this idea. |
. Allow errorptr and erroroffset to be NULL. I don't like this idea. |
| 185 |
|
|
| 186 |
. Line endings: |
. Line endings: |
| 187 |
|
|
| 188 |
* Option to use NUL as a line terminator in subject strings. This could now |
* Option to use NUL as a line terminator in subject strings. This could now |
| 189 |
be done relatively easily since the extension to support LF, CR, and CRLF. |
be done relatively easily since the extension to support LF, CR, and CRLF. |
| 190 |
If this is done, a suitable option for pcregrep is also required. |
If this is done, a suitable option for pcregrep is also required. |
| 191 |
|
|
| 192 |
. Option to provide the pattern with a length instead of with a NUL terminator. |
. Option to provide the pattern with a length instead of with a NUL terminator. |
| 193 |
This probably affects quite a few places in the code. |
This probably affects quite a few places in the code. |
| 194 |
|
|
| 195 |
. Catch SIGSEGV for stack overflows? |
. Catch SIGSEGV for stack overflows? |
|
|
|
|
. "Cut" as described in Jeffrey Friedl's book, p364: \v and \V. The definitions |
|
|
aren't yet clear enough for me. \v flushes saved states so that no |
|
|
backtracking to anything earlier can happen; \V says "no more bumpalong", but |
|
|
does it fail the current match? As described in the book, these aren't really |
|
|
"cut" as in Prolog, are they? NOTE: (a) PCRE once had "cut", but it was |
|
|
removed when atomic groups were introduced. (b) Perl 5.10 has some (*PRUNE) |
|
|
features -- see below. |
|
| 196 |
|
|
| 197 |
. A feature to suspend a match via a callout was once requested. |
. A feature to suspend a match via a callout was once requested. |
| 198 |
|
|
| 199 |
. Option to convert results into character offsets and character lengths. |
. Option to convert results into character offsets and character lengths. |
| 200 |
|
|
| 201 |
. Option for pcregrep to scan only the start of a file. I am not keen - this is |
. Option for pcregrep to scan only the start of a file. I am not keen - this is |
| 202 |
the job of "head". |
the job of "head". |
| 203 |
|
|
| 204 |
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, |
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, |
| 205 |
preceded by a blank line, instead of adding it to every matched line, and (b) |
preceded by a blank line, instead of adding it to every matched line, and (b) |
| 206 |
support --outputfile=name. |
support --outputfile=name. |
| 207 |
|
|
| 208 |
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 7. |
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 7. |
| 209 |
|
|
| 210 |
. Add a user pointer to pcre_malloc/free functions -- some option would be |
. Add a user pointer to pcre_malloc/free functions -- some option would be |
| 211 |
needed to retain backward compatibility. |
needed to retain backward compatibility. |
| 212 |
|
|
| 213 |
. Define a union for the results from pcre_fullinfo(). |
. Define a union for the results from pcre_fullinfo(). |
| 214 |
|
|
| 215 |
. Provide a "random access to the subject" facility so that the way in which it |
. Provide a "random access to the subject" facility so that the way in which it |
| 216 |
is stored is independent of PCRE. For efficiency, it probably isn't possible |
is stored is independent of PCRE. For efficiency, it probably isn't possible |
| 217 |
to switch this dynamically. It would have to be specified when PCRE was |
to switch this dynamically. It would have to be specified when PCRE was |
| 218 |
compiled. PCRE would then call a function every time it wanted a character. |
compiled. PCRE would then call a function every time it wanted a character. |
| 219 |
|
|
| 220 |
. There are new (*PRUNE) facilities in Perl 5.10, some of which it might be |
. There are new (*PRUNE) facilities in Perl 5.10, some of which it might be |
| 221 |
relatively easy to implement. |
relatively easy to implement. |
| 222 |
|
|
|
. Also in Perl 5.10 are relative subroutine references (?&-1) and (?&+1) which |
|
|
I didn't know about when I added some 5.10 features for PCRE 7.0. What about |
|
|
(?(-1)... as a condition? That's an obvious extension, even if Perl 5.10 |
|
|
doesn't have it. |
|
|
|
|
| 223 |
. Wild thought: the ability to compile from PCRE's internal byte code to a real |
. Wild thought: the ability to compile from PCRE's internal byte code to a real |
| 224 |
FSM and a very fast (third) matcher to process the result. There would be |
FSM and a very fast (third) matcher to process the result. There would be |
| 225 |
even more restrictions than for pcre_dfa_exec(), however. This is not easy. |
even more restrictions than for pcre_dfa_exec(), however. This is not easy. |
| 226 |
|
|
| 227 |
. Should pcretest have some private locale data, to avoid relying on the |
. Should pcretest have some private locale data, to avoid relying on the |
| 228 |
available locales for the test data, since different OS have different ideas? |
available locales for the test data, since different OS have different ideas? |
| 229 |
This won't be as thorough a test, but perhaps that doesn't really matter. |
This won't be as thorough a test, but perhaps that doesn't really matter. |
| 230 |
|
|
| 231 |
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort |
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort |
| 232 |
them will of course slow it down. |
them will of course slow it down. |
| 233 |
|
|
| 234 |
. Re-arrange test 2: take out the link-size dependent stuff for a separate test |
. Someone suggested --disable-callout to save code space when callouts are |
| 235 |
that is run only when the link size *is* 2; leave in some non-numbered |
never wanted. This seems rather marginal. |
| 236 |
debugging tests using the new /Z feature. |
|
| 237 |
|
. "Cut" as described in Jeffrey Friedl's book, p364: \v and \V. The definitions |
| 238 |
. Stan Switzer's goto replacement for longjmp, which is apparently very slow on |
aren't yet clear enough for me. \v flushes saved states so that no |
| 239 |
OS-X. This is used when stack recursion is disabled. It would be worth doing |
backtracking to anything earlier can happen; \V says "no more bumpalong", but |
| 240 |
some timing tests on other OS. |
does it fail the current match? As described in the book, these aren't really |
| 241 |
|
"cut" as in Prolog, are they? NOTE: (a) PCRE once had "cut", but it was |
| 242 |
. Someone suggested --disable-callout to save code space when callouts are |
removed when atomic groups were introduced. (b) Perl 5.10 has some (*PRUNE) |
| 243 |
never wanted. This seems rather marginal. |
features -- |
| 244 |
|
|
| 245 |
. Work needs doing so that the pcregrep tests work better with different |
. These are the Perl 5.10 backtracking control features (all of which are |
| 246 |
linebreak settings. Currently, some tests don't work when the input files |
described as "experimental" -- some of them "very experimental") that it |
| 247 |
do not have \n line endings. |
might be easy to add to PCRE. They all succeed when encountered, but act as |
| 248 |
|
follows when backtracking: |
| 249 |
. If the fr_FR locale isn't available for testing, try "french" instead, |
|
| 250 |
because this may be available on Windows. It means modifying the test data, |
(*PRUNE) fail this match attempt, but still bumpalong |
| 251 |
however. |
(*SKIP) fail this match attempt, bumpalong to current match point |
| 252 |
|
(*THEN) fail this branch, try next branch at same level or fail if none |
| 253 |
|
(*COMMIT) fail this match attempt, suppress bumpalong |
| 254 |
|
(*FAIL) fail and backtrack (same as (?!) and that can be optimized) |
| 255 |
|
(*F) synonym for (*FAIL) |
| 256 |
|
(*ACCEPT) behave as if end of pattern reached ("very experimental") |
| 257 |
|
|
| 258 |
|
Some of these can have arguments (*PRUNE:NAME) but I'm not sure whether they |
| 259 |
|
make sense in the PCRE context. |
| 260 |
|
|
| 261 |
Philip Hazel |
Philip Hazel |
| 262 |
Email local part: ph10 |
Email local part: ph10 |
| 263 |
Email domain: cam.ac.uk |
Email domain: cam.ac.uk |
| 264 |
Last updated: 24 April 2007 |
Last updated: 13 June 2007 |