| 1 |
ph10 |
122 |
MAINTENANCE README FOR PCRE |
| 2 |
ph10 |
507 |
=========================== |
| 3 |
ph10 |
122 |
|
| 4 |
ph10 |
181 |
The files in the "maint" directory of the PCRE source contain data, scripts, |
| 5 |
ph10 |
122 |
and programs that are used for the maintenance of PCRE, but which do not form |
| 6 |
ph10 |
181 |
part of the PCRE distribution tarballs. This document describes these files and |
| 7 |
ph10 |
122 |
also contains some notes for maintainers. Its contents are: |
| 8 |
ph10 |
97 |
|
| 9 |
ph10 |
122 |
Files in the maint directory |
| 10 |
|
|
Updating to a new Unicode release |
| 11 |
|
|
Preparing for a PCRE release |
| 12 |
|
|
Making a PCRE release |
| 13 |
ph10 |
181 |
Long-term ideas (wish list) |
| 14 |
ph10 |
122 |
|
| 15 |
|
|
|
| 16 |
|
|
Files in the maint directory |
| 17 |
ph10 |
507 |
============================ |
| 18 |
ph10 |
122 |
|
| 19 |
ph10 |
507 |
---------------- This file is now OBSOLETE and no longer used ---------------- |
| 20 |
ph10 |
129 |
Builducptable A Perl script that creates the contents of the ucptable.h file |
| 21 |
|
|
from two Unicode data files, which themselves are downloaded |
| 22 |
|
|
from the Unicode web site. Run this script in the "maint" |
| 23 |
|
|
directory. |
| 24 |
ph10 |
507 |
---------------- This file is now OBSOLETE and no longer used ---------------- |
| 25 |
ph10 |
181 |
|
| 26 |
ph10 |
351 |
GenerateUtt.py A Python script to generate part of the pcre_tables.c file |
| 27 |
|
|
that contains Unicode script names in a long string with |
| 28 |
ph10 |
535 |
offsets, which is tedious to maintain by hand. |
| 29 |
ph10 |
351 |
|
| 30 |
ph10 |
129 |
ManyConfigTests A shell script that runs "configure, make, test" a number of |
| 31 |
|
|
times with different configuration settings. |
| 32 |
ph10 |
535 |
|
| 33 |
ph10 |
350 |
MultiStage2.py A Python script that generates the file pcre_ucd.c from three |
| 34 |
|
|
Unicode data tables, which are themselves downloaded from the |
| 35 |
ph10 |
535 |
Unicode web site. Run this script in the "maint" directory. |
| 36 |
ph10 |
350 |
The generated file contains the tables for a 2-stage lookup |
| 37 |
ph10 |
535 |
of Unicode properties. |
| 38 |
ph10 |
583 |
|
| 39 |
ph10 |
539 |
pcre_chartables.c.non-standard |
| 40 |
|
|
This is a set of character tables that came from a Windows |
| 41 |
|
|
system. It has characters greater than 128 that are set as |
| 42 |
|
|
spaces, amongst other things. I kept it so that it can be |
| 43 |
ph10 |
583 |
used for testing from time to time. |
| 44 |
ph10 |
181 |
|
| 45 |
ph10 |
454 |
README This file. |
| 46 |
|
|
|
| 47 |
ph10 |
535 |
Unicode.tables The files in this directory, DerivedGeneralCategory.txt, |
| 48 |
ph10 |
350 |
Scripts.txt and UnicodeData.txt, were downloaded from the |
| 49 |
|
|
Unicode web site. They contain information about Unicode |
| 50 |
|
|
characters and scripts. |
| 51 |
ph10 |
181 |
|
| 52 |
ph10 |
350 |
ucptest.c A short C program for testing the Unicode property macros |
| 53 |
|
|
that do lookups in the pcre_ucd.c data, mainly useful after |
| 54 |
|
|
rebuilding the Unicode property table. Compile and run this in |
| 55 |
ph10 |
351 |
the "maint" directory (see comments at its head). |
| 56 |
ph10 |
181 |
|
| 57 |
ph10 |
129 |
ucptestdata A directory containing two files, testinput1 and testoutput1, |
| 58 |
|
|
to use in conjunction with the ucptest program. |
| 59 |
ph10 |
181 |
|
| 60 |
ph10 |
129 |
utf8.c A short, freestanding C program for converting a Unicode code |
| 61 |
|
|
point into a sequence of bytes in the UTF-8 encoding, and vice |
| 62 |
|
|
versa. If its argument is a hex number such as 0x1234, it |
| 63 |
|
|
outputs a list of the equivalent UTF-8 bytes. If its argument |
| 64 |
|
|
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it |
| 65 |
|
|
treats them as a UTF-8 character and outputs the equivalent |
| 66 |
|
|
code point in hex. |
| 67 |
ph10 |
97 |
|
| 68 |
ph10 |
181 |
|
| 69 |
ph10 |
122 |
Updating to a new Unicode release |
| 70 |
ph10 |
507 |
================================= |
| 71 |
ph10 |
122 |
|
| 72 |
ph10 |
181 |
When there is a new release of Unicode, the files in Unicode.tables must be |
| 73 |
ph10 |
454 |
refreshed from the web site. If the new version of Unicode adds new character |
| 74 |
ph10 |
352 |
scripts, the source file ucp.h and both the MultiStage2.py and the |
| 75 |
ph10 |
454 |
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py |
| 76 |
|
|
can be run to generate a new version of pcre_ucd.c, and GenerateUtt.py can be |
| 77 |
|
|
run to generate the tricky tables for inclusion in pcre_tables.c. |
| 78 |
ph10 |
122 |
|
| 79 |
ph10 |
491 |
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list", |
| 80 |
ph10 |
535 |
the cause is usually a missing (or misspelt) name in the list of scripts. I |
| 81 |
|
|
couldn't find a straightforward list of scripts on the Unicode site, but |
| 82 |
|
|
there's a useful Wikipedia page that list them, and notes the Unicode version |
| 83 |
ph10 |
491 |
in which they were introduced: |
| 84 |
|
|
|
| 85 |
|
|
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts |
| 86 |
|
|
|
| 87 |
ph10 |
454 |
The ucptest program can be compiled and used to check that the new tables in |
| 88 |
|
|
pcre_ucd.c work properly, using the data files in ucptestdata to check a number |
| 89 |
ph10 |
491 |
of test characters. The source file ucptest.c must be updated whenever new |
| 90 |
|
|
Unicode script names are added. |
| 91 |
ph10 |
122 |
|
| 92 |
ph10 |
535 |
Note also that both the pcresyntax.3 and pcrepattern.3 man pages contain lists |
| 93 |
ph10 |
491 |
of Unicode script names. |
| 94 |
ph10 |
351 |
|
| 95 |
ph10 |
491 |
|
| 96 |
ph10 |
122 |
Preparing for a PCRE release |
| 97 |
ph10 |
507 |
============================ |
| 98 |
ph10 |
122 |
|
| 99 |
|
|
This section contains a checklist of things that I consult before building a |
| 100 |
|
|
distribution for a new release. |
| 101 |
|
|
|
| 102 |
ph10 |
454 |
. Ensure that the version number and version date are correct in configure.ac. |
| 103 |
ph10 |
535 |
|
| 104 |
ph10 |
292 |
. If new build options have been added, ensure that they are added to the CMake |
| 105 |
ph10 |
535 |
files as well as to the autoconf files. |
| 106 |
ph10 |
122 |
|
| 107 |
|
|
. Run ./autogen.sh to ensure everything is up-to-date. |
| 108 |
|
|
|
| 109 |
ph10 |
181 |
. Compile and test with many different config options, and combinations of |
| 110 |
ph10 |
129 |
options. The maint/ManyConfigTests script now encapsulates this testing. |
| 111 |
ph10 |
181 |
|
| 112 |
ph10 |
579 |
. Run perltest.pl on the test data for tests 1, 4, 6, and 11. The output should |
| 113 |
|
|
match the PCRE test output, apart from the version identification at the |
| 114 |
|
|
start of each test. The other tests are not Perl-compatible (they use various |
| 115 |
|
|
PCRE-specific features or options). |
| 116 |
ph10 |
122 |
|
| 117 |
|
|
. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest |
| 118 |
|
|
valgrind", though that takes quite a long time. |
| 119 |
ph10 |
181 |
|
| 120 |
ph10 |
583 |
. It is possible to test with the emulated memmove() function by undefining |
| 121 |
|
|
HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often. You |
| 122 |
|
|
may see a number of "pcre_memmove defined but not used" warnings for the |
| 123 |
|
|
modules in which there is no call to memmove(). These can be ignored. |
| 124 |
ph10 |
122 |
|
| 125 |
ph10 |
535 |
. Documentation: check AUTHORS, COPYING, ChangeLog (check version and date), |
| 126 |
ph10 |
454 |
INSTALL, LICENCE, NEWS (check version and date), NON-UNIX-USE, and README. |
| 127 |
|
|
Many of these won't need changing, but over the long term things do change. |
| 128 |
ph10 |
181 |
|
| 129 |
ph10 |
583 |
. I used to test new releases myself on a number of different operating |
| 130 |
|
|
systems, using different compilers as well. For example, on Solaris it is |
| 131 |
|
|
helpful to test using Sun's cc compiler as a change from gcc. Adding |
| 132 |
|
|
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for |
| 133 |
|
|
pcretest to increase the stack size for test 2. Since I retired I can no |
| 134 |
|
|
longer do this, but instead I rely on putting out release candidates for |
| 135 |
|
|
folks on the pcre-dev list to test. |
| 136 |
ph10 |
122 |
|
| 137 |
|
|
|
| 138 |
|
|
Making a PCRE release |
| 139 |
ph10 |
507 |
===================== |
| 140 |
ph10 |
122 |
|
| 141 |
|
|
Run PrepareRelease and commit the files that it changes (by removing trailing |
| 142 |
ph10 |
583 |
spaces). The first thing this script does is to run CheckMan on the man pages; |
| 143 |
|
|
if it finds any markup errors, it reports them and then aborts. |
| 144 |
ph10 |
122 |
|
| 145 |
ph10 |
583 |
Once PrepareRelease has run clean, run "make distcheck" to create the tarballs |
| 146 |
|
|
and the zipball. Double-check with "svn status", then create an SVN tagged |
| 147 |
|
|
copy: |
| 148 |
|
|
|
| 149 |
ph10 |
212 |
svn copy svn://vcs.exim.org/pcre/code/trunk \ |
| 150 |
ph10 |
535 |
svn://vcs.exim.org/pcre/code/tags/pcre-8.xx |
| 151 |
ph10 |
212 |
|
| 152 |
ph10 |
122 |
Don't forget to update Freshmeat when the new release is out, and to tell |
| 153 |
ph10 |
535 |
webmaster@pcre.org and the mailing list. Also, update the list of version |
| 154 |
ph10 |
507 |
numbers in Bugzilla (edit products). |
| 155 |
ph10 |
122 |
|
| 156 |
|
|
|
| 157 |
|
|
Future ideas (wish list) |
| 158 |
ph10 |
507 |
======================== |
| 159 |
ph10 |
122 |
|
| 160 |
|
|
This section records a list of ideas so that they do not get forgotten. They |
| 161 |
ph10 |
181 |
vary enormously in their usefulness and potential for implementation. Some are |
| 162 |
ph10 |
122 |
very sensible; some are rather wacky. Some have been on this list for years; |
| 163 |
|
|
others are relatively new. |
| 164 |
|
|
|
| 165 |
|
|
. Optimization |
| 166 |
|
|
|
| 167 |
ph10 |
181 |
There are always ideas for new optimizations so as to speed up pattern |
| 168 |
|
|
matching. Most of them try to save work by recognizing a non-match without |
| 169 |
ph10 |
122 |
having to scan all the possibilities. These are some that I've recorded: |
| 170 |
|
|
|
| 171 |
|
|
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very |
| 172 |
|
|
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? |
| 173 |
ph10 |
181 |
OTOH, this is pathological - the user could easily fix it. |
| 174 |
|
|
|
| 175 |
ph10 |
122 |
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems |
| 176 |
|
|
to have little effect, and maybe makes things worse. |
| 177 |
ph10 |
181 |
|
| 178 |
|
|
* "Ends with literal string" - note that a single character doesn't gain much |
| 179 |
ph10 |
454 |
over the existing "required byte" (reqbyte) feature that just remembers one |
| 180 |
ph10 |
122 |
byte. |
| 181 |
ph10 |
181 |
|
| 182 |
ph10 |
535 |
* These probably need to go in pcre_study(): |
| 183 |
ph10 |
181 |
|
| 184 |
ph10 |
122 |
o Remember an initial string rather than just 1 char? |
| 185 |
ph10 |
181 |
|
| 186 |
ph10 |
122 |
o A required byte from alternatives - not just the last char, but an |
| 187 |
|
|
earlier one if common to all alternatives. |
| 188 |
ph10 |
181 |
|
| 189 |
ph10 |
122 |
o Friedl contains other ideas. |
| 190 |
ph10 |
535 |
|
| 191 |
|
|
* pcre_study() does not set initial byte flags for Unicode property types |
| 192 |
|
|
such as \p; I don't know how much benefit there would be for, for example, |
| 193 |
|
|
setting the bits for 0-9 and all bytes >= xC0 when a pattern starts with |
| 194 |
|
|
\p{N}. |
| 195 |
|
|
|
| 196 |
|
|
* There is scope for more "auto-possessifying" in connection with \p and \P. |
| 197 |
|
|
|
| 198 |
ph10 |
122 |
. If Perl gets to a consistent state over the settings of capturing sub- |
| 199 |
|
|
patterns inside repeats, see if we can match it. One example of the |
| 200 |
|
|
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE |
| 201 |
|
|
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard |
| 202 |
|
|
because I think it needs much more state to be remembered. |
| 203 |
|
|
|
| 204 |
ph10 |
181 |
. Perl 6 will be a revolution. Is it a revolution too far for PCRE? |
| 205 |
|
|
|
| 206 |
ph10 |
122 |
. Unicode |
| 207 |
|
|
|
| 208 |
ph10 |
583 |
* There has been a request for direct support of 16-bit characters and |
| 209 |
|
|
UTF-16 (Bugzilla #1049). However, since Unicode is moving beyond purely |
| 210 |
|
|
16-bit characters, is this worth it at all? One possible way of handling |
| 211 |
|
|
16-bit characters would be to "load" them in the same way that UTF-8 |
| 212 |
|
|
characters are loaded. Another possibility is to provide a set of |
| 213 |
|
|
translation functions, and build an index during translation so that the |
| 214 |
|
|
returned offsets can automatically be translated (using the index) after a |
| 215 |
|
|
match. |
| 216 |
|
|
|
| 217 |
ph10 |
122 |
* A different approach to Unicode might be to use a typedef to do everything |
| 218 |
|
|
in unsigned shorts instead of unsigned chars. Actually, we'd have to have a |
| 219 |
|
|
new typedef to distinguish data from bits of compiled pattern that are in |
| 220 |
|
|
bytes, I think. There would need to be conversion functions in and out. I |
| 221 |
|
|
don't think this is particularly trivial - and anyway, Unicode now has |
| 222 |
ph10 |
583 |
characters that need more than 16 bits, so is this at all sensible? I |
| 223 |
|
|
suspect not. |
| 224 |
ph10 |
181 |
|
| 225 |
ph10 |
122 |
. Allow errorptr and erroroffset to be NULL. I don't like this idea. |
| 226 |
|
|
|
| 227 |
|
|
. Line endings: |
| 228 |
|
|
|
| 229 |
|
|
* Option to use NUL as a line terminator in subject strings. This could now |
| 230 |
|
|
be done relatively easily since the extension to support LF, CR, and CRLF. |
| 231 |
ph10 |
454 |
If it is done, a suitable option for pcregrep is also required. |
| 232 |
ph10 |
181 |
|
| 233 |
ph10 |
122 |
. Option to provide the pattern with a length instead of with a NUL terminator. |
| 234 |
ph10 |
454 |
This affects quite a few places in the code and is not trivial. |
| 235 |
ph10 |
122 |
|
| 236 |
ph10 |
181 |
. Catch SIGSEGV for stack overflows? |
| 237 |
ph10 |
122 |
|
| 238 |
|
|
. A feature to suspend a match via a callout was once requested. |
| 239 |
|
|
|
| 240 |
|
|
. Option to convert results into character offsets and character lengths. |
| 241 |
|
|
|
| 242 |
ph10 |
181 |
. Option for pcregrep to scan only the start of a file. I am not keen - this is |
| 243 |
ph10 |
122 |
the job of "head". |
| 244 |
ph10 |
181 |
|
| 245 |
|
|
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, |
| 246 |
ph10 |
122 |
preceded by a blank line, instead of adding it to every matched line, and (b) |
| 247 |
|
|
support --outputfile=name. |
| 248 |
ph10 |
181 |
|
| 249 |
ph10 |
454 |
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 8. |
| 250 |
ph10 |
122 |
|
| 251 |
ph10 |
181 |
. Add a user pointer to pcre_malloc/free functions -- some option would be |
| 252 |
ph10 |
122 |
needed to retain backward compatibility. |
| 253 |
ph10 |
181 |
|
| 254 |
ph10 |
122 |
. Define a union for the results from pcre_fullinfo(). |
| 255 |
|
|
|
| 256 |
ph10 |
181 |
. Provide a "random access to the subject" facility so that the way in which it |
| 257 |
|
|
is stored is independent of PCRE. For efficiency, it probably isn't possible |
| 258 |
ph10 |
122 |
to switch this dynamically. It would have to be specified when PCRE was |
| 259 |
|
|
compiled. PCRE would then call a function every time it wanted a character. |
| 260 |
ph10 |
181 |
|
| 261 |
ph10 |
122 |
. Wild thought: the ability to compile from PCRE's internal byte code to a real |
| 262 |
|
|
FSM and a very fast (third) matcher to process the result. There would be |
| 263 |
|
|
even more restrictions than for pcre_dfa_exec(), however. This is not easy. |
| 264 |
ph10 |
181 |
|
| 265 |
ph10 |
122 |
. Should pcretest have some private locale data, to avoid relying on the |
| 266 |
|
|
available locales for the test data, since different OS have different ideas? |
| 267 |
|
|
This won't be as thorough a test, but perhaps that doesn't really matter. |
| 268 |
ph10 |
181 |
|
| 269 |
|
|
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort |
| 270 |
ph10 |
122 |
them will of course slow it down. |
| 271 |
|
|
|
| 272 |
ph10 |
181 |
. Someone suggested --disable-callout to save code space when callouts are |
| 273 |
|
|
never wanted. This seems rather marginal. |
| 274 |
ph10 |
535 |
|
| 275 |
|
|
. Check names that consist entirely of digits: PCRE allows, but do Perl and |
| 276 |
|
|
Python, etc? |
| 277 |
|
|
|
| 278 |
|
|
. A user suggested a parameter to limit the length of string matched, for |
| 279 |
|
|
example if the parameter is N, the current match should fail if the matched |
| 280 |
|
|
substring exceeds N. This could apply to both match functions. The value |
| 281 |
ph10 |
372 |
could be a new field in the extra block. |
| 282 |
ph10 |
535 |
|
| 283 |
ph10 |
372 |
. Callouts with arguments: (?Cn:ARG) for instance. |
| 284 |
ph10 |
122 |
|
| 285 |
ph10 |
535 |
. A user is going to supply a patch to generalize the API for user-specific |
| 286 |
ph10 |
507 |
memory allocation so that it is more flexible in threaded environments. This |
| 287 |
ph10 |
454 |
was promised a long time ago, and never appeared... |
| 288 |
ph10 |
535 |
|
| 289 |
ph10 |
454 |
. Write a function that generates random matching strings for a compiled regex. |
| 290 |
ph10 |
372 |
|
| 291 |
ph10 |
535 |
. Write a wrapper to maintain a structure with specified runtime parameters, |
| 292 |
|
|
such as recurse limit, and pass these to PCRE each time it is called. Also |
| 293 |
ph10 |
454 |
maybe malloc and free. A user sent a prototype. |
| 294 |
ph10 |
535 |
|
| 295 |
|
|
. Pcregrep: an option to specify the output line separator, either as a string |
| 296 |
|
|
or select from a fixed list. This is not dead easy, because at the moment it |
| 297 |
ph10 |
454 |
outputs whatever is in the input file. |
| 298 |
ph10 |
535 |
|
| 299 |
|
|
. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete, |
| 300 |
|
|
non-thread-safe patch showed that this can help performance for patterns |
| 301 |
|
|
where there are many alternatives. However, a simple thread-safe |
| 302 |
|
|
implementation that I tried made things worse in many simple cases, so this |
| 303 |
ph10 |
454 |
is not an obviously good thing. |
| 304 |
ph10 |
535 |
|
| 305 |
|
|
. Make the longest lookbehind available via pcre_fullinfo(). This is not |
| 306 |
|
|
straightforward because lookbehinds can be nested inside lookbehinds. This |
| 307 |
|
|
case will have to be identified, and the amounts added. This should then give |
| 308 |
|
|
the maximum possible lookbehind length. The reason for wanting this is to |
| 309 |
ph10 |
454 |
help when implementing multi-segment matching using pcre_exec() with partial |
| 310 |
|
|
matching and overlapping segments. |
| 311 |
ph10 |
535 |
|
| 312 |
ph10 |
454 |
. PCRE cannot at present distinguish between subpatterns with different names, |
| 313 |
ph10 |
535 |
but the same number (created by the use of ?|). In order to do so, a way of |
| 314 |
ph10 |
454 |
remembering *which* subpattern numbered n matched is needed. Bugzilla #760. |
| 315 |
ph10 |
535 |
Now that (*MARK) has been implemented, it can perhaps be used as a way round |
| 316 |
|
|
this problem. |
| 317 |
|
|
|
| 318 |
|
|
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include |
| 319 |
ph10 |
507 |
"something" and the the #ifdef appears only in one place, in "something". |
| 320 |
ph10 |
454 |
|
| 321 |
ph10 |
122 |
Philip Hazel |
| 322 |
|
|
Email local part: ph10 |
| 323 |
|
|
Email domain: cam.ac.uk |
| 324 |
ph10 |
583 |
Last updated: 12 January 2011 |