/[pcre]/code/trunk/maint/README
ViewVC logotype

Diff of /code/trunk/maint/README

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 121 by ph10, Fri Mar 9 10:15:12 2007 UTC revision 122 by ph10, Mon Mar 12 15:10:25 2007 UTC
# Line 1  Line 1 
1    MAINTENANCE README FOR PCRE
2    ---------------------------
3    
4  The files in the "maint" directory of the PCRE source contain data, scripts,  The files in the "maint" directory of the PCRE source contain data, scripts,
5  and programs that are used for the maintenance of PCRE, but do not form part of  and programs that are used for the maintenance of PCRE, but which do not form
6  the PCRE distribution tarballs.  part of the PCRE distribution tarballs. This document describes these files and
7    also contains some notes for maintainers. Its contents are:
8    
9      Files in the maint directory
10      Updating to a new Unicode release
11      Preparing for a PCRE release
12      Making a PCRE release
13      Long-term ideas (wish list)
14    
15    
16    Files in the maint directory
17    ----------------------------
18    
19  Builducptable   A Perl script that creates the contents of the ucptable.h file  Builducptable   A Perl script that creates the contents of the ucptable.h file
20                  from two Unicode data files, which themselves are downloaded                  from two Unicode data files, which themselves are downloaded
# Line 15  Unicode.tables The files in this direct Line 29  Unicode.tables The files in this direct
29    
30  ucptest.c       A short C program for testing the Unicode property functions in  ucptest.c       A short C program for testing the Unicode property functions in
31                  pcre_ucp_searchfuncs.c, mainly useful after rebuilding the                  pcre_ucp_searchfuncs.c, mainly useful after rebuilding the
32                  Unicode property table. Compile and run this in the "main"                  Unicode property table. Compile and run this in the "maint"
33                  directory.                  directory.
34    
35  ucptestdata     A directory containing two files, testinput1 and testoutput1,  ucptestdata     A directory containing two files, testinput1 and testoutput1,
# Line 29  utf8.c A short, freestanding C Line 43  utf8.c A short, freestanding C
43                  them as a UTF-8 character and outputs the equivalent code point                  them as a UTF-8 character and outputs the equivalent code point
44                  in hex.                  in hex.
45    
46    
47    Updating to a new Unicode release
48    ---------------------------------
49    
50  When there is a new release of Unicode, the files in Unicode.tables must be  When there is a new release of Unicode, the files in Unicode.tables must be
51  refreshed from the web site, and the Buildupctable script can then be run to  refreshed from the web site, and the Buildupctable script can then be run to
52  generate a new version of ucptable.h. The ucptest program can be used to check  generate a new version of ucptable.h. The ucptest program can be used to check
53  that the resulting table works properly, using the data files in ucptestdata to  that the resulting table works properly, using the data files in ucptestdata to
54  check a number of test characters.  check a number of test characters.
55    
56  ****  
57    Preparing for a PCRE release
58    ----------------------------
59    
60    This section contains a checklist of things that I consult before building a
61    distribution for a new release.
62    
63    . Ensure that the version number and version date are correct in configure.ac.
64    
65    . Run ./autogen.sh to ensure everything is up-to-date.
66    
67    . Compile and test with many different config options, and combinations of
68      options:
69    
70       * Totally standard ./configure with no options
71       * --disable-shared
72       * --disable-static
73       * --enable-utf8
74       * --enable-unicode-properties
75       * --disable-cpp
76       * --with-link-size=3 (occasionally check with 4 as well)
77       * --disable-stack-for-recursion
78       * --enable-newline-is-any
79    
80      I've never automated this, but perhaps I should. The newline testing could be
81      enhanced; at present, some tests fail unless plain LF is a newline.
82    
83    . Run perltest.pl on the test data for tests 1 and 4. The output should match
84      the PCRE test output, apart from the version identification at the top. The
85      other tests are not Perl-compatible (they use various special PCRE options).
86    
87    . Test on a number of different operating systems. In particular, at the moment
88      I can test on Solaris, using Sun's cc compiler (as a change from gcc). Adding
89      -xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
90      pcretest to increase the stack size for test 2. I also test on FreeBSD and
91      Linux (where I develop).
92    
93    . Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest
94      valgrind", though that takes quite a long time.
95    
96    . It can also useful to test with Electric Fence, though the fact that it
97      grumbles for missing free() calls can be a nuisance. (A missing free() in
98      pcretest is hardly a big problem.) To build with EF, use:
99    
100        LIBS='/usr/lib/libefence.a -lpthread' with ./configure.
101    
102      Then all normal runs use it to check for buffer overflow. Also run everything
103      with:
104    
105        EF_PROTECT_BELOW=1 <whatever>
106    
107      because there have been problems with lookbehinds that looked too far.
108    
109    . Test with the emulated memmove() function by undefining HAVE_MEMMOVE and
110      HAVE_BCOPY in config.h.
111    
112    . Documentation: check AUTHORS, COPYING, ChangeLog (check date), INSTALL,
113      LICENCE, NEWS (check date), NON-UNIX-USE, and README. Many of these won't
114      need changing, but over the long term things do change.
115    
116    . Man pages: Check all man pages for \ not followed by e or f or " because
117      that indicates a markup error.
118    
119    
120    Making a PCRE release
121    ---------------------
122    
123    Run PrepareRelease and commit the files that it changes (by removing trailing
124    spaces). Then run "make dist" to create the tarballs and the zipball.
125    
126    Don't forget to update Freshmeat when the new release is out, and to tell
127    webmaster@pcre.org and the mailing list.
128    
129    
130    Future ideas (wish list)
131    ------------------------
132    
133    This section records a list of ideas so that they do not get forgotten. They
134    vary enormously in their usefulness and potential for implementation. Some are
135    very sensible; some are rather wacky. Some have been on this list for years;
136    others are relatively new.
137    
138    . Optimization
139    
140      There are always ideas for new optimizations so as to speed up pattern
141      matching. Most of them try to save work by recognizing a non-match without
142      having to scan all the possibilities. These are some that I've recorded:
143    
144      * /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very
145        slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}?
146        OTOH, this is pathological - the user could easily fix it.
147    
148      * Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems
149        to have little effect, and maybe makes things worse.
150    
151      * "Ends with literal string" - note that a single character doesn't gain much
152        over the existing "required byte" (reqbyte) feature that just saves one
153        byte.
154    
155      * These probably need to go in study():
156    
157        o Remember an initial string rather than just 1 char?
158    
159        o A required byte from alternatives - not just the last char, but an
160          earlier one if common to all alternatives.
161    
162        o Minimum length of subject needed.
163    
164        o Friedl contains other ideas.
165    
166    . If Perl gets to a consistent state over the settings of capturing sub-
167      patterns inside repeats, see if we can match it. One example of the
168      difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
169      leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard
170      because I think it needs much more state to be remembered.
171    
172    . Perl 6 will be a revolution. Is it a revolution too far for PCRE?
173    
174    . Unicode
175    
176      * Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX
177        character classes. For the moment, I've chosen not to support this for
178        backward compatibility, for speed, and because it would be messy to
179        implement.
180    
181      * A different approach to Unicode might be to use a typedef to do everything
182        in unsigned shorts instead of unsigned chars. Actually, we'd have to have a
183        new typedef to distinguish data from bits of compiled pattern that are in
184        bytes, I think. There would need to be conversion functions in and out. I
185        don't think this is particularly trivial - and anyway, Unicode now has
186        characters that need more than 16 bits, so is this at all sensible?
187    
188      * There has been a request for direct support of 16-bit characters and
189        UTF-16. However, since Unicode is moving beyond purely 16-bit characters,
190        is this worth it at all? One possible way of handling 16-bit characters
191        would be to "load" them in the same way that UTF-8 characters are loaded.
192    
193    . Allow errorptr and erroroffset to be NULL. I don't like this idea.
194    
195    . Line endings:
196    
197      * Option to use NUL as a line terminator in subject strings. This could now
198        be done relatively easily since the extension to support LF, CR, and CRLF.
199        If this is done, a suitable option for pcregrep is also required.
200    
201    . Option to provide the pattern with a length instead of with a NUL terminator.
202      This probably affects quite a few places in the code.
203    
204    . Catch SIGSEGV for stack overflows?
205    
206    . "Cut" as described in Jeffrey Friedl's book, p364: \v and \V. The definitions
207      aren't yet clear enough for me. \v flushes saved states so that no
208      backtracking to anything earlier can happen; \V says "no more bumpalong", but
209      does it fail the current match? As described in the book, these aren't really
210      "cut" as in Prolog, are they? NOTE: (a) PCRE once had "cut", but it was
211      removed when atomic groups were introduced. (b) Perl 5.10 has some (*PRUNE)
212      features -- see below.
213    
214    . A feature to suspend a match via a callout was once requested.
215    
216    . Option to convert results into character offsets and character lengths.
217    
218    . Option for pcregrep to scan only the start of a file. I am not keen - this is
219      the job of "head".
220    
221    . A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
222      preceded by a blank line, instead of adding it to every matched line, and (b)
223      support --outputfile=name.
224    
225    . Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 7.
226    
227    . Add a user pointer to pcre_malloc/free functions -- some option would be
228      needed to retain backward compatibility.
229    
230    . Define a union for the results from pcre_fullinfo().
231    
232    . Provide a "random access to the subject" facility so that the way in which it
233      is stored is independent of PCRE. For efficiency, it probably isn't possible
234      to switch this dynamically. It would have to be specified when PCRE was
235      compiled. PCRE would then call a function every time it wanted a character.
236    
237    . There are new (*PRUNE) facilities in Perl 5.10, some of which it might be
238      relatively easy to implement.
239    
240    . Also in Perl 5.10 are relative subroutine references (?&-1) and (?&+1) which
241      I didn't know about when I added some 5.10 features for PCRE 7.0. What about
242      (?(-1)... as a condition? That's an obvious extension, even if Perl 5.10
243      doesn't have it.
244    
245    . Wild thought: the ability to compile from PCRE's internal byte code to a real
246      FSM and a very fast (third) matcher to process the result. There would be
247      even more restrictions than for pcre_dfa_exec(), however. This is not easy.
248    
249    . Should pcretest have some private locale data, to avoid relying on the
250      available locales for the test data, since different OS have different ideas?
251      This won't be as thorough a test, but perhaps that doesn't really matter.
252    
253    . pcregrep: add -rs for a sorted recurse? Having to store file names and sort
254      them will of course slow it down.
255    
256    . Re-arrange test 2: take out the link-size dependent stuff for a separate test
257      that is run only when the link size *is* 2; leave in some non-numbered
258      debugging tests using the new /Z feature.
259    
260    . Stan Switzer's goto replacement for longjmp, which is apparently very slow on
261      OS-X. This is used when stack recursion is disabled. It would be worth doing
262      some timing tests on other OS.
263    
264    . Someone suggested --disable-callout to save code space when callouts are
265      never wanted. This seems rather marginal.
266    
267    . Automate some of the testing before release into a script that compiles with
268      different options and runs the tests in each case.
269    
270    . How about distributing a fixed pcre_chartables.c file and abandoning the
271      on-the-fly generation using dftables. This will make cross-compiling easier,
272      and in any case, locales are going out of fashion.
273    
274    Philip Hazel
275    Email local part: ph10
276    Email domain: cam.ac.uk
277    Last updated: 12 March 2007

Legend:
Removed from v.121  
changed lines
  Added in v.122

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12