--- code/trunk/README 2007/02/24 21:38:09 7 +++ code/trunk/README 2009/03/17 21:30:30 392 @@ -1,94 +1,612 @@ -README file for PCRE (Perl-compatible regular expressions) ----------------------------------------------------------- +README file for PCRE (Perl-compatible regular expression library) +----------------------------------------------------------------- -The distribution should contain the following files: +The latest release of PCRE is always available in three alternative formats +from: - ChangeLog log of changes to the code - Makefile for building PCRE - Performance notes on performance - README this file - Tech.Notes notes on the encoding - pcre.3 man page for the functions - pcreposix.3 man page for the POSIX wrapper API - maketables.c auxiliary program for building chartables.c - study.c ) source of - pcre.c ) the functions - pcreposix.c ) - pcre.h header for the external API - pcreposix.h header for the external POSIX wrapper API - internal.h header for internal use - pcretest.c test program - pgrep.1 man page for pgrep - pgrep.c source of a grep utility that uses PCRE - perltest Perl test program - testinput test data, compatible with Perl - testinput2 test data for error messages and non-Perl things - testoutput test results corresponding to testinput - testoutput2 test results corresponding to testinput2 - -To build PCRE, edit Makefile for your system (it is a fairly simple make file) -and then run it. It builds a two libraries called libpcre.a and libpcreposix.a, -a test program called pcretest, and the pgrep command. - -To test PCRE, run pcretest on the file testinput, and compare the output with -the contents of testoutput. There should be no differences. For example: - - pcretest testinput some.file - diff some.file testoutput - -Do the same with testinput2, comparing the output with testoutput2, but this -time using the -i flag for pcretest, i.e. - - pcretest -i testinput2 some.file - diff some.file testoutput2 - -The make target "runtest" runs both these tests, using the file "testtry" to -store the intermediate output, deleting it at the end if all goes well. - -There are two sets of tests because the first set can also be fed directly into -the perltest program to check that Perl gives the same results. The second set -of tests check pcre_info(), pcre_study(), error detection and run-time flags -that are specific to PCRE, as well as the POSIX wrapper API. - -To install PCRE, copy libpcre.a to any suitable library directory (e.g. -/usr/local/lib), pcre.h to any suitable include directory (e.g. -/usr/local/include), and pcre.3 to any suitable man directory (e.g. -/usr/local/man/man3). - -To install the pgrep command, copy it to any suitable binary directory, (e.g. -/usr/local/bin) and pgrep.1 to any suitable man directory (e.g. -/usr/local/man/man1). - -PCRE has its own native API, but a set of "wrapper" functions that are based on -the POSIX API are also supplied in the library libpcreposix.a. Note that this -just provides a POSIX calling interface to PCRE: the regular expressions -themselves still follow Perl syntax and semantics. The header file -for the POSIX-style functions is called pcreposix.h. The official POSIX name is -regex.h, but I didn't want to risk possible problems with existing files of -that name by distributing it that way. To use it with an existing program that -uses the POSIX API it will have to be renamed or pointed at by a link. + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.bz2 + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.zip + +There is a mailing list for discussion about the development of PCRE at + + pcre-dev@exim.org + +Please read the NEWS file if you are upgrading from a previous release. +The contents of this README file are: + + The PCRE APIs + Documentation for PCRE + Contributions by users of PCRE + Building PCRE on non-Unix systems + Building PCRE on Unix-like systems + Retrieving configuration information on Unix-like systems + Shared libraries on Unix-like systems + Cross-compiling on Unix-like systems + Using HP's ANSI C++ compiler (aCC) + Making new tarballs + Testing PCRE + Character tables + File manifest + + +The PCRE APIs +------------- + +PCRE is written in C, and it has its own API. The distribution also includes a +set of C++ wrapper functions (see the pcrecpp man page for details), courtesy +of Google Inc. + +In addition, there is a set of C wrapper functions that are based on the POSIX +regular expression API (see the pcreposix man page). These end up in the +library called libpcreposix. Note that this just provides a POSIX calling +interface to PCRE; the regular expressions themselves still follow Perl syntax +and semantics. The POSIX API is restricted, and does not give full access to +all of PCRE's facilities. + +The header file for the POSIX-style functions is called pcreposix.h. The +official POSIX name is regex.h, but I did not want to risk possible problems +with existing files of that name by distributing it that way. To use PCRE with +an existing program that uses the POSIX API, pcreposix.h will have to be +renamed or pointed at by a link. + +If you are using the POSIX interface to PCRE and there is already a POSIX regex +library installed on your system, as well as worrying about the regex.h header +file (as mentioned above), you must also take care when linking programs to +ensure that they link with PCRE's libpcreposix library. Otherwise they may pick +up the POSIX functions of the same name from the other library. + +One way of avoiding this confusion is to compile PCRE with the addition of +-Dregcomp=PCREregcomp (and similarly for the other POSIX functions) to the +compiler flags (CFLAGS if you are using "configure" -- see below). This has the +effect of renaming the functions so that the names no longer clash. Of course, +you have to do the same thing for your applications, or write them using the +new names. + + +Documentation for PCRE +---------------------- + +If you install PCRE in the normal way on a Unix-like system, you will end up +with a set of man pages whose names all start with "pcre". The one that is just +called "pcre" lists all the others. In addition to these man pages, the PCRE +documentation is supplied in two other forms: + + 1. There are files called doc/pcre.txt, doc/pcregrep.txt, and + doc/pcretest.txt in the source distribution. The first of these is a + concatenation of the text forms of all the section 3 man pages except + those that summarize individual functions. The other two are the text + forms of the section 1 man pages for the pcregrep and pcretest commands. + These text forms are provided for ease of scanning with text editors or + similar tools. They are installed in /share/doc/pcre, where + is the installation prefix (defaulting to /usr/local). + + 2. A set of files containing all the documentation in HTML form, hyperlinked + in various ways, and rooted in a file called index.html, is distributed in + doc/html and installed in /share/doc/pcre/html. + + +Contributions by users of PCRE +------------------------------ + +You can find contributions from PCRE users in the directory + + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib + +There is a README file giving brief descriptions of what they are. Some are +complete in themselves; others are pointers to URLs containing relevant files. +Some of this material is likely to be well out-of-date. Several of the earlier +contributions provided support for compiling PCRE on various flavours of +Windows (I myself do not use Windows). Nowadays there is more Windows support +in the standard distribution, so these contibutions have been archived. + + +Building PCRE on non-Unix systems +--------------------------------- + +For a non-Unix system, please read the comments in the file NON-UNIX-USE, +though if your system supports the use of "configure" and "make" you may be +able to build PCRE in the same way as for Unix-like systems. PCRE can also be +configured in many platform environments using the GUI facility of CMake's +CMakeSetup. It creates Makefiles, solution files, etc. + +PCRE has been compiled on many different operating systems. It should be +straightforward to build PCRE on any system that has a Standard C compiler and +library, because it uses only Standard C functions. + + +Building PCRE on Unix-like systems +---------------------------------- + +If you are using HP's ANSI C++ compiler (aCC), please see the special note +in the section entitled "Using HP's ANSI C++ compiler (aCC)" below. + +The following instructions assume the use of the widely used "configure, make, +make install" process. There is also support for CMake in the PCRE +distribution; there are some comments about using CMake in the NON-UNIX-USE +file, though it can also be used in Unix-like systems. + +To build PCRE on a Unix-like system, first run the "configure" command from the +PCRE distribution directory, with your current directory set to the directory +where you want the files to be created. This command is a standard GNU +"autoconf" configuration script, for which generic instructions are supplied in +the file INSTALL. + +Most commonly, people build PCRE within its own distribution directory, and in +this case, on many systems, just running "./configure" is sufficient. However, +the usual methods of changing standard defaults are available. For example: + +CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local + +specifies that the C compiler should be run with the flags '-O2 -Wall' instead +of the default, and that "make install" should install PCRE under /opt/local +instead of the default /usr/local. + +If you want to build in a different directory, just run "configure" with that +directory as current. For example, suppose you have unpacked the PCRE source +into /source/pcre/pcre-xxx, but you want to build it in /build/pcre/pcre-xxx: + +cd /build/pcre/pcre-xxx +/source/pcre/pcre-xxx/configure + +PCRE is written in C and is normally compiled as a C library. However, it is +possible to build it as a C++ library, though the provided building apparatus +does not have any features to support this. + +There are some optional features that can be included or omitted from the PCRE +library. You can read more about them in the pcrebuild man page. + +. If you want to suppress the building of the C++ wrapper library, you can add + --disable-cpp to the "configure" command. Otherwise, when "configure" is run, + it will try to find a C++ compiler and C++ header files, and if it succeeds, + it will try to build the C++ wrapper. + +. If you want to make use of the support for UTF-8 Unicode character strings in + PCRE, you must add --enable-utf8 to the "configure" command. Without it, the + code for handling UTF-8 is not included in the library. Even when included, + it still has to be enabled by an option at run time. When PCRE is compiled + with this option, its input can only either be ASCII or UTF-8, even when + running on EBCDIC platforms. It is not possible to use both --enable-utf8 and + --enable-ebcdic at the same time. + +. If, in addition to support for UTF-8 character strings, you want to include + support for the \P, \p, and \X sequences that recognize Unicode character + properties, you must add --enable-unicode-properties to the "configure" + command. This adds about 30K to the size of the library (in the form of a + property table); only the basic two-letter properties such as Lu are + supported. + +. You can build PCRE to recognize either CR or LF or the sequence CRLF or any + of the preceding, or any of the Unicode newline sequences as indicating the + end of a line. Whatever you specify at build time is the default; the caller + of PCRE can change the selection at run time. The default newline indicator + is a single LF character (the Unix standard). You can specify the default + newline indicator by adding --enable-newline-is-cr or --enable-newline-is-lf + or --enable-newline-is-crlf or --enable-newline-is-anycrlf or + --enable-newline-is-any to the "configure" command, respectively. + + If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of + the standard tests will fail, because the lines in the test files end with + LF. Even if the files are edited to change the line endings, there are likely + to be some failures. With --enable-newline-is-anycrlf or + --enable-newline-is-any, many tests should succeed, but there may be some + failures. + +. By default, the sequence \R in a pattern matches any Unicode line ending + sequence. This is independent of the option specifying what PCRE considers to + be the end of a line (see above). However, the caller of PCRE can restrict \R + to match only CR, LF, or CRLF. You can make this the default by adding + --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R"). + +. When called via the POSIX interface, PCRE uses malloc() to get additional + storage for processing capturing parentheses if there are more than 10 of + them in a pattern. You can increase this threshold by setting, for example, + + --with-posix-malloc-threshold=20 + + on the "configure" command. + +. PCRE has a counter that can be set to limit the amount of resources it uses. + If the limit is exceeded during a match, the match fails. The default is ten + million. You can change the default by setting, for example, + + --with-match-limit=500000 + + on the "configure" command. This is just the default; individual calls to + pcre_exec() can supply their own value. There is more discussion on the + pcreapi man page. + +. There is a separate counter that limits the depth of recursive function calls + during a matching process. This also has a default of ten million, which is + essentially "unlimited". You can change the default by setting, for example, + + --with-match-limit-recursion=500000 + + Recursive function calls use up the runtime stack; running out of stack can + cause programs to crash in strange ways. There is a discussion about stack + sizes in the pcrestack man page. + +. The default maximum compiled pattern size is around 64K. You can increase + this by adding --with-link-size=3 to the "configure" command. You can + increase it even more by setting --with-link-size=4, but this is unlikely + ever to be necessary. Increasing the internal link size will reduce + performance. + +. You can build PCRE so that its internal match() function that is called from + pcre_exec() does not call itself recursively. Instead, it uses memory blocks + obtained from the heap via the special functions pcre_stack_malloc() and + pcre_stack_free() to save data that would otherwise be saved on the stack. To + build PCRE like this, use + + --disable-stack-for-recursion + + on the "configure" command. PCRE runs more slowly in this mode, but it may be + necessary in environments with limited stack sizes. This applies only to the + pcre_exec() function; it does not apply to pcre_dfa_exec(), which does not + use deeply nested recursion. There is a discussion about stack sizes in the + pcrestack man page. + +. For speed, PCRE uses four tables for manipulating and identifying characters + whose code point values are less than 256. By default, it uses a set of + tables for ASCII encoding that is part of the distribution. If you specify + + --enable-rebuild-chartables + + a program called dftables is compiled and run in the default C locale when + you obey "make". It builds a source file called pcre_chartables.c. If you do + not specify this option, pcre_chartables.c is created as a copy of + pcre_chartables.c.dist. See "Character tables" below for further information. + +. It is possible to compile PCRE for use on systems that use EBCDIC as their + character code (as opposed to ASCII) by specifying + + --enable-ebcdic + + This automatically implies --enable-rebuild-chartables (see above). However, + when PCRE is built this way, it always operates in EBCDIC. It cannot support + both EBCDIC and UTF-8. + +. It is possible to compile pcregrep to use libz and/or libbz2, in order to + read .gz and .bz2 files (respectively), by specifying one or both of + + --enable-pcregrep-libz + --enable-pcregrep-libbz2 + + Of course, the relevant libraries must be installed on your system. + +. It is possible to compile pcretest so that it links with the libreadline + library, by specifying + + --enable-pcretest-libreadline + + If this is done, when pcretest's input is from a terminal, it reads it using + the readline() function. This provides line-editing and history facilities. + Note that libreadline is GPL-licenced, so if you distribute a binary of + pcretest linked in this way, there may be licensing issues. + + Setting this option causes the -lreadline option to be added to the pcretest + build. In many operating environments with a sytem-installed readline + library this is sufficient. However, in some environments (e.g. if an + unmodified distribution version of readline is in use), it may be necessary + to specify something like LIBS="-lncurses" as well. This is because, to quote + the readline INSTALL, "Readline uses the termcap functions, but does not link + with the termcap or curses library itself, allowing applications which link + with readline the to choose an appropriate library." If you get error + messages about missing functions tgetstr, tgetent, tputs, tgetflag, or tgoto, + this is the problem, and linking with the ncurses library should fix it. + +The "configure" script builds the following files for the basic C library: + +. Makefile is the makefile that builds the library +. config.h contains build-time configuration options for the library +. pcre.h is the public PCRE header file +. pcre-config is a script that shows the settings of "configure" options +. libpcre.pc is data for the pkg-config command +. libtool is a script that builds shared and/or static libraries +. RunTest is a script for running tests on the basic C library +. RunGrepTest is a script for running tests on the pcregrep command + +Versions of config.h and pcre.h are distributed in the PCRE tarballs under +the names config.h.generic and pcre.h.generic. These are provided for the +benefit of those who have to built PCRE without the benefit of "configure". If +you use "configure", the .generic versions are not used. + +If a C++ compiler is found, the following files are also built: + +. libpcrecpp.pc is data for the pkg-config command +. pcrecpparg.h is a header file for programs that call PCRE via the C++ wrapper +. pcre_stringpiece.h is the header for the C++ "stringpiece" functions + +The "configure" script also creates config.status, which is an executable +script that can be run to recreate the configuration, and config.log, which +contains compiler output from tests that "configure" runs. + +Once "configure" has run, you can run "make". It builds two libraries, called +libpcre and libpcreposix, a test program called pcretest, and the pcregrep +command. If a C++ compiler was found on your system, "make" also builds the C++ +wrapper library, which is called libpcrecpp, and some test programs called +pcrecpp_unittest, pcre_scanner_unittest, and pcre_stringpiece_unittest. +Building the C++ wrapper can be disabled by adding --disable-cpp to the +"configure" command. + +The command "make check" runs all the appropriate tests. Details of the PCRE +tests are given below in a separate section of this document. + +You can use "make install" to install PCRE into live directories on your +system. The following are installed (file names are all relative to the + that is set when "configure" is run): + + Commands (bin): + pcretest + pcregrep + pcre-config + + Libraries (lib): + libpcre + libpcreposix + libpcrecpp (if C++ support is enabled) + + Configuration information (lib/pkgconfig): + libpcre.pc + libpcrecpp.pc (if C++ support is enabled) + + Header files (include): + pcre.h + pcreposix.h + pcre_scanner.h ) + pcre_stringpiece.h ) if C++ support is enabled + pcrecpp.h ) + pcrecpparg.h ) + + Man pages (share/man/man{1,3}): + pcregrep.1 + pcretest.1 + pcre.3 + pcre*.3 (lots more pages, all starting "pcre") + + HTML documentation (share/doc/pcre/html): + index.html + *.html (lots more pages, hyperlinked from index.html) + + Text file documentation (share/doc/pcre): + AUTHORS + COPYING + ChangeLog + LICENCE + NEWS + README + pcre.txt (a concatenation of the man(3) pages) + pcretest.txt the pcretest man page + pcregrep.txt the pcregrep man page + +If you want to remove PCRE from your system, you can run "make uninstall". +This removes all the files that "make install" installed. However, it does not +remove any directories, because these are often shared with other programs. + + +Retrieving configuration information on Unix-like systems +--------------------------------------------------------- + +Running "make install" installs the command pcre-config, which can be used to +recall information about the PCRE configuration and installation. For example: + + pcre-config --version + +prints the version number, and + + pcre-config --libs + +outputs information about where the library is installed. This command can be +included in makefiles for programs that use PCRE, saving the programmer from +having to remember too many details. + +The pkg-config command is another system for saving and retrieving information +about installed libraries. Instead of separate commands for each library, a +single command is used. For example: + + pkg-config --cflags pcre + +The data is held in *.pc files that are installed in a directory called +/lib/pkgconfig. + + +Shared libraries on Unix-like systems +------------------------------------- + +The default distribution builds PCRE as shared libraries and static libraries, +as long as the operating system supports shared libraries. Shared library +support relies on the "libtool" script which is built as part of the +"configure" process. + +The libtool script is used to compile and link both shared and static +libraries. They are placed in a subdirectory called .libs when they are newly +built. The programs pcretest and pcregrep are built to use these uninstalled +libraries (by means of wrapper scripts in the case of shared libraries). When +you use "make install" to install shared libraries, pcregrep and pcretest are +automatically re-built to use the newly installed shared libraries before being +installed themselves. However, the versions left in the build directory still +use the uninstalled libraries. + +To build PCRE using static libraries only you must use --disable-shared when +configuring it. For example: + +./configure --prefix=/usr/gnu --disable-shared + +Then run "make" in the usual way. Similarly, you can use --disable-static to +build only shared libraries. + + +Cross-compiling on Unix-like systems +------------------------------------ + +You can specify CC and CFLAGS in the normal way to the "configure" command, in +order to cross-compile PCRE for some other host. However, you should NOT +specify --enable-rebuild-chartables, because if you do, the dftables.c source +file is compiled and run on the local host, in order to generate the inbuilt +character tables (the pcre_chartables.c file). This will probably not work, +because dftables.c needs to be compiled with the local compiler, not the cross +compiler. + +When --enable-rebuild-chartables is not specified, pcre_chartables.c is created +by making a copy of pcre_chartables.c.dist, which is a default set of tables +that assumes ASCII code. Cross-compiling with the default tables should not be +a problem. + +If you need to modify the character tables when cross-compiling, you should +move pcre_chartables.c.dist out of the way, then compile dftables.c by hand and +run it on the local host to make a new version of pcre_chartables.c.dist. +Then when you cross-compile PCRE this new version of the tables will be used. + + +Using HP's ANSI C++ compiler (aCC) +---------------------------------- + +Unless C++ support is disabled by specifying the "--disable-cpp" option of the +"configure" script, you must include the "-AA" option in the CXXFLAGS +environment variable in order for the C++ components to compile correctly. + +Also, note that the aCC compiler on PA-RISC platforms may have a defect whereby +needed libraries fail to get included when specifying the "-AA" compiler +option. If you experience unresolved symbols when linking the C++ programs, +use the workaround of specifying the following environment variable prior to +running the "configure" script: + + CXXLDFLAGS="-lstd_v2 -lCsup_v2" + + +Making new tarballs +------------------- + +The command "make dist" creates three PCRE tarballs, in tar.gz, tar.bz2, and +zip formats. The command "make distcheck" does the same, but then does a trial +build of the new distribution to ensure that it works. + +If you have modified any of the man page sources in the doc directory, you +should first run the PrepareRelease script before making a distribution. This +script creates the .txt and HTML forms of the documentation from the man pages. + + +Testing PCRE +------------ + +To test the basic PCRE library on a Unix system, run the RunTest script that is +created by the configuring process. There is also a script called RunGrepTest +that tests the options of the pcregrep command. If the C++ wrapper library is +built, three test programs called pcrecpp_unittest, pcre_scanner_unittest, and +pcre_stringpiece_unittest are also built. + +Both the scripts and all the program tests are run if you obey "make check" or +"make test". For other systems, see the instructions in NON-UNIX-USE. + +The RunTest script runs the pcretest test program (which is documented in its +own man page) on each of the testinput files in the testdata directory in +turn, and compares the output with the contents of the corresponding testoutput +files. A file called testtry is used to hold the main output from pcretest +(testsavedregex is also used as a working file). To run pcretest on just one of +the test files, give its number as an argument to RunTest, for example: + + RunTest 2 + +The first test file can also be fed directly into the perltest.pl script to +check that Perl gives the same results. The only difference you should see is +in the first few lines, where the Perl version is given instead of the PCRE +version. + +The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(), +pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error +detection, and run-time flags that are specific to PCRE, as well as the POSIX +wrapper API. It also uses the debugging flags to check some of the internals of +pcre_compile(). + +If you build PCRE with a locale setting that is not the standard C locale, the +character tables may be different (see next paragraph). In some cases, this may +cause failures in the second set of tests. For example, in a locale where the +isprint() function yields TRUE for characters in the range 128-255, the use of +[:isascii:] inside a character class defines a different set of characters, and +this shows up in this test as a difference in the compiled code, which is being +listed for checking. Where the comparison test output contains [\x00-\x7f] the +test will contain [\x00-\xff], and similarly in some other cases. This is not a +bug in PCRE. + +The third set of tests checks pcre_maketables(), the facility for building a +set of character tables for a specific locale and using them instead of the +default tables. The tests make use of the "fr_FR" (French) locale. Before +running the test, the script checks for the presence of this locale by running +the "locale" command. If that command fails, or if it doesn't include "fr_FR" +in the list of available locales, the third test cannot be run, and a comment +is output to say why. If running this test produces instances of the error + + ** Failed to set locale "fr_FR" + +in the comparison output, it means that locale is not available on your system, +despite being listed by "locale". This does not mean that PCRE is broken. + +[If you are trying to run this test on Windows, you may be able to get it to +work by changing "fr_FR" to "french" everywhere it occurs. Alternatively, use +RunTest.bat. The version of RunTest.bat included with PCRE 7.4 and above uses +Windows versions of test 2. More info on using RunTest.bat is included in the +document entitled NON-UNIX-USE.] + +The fourth test checks the UTF-8 support. It is not run automatically unless +PCRE is built with UTF-8 support. To do this you must set --enable-utf8 when +running "configure". This file can be also fed directly to the perltest script, +provided you are running Perl 5.8 or higher. (For Perl 5.6, a small patch, +commented in the script, can be be used.) + +The fifth test checks error handling with UTF-8 encoding, and internal UTF-8 +features of PCRE that are not relevant to Perl. + +The sixth test checks the support for Unicode character properties. It it not +run automatically unless PCRE is built with Unicode property support. To to +this you must set --enable-unicode-properties when running "configure". + +The seventh, eighth, and ninth tests check the pcre_dfa_exec() alternative +matching function, in non-UTF-8 mode, UTF-8 mode, and UTF-8 mode with Unicode +property support, respectively. The eighth and ninth tests are not run +automatically unless PCRE is build with the relevant support. Character tables ---------------- -PCRE uses four tables for manipulating and identifying characters. These are -compiled from a source file called chartables.c. This is not supplied in -the distribution, but is built by the program maketables (compiled from -maketables.c), which uses the ANSI C character handling functions such as -isalnum(), isalpha(), isupper(), islower(), etc. to build the table sources. -This means that the default C locale set in your system may affect the contents -of the tables. You can change the tables by editing chartables.c and then -re-building PCRE. If you do this, you should probably also edit Makefile to -ensure that the file doesn't ever get re-generated. - -The first two tables pcre_lcc[] and pcre_fcc[] provide lower casing and a -case flipping functions, respectively. The pcre_cbits[] table consists of four -32-byte bit maps which identify digits, letters, "word" characters, and white -space, respectively. These are used when building 32-byte bit maps that -represent character classes. +For speed, PCRE uses four tables for manipulating and identifying characters +whose code point values are less than 256. The final argument of the +pcre_compile() function is a pointer to a block of memory containing the +concatenated tables. A call to pcre_maketables() can be used to generate a set +of tables in the current locale. If the final argument for pcre_compile() is +passed as NULL, a set of default tables that is built into the binary is used. + +The source file called pcre_chartables.c contains the default set of tables. By +default, this is created as a copy of pcre_chartables.c.dist, which contains +tables for ASCII coding. However, if --enable-rebuild-chartables is specified +for ./configure, a different version of pcre_chartables.c is built by the +program dftables (compiled from dftables.c), which uses the ANSI C character +handling functions such as isalnum(), isalpha(), isupper(), islower(), etc. to +build the table sources. This means that the default C locale which is set for +your system will control the contents of these default tables. You can change +the default tables by editing pcre_chartables.c and then re-building PCRE. If +you do this, you should take care to ensure that the file does not get +automatically re-generated. The best way to do this is to move +pcre_chartables.c.dist out of the way and replace it with your customized +tables. + +When the dftables program is run as a result of --enable-rebuild-chartables, +it uses the default C locale that is set on your system. It does not pay +attention to the LC_xxx environment variables. In other words, it uses the +system's default locale rather than whatever the compiling user happens to have +set. If you really do want to build a source set of character tables in a +locale that is specified by the LC_xxx variables, you can run the dftables +program by hand with the -L option. For example: + + ./dftables -L pcre_chartables.c.special + +The first two 256-byte tables provide lower casing and case flipping functions, +respectively. The next table consists of three 32-byte bit maps which identify +digits, "word" characters, and white space, respectively. These are used when +building 32-byte bit maps that represent character classes for code points less +than 256. -The pcre_ctypes[] table has bits indicating various character types, as +The final 256-byte table has bits indicating various character types, as follows: 1 white space character @@ -102,135 +620,144 @@ will cause PCRE to malfunction. -The pcretest program --------------------- +File manifest +------------- + +The distribution should contain the following files: -This program is intended for testing PCRE, but it can also be used for -experimenting with regular expressions. +(A) Source files of the PCRE library functions and their headers: -If it is given two filename arguments, it reads from the first and writes to -the second. If it is given only one filename argument, it reads from that file -and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and -prompts for each line of input. - -The program handles any number of sets of input on a single input file. Each -set starts with a regular expression, and continues with any number of data -lines to be matched against the pattern. An empty line signals the end of the -set. The regular expressions are given enclosed in any non-alphameric -delimiters, for example - - /(a|bc)x+yz/ - -and may be followed by i, m, s, or x to set the PCRE_CASELESS, PCRE_MULTILINE, -PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These options have the -same effect as they do in Perl. - -There are also some upper case options that do not match Perl options: /A, /E, -and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively. -The /D option is a PCRE debugging feature. It causes the internal form of -compiled regular expressions to be output after compilation. The /S option -causes pcre_study() to be called after the expression has been compiled, and -the results used when the expression is matched. If /I is present as well as -/S, then pcre_study() is called with the PCRE_CASELESS option. - -Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API -rather than its native API. When this is done, all other options except /i and -/m are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m -is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and -PCRE_DOTALL unless REG_NEWLINE is set. - -A regular expression can extend over several lines of input; the newlines are -included in it. See the testinput file for many examples. - -Before each data line is passed to pcre_exec(), leading and trailing whitespace -is removed, and it is then scanned for \ escapes. The following are recognized: - - \a alarm (= BEL) - \b backspace - \e escape - \f formfeed - \n newline - \r carriage return - \t tab - \v vertical tab - \nnn octal character (up to 3 octal digits) - \xhh hexadecimal character (up to 2 hex digits) - - \A pass the PCRE_ANCHORED option to pcre_exec() - \B pass the PCRE_NOTBOL option to pcre_exec() - \E pass the PCRE_DOLLAR_ENDONLY option to pcre_exec() - \I pass the PCRE_CASELESS option to pcre_exec() - \M pass the PCRE_MULTILINE option to pcre_exec() - \S pass the PCRE_DOTALL option to pcre_exec() - \Odd set the size of the output vector passed to pcre_exec() to dd - (any number of decimal digits) - \Z pass the PCRE_NOTEOL option to pcre_exec() - -A backslash followed by anything else just escapes the anything else. If the -very last character is a backslash, it is ignored. This gives a way of passing -an empty line as data, since a real empty line terminates the data input. - -If /P was present on the regex, causing the POSIX wrapper API to be used, only -\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to -regexec() respectively. - -When a match succeeds, pcretest outputs the list of identified substrings that -pcre_exec() returns, starting with number 0 for the string that matched the -whole pattern. Here is an example of an interactive pcretest run. - - $ pcretest - Testing Perl-Compatible Regular Expressions - PCRE version 0.90 08-Sep-1997 - - re> /^abc(\d+)/ - data> abc123 - 0: abc123 - 1: 123 - data> xyz - No match - -Note that while patterns can be continued over several lines (a plain ">" -prompt is used for continuations), data lines may not. However newlines can be -included in data by means of the \n escape. - -If the -p option is given to pcretest, it is equivalent to adding /P to each -regular expression: the POSIX wrapper API is used to call PCRE. None of the -following flags has any effect in this case. - -If the option -d is given to pcretest, it is equivalent to adding /D to each -regular expression: the internal form is output after compilation. - -If the option -i (for "information") is given to pcretest, it calls pcre_info() -after compiling an expression, and outputs the information it gets back. If the -pattern is studied, the results of that are also output. - -If the option -s is given to pcretest, it outputs the size of each compiled -pattern after it has been compiled. - -If the -t option is given, each compile, study, and match is run 2000 times -while being timed, and the resulting time per compile or match is output in -milliseconds. Do not set -t with -s, because you will then get the size output -2000 times and the timing will be distorted. - - - -The perltest program --------------------- - -The perltest program tests Perl's regular expressions; it has the same -specification as pcretest, and so can be given identical input, except that -input patterns can be followed only by Perl's lower case options. - -The data lines are processed as Perl strings, so if they contain $ or @ -characters, these have to be escaped. For this reason, all such characters in -the testinput file are escaped so that it can be used for perltest as well as -for pcretest, and the special upper case options such as /A that pcretest -recognizes are not used in this file. The output should be identical, apart -from the initial identifying banner. - -The testinput2 file is not suitable for feeding to Perltest, since it does -make use of the special upper case options and escapes that pcretest uses to -test additional features of PCRE. + dftables.c auxiliary program for building pcre_chartables.c + when --enable-rebuild-chartables is specified -Philip Hazel -October 1997 + pcre_chartables.c.dist a default set of character tables that assume ASCII + coding; used, unless --enable-rebuild-chartables is + specified, by copying to pcre_chartables.c + + pcreposix.c ) + pcre_compile.c ) + pcre_config.c ) + pcre_dfa_exec.c ) + pcre_exec.c ) + pcre_fullinfo.c ) + pcre_get.c ) sources for the functions in the library, + pcre_globals.c ) and some internal functions that they use + pcre_info.c ) + pcre_maketables.c ) + pcre_newline.c ) + pcre_ord2utf8.c ) + pcre_refcount.c ) + pcre_study.c ) + pcre_tables.c ) + pcre_try_flipped.c ) + pcre_ucd.c ) + pcre_valid_utf8.c ) + pcre_version.c ) + pcre_xclass.c ) + pcre_printint.src ) debugging function that is #included in pcretest, + ) and can also be #included in pcre_compile() + pcre.h.in template for pcre.h when built by "configure" + pcreposix.h header for the external POSIX wrapper API + pcre_internal.h header for internal use + ucp.h header for Unicode property handling + + config.h.in template for config.h, which is built by "configure" + + pcrecpp.h public header file for the C++ wrapper + pcrecpparg.h.in template for another C++ header file + pcre_scanner.h public header file for C++ scanner functions + pcrecpp.cc ) + pcre_scanner.cc ) source for the C++ wrapper library + + pcre_stringpiece.h.in template for pcre_stringpiece.h, the header for the + C++ stringpiece functions + pcre_stringpiece.cc source for the C++ stringpiece functions + +(B) Source files for programs that use PCRE: + + pcredemo.c simple demonstration of coding calls to PCRE + pcregrep.c source of a grep utility that uses PCRE + pcretest.c comprehensive test program + +(C) Auxiliary files: + + 132html script to turn "man" pages into HTML + AUTHORS information about the author of PCRE + ChangeLog log of changes to the code + CleanTxt script to clean nroff output for txt man pages + Detrail script to remove trailing spaces + HACKING some notes about the internals of PCRE + INSTALL generic installation instructions + LICENCE conditions for the use of PCRE + COPYING the same, using GNU's standard name + Makefile.in ) template for Unix Makefile, which is built by + ) "configure" + Makefile.am ) the automake input that was used to create + ) Makefile.in + NEWS important changes in this release + NON-UNIX-USE notes on building PCRE on non-Unix systems + PrepareRelease script to make preparations for "make dist" + README this file + RunTest a Unix shell script for running tests + RunGrepTest a Unix shell script for pcregrep tests + aclocal.m4 m4 macros (generated by "aclocal") + config.guess ) files used by libtool, + config.sub ) used only when building a shared library + configure a configuring shell script (built by autoconf) + configure.ac ) the autoconf input that was used to build + ) "configure" and config.h + depcomp ) script to find program dependencies, generated by + ) automake + doc/*.3 man page sources for the PCRE functions + doc/*.1 man page sources for pcregrep and pcretest + doc/index.html.src the base HTML page + doc/html/* HTML documentation + doc/pcre.txt plain text version of the man pages + doc/pcretest.txt plain text documentation of test program + doc/perltest.txt plain text documentation of Perl test program + install-sh a shell script for installing files + libpcre.pc.in template for libpcre.pc for pkg-config + libpcrecpp.pc.in template for libpcrecpp.pc for pkg-config + ltmain.sh file used to build a libtool script + missing ) common stub for a few missing GNU programs while + ) installing, generated by automake + mkinstalldirs script for making install directories + perltest.pl Perl test program + pcre-config.in source of script which retains PCRE information + pcrecpp_unittest.cc ) + pcre_scanner_unittest.cc ) test programs for the C++ wrapper + pcre_stringpiece_unittest.cc ) + testdata/testinput* test data for main library tests + testdata/testoutput* expected test results + testdata/grep* input and output for pcregrep tests + +(D) Auxiliary files for cmake support + + cmake/COPYING-CMAKE-SCRIPTS + cmake/FindPackageHandleStandardArgs.cmake + cmake/FindReadline.cmake + CMakeLists.txt + config-cmake.h.in + +(E) Auxiliary files for VPASCAL + + makevp.bat + makevp_c.txt + makevp_l.txt + pcregexp.pas + +(F) Auxiliary files for building PCRE "by hand" + + pcre.h.generic ) a version of the public PCRE header file + ) for use in non-"configure" environments + config.h.generic ) a version of config.h for use in non-"configure" + ) environments + +(F) Miscellaneous + + RunTest.bat a script for running tests under Windows + +Philip Hazel +Email local part: ph10 +Email domain: cam.ac.uk +Last updated: 17 March 2009