--- code/trunk/doc/html/pcreapi.html 2007/02/24 21:41:29 90 +++ code/trunk/doc/html/pcreapi.html 2007/02/24 21:41:34 91 @@ -15,21 +15,23 @@
PCRE NATIVE API

@@ -83,6 +85,10 @@ const char *name);

+int pcre_get_stringtable_entries(const pcre *code, +const char *name, char **first, char **last); +

+

int pcre_get_substring(const char *subject, int *ovector, int stringcount, int stringnumber, const char **stringptr); @@ -163,8 +169,8 @@

A second matching function, pcre_dfa_exec(), which is not Perl-compatible, is also provided. This uses a different algorithm for the -matching. This allows it to find all possible matches (at a given point in the -subject), not just one. However, this algorithm does not return captured +matching. The alternative algorithm finds all possible matches (at a given +point in the subject). However, this algorithm does not return captured substrings. A description of the two matching algorithms and their advantages and disadvantages is given in the pcrematching @@ -181,6 +187,7 @@ pcre_get_named_substring() pcre_get_substring_list() pcre_get_stringnumber() + pcre_get_stringtable_entries() pcre_free_substring() and pcre_free_substring_list() are also provided, to free the memory used for extracted strings. @@ -215,13 +222,17 @@ The global variables pcre_stack_malloc and pcre_stack_free are also indirections to memory management functions. These special functions are used only when PCRE is compiled to use the heap for remembering data, instead of -recursive function calls, when running the pcre_exec() function. This is -a non-standard way of building PCRE, for use in environments that have limited -stacks. Because of the greater use of memory management, it runs more slowly. -Separate functions are provided so that special-purpose external code can be -used for this case. When used, these functions are always called in a -stack-like manner (last obtained, first freed), and always for memory blocks of -the same size. +recursive function calls, when running the pcre_exec() function. See the +pcrebuild +documentation for details of how to do this. It is a non-standard way of +building PCRE, for use in environments that have limited stacks. Because of the +greater use of memory management, it runs more slowly. Separate functions are +provided so that special-purpose external code can be used for this case. When +used, these functions are always called in a stack-like manner (last obtained, +first freed), and always for memory blocks of the same size. There is a +discussion about PCRE's stack usage in the +pcrestack +documentation.

The global variable pcre_callout initially contains NULL. It can be set @@ -230,7 +241,20 @@ pcrecallout documentation.

-
MULTITHREADING
+
NEWLINES
+

+PCRE supports three different conventions for indicating line breaks in +strings: a single CR character, a single LF character, or the two-character +sequence CRLF. All three are used as "standard" by different operating systems. +When PCRE is built, a default can be specified. The default default is LF, +which is the Unix standard. When PCRE is run, the default can be overridden, +either when a pattern is compiled, or when it is matched. +
+
+In the PCRE documentation the word "newline" is used to mean "the character or +pair of characters that indicate a line break". +

+
MULTITHREADING

The PCRE functions can be used in multi-threading applications, with the proviso that the memory management functions pointed to by pcre_malloc, @@ -241,7 +265,7 @@ The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be used by several threads at once.

-
SAVING PRECOMPILED PATTERNS FOR LATER USE
+
SAVING PRECOMPILED PATTERNS FOR LATER USE

The compiled form of a regular expression can be saved and re-used at a later time, possibly by a different program, and even on a host other than the one on @@ -249,7 +273,7 @@ pcreprecompile documentation.

-
CHECKING BUILD-TIME OPTIONS
+
CHECKING BUILD-TIME OPTIONS

int pcre_config(int what, void *where);

@@ -276,9 +300,10 @@
   PCRE_CONFIG_NEWLINE
 
-The output is an integer that is set to the value of the code that is used for -the newline character. It is either linefeed (10) or carriage return (13), and -should normally be the standard character for your operating system. +The output is an integer whose value specifies the default character sequence +that is recognized as meaning "newline". The three values that are supported +are: 10 for LF, 13 for CR, and 3338 for CRLF. The default should normally be +the standard sequence for your operating system.
   PCRE_CONFIG_LINK_SIZE
 
@@ -318,7 +343,7 @@ pcre_stack_free are called to manage memory blocks on the heap, thus avoiding the use of the stack.

-
COMPILING A PATTERN
+
COMPILING A PATTERN

pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, @@ -340,7 +365,7 @@ via pcre_malloc is returned. This contains the compiled code and related data. The pcre type is defined for the returned block; this is a typedef for a structure whose contents are not externally defined. It is up to the -caller to free the memory when it is no longer required. +caller to free the memory (via pcre_free) when it is no longer required.

Although the compiled code of a PCRE regex is relocatable, that is, it does not @@ -357,8 +382,8 @@ pcrepattern documentation). For these options, the contents of the options argument specifies their initial settings at the start of compilation and execution. The -PCRE_ANCHORED option can be set at the time of matching as well as at compile -time. +PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time of +matching as well as at compile time.

If errptr is NULL, pcre_compile() returns NULL immediately. @@ -431,27 +456,37 @@ If this bit is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar also matches -immediately before the final character if it is a newline (but not before any -other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is -set. There is no equivalent to this option in Perl, and no way to set it within -a pattern. +immediately before a newline at the end of the string (but not before any other +newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. +There is no equivalent to this option in Perl, and no way to set it within a +pattern.

   PCRE_DOTALL
 
If this bit is set, a dot metacharater in the pattern matches all characters, -including newlines. Without it, newlines are excluded. This option is -equivalent to Perl's /s option, and it can be changed within a pattern by a -(?s) option setting. A negative class such as [^a] always matches a newline -character, independent of the setting of this option. +including those that indicate newline. Without it, a dot does not match when +the current position is at a newline. This option is equivalent to Perl's /s +option, and it can be changed within a pattern by a (?s) option setting. A +negative class such as [^a] always matches newlines, independent of the setting +of this option. +
+  PCRE_DUPNAMES
+
+If this bit is set, names used to identify capturing subpatterns need not be +unique. This can be helpful for certain types of pattern when it is known that +only one instance of the named subpattern can ever be matched. There are more +details of named subpatterns below; see also the +pcrepattern +documentation.
   PCRE_EXTENDED
 
If this bit is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class. Whitespace does not include the VT character (code 11). In addition, characters between an -unescaped # outside a character class and the next newline character, -inclusive, are also ignored. This is equivalent to Perl's /x option, and it can -be changed within a pattern by a (?x) option setting. +unescaped # outside a character class and the next newline, inclusive, are also +ignored. This is equivalent to Perl's /x option, and it can be changed within a +pattern by a (?x) option setting.

This option makes it possible to include comments inside complicated patterns. @@ -466,15 +501,15 @@ set, any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no -special meaning is treated as a literal. There are at present no other features -controlled by this option. It can also be set by a (?X) option setting within a -pattern. +special meaning is treated as a literal. (Perl can, however, be persuaded to +give a warning for this.) There are at present no other features controlled by +this option. It can also be set by a (?X) option setting within a pattern.

   PCRE_FIRSTLINE
 
If this option is set, an unanchored pattern is required to match before or at -the first newline character in the subject string, though the matched text may -continue over the newline. +the first newline in the subject string, though the matched text may continue +over the newline.
   PCRE_MULTILINE
 
@@ -487,12 +522,29 @@

When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs -match immediately following or immediately before any newline in the subject -string, respectively, as well as at the very start and end. This is equivalent -to Perl's /m option, and it can be changed within a pattern by a (?m) option -setting. If there are no "\n" characters in a subject string, or no +match immediately following or immediately before internal newlines in the +subject string, respectively, as well as at the very start and end. This is +equivalent to Perl's /m option, and it can be changed within a pattern by a +(?m) option setting. If there are no newlines in a subject string, or no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.

+  PCRE_NEWLINE_CR
+  PCRE_NEWLINE_LF
+  PCRE_NEWLINE_CRLF
+
+These options override the default newline definition that was chosen when PCRE +was built. Setting the first or the second specifies that a newline is +indicated by a single character (CR or LF, respectively). Setting both of them +specifies that a newline is indicated by the two-character CRLF sequence. For +convenience, PCRE_NEWLINE_CRLF is defined to contain both bits. The only time +that a line break is relevant when compiling a pattern is if PCRE_EXTENDED is +set, and an unescaped # outside a character class is encountered. This +indicates a comment that lasts until after the next newline. +

+

+The newline option set at compile time becomes the default that is used for +pcre_exec() and pcre_dfa_exec(), but it can be overridden. +

   PCRE_NO_AUTO_CAPTURE
 
If this option is set, it disables the use of numbered capturing parentheses in @@ -531,7 +583,7 @@ pcre_dfa_exec(), to suppress the UTF-8 validity checking of subject strings.

-
COMPILATION ERROR CODES
+
COMPILATION ERROR CODES

The following table lists the error codes than may be returned by pcre_compile2(), along with the error messages that may be returned by @@ -563,7 +615,7 @@ 23 internal error: code overflow 24 unrecognized character after (?< 25 lookbehind assertion is not fixed length - 26 malformed number after (?( + 26 malformed number or name after (?( 27 conditional group contains more than two branches 28 assertion expected after (?( 29 (?R or (?digits must be followed by ) @@ -580,14 +632,18 @@ 40 recursive call could loop indefinitely 41 unrecognized character after (?P 42 syntax error after (?P - 43 two named groups have the same name + 43 two named subpatterns have the same name 44 invalid UTF-8 string 45 support for \P, \p, and \X has not been compiled 46 malformed \P or \p sequence 47 unknown property name after \P or \p + 48 subpattern name is too long (maximum 32 characters) + 49 too many named subpatterns (maximum 10,000) + 50 repeated subpattern is too long + 51 octal value is greater than \377 (not in UTF-8 mode)

-
STUDYING A PATTERN
+
STUDYING A PATTERN

pcre_extra *pcre_study(const pcre *code, int options const char **errptr); @@ -640,7 +696,7 @@ not have a single fixed starting character. A bitmap of possible starting bytes is created.

-
LOCALE SUPPORT
+
LOCALE SUPPORT

PCRE handles caseless matching, and determines whether characters are letters digits, or whatever, by reference to a set of tables, indexed by character @@ -688,7 +744,7 @@ one in which it was compiled. Passing table pointers at run time is discussed below in the section on matching a pattern.

-
INFORMATION ABOUT A PATTERN
+
INFORMATION ABOUT A PATTERN

int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where); @@ -716,7 +772,7 @@ pcre_fullinfo(), to obtain the length of the compiled pattern:

   int rc;
-  unsigned long int length;
+  size_t length;
   rc = pcre_fullinfo(
     re,               /* result of pcre_compile() */
     pe,               /* result of pcre_study(), or NULL */
@@ -748,13 +804,13 @@
   PCRE_INFO_FIRSTBYTE
 
Return information about the first byte of any matched string, for a -non-anchored pattern. (This option used to be called PCRE_INFO_FIRSTCHAR; the -old name is still recognized for backwards compatibility.) +non-anchored pattern. The fourth argument should point to an int +variable. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name is +still recognized for backwards compatibility.)

If there is a fixed first byte, for example, from a pattern such as -(cat|cow|coyote), it is returned in the integer pointed to by where. -Otherwise, if either +(cat|cow|coyote). Otherwise, if either

(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch @@ -792,12 +848,13 @@ PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still -acquire numbers. A convenience function called pcre_get_named_substring() -is provided for extracting an individual captured substring by name. It is also -possible to extract the data directly, by first converting the name to a number -in order to access the correct pointers in the output vector (described with -pcre_exec() below). To do the conversion, you need to use the -name-to-number map, which is described by these three values. +acquire numbers. Several convenience functions such as +pcre_get_named_substring() are provided for extracting captured +substrings by name. It is also possible to extract the data directly, by first +converting the name to a number in order to access the correct pointers in the +output vector (described with pcre_exec() below). To do the conversion, +you need to use the name-to-number map, which is described by these three +values.

The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives @@ -807,7 +864,8 @@ entry of the table (a pointer to char). The first two bytes of each entry are the number of the capturing parenthesis, most significant byte first. The rest of the entry is the corresponding name, zero terminated. The names are in -alphabetical order. For example, consider the following pattern (assume +alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of +their parentheses numbers. For example, consider the following pattern (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):

   (?P<date> (?P<year>(\d\d)?\d\d) - (?P<month>\d\d) - (?P<day>\d\d) )
@@ -822,7 +880,7 @@
   00 02 y  e  a  r  00 ??
 
When writing code to extract data from named subpatterns using the -name-to-number map, remember that the length of each entry is likely to be +name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.
   PCRE_INFO_OPTIONS
@@ -859,7 +917,7 @@
 created by pcre_study(). The fourth argument should point to a
 size_t variable.
 

-
OBSOLETE INFO FUNCTION
+
OBSOLETE INFO FUNCTION

int pcre_info(const pcre *code, int *optptr, int *firstcharptr); @@ -883,7 +941,7 @@ it is used to pass back information about the first character of any matched string (see PCRE_INFO_FIRSTBYTE above).

-
REFERENCE COUNTS
+
REFERENCE COUNTS

int pcre_refcount(pcre *code, int adjust);

@@ -907,7 +965,7 @@ pattern is compiled on one host and then transferred to a host whose byte-order is different. (This seems a highly unlikely scenario.)

-
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
+
MATCHING A PATTERN: THE TRADITIONAL FUNCTION

int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, @@ -1045,8 +1103,8 @@

The unused bits of the options argument for pcre_exec() must be -zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, -PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. +zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, +PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.

   PCRE_ANCHORED
 
@@ -1055,6 +1113,15 @@ to be anchored by virtue of its contents, it cannot be made unachored at matching time.
+  PCRE_NEWLINE_CR
+  PCRE_NEWLINE_LF
+  PCRE_NEWLINE_CRLF
+
+These options override the newline definition that was chosen or defaulted when +the pattern was compiled. For details, see the description pcre_compile() +above. During matching, the newline choice affects the behaviour of the dot, +circumflex, and dollar metacharacters. +
   PCRE_NOTBOL
 
This option specifies that first character of the subject string is not the @@ -1193,20 +1260,10 @@ first pair, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() -is the number of pairs that have been set. If there are no capturing -subpatterns, the return value from a successful match is 1, indicating that -just the first pair of offsets has been set. -

-

-Some convenience functions are provided for extracting the captured substrings -as separate strings. These are described in the following section. -

-

-It is possible for an capturing subpattern number n+1 to match some -part of the subject when subpattern n has not been used at all. For -example, if the string "abc" is matched against the pattern (a|(z))(bc) -subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset -values corresponding to the unused subpattern are set to -1. +is one more than the highest numbered pair that has been set. For example, if +two substrings have been captured, the returned value is 3. If there are no +capturing subpatterns, the return value from a successful match is 1, +indicating that just the first pair of offsets has been set.

If a capturing subpattern is matched repeatedly, it is the last portion of the @@ -1223,13 +1280,34 @@ advisable to supply an ovector.

-Note that pcre_info() can be used to find out how many capturing +The pcre_info() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3. +

+

+It is possible for capturing subpattern number n+1 to match some part of +the subject when subpattern n has not been used at all. For example, if +the string "abc" is matched against the pattern (a|(z))(bc) the return from the +function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this +happens, both values in the offset pairs corresponding to unused subpatterns +are set to -1. +

+

+Offset values that correspond to unused subpatterns at the end of the +expression are also set to -1. For example, if the string "abc" is matched +against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The +return from the function is 2, because the highest used capturing subpattern +number is 1. However, you can refer to the offsets for the second and third +capturing subpatterns if you wish (assuming the vector is large enough, of +course). +

+

+Some convenience functions are provided for extracting the captured substrings +as separate strings. These are described below.


-Return values from pcre_exec() +Error return values from pcre_exec()

If pcre_exec() fails, it returns a negative number. The following are @@ -1326,7 +1404,7 @@

This error is given if the value of the ovecsize argument is negative.

-
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
+
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

int pcre_copy_substring(const char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, @@ -1348,9 +1426,16 @@ pcre_get_substring_list() are provided for extracting captured substrings as new, separate, zero-terminated strings. These functions identify substrings by number. The next section describes functions for extracting named -substrings. A substring that contains a binary zero is correctly extracted and -has a further zero added on the end, but the result is not, of course, -a C string. +substrings. +

+

+A substring that contains a binary zero is correctly extracted and has a +further zero added on the end, but the result is not, of course, a C string. +However, you can process such a string by referring to the length that is +returned by pcre_copy_substring() and pcre_get_substring(). +Unfortunately, the interface to pcre_get_substring_list() is not adequate +for handling strings containing binary zeros, because the end of the final +string is not independently indicated.

The first three arguments are the same for all three of these functions: @@ -1410,11 +1495,11 @@ pcre_get_substring_list(), respectively. They do nothing more than call the function pointed to by pcre_free, which of course could be called directly from a C program. However, PCRE is used in some situations where it is -linked via a special interface to another programming language which cannot use +linked via a special interface to another programming language that cannot use pcre_free directly; it is for these cases that the functions are provided.

-
EXTRACTING CAPTURED SUBSTRINGS BY NAME
+
EXTRACTING CAPTURED SUBSTRINGS BY NAME

int pcre_get_stringnumber(const pcre *code, const char *name); @@ -1437,9 +1522,10 @@

   (a+)b(?P<xxx>\d+)...
 
-the number of the subpattern called "xxx" is 2. You can find the number from -the name by calling pcre_get_stringnumber(). The first argument is the -compiled pattern, and the second is the name. The yield of the function is the +the number of the subpattern called "xxx" is 2. If the name is known to be +unique (PCRE_DUPNAMES was not set), you can find the number from the name by +calling pcre_get_stringnumber(). The first argument is the compiled +pattern, and the second is the name. The yield of the function is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of that name.

@@ -1449,8 +1535,8 @@ two functions that do the whole job.

-Most of the arguments of pcre_copy_named_substring() and -pcre_get_named_substring() are the same as those for the similarly named +Most of the arguments of pcre_copy_named_substring() and +pcre_get_named_substring() are the same as those for the similarly named functions that extract by number. As these are described in the previous section, they are not re-described here. There are just two differences:

@@ -1465,7 +1551,36 @@ then call pcre_copy_substring() or pcre_get_substring(), as appropriate.

-
FINDING ALL POSSIBLE MATCHES
+
DUPLICATE SUBPATTERN NAMES
+

+int pcre_get_stringtable_entries(const pcre *code, +const char *name, char **first, char **last); +

+

+When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns +are not required to be unique. Normally, patterns with duplicate names are such +that in any one match, only one of the named subpatterns participates. An +example is shown in the +pcrepattern +documentation. When duplicates are present, pcre_copy_named_substring() +and pcre_get_named_substring() return the first substring corresponding +to the given name that is set. If none are set, an empty string is returned. +The pcre_get_stringnumber() function returns one of the numbers that are +associated with the name, but it is not defined which it is. +
+
+If you want to get full details of all captured substrings for a given name, +you must use the pcre_get_stringtable_entries() function. The first +argument is the compiled pattern, and the second is the name. The third and +fourth are pointers to variables which are updated by the function. After it +has run, they point to the first and last entries in the name-to-number table +for the given name. The function itself returns the length of each entry, or +PCRE_ERROR_NOSUBSTRING if there are none. The format of the table is described +above in the section entitled Information about a pattern. Given all the +relevant entries for the name, you can extract each of their numbers, and hence +the captured data, if any. +

+
FINDING ALL POSSIBLE MATCHES

The traditional matching function uses a similar algorithm to Perl, which stops when it finds the first match, starting at a given point in the subject. If you @@ -1484,7 +1599,7 @@ other alternatives. Ultimately, when it runs out of matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.

-
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
+
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION

int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, @@ -1512,7 +1627,7 @@ The two additional arguments provide workspace for the function. The workspace vector should contain at least 20 elements. It is used for keeping track of multiple paths through the pattern tree. More workspace will be needed for -patterns and subjects where there are a lot of possible matches. +patterns and subjects where there are a lot of potential matches.

Here is an example of a simple call to pcre_dfa_exec(): @@ -1538,8 +1653,8 @@

The unused bits of the options argument for pcre_dfa_exec() must be -zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, -PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, +zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, +PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of these are the same as for pcre_exec(), so their description is not repeated here.

@@ -1647,7 +1762,7 @@
 extremely rare, as a vector of size 1000 is used.
 

-Last updated: 18 January 2006 +Last updated: 08 June 2006
Copyright © 1997-2006 University of Cambridge.