/[pcre]/code/trunk/doc/pcre.3
ViewVC logotype

Diff of /code/trunk/doc/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 42 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 43 by nigel, Sat Feb 24 21:39:21 2007 UTC
# Line 47  pcre - Perl-compatible regular expressio Line 47  pcre - Perl-compatible regular expressio
47  .B const unsigned char *pcre_maketables(void);  .B const unsigned char *pcre_maketables(void);
48  .PP  .PP
49  .br  .br
50    .B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR,"
51    .ti +5n
52    .B int \fIwhat\fR, void *\fIwhere\fR);
53    .PP
54    .br
55  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int
56  .B *\fIfirstcharptr\fR);  .B *\fIfirstcharptr\fR);
57  .PP  .PP
# Line 64  pcre - Perl-compatible regular expressio Line 69  pcre - Perl-compatible regular expressio
69  .SH DESCRIPTION  .SH DESCRIPTION
70  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
71  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
72  differences (see below). The current implementation corresponds to Perl 5.005.  differences (see below). The current implementation corresponds to Perl 5.005,
73    with some additional features from the Perl development release.
74    
75  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
76  a set of wrapper functions that correspond to the POSIX API. These are  a set of wrapper functions that correspond to the POSIX regular expression API.
77  described in the \fBpcreposix\fR documentation.  These are described in the \fBpcreposix\fR documentation.
78    
79  The native API function prototypes are defined in the header file \fBpcre.h\fR,  The native API function prototypes are defined in the header file \fBpcre.h\fR,
80  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be
81  accessed by adding \fB-lpcre\fR to the command for linking an application which  accessed by adding \fB-lpcre\fR to the command for linking an application which
82  calls it.  calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to
83    contain the major and minor release numbers for the library. Applications can
84    use these to include support for different releases.
85    
86  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
87  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions, while
# Line 83  captured substrings from a matched subje Line 91  captured substrings from a matched subje
91  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables
92  in the current locale for passing to \fBpcre_compile()\fR.  in the current locale for passing to \fBpcre_compile()\fR.
93    
94  The function \fBpcre_info()\fR is used to find out information about a compiled  The function \fBpcre_fullinfo()\fR is used to find out information about a
95  pattern, while the function \fBpcre_version()\fR returns a pointer to a string  compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
96  containing the version of PCRE and its date of release.  some of the available information, but is retained for backwards compatibility.
97    The function \fBpcre_version()\fR returns a pointer to a string containing the
98    version of PCRE and its date of release.
99    
100  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain
101  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions
# Line 182  sequence (?( which introduces a conditio Line 192  sequence (?( which introduces a conditio
192    
193    PCRE_EXTRA    PCRE_EXTRA
194    
195  This option turns on additional functionality of PCRE that is incompatible with  This option was invented in order to turn on additional functionality of PCRE
196  Perl. Any backslash in a pattern that is followed by a letter that has no  that is incompatible with Perl, but it is currently of very little use. When
197    set, any backslash in a pattern that is followed by a letter that has no
198  special meaning causes an error, thus reserving these combinations for future  special meaning causes an error, thus reserving these combinations for future
199  expansion. By default, as in Perl, a backslash followed by a letter with no  expansion. By default, as in Perl, a backslash followed by a letter with no
200  special meaning is treated as a literal. There are at present no other features  special meaning is treated as a literal. There are at present no other features
201  controlled by this option.  controlled by this option. It can also be set by a (?X) option setting within a
202    pattern.
203    
204    PCRE_MULTILINE    PCRE_MULTILINE
205    
# Line 261  memory containing the tables remains ava Line 273  memory containing the tables remains ava
273    
274    
275  .SH INFORMATION ABOUT A PATTERN  .SH INFORMATION ABOUT A PATTERN
276  The \fBpcre_info()\fR function returns information about a compiled pattern.  The \fBpcre_fullinfo()\fR function returns information about a compiled
277  Its yield is the number of capturing subpatterns, or one of the following  pattern. It replaces the obsolete \fBpcre_info()\fR function, which is
278  negative numbers:  nevertheless retained for backwards compability (and is documented below).
279    
280    The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled
281    pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if
282    the pattern was not studied. The third argument specifies which piece of
283    information is required, while the fourth argument is a pointer to a variable
284    to receive the data. The yield of the function is zero for success, or one of
285    the following negative numbers:
286    
287    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
288                            the argument \fIwhere\fR was NULL
289    PCRE_ERROR_BADMAGIC   the "magic number" was not found    PCRE_ERROR_BADMAGIC   the "magic number" was not found
290      PCRE_ERROR_BADOPTION  the value of \fIwhat\fR was invalid
291    
292  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the  The possible values for the third argument are defined in \fBpcre.h\fR, and are
293  pattern was compiled is placed in the integer it points to. These option bits  as follows:
294    
295      PCRE_INFO_OPTIONS
296    
297    Return a copy of the options with which the pattern was compiled. The fourth
298    argument should point to au \fBunsigned long int\fR variable. These option bits
299  are those specified in the call to \fBpcre_compile()\fR, modified by any  are those specified in the call to \fBpcre_compile()\fR, modified by any
300  top-level option settings within the pattern itself, and with the PCRE_ANCHORED  top-level option settings within the pattern itself, and with the PCRE_ANCHORED
301  bit set if the form of the pattern implies that it can match only at the start  bit forcibly set if the form of the pattern implies that it can match only at
302  of a subject string.  the start of a subject string.
303    
304  If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,    PCRE_INFO_SIZE
305  it is used to pass back information about the first character of any matched  
306  string. If there is a fixed first character, e.g. from a pattern such as  Return the size of the compiled pattern, that is, the value that was passed as
307  (cat|cow|coyote), then it is returned in the integer pointed to by  the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to
308  \fIfirstcharptr\fR. Otherwise, if either  place the compiled data. The fourth argument should point to a \fBsize_t\fR
309    variable.
310    
311      PCRE_INFO_CAPTURECOUNT
312    
313    Return the number of capturing subpatterns in the pattern. The fourth argument
314    should point to an \fbint\fR variable.
315    
316      PCRE_INFO_BACKREFMAX
317    
318    Return the number of the highest back reference in the pattern. The fourth
319    argument should point to an \fBint\fR variable. Zero is returned if there are
320    no back references.
321    
322      PCRE_INFO_FIRSTCHAR
323    
324    Return information about the first character of any matched string, for a
325    non-anchored pattern. If there is a fixed first character, e.g. from a pattern
326    such as (cat|cow|coyote), then it is returned in the integer pointed to by
327    \fIwhere\fR. Otherwise, if either
328    
329  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
330  starts with "^", or  starts with "^", or
# Line 289  starts with "^", or Line 334  starts with "^", or
334    
335  then -1 is returned, indicating that the pattern matches only at the  then -1 is returned, indicating that the pattern matches only at the
336  start of a subject string or after any "\\n" within the string. Otherwise -2 is  start of a subject string or after any "\\n" within the string. Otherwise -2 is
337  returned.  returned. For anchored patterns, -2 is returned.
338    
339      PCRE_INFO_FIRSTTABLE
340    
341    If the pattern was studied, and this resulted in the construction of a 256-bit
342    table indicating a fixed set of characters for the first character in any
343    matching string, a pointer to the table is returned. Otherwise NULL is
344    returned. The fourth argument should point to an \fBunsigned char *\fR
345    variable.
346    
347      PCRE_INFO_LASTLITERAL
348    
349    For a non-anchored pattern, return the value of the rightmost literal character
350    which must exist in any matched string, other than at its start. The fourth
351    argument should point to an \fBint\fR variable. If there is no such character,
352    or if the pattern is anchored, -1 is returned. For example, for the pattern
353    /a\\d+z\\d+/ the returned value is 'z'.
354    
355    The \fBpcre_info()\fR function is now obsolete because its interface is too
356    restrictive to return all the available data about a compiled pattern. New
357    programs should use \fBpcre_fullinfo()\fR instead. The yield of
358    \fBpcre_info()\fR is the number of capturing subpatterns, or one of the
359    following negative numbers:
360    
361      PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
362      PCRE_ERROR_BADMAGIC   the "magic number" was not found
363    
364    If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
365    pattern was compiled is placed in the integer it points to (see
366    PCRE_INFO_OPTIONS above).
367    
368    If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,
369    it is used to pass back information about the first character of any matched
370    string (see PCRE_INFO_FIRSTCHAR above).
371    
372    
373  .SH MATCHING A PATTERN  .SH MATCHING A PATTERN
# Line 564  are not part of its pattern matching eng Line 642  are not part of its pattern matching eng
642  6. The Perl \\G assertion is not supported as it is not relevant to single  6. The Perl \\G assertion is not supported as it is not relevant to single
643  pattern matches.  pattern matches.
644    
645  7. Fairly obviously, PCRE does not support the (?{code}) construction.  7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
646    constructions. However, there is some experimental support for recursive
647    patterns using the non-Perl item (?R).
648    
649  8. There are at the time of writing some oddities in Perl 5.005_02 concerned  8. There are at the time of writing some oddities in Perl 5.005_02 concerned
650  with the settings of captured strings when part of a pattern is repeated. For  with the settings of captured strings when part of a pattern is repeated. For
# Line 602  of the subject. Line 682  of the subject.
682  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
683  \fBpcre_exec()\fR have no Perl equivalents.  \fBpcre_exec()\fR have no Perl equivalents.
684    
685    (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do
686    this using the (?p{code}) construct, which PCRE cannot of course support.)
687    
688    
689  .SH REGULAR EXPRESSION DETAILS  .SH REGULAR EXPRESSION DETAILS
690  The syntax and semantics of the regular expressions supported by PCRE are  The syntax and semantics of the regular expressions supported by PCRE are
691  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
692  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
693  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
694  O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
695  here is intended as reference documentation.  here is intended as reference documentation.
696    
697  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
# Line 906  terminating ] are non-special in charact Line 989  terminating ] are non-special in charact
989  are escaped.  are escaped.
990    
991    
992    .SH POSIX CHARACTER CLASSES
993    Perl 5.6 (not yet released at the time of writing) is going to support the
994    POSIX notation for character classes, which uses names enclosed by [: and :]
995    within the enclosing square brackets. PCRE supports this notation. For example,
996    
997      [01[:alpha:]%]
998    
999    matches "0", "1", any alphabetic character, or "%". The supported class names
1000    are
1001    
1002      alnum    letters and digits
1003      alpha    letters
1004      ascii    character codes 0 - 127
1005      cntrl    control characters
1006      digit    decimal digits (same as \\d)
1007      graph    printing characters, excluding space
1008      lower    lower case letters
1009      print    printing characters, including space
1010      punct    printing characters, excluding letters and digits
1011      space    white space (same as \\s)
1012      upper    upper case letters
1013      word     "word" characters (same as \\w)
1014      xdigit   hexadecimal digits
1015    
1016    The names "ascii" and "word" are Perl extensions. Another Perl extension is
1017    negation, which is indicated by a ^ character after the colon. For example,
1018    
1019      [12[:^digit:]]
1020    
1021    matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
1022    syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1023    supported, and an error is given if they are encountered.
1024    
1025    
1026  .SH VERTICAL BAR  .SH VERTICAL BAR
1027  Vertical bar characters are used to separate alternative patterns. For example,  Vertical bar characters are used to separate alternative patterns. For example,
1028  the pattern  the pattern
# Line 1352  pattern such as Line 1469  pattern such as
1469    
1470    abcd$    abcd$
1471    
1472  when applied to a long string which does not match it. Because matching  when applied to a long string which does not match. Because matching proceeds
1473  proceeds from left to right, PCRE will look for each "a" in the subject and  from left to right, PCRE will look for each "a" in the subject and then see if
1474  then see if what follows matches the rest of the pattern. If the pattern is  what follows matches the rest of the pattern. If the pattern is specified as
 specified as  
1475    
1476    ^.*abcd$    ^.*abcd$
1477    
1478  then the initial .* matches the entire string at first, but when this fails, it  then the initial .* matches the entire string at first, but when this fails
1479  backtracks to match all but the last character, then all but the last two  (because there is no following "a"), it backtracks to match all but the last
1480  characters, and so on. Once again the search for "a" covers the entire string,  character, then all but the last two characters, and so on. Once again the
1481  from right to left, so we are no better off. However, if the pattern is written  search for "a" covers the entire string, from right to left, so we are no
1482  as  better off. However, if the pattern is written as
1483    
1484    ^(?>.*)(?<=abcd)    ^(?>.*)(?<=abcd)
1485    
# Line 1372  string. The subsequent lookbehind assert Line 1488  string. The subsequent lookbehind assert
1488  characters. If it fails, the match fails immediately. For long strings, this  characters. If it fails, the match fails immediately. For long strings, this
1489  approach makes a significant difference to the processing time.  approach makes a significant difference to the processing time.
1490    
1491    When a pattern contains an unlimited repeat inside a subpattern that can itself
1492    be repeated an unlimited number of times, the use of a once-only subpattern is
1493    the only way to avoid some failing matches taking a very long time indeed.
1494    The pattern
1495    
1496      (\\D+|<\\d+>)*[!?]
1497    
1498    matches an unlimited number of substrings that either consist of non-digits, or
1499    digits enclosed in <>, followed by either ! or ?. When it matches, it runs
1500    quickly. However, if it is applied to
1501    
1502      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1503    
1504    it takes a long time before reporting failure. This is because the string can
1505    be divided between the two repeats in a large number of ways, and all have to
1506    be tried. (The example used [!?] rather than a single character at the end,
1507    because both PCRE and Perl have an optimization that allows for fast failure
1508    when a single character is used. They remember the last single character that
1509    is required for a match, and fail early if it is not present in the string.)
1510    If the pattern is changed to
1511    
1512      ((?>\\D+)|<\\d+>)*[!?]
1513    
1514    sequences of non-digits cannot be broken, and failure happens quickly.
1515    
1516    
1517  .SH CONDITIONAL SUBPATTERNS  .SH CONDITIONAL SUBPATTERNS
1518  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
# Line 1431  character class introduces a comment tha Line 1572  character class introduces a comment tha
1572  character in the pattern.  character in the pattern.
1573    
1574    
1575    .SH RECURSIVE PATTERNS
1576    Consider the problem of matching a string in parentheses, allowing for
1577    unlimited nested parentheses. Without the use of recursion, the best that can
1578    be done is to use a pattern that matches up to some fixed depth of nesting. It
1579    is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an
1580    experimental facility that allows regular expressions to recurse (amongst other
1581    things). It does this by interpolating Perl code in the expression at run time,
1582    and the code can refer to the expression itself. A Perl pattern to solve the
1583    parentheses problem can be created like this:
1584    
1585      $re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x;
1586    
1587    The (?p{...}) item interpolates Perl code at run time, and in this case refers
1588    recursively to the pattern in which it appears. Obviously, PCRE cannot support
1589    the interpolation of Perl code. Instead, the special item (?R) is provided for
1590    the specific case of recursion. This PCRE pattern solves the parentheses
1591    problem (assume the PCRE_EXTENDED option is set so that white space is
1592    ignored):
1593    
1594      \\( ( (?>[^()]+) | (?R) )* \\)
1595    
1596    First it matches an opening parenthesis. Then it matches any number of
1597    substrings which can either be a sequence of non-parentheses, or a recursive
1598    match of the pattern itself (i.e. a correctly parenthesized substring). Finally
1599    there is a closing parenthesis.
1600    
1601    This particular example pattern contains nested unlimited repeats, and so the
1602    use of a once-only subpattern for matching strings of non-parentheses is
1603    important when applying the pattern to strings that do not match. For example,
1604    when it is applied to
1605    
1606      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1607    
1608    it yields "no match" quickly. However, if a once-only subpattern is not used,
1609    the match runs for a very long time indeed because there are so many different
1610    ways the + and * repeats can carve up the subject, and all have to be tested
1611    before failure can be reported.
1612    
1613    The values set for any capturing subpatterns are those from the outermost level
1614    of the recursion at which the subpattern value is set. If the pattern above is
1615    matched against
1616    
1617      (ab(cd)ef)
1618    
1619    the value for the capturing parentheses is "ef", which is the last value taken
1620    on at the top level. If additional parentheses are added, giving
1621    
1622      \\( ( ( (?>[^()]+) | (?R) )* ) \\)
1623         ^                        ^
1624         ^                        ^
1625    then the string they capture is "ab(cd)ef", the contents of the top level
1626    parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1627    has to obtain extra memory to store data during a recursion, which it does by
1628    using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no
1629    memory can be obtained, it saves data for the first 15 capturing parentheses
1630    only, as there is no way to give an out-of-memory error from within a
1631    recursion.
1632    
1633    
1634  .SH PERFORMANCE  .SH PERFORMANCE
1635  Certain items that may appear in patterns are more efficient than others. It is  Certain items that may appear in patterns are more efficient than others. It is
1636  more efficient to use a character class like [aeiou] than a set of alternatives  more efficient to use a character class like [aeiou] than a set of alternatives
# Line 1497  Cambridge CB2 3QG, England. Line 1697  Cambridge CB2 3QG, England.
1697  .br  .br
1698  Phone: +44 1223 334714  Phone: +44 1223 334714
1699    
1700  Last updated: 29 July 1999  Last updated: 27 January 2000
1701  .br  .br
1702  Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.42  
changed lines
  Added in v.43

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12