/[pcre]/code/tags/pcre-6.7/doc/pcre.3
ViewVC logotype

Diff of /code/tags/pcre-6.7/doc/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC
# Line 44  pcre - Perl-compatible regular expressio Line 44  pcre - Perl-compatible regular expressio
44  .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"  .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"
45  .PP  .PP
46  .br  .br
47    .B void pcre_free_substring(const char *\fIstringptr\fR);
48    .PP
49    .br
50    .B void pcre_free_substring_list(const char **\fIstringptr\fR);
51    .PP
52    .br
53  .B const unsigned char *pcre_maketables(void);  .B const unsigned char *pcre_maketables(void);
54  .PP  .PP
55  .br  .br
56    .B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR,"
57    .ti +5n
58    .B int \fIwhat\fR, void *\fIwhere\fR);
59    .PP
60    .br
61  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int
62  .B *\fIfirstcharptr\fR);  .B *\fIfirstcharptr\fR);
63  .PP  .PP
# Line 64  pcre - Perl-compatible regular expressio Line 75  pcre - Perl-compatible regular expressio
75  .SH DESCRIPTION  .SH DESCRIPTION
76  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
77  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
78  differences (see below). The current implementation corresponds to Perl 5.005.  differences (see below). The current implementation corresponds to Perl 5.005,
79    with some additional features from later versions. This includes some
80    experimental, incomplete support for UTF-8 encoded strings. Details of exactly
81    what is and what is not supported are given below.
82    
83  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
84  a set of wrapper functions that correspond to the POSIX API. These are  a set of wrapper functions that correspond to the POSIX regular expression API.
85  described in the \fBpcreposix\fR documentation.  These are described in the \fBpcreposix\fR documentation.
86    
87  The native API function prototypes are defined in the header file \fBpcre.h\fR,  The native API function prototypes are defined in the header file \fBpcre.h\fR,
88  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be
89  accessed by adding \fB-lpcre\fR to the command for linking an application which  accessed by adding \fB-lpcre\fR to the command for linking an application which
90  calls it.  calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to
91    contain the major and minor release numbers for the library. Applications can
92    use these to include support for different releases.
93    
94  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
95  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions.
96  \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and  
97    The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
98  \fBpcre_get_substring_list()\fR are convenience functions for extracting  \fBpcre_get_substring_list()\fR are convenience functions for extracting
99  captured substrings from a matched subject string. The function  captured substrings from a matched subject string; \fBpcre_free_substring()\fR
100  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables  and \fBpcre_free_substring_list()\fR are also provided, to free the memory used
101  in the current locale for passing to \fBpcre_compile()\fR.  for extracted strings.
102    
103  The function \fBpcre_info()\fR is used to find out information about a compiled  The function \fBpcre_maketables()\fR is used (optionally) to build a set of
104  pattern, while the function \fBpcre_version()\fR returns a pointer to a string  character tables in the current locale for passing to \fBpcre_compile()\fR.
105  containing the version of PCRE and its date of release.  
106    The function \fBpcre_fullinfo()\fR is used to find out information about a
107    compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
108    some of the available information, but is retained for backwards compatibility.
109    The function \fBpcre_version()\fR returns a pointer to a string containing the
110    version of PCRE and its date of release.
111    
112  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain
113  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions
# Line 182  sequence (?( which introduces a conditio Line 204  sequence (?( which introduces a conditio
204    
205    PCRE_EXTRA    PCRE_EXTRA
206    
207  This option turns on additional functionality of PCRE that is incompatible with  This option was invented in order to turn on additional functionality of PCRE
208  Perl. Any backslash in a pattern that is followed by a letter that has no  that is incompatible with Perl, but it is currently of very little use. When
209    set, any backslash in a pattern that is followed by a letter that has no
210  special meaning causes an error, thus reserving these combinations for future  special meaning causes an error, thus reserving these combinations for future
211  expansion. By default, as in Perl, a backslash followed by a letter with no  expansion. By default, as in Perl, a backslash followed by a letter with no
212  special meaning is treated as a literal. There are at present no other features  special meaning is treated as a literal. There are at present no other features
213  controlled by this option.  controlled by this option. It can also be set by a (?X) option setting within a
214    pattern.
215    
216    PCRE_MULTILINE    PCRE_MULTILINE
217    
# Line 211  This option inverts the "greediness" of Line 235  This option inverts the "greediness" of
235  greedy by default, but become greedy if followed by "?". It is not compatible  greedy by default, but become greedy if followed by "?". It is not compatible
236  with Perl. It can also be set by a (?U) option setting within the pattern.  with Perl. It can also be set by a (?U) option setting within the pattern.
237    
238      PCRE_UTF8
239    
240    This option causes PCRE to regard both the pattern and the subject as strings
241    of UTF-8 characters instead of just byte strings. However, it is available only
242    if PCRE has been built to include UTF-8 support. If not, the use of this option
243    provokes an error. Support for UTF-8 is new, experimental, and incomplete.
244    Details of exactly what it entails are given below.
245    
246    
247  .SH STUDYING A PATTERN  .SH STUDYING A PATTERN
248  When a pattern is going to be used several times, it is worth spending more  When a pattern is going to be used several times, it is worth spending more
# Line 261  memory containing the tables remains ava Line 293  memory containing the tables remains ava
293    
294    
295  .SH INFORMATION ABOUT A PATTERN  .SH INFORMATION ABOUT A PATTERN
296  The \fBpcre_info()\fR function returns information about a compiled pattern.  The \fBpcre_fullinfo()\fR function returns information about a compiled
297  Its yield is the number of capturing subpatterns, or one of the following  pattern. It replaces the obsolete \fBpcre_info()\fR function, which is
298  negative numbers:  nevertheless retained for backwards compability (and is documented below).
299    
300    The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled
301    pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if
302    the pattern was not studied. The third argument specifies which piece of
303    information is required, while the fourth argument is a pointer to a variable
304    to receive the data. The yield of the function is zero for success, or one of
305    the following negative numbers:
306    
307    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
308                            the argument \fIwhere\fR was NULL
309    PCRE_ERROR_BADMAGIC   the "magic number" was not found    PCRE_ERROR_BADMAGIC   the "magic number" was not found
310      PCRE_ERROR_BADOPTION  the value of \fIwhat\fR was invalid
311    
312  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the  The possible values for the third argument are defined in \fBpcre.h\fR, and are
313  pattern was compiled is placed in the integer it points to. These option bits  as follows:
314    
315      PCRE_INFO_OPTIONS
316    
317    Return a copy of the options with which the pattern was compiled. The fourth
318    argument should point to au \fBunsigned long int\fR variable. These option bits
319  are those specified in the call to \fBpcre_compile()\fR, modified by any  are those specified in the call to \fBpcre_compile()\fR, modified by any
320  top-level option settings within the pattern itself, and with the PCRE_ANCHORED  top-level option settings within the pattern itself, and with the PCRE_ANCHORED
321  bit set if the form of the pattern implies that it can match only at the start  bit forcibly set if the form of the pattern implies that it can match only at
322  of a subject string.  the start of a subject string.
323    
324  If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,    PCRE_INFO_SIZE
325  it is used to pass back information about the first character of any matched  
326  string. If there is a fixed first character, e.g. from a pattern such as  Return the size of the compiled pattern, that is, the value that was passed as
327  (cat|cow|coyote), then it is returned in the integer pointed to by  the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to
328  \fIfirstcharptr\fR. Otherwise, if either  place the compiled data. The fourth argument should point to a \fBsize_t\fR
329    variable.
330    
331      PCRE_INFO_CAPTURECOUNT
332    
333    Return the number of capturing subpatterns in the pattern. The fourth argument
334    should point to an \fbint\fR variable.
335    
336      PCRE_INFO_BACKREFMAX
337    
338    Return the number of the highest back reference in the pattern. The fourth
339    argument should point to an \fBint\fR variable. Zero is returned if there are
340    no back references.
341    
342      PCRE_INFO_FIRSTCHAR
343    
344    Return information about the first character of any matched string, for a
345    non-anchored pattern. If there is a fixed first character, e.g. from a pattern
346    such as (cat|cow|coyote), it is returned in the integer pointed to by
347    \fIwhere\fR. Otherwise, if either
348    
349  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
350  starts with "^", or  starts with "^", or
# Line 287  starts with "^", or Line 352  starts with "^", or
352  (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set  (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
353  (if it were set, the pattern would be anchored),  (if it were set, the pattern would be anchored),
354    
355  then -1 is returned, indicating that the pattern matches only at the  -1 is returned, indicating that the pattern matches only at the start of a
356  start of a subject string or after any "\\n" within the string. Otherwise -2 is  subject string or after any "\\n" within the string. Otherwise -2 is returned.
357  returned.  For anchored patterns, -2 is returned.
358    
359      PCRE_INFO_FIRSTTABLE
360    
361    If the pattern was studied, and this resulted in the construction of a 256-bit
362    table indicating a fixed set of characters for the first character in any
363    matching string, a pointer to the table is returned. Otherwise NULL is
364    returned. The fourth argument should point to an \fBunsigned char *\fR
365    variable.
366    
367      PCRE_INFO_LASTLITERAL
368    
369    For a non-anchored pattern, return the value of the rightmost literal character
370    which must exist in any matched string, other than at its start. The fourth
371    argument should point to an \fBint\fR variable. If there is no such character,
372    or if the pattern is anchored, -1 is returned. For example, for the pattern
373    /a\\d+z\\d+/ the returned value is 'z'.
374    
375    The \fBpcre_info()\fR function is now obsolete because its interface is too
376    restrictive to return all the available data about a compiled pattern. New
377    programs should use \fBpcre_fullinfo()\fR instead. The yield of
378    \fBpcre_info()\fR is the number of capturing subpatterns, or one of the
379    following negative numbers:
380    
381      PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
382      PCRE_ERROR_BADMAGIC   the "magic number" was not found
383    
384    If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
385    pattern was compiled is placed in the integer it points to (see
386    PCRE_INFO_OPTIONS above).
387    
388    If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,
389    it is used to pass back information about the first character of any matched
390    string (see PCRE_INFO_FIRSTCHAR above).
391    
392    
393  .SH MATCHING A PATTERN  .SH MATCHING A PATTERN
# Line 472  is a pointer to the vector of integer of Line 570  is a pointer to the vector of integer of
570  were captured by the match, including the substring that matched the entire  were captured by the match, including the substring that matched the entire
571  regular expression. This is the value returned by \fBpcre_exec\fR if it  regular expression. This is the value returned by \fBpcre_exec\fR if it
572  is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it  is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it
573  ran out of space in \fIovector\fR, then the value passed as  ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should
574  \fIstringcount\fR should be the size of the vector divided by three.  be the size of the vector divided by three.
575    
576  The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR  The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR
577  extract a single substring, whose number is given as \fIstringnumber\fR. A  extract a single substring, whose number is given as \fIstringnumber\fR. A
578  value of zero extracts the substring that matched the entire pattern, while  value of zero extracts the substring that matched the entire pattern, while
579  higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,  higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
580  the string is placed in \fIbuffer\fR, whose length is given by  the string is placed in \fIbuffer\fR, whose length is given by
581  \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is  \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is
582  obtained via \fBpcre_malloc\fR, and its address is returned via  obtained via \fBpcre_malloc\fR, and its address is returned via
583  \fIstringptr\fR. The yield of the function is the length of the string, not  \fIstringptr\fR. The yield of the function is the length of the string, not
584  including the terminating zero, or one of  including the terminating zero, or one of
# Line 512  string. This can be distinguished from a Line 610  string. This can be distinguished from a
610  inspecting the appropriate offset in \fIovector\fR, which is negative for unset  inspecting the appropriate offset in \fIovector\fR, which is negative for unset
611  substrings.  substrings.
612    
613    The two convenience functions \fBpcre_free_substring()\fR and
614    \fBpcre_free_substring_list()\fR can be used to free the memory returned by
615    a previous call of \fBpcre_get_substring()\fR or
616    \fBpcre_get_substring_list()\fR, respectively. They do nothing more than call
617    the function pointed to by \fBpcre_free\fR, which of course could be called
618    directly from a C program. However, PCRE is used in some situations where it is
619    linked via a special interface to another programming language which cannot use
620    \fBpcre_free\fR directly; it is for these cases that the functions are
621    provided.
622    
623    
624  .SH LIMITATIONS  .SH LIMITATIONS
# Line 564  are not part of its pattern matching eng Line 671  are not part of its pattern matching eng
671  6. The Perl \\G assertion is not supported as it is not relevant to single  6. The Perl \\G assertion is not supported as it is not relevant to single
672  pattern matches.  pattern matches.
673    
674  7. Fairly obviously, PCRE does not support the (?{code}) construction.  7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
675    constructions. However, there is some experimental support for recursive
676    patterns using the non-Perl item (?R).
677    
678  8. There are at the time of writing some oddities in Perl 5.005_02 concerned  8. There are at the time of writing some oddities in Perl 5.005_02 concerned
679  with the settings of captured strings when part of a pattern is repeated. For  with the settings of captured strings when part of a pattern is repeated. For
680  example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value  example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
681  "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if  "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if
682  the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set.  the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set.
683    
684  In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the  In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the
685  future Perl changes to a consistent state that is different, PCRE may change to  future Perl changes to a consistent state that is different, PCRE may change to
# Line 602  of the subject. Line 711  of the subject.
711  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
712  \fBpcre_exec()\fR have no Perl equivalents.  \fBpcre_exec()\fR have no Perl equivalents.
713    
714    (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do
715    this using the (?p{code}) construct, which PCRE cannot of course support.)
716    
717    
718  .SH REGULAR EXPRESSION DETAILS  .SH REGULAR EXPRESSION DETAILS
719  The syntax and semantics of the regular expressions supported by PCRE are  The syntax and semantics of the regular expressions supported by PCRE are
720  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
721  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
722  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
723  O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail.
724  here is intended as reference documentation.  
725    The description here is intended as reference documentation. The basic
726    operation of PCRE is on strings of bytes. However, there is the beginnings of
727    some support for UTF-8 character strings. To use this support you must
728    configure PCRE to include it, and then call \fBpcre_compile()\fR with the
729    PCRE_UTF8 option. How this affects the pattern matching is described in the
730    final section of this document.
731    
732  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
733  left to right. Most characters stand for themselves in a pattern, and match the  left to right. Most characters stand for themselves in a pattern, and match the
# Line 837  end of the subject in both modes, and if Line 955  end of the subject in both modes, and if
955  .SH FULL STOP (PERIOD, DOT)  .SH FULL STOP (PERIOD, DOT)
956  Outside a character class, a dot in the pattern matches any one character in  Outside a character class, a dot in the pattern matches any one character in
957  the subject, including a non-printing character, but not (by default) newline.  the subject, including a non-printing character, but not (by default) newline.
958  If the PCRE_DOTALL option is set, then dots match newlines as well. The  If the PCRE_DOTALL option is set, dots match newlines as well. The handling of
959  handling of dot is entirely independent of the handling of circumflex and  dot is entirely independent of the handling of circumflex and dollar, the only
960  dollar, the only relationship being that they both involve newline characters.  relationship being that they both involve newline characters. Dot has no
961  Dot has no special meaning in a character class.  special meaning in a character class.
962    
963    
964  .SH SQUARE BRACKETS  .SH SQUARE BRACKETS
# Line 906  terminating ] are non-special in charact Line 1024  terminating ] are non-special in charact
1024  are escaped.  are escaped.
1025    
1026    
1027    .SH POSIX CHARACTER CLASSES
1028    Perl 5.6 (not yet released at the time of writing) is going to support the
1029    POSIX notation for character classes, which uses names enclosed by [: and :]
1030    within the enclosing square brackets. PCRE supports this notation. For example,
1031    
1032      [01[:alpha:]%]
1033    
1034    matches "0", "1", any alphabetic character, or "%". The supported class names
1035    are
1036    
1037      alnum    letters and digits
1038      alpha    letters
1039      ascii    character codes 0 - 127
1040      cntrl    control characters
1041      digit    decimal digits (same as \\d)
1042      graph    printing characters, excluding space
1043      lower    lower case letters
1044      print    printing characters, including space
1045      punct    printing characters, excluding letters and digits
1046      space    white space (same as \\s)
1047      upper    upper case letters
1048      word     "word" characters (same as \\w)
1049      xdigit   hexadecimal digits
1050    
1051    The names "ascii" and "word" are Perl extensions. Another Perl extension is
1052    negation, which is indicated by a ^ character after the colon. For example,
1053    
1054      [12[:^digit:]]
1055    
1056    matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
1057    syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1058    supported, and an error is given if they are encountered.
1059    
1060    
1061  .SH VERTICAL BAR  .SH VERTICAL BAR
1062  Vertical bar characters are used to separate alternative patterns. For example,  Vertical bar characters are used to separate alternative patterns. For example,
1063  the pattern  the pattern
# Line 1096  to the string Line 1248  to the string
1248  fails, because it matches the entire string due to the greediness of the .*  fails, because it matches the entire string due to the greediness of the .*
1249  item.  item.
1250    
1251  However, if a quantifier is followed by a question mark, then it ceases to be  However, if a quantifier is followed by a question mark, it ceases to be
1252  greedy, and instead matches the minimum number of times possible, so the  greedy, and instead matches the minimum number of times possible, so the
1253  pattern  pattern
1254    
# Line 1112  own right. Because it has two uses, it c Line 1264  own right. Because it has two uses, it c
1264  which matches one digit by preference, but can match two if that is the only  which matches one digit by preference, but can match two if that is the only
1265  way the rest of the pattern matches.  way the rest of the pattern matches.
1266    
1267  If the PCRE_UNGREEDY option is set (an option which is not available in Perl)  If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
1268  then the quantifiers are not greedy by default, but individual ones can be made  the quantifiers are not greedy by default, but individual ones can be made
1269  greedy by following them with a question mark. In other words, it inverts the  greedy by following them with a question mark. In other words, it inverts the
1270  default behaviour.  default behaviour.
1271    
# Line 1122  is greater than 1 or with a limited maxi Line 1274  is greater than 1 or with a limited maxi
1274  compiled pattern, in proportion to the size of the minimum or maximum.  compiled pattern, in proportion to the size of the minimum or maximum.
1275    
1276  If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent  If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1277  to Perl's /s) is set, thus allowing the . to match newlines, then the pattern  to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
1278  is implicitly anchored, because whatever follows will be tried against every  implicitly anchored, because whatever follows will be tried against every
1279  character position in the subject string, so there is no point in retrying the  character position in the subject string, so there is no point in retrying the
1280  overall match at any position after the first. PCRE treats such a pattern as  overall match at any position after the first. PCRE treats such a pattern as
1281  though it were preceded by \\A. In cases where it is known that the subject  though it were preceded by \\A. In cases where it is known that the subject
# Line 1167  itself. So the pattern Line 1319  itself. So the pattern
1319    
1320  matches "sense and sensibility" and "response and responsibility", but not  matches "sense and sensibility" and "response and responsibility", but not
1321  "sense and responsibility". If caseful matching is in force at the time of the  "sense and responsibility". If caseful matching is in force at the time of the
1322  back reference, then the case of letters is relevant. For example,  back reference, the case of letters is relevant. For example,
1323    
1324    ((?i)rah)\\s+\\1    ((?i)rah)\\s+\\1
1325    
# Line 1175  matches "rah rah" and "RAH RAH", but not Line 1327  matches "rah rah" and "RAH RAH", but not
1327  capturing subpattern is matched caselessly.  capturing subpattern is matched caselessly.
1328    
1329  There may be more than one back reference to the same subpattern. If a  There may be more than one back reference to the same subpattern. If a
1330  subpattern has not actually been used in a particular match, then any back  subpattern has not actually been used in a particular match, any back
1331  references to it always fail. For example, the pattern  references to it always fail. For example, the pattern
1332    
1333    (a|(bc))\\2    (a|(bc))\\2
# Line 1183  references to it always fail. For exampl Line 1335  references to it always fail. For exampl
1335  always fails if it starts to match "a" rather than "bc". Because there may be  always fails if it starts to match "a" rather than "bc". Because there may be
1336  up to 99 back references, all digits following the backslash are taken  up to 99 back references, all digits following the backslash are taken
1337  as part of a potential back reference number. If the pattern continues with a  as part of a potential back reference number. If the pattern continues with a
1338  digit character, then some delimiter must be used to terminate the back  digit character, some delimiter must be used to terminate the back reference.
1339  reference. If the PCRE_EXTENDED option is set, this can be whitespace.  If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty
1340  Otherwise an empty comment can be used.  comment can be used.
1341    
1342  A back reference that occurs inside the parentheses to which it refers fails  A back reference that occurs inside the parentheses to which it refers fails
1343  when the subpattern is first used, so, for example, (a\\1) never matches.  when the subpattern is first used, so, for example, (a\\1) never matches.
# Line 1194  example, the pattern Line 1346  example, the pattern
1346    
1347    (a|b\\1)+    (a|b\\1)+
1348    
1349  matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of  matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1350  the subpattern, the back reference matches the character string corresponding  the subpattern, the back reference matches the character string corresponding
1351  to the previous iteration. In order for this to work, the pattern must be such  to the previous iteration. In order for this to work, the pattern must be such
1352  that the first iteration does not need to match the back reference. This can be  that the first iteration does not need to match the back reference. This can be
# Line 1273  Several assertions (of any sort) may occ Line 1425  Several assertions (of any sort) may occ
1425  matches "foo" preceded by three digits that are not "999". Notice that each of  matches "foo" preceded by three digits that are not "999". Notice that each of
1426  the assertions is applied independently at the same point in the subject  the assertions is applied independently at the same point in the subject
1427  string. First there is a check that the previous three characters are all  string. First there is a check that the previous three characters are all
1428  digits, then there is a check that the same three characters are not "999".  digits, and then there is a check that the same three characters are not "999".
1429  This pattern does \fInot\fR match "foo" preceded by six characters, the first  This pattern does \fInot\fR match "foo" preceded by six characters, the first
1430  of which are digits and the last three of which are not "999". For example, it  of which are digits and the last three of which are not "999". For example, it
1431  doesn't match "123abcfoo". A pattern to do that is  doesn't match "123abcfoo". A pattern to do that is
# Line 1352  pattern such as Line 1504  pattern such as
1504    
1505    abcd$    abcd$
1506    
1507  when applied to a long string which does not match it. Because matching  when applied to a long string which does not match. Because matching proceeds
1508  proceeds from left to right, PCRE will look for each "a" in the subject and  from left to right, PCRE will look for each "a" in the subject and then see if
1509  then see if what follows matches the rest of the pattern. If the pattern is  what follows matches the rest of the pattern. If the pattern is specified as
 specified as  
1510    
1511    ^.*abcd$    ^.*abcd$
1512    
1513  then the initial .* matches the entire string at first, but when this fails, it  the initial .* matches the entire string at first, but when this fails (because
1514  backtracks to match all but the last character, then all but the last two  there is no following "a"), it backtracks to match all but the last character,
1515  characters, and so on. Once again the search for "a" covers the entire string,  then all but the last two characters, and so on. Once again the search for "a"
1516  from right to left, so we are no better off. However, if the pattern is written  covers the entire string, from right to left, so we are no better off. However,
1517  as  if the pattern is written as
1518    
1519    ^(?>.*)(?<=abcd)    ^(?>.*)(?<=abcd)
1520    
1521  then there can be no backtracking for the .* item; it can match only the entire  there can be no backtracking for the .* item; it can match only the entire
1522  string. The subsequent lookbehind assertion does a single test on the last four  string. The subsequent lookbehind assertion does a single test on the last four
1523  characters. If it fails, the match fails immediately. For long strings, this  characters. If it fails, the match fails immediately. For long strings, this
1524  approach makes a significant difference to the processing time.  approach makes a significant difference to the processing time.
1525    
1526    When a pattern contains an unlimited repeat inside a subpattern that can itself
1527    be repeated an unlimited number of times, the use of a once-only subpattern is
1528    the only way to avoid some failing matches taking a very long time indeed.
1529    The pattern
1530    
1531      (\\D+|<\\d+>)*[!?]
1532    
1533    matches an unlimited number of substrings that either consist of non-digits, or
1534    digits enclosed in <>, followed by either ! or ?. When it matches, it runs
1535    quickly. However, if it is applied to
1536    
1537      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1538    
1539    it takes a long time before reporting failure. This is because the string can
1540    be divided between the two repeats in a large number of ways, and all have to
1541    be tried. (The example used [!?] rather than a single character at the end,
1542    because both PCRE and Perl have an optimization that allows for fast failure
1543    when a single character is used. They remember the last single character that
1544    is required for a match, and fail early if it is not present in the string.)
1545    If the pattern is changed to
1546    
1547      ((?>\\D+)|<\\d+>)*[!?]
1548    
1549    sequences of non-digits cannot be broken, and failure happens quickly.
1550    
1551    
1552  .SH CONDITIONAL SUBPATTERNS  .SH CONDITIONAL SUBPATTERNS
1553  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
# Line 1387  no-pattern (if present) is used. If ther Line 1563  no-pattern (if present) is used. If ther
1563  subpattern, a compile-time error occurs.  subpattern, a compile-time error occurs.
1564    
1565  There are two kinds of condition. If the text between the parentheses consists  There are two kinds of condition. If the text between the parentheses consists
1566  of a sequence of digits, then the condition is satisfied if the capturing  of a sequence of digits, the condition is satisfied if the capturing subpattern
1567  subpattern of that number has previously matched. Consider the following  of that number has previously matched. Consider the following pattern, which
1568  pattern, which contains non-significant white space to make it more readable  contains non-significant white space to make it more readable (assume the
1569  (assume the PCRE_EXTENDED option) and to divide it into three parts for ease  PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
 of discussion:  
1570    
1571    ( \\( )?    [^()]+    (?(1) \\) )    ( \\( )?    [^()]+    (?(1) \\) )
1572    
# Line 1431  character class introduces a comment tha Line 1606  character class introduces a comment tha
1606  character in the pattern.  character in the pattern.
1607    
1608    
1609    .SH RECURSIVE PATTERNS
1610    Consider the problem of matching a string in parentheses, allowing for
1611    unlimited nested parentheses. Without the use of recursion, the best that can
1612    be done is to use a pattern that matches up to some fixed depth of nesting. It
1613    is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an
1614    experimental facility that allows regular expressions to recurse (amongst other
1615    things). It does this by interpolating Perl code in the expression at run time,
1616    and the code can refer to the expression itself. A Perl pattern to solve the
1617    parentheses problem can be created like this:
1618    
1619      $re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x;
1620    
1621    The (?p{...}) item interpolates Perl code at run time, and in this case refers
1622    recursively to the pattern in which it appears. Obviously, PCRE cannot support
1623    the interpolation of Perl code. Instead, the special item (?R) is provided for
1624    the specific case of recursion. This PCRE pattern solves the parentheses
1625    problem (assume the PCRE_EXTENDED option is set so that white space is
1626    ignored):
1627    
1628      \\( ( (?>[^()]+) | (?R) )* \\)
1629    
1630    First it matches an opening parenthesis. Then it matches any number of
1631    substrings which can either be a sequence of non-parentheses, or a recursive
1632    match of the pattern itself (i.e. a correctly parenthesized substring). Finally
1633    there is a closing parenthesis.
1634    
1635    This particular example pattern contains nested unlimited repeats, and so the
1636    use of a once-only subpattern for matching strings of non-parentheses is
1637    important when applying the pattern to strings that do not match. For example,
1638    when it is applied to
1639    
1640      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1641    
1642    it yields "no match" quickly. However, if a once-only subpattern is not used,
1643    the match runs for a very long time indeed because there are so many different
1644    ways the + and * repeats can carve up the subject, and all have to be tested
1645    before failure can be reported.
1646    
1647    The values set for any capturing subpatterns are those from the outermost level
1648    of the recursion at which the subpattern value is set. If the pattern above is
1649    matched against
1650    
1651      (ab(cd)ef)
1652    
1653    the value for the capturing parentheses is "ef", which is the last value taken
1654    on at the top level. If additional parentheses are added, giving
1655    
1656      \\( ( ( (?>[^()]+) | (?R) )* ) \\)
1657         ^                        ^
1658         ^                        ^
1659    the string they capture is "ab(cd)ef", the contents of the top level
1660    parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1661    has to obtain extra memory to store data during a recursion, which it does by
1662    using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no
1663    memory can be obtained, it saves data for the first 15 capturing parentheses
1664    only, as there is no way to give an out-of-memory error from within a
1665    recursion.
1666    
1667    
1668  .SH PERFORMANCE  .SH PERFORMANCE
1669  Certain items that may appear in patterns are more efficient than others. It is  Certain items that may appear in patterns are more efficient than others. It is
1670  more efficient to use a character class like [aeiou] than a set of alternatives  more efficient to use a character class like [aeiou] than a set of alternatives
# Line 1486  with the pattern above. The former gives Line 1720  with the pattern above. The former gives
1720  applied to a whole line of "a" characters, whereas the latter takes an  applied to a whole line of "a" characters, whereas the latter takes an
1721  appreciable time with strings longer than about 20 characters.  appreciable time with strings longer than about 20 characters.
1722    
1723    
1724    .SH UTF-8 SUPPORT
1725    Starting at release 3.3, PCRE has some support for character strings encoded
1726    in the UTF-8 format. This is incomplete, and is regarded as experimental. In
1727    order to use it, you must configure PCRE to include UTF-8 support in the code,
1728    and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option
1729    flag. When you do this, both the pattern and any subject strings that are
1730    matched against it are treated as UTF-8 strings instead of just strings of
1731    bytes, but only in the cases that are mentioned below.
1732    
1733    If you compile PCRE with UTF-8 support, but do not use it at run time, the
1734    library will be a bit bigger, but the additional run time overhead is limited
1735    to testing the PCRE_UTF8 flag in several places, so should not be very large.
1736    
1737    PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
1738    not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
1739    the results are undefined.
1740    
1741    Running with PCRE_UTF8 set causes these changes in the way PCRE works:
1742    
1743    1. In a pattern, the escape sequence \\x{...}, where the contents of the braces
1744    is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
1745    code number is the given hexadecimal number, for example: \\x{1234}. This
1746    inserts from one to six literal bytes into the pattern, using the UTF-8
1747    encoding. If a non-hexadecimal digit appears between the braces, the item is
1748    not recognized.
1749    
1750    2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8
1751    character if its value is greater than 127.
1752    
1753    3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
1754    character. For example, \\x{100}* and \\xc3+ do not work. If you want to
1755    repeat such characters, you must enclose them in non-capturing parentheses,
1756    for example (?:\\x{100}), at present.
1757    
1758    4. The dot metacharacter matches one UTF-8 character instead of a single byte.
1759    
1760    5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
1761    repeat quantifier does operate correctly on UTF-8 characters instead of
1762    single bytes.
1763    
1764    4. Although the \\x{...} escape is permitted in a character class, characters
1765    whose values are greater than 255 cannot be included in a class.
1766    
1767    5. A class is matched against a UTF-8 character instead of just a single byte,
1768    but it can match only characters whose values are less than 256. Characters
1769    with greater values always fail to match a class.
1770    
1771    6. Repeated classes work correctly on multiple characters.
1772    
1773    7. Classes containing just a single character whose value is greater than 127
1774    (but less than 256), for example, [\\x80] or [^\\x{93}], do not work because
1775    these are optimized into single byte matches. In the first case, of course,
1776    the class brackets are just redundant.
1777    
1778    8. Lookbehind assertions move backwards in the subject by a fixed number of
1779    characters instead of a fixed number of bytes. Simple cases have been tested
1780    to work correctly, but there may be hidden gotchas herein.
1781    
1782    9. The character types such as \\d and \\w do not work correctly with UTF-8
1783    characters. They continue to test a single byte.
1784    
1785    10. Anything not explicitly mentioned here continues to work in bytes rather
1786    than in characters.
1787    
1788    The following UTF-8 features of Perl 5.6 are not implemented:
1789    
1790    1. The escape sequence \\C to match a single byte.
1791    
1792    2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
1793    
1794  .SH AUTHOR  .SH AUTHOR
1795  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>
1796  .br  .br
# Line 1497  Cambridge CB2 3QG, England. Line 1802  Cambridge CB2 3QG, England.
1802  .br  .br
1803  Phone: +44 1223 334714  Phone: +44 1223 334714
1804    
1805  Last updated: 29 July 1999  Last updated: 28 August 2000,
1806    .br
1807      the 250th anniversary of the death of J.S. Bach.
1808  .br  .br
1809  Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.41  
changed lines
  Added in v.49

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12