/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 207 by ph10, Fri Aug 3 09:44:26 2007 UTC revision 208 by ph10, Mon Aug 6 15:23:29 2007 UTC
# Line 45  INTRODUCTION Line 45  INTRODUCTION
45    
46         Details of exactly which Perl regular expression features are  and  are         Details of exactly which Perl regular expression features are  and  are
47         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
48         tern and pcrecompat pages.         tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
49           page.
50    
51         Some features of PCRE can be included, excluded, or  changed  when  the         Some  features  of  PCRE can be included, excluded, or changed when the
52         library  is  built.  The pcre_config() function makes it possible for a         library is built. The pcre_config() function makes it  possible  for  a
53         client to discover which features are  available.  The  features  them-         client  to  discover  which  features are available. The features them-
54         selves  are described in the pcrebuild page. Documentation about build-         selves are described in the pcrebuild page. Documentation about  build-
55         ing PCRE for various operating systems can be found in the README  file         ing  PCRE for various operating systems can be found in the README file
56         in the source distribution.         in the source distribution.
57    
58         The  library  contains  a number of undocumented internal functions and         The library contains a number of undocumented  internal  functions  and
59         data tables that are used by more than one  of  the  exported  external         data  tables  that  are  used by more than one of the exported external
60         functions,  but  which  are  not  intended for use by external callers.         functions, but which are not intended  for  use  by  external  callers.
61         Their names all begin with "_pcre_", which hopefully will  not  provoke         Their  names  all begin with "_pcre_", which hopefully will not provoke
62         any name clashes. In some environments, it is possible to control which         any name clashes. In some environments, it is possible to control which
63         external symbols are exported when a shared library is  built,  and  in         external  symbols  are  exported when a shared library is built, and in
64         these cases the undocumented symbols are not exported.         these cases the undocumented symbols are not exported.
65    
66    
67  USER DOCUMENTATION  USER DOCUMENTATION
68    
69         The  user  documentation  for PCRE comprises a number of different sec-         The user documentation for PCRE comprises a number  of  different  sec-
70         tions. In the "man" format, each of these is a separate "man page".  In         tions.  In the "man" format, each of these is a separate "man page". In
71         the  HTML  format, each is a separate page, linked from the index page.         the HTML format, each is a separate page, linked from the  index  page.
72         In the plain text format, all the sections are concatenated,  for  ease         In  the  plain text format, all the sections are concatenated, for ease
73         of searching. The sections are as follows:         of searching. The sections are as follows:
74    
75           pcre              this document           pcre              this document
# Line 83  USER DOCUMENTATION Line 84  USER DOCUMENTATION
84           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
85           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
86                               regular expressions                               regular expressions
87             pcresyntax        quick syntax reference
88           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
89           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
90           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
# Line 90  USER DOCUMENTATION Line 92  USER DOCUMENTATION
92           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
93           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
94    
95         In addition, in the "man" and HTML formats, there is a short  page  for         In  addition,  in the "man" and HTML formats, there is a short page for
96         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
97    
98    
99  LIMITATIONS  LIMITATIONS
100    
101         There  are some size limitations in PCRE but it is hoped that they will         There are some size limitations in PCRE but it is hoped that they  will
102         never in practice be relevant.         never in practice be relevant.
103    
104         The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
105         is compiled with the default internal linkage size of 2. If you want to         is compiled with the default internal linkage size of 2. If you want to
106         process regular expressions that are truly enormous,  you  can  compile         process  regular  expressions  that are truly enormous, you can compile
107         PCRE  with  an  internal linkage size of 3 or 4 (see the README file in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
108         the source distribution and the pcrebuild documentation  for  details).         the  source  distribution and the pcrebuild documentation for details).
109         In  these  cases the limit is substantially larger.  However, the speed         In these cases the limit is substantially larger.  However,  the  speed
110         of execution is slower.         of execution is slower.
111    
112         All values in repeating quantifiers must be less than 65536. The  maxi-         All values in repeating quantifiers must be less than 65536.
        mum  compiled  length  of  subpattern  with an explicit repeat count is  
        30000 bytes. The maximum number of capturing subpatterns is 65535.  
113    
114         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
115         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
116    
        If  a  non-capturing subpattern with an unlimited repetition quantifier  
        can match an empty string, there is a limit of 1000 on  the  number  of  
        times  it  can  be  repeated while not matching an empty string - if it  
        does match an empty string, the loop is immediately broken.  
   
117         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
118         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
119    
# Line 231  AUTHOR Line 226  AUTHOR
226    
227  REVISION  REVISION
228    
229         Last updated: 30 July 2007         Last updated: 06 August 2007
230         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
231  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
232    
# Line 2212  DUPLICATE SUBPATTERN NAMES Line 2207  DUPLICATE SUBPATTERN NAMES
2207         subpatterns  are  not  required  to  be unique. Normally, patterns with         subpatterns  are  not  required  to  be unique. Normally, patterns with
2208         duplicate names are such that in any one match, only one of  the  named         duplicate names are such that in any one match, only one of  the  named
2209         subpatterns  participates. An example is shown in the pcrepattern docu-         subpatterns  participates. An example is shown in the pcrepattern docu-
2210         mentation. When duplicates are present, pcre_copy_named_substring() and         mentation.
2211    
2212           When   duplicates   are   present,   pcre_copy_named_substring()    and
2213         pcre_get_named_substring()  return the first substring corresponding to         pcre_get_named_substring()  return the first substring corresponding to
2214         the given name that is set.  If  none  are  set,  an  empty  string  is         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
2215         returned.  The pcre_get_stringnumber() function returns one of the num-         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
2216         bers that are associated with the name, but it is not defined which  it         function returns one of the numbers that are associated with the  name,
2217         is.         but it is not defined which it is.
2218    
2219         If  you want to get full details of all captured substrings for a given         If  you want to get full details of all captured substrings for a given
2220         name, you must use  the  pcre_get_stringtable_entries()  function.  The         name, you must use  the  pcre_get_stringtable_entries()  function.  The
# Line 2732  NAME Line 2729  NAME
2729    
2730  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2731    
2732         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax and semantics of the regular expressions that are supported
2733         are described below. Regular expressions are also described in the Perl         by PCRE are described in detail below. There is a quick-reference  syn-
2734         documentation  and  in  a  number  of books, some of which have copious         tax  summary  in  the  pcresyntax  page. Perl's regular expressions are
2735         examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published         described in its own documentation, and regular expressions in  general
2736         by  O'Reilly, covers regular expressions in great detail. This descrip-         are  covered in a number of books, some of which have copious examples.
2737         tion of PCRE's regular expressions is intended as reference material.         Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by
2738           O'Reilly,  covers regular expressions in great detail. This description
2739           of PCRE's regular expressions is intended as reference material.
2740    
2741         The original operation of PCRE was on strings of  one-byte  characters.         The original operation of PCRE was on strings of  one-byte  characters.
2742         However,  there is now also support for UTF-8 character strings. To use         However,  there is now also support for UTF-8 character strings. To use
# Line 2939  BACKSLASH Line 2938  BACKSLASH
2938    
2939     Absolute and relative back references     Absolute and relative back references
2940    
2941         The  sequence  \g followed by a positive or negative number, optionally         The  sequence  \g followed by an unsigned or a negative number, option-
2942         enclosed in braces, is an absolute or relative back reference. A  named         ally enclosed in braces, is an absolute or relative back  reference.  A
2943         back  reference can be coded as \g{name}. Back references are discussed         named back reference can be coded as \g{name}. Back references are dis-
2944         later, following the discussion of parenthesized subpatterns.         cussed later, following the discussion of parenthesized subpatterns.
2945    
2946     Generic character types     Generic character types
2947    
# Line 3878  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3877  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3877    
3878           \d++foo           \d++foo
3879    
3880         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Note that a possessive quantifier can be used with an entire group, for
3881           example:
3882    
3883             (abc|xyz){2,3}+
3884    
3885           Possessive   quantifiers   are   always  greedy;  the  setting  of  the
3886         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
3887         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
3888         meaning  of  a  possessive  quantifier and the equivalent atomic group,         meaning of a possessive quantifier and  the  equivalent  atomic  group,
3889         though there may be a performance  difference;  possessive  quantifiers         though  there  may  be a performance difference; possessive quantifiers
3890         should be slightly faster.         should be slightly faster.
3891    
3892         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
3893         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
3894         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
3895         built Sun's Java package, and PCRE copied it from there. It  ultimately         built  Sun's Java package, and PCRE copied it from there. It ultimately
3896         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
3897    
3898         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
3899         ple pattern constructs. For example, the sequence  A+B  is  treated  as         ple  pattern  constructs.  For  example, the sequence A+B is treated as
3900         A++B  because  there is no point in backtracking into a sequence of A's         A++B because there is no point in backtracking into a sequence  of  A's
3901         when B must follow.         when B must follow.
3902    
3903         When a pattern contains an unlimited repeat inside  a  subpattern  that         When  a  pattern  contains an unlimited repeat inside a subpattern that
3904         can  itself  be  repeated  an  unlimited number of times, the use of an         can itself be repeated an unlimited number of  times,  the  use  of  an
3905         atomic group is the only way to avoid some  failing  matches  taking  a         atomic  group  is  the  only way to avoid some failing matches taking a
3906         very long time indeed. The pattern         very long time indeed. The pattern
3907    
3908           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
3909    
3910         matches  an  unlimited number of substrings that either consist of non-         matches an unlimited number of substrings that either consist  of  non-
3911         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
3912         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
3913    
3914           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3915    
3916         it  takes  a  long  time  before reporting failure. This is because the         it takes a long time before reporting  failure.  This  is  because  the
3917         string can be divided between the internal \D+ repeat and the  external         string  can be divided between the internal \D+ repeat and the external
3918         *  repeat  in  a  large  number of ways, and all have to be tried. (The         * repeat in a large number of ways, and all  have  to  be  tried.  (The
3919         example uses [!?] rather than a single character at  the  end,  because         example  uses  [!?]  rather than a single character at the end, because
3920         both  PCRE  and  Perl have an optimization that allows for fast failure         both PCRE and Perl have an optimization that allows  for  fast  failure
3921         when a single character is used. They remember the last single  charac-         when  a single character is used. They remember the last single charac-
3922         ter  that  is required for a match, and fail early if it is not present         ter that is required for a match, and fail early if it is  not  present
3923         in the string.) If the pattern is changed so that  it  uses  an  atomic         in  the  string.)  If  the pattern is changed so that it uses an atomic
3924         group, like this:         group, like this:
3925    
3926           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
3927    
3928         sequences  of non-digits cannot be broken, and failure happens quickly.         sequences of non-digits cannot be broken, and failure happens  quickly.
3929    
3930    
3931  BACK REFERENCES  BACK REFERENCES
3932    
3933         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
3934         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
3935         pattern earlier (that is, to its left) in the pattern,  provided  there         pattern  earlier  (that is, to its left) in the pattern, provided there
3936         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
3937    
3938         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
3939         it is always taken as a back reference, and causes  an  error  only  if         it  is  always  taken  as a back reference, and causes an error only if
3940         there  are  not that many capturing left parentheses in the entire pat-         there are not that many capturing left parentheses in the  entire  pat-
3941         tern. In other words, the parentheses that are referenced need  not  be         tern.  In  other words, the parentheses that are referenced need not be
3942         to  the left of the reference for numbers less than 10. A "forward back         to the left of the reference for numbers less than 10. A "forward  back
3943         reference" of this type can make sense when a  repetition  is  involved         reference"  of  this  type can make sense when a repetition is involved
3944         and  the  subpattern to the right has participated in an earlier itera-         and the subpattern to the right has participated in an  earlier  itera-
3945         tion.         tion.
3946    
3947         It is not possible to have a numerical "forward back  reference"  to  a         It  is  not  possible to have a numerical "forward back reference" to a
3948         subpattern  whose  number  is  10  or  more using this syntax because a         subpattern whose number is 10 or  more  using  this  syntax  because  a
3949         sequence such as \50 is interpreted as a character  defined  in  octal.         sequence  such  as  \50 is interpreted as a character defined in octal.
3950         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
3951         details of the handling of digits following a backslash.  There  is  no         details  of  the  handling of digits following a backslash. There is no
3952         such  problem  when named parentheses are used. A back reference to any         such problem when named parentheses are used. A back reference  to  any
3953         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
3954    
3955         Another way of avoiding the ambiguity inherent in  the  use  of  digits         Another  way  of  avoiding  the ambiguity inherent in the use of digits
3956         following a backslash is to use the \g escape sequence, which is a fea-         following a backslash is to use the \g escape sequence, which is a fea-
3957         ture introduced in Perl 5.10. This escape must be followed by  a  posi-         ture  introduced  in  Perl  5.10.  This  escape  must be followed by an
3958         tive  or  a negative number, optionally enclosed in braces. These exam-         unsigned number or a negative number, optionally  enclosed  in  braces.
3959         ples are all identical:         These examples are all identical:
3960    
3961           (ring), \1           (ring), \1
3962           (ring), \g1           (ring), \g1
3963           (ring), \g{1}           (ring), \g{1}
3964    
3965         A positive number specifies an absolute reference without the ambiguity         An  unsigned number specifies an absolute reference without the ambigu-
3966         that  is  present  in  the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
3967         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
3968         Consider this example:         Consider this example:
3969    
3970           (abc(def)ghi)\g{-1}           (abc(def)ghi)\g{-1}
3971    
3972         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
3973         ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,         ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
3974         \g{-2} would be equivalent to \1. The use of relative references can be         \g{-2} would be equivalent to \1. The use of relative references can be
3975         helpful in long patterns, and also in  patterns  that  are  created  by         helpful  in  long  patterns,  and  also in patterns that are created by
3976         joining together fragments that contain references within themselves.         joining together fragments that contain references within themselves.
3977    
3978         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
3979         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
3980         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3981         of doing that). So the pattern         of doing that). So the pattern
3982    
3983           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3984    
3985         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
3986         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
3987         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
3988         ple,         ple,
3989    
3990           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3991    
3992         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
3993         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3994    
3995         There are several different ways of writing back  references  to  named         There  are  several  different ways of writing back references to named
3996         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
3997         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
3998         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
3999         and named references, is also supported. We  could  rewrite  the  above         and  named  references,  is  also supported. We could rewrite the above
4000         example in any of the following ways:         example in any of the following ways:
4001    
4002           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 4000  BACK REFERENCES Line 4004  BACK REFERENCES
4004           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4005           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
4006    
4007         A  subpattern  that  is  referenced  by  name may appear in the pattern         A subpattern that is referenced by  name  may  appear  in  the  pattern
4008         before or after the reference.         before or after the reference.
4009    
4010         There may be more than one back reference to the same subpattern. If  a         There  may be more than one back reference to the same subpattern. If a
4011         subpattern  has  not actually been used in a particular match, any back         subpattern has not actually been used in a particular match,  any  back
4012         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
4013    
4014           (a|(bc))\2           (a|(bc))\2
4015    
4016         always fails if it starts to match "a" rather than "bc". Because  there         always  fails if it starts to match "a" rather than "bc". Because there
4017         may  be  many  capturing parentheses in a pattern, all digits following         may be many capturing parentheses in a pattern,  all  digits  following
4018         the backslash are taken as part of a potential back  reference  number.         the  backslash  are taken as part of a potential back reference number.
4019         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
4020         used to terminate the back reference. If the  PCRE_EXTENDED  option  is         used  to  terminate  the back reference. If the PCRE_EXTENDED option is
4021         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-         set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-
4022         ments" below) can be used.         ments" below) can be used.
4023    
4024         A back reference that occurs inside the parentheses to which it  refers         A  back reference that occurs inside the parentheses to which it refers
4025         fails  when  the subpattern is first used, so, for example, (a\1) never         fails when the subpattern is first used, so, for example,  (a\1)  never
4026         matches.  However, such references can be useful inside  repeated  sub-         matches.   However,  such references can be useful inside repeated sub-
4027         patterns. For example, the pattern         patterns. For example, the pattern
4028    
4029           (a|b\1)+           (a|b\1)+
4030    
4031         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4032         ation of the subpattern,  the  back  reference  matches  the  character         ation  of  the  subpattern,  the  back  reference matches the character
4033         string  corresponding  to  the previous iteration. In order for this to         string corresponding to the previous iteration. In order  for  this  to
4034         work, the pattern must be such that the first iteration does  not  need         work,  the  pattern must be such that the first iteration does not need
4035         to  match the back reference. This can be done using alternation, as in         to match the back reference. This can be done using alternation, as  in
4036         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4037    
4038    
4039  ASSERTIONS  ASSERTIONS
4040    
4041         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
4042         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
4043         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
4044         described above.         described above.
4045    
4046         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
4047         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
4048         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
4049         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
4050         matching position to be changed.         matching position to be changed.
4051    
4052         Assertion  subpatterns  are  not  capturing subpatterns, and may not be         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
4053         repeated, because it makes no sense to assert the  same  thing  several         repeated,  because  it  makes no sense to assert the same thing several
4054         times.  If  any kind of assertion contains capturing subpatterns within         times. If any kind of assertion contains capturing  subpatterns  within
4055         it, these are counted for the purposes of numbering the capturing  sub-         it,  these are counted for the purposes of numbering the capturing sub-
4056         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4057         out only for positive assertions, because it does not  make  sense  for         out  only  for  positive assertions, because it does not make sense for
4058         negative assertions.         negative assertions.
4059    
4060     Lookahead assertions     Lookahead assertions
# Line 4060  ASSERTIONS Line 4064  ASSERTIONS
4064    
4065           \w+(?=;)           \w+(?=;)
4066    
4067         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
4068         colon in the match, and         colon in the match, and
4069    
4070           foo(?!bar)           foo(?!bar)
4071    
4072         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
4073         that the apparently similar pattern         that the apparently similar pattern
4074    
4075           (?!foo)bar           (?!foo)bar
4076    
4077         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
4078         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
4079         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4080         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4081    
4082         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4083         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
4084         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
4085         string must always fail.         string must always fail.
4086    
4087     Lookbehind assertions     Lookbehind assertions
4088    
4089         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind  assertions start with (?<= for positive assertions and (?<!
4090         for negative assertions. For example,         for negative assertions. For example,
4091    
4092           (?<!foo)bar           (?<!foo)bar
4093    
4094         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does find an occurrence of "bar" that is not  preceded  by  "foo".  The
4095         contents of a lookbehind assertion are restricted  such  that  all  the         contents  of  a  lookbehind  assertion are restricted such that all the
4096         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4097         eral top-level alternatives, they do not all  have  to  have  the  same         eral  top-level  alternatives,  they  do  not all have to have the same
4098         fixed length. Thus         fixed length. Thus
4099    
4100           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 4099  ASSERTIONS Line 4103  ASSERTIONS
4103    
4104           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4105    
4106         causes  an  error at compile time. Branches that match different length         causes an error at compile time. Branches that match  different  length
4107         strings are permitted only at the top level of a lookbehind  assertion.         strings  are permitted only at the top level of a lookbehind assertion.
4108         This  is  an  extension  compared  with  Perl (at least for 5.8), which         This is an extension compared with  Perl  (at  least  for  5.8),  which
4109         requires all branches to match the same length of string. An  assertion         requires  all branches to match the same length of string. An assertion
4110         such as         such as
4111    
4112           (?<=ab(c|de))           (?<=ab(c|de))
4113    
4114         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
4115         different lengths, but it is acceptable if rewritten to  use  two  top-         different  lengths,  but  it is acceptable if rewritten to use two top-
4116         level branches:         level branches:
4117    
4118           (?<=abc|abde)           (?<=abc|abde)
4119    
4120         In some cases, the Perl 5.10 escape sequence \K (see above) can be used         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4121         instead of a lookbehind assertion; this is not restricted to  a  fixed-         instead  of  a lookbehind assertion; this is not restricted to a fixed-
4122         length.         length.
4123    
4124         The  implementation  of lookbehind assertions is, for each alternative,         The implementation of lookbehind assertions is, for  each  alternative,
4125         to temporarily move the current position back by the fixed  length  and         to  temporarily  move the current position back by the fixed length and
4126         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4127         rent position, the assertion fails.         rent position, the assertion fails.
4128    
4129         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4130         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode) to appear in lookbehind assertions, because it makes it  impossi-
4131         ble to calculate the length of the lookbehind. The \X and  \R  escapes,         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
4132         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
4133    
4134         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
4135         assertions to specify efficient matching at  the  end  of  the  subject         assertions  to  specify  efficient  matching  at the end of the subject
4136         string. Consider a simple pattern such as         string. Consider a simple pattern such as
4137    
4138           abcd$           abcd$
4139    
4140         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
4141         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4142         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
4143         pattern is specified as         pattern is specified as
4144    
4145           ^.*abcd$           ^.*abcd$
4146    
4147         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
4148         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4149         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
4150         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
4151         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4152    
4153           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4154    
4155         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
4156         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
4157         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
4158         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
4159         processing time.         processing time.
4160    
4161     Using multiple assertions     Using multiple assertions
# Line 4160  ASSERTIONS Line 4164  ASSERTIONS
4164    
4165           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4166    
4167         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
4168         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
4169         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
4170         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
4171         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4172         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
4173         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
4174         foo". A pattern to do that is         foo". A pattern to do that is
4175    
4176           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4177    
4178         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
4179         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4180         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4181    
# Line 4179  ASSERTIONS Line 4183  ASSERTIONS
4183    
4184           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4185    
4186         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
4187         is not preceded by "foo", while         is not preceded by "foo", while
4188    
4189           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4190    
4191         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
4192         three characters that are not "999".         three characters that are not "999".
4193    
4194    
4195  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4196    
4197         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
4198         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
4199         on  the result of an assertion, or whether a previous capturing subpat-         on the result of an assertion, or whether a previous capturing  subpat-
4200         tern matched or not. The two possible forms of  conditional  subpattern         tern  matched  or not. The two possible forms of conditional subpattern
4201         are         are
4202    
4203           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4204           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4205    
4206         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
4207         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
4208         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4209    
4210         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
4211         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4212    
4213     Checking for a used subpattern by number     Checking for a used subpattern by number
4214    
4215         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
4216         the  condition  is  true if the capturing subpattern of that number has         the condition is true if the capturing subpattern of  that  number  has
4217         previously matched. An alternative notation is to  precede  the  digits         previously  matched.  An  alternative notation is to precede the digits
4218         with a plus or minus sign. In this case, the subpattern number is rela-         with a plus or minus sign. In this case, the subpattern number is rela-
4219         tive rather than absolute.  The most recently opened parentheses can be         tive rather than absolute.  The most recently opened parentheses can be
4220         referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In         referenced by (?(-1), the next most recent by (?(-2),  and  so  on.  In
4221         looping constructs it can also make sense to refer to subsequent groups         looping constructs it can also make sense to refer to subsequent groups
4222         with constructs such as (?(+2).         with constructs such as (?(+2).
4223    
4224         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
4225         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
4226         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
4227    
4228           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
4229    
4230         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
4231         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
4232         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
4233         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether the first set
4234         of parentheses matched or not. If they did, that is, if subject started         of parentheses matched or not. If they did, that is, if subject started
4235         with an opening parenthesis, the condition is true, and so the yes-pat-         with an opening parenthesis, the condition is true, and so the yes-pat-
4236         tern  is  executed  and  a  closing parenthesis is required. Otherwise,         tern is executed and a  closing  parenthesis  is  required.  Otherwise,
4237         since no-pattern is not present, the  subpattern  matches  nothing.  In         since  no-pattern  is  not  present, the subpattern matches nothing. In
4238         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other words,  this  pattern  matches  a  sequence  of  non-parentheses,
4239         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4240    
4241         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
4242         relative reference:         relative reference:
4243    
4244           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4245    
4246         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
4247         pattern.         pattern.
4248    
4249     Checking for a used subpattern by name     Checking for a used subpattern by name
4250    
4251         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
4252         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
4253         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
4254         also  recognized. However, there is a possible ambiguity with this syn-         also recognized. However, there is a possible ambiguity with this  syn-
4255         tax, because subpattern names may  consist  entirely  of  digits.  PCRE         tax,  because  subpattern  names  may  consist entirely of digits. PCRE
4256         looks  first for a named subpattern; if it cannot find one and the name         looks first for a named subpattern; if it cannot find one and the  name
4257         consists entirely of digits, PCRE looks for a subpattern of  that  num-         consists  entirely  of digits, PCRE looks for a subpattern of that num-
4258         ber,  which must be greater than zero. Using subpattern names that con-         ber, which must be greater than zero. Using subpattern names that  con-
4259         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
4260    
4261         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
# Line 4262  CONDITIONAL SUBPATTERNS Line 4266  CONDITIONAL SUBPATTERNS
4266     Checking for pattern recursion     Checking for pattern recursion
4267    
4268         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
4269         name  R, the condition is true if a recursive call to the whole pattern         name R, the condition is true if a recursive call to the whole  pattern
4270         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
4271         sand follow the letter R, for example:         sand follow the letter R, for example:
4272    
4273           (?(R3)...) or (?(R&name)...)           (?(R3)...) or (?(R&name)...)
4274    
4275         the  condition is true if the most recent recursion is into the subpat-         the condition is true if the most recent recursion is into the  subpat-
4276         tern whose number or name is given. This condition does not  check  the         tern  whose  number or name is given. This condition does not check the
4277         entire recursion stack.         entire recursion stack.
4278    
4279         At  "top  level", all these recursion test conditions are false. Recur-         At "top level", all these recursion test conditions are  false.  Recur-
4280         sive patterns are described below.         sive patterns are described below.
4281    
4282     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
4283    
4284         If the condition is the string (DEFINE), and  there  is  no  subpattern         If  the  condition  is  the string (DEFINE), and there is no subpattern
4285         with  the  name  DEFINE,  the  condition is always false. In this case,         with the name DEFINE, the condition is  always  false.  In  this  case,
4286         there may be only one alternative  in  the  subpattern.  It  is  always         there  may  be  only  one  alternative  in the subpattern. It is always
4287         skipped  if  control  reaches  this  point  in the pattern; the idea of         skipped if control reaches this point  in  the  pattern;  the  idea  of
4288         DEFINE is that it can be used to define "subroutines" that can be  ref-         DEFINE  is that it can be used to define "subroutines" that can be ref-
4289         erenced  from elsewhere. (The use of "subroutines" is described below.)         erenced from elsewhere. (The use of "subroutines" is described  below.)
4290         For example, a pattern to match an IPv4 address could be  written  like         For  example,  a pattern to match an IPv4 address could be written like
4291         this (ignore whitespace and line breaks):         this (ignore whitespace and line breaks):
4292    
4293           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
4294           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
4295    
4296         The  first part of the pattern is a DEFINE group inside which a another         The first part of the pattern is a DEFINE group inside which a  another
4297         group named "byte" is defined. This matches an individual component  of         group  named "byte" is defined. This matches an individual component of
4298         an  IPv4  address  (a number less than 256). When matching takes place,         an IPv4 address (a number less than 256). When  matching  takes  place,
4299         this part of the pattern is skipped because DEFINE acts  like  a  false         this  part  of  the pattern is skipped because DEFINE acts like a false
4300         condition.         condition.
4301    
4302         The rest of the pattern uses references to the named group to match the         The rest of the pattern uses references to the named group to match the
4303         four dot-separated components of an IPv4 address, insisting on  a  word         four  dot-separated  components of an IPv4 address, insisting on a word
4304         boundary at each end.         boundary at each end.
4305    
4306     Assertion conditions     Assertion conditions
4307    
4308         If  the  condition  is  not  in any of the above formats, it must be an         If the condition is not in any of the above  formats,  it  must  be  an
4309         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.   This may be a positive or negative lookahead or lookbehind
4310         assertion.  Consider  this  pattern,  again  containing non-significant         assertion. Consider  this  pattern,  again  containing  non-significant
4311         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
4312    
4313           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
4314           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
4315    
4316         The condition  is  a  positive  lookahead  assertion  that  matches  an         The  condition  is  a  positive  lookahead  assertion  that  matches an
4317         optional  sequence of non-letters followed by a letter. In other words,         optional sequence of non-letters followed by a letter. In other  words,
4318         it tests for the presence of at least one letter in the subject.  If  a         it  tests  for the presence of at least one letter in the subject. If a
4319         letter  is found, the subject is matched against the first alternative;         letter is found, the subject is matched against the first  alternative;
4320         otherwise it is  matched  against  the  second.  This  pattern  matches         otherwise  it  is  matched  against  the  second.  This pattern matches
4321         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
4322         letters and dd are digits.         letters and dd are digits.
4323    
4324    
4325  COMMENTS  COMMENTS
4326    
4327         The sequence (?# marks the start of a comment that continues up to  the         The  sequence (?# marks the start of a comment that continues up to the
4328         next  closing  parenthesis.  Nested  parentheses are not permitted. The         next closing parenthesis. Nested parentheses  are  not  permitted.  The
4329         characters that make up a comment play no part in the pattern  matching         characters  that make up a comment play no part in the pattern matching
4330         at all.         at all.
4331    
4332         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If the PCRE_EXTENDED option is set, an unescaped # character outside  a
4333         character class introduces a  comment  that  continues  to  immediately         character  class  introduces  a  comment  that continues to immediately
4334         after the next newline in the pattern.         after the next newline in the pattern.
4335    
4336    
4337  RECURSIVE PATTERNS  RECURSIVE PATTERNS
4338    
4339         Consider  the problem of matching a string in parentheses, allowing for         Consider the problem of matching a string in parentheses, allowing  for
4340         unlimited nested parentheses. Without the use of  recursion,  the  best         unlimited  nested  parentheses.  Without the use of recursion, the best
4341         that  can  be  done  is  to use a pattern that matches up to some fixed         that can be done is to use a pattern that  matches  up  to  some  fixed
4342         depth of nesting. It is not possible to  handle  an  arbitrary  nesting         depth  of  nesting.  It  is not possible to handle an arbitrary nesting
4343         depth.         depth.
4344    
4345         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
4346         sions to recurse (amongst other things). It does this by  interpolating         sions  to recurse (amongst other things). It does this by interpolating
4347         Perl  code in the expression at run time, and the code can refer to the         Perl code in the expression at run time, and the code can refer to  the
4348         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
4349         parentheses problem can be created like this:         parentheses problem can be created like this:
4350    
# Line 4350  RECURSIVE PATTERNS Line 4354  RECURSIVE PATTERNS
4354         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
4355    
4356         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4357         it  supports  special  syntax  for recursion of the entire pattern, and         it supports special syntax for recursion of  the  entire  pattern,  and
4358         also for individual subpattern recursion.  After  its  introduction  in         also  for  individual  subpattern  recursion. After its introduction in
4359         PCRE  and  Python,  this  kind of recursion was introduced into Perl at         PCRE and Python, this kind of recursion was  introduced  into  Perl  at
4360         release 5.10.         release 5.10.
4361    
4362         A special item that consists of (? followed by a  number  greater  than         A  special  item  that consists of (? followed by a number greater than
4363         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
4364         the given number, provided that it occurs inside that  subpattern.  (If         the  given  number, provided that it occurs inside that subpattern. (If
4365         not,  it  is  a  "subroutine" call, which is described in the next sec-         not, it is a "subroutine" call, which is described  in  the  next  sec-
4366         tion.) The special item (?R) or (?0) is a recursive call of the  entire         tion.)  The special item (?R) or (?0) is a recursive call of the entire
4367         regular expression.         regular expression.
4368    
4369         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
4370         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
4371         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
4372         alternatives and there is a subsequent matching failure.         alternatives and there is a subsequent matching failure.
4373    
4374         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
4375         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
4376    
4377           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
4378    
4379         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
4380         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
4381         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
4382         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
4383    
4384         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
4385         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
4386    
4387           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4388    
4389         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
4390         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
4391    
4392         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
4393         tricky.  This is made easier by the use of relative references. (A Perl         tricky. This is made easier by the use of relative references. (A  Perl
4394         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write         5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
4395         (?-2) to refer to the second most recently opened parentheses preceding         (?-2) to refer to the second most recently opened parentheses preceding
4396         the recursion. In other  words,  a  negative  number  counts  capturing         the  recursion.  In  other  words,  a  negative number counts capturing
4397         parentheses leftwards from the point at which it is encountered.         parentheses leftwards from the point at which it is encountered.
4398    
4399         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
4400         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
4401         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
4402         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
4403         section.         section.
4404    
4405         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
4406         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
4407         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
4408    
4409           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4410    
4411         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
4412         one is used.         one is used.
4413    
4414         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
4415         nested  unlimited repeats, and so the use of atomic grouping for match-         nested unlimited repeats, and so the use of atomic grouping for  match-
4416         ing strings of non-parentheses is important when applying  the  pattern         ing  strings  of non-parentheses is important when applying the pattern
4417         to strings that do not match. For example, when this pattern is applied         to strings that do not match. For example, when this pattern is applied
4418         to         to
4419    
4420           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4421    
4422         it yields "no match" quickly. However, if atomic grouping is not  used,         it  yields "no match" quickly. However, if atomic grouping is not used,
4423         the  match  runs  for a very long time indeed because there are so many         the match runs for a very long time indeed because there  are  so  many
4424         different ways the + and * repeats can carve up the  subject,  and  all         different  ways  the  + and * repeats can carve up the subject, and all
4425         have to be tested before failure can be reported.         have to be tested before failure can be reported.
4426    
4427         At the end of a match, the values set for any capturing subpatterns are         At the end of a match, the values set for any capturing subpatterns are
4428         those from the outermost level of the recursion at which the subpattern         those from the outermost level of the recursion at which the subpattern
4429         value  is  set.   If  you want to obtain intermediate values, a callout         value is set.  If you want to obtain  intermediate  values,  a  callout
4430         function can be used (see below and the pcrecallout documentation).  If         function  can be used (see below and the pcrecallout documentation). If
4431         the pattern above is matched against         the pattern above is matched against
4432    
4433           (ab(cd)ef)           (ab(cd)ef)
4434    
4435         the  value  for  the  capturing  parentheses is "ef", which is the last         the value for the capturing parentheses is  "ef",  which  is  the  last
4436         value taken on at the top level. If additional parentheses  are  added,         value  taken  on at the top level. If additional parentheses are added,
4437         giving         giving
4438    
4439           \( ( ( (?>[^()]+) | (?R) )* ) \)           \( ( ( (?>[^()]+) | (?R) )* ) \)
4440              ^                        ^              ^                        ^
4441              ^                        ^              ^                        ^
4442    
4443         the  string  they  capture is "ab(cd)ef", the contents of the top level         the string they capture is "ab(cd)ef", the contents of  the  top  level
4444         parentheses. If there are more than 15 capturing parentheses in a  pat-         parentheses.  If there are more than 15 capturing parentheses in a pat-
4445         tern, PCRE has to obtain extra memory to store data during a recursion,         tern, PCRE has to obtain extra memory to store data during a recursion,
4446         which it does by using pcre_malloc, freeing  it  via  pcre_free  after-         which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
4447         wards.  If  no  memory  can  be  obtained,  the  match  fails  with the         wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
4448         PCRE_ERROR_NOMEMORY error.         PCRE_ERROR_NOMEMORY error.
4449    
4450         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
4451         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
4452         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
4453         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
4454         ted at the outer level.         ted at the outer level.
4455    
4456           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
4457    
4458         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
4459         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
4460         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
4461    
4462    
4463  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
4464    
4465         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4466         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
4467         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
4468         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
4469         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
4470    
# Line 4472  SUBPATTERNS AS SUBROUTINES Line 4476  SUBPATTERNS AS SUBROUTINES
4476    
4477           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4478    
4479         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
4480         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
4481    
4482           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
4483    
4484         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
4485         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
4486         above.         above.
4487    
4488         Like recursive subpatterns, a "subroutine" call is always treated as an         Like recursive subpatterns, a "subroutine" call is always treated as an
4489         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
4490         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
4491         there is a subsequent matching failure.         there is a subsequent matching failure.
4492    
4493         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
4494         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4495         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4496    
4497           (abc)(?i:(?-1))           (abc)(?i:(?-1))
4498    
4499         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
4500         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
4501    
4502    
4503  CALLOUTS  CALLOUTS
4504    
4505         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4506         Perl code to be obeyed in the middle of matching a regular  expression.         Perl  code to be obeyed in the middle of matching a regular expression.
4507         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
4508         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
4509         tion.         tion.
4510    
4511         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
4512         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
4513         an  external function by putting its entry point in the global variable         an external function by putting its entry point in the global  variable
4514         pcre_callout.  By default, this variable contains NULL, which  disables         pcre_callout.   By default, this variable contains NULL, which disables
4515         all calling out.         all calling out.
4516    
4517         Within  a  regular  expression,  (?C) indicates the points at which the         Within a regular expression, (?C) indicates the  points  at  which  the
4518         external function is to be called. If you want  to  identify  different         external  function  is  to be called. If you want to identify different
4519         callout  points, you can put a number less than 256 after the letter C.         callout points, you can put a number less than 256 after the letter  C.
4520         The default value is zero.  For example, this pattern has  two  callout         The  default  value is zero.  For example, this pattern has two callout
4521         points:         points:
4522    
4523           (?C1)abc(?C2)def           (?C1)abc(?C2)def
4524    
4525         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4526         automatically installed before each item in the pattern. They  are  all         automatically  installed  before each item in the pattern. They are all
4527         numbered 255.         numbered 255.
4528    
4529         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
4530         set), the external function is called. It is provided with  the  number         set),  the  external function is called. It is provided with the number
4531         of  the callout, the position in the pattern, and, optionally, one item         of the callout, the position in the pattern, and, optionally, one  item
4532         of data originally supplied by the caller of pcre_exec().  The  callout         of  data  originally supplied by the caller of pcre_exec(). The callout
4533         function  may cause matching to proceed, to backtrack, or to fail alto-         function may cause matching to proceed, to backtrack, or to fail  alto-
4534         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4535         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4536    
# Line 4545  AUTHOR Line 4549  AUTHOR
4549    
4550  REVISION  REVISION
4551    
4552         Last updated: 19 June 2007         Last updated: 06 August 2007
4553           Copyright (c) 1997-2007 University of Cambridge.
4554    ------------------------------------------------------------------------------
4555    
4556    
4557    PCRESYNTAX(3)                                                    PCRESYNTAX(3)
4558    
4559    
4560    NAME
4561           PCRE - Perl-compatible regular expressions
4562    
4563    
4564    PCRE REGULAR EXPRESSION SYNTAX SUMMARY
4565    
4566           The  full syntax and semantics of the regular expressions that are sup-
4567           ported by PCRE are described in  the  pcrepattern  documentation.  This
4568           document contains just a quick-reference summary of the syntax.
4569    
4570    
4571    QUOTING
4572    
4573             \x         where x is non-alphanumeric is a literal x
4574             \Q...\E    treat enclosed characters as literal
4575    
4576    
4577    CHARACTERS
4578    
4579             \a         alarm, that is, the BEL character (hex 07)
4580             \cx        "control-x", where x is any character
4581             \e         escape (hex 1B)
4582             \f         formfeed (hex 0C)
4583             \n         newline (hex 0A)
4584             \r         carriage return (hex 0D)
4585             \t         tab (hex 09)
4586             \ddd       character with octal code ddd, or backreference
4587             \xhh       character with hex code hh
4588             \x{hhh..}  character with hex code hhh..
4589    
4590    
4591    CHARACTER TYPES
4592    
4593             .          any character except newline;
4594                          in dotall mode, any character whatsoever
4595             \C         one byte, even in UTF-8 mode (best avoided)
4596             \d         a decimal digit
4597             \D         a character that is not a decimal digit
4598             \h         a horizontal whitespace character
4599             \H         a character that is not a horizontal whitespace character
4600             \p{xx}     a character with the xx property
4601             \P{xx}     a character without the xx property
4602             \R         a newline sequence
4603             \s         a whitespace character
4604             \S         a character that is not a whitespace character
4605             \v         a vertical whitespace character
4606             \V         a character that is not a vertical whitespace character
4607             \w         a "word" character
4608             \W         a "non-word" character
4609             \X         an extended Unicode sequence
4610    
4611           In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
4612    
4613    
4614    GENERAL CATEGORY PROPERTY CODES FOR \p and \P
4615    
4616             C          Other
4617             Cc         Control
4618             Cf         Format
4619             Cn         Unassigned
4620             Co         Private use
4621             Cs         Surrogate
4622    
4623             L          Letter
4624             Ll         Lower case letter
4625             Lm         Modifier letter
4626             Lo         Other letter
4627             Lt         Title case letter
4628             Lu         Upper case letter
4629             L&         Ll, Lu, or Lt
4630    
4631             M          Mark
4632             Mc         Spacing mark
4633             Me         Enclosing mark
4634             Mn         Non-spacing mark
4635    
4636             N          Number
4637             Nd         Decimal number
4638             Nl         Letter number
4639             No         Other number
4640    
4641             P          Punctuation
4642             Pc         Connector punctuation
4643             Pd         Dash punctuation
4644             Pe         Close punctuation
4645             Pf         Final punctuation
4646             Pi         Initial punctuation
4647             Po         Other punctuation
4648             Ps         Open punctuation
4649    
4650             S          Symbol
4651             Sc         Currency symbol
4652             Sk         Modifier symbol
4653             Sm         Mathematical symbol
4654             So         Other symbol
4655    
4656             Z          Separator
4657             Zl         Line separator
4658             Zp         Paragraph separator
4659             Zs         Space separator
4660    
4661    
4662    SCRIPT NAMES FOR \p AND \P
4663    
4664           Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
4665           Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,
4666           Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
4667           Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-
4668           gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,
4669           Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
4670           Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
4671           Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
4672           Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
4673    
4674    
4675    CHARACTER CLASSES
4676    
4677             [...]       positive character class
4678             [^...]      negative character class
4679             [x-y]       range (can be used for hex characters)
4680             [[:xxx:]]   positive POSIX named set
4681             [[^:xxx:]]  negative POSIX named set
4682    
4683             alnum       alphanumeric
4684             alpha       alphabetic
4685             ascii       0-127
4686             blank       space or tab
4687             cntrl       control character
4688             digit       decimal digit
4689             graph       printing, excluding space
4690             lower       lower case letter
4691             print       printing, including space
4692             punct       printing, excluding alphanumeric
4693             space       whitespace
4694             upper       upper case letter
4695             word        same as \w
4696             xdigit      hexadecimal digit
4697    
4698           In PCRE, POSIX character set names recognize only ASCII characters. You
4699           can use \Q...\E inside a character class.
4700    
4701    
4702    QUANTIFIERS
4703    
4704             ?           0 or 1, greedy
4705             ?+          0 or 1, possessive
4706             ??          0 or 1, lazy
4707             *           0 or more, greedy
4708             *+          0 or more, possessive
4709             *?          0 or more, lazy
4710             +           1 or more, greedy
4711             ++          1 or more, possessive
4712             +?          1 or more, lazy
4713             {n}         exactly n
4714             {n,m}       at least n, no more than m, greedy
4715             {n,m}+      at least n, no more than m, possessive
4716             {n,m}?      at least n, no more than m, lazy
4717             {n,}        n or more, greedy
4718             {n,}+       n or more, possessive
4719             {n,}?       n or more, lazy
4720    
4721    
4722    ANCHORS AND SIMPLE ASSERTIONS
4723    
4724             \b          word boundary
4725             \B          not a word boundary
4726             ^           start of subject
4727                          also after internal newline in multiline mode
4728             \A          start of subject
4729             $           end of subject
4730                          also before newline at end of subject
4731                          also before internal newline in multiline mode
4732             \Z          end of subject
4733                          also before newline at end of subject
4734             \z          end of subject
4735             \G          first matching position in subject
4736    
4737    
4738    MATCH POINT RESET
4739    
4740             \K          reset start of match
4741    
4742    
4743    ALTERNATION
4744    
4745             expr|expr|expr...
4746    
4747    
4748    CAPTURING
4749    
4750             (...)          capturing group
4751             (?<name>...)   named capturing group (Perl)
4752             (?'name'...)   named capturing group (Perl)
4753             (?P<name>...)  named capturing group (Python)
4754             (?:...)        non-capturing group
4755             (?|...)        non-capturing group; reset group numbers for
4756                             capturing groups in each alternative
4757    
4758    
4759    ATOMIC GROUPS
4760    
4761             (?>...)        atomic, non-capturing group
4762    
4763    
4764    COMMENT
4765    
4766             (?#....)       comment (not nestable)
4767    
4768    
4769    OPTION SETTING
4770    
4771             (?i)           caseless
4772             (?J)           allow duplicate names
4773             (?m)           multiline
4774             (?s)           single line (dotall)
4775             (?U)           default ungreedy (lazy)
4776             (?x)           extended (ignore white space)
4777             (?-...)        unset option(s)
4778    
4779    
4780    LOOKAHEAD AND LOOKBEHIND ASSERTIONS
4781    
4782             (?=...)        positive look ahead
4783             (?!...)        negative look ahead
4784             (?<=...)       positive look behind
4785             (?<!...)       negative look behind
4786    
4787           Each top-level branch of a look behind must be of a fixed length.
4788    
4789    
4790    BACKREFERENCES
4791    
4792             \n             reference by number (can be ambiguous)
4793             \gn            reference by number
4794             \g{n}          reference by number
4795             \g{-n}         relative reference by number
4796             \k<name>       reference by name (Perl)
4797             \k'name'       reference by name (Perl)
4798             \g{name}       reference by name (Perl)
4799             \k{name}       reference by name (.NET)
4800             (?P=name)      reference by name (Python)
4801    
4802    
4803    SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
4804    
4805             (?R)           recurse whole pattern
4806             (?n)           call subpattern by absolute number
4807             (?+n)          call subpattern by relative number
4808             (?-n)          call subpattern by relative number
4809             (?&name)       call subpattern by name (Perl)
4810             (?P>name)      call subpattern by name (Python)
4811    
4812    
4813    CONDITIONAL PATTERNS
4814    
4815             (?(condition)yes-pattern)
4816             (?(condition)yes-pattern|no-pattern)
4817    
4818             (?(n)...       absolute reference condition
4819             (?(+n)...      relative reference condition
4820             (?(-n)...      relative reference condition
4821             (?(<name>)...  named reference condition (Perl)
4822             (?('name')...  named reference condition (Perl)
4823             (?(name)...    named reference condition (PCRE)
4824             (?(R)...       overall recursion condition
4825             (?(Rn)...      specific group recursion condition
4826             (?(R&name)...  specific recursion condition
4827             (?(DEFINE)...  define subpattern for reference
4828             (?(assert)...  assertion condition
4829    
4830    
4831    CALLOUTS
4832    
4833             (?C)      callout
4834             (?Cn)     callout with data n
4835    
4836    
4837    SEE ALSO
4838    
4839           pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
4840    
4841    
4842    AUTHOR
4843    
4844           Philip Hazel
4845           University Computing Service
4846           Cambridge CB2 3QH, England.
4847    
4848    
4849    REVISION
4850    
4851           Last updated: 06 August 2007
4852         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4853  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4854    

Legend:
Removed from v.207  
changed lines
  Added in v.208

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12