/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 834 by ph10, Tue Dec 6 15:38:01 2011 UTC revision 835 by ph10, Wed Dec 28 16:10:09 2011 UTC
# Line 633  THE ALTERNATIVE MATCHING ALGORITHM Line 633  THE ALTERNATIVE MATCHING ALGORITHM
633         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
634    
635         7.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
636         single byte, even in UTF-8  mode,  is  not  supported  in  UTF-8  mode,         single byte, even in UTF-8 mode, is not supported because the  alterna-
637         because  the alternative algorithm moves through the subject string one         tive  algorithm  moves  through  the  subject string one character at a
638         character at a time, for all active paths through the tree.         time, for all active paths through the tree.
639    
640         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
641         are  not  supported.  (*FAIL)  is supported, and behaves like a failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
# Line 685  AUTHOR Line 685  AUTHOR
685    
686  REVISION  REVISION
687    
688         Last updated: 19 November 2011         Last updated: 17 November 2010
689         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
690  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
691    
# Line 1256  COMPILING A PATTERN Line 1256  COMPILING A PATTERN
1256         set  (assuming  it can find an "a" in the subject), whereas it fails by         set  (assuming  it can find an "a" in the subject), whereas it fails by
1257         default, for Perl compatibility.         default, for Perl compatibility.
1258    
        (3) \U matches an upper case "U" character; by default \U causes a com-  
        pile time error (Perl uses \U to upper case subsequent characters).  
   
        (4) \u matches a lower case "u" character unless it is followed by four  
        hexadecimal digits, in which case the hexadecimal  number  defines  the  
        code  point  to match. By default, \u causes a compile time error (Perl  
        uses it to upper case the following character).  
   
        (5) \x matches a lower case "x" character unless it is followed by  two  
        hexadecimal  digits,  in  which case the hexadecimal number defines the  
        code point to match. By default, as in Perl, a  hexadecimal  number  is  
        always expected after \x, but it may have zero, one, or two digits (so,  
        for example, \xz matches a binary zero character followed by z).  
   
1259           PCRE_MULTILINE           PCRE_MULTILINE
1260    
1261         By default, PCRE treats the subject string as consisting  of  a  single         By default, PCRE treats the subject string as consisting  of  a  single
# Line 1724  INFORMATION ABOUT A PATTERN Line 1710  INFORMATION ABOUT A PATTERN
1710         compiler could not handle this particular pattern. See the pcrejit doc-         compiler could not handle this particular pattern. See the pcrejit doc-
1711         umentation for details of what can and cannot be handled.         umentation for details of what can and cannot be handled.
1712    
          PCRE_INFO_JITSIZE  
   
        If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE  
        option, return the size of the  JIT  compiled  code,  otherwise  return  
        zero. The fourth argument should point to a size_t variable.  
   
1713           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1714    
1715         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1838  INFORMATION ABOUT A PATTERN Line 1818  INFORMATION ABOUT A PATTERN
1818    
1819           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1820    
1821         Return  the  size  of  the compiled pattern. The fourth argument should         Return  the  size  of the compiled pattern, that is, the value that was
1822         point to a size_t variable. This value does not include the size of the         passed as the argument to pcre_malloc() when PCRE was getting memory in
1823         pcre  structure  that  is returned by pcre_compile(). The value that is         which to place the compiled data. The fourth argument should point to a
1824         passed as the argument to pcre_malloc() when pcre_compile() is  getting         size_t variable.
        memory  in  which  to  place the compiled data is the value returned by  
        this option plus the size of the pcre structure.  Studying  a  compiled  
        pattern, with or without JIT, does not alter the value returned by this  
        option.  
1825    
1826           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1827    
# Line 3004  AUTHOR Line 2980  AUTHOR
2980    
2981  REVISION  REVISION
2982    
2983         Last updated: 02 December 2011         Last updated: 23 September 2011
2984         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
2985  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2986    
# Line 3167  THE CALLOUT INTERFACE Line 3143  THE CALLOUT INTERFACE
3143    
3144         The mark field is present from version 2 of the pcre_callout structure.         The mark field is present from version 2 of the pcre_callout structure.
3145         In  callouts  from pcre_exec() it contains a pointer to the zero-termi-         In  callouts  from pcre_exec() it contains a pointer to the zero-termi-
3146         nated name of the most recently passed (*MARK),  (*PRUNE),  or  (*THEN)         nated name of the most recently passed (*MARK) item in  the  match,  or
3147         item in the match, or NULL if no such items have been passed. Instances         NULL if there are no (*MARK)s in the current matching path. In callouts
3148         of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a  previous         from pcre_dfa_exec() this field always contains NULL.
        (*MARK).  In  callouts  from pcre_dfa_exec() this field always contains  
        NULL.  
3149    
3150    
3151  RETURN VALUES  RETURN VALUES
# Line 3199  AUTHOR Line 3173  AUTHOR
3173    
3174  REVISION  REVISION
3175    
3176         Last updated: 30 November 2011         Last updated: 26 August 2011
3177         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3178  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3179    
# Line 3244  DIFFERENCES BETWEEN PCRE AND PERL Line 3218  DIFFERENCES BETWEEN PCRE AND PERL
3218         its own, matching a non-newline character, is supported.) In fact these         its own, matching a non-newline character, is supported.) In fact these
3219         are implemented by Perl's general string-handling and are not  part  of         are implemented by Perl's general string-handling and are not  part  of
3220         its  pattern  matching engine. If any of these are encountered by PCRE,         its  pattern  matching engine. If any of these are encountered by PCRE,
3221         an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-         an error is generated.
        PAT  option  is set, \U and \u are interpreted as JavaScript interprets  
        them.  
3222    
3223         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
3224         is  built  with Unicode character property support. The properties that         is  built  with Unicode character property support. The properties that
# Line 3373  AUTHOR Line 3345  AUTHOR
3345    
3346  REVISION  REVISION
3347    
3348         Last updated: 14 November 2011         Last updated: 09 October 2011
3349         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3350  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3351    
# Line 3600  BACKSLASH Line 3572  BACKSLASH
3572           \t        tab (hex 09)           \t        tab (hex 09)
3573           \ddd      character with octal code ddd, or back reference           \ddd      character with octal code ddd, or back reference
3574           \xhh      character with hex code hh           \xhh      character with hex code hh
3575           \x{hhh..} character with hex code hhh.. (non-JavaScript mode)           \x{hhh..} character with hex code hhh..
          \uhhhh    character with hex code hhhh (JavaScript mode only)  
3576    
3577         The precise effect of \cx is as follows: if x is a lower  case  letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
3578         it  is converted to upper case. Then bit 6 of the character (hex 40) is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
# Line 3612  BACKSLASH Line 3583  BACKSLASH
3583         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case
3584         letter is converted to upper case, and then the 0xc0 bits are flipped.)         letter is converted to upper case, and then the 0xc0 bits are flipped.)
3585    
3586         By  default,  after  \x,  from  zero to two hexadecimal digits are read         After  \x, from zero to two hexadecimal digits are read (letters can be
3587         (letters can be in upper or lower case). Any number of hexadecimal dig-         in upper or lower case). Any number of hexadecimal  digits  may  appear
3588         its  may  appear between \x{ and }, but the value of the character code         between  \x{  and  },  but the value of the character code must be less
3589         must be less than 256 in non-UTF-8 mode, and less than 2**31  in  UTF-8         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3590         mode.  That is, the maximum value in hexadecimal is 7FFFFFFF. Note that         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
3591         this is bigger than the largest Unicode code point, which is 10FFFF.         than the largest Unicode code point, which is 10FFFF.
3592    
3593         If characters other than hexadecimal digits appear between \x{  and  },         If characters other than hexadecimal digits appear between \x{  and  },
3594         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
# Line 3625  BACKSLASH Line 3596  BACKSLASH
3596         escape,  with  no  following  digits, giving a character whose value is         escape,  with  no  following  digits, giving a character whose value is
3597         zero.         zero.
3598    
        If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x  
        is  as  just described only when it is followed by two hexadecimal dig-  
        its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript  
        mode, support for code points greater than 256 is provided by \u, which  
        must be followed by four hexadecimal digits;  otherwise  it  matches  a  
        literal "u" character.  
   
3599         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3600         two syntaxes for \x (or by \u in JavaScript mode). There is no  differ-         two  syntaxes  for  \x. There is no difference in the way they are han-
3601         ence in the way they are handled. For example, \xdc is exactly the same         dled. For example, \xdc is exactly the same as \x{dc}.
        as \x{dc} (or \u00dc in JavaScript mode).  
3602    
3603         After \0 up to two further octal digits are read. If  there  are  fewer         After \0 up to two further octal digits are read. If  there  are  fewer
3604         than  two  digits,  just  those  that  are  present  are used. Thus the         than  two  digits,  just  those  that  are  present  are used. Thus the
# Line 3679  BACKSLASH Line 3642  BACKSLASH
3642    
3643         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
3644         inside and outside character classes. In addition, inside  a  character         inside and outside character classes. In addition, inside  a  character
3645         class, \b is interpreted as the backspace character (hex 08).         class,  the  sequence \b is interpreted as the backspace character (hex
3646           08). The sequences \B, \N, \R, and \X are not special inside a  charac-
3647         \N  is not allowed in a character class. \B, \R, and \X are not special         ter  class.  Like  any  other  unrecognized  escape sequences, they are
3648         inside a character class. Like  other  unrecognized  escape  sequences,         treated as the literal characters "B", "N", "R", and  "X"  by  default,
3649         they  are  treated  as  the  literal  characters  "B",  "R", and "X" by         but cause an error if the PCRE_EXTRA option is set. Outside a character
3650         default, but cause an error if the PCRE_EXTRA option is set. Outside  a         class, these sequences have different meanings.
        character class, these sequences have different meanings.  
   
    Unsupported escape sequences  
   
        In  Perl, the sequences \l, \L, \u, and \U are recognized by its string  
        handler and used  to  modify  the  case  of  following  characters.  By  
        default,  PCRE does not support these escape sequences. However, if the  
        PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and  
        \u can be used to define a character by code point, as described in the  
        previous section.  
3651    
3652     Absolute and relative back references     Absolute and relative back references
3653    
# Line 3729  BACKSLASH Line 3682  BACKSLASH
3682    
3683         There is also the single sequence \N, which matches a non-newline char-         There is also the single sequence \N, which matches a non-newline char-
3684         acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is         acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
3685         not set. Perl also uses \N to match characters by name; PCRE  does  not         not set.
        support this.  
3686    
3687         Each  pair of lower and upper case escape sequences partitions the com-         Each pair of lower and upper case escape sequences partitions the  com-
3688         plete set of characters into two disjoint  sets.  Any  given  character         plete  set  of  characters  into two disjoint sets. Any given character
3689         matches  one, and only one, of each pair. The sequences can appear both         matches one, and only one, of each pair. The sequences can appear  both
3690         inside and outside character classes. They each match one character  of         inside  and outside character classes. They each match one character of
3691         the  appropriate  type.  If the current matching point is at the end of         the appropriate type. If the current matching point is at  the  end  of
3692         the subject string, all of them fail, because there is no character  to         the  subject string, all of them fail, because there is no character to
3693         match.         match.
3694    
3695         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
3696         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
3697         characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
3698         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
3699         ter. In PCRE, it never does.         ter. In PCRE, it never does.
3700    
3701         A  "word"  character is an underscore or any character that is a letter         A "word" character is an underscore or any character that is  a  letter
3702         or digit.  By default, the definition of letters  and  digits  is  con-         or  digit.   By  default,  the definition of letters and digits is con-
3703         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
3704         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3705         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3706         systems, or "french" in Windows, some character codes greater than  128         systems,  or "french" in Windows, some character codes greater than 128
3707         are  used  for  accented letters, and these are then matched by \w. The         are used for accented letters, and these are then matched  by  \w.  The
3708         use of locales with Unicode is discouraged.         use of locales with Unicode is discouraged.
3709    
3710         By default, in UTF-8 mode, characters  with  values  greater  than  128         By  default,  in  UTF-8  mode,  characters with values greater than 128
3711         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These         never match \d, \s, or \w, and always  match  \D,  \S,  and  \W.  These
3712         sequences retain their original meanings from before UTF-8 support  was         sequences  retain their original meanings from before UTF-8 support was
3713         available,  mainly for efficiency reasons. However, if PCRE is compiled         available, mainly for efficiency reasons. However, if PCRE is  compiled
3714         with Unicode property support, and the PCRE_UCP option is set, the  be-         with  Unicode property support, and the PCRE_UCP option is set, the be-
3715         haviour  is  changed  so  that Unicode properties are used to determine         haviour is changed so that Unicode properties  are  used  to  determine
3716         character types, as follows:         character types, as follows:
3717    
3718           \d  any character that \p{Nd} matches (decimal digit)           \d  any character that \p{Nd} matches (decimal digit)
3719           \s  any character that \p{Z} matches, plus HT, LF, FF, CR           \s  any character that \p{Z} matches, plus HT, LF, FF, CR
3720           \w  any character that \p{L} or \p{N} matches, plus underscore           \w  any character that \p{L} or \p{N} matches, plus underscore
3721    
3722         The upper case escapes match the inverse sets of characters. Note  that         The  upper case escapes match the inverse sets of characters. Note that
3723         \d  matches  only decimal digits, whereas \w matches any Unicode digit,         \d matches only decimal digits, whereas \w matches any  Unicode  digit,
3724         as well as any Unicode letter, and underscore. Note also that  PCRE_UCP         as  well as any Unicode letter, and underscore. Note also that PCRE_UCP
3725         affects  \b,  and  \B  because  they are defined in terms of \w and \W.         affects \b, and \B because they are defined in  terms  of  \w  and  \W.
3726         Matching these sequences is noticeably slower when PCRE_UCP is set.         Matching these sequences is noticeably slower when PCRE_UCP is set.
3727    
3728         The sequences \h, \H, \v, and \V are features that were added  to  Perl         The  sequences  \h, \H, \v, and \V are features that were added to Perl
3729         at  release  5.10. In contrast to the other sequences, which match only         at release 5.10. In contrast to the other sequences, which  match  only
3730         ASCII characters by default, these  always  match  certain  high-valued         ASCII  characters  by  default,  these always match certain high-valued
3731         codepoints  in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-         codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The  horizon-
3732         tal space characters are:         tal space characters are:
3733    
3734           U+0009     Horizontal tab           U+0009     Horizontal tab
# Line 3811  BACKSLASH Line 3763  BACKSLASH
3763    
3764     Newline sequences     Newline sequences
3765    
3766         Outside a character class, by default, the escape sequence  \R  matches         Outside  a  character class, by default, the escape sequence \R matches
3767         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
3768         following:         following:
3769    
3770           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3771    
3772         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
3773         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3774         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3775         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3776         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3777         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3778    
3779         In  UTF-8  mode, two additional characters whose codepoints are greater         In UTF-8 mode, two additional characters whose codepoints  are  greater
3780         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3781         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
3782         these characters to be recognized.         these characters to be recognized.
3783    
3784         It is possible to restrict \R to match only CR, LF, or CRLF (instead of         It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3785         the  complete  set  of  Unicode  line  endings)  by  setting the option         the complete set  of  Unicode  line  endings)  by  setting  the  option
3786         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3787         (BSR is an abbrevation for "backslash R".) This can be made the default         (BSR is an abbrevation for "backslash R".) This can be made the default
3788         when PCRE is built; if this is the case, the  other  behaviour  can  be         when  PCRE  is  built;  if this is the case, the other behaviour can be
3789         requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to         requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
3790         specify these settings by starting a pattern string  with  one  of  the         specify  these  settings  by  starting a pattern string with one of the
3791         following sequences:         following sequences:
3792    
3793           (*BSR_ANYCRLF)   CR, LF, or CRLF only           (*BSR_ANYCRLF)   CR, LF, or CRLF only
3794           (*BSR_UNICODE)   any Unicode newline sequence           (*BSR_UNICODE)   any Unicode newline sequence
3795    
3796         These  override  the default and the options given to pcre_compile() or         These override the default and the options given to  pcre_compile()  or
3797         pcre_compile2(), but  they  can  be  overridden  by  options  given  to         pcre_compile2(),  but  they  can  be  overridden  by  options  given to
3798         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
3799         are not Perl-compatible, are recognized only at the  very  start  of  a         are  not  Perl-compatible,  are  recognized only at the very start of a
3800         pattern,  and that they must be in upper case. If more than one of them         pattern, and that they must be in upper case. If more than one of  them
3801         is present, the last one is used. They can be combined with a change of         is present, the last one is used. They can be combined with a change of
3802         newline convention; for example, a pattern can start with:         newline convention; for example, a pattern can start with:
3803    
3804           (*ANY)(*BSR_ANYCRLF)           (*ANY)(*BSR_ANYCRLF)
3805    
3806         They can also be combined with the (*UTF8) or (*UCP) special sequences.         They can also be combined with the (*UTF8) or (*UCP) special sequences.
3807         Inside a character class, \R  is  treated  as  an  unrecognized  escape         Inside  a  character  class,  \R  is  treated as an unrecognized escape
3808         sequence, and so matches the letter "R" by default, but causes an error         sequence, and so matches the letter "R" by default, but causes an error
3809         if PCRE_EXTRA is set.         if PCRE_EXTRA is set.
3810    
3811     Unicode character properties     Unicode character properties
3812    
3813         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3814         tional  escape sequences that match characters with specific properties         tional escape sequences that match characters with specific  properties
3815         are available.  When not in UTF-8 mode, these sequences are  of  course         are  available.   When not in UTF-8 mode, these sequences are of course
3816         limited  to  testing characters whose codepoints are less than 256, but         limited to testing characters whose codepoints are less than  256,  but
3817         they do work in this mode.  The extra escape sequences are:         they do work in this mode.  The extra escape sequences are:
3818    
3819           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3820           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3821           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3822    
3823         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
3824         script names, the general category properties, "Any", which matches any         script names, the general category properties, "Any", which matches any
3825         character  (including  newline),  and  some  special  PCRE   properties         character   (including  newline),  and  some  special  PCRE  properties
3826         (described  in the next section).  Other Perl properties such as "InMu-         (described in the next section).  Other Perl properties such as  "InMu-
3827         sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}         sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
3828         does not match any characters, so always causes a match failure.         does not match any characters, so always causes a match failure.
3829    
3830         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3831         A character from one of these sets can be matched using a script  name.         A  character from one of these sets can be matched using a script name.
3832         For example:         For example:
3833    
3834           \p{Greek}           \p{Greek}
3835           \P{Han}           \P{Han}
3836    
3837         Those  that are not part of an identified script are lumped together as         Those that are not part of an identified script are lumped together  as
3838         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3839    
3840         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
3841         Buginese,  Buhid,  Canadian_Aboriginal, Carian, Cham, Cherokee, Common,         Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,
3842         Coptic,  Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,   Egyp-         Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp-
3843         tian_Hieroglyphs,   Ethiopic,   Georgian,  Glagolitic,  Gothic,  Greek,         tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,
3844         Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana,  Impe-         Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe-
3845         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
3846         Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer,  Lao,         Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
3847         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
3848         Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham,  Old_Italic,         Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
3849         Old_Persian,  Old_South_Arabian,  Old_Turkic, Ol_Chiki, Oriya, Osmanya,         Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,
3850         Phags_Pa, Phoenician, Rejang, Runic,  Samaritan,  Saurashtra,  Shavian,         Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,
3851         Sinhala,  Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le,         Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,
3852         Tai_Tham, Tai_Viet, Tamil, Telugu,  Thaana,  Thai,  Tibetan,  Tifinagh,         Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,
3853         Ugaritic, Vai, Yi.         Ugaritic, Vai, Yi.
3854    
3855         Each character has exactly one Unicode general category property, spec-         Each character has exactly one Unicode general category property, spec-
3856         ified by a two-letter abbreviation. For compatibility with Perl,  nega-         ified  by a two-letter abbreviation. For compatibility with Perl, nega-
3857         tion  can  be  specified  by including a circumflex between the opening         tion can be specified by including a  circumflex  between  the  opening
3858         brace and the property name.  For  example,  \p{^Lu}  is  the  same  as         brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
3859         \P{Lu}.         \P{Lu}.
3860    
3861         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3862         eral category properties that start with that letter. In this case,  in         eral  category properties that start with that letter. In this case, in
3863         the  absence of negation, the curly brackets in the escape sequence are         the absence of negation, the curly brackets in the escape sequence  are
3864         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3865    
3866           \p{L}           \p{L}
# Line 3960  BACKSLASH Line 3912  BACKSLASH
3912           Zp    Paragraph separator           Zp    Paragraph separator
3913           Zs    Space separator           Zs    Space separator
3914    
3915         The special property L& is also supported: it matches a character  that         The  special property L& is also supported: it matches a character that
3916         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
3917         classified as a modifier or "other".         classified as a modifier or "other".
3918    
3919         The Cs (Surrogate) property applies only to  characters  in  the  range         The  Cs  (Surrogate)  property  applies only to characters in the range
3920         U+D800  to  U+DFFF. Such characters are not valid in UTF-8 strings (see         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
3921         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
3922         ing  has  been  turned off (see the discussion of PCRE_NO_UTF8_CHECK in         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
3923         the pcreapi page). Perl does not support the Cs property.         the pcreapi page). Perl does not support the Cs property.
3924    
3925         The long synonyms for  property  names  that  Perl  supports  (such  as         The  long  synonyms  for  property  names  that  Perl supports (such as
3926         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
3927         any of these properties with "Is".         any of these properties with "Is".
3928    
3929         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3930         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3931         in the Unicode table.         in the Unicode table.
3932    
3933         Specifying caseless matching does not affect  these  escape  sequences.         Specifying  caseless  matching  does not affect these escape sequences.
3934         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3935    
3936         The  \X  escape  matches  any number of Unicode characters that form an         The \X escape matches any number of Unicode  characters  that  form  an
3937         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3938    
3939           (?>\PM\pM*)           (?>\PM\pM*)
3940    
3941         That is, it matches a character without the "mark"  property,  followed         That  is,  it matches a character without the "mark" property, followed
3942         by  zero  or  more  characters with the "mark" property, and treats the         by zero or more characters with the "mark"  property,  and  treats  the
3943         sequence as an atomic group (see below).  Characters  with  the  "mark"         sequence  as  an  atomic group (see below).  Characters with the "mark"
3944         property  are  typically  accents  that affect the preceding character.         property are typically accents that  affect  the  preceding  character.
3945         None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X         None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X
3946         matches any one character.         matches any one character.
3947    
3948         Note that recent versions of Perl have changed \X to match what Unicode         Note that recent versions of Perl have changed \X to match what Unicode
3949         calls an "extended grapheme cluster", which has a more complicated def-         calls an "extended grapheme cluster", which has a more complicated def-
3950         inition.         inition.
3951    
3952         Matching  characters  by Unicode property is not fast, because PCRE has         Matching characters by Unicode property is not fast, because  PCRE  has
3953         to search a structure that contains  data  for  over  fifteen  thousand         to  search  a  structure  that  contains data for over fifteen thousand
3954         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3955         \w do not use Unicode properties in PCRE by  default,  though  you  can         \w  do  not  use  Unicode properties in PCRE by default, though you can
3956         make them do so by setting the PCRE_UCP option for pcre_compile() or by         make them do so by setting the PCRE_UCP option for pcre_compile() or by
3957         starting the pattern with (*UCP).         starting the pattern with (*UCP).
3958    
3959     PCRE's additional properties     PCRE's additional properties
3960    
3961         As well as the standard Unicode properties described  in  the  previous         As  well  as  the standard Unicode properties described in the previous
3962         section,  PCRE supports four more that make it possible to convert tra-         section, PCRE supports four more that make it possible to convert  tra-
3963         ditional escape sequences such as \w and \s and POSIX character classes         ditional escape sequences such as \w and \s and POSIX character classes
3964         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
3965         erties internally when PCRE_UCP is set. They are:         erties internally when PCRE_UCP is set. They are:
# Line 4017  BACKSLASH Line 3969  BACKSLASH
3969           Xsp   Any Perl space character           Xsp   Any Perl space character
3970           Xwd   Any Perl "word" character           Xwd   Any Perl "word" character
3971    
3972         Xan matches characters that have either the L (letter) or the  N  (num-         Xan  matches  characters that have either the L (letter) or the N (num-
3973         ber)  property. Xps matches the characters tab, linefeed, vertical tab,         ber) property. Xps matches the characters tab, linefeed, vertical  tab,
3974         formfeed, or carriage return, and any other character that  has  the  Z         formfeed,  or  carriage  return, and any other character that has the Z
3975         (separator) property.  Xsp is the same as Xps, except that vertical tab         (separator) property.  Xsp is the same as Xps, except that vertical tab
3976         is excluded. Xwd matches the same characters as Xan, plus underscore.         is excluded. Xwd matches the same characters as Xan, plus underscore.
3977    
3978     Resetting the match start     Resetting the match start
3979    
3980         The escape sequence \K causes any previously matched characters not  to         The  escape sequence \K causes any previously matched characters not to
3981         be included in the final matched sequence. For example, the pattern:         be included in the final matched sequence. For example, the pattern:
3982    
3983           foo\Kbar           foo\Kbar
3984    
3985         matches  "foobar",  but reports that it has matched "bar". This feature         matches "foobar", but reports that it has matched "bar".  This  feature
3986         is similar to a lookbehind assertion (described  below).   However,  in         is  similar  to  a lookbehind assertion (described below).  However, in
3987         this  case, the part of the subject before the real match does not have         this case, the part of the subject before the real match does not  have
3988         to be of fixed length, as lookbehind assertions do. The use of \K  does         to  be of fixed length, as lookbehind assertions do. The use of \K does
3989         not  interfere  with  the setting of captured substrings.  For example,         not interfere with the setting of captured  substrings.   For  example,
3990         when the pattern         when the pattern
3991    
3992           (foo)\Kbar           (foo)\Kbar
3993    
3994         matches "foobar", the first substring is still set to "foo".         matches "foobar", the first substring is still set to "foo".
3995    
3996         Perl documents that the use  of  \K  within  assertions  is  "not  well         Perl  documents  that  the  use  of  \K  within assertions is "not well
3997         defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive         defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
3998         assertions, but is ignored in negative assertions.         assertions, but is ignored in negative assertions.
3999    
4000     Simple assertions     Simple assertions
4001    
4002         The final use of backslash is for certain simple assertions. An  asser-         The  final use of backslash is for certain simple assertions. An asser-
4003         tion  specifies a condition that has to be met at a particular point in         tion specifies a condition that has to be met at a particular point  in
4004         a match, without consuming any characters from the subject string.  The         a  match, without consuming any characters from the subject string. The
4005         use  of subpatterns for more complicated assertions is described below.         use of subpatterns for more complicated assertions is described  below.
4006         The backslashed assertions are:         The backslashed assertions are:
4007    
4008           \b     matches at a word boundary           \b     matches at a word boundary
# Line 4061  BACKSLASH Line 4013  BACKSLASH
4013           \z     matches only at the end of the subject           \z     matches only at the end of the subject
4014           \G     matches at the first matching position in the subject           \G     matches at the first matching position in the subject
4015    
4016         Inside a character class, \b has a different meaning;  it  matches  the         Inside  a  character  class, \b has a different meaning; it matches the
4017         backspace  character.  If  any  other  of these assertions appears in a         backspace character. If any other of  these  assertions  appears  in  a
4018         character class, by default it matches the corresponding literal  char-         character  class, by default it matches the corresponding literal char-
4019         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
4020         PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-         PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-
4021         ated instead.         ated instead.
4022    
4023         A  word  boundary is a position in the subject string where the current         A word boundary is a position in the subject string where  the  current
4024         character and the previous character do not both match \w or  \W  (i.e.         character  and  the previous character do not both match \w or \W (i.e.
4025         one  matches  \w  and the other matches \W), or the start or end of the         one matches \w and the other matches \W), or the start or  end  of  the
4026         string if the first or last  character  matches  \w,  respectively.  In         string  if  the  first  or  last character matches \w, respectively. In
4027         UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the         UTF-8 mode, the meanings of \w and \W can be  changed  by  setting  the
4028         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither         PCRE_UCP  option. When this is done, it also affects \b and \B. Neither
4029         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-         PCRE nor Perl has a separate "start of word" or "end of  word"  metase-
4030         quence. However, whatever follows \b normally determines which  it  is.         quence.  However,  whatever follows \b normally determines which it is.
4031         For example, the fragment \ba matches "a" at the start of a word.         For example, the fragment \ba matches "a" at the start of a word.
4032    
4033         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The \A, \Z, and \z assertions differ from  the  traditional  circumflex
4034         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
4035         at  the  very start and end of the subject string, whatever options are         at the very start and end of the subject string, whatever  options  are
4036         set. Thus, they are independent of multiline mode. These  three  asser-         set.  Thus,  they are independent of multiline mode. These three asser-
4037         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
4038         affect only the behaviour of the circumflex and dollar  metacharacters.         affect  only the behaviour of the circumflex and dollar metacharacters.
4039         However,  if the startoffset argument of pcre_exec() is non-zero, indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
4040         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
4041         the  subject,  \A  can never match. The difference between \Z and \z is         the subject, \A can never match. The difference between \Z  and  \z  is
4042         that \Z matches before a newline at the end of the string as well as at         that \Z matches before a newline at the end of the string as well as at
4043         the very end, whereas \z matches only at the end.         the very end, whereas \z matches only at the end.
4044    
4045         The  \G assertion is true only when the current matching position is at         The \G assertion is true only when the current matching position is  at
4046         the start point of the match, as specified by the startoffset  argument         the  start point of the match, as specified by the startoffset argument
4047         of  pcre_exec().  It  differs  from \A when the value of startoffset is         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
4048         non-zero. By calling pcre_exec() multiple times with appropriate  argu-         non-zero.  By calling pcre_exec() multiple times with appropriate argu-
4049         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
4050         mentation where \G can be useful.         mentation where \G can be useful.
4051    
4052         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
4053         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
4054         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
4055         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
4056         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
4057    
4058         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
4059         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
4060         in the compiled regular expression.         in the compiled regular expression.
4061    
# Line 4111  BACKSLASH Line 4063  BACKSLASH
4063  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
4064    
4065         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
4066         character  is  an  assertion  that is true only if the current matching         character is an assertion that is true only  if  the  current  matching
4067         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
4068         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
4069         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
4070         has an entirely different meaning (see below).         has an entirely different meaning (see below).
4071    
4072         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
4073         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
4074         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
4075         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
4076         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
4077         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
4078         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
4079    
4080         A  dollar  character  is  an assertion that is true only if the current         A dollar character is an assertion that is true  only  if  the  current
4081         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
4082         before a newline at the end of the string (by default). Dollar need not         before a newline at the end of the string (by default). Dollar need not
4083         be the last character of the pattern if a number  of  alternatives  are         be  the  last  character of the pattern if a number of alternatives are
4084         involved,  but  it  should  be  the last item in any branch in which it         involved, but it should be the last item in  any  branch  in  which  it
4085         appears. Dollar has no special meaning in a character class.         appears. Dollar has no special meaning in a character class.
4086    
4087         The meaning of dollar can be changed so that it  matches  only  at  the         The  meaning  of  dollar  can be changed so that it matches only at the
4088         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
4089         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
4090    
4091         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
4092         PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
4093         matches immediately after internal newlines as well as at the start  of         matches  immediately after internal newlines as well as at the start of
4094         the  subject  string.  It  does not match after a newline that ends the         the subject string. It does not match after a  newline  that  ends  the
4095         string. A dollar matches before any newlines in the string, as well  as         string.  A dollar matches before any newlines in the string, as well as
4096         at  the very end, when PCRE_MULTILINE is set. When newline is specified         at the very end, when PCRE_MULTILINE is set. When newline is  specified
4097         as the two-character sequence CRLF, isolated CR and  LF  characters  do         as  the  two-character  sequence CRLF, isolated CR and LF characters do
4098         not indicate newlines.         not indicate newlines.
4099    
4100         For  example, the pattern /^abc$/ matches the subject string "def\nabc"         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
4101         (where \n represents a newline) in multiline mode, but  not  otherwise.         (where  \n  represents a newline) in multiline mode, but not otherwise.
4102         Consequently,  patterns  that  are anchored in single line mode because         Consequently, patterns that are anchored in single  line  mode  because
4103         all branches start with ^ are not anchored in  multiline  mode,  and  a         all  branches  start  with  ^ are not anchored in multiline mode, and a
4104         match  for  circumflex  is  possible  when  the startoffset argument of         match for circumflex is  possible  when  the  startoffset  argument  of
4105         pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
4106         PCRE_MULTILINE is set.         PCRE_MULTILINE is set.
4107    
4108         Note  that  the sequences \A, \Z, and \z can be used to match the start         Note that the sequences \A, \Z, and \z can be used to match  the  start
4109         and end of the subject in both modes, and if all branches of a  pattern         and  end of the subject in both modes, and if all branches of a pattern
4110         start  with  \A it is always anchored, whether or not PCRE_MULTILINE is         start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
4111         set.         set.
4112    
4113    
4114  FULL STOP (PERIOD, DOT) AND \N  FULL STOP (PERIOD, DOT) AND \N
4115    
4116         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
4117         ter  in  the subject string except (by default) a character that signi-         ter in the subject string except (by default) a character  that  signi-
4118         fies the end of a line. In UTF-8 mode, the  matched  character  may  be         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
4119         more than one byte long.         more than one byte long.
4120    
4121         When  a line ending is defined as a single character, dot never matches         When a line ending is defined as a single character, dot never  matches
4122         that character; when the two-character sequence CRLF is used, dot  does         that  character; when the two-character sequence CRLF is used, dot does
4123         not  match  CR  if  it  is immediately followed by LF, but otherwise it         not match CR if it is immediately followed  by  LF,  but  otherwise  it
4124         matches all characters (including isolated CRs and LFs). When any  Uni-         matches  all characters (including isolated CRs and LFs). When any Uni-
4125         code  line endings are being recognized, dot does not match CR or LF or         code line endings are being recognized, dot does not match CR or LF  or
4126         any of the other line ending characters.         any of the other line ending characters.
4127    
4128         The behaviour of dot with regard to newlines can  be  changed.  If  the         The  behaviour  of  dot  with regard to newlines can be changed. If the
4129         PCRE_DOTALL  option  is  set,  a dot matches any one character, without         PCRE_DOTALL option is set, a dot matches  any  one  character,  without
4130         exception. If the two-character sequence CRLF is present in the subject         exception. If the two-character sequence CRLF is present in the subject
4131         string, it takes two dots to match it.         string, it takes two dots to match it.
4132    
4133         The  handling of dot is entirely independent of the handling of circum-         The handling of dot is entirely independent of the handling of  circum-
4134         flex and dollar, the only relationship being  that  they  both  involve         flex  and  dollar,  the  only relationship being that they both involve
4135         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
4136    
4137         The  escape  sequence  \N  behaves  like  a  dot, except that it is not         The escape sequence \N behaves like  a  dot,  except  that  it  is  not
4138         affected by the PCRE_DOTALL option. In  other  words,  it  matches  any         affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
4139         character  except  one that signifies the end of a line. Perl also uses         character except one that signifies the end of a line.
        \N to match characters by name; PCRE does not support this.  
4140    
4141    
4142  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
# Line 4202  MATCHING A SINGLE BYTE Line 4153  MATCHING A SINGLE BYTE
4153         PCRE_NO_UTF8_CHECK option is used).         PCRE_NO_UTF8_CHECK option is used).
4154    
4155         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
4156         below) in UTF-8 mode, because this would make it impossible  to  calcu-         below), because in UTF-8 mode this would make it impossible  to  calcu-
4157         late the length of the lookbehind.         late the length of the lookbehind.
4158    
4159         In  general, the \C escape sequence is best avoided in UTF-8 mode. How-         In  general, the \C escape sequence is best avoided in UTF-8 mode. How-
# Line 5109  ASSERTIONS Line 5060  ASSERTIONS
5060         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
5061         rent position, the assertion fails.         rent position, the assertion fails.
5062    
5063         In  UTF-8 mode, PCRE does not allow the \C escape (which matches a sin-         PCRE does not allow the \C escape (which matches a single byte in UTF-8
5064         gle byte, even in UTF-8  mode)  to  appear  in  lookbehind  assertions,         mode) to appear in lookbehind assertions, because it makes it  impossi-
5065         because  it  makes it impossible to calculate the length of the lookbe-         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
5066         hind. The \X and \R escapes,  which  can  match  different  numbers  of         which can match different numbers of bytes, are also not permitted.
        bytes, are also not permitted.  
5067    
5068         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
5069         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
5070         Recursion, however, is not supported.         Recursion, however, is not supported.
5071    
5072         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
5073         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
5074         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
5075    
5076           abcd$           abcd$
5077    
5078         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
5079         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
5080         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
5081         pattern is specified as         pattern is specified as
5082    
5083           ^.*abcd$           ^.*abcd$
5084    
5085         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
5086         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
5087         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
5088         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
5089         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
5090    
5091           ^.*+(?<=abcd)           ^.*+(?<=abcd)
5092    
5093         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
5094         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
5095         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
5096         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
5097         processing time.         processing time.
5098    
5099     Using multiple assertions     Using multiple assertions
# Line 5152  ASSERTIONS Line 5102  ASSERTIONS
5102    
5103           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
5104    
5105         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
5106         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
5107         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
5108         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
5109         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
5110         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
5111         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
5112         foo". A pattern to do that is         foo". A pattern to do that is
5113    
5114           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
5115    
5116         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
5117         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
5118         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
5119    
# Line 5171  ASSERTIONS Line 5121  ASSERTIONS
5121    
5122           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
5123    
5124         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
5125         is not preceded by "foo", while         is not preceded by "foo", while
5126    
5127           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
5128    
5129         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
5130         three characters that are not "999".         three characters that are not "999".
5131    
5132    
5133  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
5134    
5135         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
5136         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
5137         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
5138         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
5139         subpattern are:         subpattern are:
5140    
5141           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5142           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5143    
5144         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
5145         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
5146         tives  in  the subpattern, a compile-time error occurs. Each of the two         tives in the subpattern, a compile-time error occurs. Each of  the  two
5147         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
5148         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5149         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 5202  CONDITIONAL SUBPATTERNS Line 5152  CONDITIONAL SUBPATTERNS
5152           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5153    
5154    
5155         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
5156         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
5157    
5158     Checking for a used subpattern by number     Checking for a used subpattern by number
5159    
5160         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
5161         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
5162         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
5163         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
5164         numbers), the condition is true if any of them have matched. An  alter-         numbers),  the condition is true if any of them have matched. An alter-
5165         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
5166         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
5167         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
5168         most recent by (?(-2), and so on. Inside loops it can also  make  sense         most  recent  by (?(-2), and so on. Inside loops it can also make sense
5169         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
5170         referenced as (?(+1), and so on. (The value zero in any of these  forms         referenced  as (?(+1), and so on. (The value zero in any of these forms
5171         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
5172    
5173         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
5174         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
5175         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
5176    
5177           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5178    
5179         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
5180         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5181         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
5182         third part is a conditional subpattern that tests whether  or  not  the         third  part  is  a conditional subpattern that tests whether or not the
5183         first  set  of  parentheses  matched.  If they did, that is, if subject         first set of parentheses matched. If they  did,  that  is,  if  subject
5184         started with an opening parenthesis, the condition is true, and so  the         started  with an opening parenthesis, the condition is true, and so the
5185         yes-pattern  is  executed and a closing parenthesis is required. Other-         yes-pattern is executed and a closing parenthesis is  required.  Other-
5186         wise, since no-pattern is not present, the subpattern matches  nothing.         wise,  since no-pattern is not present, the subpattern matches nothing.
5187         In  other  words,  this  pattern matches a sequence of non-parentheses,         In other words, this pattern matches  a  sequence  of  non-parentheses,
5188         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5189    
5190         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
5191         relative reference:         relative reference:
5192    
5193           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5194    
5195         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
5196         pattern.         pattern.
5197    
5198     Checking for a used subpattern by name     Checking for a used subpattern by name
5199    
5200         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
5201         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
5202         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
5203         also  recognized. However, there is a possible ambiguity with this syn-         also recognized. However, there is a possible ambiguity with this  syn-
5204         tax, because subpattern names may  consist  entirely  of  digits.  PCRE         tax,  because  subpattern  names  may  consist entirely of digits. PCRE
5205         looks  first for a named subpattern; if it cannot find one and the name         looks first for a named subpattern; if it cannot find one and the  name
5206         consists entirely of digits, PCRE looks for a subpattern of  that  num-         consists  entirely  of digits, PCRE looks for a subpattern of that num-
5207         ber,  which must be greater than zero. Using subpattern names that con-         ber, which must be greater than zero. Using subpattern names that  con-
5208         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5209    
5210         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5211    
5212           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5213    
5214         If the name used in a condition of this kind is a duplicate,  the  test         If  the  name used in a condition of this kind is a duplicate, the test
5215         is  applied to all subpatterns of the same name, and is true if any one         is applied to all subpatterns of the same name, and is true if any  one
5216         of them has matched.         of them has matched.
5217    
5218     Checking for pattern recursion     Checking for pattern recursion
5219    
5220         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5221         name  R, the condition is true if a recursive call to the whole pattern         name R, the condition is true if a recursive call to the whole  pattern
5222         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5223         sand follow the letter R, for example:         sand follow the letter R, for example:
5224    
# Line 5276  CONDITIONAL SUBPATTERNS Line 5226  CONDITIONAL SUBPATTERNS
5226    
5227         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5228         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5229         recursion  stack.  If  the  name  used in a condition of this kind is a         recursion stack. If the name used in a condition  of  this  kind  is  a
5230         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5231         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5232    
5233         At  "top  level",  all  these recursion test conditions are false.  The         At "top level", all these recursion test  conditions  are  false.   The
5234         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5235    
5236     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5237    
5238         If the condition is the string (DEFINE), and  there  is  no  subpattern         If  the  condition  is  the string (DEFINE), and there is no subpattern
5239         with  the  name  DEFINE,  the  condition is always false. In this case,         with the name DEFINE, the condition is  always  false.  In  this  case,
5240         there may be only one alternative  in  the  subpattern.  It  is  always         there  may  be  only  one  alternative  in the subpattern. It is always
5241         skipped  if  control  reaches  this  point  in the pattern; the idea of         skipped if control reaches this point  in  the  pattern;  the  idea  of
5242         DEFINE is that it can be used to define subroutines that can be  refer-         DEFINE  is that it can be used to define subroutines that can be refer-
5243         enced  from elsewhere. (The use of subroutines is described below.) For         enced from elsewhere. (The use of subroutines is described below.)  For
5244         example, a pattern to match an IPv4 address  such  as  "192.168.23.245"         example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
5245         could be written like this (ignore whitespace and line breaks):         could be written like this (ignore whitespace and line breaks):
5246    
5247           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5248           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
5249    
5250         The  first part of the pattern is a DEFINE group inside which a another         The first part of the pattern is a DEFINE group inside which a  another
5251         group named "byte" is defined. This matches an individual component  of         group  named "byte" is defined. This matches an individual component of
5252         an  IPv4  address  (a number less than 256). When matching takes place,         an IPv4 address (a number less than 256). When  matching  takes  place,
5253         this part of the pattern is skipped because DEFINE acts  like  a  false         this  part  of  the pattern is skipped because DEFINE acts like a false
5254         condition.  The  rest of the pattern uses references to the named group         condition. The rest of the pattern uses references to the  named  group
5255         to match the four dot-separated components of an IPv4 address,  insist-         to  match the four dot-separated components of an IPv4 address, insist-
5256         ing on a word boundary at each end.         ing on a word boundary at each end.
5257    
5258     Assertion conditions     Assertion conditions
5259    
5260         If  the  condition  is  not  in any of the above formats, it must be an         If the condition is not in any of the above  formats,  it  must  be  an
5261         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.   This may be a positive or negative lookahead or lookbehind
5262         assertion.  Consider  this  pattern,  again  containing non-significant         assertion. Consider  this  pattern,  again  containing  non-significant
5263         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
5264    
5265           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
5266           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
5267    
5268         The condition  is  a  positive  lookahead  assertion  that  matches  an         The  condition  is  a  positive  lookahead  assertion  that  matches an
5269         optional  sequence of non-letters followed by a letter. In other words,         optional sequence of non-letters followed by a letter. In other  words,
5270         it tests for the presence of at least one letter in the subject.  If  a         it  tests  for the presence of at least one letter in the subject. If a
5271         letter  is found, the subject is matched against the first alternative;         letter is found, the subject is matched against the first  alternative;
5272         otherwise it is  matched  against  the  second.  This  pattern  matches         otherwise  it  is  matched  against  the  second.  This pattern matches
5273         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
5274         letters and dd are digits.         letters and dd are digits.
5275    
5276    
# Line 5329  COMMENTS Line 5279  COMMENTS
5279         There are two ways of including comments in patterns that are processed         There are two ways of including comments in patterns that are processed
5280         by PCRE. In both cases, the start of the comment must not be in a char-         by PCRE. In both cases, the start of the comment must not be in a char-
5281         acter class, nor in the middle of any other sequence of related charac-         acter class, nor in the middle of any other sequence of related charac-
5282         ters  such  as  (?: or a subpattern name or number. The characters that         ters such as (?: or a subpattern name or number.  The  characters  that
5283         make up a comment play no part in the pattern matching.         make up a comment play no part in the pattern matching.
5284    
5285         The sequence (?# marks the start of a comment that continues up to  the         The  sequence (?# marks the start of a comment that continues up to the
5286         next  closing parenthesis. Nested parentheses are not permitted. If the         next closing parenthesis. Nested parentheses are not permitted. If  the
5287         PCRE_EXTENDED option is set, an unescaped # character also introduces a         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5288         comment,  which  in  this  case continues to immediately after the next         comment, which in this case continues to  immediately  after  the  next
5289         newline character or character sequence in the pattern.  Which  charac-         newline  character  or character sequence in the pattern. Which charac-
5290         ters are interpreted as newlines is controlled by the options passed to         ters are interpreted as newlines is controlled by the options passed to
5291         pcre_compile() or by a special sequence at the start of the pattern, as         pcre_compile() or by a special sequence at the start of the pattern, as
5292         described  in  the  section  entitled "Newline conventions" above. Note         described in the section entitled  "Newline  conventions"  above.  Note
5293         that the end of this type of comment is a literal newline  sequence  in         that  the  end of this type of comment is a literal newline sequence in
5294         the pattern; escape sequences that happen to represent a newline do not         the pattern; escape sequences that happen to represent a newline do not
5295         count. For example, consider this pattern when  PCRE_EXTENDED  is  set,         count.  For  example,  consider this pattern when PCRE_EXTENDED is set,
5296         and the default newline convention is in force:         and the default newline convention is in force:
5297    
5298           abc #comment \n still comment           abc #comment \n still comment
5299    
5300         On  encountering  the  # character, pcre_compile() skips along, looking         On encountering the # character, pcre_compile()  skips  along,  looking
5301         for a newline in the pattern. The sequence \n is still literal at  this         for  a newline in the pattern. The sequence \n is still literal at this
5302         stage,  so  it does not terminate the comment. Only an actual character         stage, so it does not terminate the comment. Only an  actual  character
5303         with the code value 0x0a (the default newline) does so.         with the code value 0x0a (the default newline) does so.
5304    
5305    
5306  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5307    
5308         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
5309         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
5310         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
5311         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5312         depth.         depth.
5313    
5314         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5315         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
5316         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
5317         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5318         parentheses problem can be created like this:         parentheses problem can be created like this:
5319    
# Line 5373  RECURSIVE PATTERNS Line 5323  RECURSIVE PATTERNS
5323         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5324    
5325         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5326         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
5327         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
5328         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
5329         into Perl at release 5.10.         into Perl at release 5.10.
5330    
5331         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
5332         zero and a closing parenthesis is a recursive subroutine  call  of  the         zero  and  a  closing parenthesis is a recursive subroutine call of the
5333         subpattern  of  the  given  number, provided that it occurs inside that         subpattern of the given number, provided that  it  occurs  inside  that
5334         subpattern. (If not, it is a non-recursive subroutine  call,  which  is         subpattern.  (If  not,  it is a non-recursive subroutine call, which is
5335         described  in  the  next  section.)  The special item (?R) or (?0) is a         described in the next section.) The special item  (?R)  or  (?0)  is  a
5336         recursive call of the entire regular expression.         recursive call of the entire regular expression.
5337    
5338         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5339         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5340    
5341           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5342    
5343         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
5344         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
5345         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
5346         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5347         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5348         parentheses.         parentheses.
5349    
5350         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
5351         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5352    
5353           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5354    
5355         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
5356         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5357    
5358         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5359         tricky.  This is made easier by the use of relative references. Instead         tricky. This is made easier by the use of relative references.  Instead
5360         of (?1) in the pattern above you can write (?-2) to refer to the second         of (?1) in the pattern above you can write (?-2) to refer to the second
5361         most  recently  opened  parentheses  preceding  the recursion. In other         most recently opened parentheses  preceding  the  recursion.  In  other
5362         words, a negative number counts capturing  parentheses  leftwards  from         words,  a  negative  number counts capturing parentheses leftwards from
5363         the point at which it is encountered.         the point at which it is encountered.
5364    
5365         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
5366         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
5367         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
5368         enced. They are always non-recursive subroutine calls, as described  in         enced.  They are always non-recursive subroutine calls, as described in
5369         the next section.         the next section.
5370    
5371         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
5372         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5373         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5374    
5375           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5376    
5377         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
5378         one is used.         one is used.
5379    
5380         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
5381         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5382         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5383         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
5384         applied to         applied to
5385    
5386           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5387    
5388         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
5389         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
5390         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
5391         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5392    
5393         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
5394         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
5395         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
5396         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5397    
5398           (ab(cd)ef)           (ab(cd)ef)
5399    
5400         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5401         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
5402         pattern is not matched at the top level, its final  captured  value  is         pattern  is  not  matched at the top level, its final captured value is
5403         unset,  even  if  it was (temporarily) set at a deeper level during the         unset, even if it was (temporarily) set at a deeper  level  during  the
5404         matching process.         matching process.
5405    
5406         If there are more than 15 capturing parentheses in a pattern, PCRE  has         If  there are more than 15 capturing parentheses in a pattern, PCRE has
5407         to  obtain extra memory to store data during a recursion, which it does         to obtain extra memory to store data during a recursion, which it  does
5408         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5409         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5410    
5411         Do  not  confuse  the (?R) item with the condition (R), which tests for         Do not confuse the (?R) item with the condition (R),  which  tests  for
5412         recursion.  Consider this pattern, which matches text in  angle  brack-         recursion.   Consider  this pattern, which matches text in angle brack-
5413         ets,  allowing for arbitrary nesting. Only digits are allowed in nested         ets, allowing for arbitrary nesting. Only digits are allowed in  nested
5414         brackets (that is, when recursing), whereas any characters are  permit-         brackets  (that is, when recursing), whereas any characters are permit-
5415         ted at the outer level.         ted at the outer level.
5416    
5417           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5418    
5419         In  this  pattern, (?(R) is the start of a conditional subpattern, with         In this pattern, (?(R) is the start of a conditional  subpattern,  with
5420         two different alternatives for the recursive and  non-recursive  cases.         two  different  alternatives for the recursive and non-recursive cases.
5421         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5422    
5423     Differences in recursion processing between PCRE and Perl     Differences in recursion processing between PCRE and Perl
5424    
5425         Recursion  processing  in PCRE differs from Perl in two important ways.         Recursion processing in PCRE differs from Perl in two  important  ways.
5426         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
5427         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5428         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5429         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
5430         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
5431         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
5432         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5433    
5434           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5435    
5436         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5437         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
5438         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
5439         Consider the subject string "abcba":         Consider the subject string "abcba":
5440    
5441         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
5442         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5443         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5444         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
5445         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
5446    
5447         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
5448         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
5449         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
5450         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
5451         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
5452         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
5453         different:         different:
5454    
5455           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
5456    
5457         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
5458         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
5459         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
5460         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
5461         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
5462         use.         use.
5463    
5464         To change the pattern so that it matches all palindromic  strings,  not         To  change  the pattern so that it matches all palindromic strings, not
5465         just  those  with an odd number of characters, it is tempting to change         just those with an odd number of characters, it is tempting  to  change
5466         the pattern to this:         the pattern to this:
5467    
5468           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
5469    
5470         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
5471         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
5472         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
5473         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
5474         natives at the higher level:         natives at the higher level:
5475    
5476           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
5477    
5478         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
5479         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
5480    
5481           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5482    
5483         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5484         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5485         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
5486         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
5487         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
5488         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5489    
5490         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
5491         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
5492         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
5493         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
5494         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
5495         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
5496         natives, so the entire match fails.         natives, so the entire match fails.
5497    
5498         The second way in which PCRE and Perl differ in  their  recursion  pro-         The  second  way  in which PCRE and Perl differ in their recursion pro-
5499         cessing  is in the handling of captured values. In Perl, when a subpat-         cessing is in the handling of captured values. In Perl, when a  subpat-
5500         tern is called recursively or as a subpattern (see the  next  section),         tern  is  called recursively or as a subpattern (see the next section),
5501         it  has  no  access to any values that were captured outside the recur-         it has no access to any values that were captured  outside  the  recur-
5502         sion, whereas in PCRE these values can  be  referenced.  Consider  this         sion,  whereas  in  PCRE  these values can be referenced. Consider this
5503         pattern:         pattern:
5504    
5505           ^(.)(\1|a(?2))           ^(.)(\1|a(?2))
5506    
5507         In  PCRE,  this  pattern matches "bab". The first capturing parentheses         In PCRE, this pattern matches "bab". The  first  capturing  parentheses
5508         match "b", then in the second group, when the back reference  \1  fails         match  "b",  then in the second group, when the back reference \1 fails
5509         to  match "b", the second alternative matches "a" and then recurses. In         to match "b", the second alternative matches "a" and then recurses.  In
5510         the recursion, \1 does now match "b" and so the whole  match  succeeds.         the  recursion,  \1 does now match "b" and so the whole match succeeds.
5511         In  Perl,  the pattern fails to match because inside the recursive call         In Perl, the pattern fails to match because inside the  recursive  call
5512         \1 cannot access the externally set value.         \1 cannot access the externally set value.
5513    
5514    
5515  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5516    
5517         If the syntax for a recursive subpattern call (either by number  or  by         If  the  syntax for a recursive subpattern call (either by number or by
5518         name)  is  used outside the parentheses to which it refers, it operates         name) is used outside the parentheses to which it refers,  it  operates
5519         like a subroutine in a programming language. The called subpattern  may         like  a subroutine in a programming language. The called subpattern may
5520         be  defined  before or after the reference. A numbered reference can be         be defined before or after the reference. A numbered reference  can  be
5521         absolute or relative, as in these examples:         absolute or relative, as in these examples:
5522    
5523           (...(absolute)...)...(?2)...           (...(absolute)...)...(?2)...
# Line 5578  SUBPATTERNS AS SUBROUTINES Line 5528  SUBPATTERNS AS SUBROUTINES
5528    
5529           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5530    
5531         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5532         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5533    
5534           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5535    
5536         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
5537         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
5538         above.         above.
5539    
5540         All  subroutine  calls, whether recursive or not, are always treated as         All subroutine calls, whether recursive or not, are always  treated  as
5541         atomic groups. That is, once a subroutine has matched some of the  sub-         atomic  groups. That is, once a subroutine has matched some of the sub-
5542         ject string, it is never re-entered, even if it contains untried alter-         ject string, it is never re-entered, even if it contains untried alter-
5543         natives and there is  a  subsequent  matching  failure.  Any  capturing         natives  and  there  is  a  subsequent  matching failure. Any capturing
5544         parentheses  that  are  set  during the subroutine call revert to their         parentheses that are set during the subroutine  call  revert  to  their
5545         previous values afterwards.         previous values afterwards.
5546    
5547         Processing options such as case-independence are fixed when  a  subpat-         Processing  options  such as case-independence are fixed when a subpat-
5548         tern  is defined, so if it is used as a subroutine, such options cannot         tern is defined, so if it is used as a subroutine, such options  cannot
5549         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5550    
5551           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5552    
5553         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
5554         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5555    
5556    
5557  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5558    
5559         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
5560         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5561         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
5562         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
5563         ten using this syntax:         ten using this syntax:
5564    
5565           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5566           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5567    
5568         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
5569         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5570    
5571           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5572    
5573         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
5574         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
5575         call.         call.
5576    
5577    
5578  CALLOUTS  CALLOUTS
5579    
5580         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5581         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
5582         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5583         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5584         tion.         tion.
5585    
5586         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5587         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5588         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
5589         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
5590         all calling out.         all calling out.
5591    
5592         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
5593         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
5594         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
5595         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
5596         points:         points:
5597    
5598           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5599    
5600         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5601         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
5602         numbered 255.         numbered 255.
5603    
5604         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5605         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
5606         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
5607         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
5608         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
5609         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5610         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5611    
5612    
5613  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5614    
5615         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5616         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5617         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5618         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5619         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5620         in this section.         in this section.
5621    
5622         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5623         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5624         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5625         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5626         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5627    
5628         If  any of these verbs are used in an assertion or in a subpattern that         If any of these verbs are used in an assertion or in a subpattern  that
5629         is called as a subroutine (whether or not recursively), their effect is         is called as a subroutine (whether or not recursively), their effect is
5630         confined to that subpattern; it does not extend to the surrounding pat-         confined to that subpattern; it does not extend to the surrounding pat-
5631         tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)         tern,  with  one  exception:  a *MARK that is encountered in a positive
5632         that  is  encountered in a successful positive assertion is passed back         assertion is passed back (compare capturing parentheses in assertions).
        when a match succeeds (compare capturing  parentheses  in  assertions).  
5633         Note that such subpatterns are processed as anchored at the point where         Note that such subpatterns are processed as anchored at the point where
5634         they are tested. Note also that Perl's treatment of subroutines is dif-         they are tested. Note also that Perl's treatment of subroutines is dif-
5635         ferent in some cases.         ferent in some cases.
# Line 5703  BACKTRACKING CONTROL Line 5652  BACKTRACKING CONTROL
5652         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
5653         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
5654    
        Experiments  with  Perl  suggest that it too has similar optimizations,  
        sometimes leading to anomalous results.  
   
5655     Verbs that act immediately     Verbs that act immediately
5656    
5657         The following verbs act as soon as they are encountered. They  may  not         The  following  verbs act as soon as they are encountered. They may not
5658         be followed by a name.         be followed by a name.
5659    
5660            (*ACCEPT)            (*ACCEPT)
5661    
5662         This  verb causes the match to end successfully, skipping the remainder         This verb causes the match to end successfully, skipping the  remainder
5663         of the pattern. However, when it is inside a subpattern that is  called         of  the pattern. However, when it is inside a subpattern that is called
5664         as  a  subroutine, only that subpattern is ended successfully. Matching         as a subroutine, only that subpattern is ended  successfully.  Matching
5665         then continues at the outer level. If  (*ACCEPT)  is  inside  capturing         then  continues  at  the  outer level. If (*ACCEPT) is inside capturing
5666         parentheses, the data so far is captured. For example:         parentheses, the data so far is captured. For example:
5667    
5668           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5669    
5670         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
5671         tured by the outer parentheses.         tured by the outer parentheses.
5672    
5673           (*FAIL) or (*F)           (*FAIL) or (*F)
5674    
5675         This verb causes a matching failure, forcing backtracking to occur.  It         This  verb causes a matching failure, forcing backtracking to occur. It
5676         is  equivalent to (?!) but easier to read. The Perl documentation notes         is equivalent to (?!) but easier to read. The Perl documentation  notes
5677         that it is probably useful only when combined  with  (?{})  or  (??{}).         that  it  is  probably  useful only when combined with (?{}) or (??{}).
5678         Those  are,  of course, Perl features that are not present in PCRE. The         Those are, of course, Perl features that are not present in  PCRE.  The
5679         nearest equivalent is the callout feature, as for example in this  pat-         nearest  equivalent is the callout feature, as for example in this pat-
5680         tern:         tern:
5681    
5682           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5683    
5684         A  match  with the string "aaaa" always fails, but the callout is taken         A match with the string "aaaa" always fails, but the callout  is  taken
5685         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5686    
5687     Recording which path was taken     Recording which path was taken
5688    
5689         There is one verb whose main purpose  is  to  track  how  a  match  was         There  is  one  verb  whose  main  purpose  is to track how a match was
5690         arrived  at,  though  it  also  has a secondary use in conjunction with         arrived at, though it also has a  secondary  use  in  conjunction  with
5691         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
5692    
5693           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
5694    
5695         A name is always  required  with  this  verb.  There  may  be  as  many         A  name  is  always  required  with  this  verb.  There  may be as many
5696         instances  of  (*MARK) as you like in a pattern, and their names do not         instances of (*MARK) as you like in a pattern, and their names  do  not
5697         have to be unique.         have to be unique.
5698    
5699         When a match succeeds, the name of the last-encountered (*MARK) on  the         When  a  match  succeeds,  the  name of the last-encountered (*MARK) is
5700         matching  path  is  passed  back  to the caller via the pcre_extra data         passed back to  the  caller  via  the  pcre_extra  data  structure,  as
5701         structure, as described in the section on  pcre_extra  in  the  pcreapi         described in the section on pcre_extra in the pcreapi documentation. No
5702         documentation. Here is an example of pcretest output, where the /K mod-         data is returned for a partial match. Here is an  example  of  pcretest
5703         ifier requests the retrieval and outputting of (*MARK) data:         output,  where the /K modifier requests the retrieval and outputting of
5704           (*MARK) data:
5705    
5706             re> /X(*MARK:A)Y|X(*MARK:B)Z/K           /X(*MARK:A)Y|X(*MARK:B)Z/K
5707           data> XY           XY
5708            0: XY            0: XY
5709           MK: A           MK: A
5710           XZ           XZ
# Line 5773  BACKTRACKING CONTROL Line 5720  BACKTRACKING CONTROL
5720         and passed back if it is the last-encountered. This does not happen for         and passed back if it is the last-encountered. This does not happen for
5721         negative assertions.         negative assertions.
5722    
5723         After  a  partial match or a failed match, the name of the last encoun-         A  name  may  also  be  returned after a failed match if the final path
5724         tered (*MARK) in the entire match process is returned. For example:         through the pattern involves (*MARK). However, unless (*MARK)  used  in
5725           conjunction  with  (*COMMIT),  this  is unlikely to happen for an unan-
5726           chored pattern because, as the starting point for matching is advanced,
5727           the final check is often with an empty string, causing a failure before
5728           (*MARK) is reached. For example:
5729    
5730             /X(*MARK:A)Y|X(*MARK:B)Z/K
5731             XP
5732             No match
5733    
5734           There are three potential starting points for this match (starting with
5735           X,  starting  with  P,  and  with  an  empty string). If the pattern is
5736           anchored, the result is different:
5737    
5738             re> /X(*MARK:A)Y|X(*MARK:B)Z/K           /^X(*MARK:A)Y|^X(*MARK:B)Z/K
5739           data> XP           XP
5740           No match, mark = B           No match, mark = B
5741    
5742         Note that in this unanchored example the  mark  is  retained  from  the         PCRE's start-of-match optimizations can also interfere with  this.  For
5743         match attempt that started at the letter "X". Subsequent match attempts         example,  if, as a result of a call to pcre_study(), it knows the mini-
5744         starting at "P" and then with an empty string do not get as far as  the         mum subject length for a match, a shorter subject will not  be  scanned
5745         (*MARK) item, but nevertheless do not reset it.         at all.
5746    
5747           Note that similar anomalies (though different in detail) exist in Perl,
5748           no doubt for the same reasons. The use of (*MARK) data after  a  failed
5749           match  of an unanchored pattern is not recommended, unless (*COMMIT) is
5750           involved.
5751    
5752     Verbs that act after backtracking     Verbs that act after backtracking
5753    
5754         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5755         tinues with what follows, but if there is no subsequent match,  causing         tinues  with what follows, but if there is no subsequent match, causing
5756         a  backtrack  to  the  verb, a failure is forced. That is, backtracking         a backtrack to the verb, a failure is  forced.  That  is,  backtracking
5757         cannot pass to the left of the verb. However, when one of  these  verbs         cannot  pass  to the left of the verb. However, when one of these verbs
5758         appears  inside  an atomic group, its effect is confined to that group,         appears inside an atomic group, its effect is confined to  that  group,
5759         because once the group has been matched, there is never any  backtrack-         because  once the group has been matched, there is never any backtrack-
5760         ing  into  it.  In  this situation, backtracking can "jump back" to the         ing into it. In this situation, backtracking can  "jump  back"  to  the
5761         left of the entire atomic group. (Remember also, as stated above,  that         left  of the entire atomic group. (Remember also, as stated above, that
5762         this localization also applies in subroutine calls and assertions.)         this localization also applies in subroutine calls and assertions.)
5763    
5764         These  verbs  differ  in exactly what kind of failure occurs when back-         These verbs differ in exactly what kind of failure  occurs  when  back-
5765         tracking reaches them.         tracking reaches them.
5766    
5767           (*COMMIT)           (*COMMIT)
5768    
5769         This verb, which may not be followed by a name, causes the whole  match         This  verb, which may not be followed by a name, causes the whole match
5770         to fail outright if the rest of the pattern does not match. Even if the         to fail outright if the rest of the pattern does not match. Even if the
5771         pattern is unanchored, no further attempts to find a match by advancing         pattern is unanchored, no further attempts to find a match by advancing
5772         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
5773         pcre_exec() is committed to finding a match  at  the  current  starting         pcre_exec()  is  committed  to  finding a match at the current starting
5774         point, or not at all. For example:         point, or not at all. For example:
5775    
5776           a+(*COMMIT)b           a+(*COMMIT)b
5777    
5778         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
5779         of dynamic anchor, or "I've started, so I must finish." The name of the         of dynamic anchor, or "I've started, so I must finish." The name of the
5780         most  recently passed (*MARK) in the path is passed back when (*COMMIT)         most recently passed (*MARK) in the path is passed back when  (*COMMIT)
5781         forces a match failure.         forces a match failure.
5782    
5783         Note that (*COMMIT) at the start of a pattern is not  the  same  as  an         Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
5784         anchor,  unless  PCRE's start-of-match optimizations are turned off, as         anchor, unless PCRE's start-of-match optimizations are turned  off,  as
5785         shown in this pcretest example:         shown in this pcretest example:
5786    
5787             re> /(*COMMIT)abc/           /(*COMMIT)abc/
5788           data> xyzabc           xyzabc
5789            0: abc            0: abc
5790           xyzabc\Y           xyzabc\Y
5791           No match           No match
5792    
5793         PCRE knows that any match must start  with  "a",  so  the  optimization         PCRE  knows  that  any  match  must start with "a", so the optimization
5794         skips  along the subject to "a" before running the first match attempt,         skips along the subject to "a" before running the first match  attempt,
5795         which succeeds. When the optimization is disabled by the \Y  escape  in         which  succeeds.  When the optimization is disabled by the \Y escape in
5796         the second subject, the match starts at "x" and so the (*COMMIT) causes         the second subject, the match starts at "x" and so the (*COMMIT) causes
5797         it to fail without trying any other starting points.         it to fail without trying any other starting points.
5798    
5799           (*PRUNE) or (*PRUNE:NAME)           (*PRUNE) or (*PRUNE:NAME)
5800    
5801         This verb causes the match to fail at the current starting position  in         This  verb causes the match to fail at the current starting position in
5802         the  subject  if the rest of the pattern does not match. If the pattern         the subject if the rest of the pattern does not match. If  the  pattern
5803         is unanchored, the normal "bumpalong"  advance  to  the  next  starting         is  unanchored,  the  normal  "bumpalong"  advance to the next starting
5804         character  then happens. Backtracking can occur as usual to the left of         character then happens. Backtracking can occur as usual to the left  of
5805         (*PRUNE), before it is reached,  or  when  matching  to  the  right  of         (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
5806         (*PRUNE),  but  if  there is no match to the right, backtracking cannot         (*PRUNE), but if there is no match to the  right,  backtracking  cannot
5807         cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-         cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
5808         native  to an atomic group or possessive quantifier, but there are some         native to an atomic group or possessive quantifier, but there are  some
5809         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
5810         iour  of  (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE). In an         iour of (*PRUNE:NAME) is the  same  as  (*MARK:NAME)(*PRUNE)  when  the
5811         anchored pattern (*PRUNE) has the same effect as (*COMMIT).         match  fails  completely;  the name is passed back if this is the final
5812           attempt.  (*PRUNE:NAME) does not pass back a name  if  the  match  suc-
5813           ceeds.  In  an  anchored pattern (*PRUNE) has the same effect as (*COM-
5814           MIT).
5815    
5816           (*SKIP)           (*SKIP)
5817    
# Line 5871  BACKTRACKING CONTROL Line 5838  BACKTRACKING CONTROL
5838         is searched for the most recent (*MARK) that has the same name. If  one         is searched for the most recent (*MARK) that has the same name. If  one
5839         is  found, the "bumpalong" advance is to the subject position that cor-         is  found, the "bumpalong" advance is to the subject position that cor-
5840         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
5841         If no (*MARK) with a matching name is found, the (*SKIP) is ignored.         If  no (*MARK) with a matching name is found, normal "bumpalong" of one
5842           character happens (that is, the (*SKIP) is ignored).
5843    
5844           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
5845    
5846         This  verb  causes a skip to the next innermost alternative if the rest         This verb causes a skip to the next innermost alternative if  the  rest
5847         of the pattern does not match. That is, it cancels  pending  backtrack-         of  the  pattern does not match. That is, it cancels pending backtrack-
5848         ing,  but  only within the current alternative. Its name comes from the         ing, but only within the current alternative. Its name comes  from  the
5849         observation that it can be used for a pattern-based if-then-else block:         observation that it can be used for a pattern-based if-then-else block:
5850    
5851           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5852    
5853         If the COND1 pattern matches, FOO is tried (and possibly further  items         If  the COND1 pattern matches, FOO is tried (and possibly further items
5854         after  the  end  of the group if FOO succeeds); on failure, the matcher         after the end of the group if FOO succeeds); on  failure,  the  matcher
5855         skips to the second alternative and tries COND2,  without  backtracking         skips  to  the second alternative and tries COND2, without backtracking
5856         into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as         into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
5857         (*MARK:NAME)(*THEN).  If (*THEN) is not inside an alternation, it  acts         (*MARK:NAME)(*THEN)  if  the  overall  match  fails.  If (*THEN) is not
5858         like (*PRUNE).         inside an alternation, it acts like (*PRUNE).
5859    
5860         Note  that  a  subpattern that does not contain a | character is just a         Note that a subpattern that does not contain a | character  is  just  a
5861         part of the enclosing alternative; it is not a nested alternation  with         part  of the enclosing alternative; it is not a nested alternation with
5862         only  one alternative. The effect of (*THEN) extends beyond such a sub-         only one alternative. The effect of (*THEN) extends beyond such a  sub-
5863         pattern to the enclosing alternative. Consider this pattern,  where  A,         pattern  to  the enclosing alternative. Consider this pattern, where A,
5864         B, etc. are complex pattern fragments that do not contain any | charac-         B, etc. are complex pattern fragments that do not contain any | charac-
5865         ters at this level:         ters at this level:
5866    
5867           A (B(*THEN)C) | D           A (B(*THEN)C) | D
5868    
5869         If A and B are matched, but there is a failure in C, matching does  not         If  A and B are matched, but there is a failure in C, matching does not
5870         backtrack into A; instead it moves to the next alternative, that is, D.         backtrack into A; instead it moves to the next alternative, that is, D.
5871         However, if the subpattern containing (*THEN) is given an  alternative,         However,  if the subpattern containing (*THEN) is given an alternative,
5872         it behaves differently:         it behaves differently:
5873    
5874           A (B(*THEN)C | (*FAIL)) | D           A (B(*THEN)C | (*FAIL)) | D
5875    
5876         The  effect of (*THEN) is now confined to the inner subpattern. After a         The effect of (*THEN) is now confined to the inner subpattern. After  a
5877         failure in C, matching moves to (*FAIL), which causes the whole subpat-         failure in C, matching moves to (*FAIL), which causes the whole subpat-
5878         tern  to  fail  because  there are no more alternatives to try. In this         tern to fail because there are no more alternatives  to  try.  In  this
5879         case, matching does now backtrack into A.         case, matching does now backtrack into A.
5880    
5881         Note also that a conditional subpattern is not considered as having two         Note also that a conditional subpattern is not considered as having two
5882         alternatives,  because  only  one  is  ever used. In other words, the |         alternatives, because only one is ever used.  In  other  words,  the  |
5883         character in a conditional subpattern has a different meaning. Ignoring         character in a conditional subpattern has a different meaning. Ignoring
5884         white space, consider:         white space, consider:
5885    
5886           ^.*? (?(?=a) a | b(*THEN)c )           ^.*? (?(?=a) a | b(*THEN)c )
5887    
5888         If  the  subject  is  "ba", this pattern does not match. Because .*? is         If the subject is "ba", this pattern does not  match.  Because  .*?  is
5889         ungreedy, it initially matches zero  characters.  The  condition  (?=a)         ungreedy,  it  initially  matches  zero characters. The condition (?=a)
5890         then  fails,  the  character  "b"  is  matched, but "c" is not. At this         then fails, the character "b" is matched,  but  "c"  is  not.  At  this
5891         point, matching does not backtrack to .*? as might perhaps be  expected         point,  matching does not backtrack to .*? as might perhaps be expected
5892         from  the  presence  of  the | character. The conditional subpattern is         from the presence of the | character.  The  conditional  subpattern  is
5893         part of the single alternative that comprises the whole pattern, and so         part of the single alternative that comprises the whole pattern, and so
5894         the  match  fails.  (If  there was a backtrack into .*?, allowing it to         the match fails. (If there was a backtrack into  .*?,  allowing  it  to
5895         match "b", the match would succeed.)         match "b", the match would succeed.)
5896    
5897         The verbs just described provide four different "strengths" of  control         The  verbs just described provide four different "strengths" of control
5898         when subsequent matching fails. (*THEN) is the weakest, carrying on the         when subsequent matching fails. (*THEN) is the weakest, carrying on the
5899         match at the next alternative. (*PRUNE) comes next, failing  the  match         match  at  the next alternative. (*PRUNE) comes next, failing the match
5900         at  the  current starting position, but allowing an advance to the next         at the current starting position, but allowing an advance to  the  next
5901         character (for an unanchored pattern). (*SKIP) is similar, except  that         character  (for an unanchored pattern). (*SKIP) is similar, except that
5902         the advance may be more than one character. (*COMMIT) is the strongest,         the advance may be more than one character. (*COMMIT) is the strongest,
5903         causing the entire match to fail.         causing the entire match to fail.
5904    
# Line 5940  BACKTRACKING CONTROL Line 5908  BACKTRACKING CONTROL
5908    
5909           (A(*COMMIT)B(*THEN)C|D)           (A(*COMMIT)B(*THEN)C|D)
5910    
5911         Once A has matched, PCRE is committed to this  match,  at  the  current         Once  A  has  matched,  PCRE is committed to this match, at the current
5912         starting  position. If subsequently B matches, but C does not, the nor-         starting position. If subsequently B matches, but C does not, the  nor-
5913         mal (*THEN) action of trying the next alternative (that is, D) does not         mal (*THEN) action of trying the next alternative (that is, D) does not
5914         happen because (*COMMIT) overrides.         happen because (*COMMIT) overrides.
5915    
# Line 5960  AUTHOR Line 5928  AUTHOR
5928    
5929  REVISION  REVISION
5930    
5931         Last updated: 29 November 2011         Last updated: 19 October 2011
5932         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
5933  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5934    
# Line 6529  AVAILABILITY OF JIT SUPPORT Line 6497  AVAILABILITY OF JIT SUPPORT
6497         been  fully  tested. If --enable-jit is set on an unsupported platform,         been  fully  tested. If --enable-jit is set on an unsupported platform,
6498         compilation fails.         compilation fails.
6499    
6500         A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-         A program can tell if JIT support is available by calling pcre_config()
6501         port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT         with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available,
6502         option. The result is 1 when JIT is available, and  0  otherwise.  How-         and 0 otherwise. However, a simple program does not need to check  this
6503         ever, a simple program does not need to check this in order to use JIT.         in order to use JIT. The API is implemented in a way that falls back to
6504         The API is implemented in a way that falls back to  the  ordinary  PCRE         the ordinary PCRE code if JIT is not available.
        code if JIT is not available.  
   
        If  your program may sometimes be linked with versions of PCRE that are  
        older than 8.20, but you want to use JIT when it is available, you  can  
        test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT  
        macro such as PCRE_CONFIG_JIT, for compile-time control of your code.  
6505    
6506    
6507  SIMPLE USE OF JIT  SIMPLE USE OF JIT
# Line 6555  SIMPLE USE OF JIT Line 6517  SIMPLE USE OF JIT
6517               no longer needed instead of just freeing it yourself. This               no longer needed instead of just freeing it yourself. This
6518               ensures that any JIT data is also freed.               ensures that any JIT data is also freed.
6519    
        For  a  program  that may be linked with pre-8.20 versions of PCRE, you  
        can insert  
   
          #ifndef PCRE_STUDY_JIT_COMPILE  
          #define PCRE_STUDY_JIT_COMPILE 0  
          #endif  
   
        so that no option is passed to pcre_study(),  and  then  use  something  
        like this to free the study data:  
   
          #ifdef PCRE_CONFIG_JIT  
              pcre_free_study(study_ptr);  
          #else  
              pcre_free(study_ptr);  
          #endif  
   
6520         In  some circumstances you may need to call additional functions. These         In  some circumstances you may need to call additional functions. These
6521         are described in the  section  entitled  "Controlling  the  JIT  stack"         are described in the  section  entitled  "Controlling  the  JIT  stack"
6522         below.         below.
# Line 6609  UNSUPPORTED OPTIONS AND PATTERN ITEMS Line 6555  UNSUPPORTED OPTIONS AND PATTERN ITEMS
6555    
6556         The unsupported pattern items are:         The unsupported pattern items are:
6557    
6558           \C             match a single byte; not supported in UTF-8 mode           \C            match a single byte; not supported in UTF-8 mode
6559           (?Cn)          callouts           (?Cn)          callouts
6560             (?(<name>)...  conditional test on setting of a named subpattern
6561             (?(R)...       conditional test on whole pattern recursion
6562             (?(Rn)...      conditional test on recursion, by number
6563             (?(R&name)...  conditional test on recursion, by name
6564           (*COMMIT)      )           (*COMMIT)      )
6565           (*MARK)        )           (*MARK)        )
6566           (*PRUNE)       ) the backtracking control verbs           (*PRUNE)       ) the backtracking control verbs
# Line 6659  CONTROLLING THE JIT STACK Line 6609  CONTROLLING THE JIT STACK
6609         large  or  complicated  patterns  need  more  than  this.   The   error         large  or  complicated  patterns  need  more  than  this.   The   error
6610         PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.         PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.
6611         Three functions are provided for managing blocks of memory for  use  as         Three functions are provided for managing blocks of memory for  use  as
6612         JIT  stacks. There is further discussion about the use of JIT stacks in         JIT stacks.
        the section entitled "JIT stack FAQ" below.  
6613    
6614         The pcre_jit_stack_alloc() function creates a JIT stack. Its  arguments         The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
6615         are  a starting size and a maximum size, and it returns a pointer to an         are a starting size and a maximum size, and it returns a pointer to  an
6616         opaque structure of type pcre_jit_stack, or NULL if there is an  error.         opaque  structure of type pcre_jit_stack, or NULL if there is an error.
6617         The  pcre_jit_stack_free() function can be used to free a stack that is         The pcre_jit_stack_free() function can be used to free a stack that  is
6618         no longer needed. (For the technically minded:  the  address  space  is         no  longer  needed.  (For  the technically minded: the address space is
6619         allocated by mmap or VirtualAlloc.)         allocated by mmap or VirtualAlloc.)
6620    
6621         JIT  uses far less memory for recursion than the interpretive code, and         JIT uses far less memory for recursion than the interpretive code,  and
6622         a maximum stack size of 512K to 1M should be more than enough  for  any         a  maximum  stack size of 512K to 1M should be more than enough for any
6623         pattern.         pattern.
6624    
6625         The  pcre_assign_jit_stack()  function  specifies  which stack JIT code         The pcre_assign_jit_stack() function specifies  which  stack  JIT  code
6626         should use. Its arguments are as follows:         should use. Its arguments are as follows:
6627    
6628           pcre_extra         *extra           pcre_extra         *extra
6629           pcre_jit_callback  callback           pcre_jit_callback  callback
6630           void               *data           void               *data
6631    
6632         The extra argument must be  the  result  of  studying  a  pattern  with         The  extra  argument  must  be  the  result  of studying a pattern with
6633         PCRE_STUDY_JIT_COMPILE.  There  are  three  cases for the values of the         PCRE_STUDY_JIT_COMPILE. There are three cases for  the  values  of  the
6634         other two options:         other two options:
6635    
6636           (1) If callback is NULL and data is NULL, an internal 32K block           (1) If callback is NULL and data is NULL, an internal 32K block
# Line 6696  CONTROLLING THE JIT STACK Line 6645  CONTROLLING THE JIT STACK
6645               is used; otherwise the return value must be a valid JIT stack,               is used; otherwise the return value must be a valid JIT stack,
6646               the result of calling pcre_jit_stack_alloc().               the result of calling pcre_jit_stack_alloc().
6647    
6648         You may safely assign the same JIT stack to more than one  pattern,  as         You  may  safely assign the same JIT stack to more than one pattern, as
6649         long as they are all matched sequentially in the same thread. In a mul-         long as they are all matched sequentially in the same thread. In a mul-
6650         tithread application, each thread must use its own JIT stack.         tithread application, each thread must use its own JIT stack.
6651    
6652         Strictly speaking, even more is allowed. You can assign the same  stack         Strictly  speaking, even more is allowed. You can assign the same stack
6653         to  any number of patterns as long as they are not used for matching by         to any number of patterns as long as they are not used for matching  by
6654         multiple threads at the same time. For example, you can assign the same         multiple threads at the same time. For example, you can assign the same
6655         stack  to all compiled patterns, and use a global mutex in the callback         stack to all compiled patterns, and use a global mutex in the  callback
6656         to wait until the stack is available for use. However, this is an inef-         to wait until the stack is available for use. However, this is an inef-
6657         ficient solution, and not recommended.         ficient solution, and not recommended.
6658    
6659         This  is  a  suggestion  for  how a typical multithreaded program might         This is a suggestion for how  a  typical  multithreaded  program  might
6660         operate:         operate:
6661    
6662           During thread initalization           During thread initalization
# Line 6719  CONTROLLING THE JIT STACK Line 6668  CONTROLLING THE JIT STACK
6668           Use a one-line callback function           Use a one-line callback function
6669             return thread_local_var             return thread_local_var
6670    
6671         All the functions described in this section do nothing if  JIT  is  not         All  the  functions  described in this section do nothing if JIT is not
6672         available,  and  pcre_assign_jit_stack()  does nothing unless the extra         available, and pcre_assign_jit_stack() does nothing  unless  the  extra
6673         argument is non-NULL and points to  a  pcre_extra  block  that  is  the         argument  is  non-NULL  and  points  to  a pcre_extra block that is the
6674         result of a successful study with PCRE_STUDY_JIT_COMPILE.         result of a successful study with PCRE_STUDY_JIT_COMPILE.
6675    
6676    
 JIT STACK FAQ  
   
        (1) Why do we need JIT stacks?  
   
        PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack  
        where the local data of the current node is pushed before checking  its  
        child nodes.  Allocating real machine stack on some platforms is diffi-  
        cult. For example, the stack chain needs to be updated every time if we  
        extend  the  stack  on  PowerPC.  Although it is possible, its updating  
        time overhead decreases performance. So we do the recursion in memory.  
   
        (2) Why don't we simply allocate blocks of memory with malloc()?  
   
        Modern operating systems have a  nice  feature:  they  can  reserve  an  
        address space instead of allocating memory. We can safely allocate mem-  
        ory pages inside this address space, so the stack  could  grow  without  
        moving memory data (this is important because of pointers). Thus we can  
        allocate 1M address space, and use only a single memory  page  (usually  
        4K)  if  that is enough. However, we can still grow up to 1M anytime if  
        needed.  
   
        (3) Who "owns" a JIT stack?  
   
        The owner of the stack is the user program, not the JIT studied pattern  
        or  anything else. The user program must ensure that if a stack is used  
        by pcre_exec(), (that is, it is assigned to the pattern currently  run-  
        ning), that stack must not be used by any other threads (to avoid over-  
        writing the same memory area). The best practice for multithreaded pro-  
        grams  is  to  allocate  a stack for each thread, and return this stack  
        through the JIT callback function.  
   
        (4) When should a JIT stack be freed?  
   
        You can free a JIT stack at any time, as long as it will not be used by  
        pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a  
        pointer is set. There is no reference counting or any other magic.  You  
        can  free  the  patterns  and stacks in any order, anytime. Just do not  
        call pcre_exec() with a pattern pointing to an already freed stack,  as  
        that  will cause SEGFAULT. (Also, do not free a stack currently used by  
        pcre_exec() in another thread). You can also replace the  stack  for  a  
        pattern  at  any  time.  You  can  even  free the previous stack before  
        assigning a replacement.  
   
        (5) Should I allocate/free a  stack  every  time  before/after  calling  
        pcre_exec()?  
   
        No,  because  this  is  too  costly in terms of resources. However, you  
        could implement some clever idea which release the stack if it  is  not  
        used in let's say two minutes. The JIT callback can help to achive this  
        without keeping a list of the currently JIT studied patterns.  
   
        (6) OK, the stack is for long term memory allocation. But what  happens  
        if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept  
        until the stack is freed?  
   
        Especially on embedded sytems, it might be a good idea to release  mem-  
        ory  sometimes  without  freeing the stack. There is no API for this at  
        the moment. Probably a function call which returns with  the  currently  
        allocated  memory for any stack and another which allows releasing mem-  
        ory (shrinking the stack) would be a good idea if someone needs this.  
   
        (7) This is too much of a headache. Isn't there any better solution for  
        JIT stack handling?  
   
        No,  thanks to Windows. If POSIX threads were used everywhere, we could  
        throw out this complicated API.  
   
   
6677  EXAMPLE CODE  EXAMPLE CODE
6678    
6679         This is a single-threaded example that specifies a  JIT  stack  without         This is a single-threaded example that specifies a  JIT  stack  without
# Line 6824  SEE ALSO Line 6705  SEE ALSO
6705    
6706  AUTHOR  AUTHOR
6707    
6708         Philip Hazel (FAQ by Zoltan Herczeg)         Philip Hazel
6709         University Computing Service         University Computing Service
6710         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
6711    
6712    
6713  REVISION  REVISION
6714    
6715         Last updated: 26 November 2011         Last updated: 19 October 2011
6716         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
6717  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6718    
# Line 8272  SIZE AND OTHER LIMITATIONS Line 8153  SIZE AND OTHER LIMITATIONS
8153         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
8154         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
8155    
        There is a limit to the number of forward references to subsequent sub-  
        patterns of around 200,000.  Repeated  forward  references  with  fixed  
        upper  limits,  for example, (?2){0,100} when subpattern number 2 is to  
        the right, are included in the count. There is no limit to  the  number  
        of backward references.  
   
8156         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
8157         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
8158    
# Line 8298  AUTHOR Line 8173  AUTHOR
8173    
8174  REVISION  REVISION
8175    
8176         Last updated: 30 November 2011         Last updated: 24 August 2011
8177         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
8178  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
8179    

Legend:
Removed from v.834  
changed lines
  Added in v.835

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12