/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 654 by ph10, Tue Aug 2 11:00:40 2011 UTC revision 835 by ph10, Wed Dec 28 16:10:09 2011 UTC
# Line 85  USER DOCUMENTATION Line 85  USER DOCUMENTATION
85           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper
86           pcredemo          a demonstration C program that uses PCRE           pcredemo          a demonstration C program that uses PCRE
87           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
88             pcrejit           discussion of the just-in-time optimization support
89             pcrelimits        details of size and other limits
90           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
91           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
92           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
# Line 96  USER DOCUMENTATION Line 98  USER DOCUMENTATION
98           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
99           pcresyntax        quick syntax reference           pcresyntax        quick syntax reference
100           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
101             pcreunicode       discussion of Unicode and UTF-8 support
102    
103         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
104         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
105    
106    
 LIMITATIONS  
   
        There are some size limitations in PCRE but it is hoped that they  will  
        never in practice be relevant.  
   
        The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE  
        is compiled with the default internal linkage size of 2. If you want to  
        process  regular  expressions  that are truly enormous, you can compile  
        PCRE with an internal linkage size of 3 or 4 (see the  README  file  in  
        the  source  distribution and the pcrebuild documentation for details).  
        In these cases the limit is substantially larger.  However,  the  speed  
        of execution is slower.  
   
        All values in repeating quantifiers must be less than 65536.  
   
        There is no limit to the number of parenthesized subpatterns, but there  
        can be no more than 65535 capturing subpatterns.  
   
        The maximum length of name for a named subpattern is 32 characters, and  
        the maximum number of named subpatterns is 10000.  
   
        The  maximum  length of a subject string is the largest positive number  
        that an integer variable can hold. However, when using the  traditional  
        matching function, PCRE uses recursion to handle subpatterns and indef-  
        inite repetition.  This means that the available stack space may  limit  
        the size of a subject string that can be processed by certain patterns.  
        For a discussion of stack issues, see the pcrestack documentation.  
   
   
 UTF-8 AND UNICODE PROPERTY SUPPORT  
   
        From release 3.3, PCRE has  had  some  support  for  character  strings  
        encoded  in the UTF-8 format. For release 4.0 this was greatly extended  
        to cover most common requirements, and in release 5.0  additional  sup-  
        port for Unicode general category properties was added.  
   
        In  order  process  UTF-8 strings, you must build PCRE to include UTF-8  
        support in the code, and, in addition,  you  must  call  pcre_compile()  
        with  the  PCRE_UTF8  option  flag,  or the pattern must start with the  
        sequence (*UTF8). When either of these is the case,  both  the  pattern  
        and  any  subject  strings  that  are matched against it are treated as  
        UTF-8 strings instead of strings of 1-byte characters.  
   
        If you compile PCRE with UTF-8 support, but do not use it at run  time,  
        the  library will be a bit bigger, but the additional run time overhead  
        is limited to testing the PCRE_UTF8 flag occasionally, so should not be  
        very big.  
   
        If PCRE is built with Unicode character property support (which implies  
        UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-  
        ported.  The available properties that can be tested are limited to the  
        general category properties such as Lu for an upper case letter  or  Nd  
        for  a  decimal number, the Unicode script names such as Arabic or Han,  
        and the derived properties Any and L&. A full  list  is  given  in  the  
        pcrepattern documentation. Only the short names for properties are sup-  
        ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-  
        ter},  is  not  supported.   Furthermore,  in Perl, many properties may  
        optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE  
        does not support this.  
   
    Validity of UTF-8 strings  
   
        When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and  
        subjects are (by default) checked for validity on entry to the relevant  
        functions.  From  release 7.3 of PCRE, the check is according the rules  
        of RFC 3629, which are themselves derived from the  Unicode  specifica-  
        tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which  
        allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current  
        check allows only values in the range U+0 to U+10FFFF, excluding U+D800  
        to U+DFFF.  
   
        The excluded code points are the "Low Surrogate Area"  of  Unicode,  of  
        which  the Unicode Standard says this: "The Low Surrogate Area does not  
        contain any  character  assignments,  consequently  no  character  code  
        charts or namelists are provided for this area. Surrogates are reserved  
        for use with UTF-16 and then must be used in pairs."  The  code  points  
        that  are  encoded  by  UTF-16  pairs are available as independent code  
        points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate  
        thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)  
   
        If an invalid UTF-8 string is passed to PCRE, an error return is given.  
        At compile time, the only additional information is the offset  to  the  
        first byte of the failing character. The runtime functions (pcre_exec()  
        and pcre_dfa_exec()), pass back this information  as  well  as  a  more  
        detailed  reason  code if the caller has provided memory in which to do  
        this.  
   
        In some situations, you may already know that your strings  are  valid,  
        and  therefore  want  to  skip these checks in order to improve perfor-  
        mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run  
        time,  PCRE  assumes  that  the pattern or subject it is given (respec-  
        tively) contains only valid UTF-8 codes. In  this  case,  it  does  not  
        diagnose an invalid UTF-8 string.  
   
        If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,  
        what happens depends on why the string is invalid. If the  string  con-  
        forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a  
        string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,  
        apart from the initial validity test, PCRE (when in UTF-8 mode) handles  
        strings according to the more liberal rules of RFC  2279.  However,  if  
        the  string does not even conform to RFC 2279, the result is undefined.  
        Your program may crash.  
   
        If you want to process strings  of  values  in  the  full  range  0  to  
        0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can  
        set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in  
        this situation, you will have to apply your own validity check.  
   
    General comments about UTF-8 mode  
   
        1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a  
        two-byte UTF-8 character if the value is greater than 127.  
   
        2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8  
        characters for values greater than \177.  
   
        3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-  
        vidual bytes, for example: \x{100}{3}.  
   
        4. The dot metacharacter matches one UTF-8 character instead of a  sin-  
        gle byte.  
   
        5.  The  escape sequence \C can be used to match a single byte in UTF-8  
        mode, but its use can lead to some strange effects.  This  facility  is  
        not available in the alternative matching function, pcre_dfa_exec().  
   
        6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly  
        test characters of any code value, but, by default, the characters that  
        PCRE  recognizes  as digits, spaces, or word characters remain the same  
        set as before, all with values less than 256. This  remains  true  even  
        when  PCRE  is built to include Unicode property support, because to do  
        otherwise would slow down PCRE in many common cases. Note in particular  
        that this applies to \b and \B, because they are defined in terms of \w  
        and \W. If you really want to test for a wider sense of, say,  "digit",  
        you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-  
        tively, if you set the PCRE_UCP option,  the  way  that  the  character  
        escapes  work  is changed so that Unicode properties are used to deter-  
        mine which characters match. There are more details in the  section  on  
        generic character types in the pcrepattern documentation.  
   
        7.  Similarly,  characters that match the POSIX named character classes  
        are all low-valued characters, unless the PCRE_UCP option is set.  
   
        8. However, the horizontal and  vertical  whitespace  matching  escapes  
        (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,  
        whether or not PCRE_UCP is set.  
   
        9. Case-insensitive matching applies only to  characters  whose  values  
        are  less than 128, unless PCRE is built with Unicode property support.  
        Even when Unicode property support is available, PCRE  still  uses  its  
        own  character  tables when checking the case of low-valued characters,  
        so as not to degrade performance.  The Unicode property information  is  
        used only for characters with higher values. Furthermore, PCRE supports  
        case-insensitive matching only  when  there  is  a  one-to-one  mapping  
        between  a letter's cases. There are a small number of many-to-one map-  
        pings in Unicode; these are not supported by PCRE.  
   
   
107  AUTHOR  AUTHOR
108    
109         Philip Hazel         Philip Hazel
# Line 272  AUTHOR Line 117  AUTHOR
117    
118  REVISION  REVISION
119    
120         Last updated: 07 May 2011         Last updated: 24 August 2011
121         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
122  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
123    
# Line 372  UNICODE CHARACTER PROPERTY SUPPORT Line 217  UNICODE CHARACTER PROPERTY SUPPORT
217         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
218    
219    
220    JUST-IN-TIME COMPILER SUPPORT
221    
222           Just-in-time compiler support is included in the build by specifying
223    
224             --enable-jit
225    
226           This  support  is available only for certain hardware architectures. If
227           this option is set for an  unsupported  architecture,  a  compile  time
228           error  occurs.   See  the pcrejit documentation for a discussion of JIT
229           usage. When JIT support is enabled, pcregrep automatically makes use of
230           it, unless you add
231    
232             --disable-pcregrep-jit
233    
234           to the "configure" command.
235    
236    
237  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
238    
239         By  default,  PCRE interprets the linefeed (LF) character as indicating         By  default,  PCRE interprets the linefeed (LF) character as indicating
# Line 619  AUTHOR Line 481  AUTHOR
481    
482  REVISION  REVISION
483    
484         Last updated: 02 August 2011         Last updated: 06 September 2011
485         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
486  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
487    
# Line 835  NAME Line 697  NAME
697         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
698    
699    
700  PCRE NATIVE API  PCRE NATIVE API BASIC FUNCTIONS
701    
702         #include <pcre.h>         #include <pcre.h>
703    
# Line 851  PCRE NATIVE API Line 713  PCRE NATIVE API
713         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
714              const char **errptr);              const char **errptr);
715    
716           void pcre_free_study(pcre_extra *extra);
717    
718         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
719              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
720              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
721    
722    
723    PCRE NATIVE API AUXILIARY FUNCTIONS
724    
725           pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
726    
727           void pcre_jit_stack_free(pcre_jit_stack *stack);
728    
729           void pcre_assign_jit_stack(pcre_extra *extra,
730                pcre_jit_callback callback, void *data);
731    
732         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
733              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
734              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
# Line 904  PCRE NATIVE API Line 778  PCRE NATIVE API
778    
779         char *pcre_version(void);         char *pcre_version(void);
780    
781    
782    PCRE NATIVE API INDIRECTED FUNCTIONS
783    
784         void *(*pcre_malloc)(size_t);         void *(*pcre_malloc)(size_t);
785    
786         void (*pcre_free)(void *);         void (*pcre_free)(void *);
# Line 919  PCRE API OVERVIEW Line 796  PCRE API OVERVIEW
796    
797         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
798         are also some wrapper functions that correspond to  the  POSIX  regular         are also some wrapper functions that correspond to  the  POSIX  regular
799         expression  API.  These  are  described in the pcreposix documentation.         expression  API,  but they do not give access to all the functionality.
800         Both of these APIs define a set of C function calls. A C++  wrapper  is         They are described in the pcreposix documentation. Both of  these  APIs
801         distributed with PCRE. It is documented in the pcrecpp page.         define  a  set  of  C function calls. A C++ wrapper is also distributed
802           with PCRE. It is documented in the pcrecpp page.
803    
804         The  native  API  C  function prototypes are defined in the header file         The native API C function prototypes are defined  in  the  header  file
805         pcre.h, and on Unix systems the library itself is called  libpcre.   It         pcre.h,  and  on Unix systems the library itself is called libpcre.  It
806         can normally be accessed by adding -lpcre to the command for linking an         can normally be accessed by adding -lpcre to the command for linking an
807         application  that  uses  PCRE.  The  header  file  defines  the  macros         application  that  uses  PCRE.  The  header  file  defines  the  macros
808         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-         PCRE_MAJOR and PCRE_MINOR to contain the major and minor  release  num-
809         bers for the library.  Applications can use these  to  include  support         bers  for  the  library.  Applications can use these to include support
810         for different releases of PCRE.         for different releases of PCRE.
811    
812         In a Windows environment, if you want to statically link an application         In a Windows environment, if you want to statically link an application
813         program against a non-dll pcre.a  file,  you  must  define  PCRE_STATIC         program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
814         before  including  pcre.h or pcrecpp.h, because otherwise the pcre_mal-         before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
815         loc()   and   pcre_free()   exported   functions   will   be   declared         loc()   and   pcre_free()   exported   functions   will   be   declared
816         __declspec(dllimport), with unwanted results.         __declspec(dllimport), with unwanted results.
817    
818         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and         The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
819         pcre_exec() are used for compiling and matching regular expressions  in         pcre_exec()  are used for compiling and matching regular expressions in
820         a  Perl-compatible  manner. A sample program that demonstrates the sim-         a Perl-compatible manner. A sample program that demonstrates  the  sim-
821         plest way of using them is provided in the file  called  pcredemo.c  in         plest  way  of  using them is provided in the file called pcredemo.c in
822         the PCRE source distribution. A listing of this program is given in the         the PCRE source distribution. A listing of this program is given in the
823         pcredemo documentation, and the pcresample documentation describes  how         pcredemo  documentation, and the pcresample documentation describes how
824         to compile and run it.         to compile and run it.
825    
826           Just-in-time compiler support is an optional feature of PCRE  that  can
827           be built in appropriate hardware environments. It greatly speeds up the
828           matching performance of  many  patterns.  Simple  programs  can  easily
829           request  that  it  be  used  if available, by setting an option that is
830           ignored when it is not relevant. More complicated programs  might  need
831           to     make    use    of    the    functions    pcre_jit_stack_alloc(),
832           pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
833           the  JIT  code's  memory  usage.   These functions are discussed in the
834           pcrejit documentation.
835    
836         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
837         ble, is also provided. This uses a different algorithm for  the  match-         ble,  is  also provided. This uses a different algorithm for the match-
838         ing.  The  alternative algorithm finds all possible matches (at a given         ing. The alternative algorithm finds all possible matches (at  a  given
839         point in the subject), and scans the subject just  once  (unless  there         point  in  the  subject), and scans the subject just once (unless there
840         are  lookbehind  assertions).  However,  this algorithm does not return         are lookbehind assertions). However, this  algorithm  does  not  return
841         captured substrings. A description of the two matching  algorithms  and         captured  substrings.  A description of the two matching algorithms and
842         their  advantages  and disadvantages is given in the pcrematching docu-         their advantages and disadvantages is given in the  pcrematching  docu-
843         mentation.         mentation.
844    
845         In addition to the main compiling and  matching  functions,  there  are         In  addition  to  the  main compiling and matching functions, there are
846         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
847         string that is matched by pcre_exec(). They are:         string that is matched by pcre_exec(). They are:
848    
# Line 969  PCRE API OVERVIEW Line 857  PCRE API OVERVIEW
857         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
858         to free the memory used for extracted strings.         to free the memory used for extracted strings.
859    
860         The  function  pcre_maketables()  is  used  to build a set of character         The function pcre_maketables() is used to  build  a  set  of  character
861         tables  in  the  current  locale   for   passing   to   pcre_compile(),         tables   in   the   current   locale  for  passing  to  pcre_compile(),
862         pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is         pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
863         provided for specialist use.  Most  commonly,  no  special  tables  are         provided  for  specialist  use.  Most  commonly,  no special tables are
864         passed,  in  which case internal tables that are generated when PCRE is         passed, in which case internal tables that are generated when  PCRE  is
865         built are used.         built are used.
866    
867         The function pcre_fullinfo() is used to find out  information  about  a         The  function  pcre_fullinfo()  is used to find out information about a
868         compiled  pattern; pcre_info() is an obsolete version that returns only         compiled pattern; pcre_info() is an obsolete version that returns  only
869         some of the available information, but is retained for  backwards  com-         some  of  the available information, but is retained for backwards com-
870         patibility.   The function pcre_version() returns a pointer to a string         patibility.  The function pcre_version() returns a pointer to a  string
871         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
872    
873         The function pcre_refcount() maintains a  reference  count  in  a  data         The  function  pcre_refcount()  maintains  a  reference count in a data
874         block  containing  a compiled pattern. This is provided for the benefit         block containing a compiled pattern. This is provided for  the  benefit
875         of object-oriented applications.         of object-oriented applications.
876    
877         The global variables pcre_malloc and pcre_free  initially  contain  the         The  global  variables  pcre_malloc and pcre_free initially contain the
878         entry  points  of  the  standard malloc() and free() functions, respec-         entry points of the standard malloc()  and  free()  functions,  respec-
879         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
880         so  a  calling  program  can replace them if it wishes to intercept the         so a calling program can replace them if it  wishes  to  intercept  the
881         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
882    
883         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
884         indirections  to  memory  management functions. These special functions         indirections to memory management functions.  These  special  functions
885         are used only when PCRE is compiled to use  the  heap  for  remembering         are  used  only  when  PCRE is compiled to use the heap for remembering
886         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
887         function. See the pcrebuild documentation for  details  of  how  to  do         function.  See  the  pcrebuild  documentation  for details of how to do
888         this.  It  is  a non-standard way of building PCRE, for use in environ-         this. It is a non-standard way of building PCRE, for  use  in  environ-
889         ments that have limited stacks. Because of the greater  use  of  memory         ments  that  have  limited stacks. Because of the greater use of memory
890         management,  it  runs  more  slowly. Separate functions are provided so         management, it runs more slowly. Separate  functions  are  provided  so
891         that special-purpose external code can be  used  for  this  case.  When         that  special-purpose  external  code  can  be used for this case. When
892         used,  these  functions  are always called in a stack-like manner (last         used, these functions are always called in a  stack-like  manner  (last
893         obtained, first freed), and always for memory blocks of the same  size.         obtained,  first freed), and always for memory blocks of the same size.
894         There  is  a discussion about PCRE's stack usage in the pcrestack docu-         There is a discussion about PCRE's stack usage in the  pcrestack  docu-
895         mentation.         mentation.
896    
897         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
898         by  the  caller  to  a "callout" function, which PCRE will then call at         by the caller to a "callout" function, which PCRE  will  then  call  at
899         specified points during a matching operation. Details are given in  the         specified  points during a matching operation. Details are given in the
900         pcrecallout documentation.         pcrecallout documentation.
901    
902    
903  NEWLINES  NEWLINES
904    
905         PCRE  supports five different conventions for indicating line breaks in         PCRE supports five different conventions for indicating line breaks  in
906         strings: a single CR (carriage return) character, a  single  LF  (line-         strings:  a  single  CR (carriage return) character, a single LF (line-
907         feed) character, the two-character sequence CRLF, any of the three pre-         feed) character, the two-character sequence CRLF, any of the three pre-
908         ceding, or any Unicode newline sequence. The Unicode newline  sequences         ceding,  or any Unicode newline sequence. The Unicode newline sequences
909         are  the  three just mentioned, plus the single characters VT (vertical         are the three just mentioned, plus the single characters  VT  (vertical
910         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line         tab,  U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
911         separator, U+2028), and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
912    
913         Each  of  the first three conventions is used by at least one operating         Each of the first three conventions is used by at least  one  operating
914         system as its standard newline sequence. When PCRE is built, a  default         system  as its standard newline sequence. When PCRE is built, a default
915         can  be  specified.  The default default is LF, which is the Unix stan-         can be specified.  The default default is LF, which is the  Unix  stan-
916         dard. When PCRE is run, the default can be overridden,  either  when  a         dard.  When  PCRE  is run, the default can be overridden, either when a
917         pattern is compiled, or when it is matched.         pattern is compiled, or when it is matched.
918    
919         At compile time, the newline convention can be specified by the options         At compile time, the newline convention can be specified by the options
920         argument of pcre_compile(), or it can be specified by special  text  at         argument  of  pcre_compile(), or it can be specified by special text at
921         the start of the pattern itself; this overrides any other settings. See         the start of the pattern itself; this overrides any other settings. See
922         the pcrepattern page for details of the special character sequences.         the pcrepattern page for details of the special character sequences.
923    
924         In the PCRE documentation the word "newline" is used to mean "the char-         In the PCRE documentation the word "newline" is used to mean "the char-
925         acter  or pair of characters that indicate a line break". The choice of         acter or pair of characters that indicate a line break". The choice  of
926         newline convention affects the handling of  the  dot,  circumflex,  and         newline  convention  affects  the  handling of the dot, circumflex, and
927         dollar metacharacters, the handling of #-comments in /x mode, and, when         dollar metacharacters, the handling of #-comments in /x mode, and, when
928         CRLF is a recognized line ending sequence, the match position  advance-         CRLF  is a recognized line ending sequence, the match position advance-
929         ment for a non-anchored pattern. There is more detail about this in the         ment for a non-anchored pattern. There is more detail about this in the
930         section on pcre_exec() options below.         section on pcre_exec() options below.
931    
932         The choice of newline convention does not affect the interpretation  of         The  choice of newline convention does not affect the interpretation of
933         the  \n  or  \r  escape  sequences, nor does it affect what \R matches,         the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
934         which is controlled in a similar way, but by separate options.         which is controlled in a similar way, but by separate options.
935    
936    
937  MULTITHREADING  MULTITHREADING
938    
939         The PCRE functions can be used in  multi-threading  applications,  with         The  PCRE  functions  can be used in multi-threading applications, with
940         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
941         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
942         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
943    
944         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
945         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
946         at once.         at once.
947    
948           If the just-in-time optimization feature is being used, it needs  sepa-
949           rate  memory stack areas for each thread. See the pcrejit documentation
950           for more details.
951    
952    
953  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
954    
955         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
956         later time, possibly by a different program, and even on a  host  other         later  time,  possibly by a different program, and even on a host other
957         than  the  one  on  which  it  was  compiled.  Details are given in the         than the one on which  it  was  compiled.  Details  are  given  in  the
958         pcreprecompile documentation. However, compiling a  regular  expression         pcreprecompile  documentation.  However, compiling a regular expression
959         with  one version of PCRE for use with a different version is not guar-         with one version of PCRE for use with a different version is not  guar-
960         anteed to work and may cause crashes.         anteed to work and may cause crashes.
961    
962    
# Line 1072  CHECKING BUILD-TIME OPTIONS Line 964  CHECKING BUILD-TIME OPTIONS
964    
965         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
966    
967         The function pcre_config() makes it possible for a PCRE client to  dis-         The  function pcre_config() makes it possible for a PCRE client to dis-
968         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
969         The pcrebuild documentation has more details about these optional  fea-         The  pcrebuild documentation has more details about these optional fea-
970         tures.         tures.
971    
972         The  first  argument  for pcre_config() is an integer, specifying which         The first argument for pcre_config() is an  integer,  specifying  which
973         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
974         into  which  the  information  is  placed. The following information is         into which the information is  placed.  The  following  information  is
975         available:         available:
976    
977           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
978    
979         The output is an integer that is set to one if UTF-8 support is  avail-         The  output is an integer that is set to one if UTF-8 support is avail-
980         able; otherwise it is set to zero.         able; otherwise it is set to zero.
981    
982           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
983    
984         The  output  is  an  integer  that is set to one if support for Unicode         The output is an integer that is set to  one  if  support  for  Unicode
985         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
986    
987             PCRE_CONFIG_JIT
988    
989           The output is an integer that is set to one if support for just-in-time
990           compiling is available; otherwise it is set to zero.
991    
992           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
993    
994         The output is an integer whose value specifies  the  default  character         The output is an integer whose value specifies  the  default  character
# Line 1453  COMPILING A PATTERN Line 1350  COMPILING A PATTERN
1350         strings  of  UTF-8 characters instead of single-byte character strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1351         However, it is available only when PCRE is built to include UTF-8  sup-         However, it is available only when PCRE is built to include UTF-8  sup-
1352         port.  If not, the use of this option provokes an error. Details of how         port.  If not, the use of this option provokes an error. Details of how
1353         this option changes the behaviour of PCRE are given in the  section  on         this option changes the behaviour of PCRE are given in the  pcreunicode
1354         UTF-8 support in the main pcre page.         page.
1355    
1356           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1357    
# Line 1514  COMPILATION ERROR CODES Line 1411  COMPILATION ERROR CODES
1411           34  character value in \x{...} sequence is too large           34  character value in \x{...} sequence is too large
1412           35  invalid condition (?(0)           35  invalid condition (?(0)
1413           36  \C not allowed in lookbehind assertion           36  \C not allowed in lookbehind assertion
1414           37  PCRE does not support \L, \l, \N, \U, or \u           37  PCRE does not support \L, \l, \N{name}, \U, or \u
1415           38  number after (?C is > 255           38  number after (?C is > 255
1416           39  closing ) for (?C expected           39  closing ) for (?C expected
1417           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
# Line 1548  COMPILATION ERROR CODES Line 1445  COMPILATION ERROR CODES
1445                 not allowed                 not allowed
1446           66  (*MARK) must have an argument           66  (*MARK) must have an argument
1447           67  this version of PCRE is not compiled with PCRE_UCP support           67  this version of PCRE is not compiled with PCRE_UCP support
1448             68  \c must be followed by an ASCII character
1449             69  \k is not followed by a braced, angle-bracketed, or quoted name
1450    
1451         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
1452         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
# Line 1576  STUDYING A PATTERN Line 1475  STUDYING A PATTERN
1475         wants   to   pass   any   of   the   other  fields  to  pcre_exec()  or         wants   to   pass   any   of   the   other  fields  to  pcre_exec()  or
1476         pcre_dfa_exec(), it must set up its own pcre_extra block.         pcre_dfa_exec(), it must set up its own pcre_extra block.
1477    
1478         The second argument of pcre_study() contains option bits.  At  present,         The second argument of pcre_study() contains option bits. There is only
1479         no options are defined, and this argument should always be zero.         one  option:  PCRE_STUDY_JIT_COMPILE.  If this is set, and the just-in-
1480           time compiler is  available,  the  pattern  is  further  compiled  into
1481           machine  code  that  executes much faster than the pcre_exec() matching
1482           function. If the just-in-time compiler is not available, this option is
1483           ignored. All other bits in the options argument must be zero.
1484    
1485           JIT  compilation  is  a heavyweight optimization. It can take some time
1486           for patterns to be analyzed, and for one-off matches  and  simple  pat-
1487           terns  the benefit of faster execution might be offset by a much slower
1488           study time.  Not all patterns can be optimized by the JIT compiler. For
1489           those  that cannot be handled, matching automatically falls back to the
1490           pcre_exec() interpreter. For more details, see the  pcrejit  documenta-
1491           tion.
1492    
1493         The  third argument for pcre_study() is a pointer for an error message.         The  third argument for pcre_study() is a pointer for an error message.
1494         If studying succeeds (even if no data is  returned),  the  variable  it         If studying succeeds (even if no data is  returned),  the  variable  it
# Line 1586  STUDYING A PATTERN Line 1497  STUDYING A PATTERN
1497         must  not  try  to  free it. You should test the error pointer for NULL         must  not  try  to  free it. You should test the error pointer for NULL
1498         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1499    
1500         This is a typical call to pcre_study():         When you are finished with a pattern, you can free the memory used  for
1501           the study data by calling pcre_free_study(). This function was added to
1502           the API for release 8.20. For earlier versions,  the  memory  could  be
1503           freed  with  pcre_free(), just like the pattern itself. This will still
1504           work in cases where PCRE_STUDY_JIT_COMPILE  is  not  used,  but  it  is
1505           advisable to change to the new function when convenient.
1506    
1507           This  is  a typical way in which pcre_study() is used (except that in a
1508           real application there should be tests for errors):
1509    
1510           pcre_extra *pe;           int rc;
1511           pe = pcre_study(           pcre *re;
1512             pcre_extra *sd;
1513             re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
1514             sd = pcre_study(
1515             re,             /* result of pcre_compile() */             re,             /* result of pcre_compile() */
1516             0,              /* no options exist */             0,              /* no options */
1517             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1518             rc = pcre_exec(   /* see below for details of pcre_exec() options */
1519               re, sd, "subject", 7, 0, 0, ovector, 30);
1520             ...
1521             pcre_free_study(sd);
1522             pcre_free(re);
1523    
1524         Studying a pattern does two things: first, a lower bound for the length         Studying a pattern does two things: first, a lower bound for the length
1525         of subject string that is needed to match the pattern is computed. This         of subject string that is needed to match the pattern is computed. This
# Line 1607  STUDYING A PATTERN Line 1534  STUDYING A PATTERN
1534         bytes is created. This speeds up finding a position in the  subject  at         bytes is created. This speeds up finding a position in the  subject  at
1535         which to start matching.         which to start matching.
1536    
1537         The  two  optimizations  just  described can be disabled by setting the         These  two optimizations apply to both pcre_exec() and pcre_dfa_exec().
1538         PCRE_NO_START_OPTIMIZE   option    when    calling    pcre_exec()    or         However, they are not used by pcre_exec()  if  pcre_study()  is  called
1539         pcre_dfa_exec().  You  might  want  to do this if your pattern contains         with  the  PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is
1540         callouts or (*MARK), and you want to make use of  these  facilities  in         successful.  The  optimizations  can  be  disabled   by   setting   the
1541         cases  where  matching fails. See the discussion of PCRE_NO_START_OPTI-         PCRE_NO_START_OPTIMIZE    option    when    calling    pcre_exec()   or
1542         MIZE below.         pcre_dfa_exec(). You might want to do this  if  your  pattern  contains
1543           callouts  or (*MARK) (which cannot be handled by the JIT compiler), and
1544           you want to make use of these facilities in cases where matching fails.
1545           See the discussion of PCRE_NO_START_OPTIMIZE below.
1546    
1547    
1548  LOCALE SUPPORT  LOCALE SUPPORT
1549    
1550         PCRE handles caseless matching, and determines whether  characters  are         PCRE  handles  caseless matching, and determines whether characters are
1551         letters,  digits, or whatever, by reference to a set of tables, indexed         letters, digits, or whatever, by reference to a set of tables,  indexed
1552         by character value. When running in UTF-8 mode, this  applies  only  to         by  character  value.  When running in UTF-8 mode, this applies only to
1553         characters  with  codes  less than 128. By default, higher-valued codes         characters with codes less than 128. By  default,  higher-valued  codes
1554         never match escapes such as \w or \d, but they can be tested with \p if         never match escapes such as \w or \d, but they can be tested with \p if
1555         PCRE  is  built with Unicode character property support. Alternatively,         PCRE is built with Unicode character property  support.  Alternatively,
1556         the PCRE_UCP option can be set at compile  time;  this  causes  \w  and         the  PCRE_UCP  option  can  be  set at compile time; this causes \w and
1557         friends to use Unicode property support instead of built-in tables. The         friends to use Unicode property support instead of built-in tables. The
1558         use of locales with Unicode is discouraged. If you are handling charac-         use of locales with Unicode is discouraged. If you are handling charac-
1559         ters  with codes greater than 128, you should either use UTF-8 and Uni-         ters with codes greater than 128, you should either use UTF-8 and  Uni-
1560         code, or use locales, but not try to mix the two.         code, or use locales, but not try to mix the two.
1561    
1562         PCRE contains an internal set of tables that are used  when  the  final         PCRE  contains  an  internal set of tables that are used when the final
1563         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1564         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
1565         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
1566         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
1567         which may cause them to be different.         which may cause them to be different.
1568    
1569         The  internal tables can always be overridden by tables supplied by the         The internal tables can always be overridden by tables supplied by  the
1570         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
1571         from  the  default.  As more and more applications change to using Uni-         from the default. As more and more applications change  to  using  Uni-
1572         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
1573    
1574         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
1575         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
1576         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1577         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
1578         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
1579         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1580    
1581           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1582           tables = pcre_maketables();           tables = pcre_maketables();
1583           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1584    
1585         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1586         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
1587    
1588         When pcre_maketables() runs, the tables are built  in  memory  that  is         When  pcre_maketables()  runs,  the  tables are built in memory that is
1589         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1590         that the memory containing the tables remains available for as long  as         that  the memory containing the tables remains available for as long as
1591         it is needed.         it is needed.
1592    
1593         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1594         pattern, and the same tables are used via this pointer by  pcre_study()         pattern,  and the same tables are used via this pointer by pcre_study()
1595         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1596         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1597         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1598    
1599         It  is  possible to pass a table pointer or NULL (indicating the use of         It is possible to pass a table pointer or NULL (indicating the  use  of
1600         the internal tables) to pcre_exec(). Although  not  intended  for  this         the  internal  tables)  to  pcre_exec(). Although not intended for this
1601         purpose,  this facility could be used to match a pattern in a different         purpose, this facility could be used to match a pattern in a  different
1602         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1603         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1604    
# Line 1678  INFORMATION ABOUT A PATTERN Line 1608  INFORMATION ABOUT A PATTERN
1608         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1609              int what, void *where);              int what, void *where);
1610    
1611         The  pcre_fullinfo() function returns information about a compiled pat-         The pcre_fullinfo() function returns information about a compiled  pat-
1612         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1613         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1614    
1615         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
1616         pattern. The second argument is the result of pcre_study(), or NULL  if         pattern.  The second argument is the result of pcre_study(), or NULL if
1617         the  pattern  was not studied. The third argument specifies which piece         the pattern was not studied. The third argument specifies  which  piece
1618         of information is required, and the fourth argument is a pointer  to  a         of  information  is required, and the fourth argument is a pointer to a
1619         variable  to  receive  the  data. The yield of the function is zero for         variable to receive the data. The yield of the  function  is  zero  for
1620         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1621    
1622           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1694  INFORMATION ABOUT A PATTERN Line 1624  INFORMATION ABOUT A PATTERN
1624           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1625           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1626    
1627         The "magic number" is placed at the start of each compiled  pattern  as         The  "magic  number" is placed at the start of each compiled pattern as
1628         an  simple check against passing an arbitrary memory pointer. Here is a         an simple check against passing an arbitrary memory pointer. Here is  a
1629         typical call of pcre_fullinfo(), to obtain the length of  the  compiled         typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1630         pattern:         pattern:
1631    
1632           int rc;           int rc;
1633           size_t length;           size_t length;
1634           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1635             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1636             pe,               /* result of pcre_study(), or NULL */             sd,               /* result of pcre_study(), or NULL */
1637             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1638             &length);         /* where to put the data */             &length);         /* where to put the data */
1639    
1640         The  possible  values for the third argument are defined in pcre.h, and         The possible values for the third argument are defined in  pcre.h,  and
1641         are as follows:         are as follows:
1642    
1643           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1644    
1645         Return the number of the highest back reference  in  the  pattern.  The         Return  the  number  of  the highest back reference in the pattern. The
1646         fourth  argument  should  point to an int variable. Zero is returned if         fourth argument should point to an int variable. Zero  is  returned  if
1647         there are no back references.         there are no back references.
1648    
1649           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1650    
1651         Return the number of capturing subpatterns in the pattern.  The  fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
1652         argument should point to an int variable.         argument should point to an int variable.
1653    
1654           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1655    
1656         Return  a pointer to the internal default character tables within PCRE.         Return a pointer to the internal default character tables within  PCRE.
1657         The fourth argument should point to an unsigned char *  variable.  This         The  fourth  argument should point to an unsigned char * variable. This
1658         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1659         tion. External callers can cause PCRE to use  its  internal  tables  by         tion.  External  callers  can  cause PCRE to use its internal tables by
1660         passing a NULL table pointer.         passing a NULL table pointer.
1661    
1662           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1663    
1664         Return  information  about  the first byte of any matched string, for a         Return information about the first byte of any matched  string,  for  a
1665         non-anchored pattern. The fourth argument should point to an int  vari-         non-anchored  pattern. The fourth argument should point to an int vari-
1666         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1667         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1668    
1669         If there is a fixed first byte, for example, from  a  pattern  such  as         If  there  is  a  fixed first byte, for example, from a pattern such as
1670         (cat|cow|coyote), its value is returned. Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1671    
1672         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1673         branch starts with "^", or         branch starts with "^", or
1674    
1675         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1676         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1677    
1678         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
1679         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
1680         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1681    
1682           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1683    
1684         If  the pattern was studied, and this resulted in the construction of a         If the pattern was studied, and this resulted in the construction of  a
1685         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1686         matching  string, a pointer to the table is returned. Otherwise NULL is         matching string, a pointer to the table is returned. Otherwise NULL  is
1687         returned. The fourth argument should point to an unsigned char *  vari-         returned.  The fourth argument should point to an unsigned char * vari-
1688         able.         able.
1689    
1690           PCRE_INFO_HASCRORLF           PCRE_INFO_HASCRORLF
1691    
1692         Return  1  if  the  pattern  contains any explicit matches for CR or LF         Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1693         characters, otherwise 0. The fourth argument should  point  to  an  int         characters,  otherwise  0.  The  fourth argument should point to an int
1694         variable.  An explicit match is either a literal CR or LF character, or         variable. An explicit match is either a literal CR or LF character,  or
1695         \r or \n.         \r or \n.
1696    
1697           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
1698    
1699         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,         Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1700         otherwise  0. The fourth argument should point to an int variable. (?J)         otherwise 0. The fourth argument should point to an int variable.  (?J)
1701         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1702    
1703             PCRE_INFO_JIT
1704    
1705           Return  1  if  the  pattern was studied with the PCRE_STUDY_JIT_COMPILE
1706           option, and just-in-time compiling was successful. The fourth  argument
1707           should  point  to  an  int variable. A return value of 0 means that JIT
1708           support is not available in this version of PCRE, or that  the  pattern
1709           was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT
1710           compiler could not handle this particular pattern. See the pcrejit doc-
1711           umentation for details of what can and cannot be handled.
1712    
1713           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1714    
1715         Return the value of the rightmost literal byte that must exist  in  any         Return  the  value of the rightmost literal byte that must exist in any
1716         matched  string,  other  than  at  its  start,  if such a byte has been         matched string, other than at its  start,  if  such  a  byte  has  been
1717         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1718         is  no such byte, -1 is returned. For anchored patterns, a last literal         is no such byte, -1 is returned. For anchored patterns, a last  literal
1719         byte is recorded only if it follows something of variable  length.  For         byte  is  recorded only if it follows something of variable length. For
1720         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1721         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1722    
1723           PCRE_INFO_MINLENGTH           PCRE_INFO_MINLENGTH
1724    
1725         If the pattern was studied and a minimum length  for  matching  subject         If  the  pattern  was studied and a minimum length for matching subject
1726         strings  was  computed,  its  value is returned. Otherwise the returned         strings was computed, its value is  returned.  Otherwise  the  returned
1727         value is -1. The value is a number of characters, not bytes  (this  may         value  is  -1. The value is a number of characters, not bytes (this may
1728         be  relevant in UTF-8 mode). The fourth argument should point to an int         be relevant in UTF-8 mode). The fourth argument should point to an  int
1729         variable. A non-negative value is a lower bound to the  length  of  any         variable.  A  non-negative  value is a lower bound to the length of any
1730         matching  string.  There  may not be any strings of that length that do         matching string. There may not be any strings of that  length  that  do
1731         actually match, but every string that does match is at least that long.         actually match, but every string that does match is at least that long.
1732    
1733           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
1734           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1735           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1736    
1737         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1738         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1739         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
1740         pcre_get_named_substring()  are  provided  for extracting captured sub-         pcre_get_named_substring() are provided for  extracting  captured  sub-
1741         strings by name. It is also possible to extract the data  directly,  by         strings  by  name. It is also possible to extract the data directly, by
1742         first  converting  the  name to a number in order to access the correct         first converting the name to a number in order to  access  the  correct
1743         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
1744         the  conversion,  you  need  to  use  the  name-to-number map, which is         the conversion, you need  to  use  the  name-to-number  map,  which  is
1745         described by these three values.         described by these three values.
1746    
1747         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1748         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1749         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
1750         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1751         a pointer to the first entry of the table  (a  pointer  to  char).  The         a  pointer  to  the  first  entry of the table (a pointer to char). The
1752         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1753         sis, most significant byte first. The rest of the entry is  the  corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1754         sponding name, zero terminated.         sponding name, zero terminated.
1755    
1756         The  names are in alphabetical order. Duplicate names may appear if (?|         The names are in alphabetical order. Duplicate names may appear if  (?|
1757         is used to create multiple groups with the same number, as described in         is used to create multiple groups with the same number, as described in
1758         the  section  on  duplicate subpattern numbers in the pcrepattern page.         the section on duplicate subpattern numbers in  the  pcrepattern  page.
1759         Duplicate names for subpatterns with different  numbers  are  permitted         Duplicate  names  for  subpatterns with different numbers are permitted
1760         only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they         only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they
1761         appear in the table in the order in which they were found in  the  pat-         appear  in  the table in the order in which they were found in the pat-
1762         tern.  In  the  absence  of (?| this is the order of increasing number;         tern. In the absence of (?| this is the  order  of  increasing  number;
1763         when (?| is used this is not necessarily the case because later subpat-         when (?| is used this is not necessarily the case because later subpat-
1764         terns may have lower numbers.         terns may have lower numbers.
1765    
1766         As  a  simple  example of the name/number table, consider the following         As a simple example of the name/number table,  consider  the  following
1767         pattern (assume PCRE_EXTENDED is set, so white space -  including  new-         pattern  (assume  PCRE_EXTENDED is set, so white space - including new-
1768         lines - is ignored):         lines - is ignored):
1769    
1770           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1771           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1772    
1773         There  are  four  named subpatterns, so the table has four entries, and         There are four named subpatterns, so the table has  four  entries,  and
1774         each entry in the table is eight bytes long. The table is  as  follows,         each  entry  in the table is eight bytes long. The table is as follows,
1775         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1776         as ??:         as ??:
1777    
# Line 1840  INFORMATION ABOUT A PATTERN Line 1780  INFORMATION ABOUT A PATTERN
1780           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1781           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1782    
1783         When writing code to extract data  from  named  subpatterns  using  the         When  writing  code  to  extract  data from named subpatterns using the
1784         name-to-number  map,  remember that the length of the entries is likely         name-to-number map, remember that the length of the entries  is  likely
1785         to be different for each compiled pattern.         to be different for each compiled pattern.
1786    
1787           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
1788    
1789         Return 1  if  the  pattern  can  be  used  for  partial  matching  with         Return  1  if  the  pattern  can  be  used  for  partial  matching with
1790         pcre_exec(),  otherwise  0.  The fourth argument should point to an int         pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
1791         variable. From  release  8.00,  this  always  returns  1,  because  the         variable.  From  release  8.00,  this  always  returns  1,  because the
1792         restrictions  that  previously  applied  to  partial matching have been         restrictions that previously applied  to  partial  matching  have  been
1793         lifted. The pcrepartial documentation gives details of  partial  match-         lifted.  The  pcrepartial documentation gives details of partial match-
1794         ing.         ing.
1795    
1796           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1797    
1798         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1799         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1800         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1801         by any top-level option settings at the start of the pattern itself. In         by any top-level option settings at the start of the pattern itself. In
1802         other  words,  they are the options that will be in force when matching         other words, they are the options that will be in force  when  matching
1803         starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with         starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1804         the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,         the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1805         and PCRE_EXTENDED.         and PCRE_EXTENDED.
1806    
1807         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1808         alternatives begin with one of the following:         alternatives begin with one of the following:
1809    
1810           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1878  INFORMATION ABOUT A PATTERN Line 1818  INFORMATION ABOUT A PATTERN
1818    
1819           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1820    
1821         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1822         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1823         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1824         size_t variable.         size_t variable.
# Line 1886  INFORMATION ABOUT A PATTERN Line 1826  INFORMATION ABOUT A PATTERN
1826           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1827    
1828         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1829         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block. If pcre_extra is NULL, or there is no study data,
1830         pcre_malloc() when PCRE was getting memory into which to place the data         zero is returned. The fourth argument should point to  a  size_t  vari-
1831         created by pcre_study(). If pcre_extra is NULL, or there  is  no  study         able.   The  study_data field is set by pcre_study() to record informa-
1832         data,  zero  is  returned. The fourth argument should point to a size_t         tion that will speed up matching (see the section entitled "Studying  a
1833         variable.         pattern" above). The format of the study_data block is private, but its
1834           length is made available via this option so that it can  be  saved  and
1835           restored (see the pcreprecompile documentation for details).
1836    
1837    
1838  OBSOLETE INFO FUNCTION  OBSOLETE INFO FUNCTION
1839    
1840         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1841    
1842         The pcre_info() function is now obsolete because its interface  is  too         The  pcre_info()  function is now obsolete because its interface is too
1843         restrictive  to return all the available data about a compiled pattern.         restrictive to return all the available data about a compiled  pattern.
1844         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
1845         pcre_info()  is the number of capturing subpatterns, or one of the fol-         pcre_info() is the number of capturing subpatterns, or one of the  fol-
1846         lowing negative numbers:         lowing negative numbers:
1847    
1848           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1849           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1850    
1851         If the optptr argument is not NULL, a copy of the  options  with  which         If  the  optptr  argument is not NULL, a copy of the options with which
1852         the  pattern  was  compiled  is placed in the integer it points to (see         the pattern was compiled is placed in the integer  it  points  to  (see
1853         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1854    
1855         If the pattern is not anchored and the  firstcharptr  argument  is  not         If  the  pattern  is  not anchored and the firstcharptr argument is not
1856         NULL,  it is used to pass back information about the first character of         NULL, it is used to pass back information about the first character  of
1857         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1858    
1859    
# Line 1919  REFERENCE COUNTS Line 1861  REFERENCE COUNTS
1861    
1862         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1863    
1864         The pcre_refcount() function is used to maintain a reference  count  in         The  pcre_refcount()  function is used to maintain a reference count in
1865         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1866         benefit of applications that  operate  in  an  object-oriented  manner,         benefit  of  applications  that  operate  in an object-oriented manner,
1867         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1868         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1869    
1870         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1871         zero.   It is changed only by calling this function, whose action is to         zero.  It is changed only by calling this function, whose action is  to
1872         add the adjust value (which may be positive or  negative)  to  it.  The         add  the  adjust  value  (which may be positive or negative) to it. The
1873         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1874         is constrained to lie between 0 and 65535, inclusive. If the new  value         is  constrained to lie between 0 and 65535, inclusive. If the new value
1875         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1876    
1877         Except  when it is zero, the reference count is not correctly preserved         Except when it is zero, the reference count is not correctly  preserved
1878         if a pattern is compiled on one host and then  transferred  to  a  host         if  a  pattern  is  compiled on one host and then transferred to a host
1879         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1880    
1881    
# Line 1943  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1885  MATCHING A PATTERN: THE TRADITIONAL FUNC
1885              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1886              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1887    
1888         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1889         compiled pattern, which is passed in the code argument. If the  pattern         compiled  pattern, which is passed in the code argument. If the pattern
1890         was  studied,  the  result  of  the study should be passed in the extra         was studied, the result of the study should  be  passed  in  the  extra
1891         argument. This function is the main matching facility of  the  library,         argument.  You  can call pcre_exec() with the same code and extra argu-
1892         and it operates in a Perl-like manner. For specialist use there is also         ments as many times as you like, in order to  match  different  subject
1893         an alternative matching function, which is described below in the  sec-         strings with the same pattern.
1894         tion about the pcre_dfa_exec() function.  
1895           This  function  is  the  main  matching facility of the library, and it
1896           operates in a Perl-like manner. For specialist use  there  is  also  an
1897           alternative  matching function, which is described below in the section
1898           about the pcre_dfa_exec() function.
1899    
1900         In  most applications, the pattern will have been compiled (and option-         In most applications, the pattern will have been compiled (and  option-
1901         ally studied) in the same process that calls pcre_exec().  However,  it         ally  studied)  in the same process that calls pcre_exec(). However, it
1902         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1903         later in different processes, possibly even on different hosts.  For  a         later  in  different processes, possibly even on different hosts. For a
1904         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1905    
1906         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1973  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1919  MATCHING A PATTERN: THE TRADITIONAL FUNC
1919    
1920     Extra data for pcre_exec()     Extra data for pcre_exec()
1921    
1922         If  the  extra argument is not NULL, it must point to a pcre_extra data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1923         block. The pcre_study() function returns such a block (when it  doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1924         return  NULL), but you can also create one for yourself, and pass addi-         return NULL), but you can also create one for yourself, and pass  addi-
1925         tional information in it. The pcre_extra block contains  the  following         tional  information  in it. The pcre_extra block contains the following
1926         fields (not necessarily in this order):         fields (not necessarily in this order):
1927    
1928           unsigned long int flags;           unsigned long int flags;
1929           void *study_data;           void *study_data;
1930             void *executable_jit;
1931           unsigned long int match_limit;           unsigned long int match_limit;
1932           unsigned long int match_limit_recursion;           unsigned long int match_limit_recursion;
1933           void *callout_data;           void *callout_data;
1934           const unsigned char *tables;           const unsigned char *tables;
1935           unsigned char **mark;           unsigned char **mark;
1936    
1937         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1938         are set. The flag bits are:         are set. The flag bits are:
1939    
1940           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1941             PCRE_EXTRA_EXECUTABLE_JIT
1942           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1943           PCRE_EXTRA_MATCH_LIMIT_RECURSION           PCRE_EXTRA_MATCH_LIMIT_RECURSION
1944           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1945           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1946           PCRE_EXTRA_MARK           PCRE_EXTRA_MARK
1947    
1948         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field and some-
1949         the  pcre_extra  block  that is returned by pcre_study(), together with         times the executable_jit field are set in the pcre_extra block that  is
1950         the appropriate flag bit. You should not set this yourself, but you may         returned  by pcre_study(), together with the appropriate flag bits. You
1951         add  to  the  block by setting the other fields and their corresponding         should not set these yourself, but you may add to the block by  setting
1952         flag bits.         the other fields and their corresponding flag bits.
1953    
1954         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1955         a  vast amount of resources when running patterns that are not going to         a vast amount of resources when running patterns that are not going  to
1956         match, but which have a very large number  of  possibilities  in  their         match,  but  which  have  a very large number of possibilities in their
1957         search  trees. The classic example is a pattern that uses nested unlim-         search trees. The classic example is a pattern that uses nested  unlim-
1958         ited repeats.         ited repeats.
1959    
1960         Internally, PCRE uses a function called match() which it calls  repeat-         Internally,  pcre_exec() uses a function called match(), which it calls
1961         edly  (sometimes  recursively). The limit set by match_limit is imposed         repeatedly (sometimes recursively). The limit  set  by  match_limit  is
1962         on the number of times this function is called during  a  match,  which         imposed  on the number of times this function is called during a match,
1963         has  the  effect  of  limiting the amount of backtracking that can take         which has the effect of limiting the amount of  backtracking  that  can
1964         place. For patterns that are not anchored, the count restarts from zero         take place. For patterns that are not anchored, the count restarts from
1965         for each position in the subject string.         zero for each position in the subject string.
1966    
1967           When pcre_exec() is called with a pattern that was successfully studied
1968           with  the  PCRE_STUDY_JIT_COMPILE  option, the way that the matching is
1969           executed is entirely different. However, there is still the possibility
1970           of  runaway  matching  that  goes  on  for a very long time, and so the
1971           match_limit value is also used in this case (but in a different way) to
1972           limit how long the matching can continue.
1973    
1974         The  default  value  for  the  limit can be set when PCRE is built; the         The  default  value  for  the  limit can be set when PCRE is built; the
1975         default default is 10 million, which handles all but the  most  extreme         default default is 10 million, which handles all but the  most  extreme
# Line 2029  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1984  MATCHING A PATTERN: THE TRADITIONAL FUNC
1984         the  total number of calls, because not all calls to match() are recur-         the  total number of calls, because not all calls to match() are recur-
1985         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1986    
1987         Limiting the recursion depth limits the amount of  stack  that  can  be         Limiting the recursion depth limits the amount of  machine  stack  that
1988         used, or, when PCRE has been compiled to use memory on the heap instead         can  be used, or, when PCRE has been compiled to use memory on the heap
1989         of the stack, the amount of heap memory that can be used.         instead of the stack, the amount of heap memory that can be used.  This
1990           limit  is not relevant, and is ignored, if the pattern was successfully
1991           studied with PCRE_STUDY_JIT_COMPILE.
1992    
1993         The default value for match_limit_recursion can be  set  when  PCRE  is         The default value for match_limit_recursion can be  set  when  PCRE  is
1994         built;  the  default  default  is  the  same  value  as the default for         built;  the  default  default  is  the  same  value  as the default for
# Line 2074  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2031  MATCHING A PATTERN: THE TRADITIONAL FUNC
2031         PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and         PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and
2032         PCRE_PARTIAL_HARD.         PCRE_PARTIAL_HARD.
2033    
2034           If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
2035           option,  the   only   supported   options   for   JIT   execution   are
2036           PCRE_NO_UTF8_CHECK,   PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,  and
2037           PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is  not
2038           supported.  If an unsupported option is used, JIT execution is disabled
2039           and the normal interpretive code in pcre_exec() is run.
2040    
2041           PCRE_ANCHORED           PCRE_ANCHORED
2042    
2043         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
2044         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
2045         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
2046         unachored at matching time.         unachored at matching time.
2047    
2048           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
2049           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
2050    
2051         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
2052         sequence  matches.  The choice is either to match only CR, LF, or CRLF,         sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2053         or to match any Unicode newline sequence. These  options  override  the         or  to  match  any Unicode newline sequence. These options override the
2054         choice that was made or defaulted when the pattern was compiled.         choice that was made or defaulted when the pattern was compiled.
2055    
2056           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 2095  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2059  MATCHING A PATTERN: THE TRADITIONAL FUNC
2059           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
2060           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
2061    
2062         These  options  override  the  newline  definition  that  was chosen or         These options override  the  newline  definition  that  was  chosen  or
2063         defaulted when the pattern was compiled. For details, see the  descrip-         defaulted  when the pattern was compiled. For details, see the descrip-
2064         tion  of  pcre_compile()  above.  During  matching,  the newline choice         tion of pcre_compile()  above.  During  matching,  the  newline  choice
2065         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
2066         ters.  It may also alter the way the match position is advanced after a         ters. It may also alter the way the match position is advanced after  a
2067         match failure for an unanchored pattern.         match failure for an unanchored pattern.
2068    
2069         When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is         When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
2070         set,  and a match attempt for an unanchored pattern fails when the cur-         set, and a match attempt for an unanchored pattern fails when the  cur-
2071         rent position is at a  CRLF  sequence,  and  the  pattern  contains  no         rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
2072         explicit  matches  for  CR  or  LF  characters,  the  match position is         explicit matches for  CR  or  LF  characters,  the  match  position  is
2073         advanced by two characters instead of one, in other words, to after the         advanced by two characters instead of one, in other words, to after the
2074         CRLF.         CRLF.
2075    
2076         The above rule is a compromise that makes the most common cases work as         The above rule is a compromise that makes the most common cases work as
2077         expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL         expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
2078         option is not set), it does not match the string "\r\nA" because, after         option is not set), it does not match the string "\r\nA" because, after
2079         failing at the start, it skips both the CR and the LF before  retrying.         failing  at the start, it skips both the CR and the LF before retrying.
2080         However,  the  pattern  [\r\n]A does match that string, because it con-         However, the pattern [\r\n]A does match that string,  because  it  con-
2081         tains an explicit CR or LF reference, and so advances only by one char-         tains an explicit CR or LF reference, and so advances only by one char-
2082         acter after the first failure.         acter after the first failure.
2083    
2084         An explicit match for CR of LF is either a literal appearance of one of         An explicit match for CR of LF is either a literal appearance of one of
2085         those characters, or one of the \r or  \n  escape  sequences.  Implicit         those  characters,  or  one  of the \r or \n escape sequences. Implicit
2086         matches  such  as [^X] do not count, nor does \s (which includes CR and         matches such as [^X] do not count, nor does \s (which includes  CR  and
2087         LF in the characters that it matches).         LF in the characters that it matches).
2088    
2089         Notwithstanding the above, anomalous effects may still occur when  CRLF         Notwithstanding  the above, anomalous effects may still occur when CRLF
2090         is a valid newline sequence and explicit \r or \n escapes appear in the         is a valid newline sequence and explicit \r or \n escapes appear in the
2091         pattern.         pattern.
2092    
2093           PCRE_NOTBOL           PCRE_NOTBOL
2094    
2095         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
2096         the  beginning  of  a  line, so the circumflex metacharacter should not         the beginning of a line, so the  circumflex  metacharacter  should  not
2097         match before it. Setting this without PCRE_MULTILINE (at compile  time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
2098         causes  circumflex  never to match. This option affects only the behav-         causes circumflex never to match. This option affects only  the  behav-
2099         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
2100    
2101           PCRE_NOTEOL           PCRE_NOTEOL
2102    
2103         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
2104         of  a line, so the dollar metacharacter should not match it nor (except         of a line, so the dollar metacharacter should not match it nor  (except
2105         in multiline mode) a newline immediately before it. Setting this  with-         in  multiline mode) a newline immediately before it. Setting this with-
2106         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2107         option affects only the behaviour of the dollar metacharacter. It  does         option  affects only the behaviour of the dollar metacharacter. It does
2108         not affect \Z or \z.         not affect \Z or \z.
2109    
2110           PCRE_NOTEMPTY           PCRE_NOTEMPTY
2111    
2112         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
2113         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
2114         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
2115         example, if the pattern         example, if the pattern
2116    
2117           a?b?           a?b?
2118    
2119         is applied to a string not beginning with "a" or  "b",  it  matches  an         is  applied  to  a  string not beginning with "a" or "b", it matches an
2120         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
2121         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
2122         rences of "a" or "b".         rences of "a" or "b".
2123    
2124           PCRE_NOTEMPTY_ATSTART           PCRE_NOTEMPTY_ATSTART
2125    
2126         This  is  like PCRE_NOTEMPTY, except that an empty string match that is         This is like PCRE_NOTEMPTY, except that an empty string match  that  is
2127         not at the start of  the  subject  is  permitted.  If  the  pattern  is         not  at  the  start  of  the  subject  is  permitted. If the pattern is
2128         anchored, such a match can occur only if the pattern contains \K.         anchored, such a match can occur only if the pattern contains \K.
2129    
2130         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or         Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
2131         PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern         PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
2132         match  of  the empty string within its split() function, and when using         match of the empty string within its split() function, and  when  using
2133         the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after         the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
2134         matching a null string by first trying the match again at the same off-         matching a null string by first trying the match again at the same off-
2135         set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
2136         fails, by advancing the starting offset (see below) and trying an ordi-         fails, by advancing the starting offset (see below) and trying an ordi-
2137         nary match again. There is some code that demonstrates how to  do  this         nary  match  again. There is some code that demonstrates how to do this
2138         in  the  pcredemo sample program. In the most general case, you have to         in the pcredemo sample program. In the most general case, you  have  to
2139         check to see if the newline convention recognizes CRLF  as  a  newline,         check  to  see  if the newline convention recognizes CRLF as a newline,
2140         and  if so, and the current character is CR followed by LF, advance the         and if so, and the current character is CR followed by LF, advance  the
2141         starting offset by two characters instead of one.         starting offset by two characters instead of one.
2142    
2143           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2144    
2145         There are a number of optimizations that pcre_exec() uses at the  start         There  are a number of optimizations that pcre_exec() uses at the start
2146         of  a  match,  in  order to speed up the process. For example, if it is         of a match, in order to speed up the process. For  example,  if  it  is
2147         known that an unanchored match must start with a specific character, it         known that an unanchored match must start with a specific character, it
2148         searches  the  subject  for that character, and fails immediately if it         searches the subject for that character, and fails  immediately  if  it
2149         cannot find it, without actually running the  main  matching  function.         cannot  find  it,  without actually running the main matching function.
2150         This means that a special item such as (*COMMIT) at the start of a pat-         This means that a special item such as (*COMMIT) at the start of a pat-
2151         tern is not considered until after a suitable starting  point  for  the         tern  is  not  considered until after a suitable starting point for the
2152         match  has been found. When callouts or (*MARK) items are in use, these         match has been found. When callouts or (*MARK) items are in use,  these
2153         "start-up" optimizations can cause them to be skipped if the pattern is         "start-up" optimizations can cause them to be skipped if the pattern is
2154         never  actually  used.  The start-up optimizations are in effect a pre-         never actually used. The start-up optimizations are in  effect  a  pre-
2155         scan of the subject that takes place before the pattern is run.         scan of the subject that takes place before the pattern is run.
2156    
2157         The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,         The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
2158         possibly  causing  performance  to  suffer,  but ensuring that in cases         possibly causing performance to suffer,  but  ensuring  that  in  cases
2159         where the result is "no match", the callouts do occur, and  that  items         where  the  result is "no match", the callouts do occur, and that items
2160         such as (*COMMIT) and (*MARK) are considered at every possible starting         such as (*COMMIT) and (*MARK) are considered at every possible starting
2161         position in the subject string. If  PCRE_NO_START_OPTIMIZE  is  set  at         position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
2162         compile time, it cannot be unset at matching time.         compile time, it cannot be unset at matching time.
2163    
2164         Setting  PCRE_NO_START_OPTIMIZE  can  change  the outcome of a matching         Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
2165         operation.  Consider the pattern         operation.  Consider the pattern
2166    
2167           (*COMMIT)ABC           (*COMMIT)ABC
2168    
2169         When this is compiled, PCRE records the fact that a  match  must  start         When  this  is  compiled, PCRE records the fact that a match must start
2170         with  the  character  "A".  Suppose the subject string is "DEFABC". The         with the character "A". Suppose the subject  string  is  "DEFABC".  The
2171         start-up optimization scans along the subject, finds "A" and  runs  the         start-up  optimization  scans along the subject, finds "A" and runs the
2172         first  match attempt from there. The (*COMMIT) item means that the pat-         first match attempt from there. The (*COMMIT) item means that the  pat-
2173         tern must match the current starting position, which in this  case,  it         tern  must  match the current starting position, which in this case, it
2174         does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE         does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
2175         set, the initial scan along the subject string  does  not  happen.  The         set,  the  initial  scan  along the subject string does not happen. The
2176         first  match  attempt  is  run  starting  from "D" and when this fails,         first match attempt is run starting  from  "D"  and  when  this  fails,
2177         (*COMMIT) prevents any further matches  being  tried,  so  the  overall         (*COMMIT)  prevents  any  further  matches  being tried, so the overall
2178         result  is  "no  match". If the pattern is studied, more start-up opti-         result is "no match". If the pattern is studied,  more  start-up  opti-
2179         mizations may be used. For example, a minimum length  for  the  subject         mizations  may  be  used. For example, a minimum length for the subject
2180         may be recorded. Consider the pattern         may be recorded. Consider the pattern
2181    
2182           (*MARK:A)(X|Y)           (*MARK:A)(X|Y)
2183    
2184         The  minimum  length  for  a  match is one character. If the subject is         The minimum length for a match is one  character.  If  the  subject  is
2185         "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then         "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
2186         finally  an empty string.  If the pattern is studied, the final attempt         finally an empty string.  If the pattern is studied, the final  attempt
2187         does not take place, because PCRE knows that the subject is too  short,         does  not take place, because PCRE knows that the subject is too short,
2188         and  so  the  (*MARK) is never encountered.  In this case, studying the         and so the (*MARK) is never encountered.  In this  case,  studying  the
2189         pattern does not affect the overall match result, which  is  still  "no         pattern  does  not  affect the overall match result, which is still "no
2190         match", but it does affect the auxiliary information that is returned.         match", but it does affect the auxiliary information that is returned.
2191    
2192           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2193    
2194         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2195         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2196         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
2197         points to the start of a UTF-8 character. There is a  discussion  about         points  to  the start of a UTF-8 character. There is a discussion about
2198         the  validity  of  UTF-8 strings in the section on UTF-8 support in the         the validity of UTF-8 strings in the section on UTF-8  support  in  the
2199         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2200         pcre_exec()  returns  the  error  PCRE_ERROR_BADUTF8  or,  if PCRE_PAR-         pcre_exec() returns  the  error  PCRE_ERROR_BADUTF8  or,  if  PCRE_PAR-
2201         TIAL_HARD is set and the problem is a truncated UTF-8 character at  the         TIAL_HARD  is set and the problem is a truncated UTF-8 character at the
2202         end  of  the  subject, PCRE_ERROR_SHORTUTF8. In both cases, information         end of the subject, PCRE_ERROR_SHORTUTF8. In  both  cases,  information
2203         about the precise nature of the error may also  be  returned  (see  the         about  the  precise  nature  of the error may also be returned (see the
2204         descriptions  of these errors in the section entitled Error return val-         descriptions of these errors in the section entitled Error return  val-
2205         ues from pcre_exec() below).  If startoffset contains a value that does         ues from pcre_exec() below).  If startoffset contains a value that does
2206         not  point to the start of a UTF-8 character (or to the end of the sub-         not point to the start of a UTF-8 character (or to the end of the  sub-
2207         ject), PCRE_ERROR_BADUTF8_OFFSET is returned.         ject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2208    
2209         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
2210         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
2211         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2212         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
2213         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
2214         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
2215         points to the start of a UTF-8 character (or the end of  the  subject).         points  to  the start of a UTF-8 character (or the end of the subject).
2216         When  PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8         When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid  UTF-8
2217         string as a subject or an invalid value of  startoffset  is  undefined.         string  as  a  subject or an invalid value of startoffset is undefined.
2218         Your program may crash.         Your program may crash.
2219    
2220           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2221           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2222    
2223         These  options turn on the partial matching feature. For backwards com-         These options turn on the partial matching feature. For backwards  com-
2224         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2225         match  occurs if the end of the subject string is reached successfully,         match occurs if the end of the subject string is reached  successfully,
2226         but there are not enough subject characters to complete the  match.  If         but  there  are not enough subject characters to complete the match. If
2227         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2228         matching continues by testing any remaining alternatives.  Only  if  no         matching  continues  by  testing any remaining alternatives. Only if no
2229         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of         complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
2230         PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the         PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
2231         caller  is  prepared to handle a partial match, but only if no complete         caller is prepared to handle a partial match, but only if  no  complete
2232         match can be found.         match can be found.
2233    
2234         If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this         If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
2235         case,  if  a  partial  match  is found, pcre_exec() immediately returns         case, if a partial match  is  found,  pcre_exec()  immediately  returns
2236         PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In         PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
2237         other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-         other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
2238         ered to be more important that an alternative complete match.         ered to be more important that an alternative complete match.
2239    
2240         In both cases, the portion of the string that was  inspected  when  the         In  both  cases,  the portion of the string that was inspected when the
2241         partial match was found is set as the first matching string. There is a         partial match was found is set as the first matching string. There is a
2242         more detailed discussion of partial and  multi-segment  matching,  with         more  detailed  discussion  of partial and multi-segment matching, with
2243         examples, in the pcrepartial documentation.         examples, in the pcrepartial documentation.
2244    
2245     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2246    
2247         The  subject string is passed to pcre_exec() as a pointer in subject, a         The subject string is passed to pcre_exec() as a pointer in subject,  a
2248         length (in bytes) in length, and a starting byte offset in startoffset.         length (in bytes) in length, and a starting byte offset in startoffset.
2249         If  this  is  negative  or  greater  than  the  length  of the subject,         If this is  negative  or  greater  than  the  length  of  the  subject,
2250         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is         pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
2251         zero,  the  search  for a match starts at the beginning of the subject,         zero, the search for a match starts at the beginning  of  the  subject,
2252         and this is by far the most common case. In UTF-8 mode, the byte offset         and this is by far the most common case. In UTF-8 mode, the byte offset
2253         must  point  to  the start of a UTF-8 character (or the end of the sub-         must point to the start of a UTF-8 character (or the end  of  the  sub-
2254         ject). Unlike the pattern string, the subject may contain  binary  zero         ject).  Unlike  the pattern string, the subject may contain binary zero
2255         bytes.         bytes.
2256    
2257         A  non-zero  starting offset is useful when searching for another match         A non-zero starting offset is useful when searching for  another  match
2258         in the same subject by calling pcre_exec() again after a previous  suc-         in  the same subject by calling pcre_exec() again after a previous suc-
2259         cess.   Setting  startoffset differs from just passing over a shortened         cess.  Setting startoffset differs from just passing over  a  shortened
2260         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
2261         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2262    
2263           \Biss\B           \Biss\B
2264    
2265         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
2266         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
2267         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
2268         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
2269         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
2270         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2271         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
2272         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2273         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
2274         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2275    
2276         Finding all the matches in a subject is tricky  when  the  pattern  can         Finding  all  the  matches  in a subject is tricky when the pattern can
2277         match an empty string. It is possible to emulate Perl's /g behaviour by         match an empty string. It is possible to emulate Perl's /g behaviour by
2278         first  trying  the  match  again  at  the   same   offset,   with   the         first   trying   the   match   again  at  the  same  offset,  with  the
2279         PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that         PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
2280         fails, advancing the starting  offset  and  trying  an  ordinary  match         fails,  advancing  the  starting  offset  and  trying an ordinary match
2281         again. There is some code that demonstrates how to do this in the pcre-         again. There is some code that demonstrates how to do this in the pcre-
2282         demo sample program. In the most general case, you have to check to see         demo sample program. In the most general case, you have to check to see
2283         if  the newline convention recognizes CRLF as a newline, and if so, and         if the newline convention recognizes CRLF as a newline, and if so,  and
2284         the current character is CR followed by LF, advance the starting offset         the current character is CR followed by LF, advance the starting offset
2285         by two characters instead of one.         by two characters instead of one.
2286    
2287         If  a  non-zero starting offset is passed when the pattern is anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
2288         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2289         if  the  pattern  does  not require the match to be at the start of the         if the pattern does not require the match to be at  the  start  of  the
2290         subject.         subject.
2291    
2292     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
2293    
2294         In general, a pattern matches a certain portion of the subject, and  in         In  general, a pattern matches a certain portion of the subject, and in
2295         addition,  further  substrings  from  the  subject may be picked out by         addition, further substrings from the subject  may  be  picked  out  by
2296         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
2297         this  is  called "capturing" in what follows, and the phrase "capturing         this is called "capturing" in what follows, and the  phrase  "capturing
2298         subpattern" is used for a fragment of a pattern that picks out  a  sub-         subpattern"  is  used for a fragment of a pattern that picks out a sub-
2299         string.  PCRE  supports several other kinds of parenthesized subpattern         string. PCRE supports several other kinds of  parenthesized  subpattern
2300         that do not cause substrings to be captured.         that do not cause substrings to be captured.
2301    
2302         Captured substrings are returned to the caller via a vector of integers         Captured substrings are returned to the caller via a vector of integers
2303         whose  address is passed in ovector. The number of elements in the vec-         whose address is passed in ovector. The number of elements in the  vec-
2304         tor is passed in ovecsize, which must be a non-negative  number.  Note:         tor  is  passed in ovecsize, which must be a non-negative number. Note:
2305         this argument is NOT the size of ovector in bytes.         this argument is NOT the size of ovector in bytes.
2306    
2307         The  first  two-thirds of the vector is used to pass back captured sub-         The first two-thirds of the vector is used to pass back  captured  sub-
2308         strings, each substring using a pair of integers. The  remaining  third         strings,  each  substring using a pair of integers. The remaining third
2309         of  the  vector is used as workspace by pcre_exec() while matching cap-         of the vector is used as workspace by pcre_exec() while  matching  cap-
2310         turing subpatterns, and is not available for passing back  information.         turing  subpatterns, and is not available for passing back information.
2311         The  number passed in ovecsize should always be a multiple of three. If         The number passed in ovecsize should always be a multiple of three.  If
2312         it is not, it is rounded down.         it is not, it is rounded down.
2313    
2314         When a match is successful, information about  captured  substrings  is         When  a  match  is successful, information about captured substrings is
2315         returned  in  pairs  of integers, starting at the beginning of ovector,         returned in pairs of integers, starting at the  beginning  of  ovector,
2316         and continuing up to two-thirds of its length at the  most.  The  first         and  continuing  up  to two-thirds of its length at the most. The first
2317         element  of  each pair is set to the byte offset of the first character         element of each pair is set to the byte offset of the  first  character
2318         in a substring, and the second is set to the byte offset of  the  first         in  a  substring, and the second is set to the byte offset of the first
2319         character  after  the end of a substring. Note: these values are always         character after the end of a substring. Note: these values  are  always
2320         byte offsets, even in UTF-8 mode. They are not character counts.         byte offsets, even in UTF-8 mode. They are not character counts.
2321    
2322         The first pair of integers, ovector[0]  and  ovector[1],  identify  the         The  first  pair  of  integers, ovector[0] and ovector[1], identify the
2323         portion  of  the subject string matched by the entire pattern. The next         portion of the subject string matched by the entire pattern.  The  next
2324         pair is used for the first capturing subpattern, and so on.  The  value         pair  is  used for the first capturing subpattern, and so on. The value
2325         returned by pcre_exec() is one more than the highest numbered pair that         returned by pcre_exec() is one more than the highest numbered pair that
2326         has been set.  For example, if two substrings have been  captured,  the         has  been  set.  For example, if two substrings have been captured, the
2327         returned  value is 3. If there are no capturing subpatterns, the return         returned value is 3. If there are no capturing subpatterns, the  return
2328         value from a successful match is 1, indicating that just the first pair         value from a successful match is 1, indicating that just the first pair
2329         of offsets has been set.         of offsets has been set.
2330    
2331         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
2332         of the string that it matched that is returned.         of the string that it matched that is returned.
2333    
2334         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
2335         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
2336         function returns a value of zero. If the substring offsets are  not  of         function  returns a value of zero. If neither the actual string matched
2337         interest,  pcre_exec()  may  be  called with ovector passed as NULL and         not any captured substrings are of interest, pcre_exec() may be  called
2338         ovecsize as zero. However, if the pattern contains back references  and         with  ovector passed as NULL and ovecsize as zero. However, if the pat-
2339         the  ovector is not big enough to remember the related substrings, PCRE         tern contains back references and the ovector  is  not  big  enough  to
2340         has to get additional memory for use during matching. Thus it  is  usu-         remember  the related substrings, PCRE has to get additional memory for
2341         ally advisable to supply an ovector.         use during matching. Thus it is usually advisable to supply an  ovector
2342           of reasonable size.
2343    
2344           There  are  some  cases where zero is returned (indicating vector over-
2345           flow) when in fact the vector is exactly the right size for  the  final
2346           match. For example, consider the pattern
2347    
2348             (a)(?:(b)c|bd)
2349    
2350           If  a  vector of 6 elements (allowing for only 1 captured substring) is
2351           given with subject string "abd", pcre_exec() will try to set the second
2352           captured string, thereby recording a vector overflow, before failing to
2353           match "c" and backing up  to  try  the  second  alternative.  The  zero
2354           return,  however,  does  correctly  indicate that the maximum number of
2355           slots (namely 2) have been filled. In similar cases where there is tem-
2356           porary  overflow,  but  the final number of used slots is actually less
2357           than the maximum, a non-zero value is returned.
2358    
2359         The pcre_fullinfo() function can be used to find out how many capturing         The pcre_fullinfo() function can be used to find out how many capturing
2360         subpatterns there are in a compiled  pattern.  The  smallest  size  for         subpatterns  there  are  in  a  compiled pattern. The smallest size for
2361         ovector  that  will allow for n captured substrings, in addition to the         ovector that will allow for n captured substrings, in addition  to  the
2362         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
2363    
2364         It is possible for capturing subpattern number n+1 to match  some  part         It  is  possible for capturing subpattern number n+1 to match some part
2365         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
2366         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
2367         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
2368         2 is not. When this happens, both values in  the  offset  pairs  corre-         2  is  not.  When  this happens, both values in the offset pairs corre-
2369         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
2370    
2371         Offset  values  that correspond to unused subpatterns at the end of the         Offset values that correspond to unused subpatterns at the end  of  the
2372         expression are also set to -1. For example,  if  the  string  "abc"  is         expression  are  also  set  to  -1. For example, if the string "abc" is
2373         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
2374         matched. The return from the function is 2, because  the  highest  used         matched.  The  return  from the function is 2, because the highest used
2375         capturing  subpattern  number  is 1, and the offsets for for the second         capturing subpattern number is 1, and the offsets for  for  the  second
2376         and third capturing subpatterns (assuming the vector is  large  enough,         and  third  capturing subpatterns (assuming the vector is large enough,
2377         of course) are set to -1.         of course) are set to -1.
2378    
2379         Note: Elements of ovector that do not correspond to capturing parenthe-         Note: Elements in the first two-thirds of ovector that  do  not  corre-
2380         ses in the pattern are never changed. That is, if a pattern contains  n         spond  to  capturing parentheses in the pattern are never changed. That
2381         capturing parentheses, no more than ovector[0] to ovector[2n+1] are set         is, if a pattern contains n capturing parentheses, no more  than  ovec-
2382         by pcre_exec(). The other elements retain whatever values  they  previ-         tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
2383         ously had.         the first two-thirds) retain whatever values they previously had.
2384    
2385         Some  convenience  functions  are  provided for extracting the captured         Some convenience functions are provided  for  extracting  the  captured
2386         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
2387    
2388     Error return values from pcre_exec()     Error return values from pcre_exec()
2389    
2390         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
2391         defined in the header file:         defined in the header file:
2392    
2393           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 2416  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2396  MATCHING A PATTERN: THE TRADITIONAL FUNC
2396    
2397           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
2398    
2399         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
2400         ovecsize was not zero.         ovecsize was not zero.
2401    
2402           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 2425  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2405  MATCHING A PATTERN: THE TRADITIONAL FUNC
2405    
2406           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
2407    
2408         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
2409         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
2410         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
2411         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
2412         gives when the magic number is not present.         gives when the magic number is not present.
2413    
2414           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
2415    
2416         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
2417         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
2418         overwriting of the compiled pattern.         overwriting of the compiled pattern.
2419    
2420           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2421    
2422         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
2423         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2424         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
2425         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
2426         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
2427    
2428         This error is also given if pcre_stack_malloc() fails  in  pcre_exec().         This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
2429         This  can happen only when PCRE has been compiled with --disable-stack-         This can happen only when PCRE has been compiled with  --disable-stack-
2430         for-recursion.         for-recursion.
2431    
2432           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2433    
2434         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
2435         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2436         returned by pcre_exec().         returned by pcre_exec().
2437    
2438           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2439    
2440         The backtracking limit, as specified by  the  match_limit  field  in  a         The  backtracking  limit,  as  specified  by the match_limit field in a
2441         pcre_extra  structure  (or  defaulted) was reached. See the description         pcre_extra structure (or defaulted) was reached.  See  the  description
2442         above.         above.
2443    
2444           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2445    
2446         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2447         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
2448         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2449    
2450           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2451    
2452         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
2453         subject,  and the PCRE_NO_UTF8_CHECK option was not set. If the size of         subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
2454         the output vector (ovecsize) is at least 2,  the  byte  offset  to  the         the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
2455         start  of  the  the invalid UTF-8 character is placed in the first ele-         start of the the invalid UTF-8 character is placed in  the  first  ele-
2456         ment, and a reason code is placed in the  second  element.  The  reason         ment,  and  a  reason  code is placed in the second element. The reason
2457         codes are listed in the following section.  For backward compatibility,         codes are listed in the following section.  For backward compatibility,
2458         if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8  char-         if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
2459         acter   at   the   end   of   the   subject  (reason  codes  1  to  5),         acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
2460         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2461    
2462           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2463    
2464         The UTF-8 byte sequence that was passed as a subject  was  checked  and         The  UTF-8  byte  sequence that was passed as a subject was checked and
2465         found  to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the         found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
2466         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
2467         ter or the end of the subject.         ter or the end of the subject.
2468    
2469           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2470    
2471         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
2472         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2473    
2474           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2475    
2476         This code is no longer in  use.  It  was  formerly  returned  when  the         This  code  is  no  longer  in  use.  It was formerly returned when the
2477         PCRE_PARTIAL  option  was used with a compiled pattern containing items         PCRE_PARTIAL option was used with a compiled pattern  containing  items
2478         that were  not  supported  for  partial  matching.  From  release  8.00         that  were  not  supported  for  partial  matching.  From  release 8.00
2479         onwards, there are no restrictions on partial matching.         onwards, there are no restrictions on partial matching.
2480    
2481           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2482    
2483         An  unexpected  internal error has occurred. This error could be caused         An unexpected internal error has occurred. This error could  be  caused
2484         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2485    
2486           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
# Line 2510  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2490  MATCHING A PATTERN: THE TRADITIONAL FUNC
2490           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2491    
2492         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2493         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
2494         description above.         description above.
2495    
2496           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2524  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2504  MATCHING A PATTERN: THE TRADITIONAL FUNC
2504    
2505           PCRE_ERROR_SHORTUTF8      (-25)           PCRE_ERROR_SHORTUTF8      (-25)
2506    
2507         This  error  is returned instead of PCRE_ERROR_BADUTF8 when the subject         This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
2508         string ends with a truncated UTF-8 character and the  PCRE_PARTIAL_HARD         string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2509         option  is  set.   Information  about  the  failure  is returned as for         option is set.  Information  about  the  failure  is  returned  as  for
2510         PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this  case,  but         PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
2511         this  special error code for PCRE_PARTIAL_HARD precedes the implementa-         this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
2512         tion of returned information; it is retained for backwards  compatibil-         tion  of returned information; it is retained for backwards compatibil-
2513         ity.         ity.
2514    
2515           PCRE_ERROR_RECURSELOOP    (-26)           PCRE_ERROR_RECURSELOOP    (-26)
2516    
2517         This error is returned when pcre_exec() detects a recursion loop within         This error is returned when pcre_exec() detects a recursion loop within
2518         the pattern. Specifically, it means that either the whole pattern or  a         the  pattern. Specifically, it means that either the whole pattern or a
2519         subpattern  has been called recursively for the second time at the same         subpattern has been called recursively for the second time at the  same
2520         position in the subject string. Some simple patterns that might do this         position in the subject string. Some simple patterns that might do this
2521         are  detected  and faulted at compile time, but more complicated cases,         are detected and faulted at compile time, but more  complicated  cases,
2522         in particular mutual recursions between two different subpatterns, can-         in particular mutual recursions between two different subpatterns, can-
2523         not be detected until run time.         not be detected until run time.
2524    
2525             PCRE_ERROR_JIT_STACKLIMIT (-27)
2526    
2527           This error is returned when a pattern  that  was  successfully  studied
2528           using  the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem-
2529           ory available for  the  just-in-time  processing  stack  is  not  large
2530           enough. See the pcrejit documentation for more details.
2531    
2532         Error numbers -16 to -20 and -22 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2533    
2534     Reason codes for invalid UTF-8 strings     Reason codes for invalid UTF-8 strings
# Line 2936  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2923  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2923         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2924         est  matching  string is given first. If there were too many matches to         est  matching  string is given first. If there were too many matches to
2925         fit into ovector, the yield of the function is zero, and the vector  is         fit into ovector, the yield of the function is zero, and the vector  is
2926         filled with the longest matches.         filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()
2927           can use the entire ovector for returning matched strings.
2928    
2929     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2930    
2931         The  pcre_dfa_exec()  function returns a negative number when it fails.         The pcre_dfa_exec() function returns a negative number when  it  fails.
2932         Many of the errors are the same  as  for  pcre_exec(),  and  these  are         Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2933         described  above.   There are in addition the following errors that are         described above.  There are in addition the following errors  that  are
2934         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2935    
2936           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2937    
2938         This return is given if pcre_dfa_exec() encounters an item in the  pat-         This  return is given if pcre_dfa_exec() encounters an item in the pat-
2939         tern  that  it  does not support, for instance, the use of \C or a back         tern that it does not support, for instance, the use of \C  or  a  back
2940         reference.         reference.
2941    
2942           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2943    
2944         This return is given if pcre_dfa_exec()  encounters  a  condition  item         This  return  is  given  if pcre_dfa_exec() encounters a condition item
2945         that  uses  a back reference for the condition, or a test for recursion         that uses a back reference for the condition, or a test  for  recursion
2946         in a specific group. These are not supported.         in a specific group. These are not supported.
2947    
2948           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2949    
2950         This return is given if pcre_dfa_exec() is called with an  extra  block         This  return  is given if pcre_dfa_exec() is called with an extra block
2951         that contains a setting of the match_limit field. This is not supported         that contains a setting of  the  match_limit  or  match_limit_recursion
2952         (it is meaningless).         fields.  This  is  not  supported (these fields are meaningless for DFA
2953           matching).
2954    
2955           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2956    
# Line 2991  AUTHOR Line 2980  AUTHOR
2980    
2981  REVISION  REVISION
2982    
2983         Last updated: 28 July 2011         Last updated: 23 September 2011
2984         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
2985  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2986    
# Line 3039  PCRE CALLOUTS Line 3028  PCRE CALLOUTS
3028         pattern is matched. This is useful information when you are  trying  to         pattern is matched. This is useful information when you are  trying  to
3029         optimize the performance of a particular pattern.         optimize the performance of a particular pattern.
3030    
3031           The  use  of callouts in a pattern makes it ineligible for optimization
3032           by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the
3033           PCRE_STUDY_JIT_COMPILE option always fails.
3034    
3035    
3036  MISSING CALLOUTS  MISSING CALLOUTS
3037    
# Line 3180  AUTHOR Line 3173  AUTHOR
3173    
3174  REVISION  REVISION
3175    
3176         Last updated: 31 July 2011         Last updated: 26 August 2011
3177         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3178  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3179    
# Line 3199  DIFFERENCES BETWEEN PCRE AND PERL Line 3192  DIFFERENCES BETWEEN PCRE AND PERL
3192         respect to Perl versions 5.10 and above.         respect to Perl versions 5.10 and above.
3193    
3194         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
3195         of what it does have are given in the section on UTF-8 support  in  the         of what it does have are given in the pcreunicode page.
        main pcre page.  
3196    
3197         2. PCRE allows repeat quantifiers only on parenthesized assertions, but         2. PCRE allows repeat quantifiers only on parenthesized assertions, but
3198         they do not mean what you might think. For example, (?!a){3}  does  not         they  do  not mean what you might think. For example, (?!a){3} does not
3199         assert that the next three characters are not "a". It just asserts that         assert that the next three characters are not "a". It just asserts that
3200         the next character is not "a" three times (in principle: PCRE optimizes         the next character is not "a" three times (in principle: PCRE optimizes
3201         this to run the assertion just once). Perl allows repeat quantifiers on         this to run the assertion just once). Perl allows repeat quantifiers on
3202         other assertions such as \b, but these do not seem to have any use.         other assertions such as \b, but these do not seem to have any use.
3203    
3204         3. Capturing subpatterns that occur inside  negative  lookahead  asser-         3.  Capturing  subpatterns  that occur inside negative lookahead asser-
3205         tions  are  counted,  but their entries in the offsets vector are never         tions are counted, but their entries in the offsets  vector  are  never
3206         set. Perl sets its numerical variables from any such patterns that  are         set.  Perl sets its numerical variables from any such patterns that are
3207         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
3208         ing), but only if the negative lookahead assertion  contains  just  one         ing),  but  only  if the negative lookahead assertion contains just one
3209         branch.         branch.
3210    
3211         4.  Though  binary zero characters are supported in the subject string,         4. Though binary zero characters are supported in the  subject  string,
3212         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
3213         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
3214         the pattern to represent a binary zero.         the pattern to represent a binary zero.
3215    
3216         5. The following Perl escape sequences are not supported: \l,  \u,  \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
3217         \U,  and  \N when followed by a character name or Unicode value. (\N on         \U, and \N when followed by a character name or Unicode value.  (\N  on
3218         its own, matching a non-newline character, is supported.) In fact these         its own, matching a non-newline character, is supported.) In fact these
3219         are  implemented  by Perl's general string-handling and are not part of         are implemented by Perl's general string-handling and are not  part  of
3220         its pattern matching engine. If any of these are encountered  by  PCRE,         its  pattern  matching engine. If any of these are encountered by PCRE,
3221         an error is generated.         an error is generated.
3222    
3223         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
3224         is built with Unicode character property support. The  properties  that         is  built  with Unicode character property support. The properties that
3225         can  be tested with \p and \P are limited to the general category prop-         can be tested with \p and \P are limited to the general category  prop-
3226         erties such as Lu and Nd, script names such as Greek or  Han,  and  the         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
3227         derived  properties  Any  and  L&. PCRE does support the Cs (surrogate)         derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
3228         property, which Perl does not; the  Perl  documentation  says  "Because         property,  which  Perl  does  not; the Perl documentation says "Because
3229         Perl hides the need for the user to understand the internal representa-         Perl hides the need for the user to understand the internal representa-
3230         tion of Unicode characters, there is no need to implement the  somewhat         tion  of Unicode characters, there is no need to implement the somewhat
3231         messy concept of surrogates."         messy concept of surrogates."
3232    
3233         7.  PCRE implements a simpler version of \X than Perl, which changed to         7. PCRE implements a simpler version of \X than Perl, which changed  to
3234         make \X match what Unicode calls an "extended grapheme  cluster".  This         make  \X  match what Unicode calls an "extended grapheme cluster". This
3235         is  more  complicated  than an extended Unicode sequence, which is what         is more complicated than an extended Unicode sequence,  which  is  what
3236         PCRE matches.         PCRE matches.
3237    
3238         8. PCRE does support the \Q...\E escape for quoting substrings. Charac-         8. PCRE does support the \Q...\E escape for quoting substrings. Charac-
3239         ters  in  between  are  treated as literals. This is slightly different         ters in between are treated as literals.  This  is  slightly  different
3240         from Perl in that $ and @ are  also  handled  as  literals  inside  the         from  Perl  in  that  $  and  @ are also handled as literals inside the
3241         quotes.  In Perl, they cause variable interpolation (but of course PCRE         quotes. In Perl, they cause variable interpolation (but of course  PCRE
3242         does not have variables). Note the following examples:         does not have variables). Note the following examples:
3243    
3244             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 3256  DIFFERENCES BETWEEN PCRE AND PERL Line 3248  DIFFERENCES BETWEEN PCRE AND PERL
3248             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
3249             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
3250    
3251         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
3252         classes.         classes.
3253    
3254         9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})         9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
3255         constructions. However, there is support for recursive  patterns.  This         constructions.  However,  there is support for recursive patterns. This
3256         is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE         is not available in Perl 5.8, but it is in Perl 5.10.  Also,  the  PCRE
3257         "callout" feature allows an external function to be called during  pat-         "callout"  feature allows an external function to be called during pat-
3258         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
3259    
3260         10.  Subpatterns  that  are  called recursively or as "subroutines" are         10. Subpatterns that are called as subroutines (whether or  not  recur-
3261         always treated as atomic groups in  PCRE.  This  is  like  Python,  but         sively)  are  always  treated  as  atomic  groups in PCRE. This is like
3262         unlike  Perl. There is a discussion of an example that explains this in         Python, but unlike Perl.  Captured values that are set outside  a  sub-
3263         more detail in the section on recursion differences from  Perl  in  the         routine  call  can  be  reference from inside in PCRE, but not in Perl.
3264         pcrepattern page.         There is a discussion that explains these differences in more detail in
3265           the section on recursion differences from Perl in the pcrepattern page.
3266    
3267           11.  If  (*THEN)  is present in a group that is called as a subroutine,
3268           its action is limited to that group, even if the group does not contain
3269           any | characters.
3270    
3271         11.  There are some differences that are concerned with the settings of         12.  There are some differences that are concerned with the settings of
3272         captured strings when part of  a  pattern  is  repeated.  For  example,         captured strings when part of  a  pattern  is  repeated.  For  example,
3273         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
3274         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
3275    
3276         12. PCRE's handling of duplicate subpattern numbers and duplicate  sub-         13. PCRE's handling of duplicate subpattern numbers and duplicate  sub-
3277         pattern names is not as general as Perl's. This is a consequence of the         pattern names is not as general as Perl's. This is a consequence of the
3278         fact the PCRE works internally just with numbers, using an external ta-         fact the PCRE works internally just with numbers, using an external ta-
3279         ble  to  translate  between numbers and names. In particular, a pattern         ble  to  translate  between numbers and names. In particular, a pattern
# Line 3287  DIFFERENCES BETWEEN PCRE AND PERL Line 3284  DIFFERENCES BETWEEN PCRE AND PERL
3284         turing subpattern number 1. To avoid this confusing situation, an error         turing subpattern number 1. To avoid this confusing situation, an error
3285         is given at compile time.         is given at compile time.
3286    
3287         13.  Perl  recognizes  comments  in some places that PCRE does not, for         14.  Perl  recognizes  comments  in some places that PCRE does not, for
3288         example, between the ( and ? at the start of a subpattern.  If  the  /x         example, between the ( and ? at the start of a subpattern.  If  the  /x
3289         modifier  is set, Perl allows whitespace between ( and ? but PCRE never         modifier  is set, Perl allows whitespace between ( and ? but PCRE never
3290         does, even if the PCRE_EXTENDED option is set.         does, even if the PCRE_EXTENDED option is set.
3291    
3292         14. PCRE provides some extensions to the Perl regular expression facil-         15. PCRE provides some extensions to the Perl regular expression facil-
3293         ities.   Perl  5.10  includes new features that are not in earlier ver-         ities.   Perl  5.10  includes new features that are not in earlier ver-
3294         sions of Perl, some of which (such as named parentheses) have  been  in         sions of Perl, some of which (such as named parentheses) have  been  in
3295         PCRE for some time. This list is with respect to Perl 5.10:         PCRE for some time. This list is with respect to Perl 5.10:
# Line 3328  DIFFERENCES BETWEEN PCRE AND PERL Line 3325  DIFFERENCES BETWEEN PCRE AND PERL
3325         (i) The partial matching facility is PCRE-specific.         (i) The partial matching facility is PCRE-specific.
3326    
3327         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3328         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.  However,  this
3329           does not apply to optimized data created by the just-in-time compiler.
3330    
3331         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
3332         different way and is not Perl-compatible.         different way and is not Perl-compatible.
3333    
3334         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
3335         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
3336         pattern.         pattern.
3337    
# Line 3347  AUTHOR Line 3345  AUTHOR
3345    
3346  REVISION  REVISION
3347    
3348         Last updated: 24 July 2011         Last updated: 09 October 2011
3349         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3350  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3351    
# Line 3387  PCRE REGULAR EXPRESSION DETAILS Line 3385  PCRE REGULAR EXPRESSION DETAILS
3385         Starting a pattern with this sequence  is  equivalent  to  setting  the         Starting a pattern with this sequence  is  equivalent  to  setting  the
3386         PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting         PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
3387         UTF-8 mode affects pattern matching  is  mentioned  in  several  places         UTF-8 mode affects pattern matching  is  mentioned  in  several  places
3388         below.  There  is  also  a  summary of UTF-8 features in the section on         below.  There  is  also  a summary of UTF-8 features in the pcreunicode
3389         UTF-8 support in the main pcre page.         page.
3390    
3391         Another special sequence that may appear at the start of a  pattern  or         Another special sequence that may appear at the start of a  pattern  or
3392         in combination with (*UTF8) is:         in combination with (*UTF8) is:
# Line 4144  FULL STOP (PERIOD, DOT) AND \N Line 4142  FULL STOP (PERIOD, DOT) AND \N
4142  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
4143    
4144         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
4145         both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any         both  in  and  out of UTF-8 mode. Unlike a dot, it always matches line-
4146         line-ending characters. The feature is provided in  Perl  in  order  to         ending characters. The feature is provided in Perl in  order  to  match
4147         match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-         individual  bytes  in UTF-8 mode, but it is unclear how it can usefully
4148         acters into individual bytes, the rest of the string may start  with  a         be used. Because \C breaks up characters into individual bytes,  match-
4149         malformed  UTF-8  character. For this reason, the \C escape sequence is         ing  one  byte  with \C in UTF-8 mode means that the rest of the string
4150         best avoided.         may start with a malformed UTF-8 character. This has undefined results,
4151           because  PCRE  assumes that it is dealing with valid UTF-8 strings (and
4152           by default it checks  this  at  the  start  of  processing  unless  the
4153           PCRE_NO_UTF8_CHECK option is used).
4154    
4155         PCRE does not allow \C to appear in  lookbehind  assertions  (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
4156         below),  because  in UTF-8 mode this would make it impossible to calcu-         below), because in UTF-8 mode this would make it impossible  to  calcu-
4157         late the length of the lookbehind.         late the length of the lookbehind.
4158    
4159           In  general, the \C escape sequence is best avoided in UTF-8 mode. How-
4160           ever, one way of using it that avoids the problem  of  malformed  UTF-8
4161           characters  is to use a lookahead to check the length of the next char-
4162           acter, as in this pattern (ignore white space and line breaks):
4163    
4164             (?| (?=[\x00-\x7f])(\C) |
4165                 (?=[\x80-\x{7ff}])(\C)(\C) |
4166                 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
4167                 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
4168    
4169           A group that starts with (?| resets the capturing  parentheses  numbers
4170           in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
4171           assertions at the start of each branch check the next  UTF-8  character
4172           for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
4173           character's individual bytes are then captured by the appropriate  num-
4174           ber of groups.
4175    
4176    
4177  SQUARE BRACKETS AND CHARACTER CLASSES  SQUARE BRACKETS AND CHARACTER CLASSES
4178    
# Line 4162  SQUARE BRACKETS AND CHARACTER CLASSES Line 4180  SQUARE BRACKETS AND CHARACTER CLASSES
4180         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
4181         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4182         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
4183         square bracket is required as a member of the class, it should  be  the         square  bracket  is required as a member of the class, it should be the
4184         first  data  character  in  the  class (after an initial circumflex, if         first data character in the class  (after  an  initial  circumflex,  if
4185         present) or escaped with a backslash.         present) or escaped with a backslash.
4186    
4187         A character class matches a single character in the subject.  In  UTF-8         A  character  class matches a single character in the subject. In UTF-8
4188         mode, the character may be more than one byte long. A matched character         mode, the character may be more than one byte long. A matched character
4189         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
4190         character  in  the  class definition is a circumflex, in which case the         character in the class definition is a circumflex, in  which  case  the
4191         subject character must not be in the set defined by  the  class.  If  a         subject  character  must  not  be in the set defined by the class. If a
4192         circumflex  is actually required as a member of the class, ensure it is         circumflex is actually required as a member of the class, ensure it  is
4193         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
4194    
4195         For example, the character class [aeiou] matches any lower case  vowel,         For  example, the character class [aeiou] matches any lower case vowel,
4196         while  [^aeiou]  matches  any character that is not a lower case vowel.         while [^aeiou] matches any character that is not a  lower  case  vowel.
4197         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
4198         characters  that  are in the class by enumerating those that are not. A         characters that are in the class by enumerating those that are  not.  A
4199         class that starts with a circumflex is not an assertion; it still  con-         class  that starts with a circumflex is not an assertion; it still con-
4200         sumes  a  character  from the subject string, and therefore it fails if         sumes a character from the subject string, and therefore  it  fails  if
4201         the current pointer is at the end of the string.         the current pointer is at the end of the string.
4202    
4203         In UTF-8 mode, characters with values greater than 255 can be  included         In  UTF-8 mode, characters with values greater than 255 can be included
4204         in  a  class as a literal string of bytes, or by using the \x{ escaping         in a class as a literal string of bytes, or by using the  \x{  escaping
4205         mechanism.         mechanism.
4206    
4207         When caseless matching is set, any letters in a  class  represent  both         When  caseless  matching  is set, any letters in a class represent both
4208         their  upper  case  and lower case versions, so for example, a caseless         their upper case and lower case versions, so for  example,  a  caseless
4209         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
4210         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always         match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
4211         understands the concept of case for characters whose  values  are  less         understands  the  concept  of case for characters whose values are less
4212         than  128, so caseless matching is always possible. For characters with         than 128, so caseless matching is always possible. For characters  with
4213         higher values, the concept of case is supported  if  PCRE  is  compiled         higher  values,  the  concept  of case is supported if PCRE is compiled
4214         with  Unicode  property support, but not otherwise.  If you want to use         with Unicode property support, but not otherwise.  If you want  to  use
4215         caseless matching in UTF8-mode for characters 128 and above,  you  must         caseless  matching  in UTF8-mode for characters 128 and above, you must
4216         ensure  that  PCRE is compiled with Unicode property support as well as         ensure that PCRE is compiled with Unicode property support as  well  as
4217         with UTF-8 support.         with UTF-8 support.
4218    
4219         Characters that might indicate line breaks are  never  treated  in  any         Characters  that  might  indicate  line breaks are never treated in any
4220         special  way  when  matching  character  classes,  whatever line-ending         special way  when  matching  character  classes,  whatever  line-ending
4221         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
4222         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
4223         of these characters.         of these characters.
4224    
4225         The minus (hyphen) character can be used to specify a range of  charac-         The  minus (hyphen) character can be used to specify a range of charac-
4226         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters in a character  class.  For  example,  [d-m]  matches  any  letter
4227         between d and m, inclusive. If a  minus  character  is  required  in  a         between  d  and  m,  inclusive.  If  a minus character is required in a
4228         class,  it  must  be  escaped  with a backslash or appear in a position         class, it must be escaped with a backslash  or  appear  in  a  position
4229         where it cannot be interpreted as indicating a range, typically as  the         where  it cannot be interpreted as indicating a range, typically as the
4230         first or last character in the class.         first or last character in the class.
4231    
4232         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
4233         ter of a range. A pattern such as [W-]46] is interpreted as a class  of         ter  of a range. A pattern such as [W-]46] is interpreted as a class of
4234         two  characters ("W" and "-") followed by a literal string "46]", so it         two characters ("W" and "-") followed by a literal string "46]", so  it
4235         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
4236         backslash  it is interpreted as the end of range, so [W-\]46] is inter-         backslash it is interpreted as the end of range, so [W-\]46] is  inter-
4237         preted as a class containing a range followed by two other  characters.         preted  as a class containing a range followed by two other characters.
4238         The  octal or hexadecimal representation of "]" can also be used to end         The octal or hexadecimal representation of "]" can also be used to  end
4239         a range.         a range.
4240    
4241         Ranges operate in the collating sequence of character values. They  can         Ranges  operate in the collating sequence of character values. They can
4242         also   be  used  for  characters  specified  numerically,  for  example         also  be  used  for  characters  specified  numerically,  for   example
4243         [\000-\037]. In UTF-8 mode, ranges can include characters whose  values         [\000-\037].  In UTF-8 mode, ranges can include characters whose values
4244         are greater than 255, for example [\x{100}-\x{2ff}].         are greater than 255, for example [\x{100}-\x{2ff}].
4245    
4246         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
4247         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
4248         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
4249         character tables for a French locale are in  use,  [\xc8-\xcb]  matches         character  tables  for  a French locale are in use, [\xc8-\xcb] matches
4250         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
4251         concept of case for characters with values greater than 128  only  when         concept  of  case for characters with values greater than 128 only when
4252         it is compiled with Unicode property support.         it is compiled with Unicode property support.
4253    
4254         The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,         The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
4255         \w, and \W may appear in a character class, and add the characters that         \w, and \W may appear in a character class, and add the characters that
4256         they  match to the class. For example, [\dABCDEF] matches any hexadeci-         they match to the class. For example, [\dABCDEF] matches any  hexadeci-
4257         mal digit. In UTF-8 mode, the PCRE_UCP option affects the  meanings  of         mal  digit.  In UTF-8 mode, the PCRE_UCP option affects the meanings of
4258         \d,  \s,  \w  and  their upper case partners, just as it does when they         \d, \s, \w and their upper case partners, just as  it  does  when  they
4259         appear outside a character class, as described in the section  entitled         appear  outside a character class, as described in the section entitled
4260         "Generic character types" above. The escape sequence \b has a different         "Generic character types" above. The escape sequence \b has a different
4261         meaning inside a character class; it matches the  backspace  character.         meaning  inside  a character class; it matches the backspace character.
4262         The  sequences  \B,  \N,  \R, and \X are not special inside a character         The sequences \B, \N, \R, and \X are not  special  inside  a  character
4263         class. Like any other unrecognized escape sequences, they  are  treated         class.  Like  any other unrecognized escape sequences, they are treated
4264         as  the literal characters "B", "N", "R", and "X" by default, but cause         as the literal characters "B", "N", "R", and "X" by default, but  cause
4265         an error if the PCRE_EXTRA option is set.         an error if the PCRE_EXTRA option is set.
4266    
4267         A circumflex can conveniently be used with  the  upper  case  character         A  circumflex  can  conveniently  be used with the upper case character
4268         types  to specify a more restricted set of characters than the matching         types to specify a more restricted set of characters than the  matching
4269         lower case type.  For example, the class [^\W_] matches any  letter  or         lower  case  type.  For example, the class [^\W_] matches any letter or
4270         digit, but not underscore, whereas [\w] includes underscore. A positive         digit, but not underscore, whereas [\w] includes underscore. A positive
4271         character class should be read as "something OR something OR ..." and a         character class should be read as "something OR something OR ..." and a
4272         negative class as "NOT something AND NOT something AND NOT ...".         negative class as "NOT something AND NOT something AND NOT ...".
4273    
4274         The  only  metacharacters  that are recognized in character classes are         The only metacharacters that are recognized in  character  classes  are
4275         backslash, hyphen (only where it can be  interpreted  as  specifying  a         backslash,  hyphen  (only  where  it can be interpreted as specifying a
4276         range),  circumflex  (only  at the start), opening square bracket (only         range), circumflex (only at the start), opening  square  bracket  (only
4277         when it can be interpreted as introducing a POSIX class name - see  the         when  it can be interpreted as introducing a POSIX class name - see the
4278         next  section),  and  the  terminating closing square bracket. However,         next section), and the terminating  closing  square  bracket.  However,
4279         escaping other non-alphanumeric characters does no harm.         escaping other non-alphanumeric characters does no harm.
4280    
4281    
4282  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
4283    
4284         Perl supports the POSIX notation for character classes. This uses names         Perl supports the POSIX notation for character classes. This uses names
4285         enclosed  by  [: and :] within the enclosing square brackets. PCRE also         enclosed by [: and :] within the enclosing square brackets.  PCRE  also
4286         supports this notation. For example,         supports this notation. For example,
4287    
4288           [01[:alpha:]%]           [01[:alpha:]%]
# Line 4287  POSIX CHARACTER CLASSES Line 4305  POSIX CHARACTER CLASSES
4305           word     "word" characters (same as \w)           word     "word" characters (same as \w)
4306           xdigit   hexadecimal digits           xdigit   hexadecimal digits
4307    
4308         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),         The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
4309         and space (32). Notice that this list includes the VT  character  (code         and  space  (32). Notice that this list includes the VT character (code
4310         11). This makes "space" different to \s, which does not include VT (for         11). This makes "space" different to \s, which does not include VT (for
4311         Perl compatibility).         Perl compatibility).
4312    
4313         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension         The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
4314         from  Perl  5.8. Another Perl extension is negation, which is indicated         from Perl 5.8. Another Perl extension is negation, which  is  indicated
4315         by a ^ character after the colon. For example,         by a ^ character after the colon. For example,
4316    
4317           [12[:^digit:]]           [12[:^digit:]]
4318    
4319         matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the         matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
4320         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4321         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
4322    
4323         By default, in UTF-8 mode, characters with values greater than  128  do         By  default,  in UTF-8 mode, characters with values greater than 128 do
4324         not  match any of the POSIX character classes. However, if the PCRE_UCP         not match any of the POSIX character classes. However, if the  PCRE_UCP
4325         option is passed to pcre_compile(), some of the classes are changed  so         option  is passed to pcre_compile(), some of the classes are changed so
4326         that Unicode character properties are used. This is achieved by replac-         that Unicode character properties are used. This is achieved by replac-
4327         ing the POSIX classes by other sequences, as follows:         ing the POSIX classes by other sequences, as follows:
4328    
# Line 4317  POSIX CHARACTER CLASSES Line 4335  POSIX CHARACTER CLASSES
4335           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
4336           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
4337    
4338         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other         Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
4339         POSIX classes are unchanged, and match only characters with code points         POSIX classes are unchanged, and match only characters with code points
4340         less than 128.         less than 128.
4341    
4342    
4343  VERTICAL BAR  VERTICAL BAR
4344    
4345         Vertical bar characters are used to separate alternative patterns.  For         Vertical  bar characters are used to separate alternative patterns. For
4346         example, the pattern         example, the pattern
4347    
4348           gilbert|sullivan           gilbert|sullivan
4349    
4350         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches either "gilbert" or "sullivan". Any number of alternatives  may
4351         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear,  and  an  empty  alternative  is  permitted (matching the empty
4352         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
4353         to right, and the first one that succeeds is used. If the  alternatives         to  right, and the first one that succeeds is used. If the alternatives
4354         are  within a subpattern (defined below), "succeeds" means matching the         are within a subpattern (defined below), "succeeds" means matching  the
4355         rest of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
4356    
4357    
4358  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
4359    
4360         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
4361         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from         PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
4362         within the pattern by  a  sequence  of  Perl  option  letters  enclosed         within  the  pattern  by  a  sequence  of  Perl option letters enclosed
4363         between "(?" and ")".  The option letters are         between "(?" and ")".  The option letters are
4364    
4365           i  for PCRE_CASELESS           i  for PCRE_CASELESS
# Line 4351  INTERNAL OPTION SETTING Line 4369  INTERNAL OPTION SETTING
4369    
4370         For example, (?im) sets caseless, multiline matching. It is also possi-         For example, (?im) sets caseless, multiline matching. It is also possi-
4371         ble to unset these options by preceding the letter with a hyphen, and a         ble to unset these options by preceding the letter with a hyphen, and a
4372         combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
4373         LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4374         is  also  permitted.  If  a  letter  appears  both before and after the         is also permitted. If a  letter  appears  both  before  and  after  the
4375         hyphen, the option is unset.         hyphen, the option is unset.
4376    
4377         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4378         can  be changed in the same way as the Perl-compatible options by using         can be changed in the same way as the Perl-compatible options by  using
4379         the characters J, U and X respectively.         the characters J, U and X respectively.
4380    
4381         When one of these option changes occurs at  top  level  (that  is,  not         When  one  of  these  option  changes occurs at top level (that is, not
4382         inside  subpattern parentheses), the change applies to the remainder of         inside subpattern parentheses), the change applies to the remainder  of
4383         the pattern that follows. If the change is placed right at the start of         the pattern that follows. If the change is placed right at the start of
4384         a pattern, PCRE extracts it into the global options (and it will there-         a pattern, PCRE extracts it into the global options (and it will there-
4385         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4386    
4387         An option change within a subpattern (see below for  a  description  of         An  option  change  within a subpattern (see below for a description of
4388         subpatterns)  affects only that part of the subpattern that follows it,         subpatterns) affects only that part of the subpattern that follows  it,
4389         so         so
4390    
4391           (a(?i)b)c           (a(?i)b)c
4392    
4393         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4394         used).   By  this means, options can be made to have different settings         used).  By this means, options can be made to have  different  settings
4395         in different parts of the pattern. Any changes made in one  alternative         in  different parts of the pattern. Any changes made in one alternative
4396         do  carry  on  into subsequent branches within the same subpattern. For         do carry on into subsequent branches within the  same  subpattern.  For
4397         example,         example,
4398    
4399           (a(?i)b|c)           (a(?i)b|c)
4400    
4401         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
4402         first  branch  is  abandoned before the option setting. This is because         first branch is abandoned before the option setting.  This  is  because
4403         the effects of option settings happen at compile time. There  would  be         the  effects  of option settings happen at compile time. There would be
4404         some very weird behaviour otherwise.         some very weird behaviour otherwise.
4405    
4406         Note:  There  are  other  PCRE-specific  options that can be set by the         Note: There are other PCRE-specific options that  can  be  set  by  the
4407         application when the compile or match functions  are  called.  In  some         application  when  the  compile  or match functions are called. In some
4408         cases the pattern can contain special leading sequences such as (*CRLF)         cases the pattern can contain special leading sequences such as (*CRLF)
4409         to override what the application has set or what  has  been  defaulted.         to  override  what  the application has set or what has been defaulted.
4410         Details  are  given  in the section entitled "Newline sequences" above.         Details are given in the section entitled  "Newline  sequences"  above.
4411         There are also the (*UTF8) and (*UCP) leading  sequences  that  can  be         There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be
4412         used  to  set  UTF-8 and Unicode property modes; they are equivalent to         used to set UTF-8 and Unicode property modes; they  are  equivalent  to
4413         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
4414    
4415    
# Line 4404  SUBPATTERNS Line 4422  SUBPATTERNS
4422    
4423           cat(aract|erpillar|)           cat(aract|erpillar|)
4424    
4425         matches  "cataract",  "caterpillar", or "cat". Without the parentheses,         matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
4426         it would match "cataract", "erpillar" or an empty string.         it would match "cataract", "erpillar" or an empty string.
4427    
4428         2. It sets up the subpattern as  a  capturing  subpattern.  This  means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
4429         that,  when  the  whole  pattern  matches,  that portion of the subject         that, when the whole pattern  matches,  that  portion  of  the  subject
4430         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4431         ovector  argument  of pcre_exec(). Opening parentheses are counted from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
4432         left to right (starting from 1) to obtain  numbers  for  the  capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
4433         subpatterns.  For  example,  if  the  string  "the red king" is matched         subpatterns. For example, if the  string  "the  red  king"  is  matched
4434         against the pattern         against the pattern
4435    
4436           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 4420  SUBPATTERNS Line 4438  SUBPATTERNS
4438         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
4439         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
4440    
4441         The  fact  that  plain  parentheses  fulfil two functions is not always         The fact that plain parentheses fulfil  two  functions  is  not  always
4442         helpful.  There are often times when a grouping subpattern is  required         helpful.   There are often times when a grouping subpattern is required
4443         without  a capturing requirement. If an opening parenthesis is followed         without a capturing requirement. If an opening parenthesis is  followed
4444         by a question mark and a colon, the subpattern does not do any  captur-         by  a question mark and a colon, the subpattern does not do any captur-
4445         ing,  and  is  not  counted when computing the number of any subsequent         ing, and is not counted when computing the  number  of  any  subsequent
4446         capturing subpatterns. For example, if the string "the white queen"  is         capturing  subpatterns. For example, if the string "the white queen" is
4447         matched against the pattern         matched against the pattern
4448    
4449           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
# Line 4433  SUBPATTERNS Line 4451  SUBPATTERNS
4451         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
4452         1 and 2. The maximum number of capturing subpatterns is 65535.         1 and 2. The maximum number of capturing subpatterns is 65535.
4453    
4454         As a convenient shorthand, if any option settings are required  at  the         As  a  convenient shorthand, if any option settings are required at the
4455         start  of  a  non-capturing  subpattern,  the option letters may appear         start of a non-capturing subpattern,  the  option  letters  may  appear
4456         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
4457    
4458           (?i:saturday|sunday)           (?i:saturday|sunday)
4459           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
4460    
4461         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
4462         tried  from  left  to right, and options are not reset until the end of         tried from left to right, and options are not reset until  the  end  of
4463         the subpattern is reached, an option setting in one branch does  affect         the  subpattern is reached, an option setting in one branch does affect
4464         subsequent  branches,  so  the above patterns match "SUNDAY" as well as         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
4465         "Saturday".         "Saturday".
4466    
4467    
4468  DUPLICATE SUBPATTERN NUMBERS  DUPLICATE SUBPATTERN NUMBERS
4469    
4470         Perl 5.10 introduced a feature whereby each alternative in a subpattern         Perl 5.10 introduced a feature whereby each alternative in a subpattern
4471         uses  the same numbers for its capturing parentheses. Such a subpattern         uses the same numbers for its capturing parentheses. Such a  subpattern
4472         starts with (?| and is itself a non-capturing subpattern. For  example,         starts  with (?| and is itself a non-capturing subpattern. For example,
4473         consider this pattern:         consider this pattern:
4474    
4475           (?|(Sat)ur|(Sun))day           (?|(Sat)ur|(Sun))day
4476    
4477         Because  the two alternatives are inside a (?| group, both sets of cap-         Because the two alternatives are inside a (?| group, both sets of  cap-
4478         turing parentheses are numbered one. Thus, when  the  pattern  matches,         turing  parentheses  are  numbered one. Thus, when the pattern matches,
4479         you  can  look  at captured substring number one, whichever alternative         you can look at captured substring number  one,  whichever  alternative
4480         matched. This construct is useful when you want to  capture  part,  but         matched.  This  construct  is useful when you want to capture part, but
4481         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
4482         theses are numbered as usual, but the number is reset at the  start  of         theses  are  numbered as usual, but the number is reset at the start of
4483         each  branch.  The numbers of any capturing parentheses that follow the         each branch. The numbers of any capturing parentheses that  follow  the
4484         subpattern start after the highest number used in any branch. The  fol-         subpattern  start after the highest number used in any branch. The fol-
4485         lowing example is taken from the Perl documentation. The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
4486         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
4487    
# Line 4471  DUPLICATE SUBPATTERN NUMBERS Line 4489  DUPLICATE SUBPATTERN NUMBERS
4489           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
4490           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
4491    
4492         A back reference to a numbered subpattern uses the  most  recent  value         A  back  reference  to a numbered subpattern uses the most recent value
4493         that  is  set  for that number by any subpattern. The following pattern         that is set for that number by any subpattern.  The  following  pattern
4494         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
4495    
4496           /(?|(abc)|(def))\1/           /(?|(abc)|(def))\1/
4497    
4498         In contrast, a recursive or "subroutine" call to a numbered  subpattern         In  contrast,  a subroutine call to a numbered subpattern always refers
4499         always  refers  to  the first one in the pattern with the given number.         to the first one in the pattern with the given  number.  The  following
4500         The following pattern matches "abcabc" or "defabc":         pattern matches "abcabc" or "defabc":
4501    
4502           /(?|(abc)|(def))(?1)/           /(?|(abc)|(def))(?1)/
4503    
4504         If a condition test for a subpattern's having matched refers to a  non-         If  a condition test for a subpattern's having matched refers to a non-
4505         unique  number, the test is true if any of the subpatterns of that num-         unique number, the test is true if any of the subpatterns of that  num-
4506         ber have matched.         ber have matched.
4507    
4508         An alternative approach to using this "branch reset" feature is to  use         An  alternative approach to using this "branch reset" feature is to use
4509         duplicate named subpatterns, as described in the next section.         duplicate named subpatterns, as described in the next section.
4510    
4511    
4512  NAMED SUBPATTERNS  NAMED SUBPATTERNS
4513    
4514         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
4515         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
4516         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
4517         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
4518         patterns. This feature was not added to Perl until release 5.10. Python         patterns. This feature was not added to Perl until release 5.10. Python
4519         had the feature earlier, and PCRE introduced it at release  4.0,  using         had  the  feature earlier, and PCRE introduced it at release 4.0, using
4520         the  Python syntax. PCRE now supports both the Perl and the Python syn-         the Python syntax. PCRE now supports both the Perl and the Python  syn-
4521         tax. Perl allows identically numbered  subpatterns  to  have  different         tax.  Perl  allows  identically  numbered subpatterns to have different
4522         names, but PCRE does not.         names, but PCRE does not.
4523    
4524         In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
4525         or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
4526         to  capturing parentheses from other parts of the pattern, such as back         to capturing parentheses from other parts of the pattern, such as  back
4527         references, recursion, and conditions, can be made by name as  well  as         references,  recursion,  and conditions, can be made by name as well as
4528         by number.         by number.
4529    
4530         Names  consist  of  up  to  32 alphanumeric characters and underscores.         Names consist of up to  32  alphanumeric  characters  and  underscores.
4531         Named capturing parentheses are still  allocated  numbers  as  well  as         Named  capturing  parentheses  are  still  allocated numbers as well as
4532         names,  exactly as if the names were not present. The PCRE API provides         names, exactly as if the names were not present. The PCRE API  provides
4533         function calls for extracting the name-to-number translation table from         function calls for extracting the name-to-number translation table from
4534         a compiled pattern. There is also a convenience function for extracting         a compiled pattern. There is also a convenience function for extracting
4535         a captured substring by name.         a captured substring by name.
4536    
4537         By default, a name must be unique within a pattern, but it is  possible         By  default, a name must be unique within a pattern, but it is possible
4538         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
4539         time. (Duplicate names are also always permitted for  subpatterns  with         time.  (Duplicate  names are also always permitted for subpatterns with
4540         the  same  number, set up as described in the previous section.) Dupli-         the same number, set up as described in the previous  section.)  Dupli-
4541         cate names can be useful for patterns where only one  instance  of  the         cate  names  can  be useful for patterns where only one instance of the
4542         named  parentheses  can  match. Suppose you want to match the name of a         named parentheses can match. Suppose you want to match the  name  of  a
4543         weekday, either as a 3-letter abbreviation or as the full name, and  in         weekday,  either as a 3-letter abbreviation or as the full name, and in
4544         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
4545         the line breaks) does the job:         the line breaks) does the job:
4546    
# Line 4532  NAMED SUBPATTERNS Line 4550  NAMED SUBPATTERNS
4550           (?<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
4551           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
4552    
4553         There are five capturing substrings, but only one is ever set  after  a         There  are  five capturing substrings, but only one is ever set after a
4554         match.  (An alternative way of solving this problem is to use a "branch         match.  (An alternative way of solving this problem is to use a "branch
4555         reset" subpattern, as described in the previous section.)         reset" subpattern, as described in the previous section.)
4556    
4557         The convenience function for extracting the data by  name  returns  the         The  convenience  function  for extracting the data by name returns the
4558         substring  for  the first (and in this example, the only) subpattern of         substring for the first (and in this example, the only)  subpattern  of
4559         that name that matched. This saves searching  to  find  which  numbered         that  name  that  matched.  This saves searching to find which numbered
4560         subpattern it was.         subpattern it was.
4561    
4562         If  you  make  a  back  reference to a non-unique named subpattern from         If you make a back reference to  a  non-unique  named  subpattern  from
4563         elsewhere in the pattern, the one that corresponds to the first  occur-         elsewhere  in the pattern, the one that corresponds to the first occur-
4564         rence of the name is used. In the absence of duplicate numbers (see the         rence of the name is used. In the absence of duplicate numbers (see the
4565         previous section) this is the one with the lowest number. If you use  a         previous  section) this is the one with the lowest number. If you use a
4566         named  reference  in a condition test (see the section about conditions         named reference in a condition test (see the section  about  conditions
4567         below), either to check whether a subpattern has matched, or  to  check         below),  either  to check whether a subpattern has matched, or to check
4568         for  recursion,  all  subpatterns with the same name are tested. If the         for recursion, all subpatterns with the same name are  tested.  If  the
4569         condition is true for any one of them, the overall condition  is  true.         condition  is  true for any one of them, the overall condition is true.
4570         This is the same behaviour as testing by number. For further details of         This is the same behaviour as testing by number. For further details of
4571         the interfaces for handling named subpatterns, see the pcreapi documen-         the interfaces for handling named subpatterns, see the pcreapi documen-
4572         tation.         tation.
4573    
4574         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
4575         patterns with the same number because PCRE uses only the  numbers  when         patterns  with  the same number because PCRE uses only the numbers when
4576         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
4577         ent names are given to subpatterns with the same number.  However,  you         ent  names  are given to subpatterns with the same number. However, you
4578         can  give  the same name to subpatterns with the same number, even when         can give the same name to subpatterns with the same number,  even  when
4579         PCRE_DUPNAMES is not set.         PCRE_DUPNAMES is not set.
4580    
4581    
4582  REPETITION  REPETITION
4583    
4584         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
4585         following items:         following items:
4586    
4587           a literal data character           a literal data character
# Line 4575  REPETITION Line 4593  REPETITION
4593           a character class           a character class
4594           a back reference (see next section)           a back reference (see next section)
4595           a parenthesized subpattern (including assertions)           a parenthesized subpattern (including assertions)
4596           a recursive or "subroutine" call to a subpattern           a subroutine call to a subpattern (recursive or otherwise)
4597    
4598         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
4599         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
4600         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
4601         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
4602    
4603           z{2,4}           z{2,4}
4604    
4605         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
4606         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
4607         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
4608         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
4609         matches. Thus         matches. Thus
4610    
4611           [aeiou]{3,}           [aeiou]{3,}
# Line 4596  REPETITION Line 4614  REPETITION
4614    
4615           \d{8}           \d{8}
4616    
4617         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
4618         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
4619         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
4620         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
4621    
4622         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
4623         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
4624         acters, each of which is represented by a two-byte sequence. Similarly,         acters, each of which is represented by a two-byte sequence. Similarly,
4625         when Unicode property support is available, \X{3} matches three Unicode         when Unicode property support is available, \X{3} matches three Unicode
4626         extended  sequences,  each of which may be several bytes long (and they         extended sequences, each of which may be several bytes long  (and  they
4627         may be of different lengths).         may be of different lengths).
4628    
4629         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
4630         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
4631         ful for subpatterns that are referenced as subroutines  from  elsewhere         ful  for  subpatterns that are referenced as subroutines from elsewhere
4632         in the pattern (but see also the section entitled "Defining subpatterns         in the pattern (but see also the section entitled "Defining subpatterns
4633         for use by reference only" below). Items other  than  subpatterns  that         for  use  by  reference only" below). Items other than subpatterns that
4634         have a {0} quantifier are omitted from the compiled pattern.         have a {0} quantifier are omitted from the compiled pattern.
4635    
4636         For  convenience, the three most common quantifiers have single-charac-         For convenience, the three most common quantifiers have  single-charac-
4637         ter abbreviations:         ter abbreviations:
4638    
4639           *    is equivalent to {0,}           *    is equivalent to {0,}
4640           +    is equivalent to {1,}           +    is equivalent to {1,}
4641           ?    is equivalent to {0,1}           ?    is equivalent to {0,1}
4642    
4643         It is possible to construct infinite loops by  following  a  subpattern         It  is  possible  to construct infinite loops by following a subpattern
4644         that can match no characters with a quantifier that has no upper limit,         that can match no characters with a quantifier that has no upper limit,
4645         for example:         for example:
4646    
4647           (a?)*           (a?)*
4648    
4649         Earlier versions of Perl and PCRE used to give an error at compile time         Earlier versions of Perl and PCRE used to give an error at compile time
4650         for  such  patterns. However, because there are cases where this can be         for such patterns. However, because there are cases where this  can  be
4651         useful, such patterns are now accepted, but if any  repetition  of  the         useful,  such  patterns  are now accepted, but if any repetition of the
4652         subpattern  does in fact match no characters, the loop is forcibly bro-         subpattern does in fact match no characters, the loop is forcibly  bro-
4653         ken.         ken.
4654    
4655         By default, the quantifiers are "greedy", that is, they match  as  much         By  default,  the quantifiers are "greedy", that is, they match as much
4656         as  possible  (up  to  the  maximum number of permitted times), without         as possible (up to the maximum  number  of  permitted  times),  without
4657         causing the rest of the pattern to fail. The classic example  of  where         causing  the  rest of the pattern to fail. The classic example of where
4658         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
4659         appear between /* and */ and within the comment,  individual  *  and  /         appear  between  /*  and  */ and within the comment, individual * and /
4660         characters  may  appear. An attempt to match C comments by applying the         characters may appear. An attempt to match C comments by  applying  the
4661         pattern         pattern
4662    
4663           /\*.*\*/           /\*.*\*/
# Line 4648  REPETITION Line 4666  REPETITION
4666    
4667           /* first comment */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
4668    
4669         fails, because it matches the entire string owing to the greediness  of         fails,  because it matches the entire string owing to the greediness of
4670         the .*  item.         the .*  item.
4671    
4672         However,  if  a quantifier is followed by a question mark, it ceases to         However, if a quantifier is followed by a question mark, it  ceases  to
4673         be greedy, and instead matches the minimum number of times possible, so         be greedy, and instead matches the minimum number of times possible, so
4674         the pattern         the pattern
4675    
4676           /\*.*?\*/           /\*.*?\*/
4677    
4678         does  the  right  thing with the C comments. The meaning of the various         does the right thing with the C comments. The meaning  of  the  various
4679         quantifiers is not otherwise changed,  just  the  preferred  number  of         quantifiers  is  not  otherwise  changed,  just the preferred number of
4680         matches.   Do  not  confuse this use of question mark with its use as a         matches.  Do not confuse this use of question mark with its  use  as  a
4681         quantifier in its own right. Because it has two uses, it can  sometimes         quantifier  in its own right. Because it has two uses, it can sometimes
4682         appear doubled, as in         appear doubled, as in
4683    
4684           \d??\d           \d??\d
# Line 4668  REPETITION Line 4686  REPETITION
4686         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
4687         only way the rest of the pattern matches.         only way the rest of the pattern matches.
4688    
4689         If the PCRE_UNGREEDY option is set (an option that is not available  in         If  the PCRE_UNGREEDY option is set (an option that is not available in
4690         Perl),  the  quantifiers are not greedy by default, but individual ones         Perl), the quantifiers are not greedy by default, but  individual  ones
4691         can be made greedy by following them with a  question  mark.  In  other         can  be  made  greedy  by following them with a question mark. In other
4692         words, it inverts the default behaviour.         words, it inverts the default behaviour.
4693    
4694         When  a  parenthesized  subpattern  is quantified with a minimum repeat         When a parenthesized subpattern is quantified  with  a  minimum  repeat
4695         count that is greater than 1 or with a limited maximum, more memory  is         count  that is greater than 1 or with a limited maximum, more memory is
4696         required  for  the  compiled  pattern, in proportion to the size of the         required for the compiled pattern, in proportion to  the  size  of  the
4697         minimum or maximum.         minimum or maximum.
4698    
4699         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
4700         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,         alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
4701         the pattern is implicitly anchored, because whatever  follows  will  be         the  pattern  is  implicitly anchored, because whatever follows will be
4702         tried  against every character position in the subject string, so there         tried against every character position in the subject string, so  there
4703         is no point in retrying the overall match at  any  position  after  the         is  no  point  in  retrying the overall match at any position after the
4704         first.  PCRE  normally treats such a pattern as though it were preceded         first. PCRE normally treats such a pattern as though it  were  preceded
4705         by \A.         by \A.
4706    
4707         In cases where it is known that the subject  string  contains  no  new-         In  cases  where  it  is known that the subject string contains no new-
4708         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
4709         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
4710    
4711         However, there is one situation where the optimization cannot be  used.         However,  there is one situation where the optimization cannot be used.
4712         When .*  is inside capturing parentheses that are the subject of a back         When .*  is inside capturing parentheses that are the subject of a back
4713         reference elsewhere in the pattern, a match at the start may fail where         reference elsewhere in the pattern, a match at the start may fail where
4714         a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
4715    
4716           (.*)abc\1           (.*)abc\1
4717    
4718         If  the subject is "xyz123abc123" the match point is the fourth charac-         If the subject is "xyz123abc123" the match point is the fourth  charac-
4719         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
4720    
4721         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 4706  REPETITION Line 4724  REPETITION
4724           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
4725    
4726         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
4727         is "tweedledee". However, if there are  nested  capturing  subpatterns,         is  "tweedledee".  However,  if there are nested capturing subpatterns,
4728         the  corresponding captured values may have been set in previous itera-         the corresponding captured values may have been set in previous  itera-
4729         tions. For example, after         tions. For example, after
4730    
4731           /(a|(b))+/           /(a|(b))+/
# Line 4717  REPETITION Line 4735  REPETITION
4735    
4736  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
4737    
4738         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")         With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
4739         repetition,  failure  of what follows normally causes the repeated item         repetition, failure of what follows normally causes the  repeated  item
4740         to be re-evaluated to see if a different number of repeats  allows  the         to  be  re-evaluated to see if a different number of repeats allows the
4741         rest  of  the pattern to match. Sometimes it is useful to prevent this,         rest of the pattern to match. Sometimes it is useful to  prevent  this,
4742         either to change the nature of the match, or to cause it  fail  earlier         either  to  change the nature of the match, or to cause it fail earlier
4743         than  it otherwise might, when the author of the pattern knows there is         than it otherwise might, when the author of the pattern knows there  is
4744         no point in carrying on.         no point in carrying on.
4745    
4746         Consider, for example, the pattern \d+foo when applied to  the  subject         Consider,  for  example, the pattern \d+foo when applied to the subject
4747         line         line
4748    
4749           123456bar           123456bar
4750    
4751         After matching all 6 digits and then failing to match "foo", the normal         After matching all 6 digits and then failing to match "foo", the normal
4752         action of the matcher is to try again with only 5 digits  matching  the         action  of  the matcher is to try again with only 5 digits matching the
4753         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.         \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
4754         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides         "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
4755         the  means for specifying that once a subpattern has matched, it is not         the means for specifying that once a subpattern has matched, it is  not
4756         to be re-evaluated in this way.         to be re-evaluated in this way.
4757    
4758         If we use atomic grouping for the previous example, the  matcher  gives         If  we  use atomic grouping for the previous example, the matcher gives
4759         up  immediately  on failing to match "foo" the first time. The notation         up immediately on failing to match "foo" the first time.  The  notation
4760         is a kind of special parenthesis, starting with (?> as in this example:         is a kind of special parenthesis, starting with (?> as in this example:
4761    
4762           (?>\d+)foo           (?>\d+)foo
4763    
4764         This kind of parenthesis "locks up" the  part of the  pattern  it  con-         This  kind  of  parenthesis "locks up" the  part of the pattern it con-
4765         tains  once  it  has matched, and a failure further into the pattern is         tains once it has matched, and a failure further into  the  pattern  is
4766         prevented from backtracking into it. Backtracking past it  to  previous         prevented  from  backtracking into it. Backtracking past it to previous
4767         items, however, works as normal.         items, however, works as normal.
4768    
4769         An  alternative  description  is that a subpattern of this type matches         An alternative description is that a subpattern of  this  type  matches
4770         the string of characters that an  identical  standalone  pattern  would         the  string  of  characters  that an identical standalone pattern would
4771         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
4772    
4773         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
4774         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
4775         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
4776         pared to adjust the number of digits they match in order  to  make  the         pared  to  adjust  the number of digits they match in order to make the
4777         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
4778         digits.         digits.
4779    
4780         Atomic groups in general can of course contain arbitrarily  complicated         Atomic  groups in general can of course contain arbitrarily complicated
4781         subpatterns,  and  can  be  nested. However, when the subpattern for an         subpatterns, and can be nested. However, when  the  subpattern  for  an
4782         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
4783         simpler  notation,  called  a "possessive quantifier" can be used. This         simpler notation, called a "possessive quantifier" can  be  used.  This
4784         consists of an additional + character  following  a  quantifier.  Using         consists  of  an  additional  + character following a quantifier. Using
4785         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
4786    
4787           \d++foo           \d++foo
# Line 4773  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 4791  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
4791    
4792           (abc|xyz){2,3}+           (abc|xyz){2,3}+
4793    
4794         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive   quantifiers   are   always  greedy;  the  setting  of  the
4795         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
4796         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
4797         meaning  of  a  possessive  quantifier and the equivalent atomic group,         meaning of a possessive quantifier and  the  equivalent  atomic  group,
4798         though there may be a performance  difference;  possessive  quantifiers         though  there  may  be a performance difference; possessive quantifiers
4799         should be slightly faster.         should be slightly faster.
4800    
4801         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
4802         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
4803         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
4804         built Sun's Java package, and PCRE copied it from there. It  ultimately         built  Sun's Java package, and PCRE copied it from there. It ultimately
4805         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
4806    
4807         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
4808         ple pattern constructs. For example, the sequence  A+B  is  treated  as         ple  pattern  constructs.  For  example, the sequence A+B is treated as
4809         A++B  because  there is no point in backtracking into a sequence of A's         A++B because there is no point in backtracking into a sequence  of  A's
4810         when B must follow.         when B must follow.
4811    
4812         When a pattern contains an unlimited repeat inside  a  subpattern  that         When  a  pattern  contains an unlimited repeat inside a subpattern that
4813         can  itself  be  repeated  an  unlimited number of times, the use of an         can itself be repeated an unlimited number of  times,  the  use  of  an
4814         atomic group is the only way to avoid some  failing  matches  taking  a         atomic  group  is  the  only way to avoid some failing matches taking a
4815         very long time indeed. The pattern         very long time indeed. The pattern
4816    
4817           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
4818    
4819         matches  an  unlimited number of substrings that either consist of non-         matches an unlimited number of substrings that either consist  of  non-
4820         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
4821         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
4822    
4823           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4824    
4825         it  takes  a  long  time  before reporting failure. This is because the         it takes a long time before reporting  failure.  This  is  because  the
4826         string can be divided between the internal \D+ repeat and the  external         string  can be divided between the internal \D+ repeat and the external
4827         *  repeat  in  a  large  number of ways, and all have to be tried. (The         * repeat in a large number of ways, and all  have  to  be  tried.  (The
4828         example uses [!?] rather than a single character at  the  end,  because         example  uses  [!?]  rather than a single character at the end, because
4829         both  PCRE  and  Perl have an optimization that allows for fast failure         both PCRE and Perl have an optimization that allows  for  fast  failure
4830         when a single character is used. They remember the last single  charac-         when  a single character is used. They remember the last single charac-
4831         ter  that  is required for a match, and fail early if it is not present         ter that is required for a match, and fail early if it is  not  present
4832         in the string.) If the pattern is changed so that  it  uses  an  atomic         in  the  string.)  If  the pattern is changed so that it uses an atomic
4833         group, like this:         group, like this:
4834    
4835           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
# Line 4823  BACK REFERENCES Line 4841  BACK REFERENCES
4841    
4842         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
4843         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
4844         pattern  earlier  (that is, to its left) in the pattern, provided there         pattern earlier (that is, to its left) in the pattern,  provided  there
4845         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
4846    
4847         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
4848         it  is  always  taken  as a back reference, and causes an error only if         it is always taken as a back reference, and causes  an  error  only  if
4849         there are not that many capturing left parentheses in the  entire  pat-         there  are  not that many capturing left parentheses in the entire pat-
4850         tern.  In  other words, the parentheses that are referenced need not be         tern. In other words, the parentheses that are referenced need  not  be
4851         to the left of the reference for numbers less than 10. A "forward  back         to  the left of the reference for numbers less than 10. A "forward back
4852         reference"  of  this  type can make sense when a repetition is involved         reference" of this type can make sense when a  repetition  is  involved
4853         and the subpattern to the right has participated in an  earlier  itera-         and  the  subpattern to the right has participated in an earlier itera-
4854         tion.         tion.
4855    
4856         It  is  not  possible to have a numerical "forward back reference" to a         It is not possible to have a numerical "forward back  reference"  to  a
4857         subpattern whose number is 10 or  more  using  this  syntax  because  a         subpattern  whose  number  is  10  or  more using this syntax because a
4858         sequence  such  as  \50 is interpreted as a character defined in octal.         sequence such as \50 is interpreted as a character  defined  in  octal.
4859         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
4860         details  of  the  handling of digits following a backslash. There is no         details of the handling of digits following a backslash.  There  is  no
4861         such problem when named parentheses are used. A back reference  to  any         such  problem  when named parentheses are used. A back reference to any
4862         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
4863    
4864         Another  way  of  avoiding  the ambiguity inherent in the use of digits         Another way of avoiding the ambiguity inherent in  the  use  of  digits
4865         following a backslash is to use the \g  escape  sequence.  This  escape         following  a  backslash  is  to use the \g escape sequence. This escape
4866         must be followed by an unsigned number or a negative number, optionally         must be followed by an unsigned number or a negative number, optionally
4867         enclosed in braces. These examples are all identical:         enclosed in braces. These examples are all identical:
4868    
# Line 4852  BACK REFERENCES Line 4870  BACK REFERENCES
4870           (ring), \g1           (ring), \g1
4871           (ring), \g{1}           (ring), \g{1}
4872    
4873         An unsigned number specifies an absolute reference without the  ambigu-         An  unsigned number specifies an absolute reference without the ambigu-
4874         ity that is present in the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
4875         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
4876         Consider this example:         Consider this example:
# Line 4861  BACK REFERENCES Line 4879  BACK REFERENCES
4879    
4880         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
4881         ing subpattern before \g, that is, is it equivalent to \2 in this exam-         ing subpattern before \g, that is, is it equivalent to \2 in this exam-
4882         ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative         ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
4883         references can be helpful in long patterns, and also in  patterns  that         references  can  be helpful in long patterns, and also in patterns that
4884         are  created  by  joining  together  fragments  that contain references         are created by  joining  together  fragments  that  contain  references
4885         within themselves.         within themselves.
4886    
4887         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
4888         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
4889         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
4890         of doing that). So the pattern         of doing that). So the pattern
4891    
4892           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4893    
4894         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
4895         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
4896         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
4897         ple,         ple,
4898    
4899           ((?i)rah)\s+\1           ((?i)rah)\s+\1
4900    
4901         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
4902         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
4903    
4904         There  are  several  different ways of writing back references to named         There are several different ways of writing back  references  to  named
4905         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
4906         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
4907         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
4908         and  named  references,  is  also supported. We could rewrite the above         and named references, is also supported. We  could  rewrite  the  above
4909         example in any of the following ways:         example in any of the following ways:
4910    
4911           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 4895  BACK REFERENCES Line 4913  BACK REFERENCES
4913           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4914           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
4915    
4916         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
4917         before or after the reference.         before or after the reference.
4918    
4919         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
4920         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
4921         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
4922    
4923           (a|(bc))\2           (a|(bc))\2
4924    
4925         always  fails  if  it starts to match "a" rather than "bc". However, if         always fails if it starts to match "a" rather than  "bc".  However,  if
4926         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
4927         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
4928    
4929         Because  there may be many capturing parentheses in a pattern, all dig-         Because there may be many capturing parentheses in a pattern, all  dig-
4930         its following a backslash are taken as part of a potential back  refer-         its  following a backslash are taken as part of a potential back refer-
4931         ence  number.   If  the  pattern continues with a digit character, some         ence number.  If the pattern continues with  a  digit  character,  some
4932         delimiter must  be  used  to  terminate  the  back  reference.  If  the         delimiter  must  be  used  to  terminate  the  back  reference.  If the
4933         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
4934         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
4935    
4936     Recursive back references     Recursive back references
4937    
4938         A back reference that occurs inside the parentheses to which it  refers         A  back reference that occurs inside the parentheses to which it refers
4939         fails  when  the subpattern is first used, so, for example, (a\1) never         fails when the subpattern is first used, so, for example,  (a\1)  never
4940         matches.  However, such references can be useful inside  repeated  sub-         matches.   However,  such references can be useful inside repeated sub-
4941         patterns. For example, the pattern         patterns. For example, the pattern
4942    
4943           (a|b\1)+           (a|b\1)+
4944    
4945         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4946         ation of the subpattern,  the  back  reference  matches  the  character         ation  of  the  subpattern,  the  back  reference matches the character
4947         string  corresponding  to  the previous iteration. In order for this to         string corresponding to the previous iteration. In order  for  this  to
4948         work, the pattern must be such that the first iteration does  not  need         work,  the  pattern must be such that the first iteration does not need
4949         to  match the back reference. This can be done using alternation, as in         to match the back reference. This can be done using alternation, as  in
4950         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4951    
4952         Back references of this type cause the group that they reference to  be         Back  references of this type cause the group that they reference to be
4953         treated  as  an atomic group.  Once the whole group has been matched, a         treated as an atomic group.  Once the whole group has been  matched,  a
4954         subsequent matching failure cannot cause backtracking into  the  middle         subsequent  matching  failure cannot cause backtracking into the middle
4955         of the group.         of the group.
4956    
4957    
4958  ASSERTIONS  ASSERTIONS
4959    
4960         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
4961         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
4962         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
4963         described above.         described above.
4964    
4965         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
4966         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
4967         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
4968         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
4969         matching position to be changed.         matching position to be changed.
4970    
4971         Assertion subpatterns are not capturing subpatterns. If such an  asser-         Assertion  subpatterns are not capturing subpatterns. If such an asser-
4972         tion  contains  capturing  subpatterns within it, these are counted for         tion contains capturing subpatterns within it, these  are  counted  for
4973         the purposes of numbering the capturing subpatterns in the  whole  pat-         the  purposes  of numbering the capturing subpatterns in the whole pat-
4974         tern.  However,  substring  capturing  is carried out only for positive         tern. However, substring capturing is carried  out  only  for  positive
4975         assertions, because it does not make sense for negative assertions.         assertions, because it does not make sense for negative assertions.
4976    
4977         For compatibility with Perl, assertion  subpatterns  may  be  repeated;         For  compatibility  with  Perl,  assertion subpatterns may be repeated;
4978         though  it  makes  no sense to assert the same thing several times, the         though it makes no sense to assert the same thing  several  times,  the
4979         side effect of capturing parentheses may  occasionally  be  useful.  In         side  effect  of  capturing  parentheses may occasionally be useful. In
4980         practice, there only three cases:         practice, there only three cases:
4981    
4982         (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during         (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during
4983         matching.  However, it may  contain  internal  capturing  parenthesized         matching.   However,  it  may  contain internal capturing parenthesized
4984         groups that are called from elsewhere via the subroutine mechanism.         groups that are called from elsewhere via the subroutine mechanism.
4985    
4986         (2)  If quantifier is {0,n} where n is greater than zero, it is treated         (2) If quantifier is {0,n} where n is greater than zero, it is  treated
4987         as if it were {0,1}. At run time, the rest  of  the  pattern  match  is         as  if  it  were  {0,1}.  At run time, the rest of the pattern match is
4988         tried with and without the assertion, the order depending on the greed-         tried with and without the assertion, the order depending on the greed-
4989         iness of the quantifier.         iness of the quantifier.
4990    
4991         (3) If the minimum repetition is greater than zero, the  quantifier  is         (3)  If  the minimum repetition is greater than zero, the quantifier is
4992         ignored.   The  assertion  is  obeyed just once when encountered during         ignored.  The assertion is obeyed just  once  when  encountered  during
4993         matching.         matching.
4994    
4995     Lookahead assertions     Lookahead assertions
# Line 4981  ASSERTIONS Line 4999  ASSERTIONS
4999    
5000           \w+(?=;)           \w+(?=;)
5001    
5002         matches  a word followed by a semicolon, but does not include the semi-         matches a word followed by a semicolon, but does not include the  semi-
5003         colon in the match, and         colon in the match, and
5004    
5005           foo(?!bar)           foo(?!bar)
5006    
5007         matches any occurrence of "foo" that is not  followed  by  "bar".  Note         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
5008         that the apparently similar pattern         that the apparently similar pattern
5009    
5010           (?!foo)bar           (?!foo)bar
5011    
5012         does  not  find  an  occurrence  of "bar" that is preceded by something         does not find an occurrence of "bar"  that  is  preceded  by  something
5013         other than "foo"; it finds any occurrence of "bar" whatsoever,  because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
5014         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
5015         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
5016    
5017         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
5018         most  convenient  way  to  do  it  is with (?!) because an empty string         most convenient way to do it is  with  (?!)  because  an  empty  string
5019         always matches, so an assertion that requires there not to be an  empty         always  matches, so an assertion that requires there not to be an empty
5020         string must always fail.  The backtracking control verb (*FAIL) or (*F)         string must always fail.  The backtracking control verb (*FAIL) or (*F)
5021         is a synonym for (?!).         is a synonym for (?!).
5022    
5023     Lookbehind assertions     Lookbehind assertions
5024    
5025         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind  assertions start with (?<= for positive assertions and (?<!
5026         for negative assertions. For example,         for negative assertions. For example,
5027    
5028           (?<!foo)bar           (?<!foo)bar
5029    
5030         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does find an occurrence of "bar" that is not  preceded  by  "foo".  The
5031         contents of a lookbehind assertion are restricted  such  that  all  the         contents  of  a  lookbehind  assertion are restricted such that all the
5032         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
5033         eral top-level alternatives, they do not all  have  to  have  the  same         eral  top-level  alternatives,  they  do  not all have to have the same
5034         fixed length. Thus         fixed length. Thus
5035    
5036           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 5021  ASSERTIONS Line 5039  ASSERTIONS
5039    
5040           (?<!dogs?|cats?)           (?<!dogs?|cats?)
5041    
5042         causes  an  error at compile time. Branches that match different length         causes an error at compile time. Branches that match  different  length
5043         strings are permitted only at the top level of a lookbehind  assertion.         strings  are permitted only at the top level of a lookbehind assertion.
5044         This is an extension compared with Perl, which requires all branches to         This is an extension compared with Perl, which requires all branches to
5045         match the same length of string. An assertion such as         match the same length of string. An assertion such as
5046    
5047           (?<=ab(c|de))           (?<=ab(c|de))
5048    
5049         is not permitted, because its single top-level  branch  can  match  two         is  not  permitted,  because  its single top-level branch can match two
5050         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
5051         top-level branches:         top-level branches:
5052    
5053           (?<=abc|abde)           (?<=abc|abde)
5054    
5055         In some cases, the escape sequence \K (see above) can be  used  instead         In  some  cases, the escape sequence \K (see above) can be used instead
5056         of a lookbehind assertion to get round the fixed-length restriction.         of a lookbehind assertion to get round the fixed-length restriction.
5057    
5058         The  implementation  of lookbehind assertions is, for each alternative,         The implementation of lookbehind assertions is, for  each  alternative,
5059         to temporarily move the current position back by the fixed  length  and         to  temporarily  move the current position back by the fixed length and
5060         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
5061         rent position, the assertion fails.         rent position, the assertion fails.
5062    
5063         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
5064         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode) to appear in lookbehind assertions, because it makes it  impossi-
5065         ble to calculate the length of the lookbehind. The \X and  \R  escapes,         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
5066         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
5067    
5068         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
5069         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
5070         Recursion, however, is not supported.         Recursion, however, is not supported.
5071    
5072         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
5073         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
5074         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
5075    
5076           abcd$           abcd$
5077    
5078         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
5079         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
5080         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
5081         pattern is specified as         pattern is specified as
5082    
5083           ^.*abcd$           ^.*abcd$
5084    
5085         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
5086         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
5087         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
5088         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
5089         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
5090    
5091           ^.*+(?<=abcd)           ^.*+(?<=abcd)
5092    
5093         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
5094         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
5095         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
5096         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
5097         processing time.         processing time.
5098    
5099     Using multiple assertions     Using multiple assertions
# Line 5084  ASSERTIONS Line 5102  ASSERTIONS
5102    
5103           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
5104    
5105         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
5106         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
5107         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
5108         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
5109         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
5110         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
5111         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
5112         foo". A pattern to do that is         foo". A pattern to do that is
5113    
5114           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
5115    
5116         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
5117         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
5118         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
5119    
# Line 5103  ASSERTIONS Line 5121  ASSERTIONS
5121    
5122           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
5123    
5124         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
5125         is not preceded by "foo", while         is not preceded by "foo", while
5126    
5127           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
5128    
5129         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
5130         three characters that are not "999".         three characters that are not "999".
5131    
5132    
5133  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
5134    
5135         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
5136         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
5137         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
5138         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
5139         subpattern are:         subpattern are:
5140    
5141           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5142           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5143    
5144         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
5145         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
5146         tives  in  the subpattern, a compile-time error occurs. Each of the two         tives in the subpattern, a compile-time error occurs. Each of  the  two
5147         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
5148         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5149         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 5134  CONDITIONAL SUBPATTERNS Line 5152  CONDITIONAL SUBPATTERNS
5152           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5153    
5154    
5155         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
5156         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
5157    
5158     Checking for a used subpattern by number     Checking for a used subpattern by number
5159    
5160         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
5161         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
5162         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
5163         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
5164         numbers), the condition is true if any of them have matched. An  alter-         numbers),  the condition is true if any of them have matched. An alter-
5165         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
5166         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
5167         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
5168         most recent by (?(-2), and so on. Inside loops it can also  make  sense         most  recent  by (?(-2), and so on. Inside loops it can also make sense
5169         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
5170         referenced as (?(+1), and so on. (The value zero in any of these  forms         referenced  as (?(+1), and so on. (The value zero in any of these forms
5171         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
5172    
5173         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
5174         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
5175         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
5176    
5177           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5178    
5179         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
5180         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5181         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
5182         third part is a conditional subpattern that tests whether  or  not  the         third  part  is  a conditional subpattern that tests whether or not the
5183         first  set  of  parentheses  matched.  If they did, that is, if subject         first set of parentheses matched. If they  did,  that  is,  if  subject
5184         started with an opening parenthesis, the condition is true, and so  the         started  with an opening parenthesis, the condition is true, and so the
5185         yes-pattern  is  executed and a closing parenthesis is required. Other-         yes-pattern is executed and a closing parenthesis is  required.  Other-
5186         wise, since no-pattern is not present, the subpattern matches  nothing.         wise,  since no-pattern is not present, the subpattern matches nothing.
5187         In  other  words,  this  pattern matches a sequence of non-parentheses,         In other words, this pattern matches  a  sequence  of  non-parentheses,
5188         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5189    
5190         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
5191         relative reference:         relative reference:
5192    
5193           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5194    
5195         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
5196         pattern.         pattern.
5197    
5198     Checking for a used subpattern by name     Checking for a used subpattern by name
5199    
5200         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
5201         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
5202         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
5203         also  recognized. However, there is a possible ambiguity with this syn-         also recognized. However, there is a possible ambiguity with this  syn-
5204         tax, because subpattern names may  consist  entirely  of  digits.  PCRE         tax,  because  subpattern  names  may  consist entirely of digits. PCRE
5205         looks  first for a named subpattern; if it cannot find one and the name         looks first for a named subpattern; if it cannot find one and the  name
5206         consists entirely of digits, PCRE looks for a subpattern of  that  num-         consists  entirely  of digits, PCRE looks for a subpattern of that num-
5207         ber,  which must be greater than zero. Using subpattern names that con-         ber, which must be greater than zero. Using subpattern names that  con-
5208         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5209    
5210         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5211    
5212           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5213    
5214         If the name used in a condition of this kind is a duplicate,  the  test         If  the  name used in a condition of this kind is a duplicate, the test
5215         is  applied to all subpatterns of the same name, and is true if any one         is applied to all subpatterns of the same name, and is true if any  one
5216         of them has matched.         of them has matched.
5217    
5218     Checking for pattern recursion     Checking for pattern recursion
5219    
5220         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5221         name  R, the condition is true if a recursive call to the whole pattern         name R, the condition is true if a recursive call to the whole  pattern
5222         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5223         sand follow the letter R, for example:         sand follow the letter R, for example:
5224    
# Line 5208  CONDITIONAL SUBPATTERNS Line 5226  CONDITIONAL SUBPATTERNS
5226    
5227         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5228         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5229         recursion  stack.  If  the  name  used in a condition of this kind is a         recursion stack. If the name used in a condition  of  this  kind  is  a
5230         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5231         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5232    
5233         At  "top  level",  all  these recursion test conditions are false.  The         At "top level", all these recursion test  conditions  are  false.   The
5234         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5235    
5236     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5237    
5238         If the condition is the string (DEFINE), and  there  is  no  subpattern         If  the  condition  is  the string (DEFINE), and there is no subpattern
5239         with  the  name  DEFINE,  the  condition is always false. In this case,         with the name DEFINE, the condition is  always  false.  In  this  case,
5240         there may be only one alternative  in  the  subpattern.  It  is  always         there  may  be  only  one  alternative  in the subpattern. It is always
5241         skipped  if  control  reaches  this  point  in the pattern; the idea of         skipped if control reaches this point  in  the  pattern;  the  idea  of
5242         DEFINE is that it can be used to define "subroutines" that can be  ref-         DEFINE  is that it can be used to define subroutines that can be refer-
5243         erenced  from elsewhere. (The use of "subroutines" is described below.)         enced from elsewhere. (The use of subroutines is described below.)  For
5244         For  example,  a  pattern  to   match   an   IPv4   address   such   as         example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
5245         "192.168.23.245" could be written like this (ignore whitespace and line         could be written like this (ignore whitespace and line breaks):
        breaks):  
5246    
5247           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5248           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
# Line 5312  RECURSIVE PATTERNS Line 5329  RECURSIVE PATTERNS
5329         into Perl at release 5.10.         into Perl at release 5.10.
5330    
5331         A special item that consists of (? followed by a  number  greater  than         A special item that consists of (? followed by a  number  greater  than
5332         zero and a closing parenthesis is a recursive call of the subpattern of         zero  and  a  closing parenthesis is a recursive subroutine call of the
5333         the given number, provided that it occurs inside that  subpattern.  (If         subpattern of the given number, provided that  it  occurs  inside  that
5334         not,  it  is  a  "subroutine" call, which is described in the next sec-         subpattern.  (If  not,  it is a non-recursive subroutine call, which is
5335         tion.) The special item (?R) or (?0) is a recursive call of the  entire         described in the next section.) The special item  (?R)  or  (?0)  is  a
5336         regular expression.         recursive call of the entire regular expression.
5337    
5338         This  PCRE  pattern  solves  the nested parentheses problem (assume the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5339         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
# Line 5348  RECURSIVE PATTERNS Line 5365  RECURSIVE PATTERNS
5365         It is also possible to refer to  subsequently  opened  parentheses,  by         It is also possible to refer to  subsequently  opened  parentheses,  by
5366         writing  references  such  as (?+2). However, these cannot be recursive         writing  references  such  as (?+2). However, these cannot be recursive
5367         because the reference is not inside the  parentheses  that  are  refer-         because the reference is not inside the  parentheses  that  are  refer-
5368         enced.  They  are  always  "subroutine" calls, as described in the next         enced.  They are always non-recursive subroutine calls, as described in
5369         section.         the next section.
5370    
5371         An alternative approach is to use named parentheses instead.  The  Perl         An alternative approach is to use named parentheses instead.  The  Perl
5372         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
# Line 5382  RECURSIVE PATTERNS Line 5399<