Parent Directory
|
Revision Log
|
Patch
| revision 654 by ph10, Tue Aug 2 11:00:40 2011 UTC | revision 835 by ph10, Wed Dec 28 16:10:09 2011 UTC | |
|---|---|---|
| # | Line 85 USER DOCUMENTATION | Line 85 USER DOCUMENTATION |
| 85 | pcrecpp details of the C++ wrapper | pcrecpp details of the C++ wrapper |
| 86 | pcredemo a demonstration C program that uses PCRE | pcredemo a demonstration C program that uses PCRE |
| 87 | pcregrep description of the pcregrep command | pcregrep description of the pcregrep command |
| 88 | pcrejit discussion of the just-in-time optimization support | |
| 89 | pcrelimits details of size and other limits | |
| 90 | pcrematching discussion of the two matching algorithms | pcrematching discussion of the two matching algorithms |
| 91 | pcrepartial details of the partial matching facility | pcrepartial details of the partial matching facility |
| 92 | pcrepattern syntax and semantics of supported | pcrepattern syntax and semantics of supported |
| # | Line 96 USER DOCUMENTATION | Line 98 USER DOCUMENTATION |
| 98 | pcrestack discussion of stack usage | pcrestack discussion of stack usage |
| 99 | pcresyntax quick syntax reference | pcresyntax quick syntax reference |
| 100 | pcretest description of the pcretest testing command | pcretest description of the pcretest testing command |
| 101 | pcreunicode discussion of Unicode and UTF-8 support | |
| 102 | ||
| 103 | In addition, in the "man" and HTML formats, there is a short page for | In addition, in the "man" and HTML formats, there is a short page for |
| 104 | each C library function, listing its arguments and results. | each C library function, listing its arguments and results. |
| 105 | ||
| 106 | ||
| LIMITATIONS | ||
| There are some size limitations in PCRE but it is hoped that they will | ||
| never in practice be relevant. | ||
| The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE | ||
| is compiled with the default internal linkage size of 2. If you want to | ||
| process regular expressions that are truly enormous, you can compile | ||
| PCRE with an internal linkage size of 3 or 4 (see the README file in | ||
| the source distribution and the pcrebuild documentation for details). | ||
| In these cases the limit is substantially larger. However, the speed | ||
| of execution is slower. | ||
| All values in repeating quantifiers must be less than 65536. | ||
| There is no limit to the number of parenthesized subpatterns, but there | ||
| can be no more than 65535 capturing subpatterns. | ||
| The maximum length of name for a named subpattern is 32 characters, and | ||
| the maximum number of named subpatterns is 10000. | ||
| The maximum length of a subject string is the largest positive number | ||
| that an integer variable can hold. However, when using the traditional | ||
| matching function, PCRE uses recursion to handle subpatterns and indef- | ||
| inite repetition. This means that the available stack space may limit | ||
| the size of a subject string that can be processed by certain patterns. | ||
| For a discussion of stack issues, see the pcrestack documentation. | ||
| UTF-8 AND UNICODE PROPERTY SUPPORT | ||
| From release 3.3, PCRE has had some support for character strings | ||
| encoded in the UTF-8 format. For release 4.0 this was greatly extended | ||
| to cover most common requirements, and in release 5.0 additional sup- | ||
| port for Unicode general category properties was added. | ||
| In order process UTF-8 strings, you must build PCRE to include UTF-8 | ||
| support in the code, and, in addition, you must call pcre_compile() | ||
| with the PCRE_UTF8 option flag, or the pattern must start with the | ||
| sequence (*UTF8). When either of these is the case, both the pattern | ||
| and any subject strings that are matched against it are treated as | ||
| UTF-8 strings instead of strings of 1-byte characters. | ||
| If you compile PCRE with UTF-8 support, but do not use it at run time, | ||
| the library will be a bit bigger, but the additional run time overhead | ||
| is limited to testing the PCRE_UTF8 flag occasionally, so should not be | ||
| very big. | ||
| If PCRE is built with Unicode character property support (which implies | ||
| UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- | ||
| ported. The available properties that can be tested are limited to the | ||
| general category properties such as Lu for an upper case letter or Nd | ||
| for a decimal number, the Unicode script names such as Arabic or Han, | ||
| and the derived properties Any and L&. A full list is given in the | ||
| pcrepattern documentation. Only the short names for properties are sup- | ||
| ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- | ||
| ter}, is not supported. Furthermore, in Perl, many properties may | ||
| optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE | ||
| does not support this. | ||
| Validity of UTF-8 strings | ||
| When you set the PCRE_UTF8 flag, the strings passed as patterns and | ||
| subjects are (by default) checked for validity on entry to the relevant | ||
| functions. From release 7.3 of PCRE, the check is according the rules | ||
| of RFC 3629, which are themselves derived from the Unicode specifica- | ||
| tion. Earlier releases of PCRE followed the rules of RFC 2279, which | ||
| allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current | ||
| check allows only values in the range U+0 to U+10FFFF, excluding U+D800 | ||
| to U+DFFF. | ||
| The excluded code points are the "Low Surrogate Area" of Unicode, of | ||
| which the Unicode Standard says this: "The Low Surrogate Area does not | ||
| contain any character assignments, consequently no character code | ||
| charts or namelists are provided for this area. Surrogates are reserved | ||
| for use with UTF-16 and then must be used in pairs." The code points | ||
| that are encoded by UTF-16 pairs are available as independent code | ||
| points in the UTF-8 encoding. (In other words, the whole surrogate | ||
| thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) | ||
| If an invalid UTF-8 string is passed to PCRE, an error return is given. | ||
| At compile time, the only additional information is the offset to the | ||
| first byte of the failing character. The runtime functions (pcre_exec() | ||
| and pcre_dfa_exec()), pass back this information as well as a more | ||
| detailed reason code if the caller has provided memory in which to do | ||
| this. | ||
| In some situations, you may already know that your strings are valid, | ||
| and therefore want to skip these checks in order to improve perfor- | ||
| mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run | ||
| time, PCRE assumes that the pattern or subject it is given (respec- | ||
| tively) contains only valid UTF-8 codes. In this case, it does not | ||
| diagnose an invalid UTF-8 string. | ||
| If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, | ||
| what happens depends on why the string is invalid. If the string con- | ||
| forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a | ||
| string of characters in the range 0 to 0x7FFFFFFF. In other words, | ||
| apart from the initial validity test, PCRE (when in UTF-8 mode) handles | ||
| strings according to the more liberal rules of RFC 2279. However, if | ||
| the string does not even conform to RFC 2279, the result is undefined. | ||
| Your program may crash. | ||
| If you want to process strings of values in the full range 0 to | ||
| 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can | ||
| set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in | ||
| this situation, you will have to apply your own validity check. | ||
| General comments about UTF-8 mode | ||
| 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a | ||
| two-byte UTF-8 character if the value is greater than 127. | ||
| 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 | ||
| characters for values greater than \177. | ||
| 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- | ||
| vidual bytes, for example: \x{100}{3}. | ||
| 4. The dot metacharacter matches one UTF-8 character instead of a sin- | ||
| gle byte. | ||
| 5. The escape sequence \C can be used to match a single byte in UTF-8 | ||
| mode, but its use can lead to some strange effects. This facility is | ||
| not available in the alternative matching function, pcre_dfa_exec(). | ||
| 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly | ||
| test characters of any code value, but, by default, the characters that | ||
| PCRE recognizes as digits, spaces, or word characters remain the same | ||
| set as before, all with values less than 256. This remains true even | ||
| when PCRE is built to include Unicode property support, because to do | ||
| otherwise would slow down PCRE in many common cases. Note in particular | ||
| that this applies to \b and \B, because they are defined in terms of \w | ||
| and \W. If you really want to test for a wider sense of, say, "digit", | ||
| you can use explicit Unicode property tests such as \p{Nd}. Alterna- | ||
| tively, if you set the PCRE_UCP option, the way that the character | ||
| escapes work is changed so that Unicode properties are used to deter- | ||
| mine which characters match. There are more details in the section on | ||
| generic character types in the pcrepattern documentation. | ||
| 7. Similarly, characters that match the POSIX named character classes | ||
| are all low-valued characters, unless the PCRE_UCP option is set. | ||
| 8. However, the horizontal and vertical whitespace matching escapes | ||
| (\h, \H, \v, and \V) do match all the appropriate Unicode characters, | ||
| whether or not PCRE_UCP is set. | ||
| 9. Case-insensitive matching applies only to characters whose values | ||
| are less than 128, unless PCRE is built with Unicode property support. | ||
| Even when Unicode property support is available, PCRE still uses its | ||
| own character tables when checking the case of low-valued characters, | ||
| so as not to degrade performance. The Unicode property information is | ||
| used only for characters with higher values. Furthermore, PCRE supports | ||
| case-insensitive matching only when there is a one-to-one mapping | ||
| between a letter's cases. There are a small number of many-to-one map- | ||
| pings in Unicode; these are not supported by PCRE. | ||
| 107 | AUTHOR | AUTHOR |
| 108 | ||
| 109 | Philip Hazel | Philip Hazel |
| # | Line 272 AUTHOR | Line 117 AUTHOR |
| 117 | ||
| 118 | REVISION | REVISION |
| 119 | ||
| 120 | Last updated: 07 May 2011 | Last updated: 24 August 2011 |
| 121 | Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2011 University of Cambridge. |
| 122 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| 123 | ||
| # | Line 372 UNICODE CHARACTER PROPERTY SUPPORT | Line 217 UNICODE CHARACTER PROPERTY SUPPORT |
| 217 | are supported. Details are given in the pcrepattern documentation. | are supported. Details are given in the pcrepattern documentation. |
| 218 | ||
| 219 | ||
| 220 | JUST-IN-TIME COMPILER SUPPORT | |
| 221 | ||
| 222 | Just-in-time compiler support is included in the build by specifying | |
| 223 | ||
| 224 | --enable-jit | |
| 225 | ||
| 226 | This support is available only for certain hardware architectures. If | |
| 227 | this option is set for an unsupported architecture, a compile time | |
| 228 | error occurs. See the pcrejit documentation for a discussion of JIT | |
| 229 | usage. When JIT support is enabled, pcregrep automatically makes use of | |
| 230 | it, unless you add | |
| 231 | ||
| 232 | --disable-pcregrep-jit | |
| 233 | ||
| 234 | to the "configure" command. | |
| 235 | ||
| 236 | ||
| 237 | CODE VALUE OF NEWLINE | CODE VALUE OF NEWLINE |
| 238 | ||
| 239 | By default, PCRE interprets the linefeed (LF) character as indicating | By default, PCRE interprets the linefeed (LF) character as indicating |
| # | Line 619 AUTHOR | Line 481 AUTHOR |
| 481 | ||
| 482 | REVISION | REVISION |
| 483 | ||
| 484 | Last updated: 02 August 2011 | Last updated: 06 September 2011 |
| 485 | Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2011 University of Cambridge. |
| 486 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| 487 | ||
| # | Line 835 NAME | Line 697 NAME |
| 697 | PCRE - Perl-compatible regular expressions | PCRE - Perl-compatible regular expressions |
| 698 | ||
| 699 | ||
| 700 | PCRE NATIVE API | PCRE NATIVE API BASIC FUNCTIONS |
| 701 | ||
| 702 | #include <pcre.h> | #include <pcre.h> |
| 703 | ||
| # | Line 851 PCRE NATIVE API | Line 713 PCRE NATIVE API |
| 713 | pcre_extra *pcre_study(const pcre *code, int options, | pcre_extra *pcre_study(const pcre *code, int options, |
| 714 | const char **errptr); | const char **errptr); |
| 715 | ||
| 716 | void pcre_free_study(pcre_extra *extra); | |
| 717 | ||
| 718 | int pcre_exec(const pcre *code, const pcre_extra *extra, | int pcre_exec(const pcre *code, const pcre_extra *extra, |
| 719 | const char *subject, int length, int startoffset, | const char *subject, int length, int startoffset, |
| 720 | int options, int *ovector, int ovecsize); | int options, int *ovector, int ovecsize); |
| 721 | ||
| 722 | ||
| 723 | PCRE NATIVE API AUXILIARY FUNCTIONS | |
| 724 | ||
| 725 | pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize); | |
| 726 | ||
| 727 | void pcre_jit_stack_free(pcre_jit_stack *stack); | |
| 728 | ||
| 729 | void pcre_assign_jit_stack(pcre_extra *extra, | |
| 730 | pcre_jit_callback callback, void *data); | |
| 731 | ||
| 732 | int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, | int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, |
| 733 | const char *subject, int length, int startoffset, | const char *subject, int length, int startoffset, |
| 734 | int options, int *ovector, int ovecsize, | int options, int *ovector, int ovecsize, |
| # | Line 904 PCRE NATIVE API | Line 778 PCRE NATIVE API |
| 778 | ||
| 779 | char *pcre_version(void); | char *pcre_version(void); |
| 780 | ||
| 781 | ||
| 782 | PCRE NATIVE API INDIRECTED FUNCTIONS | |
| 783 | ||
| 784 | void *(*pcre_malloc)(size_t); | void *(*pcre_malloc)(size_t); |
| 785 | ||
| 786 | void (*pcre_free)(void *); | void (*pcre_free)(void *); |
| # | Line 919 PCRE API OVERVIEW | Line 796 PCRE API OVERVIEW |
| 796 | ||
| 797 | PCRE has its own native API, which is described in this document. There | PCRE has its own native API, which is described in this document. There |
| 798 | are also some wrapper functions that correspond to the POSIX regular | are also some wrapper functions that correspond to the POSIX regular |
| 799 | expression API. These are described in the pcreposix documentation. | expression API, but they do not give access to all the functionality. |
| 800 | Both of these APIs define a set of C function calls. A C++ wrapper is | They are described in the pcreposix documentation. Both of these APIs |
| 801 | distributed with PCRE. It is documented in the pcrecpp page. | define a set of C function calls. A C++ wrapper is also distributed |
| 802 | with PCRE. It is documented in the pcrecpp page. | |
| 803 | ||
| 804 | The native API C function prototypes are defined in the header file | The native API C function prototypes are defined in the header file |
| 805 | pcre.h, and on Unix systems the library itself is called libpcre. It | pcre.h, and on Unix systems the library itself is called libpcre. It |
| 806 | can normally be accessed by adding -lpcre to the command for linking an | can normally be accessed by adding -lpcre to the command for linking an |
| 807 | application that uses PCRE. The header file defines the macros | application that uses PCRE. The header file defines the macros |
| 808 | PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num- | PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num- |
| 809 | bers for the library. Applications can use these to include support | bers for the library. Applications can use these to include support |
| 810 | for different releases of PCRE. | for different releases of PCRE. |
| 811 | ||
| 812 | In a Windows environment, if you want to statically link an application | In a Windows environment, if you want to statically link an application |
| 813 | program against a non-dll pcre.a file, you must define PCRE_STATIC | program against a non-dll pcre.a file, you must define PCRE_STATIC |
| 814 | before including pcre.h or pcrecpp.h, because otherwise the pcre_mal- | before including pcre.h or pcrecpp.h, because otherwise the pcre_mal- |
| 815 | loc() and pcre_free() exported functions will be declared | loc() and pcre_free() exported functions will be declared |
| 816 | __declspec(dllimport), with unwanted results. | __declspec(dllimport), with unwanted results. |
| 817 | ||
| 818 | The functions pcre_compile(), pcre_compile2(), pcre_study(), and | The functions pcre_compile(), pcre_compile2(), pcre_study(), and |
| 819 | pcre_exec() are used for compiling and matching regular expressions in | pcre_exec() are used for compiling and matching regular expressions in |
| 820 | a Perl-compatible manner. A sample program that demonstrates the sim- | a Perl-compatible manner. A sample program that demonstrates the sim- |
| 821 | plest way of using them is provided in the file called pcredemo.c in | plest way of using them is provided in the file called pcredemo.c in |
| 822 | the PCRE source distribution. A listing of this program is given in the | the PCRE source distribution. A listing of this program is given in the |
| 823 | pcredemo documentation, and the pcresample documentation describes how | pcredemo documentation, and the pcresample documentation describes how |
| 824 | to compile and run it. | to compile and run it. |
| 825 | ||
| 826 | Just-in-time compiler support is an optional feature of PCRE that can | |
| 827 | be built in appropriate hardware environments. It greatly speeds up the | |
| 828 | matching performance of many patterns. Simple programs can easily | |
| 829 | request that it be used if available, by setting an option that is | |
| 830 | ignored when it is not relevant. More complicated programs might need | |
| 831 | to make use of the functions pcre_jit_stack_alloc(), | |
| 832 | pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control | |
| 833 | the JIT code's memory usage. These functions are discussed in the | |
| 834 | pcrejit documentation. | |
| 835 | ||
| 836 | A second matching function, pcre_dfa_exec(), which is not Perl-compati- | A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
| 837 | ble, is also provided. This uses a different algorithm for the match- | ble, is also provided. This uses a different algorithm for the match- |
| 838 | ing. The alternative algorithm finds all possible matches (at a given | ing. The alternative algorithm finds all possible matches (at a given |
| 839 | point in the subject), and scans the subject just once (unless there | point in the subject), and scans the subject just once (unless there |
| 840 | are lookbehind assertions). However, this algorithm does not return | are lookbehind assertions). However, this algorithm does not return |
| 841 | captured substrings. A description of the two matching algorithms and | captured substrings. A description of the two matching algorithms and |
| 842 | their advantages and disadvantages is given in the pcrematching docu- | their advantages and disadvantages is given in the pcrematching docu- |
| 843 | mentation. | mentation. |
| 844 | ||
| 845 | In addition to the main compiling and matching functions, there are | In addition to the main compiling and matching functions, there are |
| 846 | convenience functions for extracting captured substrings from a subject | convenience functions for extracting captured substrings from a subject |
| 847 | string that is matched by pcre_exec(). They are: | string that is matched by pcre_exec(). They are: |
| 848 | ||
| # | Line 969 PCRE API OVERVIEW | Line 857 PCRE API OVERVIEW |
| 857 | pcre_free_substring() and pcre_free_substring_list() are also provided, | pcre_free_substring() and pcre_free_substring_list() are also provided, |
| 858 | to free the memory used for extracted strings. | to free the memory used for extracted strings. |
| 859 | ||
| 860 | The function pcre_maketables() is used to build a set of character | The function pcre_maketables() is used to build a set of character |
| 861 | tables in the current locale for passing to pcre_compile(), | tables in the current locale for passing to pcre_compile(), |
| 862 | pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is | pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is |
| 863 | provided for specialist use. Most commonly, no special tables are | provided for specialist use. Most commonly, no special tables are |
| 864 | passed, in which case internal tables that are generated when PCRE is | passed, in which case internal tables that are generated when PCRE is |
| 865 | built are used. | built are used. |
| 866 | ||
| 867 | The function pcre_fullinfo() is used to find out information about a | The function pcre_fullinfo() is used to find out information about a |
| 868 | compiled pattern; pcre_info() is an obsolete version that returns only | compiled pattern; pcre_info() is an obsolete version that returns only |
| 869 | some of the available information, but is retained for backwards com- | some of the available information, but is retained for backwards com- |
| 870 | patibility. The function pcre_version() returns a pointer to a string | patibility. The function pcre_version() returns a pointer to a string |
| 871 | containing the version of PCRE and its date of release. | containing the version of PCRE and its date of release. |
| 872 | ||
| 873 | The function pcre_refcount() maintains a reference count in a data | The function pcre_refcount() maintains a reference count in a data |
| 874 | block containing a compiled pattern. This is provided for the benefit | block containing a compiled pattern. This is provided for the benefit |
| 875 | of object-oriented applications. | of object-oriented applications. |
| 876 | ||
| 877 | The global variables pcre_malloc and pcre_free initially contain the | The global variables pcre_malloc and pcre_free initially contain the |
| 878 | entry points of the standard malloc() and free() functions, respec- | entry points of the standard malloc() and free() functions, respec- |
| 879 | tively. PCRE calls the memory management functions via these variables, | tively. PCRE calls the memory management functions via these variables, |
| 880 | so a calling program can replace them if it wishes to intercept the | so a calling program can replace them if it wishes to intercept the |
| 881 | calls. This should be done before calling any PCRE functions. | calls. This should be done before calling any PCRE functions. |
| 882 | ||
| 883 | The global variables pcre_stack_malloc and pcre_stack_free are also | The global variables pcre_stack_malloc and pcre_stack_free are also |
| 884 | indirections to memory management functions. These special functions | indirections to memory management functions. These special functions |
| 885 | are used only when PCRE is compiled to use the heap for remembering | are used only when PCRE is compiled to use the heap for remembering |
| 886 | data, instead of recursive function calls, when running the pcre_exec() | data, instead of recursive function calls, when running the pcre_exec() |
| 887 | function. See the pcrebuild documentation for details of how to do | function. See the pcrebuild documentation for details of how to do |
| 888 | this. It is a non-standard way of building PCRE, for use in environ- | this. It is a non-standard way of building PCRE, for use in environ- |
| 889 | ments that have limited stacks. Because of the greater use of memory | ments that have limited stacks. Because of the greater use of memory |
| 890 | management, it runs more slowly. Separate functions are provided so | management, it runs more slowly. Separate functions are provided so |
| 891 | that special-purpose external code can be used for this case. When | that special-purpose external code can be used for this case. When |
| 892 | used, these functions are always called in a stack-like manner (last | used, these functions are always called in a stack-like manner (last |
| 893 | obtained, first freed), and always for memory blocks of the same size. | obtained, first freed), and always for memory blocks of the same size. |
| 894 | There is a discussion about PCRE's stack usage in the pcrestack docu- | There is a discussion about PCRE's stack usage in the pcrestack docu- |
| 895 | mentation. | mentation. |
| 896 | ||
| 897 | The global variable pcre_callout initially contains NULL. It can be set | The global variable pcre_callout initially contains NULL. It can be set |
| 898 | by the caller to a "callout" function, which PCRE will then call at | by the caller to a "callout" function, which PCRE will then call at |
| 899 | specified points during a matching operation. Details are given in the | specified points during a matching operation. Details are given in the |
| 900 | pcrecallout documentation. | pcrecallout documentation. |
| 901 | ||
| 902 | ||
| 903 | NEWLINES | NEWLINES |
| 904 | ||
| 905 | PCRE supports five different conventions for indicating line breaks in | PCRE supports five different conventions for indicating line breaks in |
| 906 | strings: a single CR (carriage return) character, a single LF (line- | strings: a single CR (carriage return) character, a single LF (line- |
| 907 | feed) character, the two-character sequence CRLF, any of the three pre- | feed) character, the two-character sequence CRLF, any of the three pre- |
| 908 | ceding, or any Unicode newline sequence. The Unicode newline sequences | ceding, or any Unicode newline sequence. The Unicode newline sequences |
| 909 | are the three just mentioned, plus the single characters VT (vertical | are the three just mentioned, plus the single characters VT (vertical |
| 910 | tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line | tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
| 911 | separator, U+2028), and PS (paragraph separator, U+2029). | separator, U+2028), and PS (paragraph separator, U+2029). |
| 912 | ||
| 913 | Each of the first three conventions is used by at least one operating | Each of the first three conventions is used by at least one operating |
| 914 | system as its standard newline sequence. When PCRE is built, a default | system as its standard newline sequence. When PCRE is built, a default |
| 915 | can be specified. The default default is LF, which is the Unix stan- | can be specified. The default default is LF, which is the Unix stan- |
| 916 | dard. When PCRE is run, the default can be overridden, either when a | dard. When PCRE is run, the default can be overridden, either when a |
| 917 | pattern is compiled, or when it is matched. | pattern is compiled, or when it is matched. |
| 918 | ||
| 919 | At compile time, the newline convention can be specified by the options | At compile time, the newline convention can be specified by the options |
| 920 | argument of pcre_compile(), or it can be specified by special text at | argument of pcre_compile(), or it can be specified by special text at |
| 921 | the start of the pattern itself; this overrides any other settings. See | the start of the pattern itself; this overrides any other settings. See |
| 922 | the pcrepattern page for details of the special character sequences. | the pcrepattern page for details of the special character sequences. |
| 923 | ||
| 924 | In the PCRE documentation the word "newline" is used to mean "the char- | In the PCRE documentation the word "newline" is used to mean "the char- |
| 925 | acter or pair of characters that indicate a line break". The choice of | acter or pair of characters that indicate a line break". The choice of |
| 926 | newline convention affects the handling of the dot, circumflex, and | newline convention affects the handling of the dot, circumflex, and |
| 927 | dollar metacharacters, the handling of #-comments in /x mode, and, when | dollar metacharacters, the handling of #-comments in /x mode, and, when |
| 928 | CRLF is a recognized line ending sequence, the match position advance- | CRLF is a recognized line ending sequence, the match position advance- |
| 929 | ment for a non-anchored pattern. There is more detail about this in the | ment for a non-anchored pattern. There is more detail about this in the |
| 930 | section on pcre_exec() options below. | section on pcre_exec() options below. |
| 931 | ||
| 932 | The choice of newline convention does not affect the interpretation of | The choice of newline convention does not affect the interpretation of |
| 933 | the \n or \r escape sequences, nor does it affect what \R matches, | the \n or \r escape sequences, nor does it affect what \R matches, |
| 934 | which is controlled in a similar way, but by separate options. | which is controlled in a similar way, but by separate options. |
| 935 | ||
| 936 | ||
| 937 | MULTITHREADING | MULTITHREADING |
| 938 | ||
| 939 | The PCRE functions can be used in multi-threading applications, with | The PCRE functions can be used in multi-threading applications, with |
| 940 | the proviso that the memory management functions pointed to by | the proviso that the memory management functions pointed to by |
| 941 | pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the | pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
| 942 | callout function pointed to by pcre_callout, are shared by all threads. | callout function pointed to by pcre_callout, are shared by all threads. |
| 943 | ||
| 944 | The compiled form of a regular expression is not altered during match- | The compiled form of a regular expression is not altered during match- |
| 945 | ing, so the same compiled pattern can safely be used by several threads | ing, so the same compiled pattern can safely be used by several threads |
| 946 | at once. | at once. |
| 947 | ||
| 948 | If the just-in-time optimization feature is being used, it needs sepa- | |
| 949 | rate memory stack areas for each thread. See the pcrejit documentation | |
| 950 | for more details. | |
| 951 | ||
| 952 | ||
| 953 | SAVING PRECOMPILED PATTERNS FOR LATER USE | SAVING PRECOMPILED PATTERNS FOR LATER USE |
| 954 | ||
| 955 | The compiled form of a regular expression can be saved and re-used at a | The compiled form of a regular expression can be saved and re-used at a |
| 956 | later time, possibly by a different program, and even on a host other | later time, possibly by a different program, and even on a host other |
| 957 | than the one on which it was compiled. Details are given in the | than the one on which it was compiled. Details are given in the |
| 958 | pcreprecompile documentation. However, compiling a regular expression | pcreprecompile documentation. However, compiling a regular expression |
| 959 | with one version of PCRE for use with a different version is not guar- | with one version of PCRE for use with a different version is not guar- |
| 960 | anteed to work and may cause crashes. | anteed to work and may cause crashes. |
| 961 | ||
| 962 | ||
| # | Line 1072 CHECKING BUILD-TIME OPTIONS | Line 964 CHECKING BUILD-TIME OPTIONS |
| 964 | ||
| 965 | int pcre_config(int what, void *where); | int pcre_config(int what, void *where); |
| 966 | ||
| 967 | The function pcre_config() makes it possible for a PCRE client to dis- | The function pcre_config() makes it possible for a PCRE client to dis- |
| 968 | cover which optional features have been compiled into the PCRE library. | cover which optional features have been compiled into the PCRE library. |
| 969 | The pcrebuild documentation has more details about these optional fea- | The pcrebuild documentation has more details about these optional fea- |
| 970 | tures. | tures. |
| 971 | ||
| 972 | The first argument for pcre_config() is an integer, specifying which | The first argument for pcre_config() is an integer, specifying which |
| 973 | information is required; the second argument is a pointer to a variable | information is required; the second argument is a pointer to a variable |
| 974 | into which the information is placed. The following information is | into which the information is placed. The following information is |
| 975 | available: | available: |
| 976 | ||
| 977 | PCRE_CONFIG_UTF8 | PCRE_CONFIG_UTF8 |
| 978 | ||
| 979 | The output is an integer that is set to one if UTF-8 support is avail- | The output is an integer that is set to one if UTF-8 support is avail- |
| 980 | able; otherwise it is set to zero. | able; otherwise it is set to zero. |
| 981 | ||
| 982 | PCRE_CONFIG_UNICODE_PROPERTIES | PCRE_CONFIG_UNICODE_PROPERTIES |
| 983 | ||
| 984 | The output is an integer that is set to one if support for Unicode | The output is an integer that is set to one if support for Unicode |
| 985 | character properties is available; otherwise it is set to zero. | character properties is available; otherwise it is set to zero. |
| 986 | ||
| 987 | PCRE_CONFIG_JIT | |
| 988 | ||
| 989 | The output is an integer that is set to one if support for just-in-time | |
| 990 | compiling is available; otherwise it is set to zero. | |
| 991 | ||
| 992 | PCRE_CONFIG_NEWLINE | PCRE_CONFIG_NEWLINE |
| 993 | ||
| 994 | The output is an integer whose value specifies the default character | The output is an integer whose value specifies the default character |
| # | Line 1453 COMPILING A PATTERN | Line 1350 COMPILING A PATTERN |
| 1350 | strings of UTF-8 characters instead of single-byte character strings. | strings of UTF-8 characters instead of single-byte character strings. |
| 1351 | However, it is available only when PCRE is built to include UTF-8 sup- | However, it is available only when PCRE is built to include UTF-8 sup- |
| 1352 | port. If not, the use of this option provokes an error. Details of how | port. If not, the use of this option provokes an error. Details of how |
| 1353 | this option changes the behaviour of PCRE are given in the section on | this option changes the behaviour of PCRE are given in the pcreunicode |
| 1354 | UTF-8 support in the main pcre page. | page. |
| 1355 | ||
| 1356 | PCRE_NO_UTF8_CHECK | PCRE_NO_UTF8_CHECK |
| 1357 | ||
| # | Line 1514 COMPILATION ERROR CODES | Line 1411 COMPILATION ERROR CODES |
| 1411 | 34 character value in \x{...} sequence is too large | 34 character value in \x{...} sequence is too large |
| 1412 | 35 invalid condition (?(0) | 35 invalid condition (?(0) |
| 1413 | 36 \C not allowed in lookbehind assertion | 36 \C not allowed in lookbehind assertion |
| 1414 | 37 PCRE does not support \L, \l, \N, \U, or \u | 37 PCRE does not support \L, \l, \N{name}, \U, or \u |
| 1415 | 38 number after (?C is > 255 | 38 number after (?C is > 255 |
| 1416 | 39 closing ) for (?C expected | 39 closing ) for (?C expected |
| 1417 | 40 recursive call could loop indefinitely | 40 recursive call could loop indefinitely |
| # | Line 1548 COMPILATION ERROR CODES | Line 1445 COMPILATION ERROR CODES |
| 1445 | not allowed | not allowed |
| 1446 | 66 (*MARK) must have an argument | 66 (*MARK) must have an argument |
| 1447 | 67 this version of PCRE is not compiled with PCRE_UCP support | 67 this version of PCRE is not compiled with PCRE_UCP support |
| 1448 | 68 \c must be followed by an ASCII character | |
| 1449 | 69 \k is not followed by a braced, angle-bracketed, or quoted name | |
| 1450 | ||
| 1451 | The numbers 32 and 10000 in errors 48 and 49 are defaults; different | The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
| 1452 | values may be used if the limits were changed when PCRE was built. | values may be used if the limits were changed when PCRE was built. |
| # | Line 1576 STUDYING A PATTERN | Line 1475 STUDYING A PATTERN |
| 1475 | wants to pass any of the other fields to pcre_exec() or | wants to pass any of the other fields to pcre_exec() or |
| 1476 | pcre_dfa_exec(), it must set up its own pcre_extra block. | pcre_dfa_exec(), it must set up its own pcre_extra block. |
| 1477 | ||
| 1478 | The second argument of pcre_study() contains option bits. At present, | The second argument of pcre_study() contains option bits. There is only |
| 1479 | no options are defined, and this argument should always be zero. | one option: PCRE_STUDY_JIT_COMPILE. If this is set, and the just-in- |
| 1480 | time compiler is available, the pattern is further compiled into | |
| 1481 | machine code that executes much faster than the pcre_exec() matching | |
| 1482 | function. If the just-in-time compiler is not available, this option is | |
| 1483 | ignored. All other bits in the options argument must be zero. | |
| 1484 | ||
| 1485 | JIT compilation is a heavyweight optimization. It can take some time | |
| 1486 | for patterns to be analyzed, and for one-off matches and simple pat- | |
| 1487 | terns the benefit of faster execution might be offset by a much slower | |
| 1488 | study time. Not all patterns can be optimized by the JIT compiler. For | |
| 1489 | those that cannot be handled, matching automatically falls back to the | |
| 1490 | pcre_exec() interpreter. For more details, see the pcrejit documenta- | |
| 1491 | tion. | |
| 1492 | ||
| 1493 | The third argument for pcre_study() is a pointer for an error message. | The third argument for pcre_study() is a pointer for an error message. |
| 1494 | If studying succeeds (even if no data is returned), the variable it | If studying succeeds (even if no data is returned), the variable it |
| # | Line 1586 STUDYING A PATTERN | Line 1497 STUDYING A PATTERN |
| 1497 | must not try to free it. You should test the error pointer for NULL | must not try to free it. You should test the error pointer for NULL |
| 1498 | after calling pcre_study(), to be sure that it has run successfully. | after calling pcre_study(), to be sure that it has run successfully. |
| 1499 | ||
| 1500 | This is a typical call to pcre_study(): | When you are finished with a pattern, you can free the memory used for |
| 1501 | the study data by calling pcre_free_study(). This function was added to | |
| 1502 | the API for release 8.20. For earlier versions, the memory could be | |
| 1503 | freed with pcre_free(), just like the pattern itself. This will still | |
| 1504 | work in cases where PCRE_STUDY_JIT_COMPILE is not used, but it is | |
| 1505 | advisable to change to the new function when convenient. | |
| 1506 | ||
| 1507 | This is a typical way in which pcre_study() is used (except that in a | |
| 1508 | real application there should be tests for errors): | |
| 1509 | ||
| 1510 | pcre_extra *pe; | int rc; |
| 1511 | pe = pcre_study( | pcre *re; |
| 1512 | pcre_extra *sd; | |
| 1513 | re = pcre_compile("pattern", 0, &error, &erroroffset, NULL); | |
| 1514 | sd = pcre_study( | |
| 1515 | re, /* result of pcre_compile() */ | re, /* result of pcre_compile() */ |
| 1516 | 0, /* no options exist */ | 0, /* no options */ |
| 1517 | &error); /* set to NULL or points to a message */ | &error); /* set to NULL or points to a message */ |
| 1518 | rc = pcre_exec( /* see below for details of pcre_exec() options */ | |
| 1519 | re, sd, "subject", 7, 0, 0, ovector, 30); | |
| 1520 | ... | |
| 1521 | pcre_free_study(sd); | |
| 1522 | pcre_free(re); | |
| 1523 | ||
| 1524 | Studying a pattern does two things: first, a lower bound for the length | Studying a pattern does two things: first, a lower bound for the length |
| 1525 | of subject string that is needed to match the pattern is computed. This | of subject string that is needed to match the pattern is computed. This |
| # | Line 1607 STUDYING A PATTERN | Line 1534 STUDYING A PATTERN |
| 1534 | bytes is created. This speeds up finding a position in the subject at | bytes is created. This speeds up finding a position in the subject at |
| 1535 | which to start matching. | which to start matching. |
| 1536 | ||
| 1537 | The two optimizations just described can be disabled by setting the | These two optimizations apply to both pcre_exec() and pcre_dfa_exec(). |
| 1538 | PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or | However, they are not used by pcre_exec() if pcre_study() is called |
| 1539 | pcre_dfa_exec(). You might want to do this if your pattern contains | with the PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is |
| 1540 | callouts or (*MARK), and you want to make use of these facilities in | successful. The optimizations can be disabled by setting the |
| 1541 | cases where matching fails. See the discussion of PCRE_NO_START_OPTI- | PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or |
| 1542 | MIZE below. | pcre_dfa_exec(). You might want to do this if your pattern contains |
| 1543 | callouts or (*MARK) (which cannot be handled by the JIT compiler), and | |
| 1544 | you want to make use of these facilities in cases where matching fails. | |
| 1545 | See the discussion of PCRE_NO_START_OPTIMIZE below. | |
| 1546 | ||
| 1547 | ||
| 1548 | LOCALE SUPPORT | LOCALE SUPPORT |
| 1549 | ||
| 1550 | PCRE handles caseless matching, and determines whether characters are | PCRE handles caseless matching, and determines whether characters are |
| 1551 | letters, digits, or whatever, by reference to a set of tables, indexed | letters, digits, or whatever, by reference to a set of tables, indexed |
| 1552 | by character value. When running in UTF-8 mode, this applies only to | by character value. When running in UTF-8 mode, this applies only to |
| 1553 | characters with codes less than 128. By default, higher-valued codes | characters with codes less than 128. By default, higher-valued codes |
| 1554 | never match escapes such as \w or \d, but they can be tested with \p if | never match escapes such as \w or \d, but they can be tested with \p if |
| 1555 | PCRE is built with Unicode character property support. Alternatively, | PCRE is built with Unicode character property support. Alternatively, |
| 1556 | the PCRE_UCP option can be set at compile time; this causes \w and | the PCRE_UCP option can be set at compile time; this causes \w and |
| 1557 | friends to use Unicode property support instead of built-in tables. The | friends to use Unicode property support instead of built-in tables. The |
| 1558 | use of locales with Unicode is discouraged. If you are handling charac- | use of locales with Unicode is discouraged. If you are handling charac- |
| 1559 | ters with codes greater than 128, you should either use UTF-8 and Uni- | ters with codes greater than 128, you should either use UTF-8 and Uni- |
| 1560 | code, or use locales, but not try to mix the two. | code, or use locales, but not try to mix the two. |
| 1561 | ||
| 1562 | PCRE contains an internal set of tables that are used when the final | PCRE contains an internal set of tables that are used when the final |
| 1563 | argument of pcre_compile() is NULL. These are sufficient for many | argument of pcre_compile() is NULL. These are sufficient for many |
| 1564 | applications. Normally, the internal tables recognize only ASCII char- | applications. Normally, the internal tables recognize only ASCII char- |
| 1565 | acters. However, when PCRE is built, it is possible to cause the inter- | acters. However, when PCRE is built, it is possible to cause the inter- |
| 1566 | nal tables to be rebuilt in the default "C" locale of the local system, | nal tables to be rebuilt in the default "C" locale of the local system, |
| 1567 | which may cause them to be different. | which may cause them to be different. |
| 1568 | ||
| 1569 | The internal tables can always be overridden by tables supplied by the | The internal tables can always be overridden by tables supplied by the |
| 1570 | application that calls PCRE. These may be created in a different locale | application that calls PCRE. These may be created in a different locale |
| 1571 | from the default. As more and more applications change to using Uni- | from the default. As more and more applications change to using Uni- |
| 1572 | code, the need for this locale support is expected to die away. | code, the need for this locale support is expected to die away. |
| 1573 | ||
| 1574 | External tables are built by calling the pcre_maketables() function, | External tables are built by calling the pcre_maketables() function, |
| 1575 | which has no arguments, in the relevant locale. The result can then be | which has no arguments, in the relevant locale. The result can then be |
| 1576 | passed to pcre_compile() or pcre_exec() as often as necessary. For | passed to pcre_compile() or pcre_exec() as often as necessary. For |
| 1577 | example, to build and use tables that are appropriate for the French | example, to build and use tables that are appropriate for the French |
| 1578 | locale (where accented characters with values greater than 128 are | locale (where accented characters with values greater than 128 are |
| 1579 | treated as letters), the following code could be used: | treated as letters), the following code could be used: |
| 1580 | ||
| 1581 | setlocale(LC_CTYPE, "fr_FR"); | setlocale(LC_CTYPE, "fr_FR"); |
| 1582 | tables = pcre_maketables(); | tables = pcre_maketables(); |
| 1583 | re = pcre_compile(..., tables); | re = pcre_compile(..., tables); |
| 1584 | ||
| 1585 | The locale name "fr_FR" is used on Linux and other Unix-like systems; | The locale name "fr_FR" is used on Linux and other Unix-like systems; |
| 1586 | if you are using Windows, the name for the French locale is "french". | if you are using Windows, the name for the French locale is "french". |
| 1587 | ||
| 1588 | When pcre_maketables() runs, the tables are built in memory that is | When pcre_maketables() runs, the tables are built in memory that is |
| 1589 | obtained via pcre_malloc. It is the caller's responsibility to ensure | obtained via pcre_malloc. It is the caller's responsibility to ensure |
| 1590 | that the memory containing the tables remains available for as long as | that the memory containing the tables remains available for as long as |
| 1591 | it is needed. | it is needed. |
| 1592 | ||
| 1593 | The pointer that is passed to pcre_compile() is saved with the compiled | The pointer that is passed to pcre_compile() is saved with the compiled |
| 1594 | pattern, and the same tables are used via this pointer by pcre_study() | pattern, and the same tables are used via this pointer by pcre_study() |
| 1595 | and normally also by pcre_exec(). Thus, by default, for any single pat- | and normally also by pcre_exec(). Thus, by default, for any single pat- |
| 1596 | tern, compilation, studying and matching all happen in the same locale, | tern, compilation, studying and matching all happen in the same locale, |
| 1597 | but different patterns can be compiled in different locales. | but different patterns can be compiled in different locales. |
| 1598 | ||
| 1599 | It is possible to pass a table pointer or NULL (indicating the use of | It is possible to pass a table pointer or NULL (indicating the use of |
| 1600 | the internal tables) to pcre_exec(). Although not intended for this | the internal tables) to pcre_exec(). Although not intended for this |
| 1601 | purpose, this facility could be used to match a pattern in a different | purpose, this facility could be used to match a pattern in a different |
| 1602 | locale from the one in which it was compiled. Passing table pointers at | locale from the one in which it was compiled. Passing table pointers at |
| 1603 | run time is discussed below in the section on matching a pattern. | run time is discussed below in the section on matching a pattern. |
| 1604 | ||
| # | Line 1678 INFORMATION ABOUT A PATTERN | Line 1608 INFORMATION ABOUT A PATTERN |
| 1608 | int pcre_fullinfo(const pcre *code, const pcre_extra *extra, | int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
| 1609 | int what, void *where); | int what, void *where); |
| 1610 | ||
| 1611 | The pcre_fullinfo() function returns information about a compiled pat- | The pcre_fullinfo() function returns information about a compiled pat- |
| 1612 | tern. It replaces the obsolete pcre_info() function, which is neverthe- | tern. It replaces the obsolete pcre_info() function, which is neverthe- |
| 1613 | less retained for backwards compability (and is documented below). | less retained for backwards compability (and is documented below). |
| 1614 | ||
| 1615 | The first argument for pcre_fullinfo() is a pointer to the compiled | The first argument for pcre_fullinfo() is a pointer to the compiled |
| 1616 | pattern. The second argument is the result of pcre_study(), or NULL if | pattern. The second argument is the result of pcre_study(), or NULL if |
| 1617 | the pattern was not studied. The third argument specifies which piece | the pattern was not studied. The third argument specifies which piece |
| 1618 | of information is required, and the fourth argument is a pointer to a | of information is required, and the fourth argument is a pointer to a |
| 1619 | variable to receive the data. The yield of the function is zero for | variable to receive the data. The yield of the function is zero for |
| 1620 | success, or one of the following negative numbers: | success, or one of the following negative numbers: |
| 1621 | ||
| 1622 | PCRE_ERROR_NULL the argument code was NULL | PCRE_ERROR_NULL the argument code was NULL |
| # | Line 1694 INFORMATION ABOUT A PATTERN | Line 1624 INFORMATION ABOUT A PATTERN |
| 1624 | PCRE_ERROR_BADMAGIC the "magic number" was not found | PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 1625 | PCRE_ERROR_BADOPTION the value of what was invalid | PCRE_ERROR_BADOPTION the value of what was invalid |
| 1626 | ||
| 1627 | The "magic number" is placed at the start of each compiled pattern as | The "magic number" is placed at the start of each compiled pattern as |
| 1628 | an simple check against passing an arbitrary memory pointer. Here is a | an simple check against passing an arbitrary memory pointer. Here is a |
| 1629 | typical call of pcre_fullinfo(), to obtain the length of the compiled | typical call of pcre_fullinfo(), to obtain the length of the compiled |
| 1630 | pattern: | pattern: |
| 1631 | ||
| 1632 | int rc; | int rc; |
| 1633 | size_t length; | size_t length; |
| 1634 | rc = pcre_fullinfo( | rc = pcre_fullinfo( |
| 1635 | re, /* result of pcre_compile() */ | re, /* result of pcre_compile() */ |
| 1636 | pe, /* result of pcre_study(), or NULL */ | sd, /* result of pcre_study(), or NULL */ |
| 1637 | PCRE_INFO_SIZE, /* what is required */ | PCRE_INFO_SIZE, /* what is required */ |
| 1638 | &length); /* where to put the data */ | &length); /* where to put the data */ |
| 1639 | ||
| 1640 | The possible values for the third argument are defined in pcre.h, and | The possible values for the third argument are defined in pcre.h, and |
| 1641 | are as follows: | are as follows: |
| 1642 | ||
| 1643 | PCRE_INFO_BACKREFMAX | PCRE_INFO_BACKREFMAX |
| 1644 | ||
| 1645 | Return the number of the highest back reference in the pattern. The | Return the number of the highest back reference in the pattern. The |
| 1646 | fourth argument should point to an int variable. Zero is returned if | fourth argument should point to an int variable. Zero is returned if |
| 1647 | there are no back references. | there are no back references. |
| 1648 | ||
| 1649 | PCRE_INFO_CAPTURECOUNT | PCRE_INFO_CAPTURECOUNT |
| 1650 | ||
| 1651 | Return the number of capturing subpatterns in the pattern. The fourth | Return the number of capturing subpatterns in the pattern. The fourth |
| 1652 | argument should point to an int variable. | argument should point to an int variable. |
| 1653 | ||
| 1654 | PCRE_INFO_DEFAULT_TABLES | PCRE_INFO_DEFAULT_TABLES |
| 1655 | ||
| 1656 | Return a pointer to the internal default character tables within PCRE. | Return a pointer to the internal default character tables within PCRE. |
| 1657 | The fourth argument should point to an unsigned char * variable. This | The fourth argument should point to an unsigned char * variable. This |
| 1658 | information call is provided for internal use by the pcre_study() func- | information call is provided for internal use by the pcre_study() func- |
| 1659 | tion. External callers can cause PCRE to use its internal tables by | tion. External callers can cause PCRE to use its internal tables by |
| 1660 | passing a NULL table pointer. | passing a NULL table pointer. |
| 1661 | ||
| 1662 | PCRE_INFO_FIRSTBYTE | PCRE_INFO_FIRSTBYTE |
| 1663 | ||
| 1664 | Return information about the first byte of any matched string, for a | Return information about the first byte of any matched string, for a |
| 1665 | non-anchored pattern. The fourth argument should point to an int vari- | non-anchored pattern. The fourth argument should point to an int vari- |
| 1666 | able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name | able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
| 1667 | is still recognized for backwards compatibility.) | is still recognized for backwards compatibility.) |
| 1668 | ||
| 1669 | If there is a fixed first byte, for example, from a pattern such as | If there is a fixed first byte, for example, from a pattern such as |
| 1670 | (cat|cow|coyote), its value is returned. Otherwise, if either | (cat|cow|coyote), its value is returned. Otherwise, if either |
| 1671 | ||
| 1672 | (a) the pattern was compiled with the PCRE_MULTILINE option, and every | (a) the pattern was compiled with the PCRE_MULTILINE option, and every |
| 1673 | branch starts with "^", or | branch starts with "^", or |
| 1674 | ||
| 1675 | (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not | (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
| 1676 | set (if it were set, the pattern would be anchored), | set (if it were set, the pattern would be anchored), |
| 1677 | ||
| 1678 | -1 is returned, indicating that the pattern matches only at the start | -1 is returned, indicating that the pattern matches only at the start |
| 1679 | of a subject string or after any newline within the string. Otherwise | of a subject string or after any newline within the string. Otherwise |
| 1680 | -2 is returned. For anchored patterns, -2 is returned. | -2 is returned. For anchored patterns, -2 is returned. |
| 1681 | ||
| 1682 | PCRE_INFO_FIRSTTABLE | PCRE_INFO_FIRSTTABLE |
| 1683 | ||
| 1684 | If the pattern was studied, and this resulted in the construction of a | If the pattern was studied, and this resulted in the construction of a |
| 1685 | 256-bit table indicating a fixed set of bytes for the first byte in any | 256-bit table indicating a fixed set of bytes for the first byte in any |
| 1686 | matching string, a pointer to the table is returned. Otherwise NULL is | matching string, a pointer to the table is returned. Otherwise NULL is |
| 1687 | returned. The fourth argument should point to an unsigned char * vari- | returned. The fourth argument should point to an unsigned char * vari- |
| 1688 | able. | able. |
| 1689 | ||
| 1690 | PCRE_INFO_HASCRORLF | PCRE_INFO_HASCRORLF |
| 1691 | ||
| 1692 | Return 1 if the pattern contains any explicit matches for CR or LF | Return 1 if the pattern contains any explicit matches for CR or LF |
| 1693 | characters, otherwise 0. The fourth argument should point to an int | characters, otherwise 0. The fourth argument should point to an int |
| 1694 | variable. An explicit match is either a literal CR or LF character, or | variable. An explicit match is either a literal CR or LF character, or |
| 1695 | \r or \n. | \r or \n. |
| 1696 | ||
| 1697 | PCRE_INFO_JCHANGED | PCRE_INFO_JCHANGED |
| 1698 | ||
| 1699 | Return 1 if the (?J) or (?-J) option setting is used in the pattern, | Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
| 1700 | otherwise 0. The fourth argument should point to an int variable. (?J) | otherwise 0. The fourth argument should point to an int variable. (?J) |
| 1701 | and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. | and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
| 1702 | ||
| 1703 | PCRE_INFO_JIT | |
| 1704 | ||
| 1705 | Return 1 if the pattern was studied with the PCRE_STUDY_JIT_COMPILE | |
| 1706 | option, and just-in-time compiling was successful. The fourth argument | |
| 1707 | should point to an int variable. A return value of 0 means that JIT | |
| 1708 | support is not available in this version of PCRE, or that the pattern | |
| 1709 | was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT | |
| 1710 | compiler could not handle this particular pattern. See the pcrejit doc- | |
| 1711 | umentation for details of what can and cannot be handled. | |
| 1712 | ||
| 1713 | PCRE_INFO_LASTLITERAL | PCRE_INFO_LASTLITERAL |
| 1714 | ||
| 1715 | Return the value of the rightmost literal byte that must exist in any | Return the value of the rightmost literal byte that must exist in any |
| 1716 | matched string, other than at its start, if such a byte has been | matched string, other than at its start, if such a byte has been |
| 1717 | recorded. The fourth argument should point to an int variable. If there | recorded. The fourth argument should point to an int variable. If there |
| 1718 | is no such byte, -1 is returned. For anchored patterns, a last literal | is no such byte, -1 is returned. For anchored patterns, a last literal |
| 1719 | byte is recorded only if it follows something of variable length. For | byte is recorded only if it follows something of variable length. For |
| 1720 | example, for the pattern /^a\d+z\d+/ the returned value is "z", but for | example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
| 1721 | /^a\dz\d/ the returned value is -1. | /^a\dz\d/ the returned value is -1. |
| 1722 | ||
| 1723 | PCRE_INFO_MINLENGTH | PCRE_INFO_MINLENGTH |
| 1724 | ||
| 1725 | If the pattern was studied and a minimum length for matching subject | If the pattern was studied and a minimum length for matching subject |
| 1726 | strings was computed, its value is returned. Otherwise the returned | strings was computed, its value is returned. Otherwise the returned |
| 1727 | value is -1. The value is a number of characters, not bytes (this may | value is -1. The value is a number of characters, not bytes (this may |
| 1728 | be relevant in UTF-8 mode). The fourth argument should point to an int | be relevant in UTF-8 mode). The fourth argument should point to an int |
| 1729 | variable. A non-negative value is a lower bound to the length of any | variable. A non-negative value is a lower bound to the length of any |
| 1730 | matching string. There may not be any strings of that length that do | matching string. There may not be any strings of that length that do |
| 1731 | actually match, but every string that does match is at least that long. | actually match, but every string that does match is at least that long. |
| 1732 | ||
| 1733 | PCRE_INFO_NAMECOUNT | PCRE_INFO_NAMECOUNT |
| 1734 | PCRE_INFO_NAMEENTRYSIZE | PCRE_INFO_NAMEENTRYSIZE |
| 1735 | PCRE_INFO_NAMETABLE | PCRE_INFO_NAMETABLE |
| 1736 | ||
| 1737 | PCRE supports the use of named as well as numbered capturing parenthe- | PCRE supports the use of named as well as numbered capturing parenthe- |
| 1738 | ses. The names are just an additional way of identifying the parenthe- | ses. The names are just an additional way of identifying the parenthe- |
| 1739 | ses, which still acquire numbers. Several convenience functions such as | ses, which still acquire numbers. Several convenience functions such as |
| 1740 | pcre_get_named_substring() are provided for extracting captured sub- | pcre_get_named_substring() are provided for extracting captured sub- |
| 1741 | strings by name. It is also possible to extract the data directly, by | strings by name. It is also possible to extract the data directly, by |
| 1742 | first converting the name to a number in order to access the correct | first converting the name to a number in order to access the correct |
| 1743 | pointers in the output vector (described with pcre_exec() below). To do | pointers in the output vector (described with pcre_exec() below). To do |
| 1744 | the conversion, you need to use the name-to-number map, which is | the conversion, you need to use the name-to-number map, which is |
| 1745 | described by these three values. | described by these three values. |
| 1746 | ||
| 1747 | The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT | The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
| 1748 | gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size | gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
| 1749 | of each entry; both of these return an int value. The entry size | of each entry; both of these return an int value. The entry size |
| 1750 | depends on the length of the longest name. PCRE_INFO_NAMETABLE returns | depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
| 1751 | a pointer to the first entry of the table (a pointer to char). The | a pointer to the first entry of the table (a pointer to char). The |
| 1752 | first two bytes of each entry are the number of the capturing parenthe- | first two bytes of each entry are the number of the capturing parenthe- |
| 1753 | sis, most significant byte first. The rest of the entry is the corre- | sis, most significant byte first. The rest of the entry is the corre- |
| 1754 | sponding name, zero terminated. | sponding name, zero terminated. |
| 1755 | ||
| 1756 | The names are in alphabetical order. Duplicate names may appear if (?| | The names are in alphabetical order. Duplicate names may appear if (?| |
| 1757 | is used to create multiple groups with the same number, as described in | is used to create multiple groups with the same number, as described in |
| 1758 | the section on duplicate subpattern numbers in the pcrepattern page. | the section on duplicate subpattern numbers in the pcrepattern page. |
| 1759 | Duplicate names for subpatterns with different numbers are permitted | Duplicate names for subpatterns with different numbers are permitted |
| 1760 | only if PCRE_DUPNAMES is set. In all cases of duplicate names, they | only if PCRE_DUPNAMES is set. In all cases of duplicate names, they |
| 1761 | appear in the table in the order in which they were found in the pat- | appear in the table in the order in which they were found in the pat- |
| 1762 | tern. In the absence of (?| this is the order of increasing number; | tern. In the absence of (?| this is the order of increasing number; |
| 1763 | when (?| is used this is not necessarily the case because later subpat- | when (?| is used this is not necessarily the case because later subpat- |
| 1764 | terns may have lower numbers. | terns may have lower numbers. |
| 1765 | ||
| 1766 | As a simple example of the name/number table, consider the following | As a simple example of the name/number table, consider the following |
| 1767 | pattern (assume PCRE_EXTENDED is set, so white space - including new- | pattern (assume PCRE_EXTENDED is set, so white space - including new- |
| 1768 | lines - is ignored): | lines - is ignored): |
| 1769 | ||
| 1770 | (?<date> (?<year>(\d\d)?\d\d) - | (?<date> (?<year>(\d\d)?\d\d) - |
| 1771 | (?<month>\d\d) - (?<day>\d\d) ) | (?<month>\d\d) - (?<day>\d\d) ) |
| 1772 | ||
| 1773 | There are four named subpatterns, so the table has four entries, and | There are four named subpatterns, so the table has four entries, and |
| 1774 | each entry in the table is eight bytes long. The table is as follows, | each entry in the table is eight bytes long. The table is as follows, |
| 1775 | with non-printing bytes shows in hexadecimal, and undefined bytes shown | with non-printing bytes shows in hexadecimal, and undefined bytes shown |
| 1776 | as ??: | as ??: |
| 1777 | ||
| # | Line 1840 INFORMATION ABOUT A PATTERN | Line 1780 INFORMATION ABOUT A PATTERN |
| 1780 | 00 04 m o n t h 00 | 00 04 m o n t h 00 |
| 1781 | 00 02 y e a r 00 ?? | 00 02 y e a r 00 ?? |
| 1782 | ||
| 1783 | When writing code to extract data from named subpatterns using the | When writing code to extract data from named subpatterns using the |
| 1784 | name-to-number map, remember that the length of the entries is likely | name-to-number map, remember that the length of the entries is likely |
| 1785 | to be different for each compiled pattern. | to be different for each compiled pattern. |
| 1786 | ||
| 1787 | PCRE_INFO_OKPARTIAL | PCRE_INFO_OKPARTIAL |
| 1788 | ||
| 1789 | Return 1 if the pattern can be used for partial matching with | Return 1 if the pattern can be used for partial matching with |
| 1790 | pcre_exec(), otherwise 0. The fourth argument should point to an int | pcre_exec(), otherwise 0. The fourth argument should point to an int |
| 1791 | variable. From release 8.00, this always returns 1, because the | variable. From release 8.00, this always returns 1, because the |
| 1792 | restrictions that previously applied to partial matching have been | restrictions that previously applied to partial matching have been |
| 1793 | lifted. The pcrepartial documentation gives details of partial match- | lifted. The pcrepartial documentation gives details of partial match- |
| 1794 | ing. | ing. |
| 1795 | ||
| 1796 | PCRE_INFO_OPTIONS | PCRE_INFO_OPTIONS |
| 1797 | ||
| 1798 | Return a copy of the options with which the pattern was compiled. The | Return a copy of the options with which the pattern was compiled. The |
| 1799 | fourth argument should point to an unsigned long int variable. These | fourth argument should point to an unsigned long int variable. These |
| 1800 | option bits are those specified in the call to pcre_compile(), modified | option bits are those specified in the call to pcre_compile(), modified |
| 1801 | by any top-level option settings at the start of the pattern itself. In | by any top-level option settings at the start of the pattern itself. In |
| 1802 | other words, they are the options that will be in force when matching | other words, they are the options that will be in force when matching |
| 1803 | starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with | starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
| 1804 | the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, | the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
| 1805 | and PCRE_EXTENDED. | and PCRE_EXTENDED. |
| 1806 | ||
| 1807 | A pattern is automatically anchored by PCRE if all of its top-level | A pattern is automatically anchored by PCRE if all of its top-level |
| 1808 | alternatives begin with one of the following: | alternatives begin with one of the following: |
| 1809 | ||
| 1810 | ^ unless PCRE_MULTILINE is set | ^ unless PCRE_MULTILINE is set |
| # | Line 1878 INFORMATION ABOUT A PATTERN | Line 1818 INFORMATION ABOUT A PATTERN |
| 1818 | ||
| 1819 | PCRE_INFO_SIZE | PCRE_INFO_SIZE |
| 1820 | ||
| 1821 | Return the size of the compiled pattern, that is, the value that was | Return the size of the compiled pattern, that is, the value that was |
| 1822 | passed as the argument to pcre_malloc() when PCRE was getting memory in | passed as the argument to pcre_malloc() when PCRE was getting memory in |
| 1823 | which to place the compiled data. The fourth argument should point to a | which to place the compiled data. The fourth argument should point to a |
| 1824 | size_t variable. | size_t variable. |
| # | Line 1886 INFORMATION ABOUT A PATTERN | Line 1826 INFORMATION ABOUT A PATTERN |
| 1826 | PCRE_INFO_STUDYSIZE | PCRE_INFO_STUDYSIZE |
| 1827 | ||
| 1828 | Return the size of the data block pointed to by the study_data field in | Return the size of the data block pointed to by the study_data field in |
| 1829 | a pcre_extra block. That is, it is the value that was passed to | a pcre_extra block. If pcre_extra is NULL, or there is no study data, |
| 1830 | pcre_malloc() when PCRE was getting memory into which to place the data | zero is returned. The fourth argument should point to a size_t vari- |
| 1831 | created by pcre_study(). If pcre_extra is NULL, or there is no study | able. The study_data field is set by pcre_study() to record informa- |
| 1832 | data, zero is returned. The fourth argument should point to a size_t | tion that will speed up matching (see the section entitled "Studying a |
| 1833 | variable. | pattern" above). The format of the study_data block is private, but its |
| 1834 | length is made available via this option so that it can be saved and | |
| 1835 | restored (see the pcreprecompile documentation for details). | |
| 1836 | ||
| 1837 | ||
| 1838 | OBSOLETE INFO FUNCTION | OBSOLETE INFO FUNCTION |
| 1839 | ||
| 1840 | int pcre_info(const pcre *code, int *optptr, int *firstcharptr); | int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
| 1841 | ||
| 1842 | The pcre_info() function is now obsolete because its interface is too | The pcre_info() function is now obsolete because its interface is too |
| 1843 | restrictive to return all the available data about a compiled pattern. | restrictive to return all the available data about a compiled pattern. |
| 1844 | New programs should use pcre_fullinfo() instead. The yield of | New programs should use pcre_fullinfo() instead. The yield of |
| 1845 | pcre_info() is the number of capturing subpatterns, or one of the fol- | pcre_info() is the number of capturing subpatterns, or one of the fol- |
| 1846 | lowing negative numbers: | lowing negative numbers: |
| 1847 | ||
| 1848 | PCRE_ERROR_NULL the argument code was NULL | PCRE_ERROR_NULL the argument code was NULL |
| 1849 | PCRE_ERROR_BADMAGIC the "magic number" was not found | PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 1850 | ||
| 1851 | If the optptr argument is not NULL, a copy of the options with which | If the optptr argument is not NULL, a copy of the options with which |
| 1852 | the pattern was compiled is placed in the integer it points to (see | the pattern was compiled is placed in the integer it points to (see |
| 1853 | PCRE_INFO_OPTIONS above). | PCRE_INFO_OPTIONS above). |
| 1854 | ||
| 1855 | If the pattern is not anchored and the firstcharptr argument is not | If the pattern is not anchored and the firstcharptr argument is not |
| 1856 | NULL, it is used to pass back information about the first character of | NULL, it is used to pass back information about the first character of |
| 1857 | any matched string (see PCRE_INFO_FIRSTBYTE above). | any matched string (see PCRE_INFO_FIRSTBYTE above). |
| 1858 | ||
| 1859 | ||
| # | Line 1919 REFERENCE COUNTS | Line 1861 REFERENCE COUNTS |
| 1861 | ||
| 1862 | int pcre_refcount(pcre *code, int adjust); | int pcre_refcount(pcre *code, int adjust); |
| 1863 | ||
| 1864 | The pcre_refcount() function is used to maintain a reference count in | The pcre_refcount() function is used to maintain a reference count in |
| 1865 | the data block that contains a compiled pattern. It is provided for the | the data block that contains a compiled pattern. It is provided for the |
| 1866 | benefit of applications that operate in an object-oriented manner, | benefit of applications that operate in an object-oriented manner, |
| 1867 | where different parts of the application may be using the same compiled | where different parts of the application may be using the same compiled |
| 1868 | pattern, but you want to free the block when they are all done. | pattern, but you want to free the block when they are all done. |
| 1869 | ||
| 1870 | When a pattern is compiled, the reference count field is initialized to | When a pattern is compiled, the reference count field is initialized to |
| 1871 | zero. It is changed only by calling this function, whose action is to | zero. It is changed only by calling this function, whose action is to |
| 1872 | add the adjust value (which may be positive or negative) to it. The | add the adjust value (which may be positive or negative) to it. The |
| 1873 | yield of the function is the new value. However, the value of the count | yield of the function is the new value. However, the value of the count |
| 1874 | is constrained to lie between 0 and 65535, inclusive. If the new value | is constrained to lie between 0 and 65535, inclusive. If the new value |
| 1875 | is outside these limits, it is forced to the appropriate limit value. | is outside these limits, it is forced to the appropriate limit value. |
| 1876 | ||
| 1877 | Except when it is zero, the reference count is not correctly preserved | Except when it is zero, the reference count is not correctly preserved |
| 1878 | if a pattern is compiled on one host and then transferred to a host | if a pattern is compiled on one host and then transferred to a host |
| 1879 | whose byte-order is different. (This seems a highly unlikely scenario.) | whose byte-order is different. (This seems a highly unlikely scenario.) |
| 1880 | ||
| 1881 | ||
| # | Line 1943 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 1885 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 1885 | const char *subject, int length, int startoffset, | const char *subject, int length, int startoffset, |
| 1886 | int options, int *ovector, int ovecsize); | int options, int *ovector, int ovecsize); |
| 1887 | ||
| 1888 | The function pcre_exec() is called to match a subject string against a | The function pcre_exec() is called to match a subject string against a |
| 1889 | compiled pattern, which is passed in the code argument. If the pattern | compiled pattern, which is passed in the code argument. If the pattern |
| 1890 | was studied, the result of the study should be passed in the extra | was studied, the result of the study should be passed in the extra |
| 1891 | argument. This function is the main matching facility of the library, | argument. You can call pcre_exec() with the same code and extra argu- |
| 1892 | and it operates in a Perl-like manner. For specialist use there is also | ments as many times as you like, in order to match different subject |
| 1893 | an alternative matching function, which is described below in the sec- | strings with the same pattern. |
| 1894 | tion about the pcre_dfa_exec() function. | |
| 1895 | This function is the main matching facility of the library, and it | |
| 1896 | operates in a Perl-like manner. For specialist use there is also an | |
| 1897 | alternative matching function, which is described below in the section | |
| 1898 | about the pcre_dfa_exec() function. | |
| 1899 | ||
| 1900 | In most applications, the pattern will have been compiled (and option- | In most applications, the pattern will have been compiled (and option- |
| 1901 | ally studied) in the same process that calls pcre_exec(). However, it | ally studied) in the same process that calls pcre_exec(). However, it |
| 1902 | is possible to save compiled patterns and study data, and then use them | is possible to save compiled patterns and study data, and then use them |
| 1903 | later in different processes, possibly even on different hosts. For a | later in different processes, possibly even on different hosts. For a |
| 1904 | discussion about this, see the pcreprecompile documentation. | discussion about this, see the pcreprecompile documentation. |
| 1905 | ||
| 1906 | Here is an example of a simple call to pcre_exec(): | Here is an example of a simple call to pcre_exec(): |
| # | Line 1973 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 1919 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 1919 | ||
| 1920 | Extra data for pcre_exec() | Extra data for pcre_exec() |
| 1921 | ||
| 1922 | If the extra argument is not NULL, it must point to a pcre_extra data | If the extra argument is not NULL, it must point to a pcre_extra data |
| 1923 | block. The pcre_study() function returns such a block (when it doesn't | block. The pcre_study() function returns such a block (when it doesn't |
| 1924 | return NULL), but you can also create one for yourself, and pass addi- | return NULL), but you can also create one for yourself, and pass addi- |
| 1925 | tional information in it. The pcre_extra block contains the following | tional information in it. The pcre_extra block contains the following |
| 1926 | fields (not necessarily in this order): | fields (not necessarily in this order): |
| 1927 | ||
| 1928 | unsigned long int flags; | unsigned long int flags; |
| 1929 | void *study_data; | void *study_data; |
| 1930 | void *executable_jit; | |
| 1931 | unsigned long int match_limit; | unsigned long int match_limit; |
| 1932 | unsigned long int match_limit_recursion; | unsigned long int match_limit_recursion; |
| 1933 | void *callout_data; | void *callout_data; |
| 1934 | const unsigned char *tables; | const unsigned char *tables; |
| 1935 | unsigned char **mark; | unsigned char **mark; |
| 1936 | ||
| 1937 | The flags field is a bitmap that specifies which of the other fields | The flags field is a bitmap that specifies which of the other fields |
| 1938 | are set. The flag bits are: | are set. The flag bits are: |
| 1939 | ||
| 1940 | PCRE_EXTRA_STUDY_DATA | PCRE_EXTRA_STUDY_DATA |
| 1941 | PCRE_EXTRA_EXECUTABLE_JIT | |
| 1942 | PCRE_EXTRA_MATCH_LIMIT | PCRE_EXTRA_MATCH_LIMIT |
| 1943 | PCRE_EXTRA_MATCH_LIMIT_RECURSION | PCRE_EXTRA_MATCH_LIMIT_RECURSION |
| 1944 | PCRE_EXTRA_CALLOUT_DATA | PCRE_EXTRA_CALLOUT_DATA |
| 1945 | PCRE_EXTRA_TABLES | PCRE_EXTRA_TABLES |
| 1946 | PCRE_EXTRA_MARK | PCRE_EXTRA_MARK |
| 1947 | ||
| 1948 | Other flag bits should be set to zero. The study_data field is set in | Other flag bits should be set to zero. The study_data field and some- |
| 1949 | the pcre_extra block that is returned by pcre_study(), together with | times the executable_jit field are set in the pcre_extra block that is |
| 1950 | the appropriate flag bit. You should not set this yourself, but you may | returned by pcre_study(), together with the appropriate flag bits. You |
| 1951 | add to the block by setting the other fields and their corresponding | should not set these yourself, but you may add to the block by setting |
| 1952 | flag bits. | the other fields and their corresponding flag bits. |
| 1953 | ||
| 1954 | The match_limit field provides a means of preventing PCRE from using up | The match_limit field provides a means of preventing PCRE from using up |
| 1955 | a vast amount of resources when running patterns that are not going to | a vast amount of resources when running patterns that are not going to |
| 1956 | match, but which have a very large number of possibilities in their | match, but which have a very large number of possibilities in their |
| 1957 | search trees. The classic example is a pattern that uses nested unlim- | search trees. The classic example is a pattern that uses nested unlim- |
| 1958 | ited repeats. | ited repeats. |
| 1959 | ||
| 1960 | Internally, PCRE uses a function called match() which it calls repeat- | Internally, pcre_exec() uses a function called match(), which it calls |
| 1961 | edly (sometimes recursively). The limit set by match_limit is imposed | repeatedly (sometimes recursively). The limit set by match_limit is |
| 1962 | on the number of times this function is called during a match, which | imposed on the number of times this function is called during a match, |
| 1963 | has the effect of limiting the amount of backtracking that can take | which has the effect of limiting the amount of backtracking that can |
| 1964 | place. For patterns that are not anchored, the count restarts from zero | take place. For patterns that are not anchored, the count restarts from |
| 1965 | for each position in the subject string. | zero for each position in the subject string. |
| 1966 | ||
| 1967 | When pcre_exec() is called with a pattern that was successfully studied | |
| 1968 | with the PCRE_STUDY_JIT_COMPILE option, the way that the matching is | |
| 1969 | executed is entirely different. However, there is still the possibility | |
| 1970 | of runaway matching that goes on for a very long time, and so the | |
| 1971 | match_limit value is also used in this case (but in a different way) to | |
| 1972 | limit how long the matching can continue. | |
| 1973 | ||
| 1974 | The default value for the limit can be set when PCRE is built; the | The default value for the limit can be set when PCRE is built; the |
| 1975 | default default is 10 million, which handles all but the most extreme | default default is 10 million, which handles all but the most extreme |
| # | Line 2029 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 1984 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 1984 | the total number of calls, because not all calls to match() are recur- | the total number of calls, because not all calls to match() are recur- |
| 1985 | sive. This limit is of use only if it is set smaller than match_limit. | sive. This limit is of use only if it is set smaller than match_limit. |
| 1986 | ||
| 1987 | Limiting the recursion depth limits the amount of stack that can be | Limiting the recursion depth limits the amount of machine stack that |
| 1988 | used, or, when PCRE has been compiled to use memory on the heap instead | can be used, or, when PCRE has been compiled to use memory on the heap |
| 1989 | of the stack, the amount of heap memory that can be used. | instead of the stack, the amount of heap memory that can be used. This |
| 1990 | limit is not relevant, and is ignored, if the pattern was successfully | |
| 1991 | studied with PCRE_STUDY_JIT_COMPILE. | |
| 1992 | ||
| 1993 | The default value for match_limit_recursion can be set when PCRE is | The default value for match_limit_recursion can be set when PCRE is |
| 1994 | built; the default default is the same value as the default for | built; the default default is the same value as the default for |
| # | Line 2074 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2031 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2031 | PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and | PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and |
| 2032 | PCRE_PARTIAL_HARD. | PCRE_PARTIAL_HARD. |
| 2033 | ||
| 2034 | If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE | |
| 2035 | option, the only supported options for JIT execution are | |
| 2036 | PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and | |
| 2037 | PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is not | |
| 2038 | supported. If an unsupported option is used, JIT execution is disabled | |
| 2039 | and the normal interpretive code in pcre_exec() is run. | |
| 2040 | ||
| 2041 | PCRE_ANCHORED | PCRE_ANCHORED |
| 2042 | ||
| 2043 | The PCRE_ANCHORED option limits pcre_exec() to matching at the first | The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
| 2044 | matching position. If a pattern was compiled with PCRE_ANCHORED, or | matching position. If a pattern was compiled with PCRE_ANCHORED, or |
| 2045 | turned out to be anchored by virtue of its contents, it cannot be made | turned out to be anchored by virtue of its contents, it cannot be made |
| 2046 | unachored at matching time. | unachored at matching time. |
| 2047 | ||
| 2048 | PCRE_BSR_ANYCRLF | PCRE_BSR_ANYCRLF |
| 2049 | PCRE_BSR_UNICODE | PCRE_BSR_UNICODE |
| 2050 | ||
| 2051 | These options (which are mutually exclusive) control what the \R escape | These options (which are mutually exclusive) control what the \R escape |
| 2052 | sequence matches. The choice is either to match only CR, LF, or CRLF, | sequence matches. The choice is either to match only CR, LF, or CRLF, |
| 2053 | or to match any Unicode newline sequence. These options override the | or to match any Unicode newline sequence. These options override the |
| 2054 | choice that was made or defaulted when the pattern was compiled. | choice that was made or defaulted when the pattern was compiled. |
| 2055 | ||
| 2056 | PCRE_NEWLINE_CR | PCRE_NEWLINE_CR |
| # | Line 2095 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2059 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2059 | PCRE_NEWLINE_ANYCRLF | PCRE_NEWLINE_ANYCRLF |
| 2060 | PCRE_NEWLINE_ANY | PCRE_NEWLINE_ANY |
| 2061 | ||
| 2062 | These options override the newline definition that was chosen or | These options override the newline definition that was chosen or |
| 2063 | defaulted when the pattern was compiled. For details, see the descrip- | defaulted when the pattern was compiled. For details, see the descrip- |
| 2064 | tion of pcre_compile() above. During matching, the newline choice | tion of pcre_compile() above. During matching, the newline choice |
| 2065 | affects the behaviour of the dot, circumflex, and dollar metacharac- | affects the behaviour of the dot, circumflex, and dollar metacharac- |
| 2066 | ters. It may also alter the way the match position is advanced after a | ters. It may also alter the way the match position is advanced after a |
| 2067 | match failure for an unanchored pattern. | match failure for an unanchored pattern. |
| 2068 | ||
| 2069 | When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is | When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
| 2070 | set, and a match attempt for an unanchored pattern fails when the cur- | set, and a match attempt for an unanchored pattern fails when the cur- |
| 2071 | rent position is at a CRLF sequence, and the pattern contains no | rent position is at a CRLF sequence, and the pattern contains no |
| 2072 | explicit matches for CR or LF characters, the match position is | explicit matches for CR or LF characters, the match position is |
| 2073 | advanced by two characters instead of one, in other words, to after the | advanced by two characters instead of one, in other words, to after the |
| 2074 | CRLF. | CRLF. |
| 2075 | ||
| 2076 | The above rule is a compromise that makes the most common cases work as | The above rule is a compromise that makes the most common cases work as |
| 2077 | expected. For example, if the pattern is .+A (and the PCRE_DOTALL | expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
| 2078 | option is not set), it does not match the string "\r\nA" because, after | option is not set), it does not match the string "\r\nA" because, after |
| 2079 | failing at the start, it skips both the CR and the LF before retrying. | failing at the start, it skips both the CR and the LF before retrying. |
| 2080 | However, the pattern [\r\n]A does match that string, because it con- | However, the pattern [\r\n]A does match that string, because it con- |
| 2081 | tains an explicit CR or LF reference, and so advances only by one char- | tains an explicit CR or LF reference, and so advances only by one char- |
| 2082 | acter after the first failure. | acter after the first failure. |
| 2083 | ||
| 2084 | An explicit match for CR of LF is either a literal appearance of one of | An explicit match for CR of LF is either a literal appearance of one of |
| 2085 | those characters, or one of the \r or \n escape sequences. Implicit | those characters, or one of the \r or \n escape sequences. Implicit |
| 2086 | matches such as [^X] do not count, nor does \s (which includes CR and | matches such as [^X] do not count, nor does \s (which includes CR and |
| 2087 | LF in the characters that it matches). | LF in the characters that it matches). |
| 2088 | ||
| 2089 | Notwithstanding the above, anomalous effects may still occur when CRLF | Notwithstanding the above, anomalous effects may still occur when CRLF |
| 2090 | is a valid newline sequence and explicit \r or \n escapes appear in the | is a valid newline sequence and explicit \r or \n escapes appear in the |
| 2091 | pattern. | pattern. |
| 2092 | ||
| 2093 | PCRE_NOTBOL | PCRE_NOTBOL |
| 2094 | ||
| 2095 | This option specifies that first character of the subject string is not | This option specifies that first character of the subject string is not |
| 2096 | the beginning of a line, so the circumflex metacharacter should not | the beginning of a line, so the circumflex metacharacter should not |
| 2097 | match before it. Setting this without PCRE_MULTILINE (at compile time) | match before it. Setting this without PCRE_MULTILINE (at compile time) |
| 2098 | causes circumflex never to match. This option affects only the behav- | causes circumflex never to match. This option affects only the behav- |
| 2099 | iour of the circumflex metacharacter. It does not affect \A. | iour of the circumflex metacharacter. It does not affect \A. |
| 2100 | ||
| 2101 | PCRE_NOTEOL | PCRE_NOTEOL |
| 2102 | ||
| 2103 | This option specifies that the end of the subject string is not the end | This option specifies that the end of the subject string is not the end |
| 2104 | of a line, so the dollar metacharacter should not match it nor (except | of a line, so the dollar metacharacter should not match it nor (except |
| 2105 | in multiline mode) a newline immediately before it. Setting this with- | in multiline mode) a newline immediately before it. Setting this with- |
| 2106 | out PCRE_MULTILINE (at compile time) causes dollar never to match. This | out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
| 2107 | option affects only the behaviour of the dollar metacharacter. It does | option affects only the behaviour of the dollar metacharacter. It does |
| 2108 | not affect \Z or \z. | not affect \Z or \z. |
| 2109 | ||
| 2110 | PCRE_NOTEMPTY | PCRE_NOTEMPTY |
| 2111 | ||
| 2112 | An empty string is not considered to be a valid match if this option is | An empty string is not considered to be a valid match if this option is |
| 2113 | set. If there are alternatives in the pattern, they are tried. If all | set. If there are alternatives in the pattern, they are tried. If all |
| 2114 | the alternatives match the empty string, the entire match fails. For | the alternatives match the empty string, the entire match fails. For |
| 2115 | example, if the pattern | example, if the pattern |
| 2116 | ||
| 2117 | a?b? | a?b? |
| 2118 | ||
| 2119 | is applied to a string not beginning with "a" or "b", it matches an | is applied to a string not beginning with "a" or "b", it matches an |
| 2120 | empty string at the start of the subject. With PCRE_NOTEMPTY set, this | empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
| 2121 | match is not valid, so PCRE searches further into the string for occur- | match is not valid, so PCRE searches further into the string for occur- |
| 2122 | rences of "a" or "b". | rences of "a" or "b". |
| 2123 | ||
| 2124 | PCRE_NOTEMPTY_ATSTART | PCRE_NOTEMPTY_ATSTART |
| 2125 | ||
| 2126 | This is like PCRE_NOTEMPTY, except that an empty string match that is | This is like PCRE_NOTEMPTY, except that an empty string match that is |
| 2127 | not at the start of the subject is permitted. If the pattern is | not at the start of the subject is permitted. If the pattern is |
| 2128 | anchored, such a match can occur only if the pattern contains \K. | anchored, such a match can occur only if the pattern contains \K. |
| 2129 | ||
| 2130 | Perl has no direct equivalent of PCRE_NOTEMPTY or | Perl has no direct equivalent of PCRE_NOTEMPTY or |
| 2131 | PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern | PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern |
| 2132 | match of the empty string within its split() function, and when using | match of the empty string within its split() function, and when using |
| 2133 | the /g modifier. It is possible to emulate Perl's behaviour after | the /g modifier. It is possible to emulate Perl's behaviour after |
| 2134 | matching a null string by first trying the match again at the same off- | matching a null string by first trying the match again at the same off- |
| 2135 | set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that | set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that |
| 2136 | fails, by advancing the starting offset (see below) and trying an ordi- | fails, by advancing the starting offset (see below) and trying an ordi- |
| 2137 | nary match again. There is some code that demonstrates how to do this | nary match again. There is some code that demonstrates how to do this |
| 2138 | in the pcredemo sample program. In the most general case, you have to | in the pcredemo sample program. In the most general case, you have to |
| 2139 | check to see if the newline convention recognizes CRLF as a newline, | check to see if the newline convention recognizes CRLF as a newline, |
| 2140 | and if so, and the current character is CR followed by LF, advance the | and if so, and the current character is CR followed by LF, advance the |
| 2141 | starting offset by two characters instead of one. | starting offset by two characters instead of one. |
| 2142 | ||
| 2143 | PCRE_NO_START_OPTIMIZE | PCRE_NO_START_OPTIMIZE |
| 2144 | ||
| 2145 | There are a number of optimizations that pcre_exec() uses at the start | There are a number of optimizations that pcre_exec() uses at the start |
| 2146 | of a match, in order to speed up the process. For example, if it is | of a match, in order to speed up the process. For example, if it is |
| 2147 | known that an unanchored match must start with a specific character, it | known that an unanchored match must start with a specific character, it |
| 2148 | searches the subject for that character, and fails immediately if it | searches the subject for that character, and fails immediately if it |
| 2149 | cannot find it, without actually running the main matching function. | cannot find it, without actually running the main matching function. |
| 2150 | This means that a special item such as (*COMMIT) at the start of a pat- | This means that a special item such as (*COMMIT) at the start of a pat- |
| 2151 | tern is not considered until after a suitable starting point for the | tern is not considered until after a suitable starting point for the |
| 2152 | match has been found. When callouts or (*MARK) items are in use, these | match has been found. When callouts or (*MARK) items are in use, these |
| 2153 | "start-up" optimizations can cause them to be skipped if the pattern is | "start-up" optimizations can cause them to be skipped if the pattern is |
| 2154 | never actually used. The start-up optimizations are in effect a pre- | never actually used. The start-up optimizations are in effect a pre- |
| 2155 | scan of the subject that takes place before the pattern is run. | scan of the subject that takes place before the pattern is run. |
| 2156 | ||
| 2157 | The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, | The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, |
| 2158 | possibly causing performance to suffer, but ensuring that in cases | possibly causing performance to suffer, but ensuring that in cases |
| 2159 | where the result is "no match", the callouts do occur, and that items | where the result is "no match", the callouts do occur, and that items |
| 2160 | such as (*COMMIT) and (*MARK) are considered at every possible starting | such as (*COMMIT) and (*MARK) are considered at every possible starting |
| 2161 | position in the subject string. If PCRE_NO_START_OPTIMIZE is set at | position in the subject string. If PCRE_NO_START_OPTIMIZE is set at |
| 2162 | compile time, it cannot be unset at matching time. | compile time, it cannot be unset at matching time. |
| 2163 | ||
| 2164 | Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching | Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching |
| 2165 | operation. Consider the pattern | operation. Consider the pattern |
| 2166 | ||
| 2167 | (*COMMIT)ABC | (*COMMIT)ABC |
| 2168 | ||
| 2169 | When this is compiled, PCRE records the fact that a match must start | When this is compiled, PCRE records the fact that a match must start |
| 2170 | with the character "A". Suppose the subject string is "DEFABC". The | with the character "A". Suppose the subject string is "DEFABC". The |
| 2171 | start-up optimization scans along the subject, finds "A" and runs the | start-up optimization scans along the subject, finds "A" and runs the |
| 2172 | first match attempt from there. The (*COMMIT) item means that the pat- | first match attempt from there. The (*COMMIT) item means that the pat- |
| 2173 | tern must match the current starting position, which in this case, it | tern must match the current starting position, which in this case, it |
| 2174 | does. However, if the same match is run with PCRE_NO_START_OPTIMIZE | does. However, if the same match is run with PCRE_NO_START_OPTIMIZE |
| 2175 | set, the initial scan along the subject string does not happen. The | set, the initial scan along the subject string does not happen. The |
| 2176 | first match attempt is run starting from "D" and when this fails, | first match attempt is run starting from "D" and when this fails, |
| 2177 | (*COMMIT) prevents any further matches being tried, so the overall | (*COMMIT) prevents any further matches being tried, so the overall |
| 2178 | result is "no match". If the pattern is studied, more start-up opti- | result is "no match". If the pattern is studied, more start-up opti- |
| 2179 | mizations may be used. For example, a minimum length for the subject | mizations may be used. For example, a minimum length for the subject |
| 2180 | may be recorded. Consider the pattern | may be recorded. Consider the pattern |
| 2181 | ||
| 2182 | (*MARK:A)(X|Y) | (*MARK:A)(X|Y) |
| 2183 | ||
| 2184 | The minimum length for a match is one character. If the subject is | The minimum length for a match is one character. If the subject is |
| 2185 | "ABC", there will be attempts to match "ABC", "BC", "C", and then | "ABC", there will be attempts to match "ABC", "BC", "C", and then |
| 2186 | finally an empty string. If the pattern is studied, the final attempt | finally an empty string. If the pattern is studied, the final attempt |
| 2187 | does not take place, because PCRE knows that the subject is too short, | does not take place, because PCRE knows that the subject is too short, |
| 2188 | and so the (*MARK) is never encountered. In this case, studying the | and so the (*MARK) is never encountered. In this case, studying the |
| 2189 | pattern does not affect the overall match result, which is still "no | pattern does not affect the overall match result, which is still "no |
| 2190 | match", but it does affect the auxiliary information that is returned. | match", but it does affect the auxiliary information that is returned. |
| 2191 | ||
| 2192 | PCRE_NO_UTF8_CHECK | PCRE_NO_UTF8_CHECK |
| 2193 | ||
| 2194 | When PCRE_UTF8 is set at compile time, the validity of the subject as a | When PCRE_UTF8 is set at compile time, the validity of the subject as a |
| 2195 | UTF-8 string is automatically checked when pcre_exec() is subsequently | UTF-8 string is automatically checked when pcre_exec() is subsequently |
| 2196 | called. The value of startoffset is also checked to ensure that it | called. The value of startoffset is also checked to ensure that it |
| 2197 | points to the start of a UTF-8 character. There is a discussion about | points to the start of a UTF-8 character. There is a discussion about |
| 2198 | the validity of UTF-8 strings in the section on UTF-8 support in the | the validity of UTF-8 strings in the section on UTF-8 support in the |
| 2199 | main pcre page. If an invalid UTF-8 sequence of bytes is found, | main pcre page. If an invalid UTF-8 sequence of bytes is found, |
| 2200 | pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR- | pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR- |
| 2201 | TIAL_HARD is set and the problem is a truncated UTF-8 character at the | TIAL_HARD is set and the problem is a truncated UTF-8 character at the |
| 2202 | end of the subject, PCRE_ERROR_SHORTUTF8. In both cases, information | end of the subject, PCRE_ERROR_SHORTUTF8. In both cases, information |
| 2203 | about the precise nature of the error may also be returned (see the | about the precise nature of the error may also be returned (see the |
| 2204 | descriptions of these errors in the section entitled Error return val- | descriptions of these errors in the section entitled Error return val- |
| 2205 | ues from pcre_exec() below). If startoffset contains a value that does | ues from pcre_exec() below). If startoffset contains a value that does |
| 2206 | not point to the start of a UTF-8 character (or to the end of the sub- | not point to the start of a UTF-8 character (or to the end of the sub- |
| 2207 | ject), PCRE_ERROR_BADUTF8_OFFSET is returned. | ject), PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 2208 | ||
| 2209 | If you already know that your subject is valid, and you want to skip | If you already know that your subject is valid, and you want to skip |
| 2210 | these checks for performance reasons, you can set the | these checks for performance reasons, you can set the |
| 2211 | PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to | PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
| 2212 | do this for the second and subsequent calls to pcre_exec() if you are | do this for the second and subsequent calls to pcre_exec() if you are |
| 2213 | making repeated calls to find all the matches in a single subject | making repeated calls to find all the matches in a single subject |
| 2214 | string. However, you should be sure that the value of startoffset | string. However, you should be sure that the value of startoffset |
| 2215 | points to the start of a UTF-8 character (or the end of the subject). | points to the start of a UTF-8 character (or the end of the subject). |
| 2216 | When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 | When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 |
| 2217 | string as a subject or an invalid value of startoffset is undefined. | string as a subject or an invalid value of startoffset is undefined. |
| 2218 | Your program may crash. | Your program may crash. |
| 2219 | ||
| 2220 | PCRE_PARTIAL_HARD | PCRE_PARTIAL_HARD |
| 2221 | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_SOFT |
| 2222 | ||
| 2223 | These options turn on the partial matching feature. For backwards com- | These options turn on the partial matching feature. For backwards com- |
| 2224 | patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial | patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
| 2225 | match occurs if the end of the subject string is reached successfully, | match occurs if the end of the subject string is reached successfully, |
| 2226 | but there are not enough subject characters to complete the match. If | but there are not enough subject characters to complete the match. If |
| 2227 | this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, | this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, |
| 2228 | matching continues by testing any remaining alternatives. Only if no | matching continues by testing any remaining alternatives. Only if no |
| 2229 | complete match can be found is PCRE_ERROR_PARTIAL returned instead of | complete match can be found is PCRE_ERROR_PARTIAL returned instead of |
| 2230 | PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the | PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the |
| 2231 | caller is prepared to handle a partial match, but only if no complete | caller is prepared to handle a partial match, but only if no complete |
| 2232 | match can be found. | match can be found. |
| 2233 | ||
| 2234 | If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this | If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this |
| 2235 | case, if a partial match is found, pcre_exec() immediately returns | case, if a partial match is found, pcre_exec() immediately returns |
| 2236 | PCRE_ERROR_PARTIAL, without considering any other alternatives. In | PCRE_ERROR_PARTIAL, without considering any other alternatives. In |
| 2237 | other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- | other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- |
| 2238 | ered to be more important that an alternative complete match. | ered to be more important that an alternative complete match. |
| 2239 | ||
| 2240 | In both cases, the portion of the string that was inspected when the | In both cases, the portion of the string that was inspected when the |
| 2241 | partial match was found is set as the first matching string. There is a | partial match was found is set as the first matching string. There is a |
| 2242 | more detailed discussion of partial and multi-segment matching, with | more detailed discussion of partial and multi-segment matching, with |
| 2243 | examples, in the pcrepartial documentation. | examples, in the pcrepartial documentation. |
| 2244 | ||
| 2245 | The string to be matched by pcre_exec() | The string to be matched by pcre_exec() |
| 2246 | ||
| 2247 | The subject string is passed to pcre_exec() as a pointer in subject, a | The subject string is passed to pcre_exec() as a pointer in subject, a |
| 2248 | length (in bytes) in length, and a starting byte offset in startoffset. | length (in bytes) in length, and a starting byte offset in startoffset. |
| 2249 | If this is negative or greater than the length of the subject, | If this is negative or greater than the length of the subject, |
| 2250 | pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is | pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is |
| 2251 | zero, the search for a match starts at the beginning of the subject, | zero, the search for a match starts at the beginning of the subject, |
| 2252 | and this is by far the most common case. In UTF-8 mode, the byte offset | and this is by far the most common case. In UTF-8 mode, the byte offset |
| 2253 | must point to the start of a UTF-8 character (or the end of the sub- | must point to the start of a UTF-8 character (or the end of the sub- |
| 2254 | ject). Unlike the pattern string, the subject may contain binary zero | ject). Unlike the pattern string, the subject may contain binary zero |
| 2255 | bytes. | bytes. |
| 2256 | ||
| 2257 | A non-zero starting offset is useful when searching for another match | A non-zero starting offset is useful when searching for another match |
| 2258 | in the same subject by calling pcre_exec() again after a previous suc- | in the same subject by calling pcre_exec() again after a previous suc- |
| 2259 | cess. Setting startoffset differs from just passing over a shortened | cess. Setting startoffset differs from just passing over a shortened |
| 2260 | string and setting PCRE_NOTBOL in the case of a pattern that begins | string and setting PCRE_NOTBOL in the case of a pattern that begins |
| 2261 | with any kind of lookbehind. For example, consider the pattern | with any kind of lookbehind. For example, consider the pattern |
| 2262 | ||
| 2263 | \Biss\B | \Biss\B |
| 2264 | ||
| 2265 | which finds occurrences of "iss" in the middle of words. (\B matches | which finds occurrences of "iss" in the middle of words. (\B matches |
| 2266 | only if the current position in the subject is not a word boundary.) | only if the current position in the subject is not a word boundary.) |
| 2267 | When applied to the string "Mississipi" the first call to pcre_exec() | When applied to the string "Mississipi" the first call to pcre_exec() |
| 2268 | finds the first occurrence. If pcre_exec() is called again with just | finds the first occurrence. If pcre_exec() is called again with just |
| 2269 | the remainder of the subject, namely "issipi", it does not match, | the remainder of the subject, namely "issipi", it does not match, |
| 2270 | because \B is always false at the start of the subject, which is deemed | because \B is always false at the start of the subject, which is deemed |
| 2271 | to be a word boundary. However, if pcre_exec() is passed the entire | to be a word boundary. However, if pcre_exec() is passed the entire |
| 2272 | string again, but with startoffset set to 4, it finds the second occur- | string again, but with startoffset set to 4, it finds the second occur- |
| 2273 | rence of "iss" because it is able to look behind the starting point to | rence of "iss" because it is able to look behind the starting point to |
| 2274 | discover that it is preceded by a letter. | discover that it is preceded by a letter. |
| 2275 | ||
| 2276 | Finding all the matches in a subject is tricky when the pattern can | Finding all the matches in a subject is tricky when the pattern can |
| 2277 | match an empty string. It is possible to emulate Perl's /g behaviour by | match an empty string. It is possible to emulate Perl's /g behaviour by |
| 2278 | first trying the match again at the same offset, with the | first trying the match again at the same offset, with the |
| 2279 | PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that | PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that |
| 2280 | fails, advancing the starting offset and trying an ordinary match | fails, advancing the starting offset and trying an ordinary match |
| 2281 | again. There is some code that demonstrates how to do this in the pcre- | again. There is some code that demonstrates how to do this in the pcre- |
| 2282 | demo sample program. In the most general case, you have to check to see | demo sample program. In the most general case, you have to check to see |
| 2283 | if the newline convention recognizes CRLF as a newline, and if so, and | if the newline convention recognizes CRLF as a newline, and if so, and |
| 2284 | the current character is CR followed by LF, advance the starting offset | the current character is CR followed by LF, advance the starting offset |
| 2285 | by two characters instead of one. | by two characters instead of one. |
| 2286 | ||
| 2287 | If a non-zero starting offset is passed when the pattern is anchored, | If a non-zero starting offset is passed when the pattern is anchored, |
| 2288 | one attempt to match at the given offset is made. This can only succeed | one attempt to match at the given offset is made. This can only succeed |
| 2289 | if the pattern does not require the match to be at the start of the | if the pattern does not require the match to be at the start of the |
| 2290 | subject. | subject. |
| 2291 | ||
| 2292 | How pcre_exec() returns captured substrings | How pcre_exec() returns captured substrings |
| 2293 | ||
| 2294 | In general, a pattern matches a certain portion of the subject, and in | In general, a pattern matches a certain portion of the subject, and in |
| 2295 | addition, further substrings from the subject may be picked out by | addition, further substrings from the subject may be picked out by |
| 2296 | parts of the pattern. Following the usage in Jeffrey Friedl's book, | parts of the pattern. Following the usage in Jeffrey Friedl's book, |
| 2297 | this is called "capturing" in what follows, and the phrase "capturing | this is called "capturing" in what follows, and the phrase "capturing |
| 2298 | subpattern" is used for a fragment of a pattern that picks out a sub- | subpattern" is used for a fragment of a pattern that picks out a sub- |
| 2299 | string. PCRE supports several other kinds of parenthesized subpattern | string. PCRE supports several other kinds of parenthesized subpattern |
| 2300 | that do not cause substrings to be captured. | that do not cause substrings to be captured. |
| 2301 | ||
| 2302 | Captured substrings are returned to the caller via a vector of integers | Captured substrings are returned to the caller via a vector of integers |
| 2303 | whose address is passed in ovector. The number of elements in the vec- | whose address is passed in ovector. The number of elements in the vec- |
| 2304 | tor is passed in ovecsize, which must be a non-negative number. Note: | tor is passed in ovecsize, which must be a non-negative number. Note: |
| 2305 | this argument is NOT the size of ovector in bytes. | this argument is NOT the size of ovector in bytes. |
| 2306 | ||
| 2307 | The first two-thirds of the vector is used to pass back captured sub- | The first two-thirds of the vector is used to pass back captured sub- |
| 2308 | strings, each substring using a pair of integers. The remaining third | strings, each substring using a pair of integers. The remaining third |
| 2309 | of the vector is used as workspace by pcre_exec() while matching cap- | of the vector is used as workspace by pcre_exec() while matching cap- |
| 2310 | turing subpatterns, and is not available for passing back information. | turing subpatterns, and is not available for passing back information. |
| 2311 | The number passed in ovecsize should always be a multiple of three. If | The number passed in ovecsize should always be a multiple of three. If |
| 2312 | it is not, it is rounded down. | it is not, it is rounded down. |
| 2313 | ||
| 2314 | When a match is successful, information about captured substrings is | When a match is successful, information about captured substrings is |
| 2315 | returned in pairs of integers, starting at the beginning of ovector, | returned in pairs of integers, starting at the beginning of ovector, |
| 2316 | and continuing up to two-thirds of its length at the most. The first | and continuing up to two-thirds of its length at the most. The first |
| 2317 | element of each pair is set to the byte offset of the first character | element of each pair is set to the byte offset of the first character |
| 2318 | in a substring, and the second is set to the byte offset of the first | in a substring, and the second is set to the byte offset of the first |
| 2319 | character after the end of a substring. Note: these values are always | character after the end of a substring. Note: these values are always |
| 2320 | byte offsets, even in UTF-8 mode. They are not character counts. | byte offsets, even in UTF-8 mode. They are not character counts. |
| 2321 | ||
| 2322 | The first pair of integers, ovector[0] and ovector[1], identify the | The first pair of integers, ovector[0] and ovector[1], identify the |
| 2323 | portion of the subject string matched by the entire pattern. The next | portion of the subject string matched by the entire pattern. The next |
| 2324 | pair is used for the first capturing subpattern, and so on. The value | pair is used for the first capturing subpattern, and so on. The value |
| 2325 | returned by pcre_exec() is one more than the highest numbered pair that | returned by pcre_exec() is one more than the highest numbered pair that |
| 2326 | has been set. For example, if two substrings have been captured, the | has been set. For example, if two substrings have been captured, the |
| 2327 | returned value is 3. If there are no capturing subpatterns, the return | returned value is 3. If there are no capturing subpatterns, the return |
| 2328 | value from a successful match is 1, indicating that just the first pair | value from a successful match is 1, indicating that just the first pair |
| 2329 | of offsets has been set. | of offsets has been set. |
| 2330 | ||
| 2331 | If a capturing subpattern is matched repeatedly, it is the last portion | If a capturing subpattern is matched repeatedly, it is the last portion |
| 2332 | of the string that it matched that is returned. | of the string that it matched that is returned. |
| 2333 | ||
| 2334 | If the vector is too small to hold all the captured substring offsets, | If the vector is too small to hold all the captured substring offsets, |
| 2335 | it is used as far as possible (up to two-thirds of its length), and the | it is used as far as possible (up to two-thirds of its length), and the |
| 2336 | function returns a value of zero. If the substring offsets are not of | function returns a value of zero. If neither the actual string matched |
| 2337 | interest, pcre_exec() may be called with ovector passed as NULL and | not any captured substrings are of interest, pcre_exec() may be called |
| 2338 | ovecsize as zero. However, if the pattern contains back references and | with ovector passed as NULL and ovecsize as zero. However, if the pat- |
| 2339 | the ovector is not big enough to remember the related substrings, PCRE | tern contains back references and the ovector is not big enough to |
| 2340 | has to get additional memory for use during matching. Thus it is usu- | remember the related substrings, PCRE has to get additional memory for |
| 2341 | ally advisable to supply an ovector. | use during matching. Thus it is usually advisable to supply an ovector |
| 2342 | of reasonable size. | |
| 2343 | ||
| 2344 | There are some cases where zero is returned (indicating vector over- | |
| 2345 | flow) when in fact the vector is exactly the right size for the final | |
| 2346 | match. For example, consider the pattern | |
| 2347 | ||
| 2348 | (a)(?:(b)c|bd) | |
| 2349 | ||
| 2350 | If a vector of 6 elements (allowing for only 1 captured substring) is | |
| 2351 | given with subject string "abd", pcre_exec() will try to set the second | |
| 2352 | captured string, thereby recording a vector overflow, before failing to | |
| 2353 | match "c" and backing up to try the second alternative. The zero | |
| 2354 | return, however, does correctly indicate that the maximum number of | |
| 2355 | slots (namely 2) have been filled. In similar cases where there is tem- | |
| 2356 | porary overflow, but the final number of used slots is actually less | |
| 2357 | than the maximum, a non-zero value is returned. | |
| 2358 | ||
| 2359 | The pcre_fullinfo() function can be used to find out how many capturing | The pcre_fullinfo() function can be used to find out how many capturing |
| 2360 | subpatterns there are in a compiled pattern. The smallest size for | subpatterns there are in a compiled pattern. The smallest size for |
| 2361 | ovector that will allow for n captured substrings, in addition to the | ovector that will allow for n captured substrings, in addition to the |
| 2362 | offsets of the substring matched by the whole pattern, is (n+1)*3. | offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 2363 | ||
| 2364 | It is possible for capturing subpattern number n+1 to match some part | It is possible for capturing subpattern number n+1 to match some part |
| 2365 | of the subject when subpattern n has not been used at all. For example, | of the subject when subpattern n has not been used at all. For example, |
| 2366 | if the string "abc" is matched against the pattern (a|(z))(bc) the | if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 2367 | return from the function is 4, and subpatterns 1 and 3 are matched, but | return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2368 | 2 is not. When this happens, both values in the offset pairs corre- | 2 is not. When this happens, both values in the offset pairs corre- |
| 2369 | sponding to unused subpatterns are set to -1. | sponding to unused subpatterns are set to -1. |
| 2370 | ||
| 2371 | Offset values that correspond to unused subpatterns at the end of the | Offset values that correspond to unused subpatterns at the end of the |
| 2372 | expression are also set to -1. For example, if the string "abc" is | expression are also set to -1. For example, if the string "abc" is |
| 2373 | matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not | matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 2374 | matched. The return from the function is 2, because the highest used | matched. The return from the function is 2, because the highest used |
| 2375 | capturing subpattern number is 1, and the offsets for for the second | capturing subpattern number is 1, and the offsets for for the second |
| 2376 | and third capturing subpatterns (assuming the vector is large enough, | and third capturing subpatterns (assuming the vector is large enough, |
| 2377 | of course) are set to -1. | of course) are set to -1. |
| 2378 | ||
| 2379 | Note: Elements of ovector that do not correspond to capturing parenthe- | Note: Elements in the first two-thirds of ovector that do not corre- |
| 2380 | ses in the pattern are never changed. That is, if a pattern contains n | spond to capturing parentheses in the pattern are never changed. That |
| 2381 | capturing parentheses, no more than ovector[0] to ovector[2n+1] are set | is, if a pattern contains n capturing parentheses, no more than ovec- |
| 2382 | by pcre_exec(). The other elements retain whatever values they previ- | tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in |
| 2383 | ously had. | the first two-thirds) retain whatever values they previously had. |
| 2384 | ||
| 2385 | Some convenience functions are provided for extracting the captured | Some convenience functions are provided for extracting the captured |
| 2386 | substrings as separate strings. These are described below. | substrings as separate strings. These are described below. |
| 2387 | ||
| 2388 | Error return values from pcre_exec() | Error return values from pcre_exec() |
| 2389 | ||
| 2390 | If pcre_exec() fails, it returns a negative number. The following are | If pcre_exec() fails, it returns a negative number. The following are |
| 2391 | defined in the header file: | defined in the header file: |
| 2392 | ||
| 2393 | PCRE_ERROR_NOMATCH (-1) | PCRE_ERROR_NOMATCH (-1) |
| # | Line 2416 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2396 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2396 | ||
| 2397 | PCRE_ERROR_NULL (-2) | PCRE_ERROR_NULL (-2) |
| 2398 | ||
| 2399 | Either code or subject was passed as NULL, or ovector was NULL and | Either code or subject was passed as NULL, or ovector was NULL and |
| 2400 | ovecsize was not zero. | ovecsize was not zero. |
| 2401 | ||
| 2402 | PCRE_ERROR_BADOPTION (-3) | PCRE_ERROR_BADOPTION (-3) |
| # | Line 2425 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2405 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2405 | ||
| 2406 | PCRE_ERROR_BADMAGIC (-4) | PCRE_ERROR_BADMAGIC (-4) |
| 2407 | ||
| 2408 | PCRE stores a 4-byte "magic number" at the start of the compiled code, | PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 2409 | to catch the case when it is passed a junk pointer and to detect when a | to catch the case when it is passed a junk pointer and to detect when a |
| 2410 | pattern that was compiled in an environment of one endianness is run in | pattern that was compiled in an environment of one endianness is run in |
| 2411 | an environment with the other endianness. This is the error that PCRE | an environment with the other endianness. This is the error that PCRE |
| 2412 | gives when the magic number is not present. | gives when the magic number is not present. |
| 2413 | ||
| 2414 | PCRE_ERROR_UNKNOWN_OPCODE (-5) | PCRE_ERROR_UNKNOWN_OPCODE (-5) |
| 2415 | ||
| 2416 | While running the pattern match, an unknown item was encountered in the | While running the pattern match, an unknown item was encountered in the |
| 2417 | compiled pattern. This error could be caused by a bug in PCRE or by | compiled pattern. This error could be caused by a bug in PCRE or by |
| 2418 | overwriting of the compiled pattern. | overwriting of the compiled pattern. |
| 2419 | ||
| 2420 | PCRE_ERROR_NOMEMORY (-6) | PCRE_ERROR_NOMEMORY (-6) |
| 2421 | ||
| 2422 | If a pattern contains back references, but the ovector that is passed | If a pattern contains back references, but the ovector that is passed |
| 2423 | to pcre_exec() is not big enough to remember the referenced substrings, | to pcre_exec() is not big enough to remember the referenced substrings, |
| 2424 | PCRE gets a block of memory at the start of matching to use for this | PCRE gets a block of memory at the start of matching to use for this |
| 2425 | purpose. If the call via pcre_malloc() fails, this error is given. The | purpose. If the call via pcre_malloc() fails, this error is given. The |
| 2426 | memory is automatically freed at the end of matching. | memory is automatically freed at the end of matching. |
| 2427 | ||
| 2428 | This error is also given if pcre_stack_malloc() fails in pcre_exec(). | This error is also given if pcre_stack_malloc() fails in pcre_exec(). |
| 2429 | This can happen only when PCRE has been compiled with --disable-stack- | This can happen only when PCRE has been compiled with --disable-stack- |
| 2430 | for-recursion. | for-recursion. |
| 2431 | ||
| 2432 | PCRE_ERROR_NOSUBSTRING (-7) | PCRE_ERROR_NOSUBSTRING (-7) |
| 2433 | ||
| 2434 | This error is used by the pcre_copy_substring(), pcre_get_substring(), | This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 2435 | and pcre_get_substring_list() functions (see below). It is never | and pcre_get_substring_list() functions (see below). It is never |
| 2436 | returned by pcre_exec(). | returned by pcre_exec(). |
| 2437 | ||
| 2438 | PCRE_ERROR_MATCHLIMIT (-8) | PCRE_ERROR_MATCHLIMIT (-8) |
| 2439 | ||
| 2440 | The backtracking limit, as specified by the match_limit field in a | The backtracking limit, as specified by the match_limit field in a |
| 2441 | pcre_extra structure (or defaulted) was reached. See the description | pcre_extra structure (or defaulted) was reached. See the description |
| 2442 | above. | above. |
| 2443 | ||
| 2444 | PCRE_ERROR_CALLOUT (-9) | PCRE_ERROR_CALLOUT (-9) |
| 2445 | ||
| 2446 | This error is never generated by pcre_exec() itself. It is provided for | This error is never generated by pcre_exec() itself. It is provided for |
| 2447 | use by callout functions that want to yield a distinctive error code. | use by callout functions that want to yield a distinctive error code. |
| 2448 | See the pcrecallout documentation for details. | See the pcrecallout documentation for details. |
| 2449 | ||
| 2450 | PCRE_ERROR_BADUTF8 (-10) | PCRE_ERROR_BADUTF8 (-10) |
| 2451 | ||
| 2452 | A string that contains an invalid UTF-8 byte sequence was passed as a | A string that contains an invalid UTF-8 byte sequence was passed as a |
| 2453 | subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of | subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of |
| 2454 | the output vector (ovecsize) is at least 2, the byte offset to the | the output vector (ovecsize) is at least 2, the byte offset to the |
| 2455 | start of the the invalid UTF-8 character is placed in the first ele- | start of the the invalid UTF-8 character is placed in the first ele- |
| 2456 | ment, and a reason code is placed in the second element. The reason | ment, and a reason code is placed in the second element. The reason |
| 2457 | codes are listed in the following section. For backward compatibility, | codes are listed in the following section. For backward compatibility, |
| 2458 | if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char- | if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char- |
| 2459 | acter at the end of the subject (reason codes 1 to 5), | acter at the end of the subject (reason codes 1 to 5), |
| 2460 | PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8. | PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8. |
| 2461 | ||
| 2462 | PCRE_ERROR_BADUTF8_OFFSET (-11) | PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 2463 | ||
| 2464 | The UTF-8 byte sequence that was passed as a subject was checked and | The UTF-8 byte sequence that was passed as a subject was checked and |
| 2465 | found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the | found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the |
| 2466 | value of startoffset did not point to the beginning of a UTF-8 charac- | value of startoffset did not point to the beginning of a UTF-8 charac- |
| 2467 | ter or the end of the subject. | ter or the end of the subject. |
| 2468 | ||
| 2469 | PCRE_ERROR_PARTIAL (-12) | PCRE_ERROR_PARTIAL (-12) |
| 2470 | ||
| 2471 | The subject string did not match, but it did match partially. See the | The subject string did not match, but it did match partially. See the |
| 2472 | pcrepartial documentation for details of partial matching. | pcrepartial documentation for details of partial matching. |
| 2473 | ||
| 2474 | PCRE_ERROR_BADPARTIAL (-13) | PCRE_ERROR_BADPARTIAL (-13) |
| 2475 | ||
| 2476 | This code is no longer in use. It was formerly returned when the | This code is no longer in use. It was formerly returned when the |
| 2477 | PCRE_PARTIAL option was used with a compiled pattern containing items | PCRE_PARTIAL option was used with a compiled pattern containing items |
| 2478 | that were not supported for partial matching. From release 8.00 | that were not supported for partial matching. From release 8.00 |
| 2479 | onwards, there are no restrictions on partial matching. | onwards, there are no restrictions on partial matching. |
| 2480 | ||
| 2481 | PCRE_ERROR_INTERNAL (-14) | PCRE_ERROR_INTERNAL (-14) |
| 2482 | ||
| 2483 | An unexpected internal error has occurred. This error could be caused | An unexpected internal error has occurred. This error could be caused |
| 2484 | by a bug in PCRE or by overwriting of the compiled pattern. | by a bug in PCRE or by overwriting of the compiled pattern. |
| 2485 | ||
| 2486 | PCRE_ERROR_BADCOUNT (-15) | PCRE_ERROR_BADCOUNT (-15) |
| # | Line 2510 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2490 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2490 | PCRE_ERROR_RECURSIONLIMIT (-21) | PCRE_ERROR_RECURSIONLIMIT (-21) |
| 2491 | ||
| 2492 | The internal recursion limit, as specified by the match_limit_recursion | The internal recursion limit, as specified by the match_limit_recursion |
| 2493 | field in a pcre_extra structure (or defaulted) was reached. See the | field in a pcre_extra structure (or defaulted) was reached. See the |
| 2494 | description above. | description above. |
| 2495 | ||
| 2496 | PCRE_ERROR_BADNEWLINE (-23) | PCRE_ERROR_BADNEWLINE (-23) |
| # | Line 2524 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2504 MATCHING A PATTERN: THE TRADITIONAL FUNC |
| 2504 | ||
| 2505 | PCRE_ERROR_SHORTUTF8 (-25) | PCRE_ERROR_SHORTUTF8 (-25) |
| 2506 | ||
| 2507 | This error is returned instead of PCRE_ERROR_BADUTF8 when the subject | This error is returned instead of PCRE_ERROR_BADUTF8 when the subject |
| 2508 | string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD | string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD |
| 2509 | option is set. Information about the failure is returned as for | option is set. Information about the failure is returned as for |
| 2510 | PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but | PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but |
| 2511 | this special error code for PCRE_PARTIAL_HARD precedes the implementa- | this special error code for PCRE_PARTIAL_HARD precedes the implementa- |
| 2512 | tion of returned information; it is retained for backwards compatibil- | tion of returned information; it is retained for backwards compatibil- |
| 2513 | ity. | ity. |
| 2514 | ||
| 2515 | PCRE_ERROR_RECURSELOOP (-26) | PCRE_ERROR_RECURSELOOP (-26) |
| 2516 | ||
| 2517 | This error is returned when pcre_exec() detects a recursion loop within | This error is returned when pcre_exec() detects a recursion loop within |
| 2518 | the pattern. Specifically, it means that either the whole pattern or a | the pattern. Specifically, it means that either the whole pattern or a |
| 2519 | subpattern has been called recursively for the second time at the same | subpattern has been called recursively for the second time at the same |
| 2520 | position in the subject string. Some simple patterns that might do this | position in the subject string. Some simple patterns that might do this |
| 2521 | are detected and faulted at compile time, but more complicated cases, | are detected and faulted at compile time, but more complicated cases, |
| 2522 | in particular mutual recursions between two different subpatterns, can- | in particular mutual recursions between two different subpatterns, can- |
| 2523 | not be detected until run time. | not be detected until run time. |
| 2524 | ||
| 2525 | PCRE_ERROR_JIT_STACKLIMIT (-27) | |
| 2526 | ||
| 2527 | This error is returned when a pattern that was successfully studied | |
| 2528 | using the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem- | |
| 2529 | ory available for the just-in-time processing stack is not large | |
| 2530 | enough. See the pcrejit documentation for more details. | |
| 2531 | ||
| 2532 | Error numbers -16 to -20 and -22 are not used by pcre_exec(). | Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
| 2533 | ||
| 2534 | Reason codes for invalid UTF-8 strings | Reason codes for invalid UTF-8 strings |
| # | Line 2936 MATCHING A PATTERN: THE ALTERNATIVE FUNC | Line 2923 MATCHING A PATTERN: THE ALTERNATIVE FUNC |
| 2923 | The strings are returned in reverse order of length; that is, the long- | The strings are returned in reverse order of length; that is, the long- |
| 2924 | est matching string is given first. If there were too many matches to | est matching string is given first. If there were too many matches to |
| 2925 | fit into ovector, the yield of the function is zero, and the vector is | fit into ovector, the yield of the function is zero, and the vector is |
| 2926 | filled with the longest matches. | filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec() |
| 2927 | can use the entire ovector for returning matched strings. | |
| 2928 | ||
| 2929 | Error returns from pcre_dfa_exec() | Error returns from pcre_dfa_exec() |
| 2930 | ||
| 2931 | The pcre_dfa_exec() function returns a negative number when it fails. | The pcre_dfa_exec() function returns a negative number when it fails. |
| 2932 | Many of the errors are the same as for pcre_exec(), and these are | Many of the errors are the same as for pcre_exec(), and these are |
| 2933 | described above. There are in addition the following errors that are | described above. There are in addition the following errors that are |
| 2934 | specific to pcre_dfa_exec(): | specific to pcre_dfa_exec(): |
| 2935 | ||
| 2936 | PCRE_ERROR_DFA_UITEM (-16) | PCRE_ERROR_DFA_UITEM (-16) |
| 2937 | ||
| 2938 | This return is given if pcre_dfa_exec() encounters an item in the pat- | This return is given if pcre_dfa_exec() encounters an item in the pat- |
| 2939 | tern that it does not support, for instance, the use of \C or a back | tern that it does not support, for instance, the use of \C or a back |
| 2940 | reference. | reference. |
| 2941 | ||
| 2942 | PCRE_ERROR_DFA_UCOND (-17) | PCRE_ERROR_DFA_UCOND (-17) |
| 2943 | ||
| 2944 | This return is given if pcre_dfa_exec() encounters a condition item | This return is given if pcre_dfa_exec() encounters a condition item |
| 2945 | that uses a back reference for the condition, or a test for recursion | that uses a back reference for the condition, or a test for recursion |
| 2946 | in a specific group. These are not supported. | in a specific group. These are not supported. |
| 2947 | ||
| 2948 | PCRE_ERROR_DFA_UMLIMIT (-18) | PCRE_ERROR_DFA_UMLIMIT (-18) |
| 2949 | ||
| 2950 | This return is given if pcre_dfa_exec() is called with an extra block | This return is given if pcre_dfa_exec() is called with an extra block |
| 2951 | that contains a setting of the match_limit field. This is not supported | that contains a setting of the match_limit or match_limit_recursion |
| 2952 | (it is meaningless). | fields. This is not supported (these fields are meaningless for DFA |
| 2953 | matching). | |
| 2954 | ||
| 2955 | PCRE_ERROR_DFA_WSSIZE (-19) | PCRE_ERROR_DFA_WSSIZE (-19) |
| 2956 | ||
| # | Line 2991 AUTHOR | Line 2980 AUTHOR |
| 2980 | ||
| 2981 | REVISION | REVISION |
| 2982 | ||
| 2983 | Last updated: 28 July 2011 | Last updated: 23 September 2011 |
| 2984 | Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2011 University of Cambridge. |
| 2985 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| 2986 | ||
| # | Line 3039 PCRE CALLOUTS | Line 3028 PCRE CALLOUTS |
| 3028 | pattern is matched. This is useful information when you are trying to | pattern is matched. This is useful information when you are trying to |
| 3029 | optimize the performance of a particular pattern. | optimize the performance of a particular pattern. |
| 3030 | ||
| 3031 | The use of callouts in a pattern makes it ineligible for optimization | |
| 3032 | by the just-in-time compiler. Studying such a pattern with the | |
| 3033 | PCRE_STUDY_JIT_COMPILE option always fails. | |
| 3034 | ||
| 3035 | ||
| 3036 | MISSING CALLOUTS | MISSING CALLOUTS |
| 3037 | ||
| # | Line 3180 AUTHOR | Line 3173 AUTHOR |
| 3173 | ||
| 3174 | REVISION | REVISION |
| 3175 | ||
| 3176 | Last updated: 31 July 2011 | Last updated: 26 August 2011 |
| 3177 | Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2011 University of Cambridge. |
| 3178 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| 3179 | ||
| # | Line 3199 DIFFERENCES BETWEEN PCRE AND PERL | Line 3192 DIFFERENCES BETWEEN PCRE AND PERL |
| 3192 | respect to Perl versions 5.10 and above. | respect to Perl versions 5.10 and above. |
| 3193 | ||
| 3194 | 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details | 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
| 3195 | of what it does have are given in the section on UTF-8 support in the | of what it does have are given in the pcreunicode page. |
| main pcre page. | ||
| 3196 | ||
| 3197 | 2. PCRE allows repeat quantifiers only on parenthesized assertions, but | 2. PCRE allows repeat quantifiers only on parenthesized assertions, but |
| 3198 | they do not mean what you might think. For example, (?!a){3} does not | they do not mean what you might think. For example, (?!a){3} does not |
| 3199 | assert that the next three characters are not "a". It just asserts that | assert that the next three characters are not "a". It just asserts that |
| 3200 | the next character is not "a" three times (in principle: PCRE optimizes | the next character is not "a" three times (in principle: PCRE optimizes |
| 3201 | this to run the assertion just once). Perl allows repeat quantifiers on | this to run the assertion just once). Perl allows repeat quantifiers on |
| 3202 | other assertions such as \b, but these do not seem to have any use. | other assertions such as \b, but these do not seem to have any use. |
| 3203 | ||
| 3204 | 3. Capturing subpatterns that occur inside negative lookahead asser- | 3. Capturing subpatterns that occur inside negative lookahead asser- |
| 3205 | tions are counted, but their entries in the offsets vector are never | tions are counted, but their entries in the offsets vector are never |
| 3206 | set. Perl sets its numerical variables from any such patterns that are | set. Perl sets its numerical variables from any such patterns that are |
| 3207 | matched before the assertion fails to match something (thereby succeed- | matched before the assertion fails to match something (thereby succeed- |
| 3208 | ing), but only if the negative lookahead assertion contains just one | ing), but only if the negative lookahead assertion contains just one |
| 3209 | branch. | branch. |
| 3210 | ||
| 3211 | 4. Though binary zero characters are supported in the subject string, | 4. Though binary zero characters are supported in the subject string, |
| 3212 | they are not allowed in a pattern string because it is passed as a nor- | they are not allowed in a pattern string because it is passed as a nor- |
| 3213 | mal C string, terminated by zero. The escape sequence \0 can be used in | mal C string, terminated by zero. The escape sequence \0 can be used in |
| 3214 | the pattern to represent a binary zero. | the pattern to represent a binary zero. |
| 3215 | ||
| 3216 | 5. The following Perl escape sequences are not supported: \l, \u, \L, | 5. The following Perl escape sequences are not supported: \l, \u, \L, |
| 3217 | \U, and \N when followed by a character name or Unicode value. (\N on | \U, and \N when followed by a character name or Unicode value. (\N on |
| 3218 | its own, matching a non-newline character, is supported.) In fact these | its own, matching a non-newline character, is supported.) In fact these |
| 3219 | are implemented by Perl's general string-handling and are not part of | are implemented by Perl's general string-handling and are not part of |
| 3220 | its pattern matching engine. If any of these are encountered by PCRE, | its pattern matching engine. If any of these are encountered by PCRE, |
| 3221 | an error is generated. | an error is generated. |
| 3222 | ||
| 3223 | 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE | 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
| 3224 | is built with Unicode character property support. The properties that | is built with Unicode character property support. The properties that |
| 3225 | can be tested with \p and \P are limited to the general category prop- | can be tested with \p and \P are limited to the general category prop- |
| 3226 | erties such as Lu and Nd, script names such as Greek or Han, and the | erties such as Lu and Nd, script names such as Greek or Han, and the |
| 3227 | derived properties Any and L&. PCRE does support the Cs (surrogate) | derived properties Any and L&. PCRE does support the Cs (surrogate) |
| 3228 | property, which Perl does not; the Perl documentation says "Because | property, which Perl does not; the Perl documentation says "Because |
| 3229 | Perl hides the need for the user to understand the internal representa- | Perl hides the need for the user to understand the internal representa- |
| 3230 | tion of Unicode characters, there is no need to implement the somewhat | tion of Unicode characters, there is no need to implement the somewhat |
| 3231 | messy concept of surrogates." | messy concept of surrogates." |
| 3232 | ||
| 3233 | 7. PCRE implements a simpler version of \X than Perl, which changed to | 7. PCRE implements a simpler version of \X than Perl, which changed to |
| 3234 | make \X match what Unicode calls an "extended grapheme cluster". This | make \X match what Unicode calls an "extended grapheme cluster". This |
| 3235 | is more complicated than an extended Unicode sequence, which is what | is more complicated than an extended Unicode sequence, which is what |
| 3236 | PCRE matches. | PCRE matches. |
| 3237 | ||
| 3238 | 8. PCRE does support the \Q...\E escape for quoting substrings. Charac- | 8. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
| 3239 | ters in between are treated as literals. This is slightly different | ters in between are treated as literals. This is slightly different |
| 3240 | from Perl in that $ and @ are also handled as literals inside the | from Perl in that $ and @ are also handled as literals inside the |
| 3241 | quotes. In Perl, they cause variable interpolation (but of course PCRE | quotes. In Perl, they cause variable interpolation (but of course PCRE |
| 3242 | does not have variables). Note the following examples: | does not have variables). Note the following examples: |
| 3243 | ||
| 3244 | Pattern PCRE matches Perl matches | Pattern PCRE matches Perl matches |
| # | Line 3256 DIFFERENCES BETWEEN PCRE AND PERL | Line 3248 DIFFERENCES BETWEEN PCRE AND PERL |
| 3248 | \Qabc\$xyz\E abc\$xyz abc\$xyz | \Qabc\$xyz\E abc\$xyz abc\$xyz |
| 3249 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 3250 | ||
| 3251 | The \Q...\E sequence is recognized both inside and outside character | The \Q...\E sequence is recognized both inside and outside character |
| 3252 | classes. | classes. |
| 3253 | ||
| 3254 | 9. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) | 9. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
| 3255 | constructions. However, there is support for recursive patterns. This | constructions. However, there is support for recursive patterns. This |
| 3256 | is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE | is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE |
| 3257 | "callout" feature allows an external function to be called during pat- | "callout" feature allows an external function to be called during pat- |
| 3258 | tern matching. See the pcrecallout documentation for details. | tern matching. See the pcrecallout documentation for details. |
| 3259 | ||
| 3260 | 10. Subpatterns that are called recursively or as "subroutines" are | 10. Subpatterns that are called as subroutines (whether or not recur- |
| 3261 | always treated as atomic groups in PCRE. This is like Python, but | sively) are always treated as atomic groups in PCRE. This is like |
| 3262 | unlike Perl. There is a discussion of an example that explains this in | Python, but unlike Perl. Captured values that are set outside a sub- |
| 3263 | more detail in the section on recursion differences from Perl in the | routine call can be reference from inside in PCRE, but not in Perl. |
| 3264 | pcrepattern page. | There is a discussion that explains these differences in more detail in |
| 3265 | the section on recursion differences from Perl in the pcrepattern page. | |
| 3266 | ||
| 3267 | 11. If (*THEN) is present in a group that is called as a subroutine, | |
| 3268 | its action is limited to that group, even if the group does not contain | |
| 3269 | any | characters. | |
| 3270 | ||
| 3271 | 11. There are some differences that are concerned with the settings of | 12. There are some differences that are concerned with the settings of |
| 3272 | captured strings when part of a pattern is repeated. For example, | captured strings when part of a pattern is repeated. For example, |
| 3273 | matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 | matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
| 3274 | unset, but in PCRE it is set to "b". | unset, but in PCRE it is set to "b". |
| 3275 | ||
| 3276 | 12. PCRE's handling of duplicate subpattern numbers and duplicate sub- | 13. PCRE's handling of duplicate subpattern numbers and duplicate sub- |
| 3277 | pattern names is not as general as Perl's. This is a consequence of the | pattern names is not as general as Perl's. This is a consequence of the |
| 3278 | fact the PCRE works internally just with numbers, using an external ta- | fact the PCRE works internally just with numbers, using an external ta- |
| 3279 | ble to translate between numbers and names. In particular, a pattern | ble to translate between numbers and names. In particular, a pattern |
| # | Line 3287 DIFFERENCES BETWEEN PCRE AND PERL | Line 3284 DIFFERENCES BETWEEN PCRE AND PERL |
| 3284 | turing subpattern number 1. To avoid this confusing situation, an error | turing subpattern number 1. To avoid this confusing situation, an error |
| 3285 | is given at compile time. | is given at compile time. |
| 3286 | ||
| 3287 | 13. Perl recognizes comments in some places that PCRE does not, for | 14. Perl recognizes comments in some places that PCRE does not, for |
| 3288 | example, between the ( and ? at the start of a subpattern. If the /x | example, between the ( and ? at the start of a subpattern. If the /x |
| 3289 | modifier is set, Perl allows whitespace between ( and ? but PCRE never | modifier is set, Perl allows whitespace between ( and ? but PCRE never |
| 3290 | does, even if the PCRE_EXTENDED option is set. | does, even if the PCRE_EXTENDED option is set. |
| 3291 | ||
| 3292 | 14. PCRE provides some extensions to the Perl regular expression facil- | 15. PCRE provides some extensions to the Perl regular expression facil- |
| 3293 | ities. Perl 5.10 includes new features that are not in earlier ver- | ities. Perl 5.10 includes new features that are not in earlier ver- |
| 3294 | sions of Perl, some of which (such as named parentheses) have been in | sions of Perl, some of which (such as named parentheses) have been in |
| 3295 | PCRE for some time. This list is with respect to Perl 5.10: | PCRE for some time. This list is with respect to Perl 5.10: |
| # | Line 3328 DIFFERENCES BETWEEN PCRE AND PERL | Line 3325 DIFFERENCES BETWEEN PCRE AND PERL |
| 3325 | (i) The partial matching facility is PCRE-specific. | (i) The partial matching facility is PCRE-specific. |
| 3326 | ||
| 3327 | (j) Patterns compiled by PCRE can be saved and re-used at a later time, | (j) Patterns compiled by PCRE can be saved and re-used at a later time, |
| 3328 | even on different hosts that have the other endianness. | even on different hosts that have the other endianness. However, this |
| 3329 | does not apply to optimized data created by the just-in-time compiler. | |
| 3330 | ||
| 3331 | (k) The alternative matching function (pcre_dfa_exec()) matches in a | (k) The alternative matching function (pcre_dfa_exec()) matches in a |
| 3332 | different way and is not Perl-compatible. | different way and is not Perl-compatible. |
| 3333 | ||
| 3334 | (l) PCRE recognizes some special sequences such as (*CR) at the start | (l) PCRE recognizes some special sequences such as (*CR) at the start |
| 3335 | of a pattern that set overall options that cannot be changed within the | of a pattern that set overall options that cannot be changed within the |
| 3336 | pattern. | pattern. |
| 3337 | ||
| # | Line 3347 AUTHOR | Line 3345 AUTHOR |
| 3345 | ||
| 3346 | REVISION | REVISION |
| 3347 | ||
| 3348 | Last updated: 24 July 2011 | Last updated: 09 October 2011 |
| 3349 | Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2011 University of Cambridge. |
| 3350 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| 3351 | ||
| # | Line 3387 PCRE REGULAR EXPRESSION DETAILS | Line 3385 PCRE REGULAR EXPRESSION DETAILS |
| 3385 | Starting a pattern with this sequence is equivalent to setting the | Starting a pattern with this sequence is equivalent to setting the |
| 3386 | PCRE_UTF8 option. This feature is not Perl-compatible. How setting | PCRE_UTF8 option. This feature is not Perl-compatible. How setting |
| 3387 | UTF-8 mode affects pattern matching is mentioned in several places | UTF-8 mode affects pattern matching is mentioned in several places |
| 3388 | below. There is also a summary of UTF-8 features in the section on | below. There is also a summary of UTF-8 features in the pcreunicode |
| 3389 | UTF-8 support in the main pcre page. | page. |
| 3390 | ||
| 3391 | Another special sequence that may appear at the start of a pattern or | Another special sequence that may appear at the start of a pattern or |
| 3392 | in combination with (*UTF8) is: | in combination with (*UTF8) is: |
| # | Line 4144 FULL STOP (PERIOD, DOT) AND \N | Line 4142 FULL STOP (PERIOD, DOT) AND \N |
| 4142 | MATCHING A SINGLE BYTE | MATCHING A SINGLE BYTE |
| 4143 | ||
| 4144 | Outside a character class, the escape sequence \C matches any one byte, | Outside a character class, the escape sequence \C matches any one byte, |
| 4145 | both in and out of UTF-8 mode. Unlike a dot, it always matches any | both in and out of UTF-8 mode. Unlike a dot, it always matches line- |
| 4146 | line-ending characters. The feature is provided in Perl in order to | ending characters. The feature is provided in Perl in order to match |
| 4147 | match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- | individual bytes in UTF-8 mode, but it is unclear how it can usefully |
| 4148 | acters into individual bytes, the rest of the string may start with a | be used. Because \C breaks up characters into individual bytes, match- |
| 4149 | malformed UTF-8 character. For this reason, the \C escape sequence is | ing one byte with \C in UTF-8 mode means that the rest of the string |
| 4150 | best avoided. | may start with a malformed UTF-8 character. This has undefined results, |
| 4151 | because PCRE assumes that it is dealing with valid UTF-8 strings (and | |
| 4152 | by default it checks this at the start of processing unless the | |
| 4153 | PCRE_NO_UTF8_CHECK option is used). | |
| 4154 | ||
| 4155 | PCRE does not allow \C to appear in lookbehind assertions (described | PCRE does not allow \C to appear in lookbehind assertions (described |
| 4156 | below), because in UTF-8 mode this would make it impossible to calcu- | below), because in UTF-8 mode this would make it impossible to calcu- |
| 4157 | late the length of the lookbehind. | late the length of the lookbehind. |
| 4158 | ||
| 4159 | In general, the \C escape sequence is best avoided in UTF-8 mode. How- | |
| 4160 | ever, one way of using it that avoids the problem of malformed UTF-8 | |
| 4161 | characters is to use a lookahead to check the length of the next char- | |
| 4162 | acter, as in this pattern (ignore white space and line breaks): | |
| 4163 | ||
| 4164 | (?| (?=[\x00-\x7f])(\C) | | |
| 4165 | (?=[\x80-\x{7ff}])(\C)(\C) | | |
| 4166 | (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | | |
| 4167 | (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) | |
| 4168 | ||
| 4169 | A group that starts with (?| resets the capturing parentheses numbers | |
| 4170 | in each alternative (see "Duplicate Subpattern Numbers" below). The | |
| 4171 | assertions at the start of each branch check the next UTF-8 character | |
| 4172 | for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The | |
| 4173 | character's individual bytes are then captured by the appropriate num- | |
| 4174 | ber of groups. | |
| 4175 | ||
| 4176 | ||
| 4177 | SQUARE BRACKETS AND CHARACTER CLASSES | SQUARE BRACKETS AND CHARACTER CLASSES |
| 4178 | ||
| # | Line 4162 SQUARE BRACKETS AND CHARACTER CLASSES | Line 4180 SQUARE BRACKETS AND CHARACTER CLASSES |
| 4180 | closing square bracket. A closing square bracket on its own is not spe- | closing square bracket. A closing square bracket on its own is not spe- |
| 4181 | cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, | cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, |
| 4182 | a lone closing square bracket causes a compile-time error. If a closing | a lone closing square bracket causes a compile-time error. If a closing |
| 4183 | square bracket is required as a member of the class, it should be the | square bracket is required as a member of the class, it should be the |
| 4184 | first data character in the class (after an initial circumflex, if | first data character in the class (after an initial circumflex, if |
| 4185 | present) or escaped with a backslash. | present) or escaped with a backslash. |
| 4186 | ||
| 4187 | A character class matches a single character in the subject. In UTF-8 | A character class matches a single character in the subject. In UTF-8 |
| 4188 | mode, the character may be more than one byte long. A matched character | mode, the character may be more than one byte long. A matched character |
| 4189 | must be in the set of characters defined by the class, unless the first | must be in the set of characters defined by the class, unless the first |
| 4190 | character in the class definition is a circumflex, in which case the | character in the class definition is a circumflex, in which case the |
| 4191 | subject character must not be in the set defined by the class. If a | subject character must not be in the set defined by the class. If a |
| 4192 | circumflex is actually required as a member of the class, ensure it is | circumflex is actually required as a member of the class, ensure it is |
| 4193 | not the first character, or escape it with a backslash. | not the first character, or escape it with a backslash. |
| 4194 | ||
| 4195 | For example, the character class [aeiou] matches any lower case vowel, | For example, the character class [aeiou] matches any lower case vowel, |
| 4196 | while [^aeiou] matches any character that is not a lower case vowel. | while [^aeiou] matches any character that is not a lower case vowel. |
| 4197 | Note that a circumflex is just a convenient notation for specifying the | Note that a circumflex is just a convenient notation for specifying the |
| 4198 | characters that are in the class by enumerating those that are not. A | characters that are in the class by enumerating those that are not. A |
| 4199 | class that starts with a circumflex is not an assertion; it still con- | class that starts with a circumflex is not an assertion; it still con- |
| 4200 | sumes a character from the subject string, and therefore it fails if | sumes a character from the subject string, and therefore it fails if |
| 4201 | the current pointer is at the end of the string. | the current pointer is at the end of the string. |
| 4202 | ||
| 4203 | In UTF-8 mode, characters with values greater than 255 can be included | In UTF-8 mode, characters with values greater than 255 can be included |
| 4204 | in a class as a literal string of bytes, or by using the \x{ escaping | in a class as a literal string of bytes, or by using the \x{ escaping |
| 4205 | mechanism. | mechanism. |
| 4206 | ||
| 4207 | When caseless matching is set, any letters in a class represent both | When caseless matching is set, any letters in a class represent both |
| 4208 | their upper case and lower case versions, so for example, a caseless | their upper case and lower case versions, so for example, a caseless |
| 4209 | [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not | [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
| 4210 | match "A", whereas a caseful version would. In UTF-8 mode, PCRE always | match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
| 4211 | understands the concept of case for characters whose values are less | understands the concept of case for characters whose values are less |
| 4212 | than 128, so caseless matching is always possible. For characters with | than 128, so caseless matching is always possible. For characters with |
| 4213 | higher values, the concept of case is supported if PCRE is compiled | higher values, the concept of case is supported if PCRE is compiled |
| 4214 | with Unicode property support, but not otherwise. If you want to use | with Unicode property support, but not otherwise. If you want to use |
| 4215 | caseless matching in UTF8-mode for characters 128 and above, you must | caseless matching in UTF8-mode for characters 128 and above, you must |
| 4216 | ensure that PCRE is compiled with Unicode property support as well as | ensure that PCRE is compiled with Unicode property support as well as |
| 4217 | with UTF-8 support. | with UTF-8 support. |
| 4218 | ||
| 4219 | Characters that might indicate line breaks are never treated in any | Characters that might indicate line breaks are never treated in any |
| 4220 | special way when matching character classes, whatever line-ending | special way when matching character classes, whatever line-ending |
| 4221 | sequence is in use, and whatever setting of the PCRE_DOTALL and | sequence is in use, and whatever setting of the PCRE_DOTALL and |
| 4222 | PCRE_MULTILINE options is used. A class such as [^a] always matches one | PCRE_MULTILINE options is used. A class such as [^a] always matches one |
| 4223 | of these characters. | of these characters. |
| 4224 | ||
| 4225 | The minus (hyphen) character can be used to specify a range of charac- | The minus (hyphen) character can be used to specify a range of charac- |
| 4226 | ters in a character class. For example, [d-m] matches any letter | ters in a character class. For example, [d-m] matches any letter |
| 4227 | between d and m, inclusive. If a minus character is required in a | between d and m, inclusive. If a minus character is required in a |
| 4228 | class, it must be escaped with a backslash or appear in a position | class, it must be escaped with a backslash or appear in a position |
| 4229 | where it cannot be interpreted as indicating a range, typically as the | where it cannot be interpreted as indicating a range, typically as the |
| 4230 | first or last character in the class. | first or last character in the class. |
| 4231 | ||
| 4232 | It is not possible to have the literal character "]" as the end charac- | It is not possible to have the literal character "]" as the end charac- |
| 4233 | ter of a range. A pattern such as [W-]46] is interpreted as a class of | ter of a range. A pattern such as [W-]46] is interpreted as a class of |
| 4234 | two characters ("W" and "-") followed by a literal string "46]", so it | two characters ("W" and "-") followed by a literal string "46]", so it |
| 4235 | would match "W46]" or "-46]". However, if the "]" is escaped with a | would match "W46]" or "-46]". However, if the "]" is escaped with a |
| 4236 | backslash it is interpreted as the end of range, so [W-\]46] is inter- | backslash it is interpreted as the end of range, so [W-\]46] is inter- |
| 4237 | preted as a class containing a range followed by two other characters. | preted as a class containing a range followed by two other characters. |
| 4238 | The octal or hexadecimal representation of "]" can also be used to end | The octal or hexadecimal representation of "]" can also be used to end |
| 4239 | a range. | a range. |
| 4240 | ||
| 4241 | Ranges operate in the collating sequence of character values. They can | Ranges operate in the collating sequence of character values. They can |
| 4242 | also be used for characters specified numerically, for example | also be used for characters specified numerically, for example |
| 4243 | [\000-\037]. In UTF-8 mode, ranges can include characters whose values | [\000-\037]. In UTF-8 mode, ranges can include characters whose values |
| 4244 | are greater than 255, for example [\x{100}-\x{2ff}]. | are greater than 255, for example [\x{100}-\x{2ff}]. |
| 4245 | ||
| 4246 | If a range that includes letters is used when caseless matching is set, | If a range that includes letters is used when caseless matching is set, |
| 4247 | it matches the letters in either case. For example, [W-c] is equivalent | it matches the letters in either case. For example, [W-c] is equivalent |
| 4248 | to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if | to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
| 4249 | character tables for a French locale are in use, [\xc8-\xcb] matches | character tables for a French locale are in use, [\xc8-\xcb] matches |
| 4250 | accented E characters in both cases. In UTF-8 mode, PCRE supports the | accented E characters in both cases. In UTF-8 mode, PCRE supports the |
| 4251 | concept of case for characters with values greater than 128 only when | concept of case for characters with values greater than 128 only when |
| 4252 | it is compiled with Unicode property support. | it is compiled with Unicode property support. |
| 4253 | ||
| 4254 | The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, | The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, |
| 4255 | \w, and \W may appear in a character class, and add the characters that | \w, and \W may appear in a character class, and add the characters that |
| 4256 | they match to the class. For example, [\dABCDEF] matches any hexadeci- | they match to the class. For example, [\dABCDEF] matches any hexadeci- |
| 4257 | mal digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of | mal digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of |
| 4258 | \d, \s, \w and their upper case partners, just as it does when they | \d, \s, \w and their upper case partners, just as it does when they |
| 4259 | appear outside a character class, as described in the section entitled | appear outside a character class, as described in the section entitled |
| 4260 | "Generic character types" above. The escape sequence \b has a different | "Generic character types" above. The escape sequence \b has a different |
| 4261 | meaning inside a character class; it matches the backspace character. | meaning inside a character class; it matches the backspace character. |
| 4262 | The sequences \B, \N, \R, and \X are not special inside a character | The sequences \B, \N, \R, and \X are not special inside a character |
| 4263 | class. Like any other unrecognized escape sequences, they are treated | class. Like any other unrecognized escape sequences, they are treated |
| 4264 | as the literal characters "B", "N", "R", and "X" by default, but cause | as the literal characters "B", "N", "R", and "X" by default, but cause |
| 4265 | an error if the PCRE_EXTRA option is set. | an error if the PCRE_EXTRA option is set. |
| 4266 | ||
| 4267 | A circumflex can conveniently be used with the upper case character | A circumflex can conveniently be used with the upper case character |
| 4268 | types to specify a more restricted set of characters than the matching | types to specify a more restricted set of characters than the matching |
| 4269 | lower case type. For example, the class [^\W_] matches any letter or | lower case type. For example, the class [^\W_] matches any letter or |
| 4270 | digit, but not underscore, whereas [\w] includes underscore. A positive | digit, but not underscore, whereas [\w] includes underscore. A positive |
| 4271 | character class should be read as "something OR something OR ..." and a | character class should be read as "something OR something OR ..." and a |
| 4272 | negative class as "NOT something AND NOT something AND NOT ...". | negative class as "NOT something AND NOT something AND NOT ...". |
| 4273 | ||
| 4274 | The only metacharacters that are recognized in character classes are | The only metacharacters that are recognized in character classes are |
| 4275 | backslash, hyphen (only where it can be interpreted as specifying a | backslash, hyphen (only where it can be interpreted as specifying a |
| 4276 | range), circumflex (only at the start), opening square bracket (only | range), circumflex (only at the start), opening square bracket (only |
| 4277 | when it can be interpreted as introducing a POSIX class name - see the | when it can be interpreted as introducing a POSIX class name - see the |
| 4278 | next section), and the terminating closing square bracket. However, | next section), and the terminating closing square bracket. However, |
| 4279 | escaping other non-alphanumeric characters does no harm. | escaping other non-alphanumeric characters does no harm. |
| 4280 | ||
| 4281 | ||
| 4282 | POSIX CHARACTER CLASSES | POSIX CHARACTER CLASSES |
| 4283 | ||
| 4284 | Perl supports the POSIX notation for character classes. This uses names | Perl supports the POSIX notation for character classes. This uses names |
| 4285 | enclosed by [: and :] within the enclosing square brackets. PCRE also | enclosed by [: and :] within the enclosing square brackets. PCRE also |
| 4286 | supports this notation. For example, | supports this notation. For example, |
| 4287 | ||
| 4288 | [01[:alpha:]%] | [01[:alpha:]%] |
| # | Line 4287 POSIX CHARACTER CLASSES | Line 4305 POSIX CHARACTER CLASSES |
| 4305 | word "word" characters (same as \w) | word "word" characters (same as \w) |
| 4306 | xdigit hexadecimal digits | xdigit hexadecimal digits |
| 4307 | ||
| 4308 | The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), | The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
| 4309 | and space (32). Notice that this list includes the VT character (code | and space (32). Notice that this list includes the VT character (code |
| 4310 | 11). This makes "space" different to \s, which does not include VT (for | 11). This makes "space" different to \s, which does not include VT (for |
| 4311 | Perl compatibility). | Perl compatibility). |
| 4312 | ||
| 4313 | The name "word" is a Perl extension, and "blank" is a GNU extension | The name "word" is a Perl extension, and "blank" is a GNU extension |
| 4314 | from Perl 5.8. Another Perl extension is negation, which is indicated | from Perl 5.8. Another Perl extension is negation, which is indicated |
| 4315 | by a ^ character after the colon. For example, | by a ^ character after the colon. For example, |
| 4316 | ||
| 4317 | [12[:^digit:]] | [12[:^digit:]] |
| 4318 | ||
| 4319 | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
| 4320 | POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but | POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
| 4321 | these are not supported, and an error is given if they are encountered. | these are not supported, and an error is given if they are encountered. |
| 4322 | ||
| 4323 | By default, in UTF-8 mode, characters with values greater than 128 do | By default, in UTF-8 mode, characters with values greater than 128 do |
| 4324 | not match any of the POSIX character classes. However, if the PCRE_UCP | not match any of the POSIX character classes. However, if the PCRE_UCP |
| 4325 | option is passed to pcre_compile(), some of the classes are changed so | option is passed to pcre_compile(), some of the classes are changed so |
| 4326 | that Unicode character properties are used. This is achieved by replac- | that Unicode character properties are used. This is achieved by replac- |
| 4327 | ing the POSIX classes by other sequences, as follows: | ing the POSIX classes by other sequences, as follows: |
| 4328 | ||
| # | Line 4317 POSIX CHARACTER CLASSES | Line 4335 POSIX CHARACTER CLASSES |
| 4335 | [:upper:] becomes \p{Lu} | [:upper:] becomes \p{Lu} |
| 4336 | [:word:] becomes \p{Xwd} | [:word:] becomes \p{Xwd} |
| 4337 | ||
| 4338 | Negated versions, such as [:^alpha:] use \P instead of \p. The other | Negated versions, such as [:^alpha:] use \P instead of \p. The other |
| 4339 | POSIX classes are unchanged, and match only characters with code points | POSIX classes are unchanged, and match only characters with code points |
| 4340 | less than 128. | less than 128. |
| 4341 | ||
| 4342 | ||
| 4343 | VERTICAL BAR | VERTICAL BAR |
| 4344 | ||
| 4345 | Vertical bar characters are used to separate alternative patterns. For | Vertical bar characters are used to separate alternative patterns. For |
| 4346 | example, the pattern | example, the pattern |
| 4347 | ||
| 4348 | gilbert|sullivan | gilbert|sullivan |
| 4349 | ||
| 4350 | matches either "gilbert" or "sullivan". Any number of alternatives may | matches either "gilbert" or "sullivan". Any number of alternatives may |
| 4351 | appear, and an empty alternative is permitted (matching the empty | appear, and an empty alternative is permitted (matching the empty |
| 4352 | string). The matching process tries each alternative in turn, from left | string). The matching process tries each alternative in turn, from left |
| 4353 | to right, and the first one that succeeds is used. If the alternatives | to right, and the first one that succeeds is used. If the alternatives |
| 4354 | are within a subpattern (defined below), "succeeds" means matching the | are within a subpattern (defined below), "succeeds" means matching the |
| 4355 | rest of the main pattern as well as the alternative in the subpattern. | rest of the main pattern as well as the alternative in the subpattern. |
| 4356 | ||
| 4357 | ||
| 4358 | INTERNAL OPTION SETTING | INTERNAL OPTION SETTING |
| 4359 | ||
| 4360 | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
| 4361 | PCRE_EXTENDED options (which are Perl-compatible) can be changed from | PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
| 4362 | within the pattern by a sequence of Perl option letters enclosed | within the pattern by a sequence of Perl option letters enclosed |
| 4363 | between "(?" and ")". The option letters are | between "(?" and ")". The option letters are |
| 4364 | ||
| 4365 | i for PCRE_CASELESS | i for PCRE_CASELESS |
| # | Line 4351 INTERNAL OPTION SETTING | Line 4369 INTERNAL OPTION SETTING |
| 4369 | ||
| 4370 | For example, (?im) sets caseless, multiline matching. It is also possi- | For example, (?im) sets caseless, multiline matching. It is also possi- |
| 4371 | ble to unset these options by preceding the letter with a hyphen, and a | ble to unset these options by preceding the letter with a hyphen, and a |
| 4372 | combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- | combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
| 4373 | LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, | LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
| 4374 | is also permitted. If a letter appears both before and after the | is also permitted. If a letter appears both before and after the |
| 4375 | hyphen, the option is unset. | hyphen, the option is unset. |
| 4376 | ||
| 4377 | The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA | The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
| 4378 | can be changed in the same way as the Perl-compatible options by using | can be changed in the same way as the Perl-compatible options by using |
| 4379 | the characters J, U and X respectively. | the characters J, U and X respectively. |
| 4380 | ||
| 4381 | When one of these option changes occurs at top level (that is, not | When one of these option changes occurs at top level (that is, not |
| 4382 | inside subpattern parentheses), the change applies to the remainder of | inside subpattern parentheses), the change applies to the remainder of |
| 4383 | the pattern that follows. If the change is placed right at the start of | the pattern that follows. If the change is placed right at the start of |
| 4384 | a pattern, PCRE extracts it into the global options (and it will there- | a pattern, PCRE extracts it into the global options (and it will there- |
| 4385 | fore show up in data extracted by the pcre_fullinfo() function). | fore show up in data extracted by the pcre_fullinfo() function). |
| 4386 | ||
| 4387 | An option change within a subpattern (see below for a description of | An option change within a subpattern (see below for a description of |
| 4388 | subpatterns) affects only that part of the subpattern that follows it, | subpatterns) affects only that part of the subpattern that follows it, |
| 4389 | so | so |
| 4390 | ||
| 4391 | (a(?i)b)c | (a(?i)b)c |
| 4392 | ||
| 4393 | matches abc and aBc and no other strings (assuming PCRE_CASELESS is not | matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
| 4394 | used). By this means, options can be made to have different settings | used). By this means, options can be made to have different settings |
| 4395 | in different parts of the pattern. Any changes made in one alternative | in different parts of the pattern. Any changes made in one alternative |
| 4396 | do carry on into subsequent branches within the same subpattern. For | do carry on into subsequent branches within the same subpattern. For |
| 4397 | example, | example, |
| 4398 | ||
| 4399 | (a(?i)b|c) | (a(?i)b|c) |
| 4400 | ||
| 4401 | matches "ab", "aB", "c", and "C", even though when matching "C" the | matches "ab", "aB", "c", and "C", even though when matching "C" the |
| 4402 | first branch is abandoned before the option setting. This is because | first branch is abandoned before the option setting. This is because |
| 4403 | the effects of option settings happen at compile time. There would be | the effects of option settings happen at compile time. There would be |
| 4404 | some very weird behaviour otherwise. | some very weird behaviour otherwise. |
| 4405 | ||
| 4406 | Note: There are other PCRE-specific options that can be set by the | Note: There are other PCRE-specific options that can be set by the |
| 4407 | application when the compile or match functions are called. In some | application when the compile or match functions are called. In some |
| 4408 | cases the pattern can contain special leading sequences such as (*CRLF) | cases the pattern can contain special leading sequences such as (*CRLF) |
| 4409 | to override what the application has set or what has been defaulted. | to override what the application has set or what has been defaulted. |
| 4410 | Details are given in the section entitled "Newline sequences" above. | Details are given in the section entitled "Newline sequences" above. |
| 4411 | There are also the (*UTF8) and (*UCP) leading sequences that can be | There are also the (*UTF8) and (*UCP) leading sequences that can be |
| 4412 | used to set UTF-8 and Unicode property modes; they are equivalent to | used to set UTF-8 and Unicode property modes; they are equivalent to |
| 4413 | setting the PCRE_UTF8 and the PCRE_UCP options, respectively. | setting the PCRE_UTF8 and the PCRE_UCP options, respectively. |
| 4414 | ||
| 4415 | ||
| # | Line 4404 SUBPATTERNS | Line 4422 SUBPATTERNS |
| 4422 | ||
| 4423 | cat(aract|erpillar|) | cat(aract|erpillar|) |
| 4424 | ||
| 4425 | matches "cataract", "caterpillar", or "cat". Without the parentheses, | matches "cataract", "caterpillar", or "cat". Without the parentheses, |
| 4426 | it would match "cataract", "erpillar" or an empty string. | it would match "cataract", "erpillar" or an empty string. |
| 4427 | ||
| 4428 | 2. It sets up the subpattern as a capturing subpattern. This means | 2. It sets up the subpattern as a capturing subpattern. This means |
| 4429 | that, when the whole pattern matches, that portion of the subject | that, when the whole pattern matches, that portion of the subject |
| 4430 | string that matched the subpattern is passed back to the caller via the | string that matched the subpattern is passed back to the caller via the |
| 4431 | ovector argument of pcre_exec(). Opening parentheses are counted from | ovector argument of pcre_exec(). Opening parentheses are counted from |
| 4432 | left to right (starting from 1) to obtain numbers for the capturing | left to right (starting from 1) to obtain numbers for the capturing |
| 4433 | subpatterns. For example, if the string "the red king" is matched | subpatterns. For example, if the string "the red king" is matched |
| 4434 | against the pattern | against the pattern |
| 4435 | ||
| 4436 | the ((red|white) (king|queen)) | the ((red|white) (king|queen)) |
| # | Line 4420 SUBPATTERNS | Line 4438 SUBPATTERNS |
| 4438 | the captured substrings are "red king", "red", and "king", and are num- | the captured substrings are "red king", "red", and "king", and are num- |
| 4439 | bered 1, 2, and 3, respectively. | bered 1, 2, and 3, respectively. |
| 4440 | ||
| 4441 | The fact that plain parentheses fulfil two functions is not always | The fact that plain parentheses fulfil two functions is not always |
| 4442 | helpful. There are often times when a grouping subpattern is required | helpful. There are often times when a grouping subpattern is required |
| 4443 | without a capturing requirement. If an opening parenthesis is followed | without a capturing requirement. If an opening parenthesis is followed |
| 4444 | by a question mark and a colon, the subpattern does not do any captur- | by a question mark and a colon, the subpattern does not do any captur- |
| 4445 | ing, and is not counted when computing the number of any subsequent | ing, and is not counted when computing the number of any subsequent |
| 4446 | capturing subpatterns. For example, if the string "the white queen" is | capturing subpatterns. For example, if the string "the white queen" is |
| 4447 | matched against the pattern | matched against the pattern |
| 4448 | ||
| 4449 | the ((?:red|white) (king|queen)) | the ((?:red|white) (king|queen)) |
| # | Line 4433 SUBPATTERNS | Line 4451 SUBPATTERNS |
| 4451 | the captured substrings are "white queen" and "queen", and are numbered | the captured substrings are "white queen" and "queen", and are numbered |
| 4452 | 1 and 2. The maximum number of capturing subpatterns is 65535. | 1 and 2. The maximum number of capturing subpatterns is 65535. |
| 4453 | ||
| 4454 | As a convenient shorthand, if any option settings are required at the | As a convenient shorthand, if any option settings are required at the |
| 4455 | start of a non-capturing subpattern, the option letters may appear | start of a non-capturing subpattern, the option letters may appear |
| 4456 | between the "?" and the ":". Thus the two patterns | between the "?" and the ":". Thus the two patterns |
| 4457 | ||
| 4458 | (?i:saturday|sunday) | (?i:saturday|sunday) |
| 4459 | (?:(?i)saturday|sunday) | (?:(?i)saturday|sunday) |
| 4460 | ||
| 4461 | match exactly the same set of strings. Because alternative branches are | match exactly the same set of strings. Because alternative branches are |
| 4462 | tried from left to right, and options are not reset until the end of | tried from left to right, and options are not reset until the end of |
| 4463 | the subpattern is reached, an option setting in one branch does affect | the subpattern is reached, an option setting in one branch does affect |
| 4464 | subsequent branches, so the above patterns match "SUNDAY" as well as | subsequent branches, so the above patterns match "SUNDAY" as well as |
| 4465 | "Saturday". | "Saturday". |
| 4466 | ||
| 4467 | ||
| 4468 | DUPLICATE SUBPATTERN NUMBERS | DUPLICATE SUBPATTERN NUMBERS |
| 4469 | ||
| 4470 | Perl 5.10 introduced a feature whereby each alternative in a subpattern | Perl 5.10 introduced a feature whereby each alternative in a subpattern |
| 4471 | uses the same numbers for its capturing parentheses. Such a subpattern | uses the same numbers for its capturing parentheses. Such a subpattern |
| 4472 | starts with (?| and is itself a non-capturing subpattern. For example, | starts with (?| and is itself a non-capturing subpattern. For example, |
| 4473 | consider this pattern: | consider this pattern: |
| 4474 | ||
| 4475 | (?|(Sat)ur|(Sun))day | (?|(Sat)ur|(Sun))day |
| 4476 | ||
| 4477 | Because the two alternatives are inside a (?| group, both sets of cap- | Because the two alternatives are inside a (?| group, both sets of cap- |
| 4478 | turing parentheses are numbered one. Thus, when the pattern matches, | turing parentheses are numbered one. Thus, when the pattern matches, |
| 4479 | you can look at captured substring number one, whichever alternative | you can look at captured substring number one, whichever alternative |
| 4480 | matched. This construct is useful when you want to capture part, but | matched. This construct is useful when you want to capture part, but |
| 4481 | not all, of one of a number of alternatives. Inside a (?| group, paren- | not all, of one of a number of alternatives. Inside a (?| group, paren- |
| 4482 | theses are numbered as usual, but the number is reset at the start of | theses are numbered as usual, but the number is reset at the start of |
| 4483 | each branch. The numbers of any capturing parentheses that follow the | each branch. The numbers of any capturing parentheses that follow the |
| 4484 | subpattern start after the highest number used in any branch. The fol- | subpattern start after the highest number used in any branch. The fol- |
| 4485 | lowing example is taken from the Perl documentation. The numbers under- | lowing example is taken from the Perl documentation. The numbers under- |
| 4486 | neath show in which buffer the captured content will be stored. | neath show in which buffer the captured content will be stored. |
| 4487 | ||
| # | Line 4471 DUPLICATE SUBPATTERN NUMBERS | Line 4489 DUPLICATE SUBPATTERN NUMBERS |
| 4489 | / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x | / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 4490 | # 1 2 2 3 2 3 4 | # 1 2 2 3 2 3 4 |
| 4491 | ||
| 4492 | A back reference to a numbered subpattern uses the most recent value | A back reference to a numbered subpattern uses the most recent value |
| 4493 | that is set for that number by any subpattern. The following pattern | that is set for that number by any subpattern. The following pattern |
| 4494 | matches "abcabc" or "defdef": | matches "abcabc" or "defdef": |
| 4495 | ||
| 4496 | /(?|(abc)|(def))\1/ | /(?|(abc)|(def))\1/ |
| 4497 | ||
| 4498 | In contrast, a recursive or "subroutine" call to a numbered subpattern | In contrast, a subroutine call to a numbered subpattern always refers |
| 4499 | always refers to the first one in the pattern with the given number. | to the first one in the pattern with the given number. The following |
| 4500 | The following pattern matches "abcabc" or "defabc": | pattern matches "abcabc" or "defabc": |
| 4501 | ||
| 4502 | /(?|(abc)|(def))(?1)/ | /(?|(abc)|(def))(?1)/ |
| 4503 | ||
| 4504 | If a condition test for a subpattern's having matched refers to a non- | If a condition test for a subpattern's having matched refers to a non- |
| 4505 | unique number, the test is true if any of the subpatterns of that num- | unique number, the test is true if any of the subpatterns of that num- |
| 4506 | ber have matched. | ber have matched. |
| 4507 | ||
| 4508 | An alternative approach to using this "branch reset" feature is to use | An alternative approach to using this "branch reset" feature is to use |
| 4509 | duplicate named subpatterns, as described in the next section. | duplicate named subpatterns, as described in the next section. |
| 4510 | ||
| 4511 | ||
| 4512 | NAMED SUBPATTERNS | NAMED SUBPATTERNS |
| 4513 | ||
| 4514 | Identifying capturing parentheses by number is simple, but it can be | Identifying capturing parentheses by number is simple, but it can be |
| 4515 | very hard to keep track of the numbers in complicated regular expres- | very hard to keep track of the numbers in complicated regular expres- |
| 4516 | sions. Furthermore, if an expression is modified, the numbers may | sions. Furthermore, if an expression is modified, the numbers may |
| 4517 | change. To help with this difficulty, PCRE supports the naming of sub- | change. To help with this difficulty, PCRE supports the naming of sub- |
| 4518 | patterns. This feature was not added to Perl until release 5.10. Python | patterns. This feature was not added to Perl until release 5.10. Python |
| 4519 | had the feature earlier, and PCRE introduced it at release 4.0, using | had the feature earlier, and PCRE introduced it at release 4.0, using |
| 4520 | the Python syntax. PCRE now supports both the Perl and the Python syn- | the Python syntax. PCRE now supports both the Perl and the Python syn- |
| 4521 | tax. Perl allows identically numbered subpatterns to have different | tax. Perl allows identically numbered subpatterns to have different |
| 4522 | names, but PCRE does not. | names, but PCRE does not. |
| 4523 | ||
| 4524 | In PCRE, a subpattern can be named in one of three ways: (?<name>...) | In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
| 4525 | or (?'name'...) as in Perl, or (?P<name>...) as in Python. References | or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
| 4526 | to capturing parentheses from other parts of the pattern, such as back | to capturing parentheses from other parts of the pattern, such as back |
| 4527 | references, recursion, and conditions, can be made by name as well as | references, recursion, and conditions, can be made by name as well as |
| 4528 | by number. | by number. |
| 4529 | ||
| 4530 | Names consist of up to 32 alphanumeric characters and underscores. | Names consist of up to 32 alphanumeric characters and underscores. |
| 4531 | Named capturing parentheses are still allocated numbers as well as | Named capturing parentheses are still allocated numbers as well as |
| 4532 | names, exactly as if the names were not present. The PCRE API provides | names, exactly as if the names were not present. The PCRE API provides |
| 4533 | function calls for extracting the name-to-number translation table from | function calls for extracting the name-to-number translation table from |
| 4534 | a compiled pattern. There is also a convenience function for extracting | a compiled pattern. There is also a convenience function for extracting |
| 4535 | a captured substring by name. | a captured substring by name. |
| 4536 | ||
| 4537 | By default, a name must be unique within a pattern, but it is possible | By default, a name must be unique within a pattern, but it is possible |
| 4538 | to relax this constraint by setting the PCRE_DUPNAMES option at compile | to relax this constraint by setting the PCRE_DUPNAMES option at compile |
| 4539 | time. (Duplicate names are also always permitted for subpatterns with | time. (Duplicate names are also always permitted for subpatterns with |
| 4540 | the same number, set up as described in the previous section.) Dupli- | the same number, set up as described in the previous section.) Dupli- |
| 4541 | cate names can be useful for patterns where only one instance of the | cate names can be useful for patterns where only one instance of the |
| 4542 | named parentheses can match. Suppose you want to match the name of a | named parentheses can match. Suppose you want to match the name of a |
| 4543 | weekday, either as a 3-letter abbreviation or as the full name, and in | weekday, either as a 3-letter abbreviation or as the full name, and in |
| 4544 | both cases you want to extract the abbreviation. This pattern (ignoring | both cases you want to extract the abbreviation. This pattern (ignoring |
| 4545 | the line breaks) does the job: | the line breaks) does the job: |
| 4546 | ||
| # | Line 4532 NAMED SUBPATTERNS | Line 4550 NAMED SUBPATTERNS |
| 4550 | (?<DN>Thu)(?:rsday)?| | (?<DN>Thu)(?:rsday)?| |
| 4551 | (?<DN>Sat)(?:urday)? | (?<DN>Sat)(?:urday)? |
| 4552 | ||
| 4553 | There are five capturing substrings, but only one is ever set after a | There are five capturing substrings, but only one is ever set after a |
| 4554 | match. (An alternative way of solving this problem is to use a "branch | match. (An alternative way of solving this problem is to use a "branch |
| 4555 | reset" subpattern, as described in the previous section.) | reset" subpattern, as described in the previous section.) |
| 4556 | ||
| 4557 | The convenience function for extracting the data by name returns the | The convenience function for extracting the data by name returns the |
| 4558 | substring for the first (and in this example, the only) subpattern of | substring for the first (and in this example, the only) subpattern of |
| 4559 | that name that matched. This saves searching to find which numbered | that name that matched. This saves searching to find which numbered |
| 4560 | subpattern it was. | subpattern it was. |
| 4561 | ||
| 4562 | If you make a back reference to a non-unique named subpattern from | If you make a back reference to a non-unique named subpattern from |
| 4563 | elsewhere in the pattern, the one that corresponds to the first occur- | elsewhere in the pattern, the one that corresponds to the first occur- |
| 4564 | rence of the name is used. In the absence of duplicate numbers (see the | rence of the name is used. In the absence of duplicate numbers (see the |
| 4565 | previous section) this is the one with the lowest number. If you use a | previous section) this is the one with the lowest number. If you use a |
| 4566 | named reference in a condition test (see the section about conditions | named reference in a condition test (see the section about conditions |
| 4567 | below), either to check whether a subpattern has matched, or to check | below), either to check whether a subpattern has matched, or to check |
| 4568 | for recursion, all subpatterns with the same name are tested. If the | for recursion, all subpatterns with the same name are tested. If the |
| 4569 | condition is true for any one of them, the overall condition is true. | condition is true for any one of them, the overall condition is true. |
| 4570 | This is the same behaviour as testing by number. For further details of | This is the same behaviour as testing by number. For further details of |
| 4571 | the interfaces for handling named subpatterns, see the pcreapi documen- | the interfaces for handling named subpatterns, see the pcreapi documen- |
| 4572 | tation. | tation. |
| 4573 | ||
| 4574 | Warning: You cannot use different names to distinguish between two sub- | Warning: You cannot use different names to distinguish between two sub- |
| 4575 | patterns with the same number because PCRE uses only the numbers when | patterns with the same number because PCRE uses only the numbers when |
| 4576 | matching. For this reason, an error is given at compile time if differ- | matching. For this reason, an error is given at compile time if differ- |
| 4577 | ent names are given to subpatterns with the same number. However, you | ent names are given to subpatterns with the same number. However, you |
| 4578 | can give the same name to subpatterns with the same number, even when | can give the same name to subpatterns with the same number, even when |
| 4579 | PCRE_DUPNAMES is not set. | PCRE_DUPNAMES is not set. |
| 4580 | ||
| 4581 | ||
| 4582 | REPETITION | REPETITION |
| 4583 | ||
| 4584 | Repetition is specified by quantifiers, which can follow any of the | Repetition is specified by quantifiers, which can follow any of the |
| 4585 | following items: | following items: |
| 4586 | ||
| 4587 | a literal data character | a literal data character |
| # | Line 4575 REPETITION | Line 4593 REPETITION |
| 4593 | a character class | a character class |
| 4594 | a back reference (see next section) | a back reference (see next section) |
| 4595 | a parenthesized subpattern (including assertions) | a parenthesized subpattern (including assertions) |
| 4596 | a recursive or "subroutine" call to a subpattern | a subroutine call to a subpattern (recursive or otherwise) |
| 4597 | ||
| 4598 | The general repetition quantifier specifies a minimum and maximum num- | The general repetition quantifier specifies a minimum and maximum num- |
| 4599 | ber of permitted matches, by giving the two numbers in curly brackets | ber of permitted matches, by giving the two numbers in curly brackets |
| 4600 | (braces), separated by a comma. The numbers must be less than 65536, | (braces), separated by a comma. The numbers must be less than 65536, |
| 4601 | and the first must be less than or equal to the second. For example: | and the first must be less than or equal to the second. For example: |
| 4602 | ||
| 4603 | z{2,4} | z{2,4} |
| 4604 | ||
| 4605 | matches "zz", "zzz", or "zzzz". A closing brace on its own is not a | matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
| 4606 | special character. If the second number is omitted, but the comma is | special character. If the second number is omitted, but the comma is |
| 4607 | present, there is no upper limit; if the second number and the comma | present, there is no upper limit; if the second number and the comma |
| 4608 | are both omitted, the quantifier specifies an exact number of required | are both omitted, the quantifier specifies an exact number of required |
| 4609 | matches. Thus | matches. Thus |
| 4610 | ||
| 4611 | [aeiou]{3,} | [aeiou]{3,} |
| # | Line 4596 REPETITION | Line 4614 REPETITION |
| 4614 | ||
| 4615 | \d{8} | \d{8} |
| 4616 | ||
| 4617 | matches exactly 8 digits. An opening curly bracket that appears in a | matches exactly 8 digits. An opening curly bracket that appears in a |
| 4618 | position where a quantifier is not allowed, or one that does not match | position where a quantifier is not allowed, or one that does not match |
| 4619 | the syntax of a quantifier, is taken as a literal character. For exam- | the syntax of a quantifier, is taken as a literal character. For exam- |
| 4620 | ple, {,6} is not a quantifier, but a literal string of four characters. | ple, {,6} is not a quantifier, but a literal string of four characters. |
| 4621 | ||
| 4622 | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
| 4623 | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
| 4624 | acters, each of which is represented by a two-byte sequence. Similarly, | acters, each of which is represented by a two-byte sequence. Similarly, |
| 4625 | when Unicode property support is available, \X{3} matches three Unicode | when Unicode property support is available, \X{3} matches three Unicode |
| 4626 | extended sequences, each of which may be several bytes long (and they | extended sequences, each of which may be several bytes long (and they |
| 4627 | may be of different lengths). | may be of different lengths). |
| 4628 | ||
| 4629 | The quantifier {0} is permitted, causing the expression to behave as if | The quantifier {0} is permitted, causing the expression to behave as if |
| 4630 | the previous item and the quantifier were not present. This may be use- | the previous item and the quantifier were not present. This may be use- |
| 4631 | ful for subpatterns that are referenced as subroutines from elsewhere | ful for subpatterns that are referenced as subroutines from elsewhere |
| 4632 | in the pattern (but see also the section entitled "Defining subpatterns | in the pattern (but see also the section entitled "Defining subpatterns |
| 4633 | for use by reference only" below). Items other than subpatterns that | for use by reference only" below). Items other than subpatterns that |
| 4634 | have a {0} quantifier are omitted from the compiled pattern. | have a {0} quantifier are omitted from the compiled pattern. |
| 4635 | ||
| 4636 | For convenience, the three most common quantifiers have single-charac- | For convenience, the three most common quantifiers have single-charac- |
| 4637 | ter abbreviations: | ter abbreviations: |
| 4638 | ||
| 4639 | * is equivalent to {0,} | * is equivalent to {0,} |
| 4640 | + is equivalent to {1,} | + is equivalent to {1,} |
| 4641 | ? is equivalent to {0,1} | ? is equivalent to {0,1} |
| 4642 | ||
| 4643 | It is possible to construct infinite loops by following a subpattern | It is possible to construct infinite loops by following a subpattern |
| 4644 | that can match no characters with a quantifier that has no upper limit, | that can match no characters with a quantifier that has no upper limit, |
| 4645 | for example: | for example: |
| 4646 | ||
| 4647 | (a?)* | (a?)* |
| 4648 | ||
| 4649 | Earlier versions of Perl and PCRE used to give an error at compile time | Earlier versions of Perl and PCRE used to give an error at compile time |
| 4650 | for such patterns. However, because there are cases where this can be | for such patterns. However, because there are cases where this can be |
| 4651 | useful, such patterns are now accepted, but if any repetition of the | useful, such patterns are now accepted, but if any repetition of the |
| 4652 | subpattern does in fact match no characters, the loop is forcibly bro- | subpattern does in fact match no characters, the loop is forcibly bro- |
| 4653 | ken. | ken. |
| 4654 | ||
| 4655 | By default, the quantifiers are "greedy", that is, they match as much | By default, the quantifiers are "greedy", that is, they match as much |
| 4656 | as possible (up to the maximum number of permitted times), without | as possible (up to the maximum number of permitted times), without |
| 4657 | causing the rest of the pattern to fail. The classic example of where | causing the rest of the pattern to fail. The classic example of where |
| 4658 | this gives problems is in trying to match comments in C programs. These | this gives problems is in trying to match comments in C programs. These |
| 4659 | appear between /* and */ and within the comment, individual * and / | appear between /* and */ and within the comment, individual * and / |
| 4660 | characters may appear. An attempt to match C comments by applying the | characters may appear. An attempt to match C comments by applying the |
| 4661 | pattern | pattern |
| 4662 | ||
| 4663 | /\*.*\*/ | /\*.*\*/ |
| # | Line 4648 REPETITION | Line 4666 REPETITION |
| 4666 | ||
| 4667 | /* first comment */ not comment /* second comment */ | /* first comment */ not comment /* second comment */ |
| 4668 | ||
| 4669 | fails, because it matches the entire string owing to the greediness of | fails, because it matches the entire string owing to the greediness of |
| 4670 | the .* item. | the .* item. |
| 4671 | ||
| 4672 | However, if a quantifier is followed by a question mark, it ceases to | However, if a quantifier is followed by a question mark, it ceases to |
| 4673 | be greedy, and instead matches the minimum number of times possible, so | be greedy, and instead matches the minimum number of times possible, so |
| 4674 | the pattern | the pattern |
| 4675 | ||
| 4676 | /\*.*?\*/ | /\*.*?\*/ |
| 4677 | ||
| 4678 | does the right thing with the C comments. The meaning of the various | does the right thing with the C comments. The meaning of the various |
| 4679 | quantifiers is not otherwise changed, just the preferred number of | quantifiers is not otherwise changed, just the preferred number of |
| 4680 | matches. Do not confuse this use of question mark with its use as a | matches. Do not confuse this use of question mark with its use as a |
| 4681 | quantifier in its own right. Because it has two uses, it can sometimes | quantifier in its own right. Because it has two uses, it can sometimes |
| 4682 | appear doubled, as in | appear doubled, as in |
| 4683 | ||
| 4684 | \d??\d | \d??\d |
| # | Line 4668 REPETITION | Line 4686 REPETITION |
| 4686 | which matches one digit by preference, but can match two if that is the | which matches one digit by preference, but can match two if that is the |
| 4687 | only way the rest of the pattern matches. | only way the rest of the pattern matches. |
| 4688 | ||
| 4689 | If the PCRE_UNGREEDY option is set (an option that is not available in | If the PCRE_UNGREEDY option is set (an option that is not available in |
| 4690 | Perl), the quantifiers are not greedy by default, but individual ones | Perl), the quantifiers are not greedy by default, but individual ones |
| 4691 | can be made greedy by following them with a question mark. In other | can be made greedy by following them with a question mark. In other |
| 4692 | words, it inverts the default behaviour. | words, it inverts the default behaviour. |
| 4693 | ||
| 4694 | When a parenthesized subpattern is quantified with a minimum repeat | When a parenthesized subpattern is quantified with a minimum repeat |
| 4695 | count that is greater than 1 or with a limited maximum, more memory is | count that is greater than 1 or with a limited maximum, more memory is |
| 4696 | required for the compiled pattern, in proportion to the size of the | required for the compiled pattern, in proportion to the size of the |
| 4697 | minimum or maximum. | minimum or maximum. |
| 4698 | ||
| 4699 | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
| 4700 | alent to Perl's /s) is set, thus allowing the dot to match newlines, | alent to Perl's /s) is set, thus allowing the dot to match newlines, |
| 4701 | the pattern is implicitly anchored, because whatever follows will be | the pattern is implicitly anchored, because whatever follows will be |
| 4702 | tried against every character position in the subject string, so there | tried against every character position in the subject string, so there |
| 4703 | is no point in retrying the overall match at any position after the | is no point in retrying the overall match at any position after the |
| 4704 | first. PCRE normally treats such a pattern as though it were preceded | first. PCRE normally treats such a pattern as though it were preceded |
| 4705 | by \A. | by \A. |
| 4706 | ||
| 4707 | In cases where it is known that the subject string contains no new- | In cases where it is known that the subject string contains no new- |
| 4708 | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
| 4709 | mization, or alternatively using ^ to indicate anchoring explicitly. | mization, or alternatively using ^ to indicate anchoring explicitly. |
| 4710 | ||
| 4711 | However, there is one situation where the optimization cannot be used. | However, there is one situation where the optimization cannot be used. |
| 4712 | When .* is inside capturing parentheses that are the subject of a back | When .* is inside capturing parentheses that are the subject of a back |
| 4713 | reference elsewhere in the pattern, a match at the start may fail where | reference elsewhere in the pattern, a match at the start may fail where |
| 4714 | a later one succeeds. Consider, for example: | a later one succeeds. Consider, for example: |
| 4715 | ||
| 4716 | (.*)abc\1 | (.*)abc\1 |
| 4717 | ||
| 4718 | If the subject is "xyz123abc123" the match point is the fourth charac- | If the subject is "xyz123abc123" the match point is the fourth charac- |
| 4719 | ter. For this reason, such a pattern is not implicitly anchored. | ter. For this reason, such a pattern is not implicitly anchored. |
| 4720 | ||
| 4721 | When a capturing subpattern is repeated, the value captured is the sub- | When a capturing subpattern is repeated, the value captured is the sub- |
| # | Line 4706 REPETITION | Line 4724 REPETITION |
| 4724 | (tweedle[dume]{3}\s*)+ | (tweedle[dume]{3}\s*)+ |
| 4725 | ||
| 4726 | has matched "tweedledum tweedledee" the value of the captured substring | has matched "tweedledum tweedledee" the value of the captured substring |
| 4727 | is "tweedledee". However, if there are nested capturing subpatterns, | is "tweedledee". However, if there are nested capturing subpatterns, |
| 4728 | the corresponding captured values may have been set in previous itera- | the corresponding captured values may have been set in previous itera- |
| 4729 | tions. For example, after | tions. For example, after |
| 4730 | ||
| 4731 | /(a|(b))+/ | /(a|(b))+/ |
| # | Line 4717 REPETITION | Line 4735 REPETITION |
| 4735 | ||
| 4736 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
| 4737 | ||
| 4738 | With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") | With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| 4739 | repetition, failure of what follows normally causes the repeated item | repetition, failure of what follows normally causes the repeated item |
| 4740 | to be re-evaluated to see if a different number of repeats allows the | to be re-evaluated to see if a different number of repeats allows the |
| 4741 | rest of the pattern to match. Sometimes it is useful to prevent this, | rest of the pattern to match. Sometimes it is useful to prevent this, |
| 4742 | either to change the nature of the match, or to cause it fail earlier | either to change the nature of the match, or to cause it fail earlier |
| 4743 | than it otherwise might, when the author of the pattern knows there is | than it otherwise might, when the author of the pattern knows there is |
| 4744 | no point in carrying on. | no point in carrying on. |
| 4745 | ||
| 4746 | Consider, for example, the pattern \d+foo when applied to the subject | Consider, for example, the pattern \d+foo when applied to the subject |
| 4747 | line | line |
| 4748 | ||
| 4749 | 123456bar | 123456bar |
| 4750 | ||
| 4751 | After matching all 6 digits and then failing to match "foo", the normal | After matching all 6 digits and then failing to match "foo", the normal |
| 4752 | action of the matcher is to try again with only 5 digits matching the | action of the matcher is to try again with only 5 digits matching the |
| 4753 | \d+ item, and then with 4, and so on, before ultimately failing. | \d+ item, and then with 4, and so on, before ultimately failing. |
| 4754 | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides |
| 4755 | the means for specifying that once a subpattern has matched, it is not | the means for specifying that once a subpattern has matched, it is not |
| 4756 | to be re-evaluated in this way. | to be re-evaluated in this way. |
| 4757 | ||
| 4758 | If we use atomic grouping for the previous example, the matcher gives | If we use atomic grouping for the previous example, the matcher gives |
| 4759 | up immediately on failing to match "foo" the first time. The notation | up immediately on failing to match "foo" the first time. The notation |
| 4760 | is a kind of special parenthesis, starting with (?> as in this example: | is a kind of special parenthesis, starting with (?> as in this example: |
| 4761 | ||
| 4762 | (?>\d+)foo | (?>\d+)foo |
| 4763 | ||
| 4764 | This kind of parenthesis "locks up" the part of the pattern it con- | This kind of parenthesis "locks up" the part of the pattern it con- |
| 4765 | tains once it has matched, and a failure further into the pattern is | tains once it has matched, and a failure further into the pattern is |
| 4766 | prevented from backtracking into it. Backtracking past it to previous | prevented from backtracking into it. Backtracking past it to previous |
| 4767 | items, however, works as normal. | items, however, works as normal. |
| 4768 | ||
| 4769 | An alternative description is that a subpattern of this type matches | An alternative description is that a subpattern of this type matches |
| 4770 | the string of characters that an identical standalone pattern would | the string of characters that an identical standalone pattern would |
| 4771 | match, if anchored at the current point in the subject string. | match, if anchored at the current point in the subject string. |
| 4772 | ||
| 4773 | Atomic grouping subpatterns are not capturing subpatterns. Simple cases | Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
| 4774 | such as the above example can be thought of as a maximizing repeat that | such as the above example can be thought of as a maximizing repeat that |
| 4775 | must swallow everything it can. So, while both \d+ and \d+? are pre- | must swallow everything it can. So, while both \d+ and \d+? are pre- |
| 4776 | pared to adjust the number of digits they match in order to make the | pared to adjust the number of digits they match in order to make the |
| 4777 | rest of the pattern match, (?>\d+) can only match an entire sequence of | rest of the pattern match, (?>\d+) can only match an entire sequence of |
| 4778 | digits. | digits. |
| 4779 | ||
| 4780 | Atomic groups in general can of course contain arbitrarily complicated | Atomic groups in general can of course contain arbitrarily complicated |
| 4781 | subpatterns, and can be nested. However, when the subpattern for an | subpatterns, and can be nested. However, when the subpattern for an |
| 4782 | atomic group is just a single repeated item, as in the example above, a | atomic group is just a single repeated item, as in the example above, a |
| 4783 | simpler notation, called a "possessive quantifier" can be used. This | simpler notation, called a "possessive quantifier" can be used. This |
| 4784 | consists of an additional + character following a quantifier. Using | consists of an additional + character following a quantifier. Using |
| 4785 | this notation, the previous example can be rewritten as | this notation, the previous example can be rewritten as |
| 4786 | ||
| 4787 | \d++foo | \d++foo |
| # | Line 4773 ATOMIC GROUPING AND POSSESSIVE QUANTIFIE | Line 4791 ATOMIC GROUPING AND POSSESSIVE QUANTIFIE |
| 4791 | ||
| 4792 | (abc|xyz){2,3}+ | (abc|xyz){2,3}+ |
| 4793 | ||
| 4794 | Possessive quantifiers are always greedy; the setting of the | Possessive quantifiers are always greedy; the setting of the |
| 4795 | PCRE_UNGREEDY option is ignored. They are a convenient notation for the | PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
| 4796 | simpler forms of atomic group. However, there is no difference in the | simpler forms of atomic group. However, there is no difference in the |
| 4797 | meaning of a possessive quantifier and the equivalent atomic group, | meaning of a possessive quantifier and the equivalent atomic group, |
| 4798 | though there may be a performance difference; possessive quantifiers | though there may be a performance difference; possessive quantifiers |
| 4799 | should be slightly faster. | should be slightly faster. |
| 4800 | ||
| 4801 | The possessive quantifier syntax is an extension to the Perl 5.8 syn- | The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| 4802 | tax. Jeffrey Friedl originated the idea (and the name) in the first | tax. Jeffrey Friedl originated the idea (and the name) in the first |
| 4803 | edition of his book. Mike McCloskey liked it, so implemented it when he | edition of his book. Mike McCloskey liked it, so implemented it when he |
| 4804 | built Sun's Java package, and PCRE copied it from there. It ultimately | built Sun's Java package, and PCRE copied it from there. It ultimately |
| 4805 | found its way into Perl at release 5.10. | found its way into Perl at release 5.10. |
| 4806 | ||
| 4807 | PCRE has an optimization that automatically "possessifies" certain sim- | PCRE has an optimization that automatically "possessifies" certain sim- |
| 4808 | ple pattern constructs. For example, the sequence A+B is treated as | ple pattern constructs. For example, the sequence A+B is treated as |
| 4809 | A++B because there is no point in backtracking into a sequence of A's | A++B because there is no point in backtracking into a sequence of A's |
| 4810 | when B must follow. | when B must follow. |
| 4811 | ||
| 4812 | When a pattern contains an unlimited repeat inside a subpattern that | When a pattern contains an unlimited repeat inside a subpattern that |
| 4813 | can itself be repeated an unlimited number of times, the use of an | can itself be repeated an unlimited number of times, the use of an |
| 4814 | atomic group is the only way to avoid some failing matches taking a | atomic group is the only way to avoid some failing matches taking a |
| 4815 | very long time indeed. The pattern | very long time indeed. The pattern |
| 4816 | ||
| 4817 | (\D+|<\d+>)*[!?] | (\D+|<\d+>)*[!?] |
| 4818 | ||
| 4819 | matches an unlimited number of substrings that either consist of non- | matches an unlimited number of substrings that either consist of non- |
| 4820 | digits, or digits enclosed in <>, followed by either ! or ?. When it | digits, or digits enclosed in <>, followed by either ! or ?. When it |
| 4821 | matches, it runs quickly. However, if it is applied to | matches, it runs quickly. However, if it is applied to |
| 4822 | ||
| 4823 | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| 4824 | ||
| 4825 | it takes a long time before reporting failure. This is because the | it takes a long time before reporting failure. This is because the |
| 4826 | string can be divided between the internal \D+ repeat and the external | string can be divided between the internal \D+ repeat and the external |
| 4827 | * repeat in a large number of ways, and all have to be tried. (The | * repeat in a large number of ways, and all have to be tried. (The |
| 4828 | example uses [!?] rather than a single character at the end, because | example uses [!?] rather than a single character at the end, because |
| 4829 | both PCRE and Perl have an optimization that allows for fast failure | both PCRE and Perl have an optimization that allows for fast failure |
| 4830 | when a single character is used. They remember the last single charac- | when a single character is used. They remember the last single charac- |
| 4831 | ter that is required for a match, and fail early if it is not present | ter that is required for a match, and fail early if it is not present |
| 4832 | in the string.) If the pattern is changed so that it uses an atomic | in the string.) If the pattern is changed so that it uses an atomic |
| 4833 | group, like this: | group, like this: |
| 4834 | ||
| 4835 | ((?>\D+)|<\d+>)*[!?] | ((?>\D+)|<\d+>)*[!?] |
| # | Line 4823 BACK REFERENCES | Line 4841 BACK REFERENCES |
| 4841 | ||
| 4842 | Outside a character class, a backslash followed by a digit greater than | Outside a character class, a backslash followed by a digit greater than |
| 4843 | 0 (and possibly further digits) is a back reference to a capturing sub- | 0 (and possibly further digits) is a back reference to a capturing sub- |
| 4844 | pattern earlier (that is, to its left) in the pattern, provided there | pattern earlier (that is, to its left) in the pattern, provided there |
| 4845 | have been that many previous capturing left parentheses. | have been that many previous capturing left parentheses. |
| 4846 | ||
| 4847 | However, if the decimal number following the backslash is less than 10, | However, if the decimal number following the backslash is less than 10, |
| 4848 | it is always taken as a back reference, and causes an error only if | it is always taken as a back reference, and causes an error only if |
| 4849 | there are not that many capturing left parentheses in the entire pat- | there are not that many capturing left parentheses in the entire pat- |
| 4850 | tern. In other words, the parentheses that are referenced need not be | tern. In other words, the parentheses that are referenced need not be |
| 4851 | to the left of the reference for numbers less than 10. A "forward back | to the left of the reference for numbers less than 10. A "forward back |
| 4852 | reference" of this type can make sense when a repetition is involved | reference" of this type can make sense when a repetition is involved |
| 4853 | and the subpattern to the right has participated in an earlier itera- | and the subpattern to the right has participated in an earlier itera- |
| 4854 | tion. | tion. |
| 4855 | ||
| 4856 | It is not possible to have a numerical "forward back reference" to a | It is not possible to have a numerical "forward back reference" to a |
| 4857 | subpattern whose number is 10 or more using this syntax because a | subpattern whose number is 10 or more using this syntax because a |
| 4858 | sequence such as \50 is interpreted as a character defined in octal. | sequence such as \50 is interpreted as a character defined in octal. |
| 4859 | See the subsection entitled "Non-printing characters" above for further | See the subsection entitled "Non-printing characters" above for further |
| 4860 | details of the handling of digits following a backslash. There is no | details of the handling of digits following a backslash. There is no |
| 4861 | such problem when named parentheses are used. A back reference to any | such problem when named parentheses are used. A back reference to any |
| 4862 | subpattern is possible using named parentheses (see below). | subpattern is possible using named parentheses (see below). |
| 4863 | ||
| 4864 | Another way of avoiding the ambiguity inherent in the use of digits | Another way of avoiding the ambiguity inherent in the use of digits |
| 4865 | following a backslash is to use the \g escape sequence. This escape | following a backslash is to use the \g escape sequence. This escape |
| 4866 | must be followed by an unsigned number or a negative number, optionally | must be followed by an unsigned number or a negative number, optionally |
| 4867 | enclosed in braces. These examples are all identical: | enclosed in braces. These examples are all identical: |
| 4868 | ||
| # | Line 4852 BACK REFERENCES | Line 4870 BACK REFERENCES |
| 4870 | (ring), \g1 | (ring), \g1 |
| 4871 | (ring), \g{1} | (ring), \g{1} |
| 4872 | ||
| 4873 | An unsigned number specifies an absolute reference without the ambigu- | An unsigned number specifies an absolute reference without the ambigu- |
| 4874 | ity that is present in the older syntax. It is also useful when literal | ity that is present in the older syntax. It is also useful when literal |
| 4875 | digits follow the reference. A negative number is a relative reference. | digits follow the reference. A negative number is a relative reference. |
| 4876 | Consider this example: | Consider this example: |
| # | Line 4861 BACK REFERENCES | Line 4879 BACK REFERENCES |
| 4879 | ||
| 4880 | The sequence \g{-1} is a reference to the most recently started captur- | The sequence \g{-1} is a reference to the most recently started captur- |
| 4881 | ing subpattern before \g, that is, is it equivalent to \2 in this exam- | ing subpattern before \g, that is, is it equivalent to \2 in this exam- |
| 4882 | ple. Similarly, \g{-2} would be equivalent to \1. The use of relative | ple. Similarly, \g{-2} would be equivalent to \1. The use of relative |
| 4883 | references can be helpful in long patterns, and also in patterns that | references can be helpful in long patterns, and also in patterns that |
| 4884 | are created by joining together fragments that contain references | are created by joining together fragments that contain references |
| 4885 | within themselves. | within themselves. |
| 4886 | ||
| 4887 | A back reference matches whatever actually matched the capturing sub- | A back reference matches whatever actually matched the capturing sub- |
| 4888 | pattern in the current subject string, rather than anything matching | pattern in the current subject string, rather than anything matching |
| 4889 | the subpattern itself (see "Subpatterns as subroutines" below for a way | the subpattern itself (see "Subpatterns as subroutines" below for a way |
| 4890 | of doing that). So the pattern | of doing that). So the pattern |
| 4891 | ||
| 4892 | (sens|respons)e and \1ibility | (sens|respons)e and \1ibility |
| 4893 | ||
| 4894 | matches "sense and sensibility" and "response and responsibility", but | matches "sense and sensibility" and "response and responsibility", but |
| 4895 | not "sense and responsibility". If caseful matching is in force at the | not "sense and responsibility". If caseful matching is in force at the |
| 4896 | time of the back reference, the case of letters is relevant. For exam- | time of the back reference, the case of letters is relevant. For exam- |
| 4897 | ple, | ple, |
| 4898 | ||
| 4899 | ((?i)rah)\s+\1 | ((?i)rah)\s+\1 |
| 4900 | ||
| 4901 | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| 4902 | original capturing subpattern is matched caselessly. | original capturing subpattern is matched caselessly. |
| 4903 | ||
| 4904 | There are several different ways of writing back references to named | There are several different ways of writing back references to named |
| 4905 | subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or | subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
| 4906 | \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's | \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
| 4907 | unified back reference syntax, in which \g can be used for both numeric | unified back reference syntax, in which \g can be used for both numeric |
| 4908 | and named references, is also supported. We could rewrite the above | and named references, is also supported. We could rewrite the above |
| 4909 | example in any of the following ways: | example in any of the following ways: |
| 4910 | ||
| 4911 | (?<p1>(?i)rah)\s+\k<p1> | (?<p1>(?i)rah)\s+\k<p1> |
| # | Line 4895 BACK REFERENCES | Line 4913 BACK REFERENCES |
| 4913 | (?P<p1>(?i)rah)\s+(?P=p1) | (?P<p1>(?i)rah)\s+(?P=p1) |
| 4914 | (?<p1>(?i)rah)\s+\g{p1} | (?<p1>(?i)rah)\s+\g{p1} |
| 4915 | ||
| 4916 | A subpattern that is referenced by name may appear in the pattern | A subpattern that is referenced by name may appear in the pattern |
| 4917 | before or after the reference. | before or after the reference. |
| 4918 | ||
| 4919 | There may be more than one back reference to the same subpattern. If a | There may be more than one back reference to the same subpattern. If a |
| 4920 | subpattern has not actually been used in a particular match, any back | subpattern has not actually been used in a particular match, any back |
| 4921 | references to it always fail by default. For example, the pattern | references to it always fail by default. For example, the pattern |
| 4922 | ||
| 4923 | (a|(bc))\2 | (a|(bc))\2 |
| 4924 | ||
| 4925 | always fails if it starts to match "a" rather than "bc". However, if | always fails if it starts to match "a" rather than "bc". However, if |
| 4926 | the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- | the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- |
| 4927 | ence to an unset value matches an empty string. | ence to an unset value matches an empty string. |
| 4928 | ||
| 4929 | Because there may be many capturing parentheses in a pattern, all dig- | Because there may be many capturing parentheses in a pattern, all dig- |
| 4930 | its following a backslash are taken as part of a potential back refer- | its following a backslash are taken as part of a potential back refer- |
| 4931 | ence number. If the pattern continues with a digit character, some | ence number. If the pattern continues with a digit character, some |
| 4932 | delimiter must be used to terminate the back reference. If the | delimiter must be used to terminate the back reference. If the |
| 4933 | PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ | PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
| 4934 | syntax or an empty comment (see "Comments" below) can be used. | syntax or an empty comment (see "Comments" below) can be used. |
| 4935 | ||
| 4936 | Recursive back references | Recursive back references |
| 4937 | ||
| 4938 | A back reference that occurs inside the parentheses to which it refers | A back reference that occurs inside the parentheses to which it refers |
| 4939 | fails when the subpattern is first used, so, for example, (a\1) never | fails when the subpattern is first used, so, for example, (a\1) never |
| 4940 | matches. However, such references can be useful inside repeated sub- | matches. However, such references can be useful inside repeated sub- |
| 4941 | patterns. For example, the pattern | patterns. For example, the pattern |
| 4942 | ||
| 4943 | (a|b\1)+ | (a|b\1)+ |
| 4944 | ||
| 4945 | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
| 4946 | ation of the subpattern, the back reference matches the character | ation of the subpattern, the back reference matches the character |
| 4947 | string corresponding to the previous iteration. In order for this to | string corresponding to the previous iteration. In order for this to |
| 4948 | work, the pattern must be such that the first iteration does not need | work, the pattern must be such that the first iteration does not need |
| 4949 | to match the back reference. This can be done using alternation, as in | to match the back reference. This can be done using alternation, as in |
| 4950 | the example above, or by a quantifier with a minimum of zero. | the example above, or by a quantifier with a minimum of zero. |
| 4951 | ||
| 4952 | Back references of this type cause the group that they reference to be | Back references of this type cause the group that they reference to be |
| 4953 | treated as an atomic group. Once the whole group has been matched, a | treated as an atomic group. Once the whole group has been matched, a |
| 4954 | subsequent matching failure cannot cause backtracking into the middle | subsequent matching failure cannot cause backtracking into the middle |
| 4955 | of the group. | of the group. |
| 4956 | ||
| 4957 | ||
| 4958 | ASSERTIONS | ASSERTIONS |
| 4959 | ||
| 4960 | An assertion is a test on the characters following or preceding the | An assertion is a test on the characters following or preceding the |
| 4961 | current matching point that does not actually consume any characters. | current matching point that does not actually consume any characters. |
| 4962 | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
| 4963 | described above. | described above. |
| 4964 | ||
| 4965 | More complicated assertions are coded as subpatterns. There are two | More complicated assertions are coded as subpatterns. There are two |
| 4966 | kinds: those that look ahead of the current position in the subject | kinds: those that look ahead of the current position in the subject |
| 4967 | string, and those that look behind it. An assertion subpattern is | string, and those that look behind it. An assertion subpattern is |
| 4968 | matched in the normal way, except that it does not cause the current | matched in the normal way, except that it does not cause the current |
| 4969 | matching position to be changed. | matching position to be changed. |
| 4970 | ||
| 4971 | Assertion subpatterns are not capturing subpatterns. If such an asser- | Assertion subpatterns are not capturing subpatterns. If such an asser- |
| 4972 | tion contains capturing subpatterns within it, these are counted for | tion contains capturing subpatterns within it, these are counted for |
| 4973 | the purposes of numbering the capturing subpatterns in the whole pat- | the purposes of numbering the capturing subpatterns in the whole pat- |
| 4974 | tern. However, substring capturing is carried out only for positive | tern. However, substring capturing is carried out only for positive |
| 4975 | assertions, because it does not make sense for negative assertions. | assertions, because it does not make sense for negative assertions. |
| 4976 | ||
| 4977 | For compatibility with Perl, assertion subpatterns may be repeated; | For compatibility with Perl, assertion subpatterns may be repeated; |
| 4978 | though it makes no sense to assert the same thing several times, the | though it makes no sense to assert the same thing several times, the |
| 4979 | side effect of capturing parentheses may occasionally be useful. In | side effect of capturing parentheses may occasionally be useful. In |
| 4980 | practice, there only three cases: | practice, there only three cases: |
| 4981 | ||
| 4982 | (1) If the quantifier is {0}, the assertion is never obeyed during | (1) If the quantifier is {0}, the assertion is never obeyed during |
| 4983 | matching. However, it may contain internal capturing parenthesized | matching. However, it may contain internal capturing parenthesized |
| 4984 | groups that are called from elsewhere via the subroutine mechanism. | groups that are called from elsewhere via the subroutine mechanism. |
| 4985 | ||
| 4986 | (2) If quantifier is {0,n} where n is greater than zero, it is treated | (2) If quantifier is {0,n} where n is greater than zero, it is treated |
| 4987 | as if it were {0,1}. At run time, the rest of the pattern match is | as if it were {0,1}. At run time, the rest of the pattern match is |
| 4988 | tried with and without the assertion, the order depending on the greed- | tried with and without the assertion, the order depending on the greed- |
| 4989 | iness of the quantifier. | iness of the quantifier. |
| 4990 | ||
| 4991 | (3) If the minimum repetition is greater than zero, the quantifier is | (3) If the minimum repetition is greater than zero, the quantifier is |
| 4992 | ignored. The assertion is obeyed just once when encountered during | ignored. The assertion is obeyed just once when encountered during |
| 4993 | matching. | matching. |
| 4994 | ||
| 4995 | Lookahead assertions | Lookahead assertions |
| # | Line 4981 ASSERTIONS | Line 4999 ASSERTIONS |
| 4999 | ||
| 5000 | \w+(?=;) | \w+(?=;) |
| 5001 | ||
| 5002 | matches a word followed by a semicolon, but does not include the semi- | matches a word followed by a semicolon, but does not include the semi- |
| 5003 | colon in the match, and | colon in the match, and |
| 5004 | ||
| 5005 | foo(?!bar) | foo(?!bar) |
| 5006 | ||
| 5007 | matches any occurrence of "foo" that is not followed by "bar". Note | matches any occurrence of "foo" that is not followed by "bar". Note |
| 5008 | that the apparently similar pattern | that the apparently similar pattern |
| 5009 | ||
| 5010 | (?!foo)bar | (?!foo)bar |
| 5011 | ||
| 5012 | does not find an occurrence of "bar" that is preceded by something | does not find an occurrence of "bar" that is preceded by something |
| 5013 | other than "foo"; it finds any occurrence of "bar" whatsoever, because | other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| 5014 | the assertion (?!foo) is always true when the next three characters are | the assertion (?!foo) is always true when the next three characters are |
| 5015 | "bar". A lookbehind assertion is needed to achieve the other effect. | "bar". A lookbehind assertion is needed to achieve the other effect. |
| 5016 | ||
| 5017 | If you want to force a matching failure at some point in a pattern, the | If you want to force a matching failure at some point in a pattern, the |
| 5018 | most convenient way to do it is with (?!) because an empty string | most convenient way to do it is with (?!) because an empty string |
| 5019 | always matches, so an assertion that requires there not to be an empty | always matches, so an assertion that requires there not to be an empty |
| 5020 | string must always fail. The backtracking control verb (*FAIL) or (*F) | string must always fail. The backtracking control verb (*FAIL) or (*F) |
| 5021 | is a synonym for (?!). | is a synonym for (?!). |
| 5022 | ||
| 5023 | Lookbehind assertions | Lookbehind assertions |
| 5024 | ||
| 5025 | Lookbehind assertions start with (?<= for positive assertions and (?<! | Lookbehind assertions start with (?<= for positive assertions and (?<! |
| 5026 | for negative assertions. For example, | for negative assertions. For example, |
| 5027 | ||
| 5028 | (?<!foo)bar | (?<!foo)bar |
| 5029 | ||
| 5030 | does find an occurrence of "bar" that is not preceded by "foo". The | does find an occurrence of "bar" that is not preceded by "foo". The |
| 5031 | contents of a lookbehind assertion are restricted such that all the | contents of a lookbehind assertion are restricted such that all the |
| 5032 | strings it matches must have a fixed length. However, if there are sev- | strings it matches must have a fixed length. However, if there are sev- |
| 5033 | eral top-level alternatives, they do not all have to have the same | eral top-level alternatives, they do not all have to have the same |
| 5034 | fixed length. Thus | fixed length. Thus |
| 5035 | ||
| 5036 | (?<=bullock|donkey) | (?<=bullock|donkey) |
| # | Line 5021 ASSERTIONS | Line 5039 ASSERTIONS |
| 5039 | ||
| 5040 | (?<!dogs?|cats?) | (?<!dogs?|cats?) |
| 5041 | ||
| 5042 | causes an error at compile time. Branches that match different length | causes an error at compile time. Branches that match different length |
| 5043 | strings are permitted only at the top level of a lookbehind assertion. | strings are permitted only at the top level of a lookbehind assertion. |
| 5044 | This is an extension compared with Perl, which requires all branches to | This is an extension compared with Perl, which requires all branches to |
| 5045 | match the same length of string. An assertion such as | match the same length of string. An assertion such as |
| 5046 | ||
| 5047 | (?<=ab(c|de)) | (?<=ab(c|de)) |
| 5048 | ||
| 5049 | is not permitted, because its single top-level branch can match two | is not permitted, because its single top-level branch can match two |
| 5050 | different lengths, but it is acceptable to PCRE if rewritten to use two | different lengths, but it is acceptable to PCRE if rewritten to use two |
| 5051 | top-level branches: | top-level branches: |
| 5052 | ||
| 5053 | (?<=abc|abde) | (?<=abc|abde) |
| 5054 | ||
| 5055 | In some cases, the escape sequence \K (see above) can be used instead | In some cases, the escape sequence \K (see above) can be used instead |
| 5056 | of a lookbehind assertion to get round the fixed-length restriction. | of a lookbehind assertion to get round the fixed-length restriction. |
| 5057 | ||
| 5058 | The implementation of lookbehind assertions is, for each alternative, | The implementation of lookbehind assertions is, for each alternative, |
| 5059 | to temporarily move the current position back by the fixed length and | to temporarily move the current position back by the fixed length and |
| 5060 | then try to match. If there are insufficient characters before the cur- | then try to match. If there are insufficient characters before the cur- |
| 5061 | rent position, the assertion fails. | rent position, the assertion fails. |
| 5062 | ||
| 5063 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 5064 | mode) to appear in lookbehind assertions, because it makes it impossi- | mode) to appear in lookbehind assertions, because it makes it impossi- |
| 5065 | ble to calculate the length of the lookbehind. The \X and \R escapes, | ble to calculate the length of the lookbehind. The \X and \R escapes, |
| 5066 | which can match different numbers of bytes, are also not permitted. | which can match different numbers of bytes, are also not permitted. |
| 5067 | ||
| 5068 | "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in | "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| 5069 | lookbehinds, as long as the subpattern matches a fixed-length string. | lookbehinds, as long as the subpattern matches a fixed-length string. |
| 5070 | Recursion, however, is not supported. | Recursion, however, is not supported. |
| 5071 | ||
| 5072 | Possessive quantifiers can be used in conjunction with lookbehind | Possessive quantifiers can be used in conjunction with lookbehind |
| 5073 | assertions to specify efficient matching of fixed-length strings at the | assertions to specify efficient matching of fixed-length strings at the |
| 5074 | end of subject strings. Consider a simple pattern such as | end of subject strings. Consider a simple pattern such as |
| 5075 | ||
| 5076 | abcd$ | abcd$ |
| 5077 | ||
| 5078 | when applied to a long string that does not match. Because matching | when applied to a long string that does not match. Because matching |
| 5079 | proceeds from left to right, PCRE will look for each "a" in the subject | proceeds from left to right, PCRE will look for each "a" in the subject |
| 5080 | and then see if what follows matches the rest of the pattern. If the | and then see if what follows matches the rest of the pattern. If the |
| 5081 | pattern is specified as | pattern is specified as |
| 5082 | ||
| 5083 | ^.*abcd$ | ^.*abcd$ |
| 5084 | ||
| 5085 | the initial .* matches the entire string at first, but when this fails | the initial .* matches the entire string at first, but when this fails |
| 5086 | (because there is no following "a"), it backtracks to match all but the | (because there is no following "a"), it backtracks to match all but the |
| 5087 | last character, then all but the last two characters, and so on. Once | last character, then all but the last two characters, and so on. Once |
| 5088 | again the search for "a" covers the entire string, from right to left, | again the search for "a" covers the entire string, from right to left, |
| 5089 | so we are no better off. However, if the pattern is written as | so we are no better off. However, if the pattern is written as |
| 5090 | ||
| 5091 | ^.*+(?<=abcd) | ^.*+(?<=abcd) |
| 5092 | ||
| 5093 | there can be no backtracking for the .*+ item; it can match only the | there can be no backtracking for the .*+ item; it can match only the |
| 5094 | entire string. The subsequent lookbehind assertion does a single test | entire string. The subsequent lookbehind assertion does a single test |
| 5095 | on the last four characters. If it fails, the match fails immediately. | on the last four characters. If it fails, the match fails immediately. |
| 5096 | For long strings, this approach makes a significant difference to the | For long strings, this approach makes a significant difference to the |
| 5097 | processing time. | processing time. |
| 5098 | ||
| 5099 | Using multiple assertions | Using multiple assertions |
| # | Line 5084 ASSERTIONS | Line 5102 ASSERTIONS |
| 5102 | ||
| 5103 | (?<=\d{3})(?<!999)foo | (?<=\d{3})(?<!999)foo |
| 5104 | ||
| 5105 | matches "foo" preceded by three digits that are not "999". Notice that | matches "foo" preceded by three digits that are not "999". Notice that |
| 5106 | each of the assertions is applied independently at the same point in | each of the assertions is applied independently at the same point in |
| 5107 | the subject string. First there is a check that the previous three | the subject string. First there is a check that the previous three |
| 5108 | characters are all digits, and then there is a check that the same | characters are all digits, and then there is a check that the same |
| 5109 | three characters are not "999". This pattern does not match "foo" pre- | three characters are not "999". This pattern does not match "foo" pre- |
| 5110 | ceded by six characters, the first of which are digits and the last | ceded by six characters, the first of which are digits and the last |
| 5111 | three of which are not "999". For example, it doesn't match "123abc- | three of which are not "999". For example, it doesn't match "123abc- |
| 5112 | foo". A pattern to do that is | foo". A pattern to do that is |
| 5113 | ||
| 5114 | (?<=\d{3}...)(?<!999)foo | (?<=\d{3}...)(?<!999)foo |
| 5115 | ||
| 5116 | This time the first assertion looks at the preceding six characters, | This time the first assertion looks at the preceding six characters, |
| 5117 | checking that the first three are digits, and then the second assertion | checking that the first three are digits, and then the second assertion |
| 5118 | checks that the preceding three characters are not "999". | checks that the preceding three characters are not "999". |
| 5119 | ||
| # | Line 5103 ASSERTIONS | Line 5121 ASSERTIONS |
| 5121 | ||
| 5122 | (?<=(?<!foo)bar)baz | (?<=(?<!foo)bar)baz |
| 5123 | ||
| 5124 | matches an occurrence of "baz" that is preceded by "bar" which in turn | matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 5125 | is not preceded by "foo", while | is not preceded by "foo", while |
| 5126 | ||
| 5127 | (?<=\d{3}(?!999)...)foo | (?<=\d{3}(?!999)...)foo |
| 5128 | ||
| 5129 | is another pattern that matches "foo" preceded by three digits and any | is another pattern that matches "foo" preceded by three digits and any |
| 5130 | three characters that are not "999". | three characters that are not "999". |
| 5131 | ||
| 5132 | ||
| 5133 | CONDITIONAL SUBPATTERNS | CONDITIONAL SUBPATTERNS |
| 5134 | ||
| 5135 | It is possible to cause the matching process to obey a subpattern con- | It is possible to cause the matching process to obey a subpattern con- |
| 5136 | ditionally or to choose between two alternative subpatterns, depending | ditionally or to choose between two alternative subpatterns, depending |
| 5137 | on the result of an assertion, or whether a specific capturing subpat- | on the result of an assertion, or whether a specific capturing subpat- |
| 5138 | tern has already been matched. The two possible forms of conditional | tern has already been matched. The two possible forms of conditional |
| 5139 | subpattern are: | subpattern are: |
| 5140 | ||
| 5141 | (?(condition)yes-pattern) | (?(condition)yes-pattern) |
| 5142 | (?(condition)yes-pattern|no-pattern) | (?(condition)yes-pattern|no-pattern) |
| 5143 | ||
| 5144 | If the condition is satisfied, the yes-pattern is used; otherwise the | If the condition is satisfied, the yes-pattern is used; otherwise the |
| 5145 | no-pattern (if present) is used. If there are more than two alterna- | no-pattern (if present) is used. If there are more than two alterna- |
| 5146 | tives in the subpattern, a compile-time error occurs. Each of the two | tives in the subpattern, a compile-time error occurs. Each of the two |
| 5147 | alternatives may itself contain nested subpatterns of any form, includ- | alternatives may itself contain nested subpatterns of any form, includ- |
| 5148 | ing conditional subpatterns; the restriction to two alternatives | ing conditional subpatterns; the restriction to two alternatives |
| 5149 | applies only at the level of the condition. This pattern fragment is an | applies only at the level of the condition. This pattern fragment is an |
| # | Line 5134 CONDITIONAL SUBPATTERNS | Line 5152 CONDITIONAL SUBPATTERNS |
| 5152 | (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) | (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
| 5153 | ||
| 5154 | ||
| 5155 | There are four kinds of condition: references to subpatterns, refer- | There are four kinds of condition: references to subpatterns, refer- |
| 5156 | ences to recursion, a pseudo-condition called DEFINE, and assertions. | ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 5157 | ||
| 5158 | Checking for a used subpattern by number | Checking for a used subpattern by number |
| 5159 | ||
| 5160 | If the text between the parentheses consists of a sequence of digits, | If the text between the parentheses consists of a sequence of digits, |
| 5161 | the condition is true if a capturing subpattern of that number has pre- | the condition is true if a capturing subpattern of that number has pre- |
| 5162 | viously matched. If there is more than one capturing subpattern with | viously matched. If there is more than one capturing subpattern with |
| 5163 | the same number (see the earlier section about duplicate subpattern | the same number (see the earlier section about duplicate subpattern |
| 5164 | numbers), the condition is true if any of them have matched. An alter- | numbers), the condition is true if any of them have matched. An alter- |
| 5165 | native notation is to precede the digits with a plus or minus sign. In | native notation is to precede the digits with a plus or minus sign. In |
| 5166 | this case, the subpattern number is relative rather than absolute. The | this case, the subpattern number is relative rather than absolute. The |
| 5167 | most recently opened parentheses can be referenced by (?(-1), the next | most recently opened parentheses can be referenced by (?(-1), the next |
| 5168 | most recent by (?(-2), and so on. Inside loops it can also make sense | most recent by (?(-2), and so on. Inside loops it can also make sense |
| 5169 | to refer to subsequent groups. The next parentheses to be opened can be | to refer to subsequent groups. The next parentheses to be opened can be |
| 5170 | referenced as (?(+1), and so on. (The value zero in any of these forms | referenced as (?(+1), and so on. (The value zero in any of these forms |
| 5171 | is not used; it provokes a compile-time error.) | is not used; it provokes a compile-time error.) |
| 5172 | ||
| 5173 | Consider the following pattern, which contains non-significant white | Consider the following pattern, which contains non-significant white |
| 5174 | space to make it more readable (assume the PCRE_EXTENDED option) and to | space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 5175 | divide it into three parts for ease of discussion: | divide it into three parts for ease of discussion: |
| 5176 | ||
| 5177 | ( \( )? [^()]+ (?(1) \) ) | ( \( )? [^()]+ (?(1) \) ) |
| 5178 | ||
| 5179 | The first part matches an optional opening parenthesis, and if that | The first part matches an optional opening parenthesis, and if that |
| 5180 | character is present, sets it as the first captured substring. The sec- | character is present, sets it as the first captured substring. The sec- |
| 5181 | ond part matches one or more characters that are not parentheses. The | ond part matches one or more characters that are not parentheses. The |
| 5182 | third part is a conditional subpattern that tests whether or not the | third part is a conditional subpattern that tests whether or not the |
| 5183 | first set of parentheses matched. If they did, that is, if subject | first set of parentheses matched. If they did, that is, if subject |
| 5184 | started with an opening parenthesis, the condition is true, and so the | started with an opening parenthesis, the condition is true, and so the |
| 5185 | yes-pattern is executed and a closing parenthesis is required. Other- | yes-pattern is executed and a closing parenthesis is required. Other- |
| 5186 | wise, since no-pattern is not present, the subpattern matches nothing. | wise, since no-pattern is not present, the subpattern matches nothing. |
| 5187 | In other words, this pattern matches a sequence of non-parentheses, | In other words, this pattern matches a sequence of non-parentheses, |
| 5188 | optionally enclosed in parentheses. | optionally enclosed in parentheses. |
| 5189 | ||
| 5190 | If you were embedding this pattern in a larger one, you could use a | If you were embedding this pattern in a larger one, you could use a |
| 5191 | relative reference: | relative reference: |
| 5192 | ||
| 5193 | ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... | ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 5194 | ||
| 5195 | This makes the fragment independent of the parentheses in the larger | This makes the fragment independent of the parentheses in the larger |
| 5196 | pattern. | pattern. |
| 5197 | ||
| 5198 | Checking for a used subpattern by name | Checking for a used subpattern by name |
| 5199 | ||
| 5200 | Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a | Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 5201 | used subpattern by name. For compatibility with earlier versions of | used subpattern by name. For compatibility with earlier versions of |
| 5202 | PCRE, which had this facility before Perl, the syntax (?(name)...) is | PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 5203 | also recognized. However, there is a possible ambiguity with this syn- | also recognized. However, there is a possible ambiguity with this syn- |
| 5204 | tax, because subpattern names may consist entirely of digits. PCRE | tax, because subpattern names may consist entirely of digits. PCRE |
| 5205 | looks first for a named subpattern; if it cannot find one and the name | looks first for a named subpattern; if it cannot find one and the name |
| 5206 | consists entirely of digits, PCRE looks for a subpattern of that num- | consists entirely of digits, PCRE looks for a subpattern of that num- |
| 5207 | ber, which must be greater than zero. Using subpattern names that con- | ber, which must be greater than zero. Using subpattern names that con- |
| 5208 | sist entirely of digits is not recommended. | sist entirely of digits is not recommended. |
| 5209 | ||
| 5210 | Rewriting the above example to use a named subpattern gives this: | Rewriting the above example to use a named subpattern gives this: |
| 5211 | ||
| 5212 | (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) | (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| 5213 | ||
| 5214 | If the name used in a condition of this kind is a duplicate, the test | If the name used in a condition of this kind is a duplicate, the test |
| 5215 | is applied to all subpatterns of the same name, and is true if any one | is applied to all subpatterns of the same name, and is true if any one |
| 5216 | of them has matched. | of them has matched. |
| 5217 | ||
| 5218 | Checking for pattern recursion | Checking for pattern recursion |
| 5219 | ||
| 5220 | If the condition is the string (R), and there is no subpattern with the | If the condition is the string (R), and there is no subpattern with the |
| 5221 | name R, the condition is true if a recursive call to the whole pattern | name R, the condition is true if a recursive call to the whole pattern |
| 5222 | or any subpattern has been made. If digits or a name preceded by amper- | or any subpattern has been made. If digits or a name preceded by amper- |
| 5223 | sand follow the letter R, for example: | sand follow the letter R, for example: |
| 5224 | ||
| # | Line 5208 CONDITIONAL SUBPATTERNS | Line 5226 CONDITIONAL SUBPATTERNS |
| 5226 | ||
| 5227 | the condition is true if the most recent recursion is into a subpattern | the condition is true if the most recent recursion is into a subpattern |
| 5228 | whose number or name is given. This condition does not check the entire | whose number or name is given. This condition does not check the entire |
| 5229 | recursion stack. If the name used in a condition of this kind is a | recursion stack. If the name used in a condition of this kind is a |
| 5230 | duplicate, the test is applied to all subpatterns of the same name, and | duplicate, the test is applied to all subpatterns of the same name, and |
| 5231 | is true if any one of them is the most recent recursion. | is true if any one of them is the most recent recursion. |
| 5232 | ||
| 5233 | At "top level", all these recursion test conditions are false. The | At "top level", all these recursion test conditions are false. The |
| 5234 | syntax for recursive patterns is described below. | syntax for recursive patterns is described below. |
| 5235 | ||
| 5236 | Defining subpatterns for use by reference only | Defining subpatterns for use by reference only |
| 5237 | ||
| 5238 | If the condition is the string (DEFINE), and there is no subpattern | If the condition is the string (DEFINE), and there is no subpattern |
| 5239 | with the name DEFINE, the condition is always false. In this case, | with the name DEFINE, the condition is always false. In this case, |
| 5240 | there may be only one alternative in the subpattern. It is always | there may be only one alternative in the subpattern. It is always |
| 5241 | skipped if control reaches this point in the pattern; the idea of | skipped if control reaches this point in the pattern; the idea of |
| 5242 | DEFINE is that it can be used to define "subroutines" that can be ref- | DEFINE is that it can be used to define subroutines that can be refer- |
| 5243 | erenced from elsewhere. (The use of "subroutines" is described below.) | enced from elsewhere. (The use of subroutines is described below.) For |
| 5244 | For example, a pattern to match an IPv4 address such as | example, a pattern to match an IPv4 address such as "192.168.23.245" |
| 5245 | "192.168.23.245" could be written like this (ignore whitespace and line | could be written like this (ignore whitespace and line breaks): |
| breaks): | ||
| 5246 | ||
| 5247 | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 5248 | \b (?&byte) (\.(?&byte)){3} \b | \b (?&byte) (\.(?&byte)){3} \b |
| # | Line 5312 RECURSIVE PATTERNS | Line 5329 RECURSIVE PATTERNS |
| 5329 | into Perl at release 5.10. | into Perl at release 5.10. |
| 5330 | ||
| 5331 | A special item that consists of (? followed by a number greater than | A special item that consists of (? followed by a number greater than |
| 5332 | zero and a closing parenthesis is a recursive call of the subpattern of | zero and a closing parenthesis is a recursive subroutine call of the |
| 5333 | the given number, provided that it occurs inside that subpattern. (If | subpattern of the given number, provided that it occurs inside that |
| 5334 | not, it is a "subroutine" call, which is described in the next sec- | subpattern. (If not, it is a non-recursive subroutine call, which is |
| 5335 | tion.) The special item (?R) or (?0) is a recursive call of the entire | described in the next section.) The special item (?R) or (?0) is a |
| 5336 | regular expression. | recursive call of the entire regular expression. |
| 5337 | ||
| 5338 | This PCRE pattern solves the nested parentheses problem (assume the | This PCRE pattern solves the nested parentheses problem (assume the |
| 5339 | PCRE_EXTENDED option is set so that white space is ignored): | PCRE_EXTENDED option is set so that white space is ignored): |
| # | Line 5348 RECURSIVE PATTERNS | Line 5365 RECURSIVE PATTERNS |
| 5365 | It is also possible to refer to subsequently opened parentheses, by | It is also possible to refer to subsequently opened parentheses, by |
| 5366 | writing references such as (?+2). However, these cannot be recursive | writing references such as (?+2). However, these cannot be recursive |
| 5367 | because the reference is not inside the parentheses that are refer- | because the reference is not inside the parentheses that are refer- |
| 5368 | enced. They are always "subroutine" calls, as described in the next | enced. They are always non-recursive subroutine calls, as described in |
| 5369 | section. | the next section. |
| 5370 | ||
| 5371 | An alternative approach is to use named parentheses instead. The Perl | An alternative approach is to use named parentheses instead. The Perl |
| 5372 | syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also | syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
| # | Line 5382 RECURSIVE PATTERNS | Line 5399< |