| 81 |
pcreposix the POSIX-compatible C API |
pcreposix the POSIX-compatible C API |
| 82 |
pcreprecompile details of saving and re-using precompiled patterns |
pcreprecompile details of saving and re-using precompiled patterns |
| 83 |
pcresample discussion of the sample program |
pcresample discussion of the sample program |
| 84 |
|
pcrestack discussion of stack usage |
| 85 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
| 86 |
|
|
| 87 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
| 101 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
| 102 |
of execution will be slower. |
of execution will be slower. |
| 103 |
|
|
| 104 |
All values in repeating quantifiers must be less than 65536. The maxi- |
All values in repeating quantifiers must be less than 65536. The maxi- |
| 105 |
mum number of capturing subpatterns is 65535. |
mum compiled length of subpattern with an explicit repeat count is |
| 106 |
|
30000 bytes. The maximum number of capturing subpatterns is 65535. |
| 107 |
|
|
| 108 |
There is no limit to the number of non-capturing subpatterns, but the |
There is no limit to the number of non-capturing subpatterns, but the |
| 109 |
maximum depth of nesting of all kinds of parenthesized subpattern, |
maximum depth of nesting of all kinds of parenthesized subpattern, |
| 110 |
including capturing subpatterns, assertions, and other types of subpat- |
including capturing subpatterns, assertions, and other types of subpat- |
| 111 |
tern, is 200. |
tern, is 200. |
| 112 |
|
|
| 113 |
|
The maximum length of name for a named subpattern is 32, and the maxi- |
| 114 |
|
mum number of named subpatterns is 10000. |
| 115 |
|
|
| 116 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
| 117 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
| 118 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
| 119 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
| 120 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
| 121 |
|
For a discussion of stack issues, see the pcrestack documentation. |
| 122 |
|
|
| 123 |
|
|
| 124 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
| 168 |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
| 169 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
| 170 |
|
|
| 171 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
| 172 |
|
characters for values greater than \177. |
| 173 |
|
|
| 174 |
|
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
| 175 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
| 176 |
|
|
| 177 |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
| 178 |
gle byte. |
gle byte. |
| 179 |
|
|
| 180 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
| 181 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
| 182 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
| 183 |
|
|
| 184 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 185 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
| 186 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
| 187 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
| 188 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
| 189 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
| 190 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
| 191 |
\p{Nd}. |
\p{Nd}. |
| 192 |
|
|
| 193 |
7. Similarly, characters that match the POSIX named character classes |
8. Similarly, characters that match the POSIX named character classes |
| 194 |
are all low-valued characters. |
are all low-valued characters. |
| 195 |
|
|
| 196 |
8. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
| 197 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
| 198 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
| 199 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
| 200 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
| 201 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
| 202 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
| 203 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
| 204 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
| 205 |
ported by PCRE. |
ported by PCRE. |
| 206 |
|
|
| 207 |
|
|
| 211 |
University Computing Service, |
University Computing Service, |
| 212 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
| 213 |
|
|
| 214 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
| 215 |
so I've taken it away. If you want to email me, use my initial and sur- |
so I've taken it away. If you want to email me, use my initial and sur- |
| 216 |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
| 217 |
|
|
| 218 |
Last updated: 24 January 2006 |
Last updated: 05 June 2006 |
| 219 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 220 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 221 |
|
|
| 290 |
|
|
| 291 |
CODE VALUE OF NEWLINE |
CODE VALUE OF NEWLINE |
| 292 |
|
|
| 293 |
By default, PCRE treats character 10 (linefeed) as the newline charac- |
By default, PCRE interprets character 10 (linefeed, LF) as indicating |
| 294 |
ter. This is the normal newline character on Unix-like systems. You can |
the end of a line. This is the normal newline character on Unix-like |
| 295 |
compile PCRE to use character 13 (carriage return) instead by adding |
systems. You can compile PCRE to use character 13 (carriage return, CR) |
| 296 |
|
instead, by adding |
| 297 |
|
|
| 298 |
--enable-newline-is-cr |
--enable-newline-is-cr |
| 299 |
|
|
| 300 |
to the configure command. For completeness there is also a --enable- |
to the configure command. There is also a --enable-newline-is-lf |
| 301 |
newline-is-lf option, which explicitly specifies linefeed as the new- |
option, which explicitly specifies linefeed as the newline character. |
| 302 |
line character. |
|
| 303 |
|
Alternatively, you can specify that line endings are to be indicated by |
| 304 |
|
the two character sequence CRLF. If you want this, add |
| 305 |
|
|
| 306 |
|
--enable-newline-is-crlf |
| 307 |
|
|
| 308 |
|
to the configure command. Whatever line ending convention is selected |
| 309 |
|
when PCRE is built can be overridden when the library functions are |
| 310 |
|
called. At build time it is conventional to use the standard for your |
| 311 |
|
operating system. |
| 312 |
|
|
| 313 |
|
|
| 314 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
| 339 |
to the configure command. |
to the configure command. |
| 340 |
|
|
| 341 |
|
|
|
LIMITING PCRE RESOURCE USAGE |
|
|
|
|
|
Internally, PCRE has a function called match(), which it calls repeat- |
|
|
edly (possibly recursively) when matching a pattern with the |
|
|
pcre_exec() function. By controlling the maximum number of times this |
|
|
function may be called during a single matching operation, a limit can |
|
|
be placed on the resources used by a single call to pcre_exec(). The |
|
|
limit can be changed at run time, as described in the pcreapi documen- |
|
|
tation. The default is 10 million, but this can be changed by adding a |
|
|
setting such as |
|
|
|
|
|
--with-match-limit=500000 |
|
|
|
|
|
to the configure command. This setting has no effect on the |
|
|
pcre_dfa_exec() matching function. |
|
|
|
|
|
|
|
| 342 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
| 343 |
|
|
| 344 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
| 368 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
| 369 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
| 370 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
| 371 |
suffer from this problem.) An alternative approach that uses memory |
suffer from this problem, but it may sometimes be necessary to increase |
| 372 |
from the heap to remember data, instead of using recursive function |
the maximum stack size. There is a discussion in the pcrestack docu- |
| 373 |
calls, has been implemented to work round this problem. If you want to |
mentation.) An alternative approach to recursion that uses memory from |
| 374 |
build a version of PCRE that works this way, add |
the heap to remember data, instead of using recursive function calls, |
| 375 |
|
has been implemented to work round the problem of limited stack size. |
| 376 |
|
If you want to build a version of PCRE that works this way, add |
| 377 |
|
|
| 378 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
| 379 |
|
|
| 388 |
function; it is not relevant for the the pcre_dfa_exec() function. |
function; it is not relevant for the the pcre_dfa_exec() function. |
| 389 |
|
|
| 390 |
|
|
| 391 |
|
LIMITING PCRE RESOURCE USAGE |
| 392 |
|
|
| 393 |
|
Internally, PCRE has a function called match(), which it calls repeat- |
| 394 |
|
edly (sometimes recursively) when matching a pattern with the |
| 395 |
|
pcre_exec() function. By controlling the maximum number of times this |
| 396 |
|
function may be called during a single matching operation, a limit can |
| 397 |
|
be placed on the resources used by a single call to pcre_exec(). The |
| 398 |
|
limit can be changed at run time, as described in the pcreapi documen- |
| 399 |
|
tation. The default is 10 million, but this can be changed by adding a |
| 400 |
|
setting such as |
| 401 |
|
|
| 402 |
|
--with-match-limit=500000 |
| 403 |
|
|
| 404 |
|
to the configure command. This setting has no effect on the |
| 405 |
|
pcre_dfa_exec() matching function. |
| 406 |
|
|
| 407 |
|
In some environments it is desirable to limit the depth of recursive |
| 408 |
|
calls of match() more strictly than the total number of calls, in order |
| 409 |
|
to restrict the maximum amount of stack (or heap, if --disable-stack- |
| 410 |
|
for-recursion is specified) that is used. A second limit controls this; |
| 411 |
|
it defaults to the value that is set for --with-match-limit, which |
| 412 |
|
imposes no additional constraints. However, you can set a lower limit |
| 413 |
|
by adding, for example, |
| 414 |
|
|
| 415 |
|
--with-match-limit-recursion=10000 |
| 416 |
|
|
| 417 |
|
to the configure command. This value can also be overridden at run |
| 418 |
|
time. |
| 419 |
|
|
| 420 |
|
|
| 421 |
USING EBCDIC CODE |
USING EBCDIC CODE |
| 422 |
|
|
| 423 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
| 424 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
| 425 |
PCRE can, however, be compiled to run in an EBCDIC environment by |
PCRE can, however, be compiled to run in an EBCDIC environment by |
| 426 |
adding |
adding |
| 427 |
|
|
| 428 |
--enable-ebcdic |
--enable-ebcdic |
| 429 |
|
|
| 430 |
to the configure command. |
to the configure command. |
| 431 |
|
|
| 432 |
Last updated: 15 August 2005 |
Last updated: 06 June 2006 |
| 433 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 434 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 435 |
|
|
| 436 |
|
|
| 475 |
resented as a tree structure. An unlimited repetition in the pattern |
resented as a tree structure. An unlimited repetition in the pattern |
| 476 |
makes the tree of infinite size, but it is still a tree. Matching the |
makes the tree of infinite size, but it is still a tree. Matching the |
| 477 |
pattern to a given subject string (from a given starting point) can be |
pattern to a given subject string (from a given starting point) can be |
| 478 |
thought of as a search of the tree. There are two standard ways to |
thought of as a search of the tree. There are two ways to search a |
| 479 |
search a tree: depth-first and breadth-first, and these correspond to |
tree: depth-first and breadth-first, and these correspond to the two |
| 480 |
the two matching algorithms provided by PCRE. |
matching algorithms provided by PCRE. |
| 481 |
|
|
| 482 |
|
|
| 483 |
THE STANDARD MATCHING ALGORITHM |
THE STANDARD MATCHING ALGORITHM |
| 597 |
but does not provide the advantage that it does for the standard algo- |
but does not provide the advantage that it does for the standard algo- |
| 598 |
rithm. |
rithm. |
| 599 |
|
|
| 600 |
Last updated: 28 February 2005 |
Last updated: 06 June 2006 |
| 601 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 602 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 603 |
|
|
| 604 |
|
|
| 651 |
int pcre_get_stringnumber(const pcre *code, |
int pcre_get_stringnumber(const pcre *code, |
| 652 |
const char *name); |
const char *name); |
| 653 |
|
|
| 654 |
|
int pcre_get_stringtable_entries(const pcre *code, |
| 655 |
|
const char *name, char **first, char **last); |
| 656 |
|
|
| 657 |
int pcre_get_substring(const char *subject, int *ovector, |
int pcre_get_substring(const char *subject, int *ovector, |
| 658 |
int stringcount, int stringnumber, |
int stringcount, int stringnumber, |
| 659 |
const char **stringptr); |
const char **stringptr); |
| 714 |
|
|
| 715 |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
| 716 |
ble, is also provided. This uses a different algorithm for the match- |
ble, is also provided. This uses a different algorithm for the match- |
| 717 |
ing. This allows it to find all possible matches (at a given point in |
ing. The alternative algorithm finds all possible matches (at a given |
| 718 |
the subject), not just one. However, this algorithm does not return |
point in the subject). However, this algorithm does not return captured |
| 719 |
captured substrings. A description of the two matching algorithms and |
substrings. A description of the two matching algorithms and their |
| 720 |
their advantages and disadvantages is given in the pcrematching docu- |
advantages and disadvantages is given in the pcrematching documenta- |
| 721 |
mentation. |
tion. |
| 722 |
|
|
| 723 |
In addition to the main compiling and matching functions, there are |
In addition to the main compiling and matching functions, there are |
| 724 |
convenience functions for extracting captured substrings from a subject |
convenience functions for extracting captured substrings from a subject |
| 730 |
pcre_get_named_substring() |
pcre_get_named_substring() |
| 731 |
pcre_get_substring_list() |
pcre_get_substring_list() |
| 732 |
pcre_get_stringnumber() |
pcre_get_stringnumber() |
| 733 |
|
pcre_get_stringtable_entries() |
| 734 |
|
|
| 735 |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
| 736 |
to free the memory used for extracted strings. |
to free the memory used for extracted strings. |
| 762 |
indirections to memory management functions. These special functions |
indirections to memory management functions. These special functions |
| 763 |
are used only when PCRE is compiled to use the heap for remembering |
are used only when PCRE is compiled to use the heap for remembering |
| 764 |
data, instead of recursive function calls, when running the pcre_exec() |
data, instead of recursive function calls, when running the pcre_exec() |
| 765 |
function. This is a non-standard way of building PCRE, for use in envi- |
function. See the pcrebuild documentation for details of how to do |
| 766 |
ronments that have limited stacks. Because of the greater use of memory |
this. It is a non-standard way of building PCRE, for use in environ- |
| 767 |
management, it runs more slowly. Separate functions are provided so |
ments that have limited stacks. Because of the greater use of memory |
| 768 |
that special-purpose external code can be used for this case. When |
management, it runs more slowly. Separate functions are provided so |
| 769 |
used, these functions are always called in a stack-like manner (last |
that special-purpose external code can be used for this case. When |
| 770 |
obtained, first freed), and always for memory blocks of the same size. |
used, these functions are always called in a stack-like manner (last |
| 771 |
|
obtained, first freed), and always for memory blocks of the same size. |
| 772 |
|
There is a discussion about PCRE's stack usage in the pcrestack docu- |
| 773 |
|
mentation. |
| 774 |
|
|
| 775 |
The global variable pcre_callout initially contains NULL. It can be set |
The global variable pcre_callout initially contains NULL. It can be set |
| 776 |
by the caller to a "callout" function, which PCRE will then call at |
by the caller to a "callout" function, which PCRE will then call at |
| 778 |
pcrecallout documentation. |
pcrecallout documentation. |
| 779 |
|
|
| 780 |
|
|
| 781 |
|
NEWLINES |
| 782 |
|
PCRE supports three different conventions for indicating line breaks in |
| 783 |
|
strings: a single CR character, a single LF character, or the two-char- |
| 784 |
|
acter sequence CRLF. All three are used as "standard" by different |
| 785 |
|
operating systems. When PCRE is built, a default can be specified. The |
| 786 |
|
default default is LF, which is the Unix standard. When PCRE is run, |
| 787 |
|
the default can be overridden, either when a pattern is compiled, or |
| 788 |
|
when it is matched. |
| 789 |
|
|
| 790 |
|
In the PCRE documentation the word "newline" is used to mean "the char- |
| 791 |
|
acter or pair of characters that indicate a line break". |
| 792 |
|
|
| 793 |
|
|
| 794 |
MULTITHREADING |
MULTITHREADING |
| 795 |
|
|
| 796 |
The PCRE functions can be used in multi-threading applications, with |
The PCRE functions can be used in multi-threading applications, with |
| 797 |
the proviso that the memory management functions pointed to by |
the proviso that the memory management functions pointed to by |
| 798 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
| 799 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
| 800 |
|
|
| 801 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
| 802 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
| 803 |
at once. |
at once. |
| 804 |
|
|
| 806 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
| 807 |
|
|
| 808 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
| 809 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
| 810 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
| 811 |
pcreprecompile documentation. |
pcreprecompile documentation. |
| 812 |
|
|
| 813 |
|
|
| 815 |
|
|
| 816 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
| 817 |
|
|
| 818 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
| 819 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
| 820 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
| 821 |
tures. |
tures. |
| 822 |
|
|
| 823 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
| 824 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
| 825 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
| 826 |
available: |
available: |
| 827 |
|
|
| 828 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
| 829 |
|
|
| 830 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
| 831 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
| 832 |
|
|
| 833 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
| 834 |
|
|
| 835 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
| 836 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
| 837 |
|
|
| 838 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
| 839 |
|
|
| 840 |
The output is an integer that is set to the value of the code that is |
The output is an integer whose value specifies the default character |
| 841 |
used for the newline character. It is either linefeed (10) or carriage |
sequence that is recognized as meaning "newline". The three values that |
| 842 |
return (13), and should normally be the standard character for your |
are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default |
| 843 |
operating system. |
should normally be the standard sequence for your operating system. |
| 844 |
|
|
| 845 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
| 846 |
|
|
| 847 |
The output is an integer that contains the number of bytes used for |
The output is an integer that contains the number of bytes used for |
| 848 |
internal linkage in compiled regular expressions. The value is 2, 3, or |
internal linkage in compiled regular expressions. The value is 2, 3, or |
| 849 |
4. Larger values allow larger regular expressions to be compiled, at |
4. Larger values allow larger regular expressions to be compiled, at |
| 850 |
the expense of slower matching. The default value of 2 is sufficient |
the expense of slower matching. The default value of 2 is sufficient |
| 851 |
for all but the most massive patterns, since it allows the compiled |
for all but the most massive patterns, since it allows the compiled |
| 852 |
pattern to be up to 64K in size. |
pattern to be up to 64K in size. |
| 853 |
|
|
| 854 |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
| 855 |
|
|
| 856 |
The output is an integer that contains the threshold above which the |
The output is an integer that contains the threshold above which the |
| 857 |
POSIX interface uses malloc() for output vectors. Further details are |
POSIX interface uses malloc() for output vectors. Further details are |
| 858 |
given in the pcreposix documentation. |
given in the pcreposix documentation. |
| 859 |
|
|
| 860 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
| 861 |
|
|
| 862 |
The output is an integer that gives the default limit for the number of |
The output is an integer that gives the default limit for the number of |
| 863 |
internal matching function calls in a pcre_exec() execution. Further |
internal matching function calls in a pcre_exec() execution. Further |
| 864 |
details are given with pcre_exec() below. |
details are given with pcre_exec() below. |
| 865 |
|
|
| 866 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
| 867 |
|
|
| 868 |
The output is an integer that gives the default limit for the depth of |
The output is an integer that gives the default limit for the depth of |
| 869 |
recursion when calling the internal matching function in a pcre_exec() |
recursion when calling the internal matching function in a pcre_exec() |
| 870 |
execution. Further details are given with pcre_exec() below. |
execution. Further details are given with pcre_exec() below. |
| 871 |
|
|
| 872 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
| 873 |
|
|
| 874 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
| 875 |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
| 876 |
the stack to remember their state. This is the usual way that PCRE is |
the stack to remember their state. This is the usual way that PCRE is |
| 877 |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
| 878 |
on the heap instead of recursive function calls. In this case, |
on the heap instead of recursive function calls. In this case, |
| 879 |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
| 880 |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
| 881 |
|
|
| 882 |
|
|
| 893 |
|
|
| 894 |
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
| 895 |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
| 896 |
the two interfaces is that pcre_compile2() has an additional argument, |
the two interfaces is that pcre_compile2() has an additional argument, |
| 897 |
errorcodeptr, via which a numerical error code can be returned. |
errorcodeptr, via which a numerical error code can be returned. |
| 898 |
|
|
| 899 |
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
| 900 |
the pattern argument. A pointer to a single block of memory that is |
the pattern argument. A pointer to a single block of memory that is |
| 901 |
obtained via pcre_malloc is returned. This contains the compiled code |
obtained via pcre_malloc is returned. This contains the compiled code |
| 902 |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
| 903 |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
| 904 |
It is up to the caller to free the memory when it is no longer |
It is up to the caller to free the memory (via pcre_free) when it is no |
| 905 |
required. |
longer required. |
| 906 |
|
|
| 907 |
Although the compiled code of a PCRE regex is relocatable, that is, it |
Although the compiled code of a PCRE regex is relocatable, that is, it |
| 908 |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
| 909 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
| 910 |
ment, which is an address (see below). |
ment, which is an address (see below). |
| 911 |
|
|
| 912 |
The options argument contains independent bits that affect the compila- |
The options argument contains independent bits that affect the compila- |
| 913 |
tion. It should be zero if no options are required. The available |
tion. It should be zero if no options are required. The available |
| 914 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them, in particular, those that |
| 915 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, can also be set and unset from within the |
| 916 |
pattern (see the detailed description in the pcrepattern documenta- |
pattern (see the detailed description in the pcrepattern documenta- |
| 917 |
tion). For these options, the contents of the options argument speci- |
tion). For these options, the contents of the options argument speci- |
| 918 |
fies their initial settings at the start of compilation and execution. |
fies their initial settings at the start of compilation and execution. |
| 919 |
The PCRE_ANCHORED option can be set at the time of matching as well as |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
| 920 |
at compile time. |
of matching as well as at compile time. |
| 921 |
|
|
| 922 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
| 923 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
| 924 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
| 925 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
| 926 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
| 927 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
| 928 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
| 929 |
given. |
given. |
| 930 |
|
|
| 931 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
| 932 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
| 933 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
| 934 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
| 935 |
|
|
| 936 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
| 937 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
| 938 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
| 939 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
| 940 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
| 941 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
| 942 |
support below. |
support below. |
| 943 |
|
|
| 944 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
| 945 |
pile(): |
pile(): |
| 946 |
|
|
| 947 |
pcre *re; |
pcre *re; |
| 954 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
| 955 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
| 956 |
|
|
| 957 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
| 958 |
file: |
file: |
| 959 |
|
|
| 960 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 961 |
|
|
| 962 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
| 963 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
| 964 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
| 965 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
| 966 |
only way to do it in Perl. |
only way to do it in Perl. |
| 967 |
|
|
| 968 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
| 969 |
|
|
| 970 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
| 971 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
| 972 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
| 973 |
|
|
| 974 |
PCRE_CASELESS |
PCRE_CASELESS |
| 975 |
|
|
| 976 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
| 977 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
| 978 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
| 979 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
| 980 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
| 981 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
| 982 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
| 983 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
| 984 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
| 985 |
UTF-8 support. |
UTF-8 support. |
| 986 |
|
|
| 987 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
| 988 |
|
|
| 989 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
| 990 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
| 991 |
matches immediately before the final character if it is a newline (but |
matches immediately before a newline at the end of the string (but not |
| 992 |
not before any other newlines). The PCRE_DOLLAR_ENDONLY option is |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
| 993 |
ignored if PCRE_MULTILINE is set. There is no equivalent to this option |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
| 994 |
in Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
| 995 |
|
|
| 996 |
PCRE_DOTALL |
PCRE_DOTALL |
| 997 |
|
|
| 998 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
| 999 |
acters, including newlines. Without it, newlines are excluded. This |
acters, including those that indicate newline. Without it, a dot does |
| 1000 |
option is equivalent to Perl's /s option, and it can be changed within |
not match when the current position is at a newline. This option is |
| 1001 |
a pattern by a (?s) option setting. A negative class such as [^a] |
equivalent to Perl's /s option, and it can be changed within a pattern |
| 1002 |
always matches a newline character, independent of the setting of this |
by a (?s) option setting. A negative class such as [^a] always matches |
| 1003 |
option. |
newlines, independent of the setting of this option. |
| 1004 |
|
|
| 1005 |
|
PCRE_DUPNAMES |
| 1006 |
|
|
| 1007 |
|
If this bit is set, names used to identify capturing subpatterns need |
| 1008 |
|
not be unique. This can be helpful for certain types of pattern when it |
| 1009 |
|
is known that only one instance of the named subpattern can ever be |
| 1010 |
|
matched. There are more details of named subpatterns below; see also |
| 1011 |
|
the pcrepattern documentation. |
| 1012 |
|
|
| 1013 |
PCRE_EXTENDED |
PCRE_EXTENDED |
| 1014 |
|
|
| 1015 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
| 1016 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
| 1017 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
| 1018 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
| 1019 |
line character, inclusive, are also ignored. This is equivalent to |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
| 1020 |
Perl's /x option, and it can be changed within a pattern by a (?x) |
option, and it can be changed within a pattern by a (?x) option set- |
| 1021 |
option setting. |
ting. |
| 1022 |
|
|
| 1023 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
| 1024 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
| 1025 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
| 1026 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
| 1027 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
| 1028 |
|
|
| 1029 |
PCRE_EXTRA |
PCRE_EXTRA |
| 1030 |
|
|
| 1031 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
| 1032 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
| 1033 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
| 1034 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
| 1035 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
| 1036 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
| 1037 |
literal. There are at present no other features controlled by this |
literal. (Perl can, however, be persuaded to give a warning for this.) |
| 1038 |
option. It can also be set by a (?X) option setting within a pattern. |
There are at present no other features controlled by this option. It |
| 1039 |
|
can also be set by a (?X) option setting within a pattern. |
| 1040 |
|
|
| 1041 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
| 1042 |
|
|
| 1043 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
| 1044 |
before or at the first newline character in the subject string, though |
before or at the first newline in the subject string, though the |
| 1045 |
the matched text may continue over the newline. |
matched text may continue over the newline. |
| 1046 |
|
|
| 1047 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 1048 |
|
|
| 1054 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
| 1055 |
|
|
| 1056 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
| 1057 |
constructs match immediately following or immediately before any new- |
constructs match immediately following or immediately before internal |
| 1058 |
line in the subject string, respectively, as well as at the very start |
newlines in the subject string, respectively, as well as at the very |
| 1059 |
and end. This is equivalent to Perl's /m option, and it can be changed |
start and end. This is equivalent to Perl's /m option, and it can be |
| 1060 |
within a pattern by a (?m) option setting. If there are no "\n" charac- |
changed within a pattern by a (?m) option setting. If there are no new- |
| 1061 |
ters in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
| 1062 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
| 1063 |
|
|
| 1064 |
|
PCRE_NEWLINE_CR |
| 1065 |
|
PCRE_NEWLINE_LF |
| 1066 |
|
PCRE_NEWLINE_CRLF |
| 1067 |
|
|
| 1068 |
|
These options override the default newline definition that was chosen |
| 1069 |
|
when PCRE was built. Setting the first or the second specifies that a |
| 1070 |
|
newline is indicated by a single character (CR or LF, respectively). |
| 1071 |
|
Setting both of them specifies that a newline is indicated by the two- |
| 1072 |
|
character CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined |
| 1073 |
|
to contain both bits. The only time that a line break is relevant when |
| 1074 |
|
compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out- |
| 1075 |
|
side a character class is encountered. This indicates a comment that |
| 1076 |
|
lasts until after the next newline. |
| 1077 |
|
|
| 1078 |
|
The newline option set at compile time becomes the default that is used |
| 1079 |
|
for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
| 1080 |
|
|
| 1081 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
| 1082 |
|
|
| 1083 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
| 1084 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
| 1085 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
| 1086 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
| 1087 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
| 1088 |
|
|
| 1089 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
| 1090 |
|
|
| 1091 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
| 1092 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
| 1093 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
| 1094 |
within the pattern. |
within the pattern. |
| 1095 |
|
|
| 1096 |
PCRE_UTF8 |
PCRE_UTF8 |
| 1097 |
|
|
| 1098 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
| 1099 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
| 1100 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
| 1101 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
| 1102 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
| 1103 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
| 1104 |
|
|
| 1105 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 1106 |
|
|
| 1107 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
| 1108 |
automatically checked. If an invalid UTF-8 sequence of bytes is found, |
automatically checked. If an invalid UTF-8 sequence of bytes is found, |
| 1109 |
pcre_compile() returns an error. If you already know that your pattern |
pcre_compile() returns an error. If you already know that your pattern |
| 1110 |
is valid, and you want to skip this check for performance reasons, you |
is valid, and you want to skip this check for performance reasons, you |
| 1111 |
can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
| 1112 |
passing an invalid UTF-8 string as a pattern is undefined. It may cause |
passing an invalid UTF-8 string as a pattern is undefined. It may cause |
| 1113 |
your program to crash. Note that this option can also be passed to |
your program to crash. Note that this option can also be passed to |
| 1114 |
pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check- |
pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check- |
| 1115 |
ing of subject strings. |
ing of subject strings. |
| 1116 |
|
|
| 1117 |
|
|
| 1118 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
| 1119 |
|
|
| 1120 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
| 1121 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
| 1122 |
both compiling functions. |
both compiling functions. |
| 1123 |
|
|
| 1124 |
0 no error |
0 no error |
| 1147 |
23 internal error: code overflow |
23 internal error: code overflow |
| 1148 |
24 unrecognized character after (?< |
24 unrecognized character after (?< |
| 1149 |
25 lookbehind assertion is not fixed length |
25 lookbehind assertion is not fixed length |
| 1150 |
26 malformed number after (?( |
26 malformed number or name after (?( |
| 1151 |
27 conditional group contains more than two branches |
27 conditional group contains more than two branches |
| 1152 |
28 assertion expected after (?( |
28 assertion expected after (?( |
| 1153 |
29 (?R or (?digits must be followed by ) |
29 (?R or (?digits must be followed by ) |
| 1164 |
40 recursive call could loop indefinitely |
40 recursive call could loop indefinitely |
| 1165 |
41 unrecognized character after (?P |
41 unrecognized character after (?P |
| 1166 |
42 syntax error after (?P |
42 syntax error after (?P |
| 1167 |
43 two named groups have the same name |
43 two named subpatterns have the same name |
| 1168 |
44 invalid UTF-8 string |
44 invalid UTF-8 string |
| 1169 |
45 support for \P, \p, and \X has not been compiled |
45 support for \P, \p, and \X has not been compiled |
| 1170 |
46 malformed \P or \p sequence |
46 malformed \P or \p sequence |
| 1171 |
47 unknown property name after \P or \p |
47 unknown property name after \P or \p |
| 1172 |
|
48 subpattern name is too long (maximum 32 characters) |
| 1173 |
|
49 too many named subpatterns (maximum 10,000) |
| 1174 |
|
50 repeated subpattern is too long |
| 1175 |
|
51 octal value is greater than \377 (not in UTF-8 mode) |
| 1176 |
|
|
| 1177 |
|
|
| 1178 |
STUDYING A PATTERN |
STUDYING A PATTERN |
| 1180 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
| 1181 |
const char **errptr); |
const char **errptr); |
| 1182 |
|
|
| 1183 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
| 1184 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
| 1185 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
| 1186 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
| 1187 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
| 1188 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
| 1189 |
the results of the study. |
the results of the study. |
| 1190 |
|
|
| 1191 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
| 1192 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
| 1193 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
| 1194 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
| 1195 |
|
|
| 1196 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
| 1197 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
| 1198 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
| 1199 |
its own pcre_extra block. |
its own pcre_extra block. |
| 1200 |
|
|
| 1201 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
| 1202 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
| 1203 |
|
|
| 1204 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
| 1205 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
| 1206 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
| 1207 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
| 1208 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
| 1209 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
| 1210 |
|
|
| 1211 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
| 1217 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
| 1218 |
|
|
| 1219 |
At present, studying a pattern is useful only for non-anchored patterns |
At present, studying a pattern is useful only for non-anchored patterns |
| 1220 |
that do not have a single fixed starting character. A bitmap of possi- |
that do not have a single fixed starting character. A bitmap of possi- |
| 1221 |
ble starting bytes is created. |
ble starting bytes is created. |
| 1222 |
|
|
| 1223 |
|
|
| 1224 |
LOCALE SUPPORT |
LOCALE SUPPORT |
| 1225 |
|
|
| 1226 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
| 1227 |
letters digits, or whatever, by reference to a set of tables, indexed |
letters digits, or whatever, by reference to a set of tables, indexed |
| 1228 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
| 1229 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
| 1230 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
| 1231 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
| 1232 |
code is discouraged. |
code is discouraged. |
| 1233 |
|
|
| 1234 |
An internal set of tables is created in the default C locale when PCRE |
An internal set of tables is created in the default C locale when PCRE |
| 1235 |
is built. This is used when the final argument of pcre_compile() is |
is built. This is used when the final argument of pcre_compile() is |
| 1236 |
NULL, and is sufficient for many applications. An alternative set of |
NULL, and is sufficient for many applications. An alternative set of |
| 1237 |
tables can, however, be supplied. These may be created in a different |
tables can, however, be supplied. These may be created in a different |
| 1238 |
locale from the default. As more and more applications change to using |
locale from the default. As more and more applications change to using |
| 1239 |
Unicode, the need for this locale support is expected to die away. |
Unicode, the need for this locale support is expected to die away. |
| 1240 |
|
|
| 1241 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
| 1242 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
| 1243 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
| 1244 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
| 1245 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
| 1246 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
| 1247 |
|
|
| 1248 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
| 1249 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
| 1250 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
| 1251 |
|
|
| 1252 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
| 1253 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
| 1254 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
| 1255 |
it is needed. |
it is needed. |
| 1256 |
|
|
| 1257 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
| 1258 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
| 1259 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
| 1260 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
| 1261 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
| 1262 |
|
|
| 1263 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
| 1264 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
| 1265 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
| 1266 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
| 1267 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
| 1268 |
|
|
| 1272 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
| 1273 |
int what, void *where); |
int what, void *where); |
| 1274 |
|
|
| 1275 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
| 1276 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
| 1277 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
| 1278 |
|
|
| 1279 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
| 1280 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
| 1281 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
| 1282 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
| 1283 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
| 1284 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
| 1285 |
|
|
| 1286 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
| 1288 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 1289 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
| 1290 |
|
|
| 1291 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
| 1292 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
| 1293 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
| 1294 |
pattern: |
pattern: |
| 1295 |
|
|
| 1296 |
int rc; |
int rc; |
| 1297 |
unsigned long int length; |
size_t length; |
| 1298 |
rc = pcre_fullinfo( |
rc = pcre_fullinfo( |
| 1299 |
re, /* result of pcre_compile() */ |
re, /* result of pcre_compile() */ |
| 1300 |
pe, /* result of pcre_study(), or NULL */ |
pe, /* result of pcre_study(), or NULL */ |
| 1301 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
| 1302 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
| 1303 |
|
|
| 1304 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
| 1305 |
are as follows: |
are as follows: |
| 1306 |
|
|
| 1307 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
| 1308 |
|
|
| 1309 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
| 1310 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
| 1311 |
there are no back references. |
there are no back references. |
| 1312 |
|
|
| 1313 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
| 1314 |
|
|
| 1315 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
| 1316 |
argument should point to an int variable. |
argument should point to an int variable. |
| 1317 |
|
|
| 1318 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
| 1319 |
|
|
| 1320 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
| 1321 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
| 1322 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
| 1323 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
| 1324 |
passing a NULL table pointer. |
passing a NULL table pointer. |
| 1325 |
|
|
| 1326 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
| 1327 |
|
|
| 1328 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
| 1329 |
non-anchored pattern. (This option used to be called |
non-anchored pattern. The fourth argument should point to an int vari- |
| 1330 |
PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
| 1331 |
compatibility.) |
is still recognized for backwards compatibility.) |
| 1332 |
|
|
| 1333 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
| 1334 |
(cat|cow|coyote), it is returned in the integer pointed to by where. |
(cat|cow|coyote). Otherwise, if either |
|
Otherwise, if either |
|
| 1335 |
|
|
| 1336 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
| 1337 |
branch starts with "^", or |
branch starts with "^", or |
| 1367 |
|
|
| 1368 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
| 1369 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
| 1370 |
ses, which still acquire numbers. A convenience function called |
ses, which still acquire numbers. Several convenience functions such as |
| 1371 |
pcre_get_named_substring() is provided for extracting an individual |
pcre_get_named_substring() are provided for extracting captured sub- |
| 1372 |
captured substring by name. It is also possible to extract the data |
strings by name. It is also possible to extract the data directly, by |
| 1373 |
directly, by first converting the name to a number in order to access |
first converting the name to a number in order to access the correct |
| 1374 |
the correct pointers in the output vector (described with pcre_exec() |
pointers in the output vector (described with pcre_exec() below). To do |
| 1375 |
below). To do the conversion, you need to use the name-to-number map, |
the conversion, you need to use the name-to-number map, which is |
| 1376 |
which is described by these three values. |
described by these three values. |
| 1377 |
|
|
| 1378 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
| 1379 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
| 1383 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
| 1384 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
| 1385 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
| 1386 |
For example, consider the following pattern (assume PCRE_EXTENDED is |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
| 1387 |
set, so white space - including newlines - is ignored): |
theses numbers. For example, consider the following pattern (assume |
| 1388 |
|
PCRE_EXTENDED is set, so white space - including newlines - is |
| 1389 |
|
ignored): |
| 1390 |
|
|
| 1391 |
(?P<date> (?P<year>(\d\d)?\d\d) - |
(?P<date> (?P<year>(\d\d)?\d\d) - |
| 1392 |
(?P<month>\d\d) - (?P<day>\d\d) ) |
(?P<month>\d\d) - (?P<day>\d\d) ) |
| 1402 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
| 1403 |
|
|
| 1404 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
| 1405 |
name-to-number map, remember that the length of each entry is likely to |
name-to-number map, remember that the length of the entries is likely |
| 1406 |
be different for each compiled pattern. |
to be different for each compiled pattern. |
| 1407 |
|
|
| 1408 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
| 1409 |
|
|
| 1602 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
| 1603 |
|
|
| 1604 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
| 1605 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
| 1606 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
| 1607 |
|
PCRE_PARTIAL. |
| 1608 |
|
|
| 1609 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1610 |
|
|
| 1611 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
| 1612 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
| 1613 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
| 1614 |
unachored at matching time. |
unachored at matching time. |
| 1615 |
|
|
| 1616 |
|
PCRE_NEWLINE_CR |
| 1617 |
|
PCRE_NEWLINE_LF |
| 1618 |
|
PCRE_NEWLINE_CRLF |
| 1619 |
|
|
| 1620 |
|
These options override the newline definition that was chosen or |
| 1621 |
|
defaulted when the pattern was compiled. For details, see the descrip- |
| 1622 |
|
tion pcre_compile() above. During matching, the newline choice affects |
| 1623 |
|
the behaviour of the dot, circumflex, and dollar metacharacters. |
| 1624 |
|
|
| 1625 |
PCRE_NOTBOL |
PCRE_NOTBOL |
| 1626 |
|
|
| 1627 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
| 1757 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
| 1758 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
| 1759 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
| 1760 |
tern, and so on. The value returned by pcre_exec() is the number of |
tern, and so on. The value returned by pcre_exec() is one more than the |
| 1761 |
pairs that have been set. If there are no capturing subpatterns, the |
highest numbered pair that has been set. For example, if two substrings |
| 1762 |
return value from a successful match is 1, indicating that just the |
have been captured, the returned value is 3. If there are no capturing |
| 1763 |
first pair of offsets has been set. |
subpatterns, the return value from a successful match is 1, indicating |
| 1764 |
|
that just the first pair of offsets has been set. |
|
Some convenience functions are provided for extracting the captured |
|
|
substrings as separate strings. These are described in the following |
|
|
section. |
|
|
|
|
|
It is possible for an capturing subpattern number n+1 to match some |
|
|
part of the subject when subpattern n has not been used at all. For |
|
|
example, if the string "abc" is matched against the pattern (a|(z))(bc) |
|
|
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both |
|
|
offset values corresponding to the unused subpattern are set to -1. |
|
| 1765 |
|
|
| 1766 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
| 1767 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
| 1768 |
|
|
| 1769 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
| 1770 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
| 1771 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
| 1772 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
| 1773 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
| 1774 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
| 1775 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
| 1776 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
| 1777 |
|
|
| 1778 |
Note that pcre_info() can be used to find out how many capturing sub- |
The pcre_info() function can be used to find out how many capturing |
| 1779 |
patterns there are in a compiled pattern. The smallest size for ovector |
subpatterns there are in a compiled pattern. The smallest size for |
| 1780 |
that will allow for n captured substrings, in addition to the offsets |
ovector that will allow for n captured substrings, in addition to the |
| 1781 |
of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 1782 |
|
|
| 1783 |
|
It is possible for capturing subpattern number n+1 to match some part |
| 1784 |
|
of the subject when subpattern n has not been used at all. For example, |
| 1785 |
|
if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 1786 |
|
return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 1787 |
|
2 is not. When this happens, both values in the offset pairs corre- |
| 1788 |
|
sponding to unused subpatterns are set to -1. |
| 1789 |
|
|
| 1790 |
|
Offset values that correspond to unused subpatterns at the end of the |
| 1791 |
|
expression are also set to -1. For example, if the string "abc" is |
| 1792 |
|
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 1793 |
|
matched. The return from the function is 2, because the highest used |
| 1794 |
|
capturing subpattern number is 1. However, you can refer to the offsets |
| 1795 |
|
for the second and third capturing subpatterns if you wish (assuming |
| 1796 |
|
the vector is large enough, of course). |
| 1797 |
|
|
| 1798 |
|
Some convenience functions are provided for extracting the captured |
| 1799 |
|
substrings as separate strings. These are described below. |
| 1800 |
|
|
| 1801 |
Return values from pcre_exec() |
Error return values from pcre_exec() |
| 1802 |
|
|
| 1803 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
| 1804 |
defined in the header file: |
defined in the header file: |
| 1805 |
|
|
| 1806 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
| 1809 |
|
|
| 1810 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
| 1811 |
|
|
| 1812 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
| 1813 |
ovecsize was not zero. |
ovecsize was not zero. |
| 1814 |
|
|
| 1815 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
| 1818 |
|
|
| 1819 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
| 1820 |
|
|
| 1821 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 1822 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
| 1823 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
| 1824 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
| 1825 |
gives when the magic number is not present. |
gives when the magic number is not present. |
| 1826 |
|
|
| 1827 |
PCRE_ERROR_UNKNOWN_NODE (-5) |
PCRE_ERROR_UNKNOWN_NODE (-5) |
| 1828 |
|
|
| 1829 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
| 1830 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
| 1831 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
| 1832 |
|
|
| 1833 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 1834 |
|
|
| 1835 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
| 1836 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
| 1837 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
| 1838 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
| 1839 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
| 1840 |
|
|
| 1841 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 1842 |
|
|
| 1843 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 1844 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
| 1845 |
returned by pcre_exec(). |
returned by pcre_exec(). |
| 1846 |
|
|
| 1847 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
| 1848 |
|
|
| 1849 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
| 1850 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
| 1851 |
above. |
above. |
| 1852 |
|
|
| 1853 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
| 1854 |
|
|
| 1855 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
| 1856 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
| 1857 |
description above. |
description above. |
| 1858 |
|
|
| 1859 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
| 1860 |
|
|
| 1861 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
| 1862 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
| 1863 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
| 1864 |
|
|
| 1865 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 1866 |
|
|
| 1867 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
| 1868 |
subject. |
subject. |
| 1869 |
|
|
| 1870 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 1871 |
|
|
| 1872 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
| 1873 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
| 1874 |
ter. |
ter. |
| 1875 |
|
|
| 1876 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 1877 |
|
|
| 1878 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
| 1879 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
| 1880 |
|
|
| 1881 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
| 1882 |
|
|
| 1883 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
| 1884 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
| 1885 |
documentation for details of partial matching. |
documentation for details of partial matching. |
| 1886 |
|
|
| 1887 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 1888 |
|
|
| 1889 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
| 1890 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
| 1891 |
|
|
| 1892 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
| 1893 |
|
|
| 1894 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
| 1895 |
|
|
| 1896 |
|
|
| 1897 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| 1907 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
| 1908 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
| 1909 |
|
|
| 1910 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
| 1911 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
| 1912 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
| 1913 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
| 1914 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
| 1915 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
| 1916 |
substrings. A substring that contains a binary zero is correctly |
substrings. |
| 1917 |
extracted and has a further zero added on the end, but the result is |
|
| 1918 |
not, of course, a C string. |
A substring that contains a binary zero is correctly extracted and has |
| 1919 |
|
a further zero added on the end, but the result is not, of course, a C |
| 1920 |
|
string. However, you can process such a string by referring to the |
| 1921 |
|
length that is returned by pcre_copy_substring() and pcre_get_sub- |
| 1922 |
|
string(). Unfortunately, the interface to pcre_get_substring_list() is |
| 1923 |
|
not adequate for handling strings containing binary zeros, because the |
| 1924 |
|
end of the final string is not independently indicated. |
| 1925 |
|
|
| 1926 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
| 1927 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
| 1928 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
| 1929 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
| 1930 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
| 1931 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
| 1932 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
| 1933 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
| 1934 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
| 1935 |
|
|
| 1936 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
| 1937 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
| 1938 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
| 1939 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
| 1940 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
| 1941 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
| 1942 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
| 1943 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
| 1944 |
the terminating zero, or one of |
the terminating zero, or one of |
| 1945 |
|
|
| 1946 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 1947 |
|
|
| 1948 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
| 1949 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
| 1950 |
|
|
| 1951 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 1952 |
|
|
| 1953 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
| 1954 |
|
|
| 1955 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
| 1956 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
| 1957 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
| 1958 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
| 1959 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
| 1960 |
pointer. The yield of the function is zero if all went well, or |
pointer. The yield of the function is zero if all went well, or |
| 1961 |
|
|
| 1962 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 1963 |
|
|
| 1964 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
| 1965 |
|
|
| 1966 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
| 1967 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
| 1968 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
| 1969 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
| 1970 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
| 1971 |
tive for unset substrings. |
tive for unset substrings. |
| 1972 |
|
|
| 1973 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
| 1974 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
| 1975 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
| 1976 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
| 1977 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
| 1978 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
| 1979 |
cial interface to another programming language which cannot use |
cial interface to another programming language that cannot use |
| 1980 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
| 1981 |
vided. |
vided. |
| 1982 |
|
|
| 1983 |
|
|
| 1996 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
| 1997 |
const char **stringptr); |
const char **stringptr); |
| 1998 |
|
|
| 1999 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
| 2000 |
ber. For example, for this pattern |
ber. For example, for this pattern |
| 2001 |
|
|
| 2002 |
(a+)b(?P<xxx>\d+)... |
(a+)b(?P<xxx>\d+)... |
| 2003 |
|
|
| 2004 |
the number of the subpattern called "xxx" is 2. You can find the number |
the number of the subpattern called "xxx" is 2. If the name is known to |
| 2005 |
from the name by calling pcre_get_stringnumber(). The first argument is |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
| 2006 |
the compiled pattern, and the second is the name. The yield of the |
name by calling pcre_get_stringnumber(). The first argument is the com- |
| 2007 |
function is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if |
piled pattern, and the second is the name. The yield of the function is |
| 2008 |
there is no subpattern of that name. |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
| 2009 |
|
subpattern of that name. |
| 2010 |
|
|
| 2011 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
| 2012 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
| 2028 |
ate. |
ate. |
| 2029 |
|
|
| 2030 |
|
|
| 2031 |
|
DUPLICATE SUBPATTERN NAMES |
| 2032 |
|
|
| 2033 |
|
int pcre_get_stringtable_entries(const pcre *code, |
| 2034 |
|
const char *name, char **first, char **last); |
| 2035 |
|
|
| 2036 |
|
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
| 2037 |
|
subpatterns are not required to be unique. Normally, patterns with |
| 2038 |
|
duplicate names are such that in any one match, only one of the named |
| 2039 |
|
subpatterns participates. An example is shown in the pcrepattern docu- |
| 2040 |
|
mentation. When duplicates are present, pcre_copy_named_substring() and |
| 2041 |
|
pcre_get_named_substring() return the first substring corresponding to |
| 2042 |
|
the given name that is set. If none are set, an empty string is |
| 2043 |
|
returned. The pcre_get_stringnumber() function returns one of the num- |
| 2044 |
|
bers that are associated with the name, but it is not defined which it |
| 2045 |
|
is. |
| 2046 |
|
|
| 2047 |
|
If you want to get full details of all captured substrings for a given |
| 2048 |
|
name, you must use the pcre_get_stringtable_entries() function. The |
| 2049 |
|
first argument is the compiled pattern, and the second is the name. The |
| 2050 |
|
third and fourth are pointers to variables which are updated by the |
| 2051 |
|
function. After it has run, they point to the first and last entries in |
| 2052 |
|
the name-to-number table for the given name. The function itself |
| 2053 |
|
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING if there |
| 2054 |
|
are none. The format of the table is described above in the section |
| 2055 |
|
entitled Information about a pattern. Given all the relevant entries |
| 2056 |
|
for the name, you can extract each of their numbers, and hence the cap- |
| 2057 |
|
tured data, if any. |
| 2058 |
|
|
| 2059 |
|
|
| 2060 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
| 2061 |
|
|
| 2062 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
| 2063 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
| 2064 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
| 2065 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
| 2066 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
| 2067 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
| 2068 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
| 2069 |
tation. |
tation. |
| 2070 |
|
|
| 2071 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
| 2072 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
| 2073 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
| 2074 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
| 2075 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
| 2076 |
|
|
| 2077 |
|
|
| 2082 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
| 2083 |
int *workspace, int wscount); |
int *workspace, int wscount); |
| 2084 |
|
|
| 2085 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
| 2086 |
against a compiled pattern, using a "DFA" matching algorithm. This has |
against a compiled pattern, using a "DFA" matching algorithm. This has |
| 2087 |
different characteristics to the normal algorithm, and is not compati- |
different characteristics to the normal algorithm, and is not compati- |
| 2088 |
ble with Perl. Some of the features of PCRE patterns are not supported. |
ble with Perl. Some of the features of PCRE patterns are not supported. |
| 2089 |
Nevertheless, there are times when this kind of matching can be useful. |
Nevertheless, there are times when this kind of matching can be useful. |
| 2090 |
For a discussion of the two matching algorithms, see the pcrematching |
For a discussion of the two matching algorithms, see the pcrematching |
| 2091 |
documentation. |
documentation. |
| 2092 |
|
|
| 2093 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
| 2094 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
| 2095 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
| 2096 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
| 2097 |
repeated here. |
repeated here. |
| 2098 |
|
|
| 2099 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
| 2100 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
| 2101 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
| 2102 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
| 2103 |
lot of possible matches. |
lot of potential matches. |
| 2104 |
|
|
| 2105 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
| 2106 |
|
|
| 2121 |
|
|
| 2122 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
| 2123 |
|
|
| 2124 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
| 2125 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
| 2126 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
| 2127 |
PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
| 2128 |
these are the same as for pcre_exec(), so their description is not |
three of these are the same as for pcre_exec(), so their description is |
| 2129 |
repeated here. |
not repeated here. |
| 2130 |
|
|
| 2131 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2132 |
|
|
| 2133 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
| 2134 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
| 2135 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
| 2136 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
| 2137 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
| 2138 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
| 2139 |
set as the first matching string. |
set as the first matching string. |
| 2140 |
|
|
| 2141 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2142 |
|
|
| 2143 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
| 2144 |
stop as soon as it has found one match. Because of the way the DFA |
stop as soon as it has found one match. Because of the way the DFA |
| 2145 |
algorithm works, this is necessarily the shortest possible match at the |
algorithm works, this is necessarily the shortest possible match at the |
| 2146 |
first possible matching point in the subject string. |
first possible matching point in the subject string. |
| 2147 |
|
|
| 2148 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
| 2149 |
|
|
| 2150 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
| 2151 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
| 2152 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
| 2153 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
| 2154 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
| 2155 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
| 2156 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
| 2157 |
documentation. |
documentation. |
| 2158 |
|
|
| 2159 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
| 2160 |
|
|
| 2161 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
| 2162 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
| 2163 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
| 2164 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
| 2165 |
if the pattern |
if the pattern |
| 2166 |
|
|
| 2167 |
<.*> |
<.*> |
| 2176 |
<something> <something else> |
<something> <something else> |
| 2177 |
<something> <something else> <something further> |
<something> <something else> <something further> |
| 2178 |
|
|
| 2179 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
| 2180 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
| 2181 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
| 2182 |
the offset to the start, and the second is the offset to the end. All |
the offset to the start, and the second is the offset to the end. All |
| 2183 |
the strings have the same start offset. (Space could have been saved by |
the strings have the same start offset. (Space could have been saved by |
| 2184 |
giving this only once, but it was decided to retain some compatibility |
giving this only once, but it was decided to retain some compatibility |
| 2185 |
with the way pcre_exec() returns data, even though the meaning of the |
with the way pcre_exec() returns data, even though the meaning of the |
| 2186 |
strings is different.) |
strings is different.) |
| 2187 |
|
|
| 2188 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
| 2189 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
| 2190 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
| 2191 |
filled with the longest matches. |
filled with the longest matches. |
| 2192 |
|
|
| 2193 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
| 2194 |
|
|
| 2195 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
| 2196 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
| 2197 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
| 2198 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
| 2199 |
|
|
| 2200 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
| 2201 |
|
|
| 2202 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
| 2203 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
| 2204 |
reference. |
reference. |
| 2205 |
|
|
| 2206 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
| 2207 |
|
|
| 2208 |
This return is given if pcre_dfa_exec() encounters a condition item in |
This return is given if pcre_dfa_exec() encounters a condition item in |
| 2209 |
a pattern that uses a back reference for the condition. This is not |
a pattern that uses a back reference for the condition. This is not |
| 2210 |
supported. |
supported. |
| 2211 |
|
|
| 2212 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
| 2213 |
|
|
| 2214 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
| 2215 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
| 2216 |
(it is meaningless). |
(it is meaningless). |
| 2217 |
|
|
| 2218 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
| 2219 |
|
|
| 2220 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
| 2221 |
workspace vector. |
workspace vector. |
| 2222 |
|
|
| 2223 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
| 2224 |
|
|
| 2225 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
| 2226 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
| 2227 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
| 2228 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
| 2229 |
|
|
| 2230 |
Last updated: 18 January 2006 |
Last updated: 08 June 2006 |
| 2231 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 2232 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2233 |
|
|
| 2474 |
meta-character matches only at the very end of the string. |
meta-character matches only at the very end of the string. |
| 2475 |
|
|
| 2476 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
| 2477 |
cial meaning is faulted. |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
| 2478 |
|
ignored. (Perl can be made to issue a warning.) |
| 2479 |
|
|
| 2480 |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
| 2481 |
fiers is inverted, that is, by default they are not greedy, but if fol- |
fiers is inverted, that is, by default they are not greedy, but if fol- |
| 2482 |
lowed by a question mark they are. |
lowed by a question mark they are. |
| 2483 |
|
|
| 2484 |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
| 2485 |
tried only at the first matching position in the subject string. |
tried only at the first matching position in the subject string. |
| 2486 |
|
|
| 2487 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
| 2488 |
TURE options for pcre_exec() have no Perl equivalents. |
TURE options for pcre_exec() have no Perl equivalents. |
| 2489 |
|
|
| 2490 |
(g) The (?R), (?number), and (?P>name) constructs allows for recursive |
(g) The (?R), (?number), and (?P>name) constructs allows for recursive |
| 2491 |
pattern matching (Perl can do this using the (?p{code}) construct, |
pattern matching (Perl can do this using the (?p{code}) construct, |
| 2492 |
which PCRE cannot support.) |
which PCRE cannot support.) |
| 2493 |
|
|
| 2494 |
(h) PCRE supports named capturing substrings, using the Python syntax. |
(h) PCRE supports named capturing substrings, using the Python syntax. |
| 2495 |
|
|
| 2496 |
(i) PCRE supports the possessive quantifier "++" syntax, taken from |
(i) PCRE supports the possessive quantifier "++" syntax, taken from |
| 2497 |
Sun's Java package. |
Sun's Java package. |
| 2498 |
|
|
| 2499 |
(j) The (R) condition, for testing recursion, is a PCRE extension. |
(j) The (R) condition, for testing recursion, is a PCRE extension. |
| 2505 |
(m) Patterns compiled by PCRE can be saved and re-used at a later time, |
(m) Patterns compiled by PCRE can be saved and re-used at a later time, |
| 2506 |
even on different hosts that have the other endianness. |
even on different hosts that have the other endianness. |
| 2507 |
|
|
| 2508 |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
| 2509 |
different way and is not Perl-compatible. |
different way and is not Perl-compatible. |
| 2510 |
|
|
| 2511 |
Last updated: 24 January 2006 |
Last updated: 06 June 2006 |
| 2512 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 2513 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2514 |
|
|
| 2617 |
|
|
| 2618 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
| 2619 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
| 2620 |
# outside a character class and the next newline character are ignored. |
# outside a character class and the next newline are ignored. An escap- |
| 2621 |
An escaping backslash can be used to include a whitespace or # charac- |
ing backslash can be used to include a whitespace or # character as |
| 2622 |
ter as part of the pattern. |
part of the pattern. |
| 2623 |
|
|
| 2624 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
| 2625 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
| 2676 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
| 2677 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
| 2678 |
|
|
| 2679 |
After \0 up to two further octal digits are read. In both cases, if |
After \0 up to two further octal digits are read. If there are fewer |
| 2680 |
there are fewer than two digits, just those that are present are used. |
than two digits, just those that are present are used. Thus the |
| 2681 |
Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
| 2682 |
character (code value 7). Make sure you supply two digits after the |
(code value 7). Make sure you supply two digits after the initial zero |
| 2683 |
initial zero if the pattern character that follows is itself an octal |
if the pattern character that follows is itself an octal digit. |
|
digit. |
|
| 2684 |
|
|
| 2685 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
| 2686 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
| 2687 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
| 2688 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
| 2689 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
| 2690 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
| 2691 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
| 2692 |
|
|
| 2693 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
| 2694 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
| 2695 |
up to three octal digits following the backslash, and generates a sin- |
up to three octal digits following the backslash, ane uses them to gen- |
| 2696 |
gle byte from the least significant 8 bits of the value. Any subsequent |
erate a data character. Any subsequent digits stand for themselves. In |
| 2697 |
digits stand for themselves. For example: |
non-UTF-8 mode, the value of a character specified in octal must be |
| 2698 |
|
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
| 2699 |
|
example: |
| 2700 |
|
|
| 2701 |
\040 is another way of writing a space |
\040 is another way of writing a space |
| 2702 |
\40 is the same, provided there are fewer than 40 |
\40 is the same, provided there are fewer than 40 |
| 2713 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
| 2714 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
| 2715 |
|
|
| 2716 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
| 2717 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
| 2718 |
|
|
| 2719 |
All the sequences that define a single byte value or a single UTF-8 |
All the sequences that define a single character value can be used both |
| 2720 |
character (in UTF-8 mode) can be used both inside and outside character |
inside and outside character classes. In addition, inside a character |
| 2721 |
classes. In addition, inside a character class, the sequence \b is |
class, the sequence \b is interpreted as the backspace character (hex |
| 2722 |
interpreted as the backspace character (hex 08), and the sequence \X is |
08), and the sequence \X is interpreted as the character "X". Outside a |
| 2723 |
interpreted as the character "X". Outside a character class, these |
character class, these sequences have different meanings (see below). |
|
sequences have different meanings (see below). |
|
| 2724 |
|
|
| 2725 |
Generic character types |
Generic character types |
| 2726 |
|
|
| 2745 |
|
|
| 2746 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
| 2747 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
| 2748 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If |
| 2749 |
|
"use locale;" is included in a Perl script, \s may match the VT charac- |
| 2750 |
|
ter. In PCRE, it never does.) |
| 2751 |
|
|
| 2752 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
| 2753 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
| 2862 |
classified as a modifier or "other". |
classified as a modifier or "other". |
| 2863 |
|
|
| 2864 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for these properties that Perl supports (such as |
| 2865 |
\p{Letter}) are not supported by PCRE. Nor is is permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
| 2866 |
any of these properties with "Is". |
any of these properties with "Is". |
| 2867 |
|
|
| 2868 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
| 2920 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
| 2921 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
| 2922 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
| 2923 |
that \Z matches before a newline that is the last character of the |
that \Z matches before a newline at the end of the string as well as at |
| 2924 |
string as well as at the end of the string, whereas \z matches only at |
the very end, whereas \z matches only at the end. |
| 2925 |
the end. |
|
| 2926 |
|
The \G assertion is true only when the current matching position is at |
| 2927 |
The \G assertion is true only when the current matching position is at |
the start point of the match, as specified by the startoffset argument |
| 2928 |
the start point of the match, as specified by the startoffset argument |
of pcre_exec(). It differs from \A when the value of startoffset is |
| 2929 |
of pcre_exec(). It differs from \A when the value of startoffset is |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
|
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
|
| 2930 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
| 2931 |
mentation where \G can be useful. |
mentation where \G can be useful. |
| 2932 |
|
|
| 2933 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
| 2934 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
| 2935 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
| 2936 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
| 2937 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
| 2938 |
|
|
| 2939 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
| 2940 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
| 2941 |
in the compiled regular expression. |
in the compiled regular expression. |
| 2942 |
|
|
| 2944 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
| 2945 |
|
|
| 2946 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
| 2947 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
| 2948 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
| 2949 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
| 2950 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
| 2951 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
| 2952 |
|
|
| 2953 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
| 2954 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
| 2955 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
| 2956 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
| 2957 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
| 2958 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
| 2959 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
| 2960 |
|
|
| 2961 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
| 2962 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
| 2963 |
before a newline character that is the last character in the string (by |
before a newline at the end of the string (by default). Dollar need not |
| 2964 |
default). Dollar need not be the last character of the pattern if a |
be the last character of the pattern if a number of alternatives are |
| 2965 |
number of alternatives are involved, but it should be the last item in |
involved, but it should be the last item in any branch in which it |
| 2966 |
any branch in which it appears. Dollar has no special meaning in a |
appears. Dollar has no special meaning in a character class. |
|
character class. |
|
| 2967 |
|
|
| 2968 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
| 2969 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
| 2970 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
| 2971 |
|
|
| 2972 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
| 2973 |
PCRE_MULTILINE option is set. When this is the case, they match immedi- |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
| 2974 |
ately after and immediately before an internal newline character, |
matches immediately after internal newlines as well as at the start of |
| 2975 |
respectively, in addition to matching at the start and end of the sub- |
the subject string. It does not match after a newline that ends the |
| 2976 |
ject string. For example, the pattern /^abc$/ matches the subject |
string. A dollar matches before any newlines in the string, as well as |
| 2977 |
string "def\nabc" (where \n represents a newline character) in multi- |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
| 2978 |
line mode, but not otherwise. Consequently, patterns that are anchored |
as the two-character sequence CRLF, isolated CR and LF characters do |
| 2979 |
in single line mode because all branches start with ^ are not anchored |
not indicate newlines. |
| 2980 |
in multiline mode, and a match for circumflex is possible when the |
|
| 2981 |
startoffset argument of pcre_exec() is non-zero. The PCRE_DOL- |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
| 2982 |
LAR_ENDONLY option is ignored if PCRE_MULTILINE is set. |
(where \n represents a newline) in multiline mode, but not otherwise. |
| 2983 |
|
Consequently, patterns that are anchored in single line mode because |
| 2984 |
Note that the sequences \A, \Z, and \z can be used to match the start |
all branches start with ^ are not anchored in multiline mode, and a |
| 2985 |
and end of the subject in both modes, and if all branches of a pattern |
match for circumflex is possible when the startoffset argument of |
| 2986 |
start with \A it is always anchored, whether PCRE_MULTILINE is set or |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
| 2987 |
not. |
PCRE_MULTILINE is set. |
| 2988 |
|
|
| 2989 |
|
Note that the sequences \A, \Z, and \z can be used to match the start |
| 2990 |
|
and end of the subject in both modes, and if all branches of a pattern |
| 2991 |
|
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
| 2992 |
|
set. |
| 2993 |
|
|
| 2994 |
|
|
| 2995 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
| 2996 |
|
|
| 2997 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
| 2998 |
ter in the subject, including a non-printing character, but not (by |
ter in the subject string except (by default) a character that signi- |
| 2999 |
default) newline. In UTF-8 mode, a dot matches any UTF-8 character, |
fies the end of a line. In UTF-8 mode, the matched character may be |
| 3000 |
which might be more than one byte long, except (by default) newline. If |
more than one byte long. When a line ending is defined as a single |
| 3001 |
the PCRE_DOTALL option is set, dots match newlines as well. The han- |
character (CR or LF), dot never matches that character; when the two- |
| 3002 |
dling of dot is entirely independent of the handling of circumflex and |
character sequence CRLF is used, dot does not match CR if it is immedi- |
| 3003 |
dollar, the only relationship being that they both involve newline |
ately followed by LF, but otherwise it matches all characters (includ- |
| 3004 |
characters. Dot has no special meaning in a character class. |
ing isolated CRs and LFs). |
| 3005 |
|
|
| 3006 |
|
The behaviour of dot with regard to newlines can be changed. If the |
| 3007 |
|
PCRE_DOTALL option is set, a dot matches any one character, without |
| 3008 |
|
exception. If newline is defined as the two-character sequence CRLF, it |
| 3009 |
|
takes two dots to match it. |
| 3010 |
|
|
| 3011 |
|
The handling of dot is entirely independent of the handling of circum- |
| 3012 |
|
flex and dollar, the only relationship being that they both involve |
| 3013 |
|
newlines. Dot has no special meaning in a character class. |
| 3014 |
|
|
| 3015 |
|
|
| 3016 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
| 3017 |
|
|
| 3018 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
| 3019 |
both in and out of UTF-8 mode. Unlike a dot, it can match a newline. |
both in and out of UTF-8 mode. Unlike a dot, it always matches CR and |
| 3020 |
The feature is provided in Perl in order to match individual bytes in |
LF. The feature is provided in Perl in order to match individual bytes |
| 3021 |
UTF-8 mode. Because it breaks up UTF-8 characters into individual |
in UTF-8 mode. Because it breaks up UTF-8 characters into individual |
| 3022 |
bytes, what remains in the string may be a malformed UTF-8 string. For |
bytes, what remains in the string may be a malformed UTF-8 string. For |
| 3023 |
this reason, the \C escape sequence is best avoided. |
this reason, the \C escape sequence is best avoided. |
| 3024 |
|
|
| 3067 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
| 3068 |
support. |
support. |
| 3069 |
|
|
| 3070 |
The newline character is never treated in any special way in character |
Characters that might indicate line breaks (CR and LF) are never |
| 3071 |
classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE |
treated in any special way when matching character classes, whatever |
| 3072 |
options is. A class such as [^a] will always match a newline. |
line-ending sequence is in use, and whatever setting of the PCRE_DOTALL |
| 3073 |
|
and PCRE_MULTILINE options is used. A class such as [^a] always matches |
| 3074 |
|
one of these characters. |
| 3075 |
|
|
| 3076 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
| 3077 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
| 3172 |
|
|
| 3173 |
matches either "gilbert" or "sullivan". Any number of alternatives may |
matches either "gilbert" or "sullivan". Any number of alternatives may |
| 3174 |
appear, and an empty alternative is permitted (matching the empty |
appear, and an empty alternative is permitted (matching the empty |
| 3175 |
string). The matching process tries each alternative in turn, from |
string). The matching process tries each alternative in turn, from left |
| 3176 |
left to right, and the first one that succeeds is used. If the alterna- |
to right, and the first one that succeeds is used. If the alternatives |
| 3177 |
tives are within a subpattern (defined below), "succeeds" means match- |
are within a subpattern (defined below), "succeeds" means matching the |
| 3178 |
ing the rest of the main pattern as well as the alternative in the sub- |
rest of the main pattern as well as the alternative in the subpattern. |
|
pattern. |
|
| 3179 |
|
|
| 3180 |
|
|
| 3181 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
| 3221 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
| 3222 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
| 3223 |
|
|
| 3224 |
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
| 3225 |
in the same way as the Perl-compatible options by using the characters |
can be changed in the same way as the Perl-compatible options by using |
| 3226 |
U and X respectively. The (?X) flag setting is special in that it must |
the characters J, U and X respectively. |
|
always occur earlier in the pattern than any of the additional features |
|
|
it turns on, even when it is at top level. It is best to put it at the |
|
|
start. |
|
| 3227 |
|
|
| 3228 |
|
|
| 3229 |
SUBPATTERNS |
SUBPATTERNS |
| 3235 |
|
|
| 3236 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
| 3237 |
|
|
| 3238 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
| 3239 |
the parentheses, it would match "cataract", "erpillar" or the empty |
the parentheses, it would match "cataract", "erpillar" or the empty |
| 3240 |
string. |
string. |
| 3241 |
|
|
| 3242 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
| 3243 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
| 3244 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
| 3245 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
| 3246 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
| 3247 |
subpatterns. |
subpatterns. |
| 3248 |
|
|
| 3249 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
| 3250 |
tern |
tern |
| 3251 |
|
|
| 3252 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
| 3254 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
| 3255 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
| 3256 |
|
|
| 3257 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
| 3258 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
| 3259 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
| 3260 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
| 3261 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
| 3262 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
| 3263 |
matched against the pattern |
matched against the pattern |
| 3264 |
|
|
| 3265 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
| 3266 |
|
|
| 3267 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
| 3268 |
1 and 2. The maximum number of capturing subpatterns is 65535, and the |
1 and 2. The maximum number of capturing subpatterns is 65535, and the |
| 3269 |
maximum depth of nesting of all subpatterns, both capturing and non- |
maximum depth of nesting of all subpatterns, both capturing and non- |
| 3270 |
capturing, is 200. |
capturing, is 200. |
| 3271 |
|
|
| 3272 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
| 3273 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
| 3274 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
| 3275 |
|
|
| 3276 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
| 3277 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
| 3278 |
|
|
| 3279 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
| 3280 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
| 3281 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
| 3282 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
| 3283 |
"Saturday". |
"Saturday". |
| 3284 |
|
|
| 3285 |
|
|
| 3286 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
| 3287 |
|
|
| 3288 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
| 3289 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
| 3290 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
| 3291 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
| 3292 |
patterns, something that Perl does not provide. The Python syntax |
patterns, something that Perl does not provide. The Python syntax |
| 3293 |
(?P<name>...) is used. Names consist of alphanumeric characters and |
(?P<name>...) is used. References to capturing parentheses from other |
| 3294 |
underscores, and must be unique within a pattern. |
parts of the pattern, such as backreferences, recursion, and condi- |
| 3295 |
|
tions, can be made by name as well as by number. |
| 3296 |
|
|
| 3297 |
Named capturing parentheses are still allocated numbers as well as |
Names consist of up to 32 alphanumeric characters and underscores. |
| 3298 |
|
Named capturing parentheses are still allocated numbers as well as |
| 3299 |
names. The PCRE API provides function calls for extracting the name-to- |
names. The PCRE API provides function calls for extracting the name-to- |
| 3300 |
number translation table from a compiled pattern. There is also a con- |
number translation table from a compiled pattern. There is also a con- |
| 3301 |
venience function for extracting a captured substring by name. For fur- |
venience function for extracting a captured substring by name. |
| 3302 |
ther details see the pcreapi documentation. |
|
| 3303 |
|
By default, a name must be unique within a pattern, but it is possible |
| 3304 |
|
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
| 3305 |
|
time. This can be useful for patterns where only one instance of the |
| 3306 |
|
named parentheses can match. Suppose you want to match the name of a |
| 3307 |
|
weekday, either as a 3-letter abbreviation or as the full name, and in |
| 3308 |
|
both cases you want to extract the abbreviation. This pattern (ignoring |
| 3309 |
|
the line breaks) does the job: |
| 3310 |
|
|
| 3311 |
|
(?P<DN>Mon|Fri|Sun)(?:day)?| |
| 3312 |
|
(?P<DN>Tue)(?:sday)?| |
| 3313 |
|
(?P<DN>Wed)(?:nesday)?| |
| 3314 |
|
(?P<DN>Thu)(?:rsday)?| |
| 3315 |
|
(?P<DN>Sat)(?:urday)? |
| 3316 |
|
|
| 3317 |
|
There are five capturing substrings, but only one is ever set after a |
| 3318 |
|
match. The convenience function for extracting the data by name |
| 3319 |
|
returns the substring for the first, and in this example, the only, |
| 3320 |
|
subpattern of that name that matched. This saves searching to find |
| 3321 |
|
which numbered subpattern it was. If you make a reference to a non- |
| 3322 |
|
unique named subpattern from elsewhere in the pattern, the one that |
| 3323 |
|
corresponds to the lowest number is used. For further details of the |
| 3324 |
|
interfaces for handling named subpatterns, see the pcreapi documenta- |
| 3325 |
|
tion. |
| 3326 |
|
|
| 3327 |
|
|
| 3328 |
REPETITION |
REPETITION |
| 3531 |
meaning or processing of a possessive quantifier and the equivalent |
meaning or processing of a possessive quantifier and the equivalent |
| 3532 |
atomic group. |
atomic group. |
| 3533 |
|
|
| 3534 |
The possessive quantifier syntax is an extension to the Perl syntax. It |
The possessive quantifier syntax is an extension to the Perl syntax. |
| 3535 |
originates in Sun's Java package. |
Jeffrey Friedl originated the idea (and the name) in the first edition |
| 3536 |
|
of his book. Mike McCloskey liked it, so implemented it when he built |
| 3537 |
|
Sun's Java package, and PCRE copied it from there. |
| 3538 |
|
|
| 3539 |
When a pattern contains an unlimited repeat inside a subpattern that |
When a pattern contains an unlimited repeat inside a subpattern that |
| 3540 |
can itself be repeated an unlimited number of times, the use of an |
can itself be repeated an unlimited number of times, the use of an |
| 3575 |
it is always taken as a back reference, and causes an error only if |
it is always taken as a back reference, and causes an error only if |
| 3576 |
there are not that many capturing left parentheses in the entire pat- |
there are not that many capturing left parentheses in the entire pat- |
| 3577 |
tern. In other words, the parentheses that are referenced need not be |
tern. In other words, the parentheses that are referenced need not be |
| 3578 |
to the left of the reference for numbers less than 10. See the subsec- |
to the left of the reference for numbers less than 10. A "forward back |
| 3579 |
tion entitled "Non-printing characters" above for further details of |
reference" of this type can make sense when a repetition is involved |
| 3580 |
the handling of digits following a backslash. |
and the subpattern to the right has participated in an earlier itera- |
| 3581 |
|
tion. |
| 3582 |
|
|
| 3583 |
|
It is not possible to have a numerical "forward back reference" to sub- |
| 3584 |
|
pattern whose number is 10 or more. However, a back reference to any |
| 3585 |
|
subpattern is possible using named parentheses (see below). See also |
| 3586 |
|
the subsection entitled "Non-printing characters" above for further |
| 3587 |
|
details of the handling of digits following a backslash. |
| 3588 |
|
|
| 3589 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
| 3590 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
| 3591 |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
| 3592 |
of doing that). So the pattern |
of doing that). So the pattern |
| 3593 |
|
|
| 3594 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 3595 |
|
|
| 3596 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 3597 |
not "sense and responsibility". If caseful matching is in force at the |
not "sense and responsibility". If caseful matching is in force at the |
| 3598 |
time of the back reference, the case of letters is relevant. For exam- |
time of the back reference, the case of letters is relevant. For exam- |
| 3599 |
ple, |
ple, |
| 3600 |
|
|
| 3601 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
| 3602 |
|
|
| 3603 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| 3604 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
| 3605 |
|
|
| 3606 |
Back references to named subpatterns use the Python syntax (?P=name). |
Back references to named subpatterns use the Python syntax (?P=name). |
| 3607 |
We could rewrite the above example as follows: |
We could rewrite the above example as follows: |
| 3608 |
|
|
| 3609 |
(?<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
| 3610 |
|
|
| 3611 |
|
A subpattern that is referenced by name may appear in the pattern |
| 3612 |
|
before or after the reference. |
| 3613 |
|
|
| 3614 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
| 3615 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
| 3698 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
| 3699 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
| 3700 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
| 3701 |
eral alternatives, they do not all have to have the same fixed length. |
eral top-level alternatives, they do not all have to have the same |
| 3702 |
Thus |
fixed length. Thus |
| 3703 |
|
|
| 3704 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
| 3705 |
|
|
| 3812 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
| 3813 |
|
|
| 3814 |
There are three kinds of condition. If the text between the parentheses |
There are three kinds of condition. If the text between the parentheses |
| 3815 |
consists of a sequence of digits, the condition is satisfied if the |
consists of a sequence of digits, or a sequence of alphanumeric charac- |
| 3816 |
capturing subpattern of that number has previously matched. The number |
ters and underscores, the condition is satisfied if the capturing sub- |
| 3817 |
must be greater than zero. Consider the following pattern, which con- |
pattern of that number or name has previously matched. There is a pos- |
| 3818 |
tains non-significant white space to make it more readable (assume the |
sible ambiguity here, because subpattern names may consist entirely of |
| 3819 |
PCRE_EXTENDED option) and to divide it into three parts for ease of |
digits. PCRE looks first for a named subpattern; if it cannot find one |
| 3820 |
discussion: |
and the text consists entirely of digits, it looks for a subpattern of |
| 3821 |
|
that number, which must be greater than zero. Using subpattern names |
| 3822 |
|
that consist entirely of digits is not recommended. |
| 3823 |
|
|
| 3824 |
|
Consider the following pattern, which contains non-significant white |
| 3825 |
|
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 3826 |
|
divide it into three parts for ease of discussion: |
| 3827 |
|
|
| 3828 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
| 3829 |
|
|
| 3836 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
| 3837 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
| 3838 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
| 3839 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. Rewriting it to use a named subpat- |
| 3840 |
|
tern gives this: |
| 3841 |
|
|
| 3842 |
|
(?P<OPEN> \( )? [^()]+ (?(OPEN) \) ) |
| 3843 |
|
|
| 3844 |
If the condition is the string (R), it is satisfied if a recursive call |
If the condition is the string (R), and there is no subpattern with the |
| 3845 |
to the pattern or subpattern has been made. At "top level", the condi- |
name R, the condition is satisfied if a recursive call to the pattern |
| 3846 |
tion is false. This is a PCRE extension. Recursive patterns are |
or subpattern has been made. At "top level", the condition is false. |
| 3847 |
described in the next section. |
This is a PCRE extension. Recursive patterns are described in the next |
| 3848 |
|
section. |
| 3849 |
|
|
| 3850 |
If the condition is not a sequence of digits or (R), it must be an |
If the condition is not a sequence of digits or (R), it must be an |
| 3851 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 3872 |
at all. |
at all. |
| 3873 |
|
|
| 3874 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
| 3875 |
character class introduces a comment that continues up to the next new- |
character class introduces a comment that continues to immediately |
| 3876 |
line character in the pattern. |
after the next newline in the pattern. |
| 3877 |
|
|
| 3878 |
|
|
| 3879 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
| 3996 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 3997 |
|
|
| 3998 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 3999 |
two strings. Such references must, however, follow the subpattern to |
two strings. Such references, if given numerically, must follow the |
| 4000 |
which they refer. |
subpattern to which they refer. However, named references can refer to |
| 4001 |
|
later subpatterns. |
| 4002 |
|
|
| 4003 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
| 4004 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
| 4005 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
| 4006 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
| 4007 |
|
|
| 4008 |
|
|
| 4009 |
CALLOUTS |
CALLOUTS |
| 4010 |
|
|
| 4011 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 4012 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
| 4013 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
| 4014 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
| 4015 |
tion. |
tion. |
| 4016 |
|
|
| 4017 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 4018 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
| 4019 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
| 4020 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
| 4021 |
all calling out. |
all calling out. |
| 4022 |
|
|
| 4023 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
| 4024 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
| 4025 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
| 4026 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
| 4027 |
points: |
points: |
| 4028 |
|
|
| 4029 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
| 4030 |
|
|
| 4031 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
| 4032 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
| 4033 |
numbered 255. |
numbered 255. |
| 4034 |
|
|
| 4035 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
| 4036 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
| 4037 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
| 4038 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
| 4039 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
| 4040 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 4041 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 4042 |
|
|
| 4043 |
Last updated: 24 January 2006 |
Last updated: 06 June 2006 |
| 4044 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 4045 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4046 |
|
|
| 5048 |
Last updated: 09 September 2004 |
Last updated: 09 September 2004 |
| 5049 |
Copyright (c) 1997-2004 University of Cambridge. |
Copyright (c) 1997-2004 University of Cambridge. |
| 5050 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5051 |
|
PCRESTACK(3) PCRESTACK(3) |
| 5052 |
|
|
| 5053 |
|
|
| 5054 |
|
NAME |
| 5055 |
|
PCRE - Perl-compatible regular expressions |
| 5056 |
|
|
| 5057 |
|
|
| 5058 |
|
PCRE DISCUSSION OF STACK USAGE |
| 5059 |
|
|
| 5060 |
|
When you call pcre_exec(), it makes use of an internal function called |
| 5061 |
|
match(). This calls itself recursively at branch points in the pattern, |
| 5062 |
|
in order to remember the state of the match so that it can back up and |
| 5063 |
|
try a different alternative if the first one fails. As matching pro- |
| 5064 |
|
ceeds deeper and deeper into the tree of possibilities, the recursion |
| 5065 |
|
depth increases. |
| 5066 |
|
|
| 5067 |
|
Not all calls of match() increase the recursion depth; for an item such |
| 5068 |
|
as a* it may be called several times at the same level, after matching |
| 5069 |
|
different numbers of a's. Furthermore, in a number of cases where the |
| 5070 |
|
result of the recursive call would immediately be passed back as the |
| 5071 |
|
result of the current call (a "tail recursion"), the function is just |
| 5072 |
|
restarted instead. |
| 5073 |
|
|
| 5074 |
|
The pcre_dfa_exec() function operates in an entirely different way, and |
| 5075 |
|
hardly uses recursion at all. The limit on its complexity is the amount |
| 5076 |
|
of workspace it is given. The comments that follow do NOT apply to |
| 5077 |
|
pcre_dfa_exec(); they are relevant only for pcre_exec(). |
| 5078 |
|
|
| 5079 |
|
You can set limits on the number of times that match() is called, both |
| 5080 |
|
in total and recursively. If the limit is exceeded, an error occurs. |
| 5081 |
|
For details, see the section on extra data for pcre_exec() in the |
| 5082 |
|
pcreapi documentation. |
| 5083 |
|
|
| 5084 |
|
Each time that match() is actually called recursively, it uses memory |
| 5085 |
|
from the process stack. For certain kinds of pattern and data, very |
| 5086 |
|
large amounts of stack may be needed, despite the recognition of "tail |
| 5087 |
|
recursion". You can often reduce the amount of recursion, and there- |
| 5088 |
|
fore the amount of stack used, by modifying the pattern that is being |
| 5089 |
|
matched. Consider, for example, this pattern: |
| 5090 |
|
|
| 5091 |
|
([^<]|<(?!inet))+ |
| 5092 |
|
|
| 5093 |
|
It matches from wherever it starts until it encounters "<inet" or the |
| 5094 |
|
end of the data, and is the kind of pattern that might be used when |
| 5095 |
|
processing an XML file. Each iteration of the outer parentheses matches |
| 5096 |
|
either one character that is not "<" or a "<" that is not followed by |
| 5097 |
|
"inet". However, each time a parenthesis is processed, a recursion |
| 5098 |
|
occurs, so this formulation uses a stack frame for each matched charac- |
| 5099 |
|
ter. For a long string, a lot of stack is required. Consider now this |
| 5100 |
|
rewritten pattern, which matches exactly the same strings: |
| 5101 |
|
|
| 5102 |
|
([^<]++|<(?!inet)) |
| 5103 |
|
|
| 5104 |
|
This uses very much less stack, because runs of characters that do not |
| 5105 |
|
contain "<" are "swallowed" in one item inside the parentheses. Recur- |
| 5106 |
|
sion happens only when a "<" character that is not followed by "inet" |
| 5107 |
|
is encountered (and we assume this is relatively rare). A possessive |
| 5108 |
|
quantifier is used to stop any backtracking into the runs of non-"<" |
| 5109 |
|
characters, but that is not related to stack usage. |
| 5110 |
|
|
| 5111 |
|
In environments where stack memory is constrained, you might want to |
| 5112 |
|
compile PCRE to use heap memory instead of stack for remembering back- |
| 5113 |
|
up points. This makes it run a lot more slowly, however. Details of how |
| 5114 |
|
to do this are given in the pcrebuild documentation. |
| 5115 |
|
|
| 5116 |
|
In Unix-like environments, there is not often a problem with the stack, |
| 5117 |
|
though the default limit on stack size varies from system to system. |
| 5118 |
|
Values from 8Mb to 64Mb are common. You can find your default limit by |
| 5119 |
|
running the command: |
| 5120 |
|
|
| 5121 |
|
ulimit -s |
| 5122 |
|
|
| 5123 |
|
The effect of running out of stack is often SIGSEGV, though sometimes |
| 5124 |
|
an error message is given. You can normally increase the limit on stack |
| 5125 |
|
size by code such as this: |
| 5126 |
|
|
| 5127 |
|
struct rlimit rlim; |
| 5128 |
|
getrlimit(RLIMIT_STACK, &rlim); |
| 5129 |
|
rlim.rlim_cur = 100*1024*1024; |
| 5130 |
|
setrlimit(RLIMIT_STACK, &rlim); |
| 5131 |
|
|
| 5132 |
|
This reads the current limits (soft and hard) using getrlimit(), then |
| 5133 |
|
attempts to increase the soft limit to 100Mb using setrlimit(). You |
| 5134 |
|
must do this before calling pcre_exec(). |
| 5135 |
|
|
| 5136 |
|
PCRE has an internal counter that can be used to limit the depth of |
| 5137 |
|
recursion, and thus cause pcre_exec() to give an error code before it |
| 5138 |
|
runs out of stack. By default, the limit is very large, and unlikely |
| 5139 |
|
ever to operate. It can be changed when PCRE is built, and it can also |
| 5140 |
|
be set when pcre_exec() is called. For details of these interfaces, see |
| 5141 |
|
the pcrebuild and pcreapi documentation. |
| 5142 |
|
|
| 5143 |
|
As a very rough rule of thumb, you should reckon on about 500 bytes per |
| 5144 |
|
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
| 5145 |
|
should set the limit at 16000 recursions. A 64Mb stack, on the other |
| 5146 |
|
hand, can support around 128000 recursions. The pcretest test program |
| 5147 |
|
has a command line option (-S) that can be used to increase its stack. |
| 5148 |
|
|
| 5149 |
|
Last updated: 29 June 2006 |
| 5150 |
|
Copyright (c) 1997-2006 University of Cambridge. |
| 5151 |
|
------------------------------------------------------------------------------ |
| 5152 |
|
|
| 5153 |
|
|