| 75 |
.B const char *\fIname\fP); |
.B const char *\fIname\fP); |
| 76 |
.PP |
.PP |
| 77 |
.br |
.br |
| 78 |
|
.B int pcre_get_stringtable_entries(const pcre *\fIcode\fP, |
| 79 |
|
.ti +5n |
| 80 |
|
.B const char *\fIname\fP, char **\fIfirst\fP, char **\fIlast\fP); |
| 81 |
|
.PP |
| 82 |
|
.br |
| 83 |
.B int pcre_get_substring(const char *\fIsubject\fP, int *\fIovector\fP, |
.B int pcre_get_substring(const char *\fIsubject\fP, int *\fIovector\fP, |
| 84 |
.ti +5n |
.ti +5n |
| 85 |
.B int \fIstringcount\fP, int \fIstringnumber\fP, |
.B int \fIstringcount\fP, int \fIstringnumber\fP, |
| 169 |
.P |
.P |
| 170 |
A second matching function, \fBpcre_dfa_exec()\fP, which is not |
A second matching function, \fBpcre_dfa_exec()\fP, which is not |
| 171 |
Perl-compatible, is also provided. This uses a different algorithm for the |
Perl-compatible, is also provided. This uses a different algorithm for the |
| 172 |
matching. This allows it to find all possible matches (at a given point in the |
matching. The alternative algorithm finds all possible matches (at a given |
| 173 |
subject), not just one. However, this algorithm does not return captured |
point in the subject). However, this algorithm does not return captured |
| 174 |
substrings. A description of the two matching algorithms and their advantages |
substrings. A description of the two matching algorithms and their advantages |
| 175 |
and disadvantages is given in the |
and disadvantages is given in the |
| 176 |
.\" HREF |
.\" HREF |
| 188 |
\fBpcre_get_named_substring()\fP |
\fBpcre_get_named_substring()\fP |
| 189 |
\fBpcre_get_substring_list()\fP |
\fBpcre_get_substring_list()\fP |
| 190 |
\fBpcre_get_stringnumber()\fP |
\fBpcre_get_stringnumber()\fP |
| 191 |
|
\fBpcre_get_stringtable_entries()\fP |
| 192 |
.sp |
.sp |
| 193 |
\fBpcre_free_substring()\fP and \fBpcre_free_substring_list()\fP are also |
\fBpcre_free_substring()\fP and \fBpcre_free_substring_list()\fP are also |
| 194 |
provided, to free the memory used for extracted strings. |
provided, to free the memory used for extracted strings. |
| 218 |
The global variables \fBpcre_stack_malloc\fP and \fBpcre_stack_free\fP are also |
The global variables \fBpcre_stack_malloc\fP and \fBpcre_stack_free\fP are also |
| 219 |
indirections to memory management functions. These special functions are used |
indirections to memory management functions. These special functions are used |
| 220 |
only when PCRE is compiled to use the heap for remembering data, instead of |
only when PCRE is compiled to use the heap for remembering data, instead of |
| 221 |
recursive function calls, when running the \fBpcre_exec()\fP function. This is |
recursive function calls, when running the \fBpcre_exec()\fP function. See the |
| 222 |
a non-standard way of building PCRE, for use in environments that have limited |
.\" HREF |
| 223 |
stacks. Because of the greater use of memory management, it runs more slowly. |
\fBpcrebuild\fP |
| 224 |
Separate functions are provided so that special-purpose external code can be |
.\" |
| 225 |
used for this case. When used, these functions are always called in a |
documentation for details of how to do this. It is a non-standard way of |
| 226 |
stack-like manner (last obtained, first freed), and always for memory blocks of |
building PCRE, for use in environments that have limited stacks. Because of the |
| 227 |
the same size. |
greater use of memory management, it runs more slowly. Separate functions are |
| 228 |
|
provided so that special-purpose external code can be used for this case. When |
| 229 |
|
used, these functions are always called in a stack-like manner (last obtained, |
| 230 |
|
first freed), and always for memory blocks of the same size. There is a |
| 231 |
|
discussion about PCRE's stack usage in the |
| 232 |
|
.\" HREF |
| 233 |
|
\fBpcrestack\fP |
| 234 |
|
.\" |
| 235 |
|
documentation. |
| 236 |
.P |
.P |
| 237 |
The global variable \fBpcre_callout\fP initially contains NULL. It can be set |
The global variable \fBpcre_callout\fP initially contains NULL. It can be set |
| 238 |
by the caller to a "callout" function, which PCRE will then call at specified |
by the caller to a "callout" function, which PCRE will then call at specified |
| 243 |
documentation. |
documentation. |
| 244 |
. |
. |
| 245 |
. |
. |
| 246 |
|
.SH NEWLINES |
| 247 |
|
PCRE supports three different conventions for indicating line breaks in |
| 248 |
|
strings: a single CR character, a single LF character, or the two-character |
| 249 |
|
sequence CRLF. All three are used as "standard" by different operating systems. |
| 250 |
|
When PCRE is built, a default can be specified. The default default is LF, |
| 251 |
|
which is the Unix standard. When PCRE is run, the default can be overridden, |
| 252 |
|
either when a pattern is compiled, or when it is matched. |
| 253 |
|
.sp |
| 254 |
|
In the PCRE documentation the word "newline" is used to mean "the character or |
| 255 |
|
pair of characters that indicate a line break". |
| 256 |
|
. |
| 257 |
|
. |
| 258 |
.SH MULTITHREADING |
.SH MULTITHREADING |
| 259 |
.rs |
.rs |
| 260 |
.sp |
.sp |
| 307 |
.sp |
.sp |
| 308 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
| 309 |
.sp |
.sp |
| 310 |
The output is an integer that is set to the value of the code that is used for |
The output is an integer whose value specifies the default character sequence |
| 311 |
the newline character. It is either linefeed (10) or carriage return (13), and |
that is recognized as meaning "newline". The three values that are supported |
| 312 |
should normally be the standard character for your operating system. |
are: 10 for LF, 13 for CR, and 3338 for CRLF. The default should normally be |
| 313 |
|
the standard sequence for your operating system. |
| 314 |
.sp |
.sp |
| 315 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
| 316 |
.sp |
.sp |
| 380 |
via \fBpcre_malloc\fP is returned. This contains the compiled code and related |
via \fBpcre_malloc\fP is returned. This contains the compiled code and related |
| 381 |
data. The \fBpcre\fP type is defined for the returned block; this is a typedef |
data. The \fBpcre\fP type is defined for the returned block; this is a typedef |
| 382 |
for a structure whose contents are not externally defined. It is up to the |
for a structure whose contents are not externally defined. It is up to the |
| 383 |
caller to free the memory when it is no longer required. |
caller to free the memory (via \fBpcre_free\fP) when it is no longer required. |
| 384 |
.P |
.P |
| 385 |
Although the compiled code of a PCRE regex is relocatable, that is, it does not |
Although the compiled code of a PCRE regex is relocatable, that is, it does not |
| 386 |
depend on memory location, the complete \fBpcre\fP data block is not |
depend on memory location, the complete \fBpcre\fP data block is not |
| 397 |
.\" |
.\" |
| 398 |
documentation). For these options, the contents of the \fIoptions\fP argument |
documentation). For these options, the contents of the \fIoptions\fP argument |
| 399 |
specifies their initial settings at the start of compilation and execution. The |
specifies their initial settings at the start of compilation and execution. The |
| 400 |
PCRE_ANCHORED option can be set at the time of matching as well as at compile |
PCRE_ANCHORED and PCRE_NEWLINE_\fIxxx\fP options can be set at the time of |
| 401 |
time. |
matching as well as at compile time. |
| 402 |
.P |
.P |
| 403 |
If \fIerrptr\fP is NULL, \fBpcre_compile()\fP returns NULL immediately. |
If \fIerrptr\fP is NULL, \fBpcre_compile()\fP returns NULL immediately. |
| 404 |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns |
| 469 |
.sp |
.sp |
| 470 |
If this bit is set, a dollar metacharacter in the pattern matches only at the |
If this bit is set, a dollar metacharacter in the pattern matches only at the |
| 471 |
end of the subject string. Without this option, a dollar also matches |
end of the subject string. Without this option, a dollar also matches |
| 472 |
immediately before the final character if it is a newline (but not before any |
immediately before a newline at the end of the string (but not before any other |
| 473 |
other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. |
| 474 |
set. There is no equivalent to this option in Perl, and no way to set it within |
There is no equivalent to this option in Perl, and no way to set it within a |
| 475 |
a pattern. |
pattern. |
| 476 |
.sp |
.sp |
| 477 |
PCRE_DOTALL |
PCRE_DOTALL |
| 478 |
.sp |
.sp |
| 479 |
If this bit is set, a dot metacharater in the pattern matches all characters, |
If this bit is set, a dot metacharater in the pattern matches all characters, |
| 480 |
including newlines. Without it, newlines are excluded. This option is |
including those that indicate newline. Without it, a dot does not match when |
| 481 |
equivalent to Perl's /s option, and it can be changed within a pattern by a |
the current position is at a newline. This option is equivalent to Perl's /s |
| 482 |
(?s) option setting. A negative class such as [^a] always matches a newline |
option, and it can be changed within a pattern by a (?s) option setting. A |
| 483 |
character, independent of the setting of this option. |
negative class such as [^a] always matches newlines, independent of the setting |
| 484 |
|
of this option. |
| 485 |
|
.sp |
| 486 |
|
PCRE_DUPNAMES |
| 487 |
|
.sp |
| 488 |
|
If this bit is set, names used to identify capturing subpatterns need not be |
| 489 |
|
unique. This can be helpful for certain types of pattern when it is known that |
| 490 |
|
only one instance of the named subpattern can ever be matched. There are more |
| 491 |
|
details of named subpatterns below; see also the |
| 492 |
|
.\" HREF |
| 493 |
|
\fBpcrepattern\fP |
| 494 |
|
.\" |
| 495 |
|
documentation. |
| 496 |
.sp |
.sp |
| 497 |
PCRE_EXTENDED |
PCRE_EXTENDED |
| 498 |
.sp |
.sp |
| 499 |
If this bit is set, whitespace data characters in the pattern are totally |
If this bit is set, whitespace data characters in the pattern are totally |
| 500 |
ignored except when escaped or inside a character class. Whitespace does not |
ignored except when escaped or inside a character class. Whitespace does not |
| 501 |
include the VT character (code 11). In addition, characters between an |
include the VT character (code 11). In addition, characters between an |
| 502 |
unescaped # outside a character class and the next newline character, |
unescaped # outside a character class and the next newline, inclusive, are also |
| 503 |
inclusive, are also ignored. This is equivalent to Perl's /x option, and it can |
ignored. This is equivalent to Perl's /x option, and it can be changed within a |
| 504 |
be changed within a pattern by a (?x) option setting. |
pattern by a (?x) option setting. |
| 505 |
.P |
.P |
| 506 |
This option makes it possible to include comments inside complicated patterns. |
This option makes it possible to include comments inside complicated patterns. |
| 507 |
Note, however, that this applies only to data characters. Whitespace characters |
Note, however, that this applies only to data characters. Whitespace characters |
| 515 |
set, any backslash in a pattern that is followed by a letter that has no |
set, any backslash in a pattern that is followed by a letter that has no |
| 516 |
special meaning causes an error, thus reserving these combinations for future |
special meaning causes an error, thus reserving these combinations for future |
| 517 |
expansion. By default, as in Perl, a backslash followed by a letter with no |
expansion. By default, as in Perl, a backslash followed by a letter with no |
| 518 |
special meaning is treated as a literal. There are at present no other features |
special meaning is treated as a literal. (Perl can, however, be persuaded to |
| 519 |
controlled by this option. It can also be set by a (?X) option setting within a |
give a warning for this.) There are at present no other features controlled by |
| 520 |
pattern. |
this option. It can also be set by a (?X) option setting within a pattern. |
| 521 |
.sp |
.sp |
| 522 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
| 523 |
.sp |
.sp |
| 524 |
If this option is set, an unanchored pattern is required to match before or at |
If this option is set, an unanchored pattern is required to match before or at |
| 525 |
the first newline character in the subject string, though the matched text may |
the first newline in the subject string, though the matched text may continue |
| 526 |
continue over the newline. |
over the newline. |
| 527 |
.sp |
.sp |
| 528 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 529 |
.sp |
.sp |
| 535 |
Perl. |
Perl. |
| 536 |
.P |
.P |
| 537 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs |
| 538 |
match immediately following or immediately before any newline in the subject |
match immediately following or immediately before internal newlines in the |
| 539 |
string, respectively, as well as at the very start and end. This is equivalent |
subject string, respectively, as well as at the very start and end. This is |
| 540 |
to Perl's /m option, and it can be changed within a pattern by a (?m) option |
equivalent to Perl's /m option, and it can be changed within a pattern by a |
| 541 |
setting. If there are no "\en" characters in a subject string, or no |
(?m) option setting. If there are no newlines in a subject string, or no |
| 542 |
occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect. |
occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect. |
| 543 |
.sp |
.sp |
| 544 |
|
PCRE_NEWLINE_CR |
| 545 |
|
PCRE_NEWLINE_LF |
| 546 |
|
PCRE_NEWLINE_CRLF |
| 547 |
|
.sp |
| 548 |
|
These options override the default newline definition that was chosen when PCRE |
| 549 |
|
was built. Setting the first or the second specifies that a newline is |
| 550 |
|
indicated by a single character (CR or LF, respectively). Setting both of them |
| 551 |
|
specifies that a newline is indicated by the two-character CRLF sequence. For |
| 552 |
|
convenience, PCRE_NEWLINE_CRLF is defined to contain both bits. The only time |
| 553 |
|
that a line break is relevant when compiling a pattern is if PCRE_EXTENDED is |
| 554 |
|
set, and an unescaped # outside a character class is encountered. This |
| 555 |
|
indicates a comment that lasts until after the next newline. |
| 556 |
|
.P |
| 557 |
|
The newline option set at compile time becomes the default that is used for |
| 558 |
|
\fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP, but it can be overridden. |
| 559 |
|
.sp |
| 560 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
| 561 |
.sp |
.sp |
| 562 |
If this option is set, it disables the use of numbered capturing parentheses in |
If this option is set, it disables the use of numbered capturing parentheses in |
| 634 |
23 internal error: code overflow |
23 internal error: code overflow |
| 635 |
24 unrecognized character after (?< |
24 unrecognized character after (?< |
| 636 |
25 lookbehind assertion is not fixed length |
25 lookbehind assertion is not fixed length |
| 637 |
26 malformed number after (?( |
26 malformed number or name after (?( |
| 638 |
27 conditional group contains more than two branches |
27 conditional group contains more than two branches |
| 639 |
28 assertion expected after (?( |
28 assertion expected after (?( |
| 640 |
29 (?R or (?digits must be followed by ) |
29 (?R or (?digits must be followed by ) |
| 651 |
40 recursive call could loop indefinitely |
40 recursive call could loop indefinitely |
| 652 |
41 unrecognized character after (?P |
41 unrecognized character after (?P |
| 653 |
42 syntax error after (?P |
42 syntax error after (?P |
| 654 |
43 two named groups have the same name |
43 two named subpatterns have the same name |
| 655 |
44 invalid UTF-8 string |
44 invalid UTF-8 string |
| 656 |
45 support for \eP, \ep, and \eX has not been compiled |
45 support for \eP, \ep, and \eX has not been compiled |
| 657 |
46 malformed \eP or \ep sequence |
46 malformed \eP or \ep sequence |
| 658 |
47 unknown property name after \eP or \ep |
47 unknown property name after \eP or \ep |
| 659 |
|
48 subpattern name is too long (maximum 32 characters) |
| 660 |
|
49 too many named subpatterns (maximum 10,000) |
| 661 |
|
50 repeated subpattern is too long |
| 662 |
|
51 octal value is greater than \e377 (not in UTF-8 mode) |
| 663 |
. |
. |
| 664 |
. |
. |
| 665 |
.SH "STUDYING A PATTERN" |
.SH "STUDYING A PATTERN" |
| 790 |
\fBpcre_fullinfo()\fP, to obtain the length of the compiled pattern: |
\fBpcre_fullinfo()\fP, to obtain the length of the compiled pattern: |
| 791 |
.sp |
.sp |
| 792 |
int rc; |
int rc; |
| 793 |
unsigned long int length; |
size_t length; |
| 794 |
rc = pcre_fullinfo( |
rc = pcre_fullinfo( |
| 795 |
re, /* result of pcre_compile() */ |
re, /* result of pcre_compile() */ |
| 796 |
pe, /* result of pcre_study(), or NULL */ |
pe, /* result of pcre_study(), or NULL */ |
| 822 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
| 823 |
.sp |
.sp |
| 824 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
| 825 |
non-anchored pattern. (This option used to be called PCRE_INFO_FIRSTCHAR; the |
non-anchored pattern. The fourth argument should point to an \fBint\fP |
| 826 |
old name is still recognized for backwards compatibility.) |
variable. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name is |
| 827 |
|
still recognized for backwards compatibility.) |
| 828 |
.P |
.P |
| 829 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
| 830 |
(cat|cow|coyote), it is returned in the integer pointed to by \fIwhere\fP. |
(cat|cow|coyote). Otherwise, if either |
|
Otherwise, if either |
|
| 831 |
.sp |
.sp |
| 832 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
| 833 |
starts with "^", or |
starts with "^", or |
| 862 |
.sp |
.sp |
| 863 |
PCRE supports the use of named as well as numbered capturing parentheses. The |
PCRE supports the use of named as well as numbered capturing parentheses. The |
| 864 |
names are just an additional way of identifying the parentheses, which still |
names are just an additional way of identifying the parentheses, which still |
| 865 |
acquire numbers. A convenience function called \fBpcre_get_named_substring()\fP |
acquire numbers. Several convenience functions such as |
| 866 |
is provided for extracting an individual captured substring by name. It is also |
\fBpcre_get_named_substring()\fP are provided for extracting captured |
| 867 |
possible to extract the data directly, by first converting the name to a number |
substrings by name. It is also possible to extract the data directly, by first |
| 868 |
in order to access the correct pointers in the output vector (described with |
converting the name to a number in order to access the correct pointers in the |
| 869 |
\fBpcre_exec()\fP below). To do the conversion, you need to use the |
output vector (described with \fBpcre_exec()\fP below). To do the conversion, |
| 870 |
name-to-number map, which is described by these three values. |
you need to use the name-to-number map, which is described by these three |
| 871 |
|
values. |
| 872 |
.P |
.P |
| 873 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives |
| 874 |
the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each |
the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each |
| 877 |
entry of the table (a pointer to \fBchar\fP). The first two bytes of each entry |
entry of the table (a pointer to \fBchar\fP). The first two bytes of each entry |
| 878 |
are the number of the capturing parenthesis, most significant byte first. The |
are the number of the capturing parenthesis, most significant byte first. The |
| 879 |
rest of the entry is the corresponding name, zero terminated. The names are in |
rest of the entry is the corresponding name, zero terminated. The names are in |
| 880 |
alphabetical order. For example, consider the following pattern (assume |
alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of |
| 881 |
|
their parentheses numbers. For example, consider the following pattern (assume |
| 882 |
PCRE_EXTENDED is set, so white space - including newlines - is ignored): |
PCRE_EXTENDED is set, so white space - including newlines - is ignored): |
| 883 |
.sp |
.sp |
| 884 |
.\" JOIN |
.\" JOIN |
| 895 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
| 896 |
.sp |
.sp |
| 897 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
| 898 |
name-to-number map, remember that the length of each entry is likely to be |
name-to-number map, remember that the length of the entries is likely to be |
| 899 |
different for each compiled pattern. |
different for each compiled pattern. |
| 900 |
.sp |
.sp |
| 901 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
| 1118 |
.rs |
.rs |
| 1119 |
.sp |
.sp |
| 1120 |
The unused bits of the \fIoptions\fP argument for \fBpcre_exec()\fP must be |
The unused bits of the \fIoptions\fP argument for \fBpcre_exec()\fP must be |
| 1121 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
| 1122 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
| 1123 |
.sp |
.sp |
| 1124 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1125 |
.sp |
.sp |
| 1128 |
to be anchored by virtue of its contents, it cannot be made unachored at |
to be anchored by virtue of its contents, it cannot be made unachored at |
| 1129 |
matching time. |
matching time. |
| 1130 |
.sp |
.sp |
| 1131 |
|
PCRE_NEWLINE_CR |
| 1132 |
|
PCRE_NEWLINE_LF |
| 1133 |
|
PCRE_NEWLINE_CRLF |
| 1134 |
|
.sp |
| 1135 |
|
These options override the newline definition that was chosen or defaulted when |
| 1136 |
|
the pattern was compiled. For details, see the description \fBpcre_compile()\fP |
| 1137 |
|
above. During matching, the newline choice affects the behaviour of the dot, |
| 1138 |
|
circumflex, and dollar metacharacters. |
| 1139 |
|
.sp |
| 1140 |
PCRE_NOTBOL |
PCRE_NOTBOL |
| 1141 |
.sp |
.sp |
| 1142 |
This option specifies that first character of the subject string is not the |
This option specifies that first character of the subject string is not the |
| 1268 |
first pair, \fIovector[0]\fP and \fIovector[1]\fP, identify the portion of the |
first pair, \fIovector[0]\fP and \fIovector[1]\fP, identify the portion of the |
| 1269 |
subject string matched by the entire pattern. The next pair is used for the |
subject string matched by the entire pattern. The next pair is used for the |
| 1270 |
first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fP |
first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fP |
| 1271 |
is the number of pairs that have been set. If there are no capturing |
is one more than the highest numbered pair that has been set. For example, if |
| 1272 |
subpatterns, the return value from a successful match is 1, indicating that |
two substrings have been captured, the returned value is 3. If there are no |
| 1273 |
just the first pair of offsets has been set. |
capturing subpatterns, the return value from a successful match is 1, |
| 1274 |
.P |
indicating that just the first pair of offsets has been set. |
|
Some convenience functions are provided for extracting the captured substrings |
|
|
as separate strings. These are described in the following section. |
|
|
.P |
|
|
It is possible for an capturing subpattern number \fIn+1\fP to match some |
|
|
part of the subject when subpattern \fIn\fP has not been used at all. For |
|
|
example, if the string "abc" is matched against the pattern (a|(z))(bc) |
|
|
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset |
|
|
values corresponding to the unused subpattern are set to -1. |
|
| 1275 |
.P |
.P |
| 1276 |
If a capturing subpattern is matched repeatedly, it is the last portion of the |
If a capturing subpattern is matched repeatedly, it is the last portion of the |
| 1277 |
string that it matched that is returned. |
string that it matched that is returned. |
| 1285 |
has to get additional memory for use during matching. Thus it is usually |
has to get additional memory for use during matching. Thus it is usually |
| 1286 |
advisable to supply an \fIovector\fP. |
advisable to supply an \fIovector\fP. |
| 1287 |
.P |
.P |
| 1288 |
Note that \fBpcre_info()\fP can be used to find out how many capturing |
The \fBpcre_info()\fP function can be used to find out how many capturing |
| 1289 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 1290 |
\fIovector\fP that will allow for \fIn\fP captured substrings, in addition to |
\fIovector\fP that will allow for \fIn\fP captured substrings, in addition to |
| 1291 |
the offsets of the substring matched by the whole pattern, is (\fIn\fP+1)*3. |
the offsets of the substring matched by the whole pattern, is (\fIn\fP+1)*3. |
| 1292 |
|
.P |
| 1293 |
|
It is possible for capturing subpattern number \fIn+1\fP to match some part of |
| 1294 |
|
the subject when subpattern \fIn\fP has not been used at all. For example, if |
| 1295 |
|
the string "abc" is matched against the pattern (a|(z))(bc) the return from the |
| 1296 |
|
function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this |
| 1297 |
|
happens, both values in the offset pairs corresponding to unused subpatterns |
| 1298 |
|
are set to -1. |
| 1299 |
|
.P |
| 1300 |
|
Offset values that correspond to unused subpatterns at the end of the |
| 1301 |
|
expression are also set to -1. For example, if the string "abc" is matched |
| 1302 |
|
against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The |
| 1303 |
|
return from the function is 2, because the highest used capturing subpattern |
| 1304 |
|
number is 1. However, you can refer to the offsets for the second and third |
| 1305 |
|
capturing subpatterns if you wish (assuming the vector is large enough, of |
| 1306 |
|
course). |
| 1307 |
|
.P |
| 1308 |
|
Some convenience functions are provided for extracting the captured substrings |
| 1309 |
|
as separate strings. These are described below. |
| 1310 |
. |
. |
| 1311 |
.\" HTML <a name="errorlist"></a> |
.\" HTML <a name="errorlist"></a> |
| 1312 |
.SS "Return values from \fBpcre_exec()\fP" |
.SS "Error return values from \fBpcre_exec()\fP" |
| 1313 |
.rs |
.rs |
| 1314 |
.sp |
.sp |
| 1315 |
If \fBpcre_exec()\fP fails, it returns a negative number. The following are |
If \fBpcre_exec()\fP fails, it returns a negative number. The following are |
| 1440 |
\fBpcre_get_substring_list()\fP are provided for extracting captured substrings |
\fBpcre_get_substring_list()\fP are provided for extracting captured substrings |
| 1441 |
as new, separate, zero-terminated strings. These functions identify substrings |
as new, separate, zero-terminated strings. These functions identify substrings |
| 1442 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
| 1443 |
substrings. A substring that contains a binary zero is correctly extracted and |
substrings. |
| 1444 |
has a further zero added on the end, but the result is not, of course, |
.P |
| 1445 |
a C string. |
A substring that contains a binary zero is correctly extracted and has a |
| 1446 |
|
further zero added on the end, but the result is not, of course, a C string. |
| 1447 |
|
However, you can process such a string by referring to the length that is |
| 1448 |
|
returned by \fBpcre_copy_substring()\fP and \fBpcre_get_substring()\fP. |
| 1449 |
|
Unfortunately, the interface to \fBpcre_get_substring_list()\fP is not adequate |
| 1450 |
|
for handling strings containing binary zeros, because the end of the final |
| 1451 |
|
string is not independently indicated. |
| 1452 |
.P |
.P |
| 1453 |
The first three arguments are the same for all three of these functions: |
The first three arguments are the same for all three of these functions: |
| 1454 |
\fIsubject\fP is the subject string that has just been successfully matched, |
\fIsubject\fP is the subject string that has just been successfully matched, |
| 1503 |
\fBpcre_get_substring_list()\fP, respectively. They do nothing more than call |
\fBpcre_get_substring_list()\fP, respectively. They do nothing more than call |
| 1504 |
the function pointed to by \fBpcre_free\fP, which of course could be called |
the function pointed to by \fBpcre_free\fP, which of course could be called |
| 1505 |
directly from a C program. However, PCRE is used in some situations where it is |
directly from a C program. However, PCRE is used in some situations where it is |
| 1506 |
linked via a special interface to another programming language which cannot use |
linked via a special interface to another programming language that cannot use |
| 1507 |
\fBpcre_free\fP directly; it is for these cases that the functions are |
\fBpcre_free\fP directly; it is for these cases that the functions are |
| 1508 |
provided. |
provided. |
| 1509 |
. |
. |
| 1538 |
.sp |
.sp |
| 1539 |
(a+)b(?P<xxx>\ed+)... |
(a+)b(?P<xxx>\ed+)... |
| 1540 |
.sp |
.sp |
| 1541 |
the number of the subpattern called "xxx" is 2. You can find the number from |
the number of the subpattern called "xxx" is 2. If the name is known to be |
| 1542 |
the name by calling \fBpcre_get_stringnumber()\fP. The first argument is the |
unique (PCRE_DUPNAMES was not set), you can find the number from the name by |
| 1543 |
compiled pattern, and the second is the name. The yield of the function is the |
calling \fBpcre_get_stringnumber()\fP. The first argument is the compiled |
| 1544 |
|
pattern, and the second is the name. The yield of the function is the |
| 1545 |
subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of |
subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of |
| 1546 |
that name. |
that name. |
| 1547 |
.P |
.P |
| 1549 |
functions described in the previous section. For convenience, there are also |
functions described in the previous section. For convenience, there are also |
| 1550 |
two functions that do the whole job. |
two functions that do the whole job. |
| 1551 |
.P |
.P |
| 1552 |
Most of the arguments of \fIpcre_copy_named_substring()\fP and |
Most of the arguments of \fBpcre_copy_named_substring()\fP and |
| 1553 |
\fIpcre_get_named_substring()\fP are the same as those for the similarly named |
\fBpcre_get_named_substring()\fP are the same as those for the similarly named |
| 1554 |
functions that extract by number. As these are described in the previous |
functions that extract by number. As these are described in the previous |
| 1555 |
section, they are not re-described here. There are just two differences: |
section, they are not re-described here. There are just two differences: |
| 1556 |
.P |
.P |
| 1564 |
appropriate. |
appropriate. |
| 1565 |
. |
. |
| 1566 |
. |
. |
| 1567 |
|
.SH "DUPLICATE SUBPATTERN NAMES" |
| 1568 |
|
.rs |
| 1569 |
|
.sp |
| 1570 |
|
.B int pcre_get_stringtable_entries(const pcre *\fIcode\fP, |
| 1571 |
|
.ti +5n |
| 1572 |
|
.B const char *\fIname\fP, char **\fIfirst\fP, char **\fIlast\fP); |
| 1573 |
|
.PP |
| 1574 |
|
When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns |
| 1575 |
|
are not required to be unique. Normally, patterns with duplicate names are such |
| 1576 |
|
that in any one match, only one of the named subpatterns participates. An |
| 1577 |
|
example is shown in the |
| 1578 |
|
.\" HREF |
| 1579 |
|
\fBpcrepattern\fP |
| 1580 |
|
.\" |
| 1581 |
|
documentation. When duplicates are present, \fBpcre_copy_named_substring()\fP |
| 1582 |
|
and \fBpcre_get_named_substring()\fP return the first substring corresponding |
| 1583 |
|
to the given name that is set. If none are set, an empty string is returned. |
| 1584 |
|
The \fBpcre_get_stringnumber()\fP function returns one of the numbers that are |
| 1585 |
|
associated with the name, but it is not defined which it is. |
| 1586 |
|
.sp |
| 1587 |
|
If you want to get full details of all captured substrings for a given name, |
| 1588 |
|
you must use the \fBpcre_get_stringtable_entries()\fP function. The first |
| 1589 |
|
argument is the compiled pattern, and the second is the name. The third and |
| 1590 |
|
fourth are pointers to variables which are updated by the function. After it |
| 1591 |
|
has run, they point to the first and last entries in the name-to-number table |
| 1592 |
|
for the given name. The function itself returns the length of each entry, or |
| 1593 |
|
PCRE_ERROR_NOSUBSTRING if there are none. The format of the table is described |
| 1594 |
|
above in the section entitled \fIInformation about a pattern\fP. Given all the |
| 1595 |
|
relevant entries for the name, you can extract each of their numbers, and hence |
| 1596 |
|
the captured data, if any. |
| 1597 |
|
. |
| 1598 |
|
. |
| 1599 |
.SH "FINDING ALL POSSIBLE MATCHES" |
.SH "FINDING ALL POSSIBLE MATCHES" |
| 1600 |
.rs |
.rs |
| 1601 |
.sp |
.sp |
| 1650 |
The two additional arguments provide workspace for the function. The workspace |
The two additional arguments provide workspace for the function. The workspace |
| 1651 |
vector should contain at least 20 elements. It is used for keeping track of |
vector should contain at least 20 elements. It is used for keeping track of |
| 1652 |
multiple paths through the pattern tree. More workspace will be needed for |
multiple paths through the pattern tree. More workspace will be needed for |
| 1653 |
patterns and subjects where there are a lot of possible matches. |
patterns and subjects where there are a lot of potential matches. |
| 1654 |
.P |
.P |
| 1655 |
Here is an example of a simple call to \fBpcre_dfa_exec()\fP: |
Here is an example of a simple call to \fBpcre_dfa_exec()\fP: |
| 1656 |
.sp |
.sp |
| 1673 |
.rs |
.rs |
| 1674 |
.sp |
.sp |
| 1675 |
The unused bits of the \fIoptions\fP argument for \fBpcre_dfa_exec()\fP must be |
The unused bits of the \fIoptions\fP argument for \fBpcre_dfa_exec()\fP must be |
| 1676 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
| 1677 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, |
| 1678 |
PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of these are |
PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of these are |
| 1679 |
the same as for \fBpcre_exec()\fP, so their description is not repeated here. |
the same as for \fBpcre_exec()\fP, so their description is not repeated here. |
| 1680 |
.sp |
.sp |
| 1784 |
extremely rare, as a vector of size 1000 is used. |
extremely rare, as a vector of size 1000 is used. |
| 1785 |
.P |
.P |
| 1786 |
.in 0 |
.in 0 |
| 1787 |
Last updated: 18 January 2006 |
Last updated: 08 June 2006 |
| 1788 |
.br |
.br |
| 1789 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |