| 92 |
use these to include support for different releases. |
use these to include support for different releases. |
| 93 |
|
|
| 94 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
| 95 |
are used for compiling and matching regular expressions. |
are used for compiling and matching regular expressions. A sample program that |
| 96 |
|
demonstrates the simplest way of using them is given in the file |
| 97 |
|
\fIpcredemo.c\fR. The last section of this man page describes how to run it. |
| 98 |
|
|
| 99 |
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
| 100 |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
| 131 |
The function \fBpcre_compile()\fR is called to compile a pattern into an |
The function \fBpcre_compile()\fR is called to compile a pattern into an |
| 132 |
internal form. The pattern is a C string terminated by a binary zero, and |
internal form. The pattern is a C string terminated by a binary zero, and |
| 133 |
is passed in the argument \fIpattern\fR. A pointer to a single block of memory |
is passed in the argument \fIpattern\fR. A pointer to a single block of memory |
| 134 |
that is obtained via \fBpcre_malloc\fR is returned. This contains the |
that is obtained via \fBpcre_malloc\fR is returned. This contains the compiled |
| 135 |
compiled code and related data. The \fBpcre\fR type is defined for this for |
code and related data. The \fBpcre\fR type is defined for the returned block; |
| 136 |
convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the |
this is a typedef for a structure whose contents are not externally defined. It |
| 137 |
contents of the block are not externally defined. It is up to the caller to |
is up to the caller to free the memory when it is no longer required. |
| 138 |
free the memory when it is no longer required. |
|
| 139 |
.PP |
Although the compiled code of a PCRE regex is relocatable, that is, it does not |
| 140 |
|
depend on memory location, the complete \fBpcre\fR data block is not |
| 141 |
|
fully relocatable, because it contains a copy of the \fItableptr\fR argument, |
| 142 |
|
which is an address (see below). |
| 143 |
|
|
| 144 |
The size of a compiled pattern is roughly proportional to the length of the |
The size of a compiled pattern is roughly proportional to the length of the |
| 145 |
pattern string, except that each character class (other than those containing |
pattern string, except that each character class (other than those containing |
| 146 |
just a single character, negated or not) requires 33 bytes, and repeat |
just a single character, negated or not) requires 33 bytes, and repeat |
| 147 |
quantifiers with a minimum greater than one or a bounded maximum cause the |
quantifiers with a minimum greater than one or a bounded maximum cause the |
| 148 |
relevant portions of the compiled pattern to be replicated. |
relevant portions of the compiled pattern to be replicated. |
| 149 |
.PP |
|
| 150 |
The \fIoptions\fR argument contains independent bits that affect the |
The \fIoptions\fR argument contains independent bits that affect the |
| 151 |
compilation. It should be zero if no options are required. Some of the options, |
compilation. It should be zero if no options are required. Some of the options, |
| 152 |
in particular, those that are compatible with Perl, can also be set and unset |
in particular, those that are compatible with Perl, can also be set and unset |
| 155 |
their initial settings at the start of compilation and execution. The |
their initial settings at the start of compilation and execution. The |
| 156 |
PCRE_ANCHORED option can be set at the time of matching as well as at compile |
PCRE_ANCHORED option can be set at the time of matching as well as at compile |
| 157 |
time. |
time. |
| 158 |
.PP |
|
| 159 |
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. |
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. |
| 160 |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns |
| 161 |
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual |
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual |
| 162 |
error message. The offset from the start of the pattern to the character where |
error message. The offset from the start of the pattern to the character where |
| 163 |
the error was discovered is placed in the variable pointed to by |
the error was discovered is placed in the variable pointed to by |
| 164 |
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. |
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. |
| 165 |
.PP |
|
| 166 |
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of |
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of |
| 167 |
character tables which are built when it is compiled, using the default C |
character tables which are built when it is compiled, using the default C |
| 168 |
locale. Otherwise, \fItableptr\fR must be the result of a call to |
locale. Otherwise, \fItableptr\fR must be the result of a call to |
| 169 |
\fBpcre_maketables()\fR. See the section on locale support below. |
\fBpcre_maketables()\fR. See the section on locale support below. |
| 170 |
.PP |
|
| 171 |
|
This code fragment shows a typical straightforward call to \fBpcre_compile()\fR: |
| 172 |
|
|
| 173 |
|
pcre *re; |
| 174 |
|
const char *error; |
| 175 |
|
int erroffset; |
| 176 |
|
re = pcre_compile( |
| 177 |
|
"^A.*Z", /* the pattern */ |
| 178 |
|
0, /* default options */ |
| 179 |
|
&error, /* for error message */ |
| 180 |
|
&erroffset, /* for error offset */ |
| 181 |
|
NULL); /* use default character tables */ |
| 182 |
|
|
| 183 |
The following option bits are defined in the header file: |
The following option bits are defined in the header file: |
| 184 |
|
|
| 185 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 266 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
| 267 |
time analyzing it in order to speed up the time taken for matching. The |
time analyzing it in order to speed up the time taken for matching. The |
| 268 |
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first |
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first |
| 269 |
argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR |
argument, and returns a pointer to a \fBpcre_extra\fR block (another typedef |
| 270 |
typedef) containing additional information about the pattern; this can be |
for a structure with hidden contents) containing additional information about |
| 271 |
passed to \fBpcre_exec()\fR. If no additional information is available, NULL |
the pattern; this can be passed to \fBpcre_exec()\fR. If no additional |
| 272 |
is returned. |
information is available, NULL is returned. |
| 273 |
|
|
| 274 |
The second argument contains option bits. At present, no options are defined |
The second argument contains option bits. At present, no options are defined |
| 275 |
for \fBpcre_study()\fR, and this argument should always be zero. |
for \fBpcre_study()\fR, and this argument should always be zero. |
| 278 |
studying succeeds (even if no data is returned), the variable it points to is |
studying succeeds (even if no data is returned), the variable it points to is |
| 279 |
set to NULL. Otherwise it points to a textual error message. |
set to NULL. Otherwise it points to a textual error message. |
| 280 |
|
|
| 281 |
|
This is a typical call to \fBpcre_study\fR(): |
| 282 |
|
|
| 283 |
|
pcre_extra *pe; |
| 284 |
|
pe = pcre_study( |
| 285 |
|
re, /* result of pcre_compile() */ |
| 286 |
|
0, /* no options exist */ |
| 287 |
|
&error); /* set to NULL or points to a message */ |
| 288 |
|
|
| 289 |
At present, studying a pattern is useful only for non-anchored patterns that do |
At present, studying a pattern is useful only for non-anchored patterns that do |
| 290 |
not have a single fixed starting character. A bitmap of possible starting |
not have a single fixed starting character. A bitmap of possible starting |
| 291 |
characters is created. |
characters is created. |
| 335 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 336 |
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
| 337 |
|
|
| 338 |
|
Here is a typical call of \fBpcre_fullinfo()\fR, to obtain the length of the |
| 339 |
|
compiled pattern: |
| 340 |
|
|
| 341 |
|
int rc; |
| 342 |
|
unsigned long int length; |
| 343 |
|
rc = pcre_fullinfo( |
| 344 |
|
re, /* result of pcre_compile() */ |
| 345 |
|
pe, /* result of pcre_study(), or NULL */ |
| 346 |
|
PCRE_INFO_SIZE, /* what is required */ |
| 347 |
|
&length); /* where to put the data */ |
| 348 |
|
|
| 349 |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
| 350 |
as follows: |
as follows: |
| 351 |
|
|
| 352 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
| 353 |
|
|
| 354 |
Return a copy of the options with which the pattern was compiled. The fourth |
Return a copy of the options with which the pattern was compiled. The fourth |
| 355 |
argument should point to au \fBunsigned long int\fR variable. These option bits |
argument should point to an \fBunsigned long int\fR variable. These option bits |
| 356 |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
| 357 |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
| 358 |
bit forcibly set if the form of the pattern implies that it can match only at |
bit forcibly set if the form of the pattern implies that it can match only at |
| 433 |
pattern has been studied, the result of the study should be passed in the |
pattern has been studied, the result of the study should be passed in the |
| 434 |
\fIextra\fR argument. Otherwise this must be NULL. |
\fIextra\fR argument. Otherwise this must be NULL. |
| 435 |
|
|
| 436 |
|
Here is an example of a simple call to \fBpcre_exec()\fR: |
| 437 |
|
|
| 438 |
|
int rc; |
| 439 |
|
int ovector[30]; |
| 440 |
|
rc = pcre_exec( |
| 441 |
|
re, /* result of pcre_compile() */ |
| 442 |
|
NULL, /* we didn't study the pattern */ |
| 443 |
|
"some string", /* the subject string */ |
| 444 |
|
11, /* the length of the subject string */ |
| 445 |
|
0, /* start at offset 0 in the subject */ |
| 446 |
|
0, /* default options */ |
| 447 |
|
ovector, /* vector for substring information */ |
| 448 |
|
30); /* number of elements in the vector */ |
| 449 |
|
|
| 450 |
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose |
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose |
| 451 |
unused bits must be zero. However, if a pattern was compiled with |
unused bits must be zero. However, if a pattern was compiled with |
| 452 |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
| 488 |
|
|
| 489 |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
| 490 |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
| 491 |
string, it may contain binary zero characters. When the starting offset is |
string, the subject may contain binary zero characters. When the starting |
| 492 |
zero, the search for a match starts at the beginning of the subject, and this |
offset is zero, the search for a match starts at the beginning of the subject, |
| 493 |
is by far the most common case. |
and this is by far the most common case. |
| 494 |
|
|
| 495 |
A non-zero starting offset is useful when searching for another match in the |
A non-zero starting offset is useful when searching for another match in the |
| 496 |
same subject by calling \fBpcre_exec()\fR again after a previous success. |
same subject by calling \fBpcre_exec()\fR again after a previous success. |
| 677 |
practice be relevant. |
practice be relevant. |
| 678 |
The maximum length of a compiled pattern is 65539 (sic) bytes. |
The maximum length of a compiled pattern is 65539 (sic) bytes. |
| 679 |
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
| 680 |
The maximum number of capturing subpatterns is 99. |
There maximum number of capturing subpatterns is 65535. |
| 681 |
The maximum number of all parenthesized subpatterns, including capturing |
There is no limit to the number of non-capturing subpatterns, but the maximum |
| 682 |
|
depth of nesting of all kinds of parenthesized subpattern, including capturing |
| 683 |
subpatterns, assertions, and other types of subpattern, is 200. |
subpatterns, assertions, and other types of subpattern, is 200. |
| 684 |
|
|
| 685 |
The maximum length of a subject string is the largest positive number that an |
The maximum length of a subject string is the largest positive number that an |
| 1001 |
|
|
| 1002 |
Note that the sequences \\A, \\Z, and \\z can be used to match the start and |
Note that the sequences \\A, \\Z, and \\z can be used to match the start and |
| 1003 |
end of the subject in both modes, and if all branches of a pattern start with |
end of the subject in both modes, and if all branches of a pattern start with |
| 1004 |
\\A is it always anchored, whether PCRE_MULTILINE is set or not. |
\\A it is always anchored, whether PCRE_MULTILINE is set or not. |
| 1005 |
|
|
| 1006 |
|
|
| 1007 |
.SH FULL STOP (PERIOD, DOT) |
.SH FULL STOP (PERIOD, DOT) |
| 1105 |
|
|
| 1106 |
[12[:^digit:]] |
[12[:^digit:]] |
| 1107 |
|
|
| 1108 |
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX |
| 1109 |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
| 1110 |
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
| 1111 |
|
|
| 1203 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
| 1204 |
|
|
| 1205 |
the captured substrings are "red king", "red", and "king", and are numbered 1, |
the captured substrings are "red king", "red", and "king", and are numbered 1, |
| 1206 |
2, and 3. |
2, and 3, respectively. |
| 1207 |
|
|
| 1208 |
The fact that plain parentheses fulfil two functions is not always helpful. |
The fact that plain parentheses fulfil two functions is not always helpful. |
| 1209 |
There are often times when a grouping subpattern is required without a |
There are often times when a grouping subpattern is required without a |
| 1844 |
|
|
| 1845 |
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
| 1846 |
|
|
| 1847 |
|
|
| 1848 |
|
.SH SAMPLE PROGRAM |
| 1849 |
|
The code below is a simple, complete demonstration program, to get you started |
| 1850 |
|
with using PCRE. This code is also supplied in the file \fIpcredemo.c\fR in the |
| 1851 |
|
PCRE distribution. |
| 1852 |
|
|
| 1853 |
|
The program compiles the regular expression that is its first argument, and |
| 1854 |
|
matches it against the subject string in its second argument. No options are |
| 1855 |
|
set, and default character tables are used. If matching succeeds, the program |
| 1856 |
|
outputs the portion of the subject that matched, together with the contents of |
| 1857 |
|
any captured substrings. |
| 1858 |
|
|
| 1859 |
|
On a Unix system that has PCRE installed in \fI/usr/local\fR, you can compile |
| 1860 |
|
the demonstration program using a command like this: |
| 1861 |
|
|
| 1862 |
|
gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre |
| 1863 |
|
|
| 1864 |
|
Then you can run simple tests like this: |
| 1865 |
|
|
| 1866 |
|
./pcredemo 'cat|dog' 'the cat sat on the mat' |
| 1867 |
|
|
| 1868 |
|
Note that there is a much more comprehensive test program, called |
| 1869 |
|
\fBpcretest\fR, which supports many more facilities for testing regular |
| 1870 |
|
expressions. The \fBpcredemo\fR program is provided as a simple coding example. |
| 1871 |
|
|
| 1872 |
|
On some operating systems (e.g. Solaris) you may get an error like this when |
| 1873 |
|
you try to run \fBpcredemo\fR: |
| 1874 |
|
|
| 1875 |
|
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory |
| 1876 |
|
|
| 1877 |
|
This is caused by the way shared library support works on those systems. You |
| 1878 |
|
need to add |
| 1879 |
|
|
| 1880 |
|
-R/usr/local/lib |
| 1881 |
|
|
| 1882 |
|
to the compile command to get round this problem. Here's the code: |
| 1883 |
|
|
| 1884 |
|
#include <stdio.h> |
| 1885 |
|
#include <string.h> |
| 1886 |
|
#include <pcre.h> |
| 1887 |
|
|
| 1888 |
|
#define OVECCOUNT 30 /* should be a multiple of 3 */ |
| 1889 |
|
|
| 1890 |
|
int main(int argc, char **argv) |
| 1891 |
|
{ |
| 1892 |
|
pcre *re; |
| 1893 |
|
const char *error; |
| 1894 |
|
int erroffset; |
| 1895 |
|
int ovector[OVECCOUNT]; |
| 1896 |
|
int rc, i; |
| 1897 |
|
|
| 1898 |
|
if (argc != 3) |
| 1899 |
|
{ |
| 1900 |
|
printf("Two arguments required: a regex and a " |
| 1901 |
|
"subject string\\n"); |
| 1902 |
|
return 1; |
| 1903 |
|
} |
| 1904 |
|
|
| 1905 |
|
/* Compile the regular expression in the first argument */ |
| 1906 |
|
|
| 1907 |
|
re = pcre_compile( |
| 1908 |
|
argv[1], /* the pattern */ |
| 1909 |
|
0, /* default options */ |
| 1910 |
|
&error, /* for error message */ |
| 1911 |
|
&erroffset, /* for error offset */ |
| 1912 |
|
NULL); /* use default character tables */ |
| 1913 |
|
|
| 1914 |
|
/* Compilation failed: print the error message and exit */ |
| 1915 |
|
|
| 1916 |
|
if (re == NULL) |
| 1917 |
|
{ |
| 1918 |
|
printf("PCRE compilation failed at offset %d: %s\\n", |
| 1919 |
|
erroffset, error); |
| 1920 |
|
return 1; |
| 1921 |
|
} |
| 1922 |
|
|
| 1923 |
|
/* Compilation succeeded: match the subject in the second |
| 1924 |
|
argument */ |
| 1925 |
|
|
| 1926 |
|
rc = pcre_exec( |
| 1927 |
|
re, /* the compiled pattern */ |
| 1928 |
|
NULL, /* we didn't study the pattern */ |
| 1929 |
|
argv[2], /* the subject string */ |
| 1930 |
|
(int)strlen(argv[2]), /* the length of the subject */ |
| 1931 |
|
0, /* start at offset 0 in the subject */ |
| 1932 |
|
0, /* default options */ |
| 1933 |
|
ovector, /* vector for substring information */ |
| 1934 |
|
OVECCOUNT); /* number of elements in the vector */ |
| 1935 |
|
|
| 1936 |
|
/* Matching failed: handle error cases */ |
| 1937 |
|
|
| 1938 |
|
if (rc < 0) |
| 1939 |
|
{ |
| 1940 |
|
switch(rc) |
| 1941 |
|
{ |
| 1942 |
|
case PCRE_ERROR_NOMATCH: printf("No match\\n"); break; |
| 1943 |
|
/* |
| 1944 |
|
Handle other special cases if you like |
| 1945 |
|
*/ |
| 1946 |
|
default: printf("Matching error %d\\n", rc); break; |
| 1947 |
|
} |
| 1948 |
|
return 1; |
| 1949 |
|
} |
| 1950 |
|
|
| 1951 |
|
/* Match succeded */ |
| 1952 |
|
|
| 1953 |
|
printf("Match succeeded\\n"); |
| 1954 |
|
|
| 1955 |
|
/* The output vector wasn't big enough */ |
| 1956 |
|
|
| 1957 |
|
if (rc == 0) |
| 1958 |
|
{ |
| 1959 |
|
rc = OVECCOUNT/3; |
| 1960 |
|
printf("ovector only has room for %d captured " |
| 1961 |
|
substrings\\n", rc - 1); |
| 1962 |
|
} |
| 1963 |
|
|
| 1964 |
|
/* Show substrings stored in the output vector */ |
| 1965 |
|
|
| 1966 |
|
for (i = 0; i < rc; i++) |
| 1967 |
|
{ |
| 1968 |
|
char *substring_start = argv[2] + ovector[2*i]; |
| 1969 |
|
int substring_length = ovector[2*i+1] - ovector[2*i]; |
| 1970 |
|
printf("%2d: %.*s\\n", i, substring_length, |
| 1971 |
|
substring_start); |
| 1972 |
|
} |
| 1973 |
|
|
| 1974 |
|
return 0; |
| 1975 |
|
} |
| 1976 |
|
|
| 1977 |
|
|
| 1978 |
.SH AUTHOR |
.SH AUTHOR |
| 1979 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
| 1980 |
.br |
.br |
| 1986 |
.br |
.br |
| 1987 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
| 1988 |
|
|
| 1989 |
Last updated: 28 August 2000, |
Last updated: 15 August 2001 |
|
.br |
|
|
the 250th anniversary of the death of J.S. Bach. |
|
| 1990 |
.br |
.br |
| 1991 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2001 University of Cambridge. |