/[pcre]/code/trunk/doc/pcretest.1
ViewVC logotype

Contents of /code/trunk/doc/pcretest.1

Parent Directory Parent Directory | Revision Log Revision Log


Revision 599 - (hide annotations) (download)
Sat May 7 16:09:06 2011 UTC (2 years ago) by ph10
File size: 31768 byte(s)
Fix typos in pcregrep and pcretest man pages.

1 nigel 53 .TH PCRETEST 1
2     .SH NAME
3     pcretest - a program for testing Perl-compatible regular expressions.
4     .SH SYNOPSIS
5 nigel 75 .rs
6     .sp
7 nigel 91 .B pcretest "[options] [source] [destination]"
8     .sp
9 nigel 75 \fBpcretest\fP was written as a test program for the PCRE regular expression
10 nigel 53 library itself, but it can also be used for experimenting with regular
11 nigel 63 expressions. This document describes the features of the test program; for
12     details of the regular expressions themselves, see the
13     .\" HREF
14 nigel 75 \fBpcrepattern\fP
15 nigel 63 .\"
16 nigel 75 documentation. For details of the PCRE library function calls and their
17     options, see the
18 nigel 63 .\" HREF
19 nigel 75 \fBpcreapi\fP
20 nigel 63 .\"
21     documentation.
22 nigel 75 .
23     .
24 nigel 53 .SH OPTIONS
25 nigel 63 .rs
26 nigel 53 .TP 10
27 nigel 93 \fB-b\fP
28 ph10 599 Behave as if each regex has the \fB/B\fP (show byte code) modifier; the
29     internal form is output after compilation.
30 nigel 93 .TP 10
31 nigel 75 \fB-C\fP
32 nigel 63 Output the version number of the PCRE library, and all available information
33     about the optional features that are included, and then exit.
34     .TP 10
35 nigel 75 \fB-d\fP
36 nigel 77 Behave as if each regex has the \fB/D\fP (debug) modifier; the internal
37 nigel 93 form and information about the compiled pattern is output after compilation;
38     \fB-d\fP is equivalent to \fB-b -i\fP.
39 nigel 53 .TP 10
40 nigel 77 \fB-dfa\fP
41     Behave as if each data line contains the \eD escape sequence; this causes the
42     alternative matching function, \fBpcre_dfa_exec()\fP, to be used instead of the
43     standard \fBpcre_exec()\fP function (more detail is given below).
44     .TP 10
45 nigel 93 \fB-help\fP
46     Output a brief summary these options and then exit.
47     .TP 10
48 nigel 75 \fB-i\fP
49 nigel 77 Behave as if each regex has the \fB/I\fP modifier; information about the
50 nigel 53 compiled pattern is given after compilation.
51     .TP 10
52 ph10 386 \fB-M\fP
53     Behave as if each data line contains the \eM escape sequence; this causes
54 ph10 392 PCRE to discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings by
55 ph10 386 calling \fBpcre_exec()\fP repeatedly with different limits.
56     .TP 10
57 nigel 75 \fB-m\fP
58 nigel 53 Output the size of each compiled pattern after it has been compiled. This is
59 nigel 75 equivalent to adding \fB/M\fP to each regular expression. For compatibility
60     with earlier versions of pcretest, \fB-s\fP is a synonym for \fB-m\fP.
61 nigel 53 .TP 10
62 nigel 75 \fB-o\fP \fIosize\fP
63     Set the number of elements in the output vector that is used when calling
64 nigel 93 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP to be \fIosize\fP. The default value
65     is 45, which is enough for 14 capturing subexpressions for \fBpcre_exec()\fP or
66     22 different matches for \fBpcre_dfa_exec()\fP. The vector size can be
67     changed for individual matching calls by including \eO in the data line (see
68     below).
69 nigel 53 .TP 10
70 nigel 75 \fB-p\fP
71 nigel 77 Behave as if each regex has the \fB/P\fP modifier; the POSIX wrapper API is
72     used to call PCRE. None of the other options has any effect when \fB-p\fP is
73     set.
74 nigel 53 .TP 10
75 nigel 91 \fB-q\fP
76 nigel 87 Do not output the version number of \fBpcretest\fP at the start of execution.
77     .TP 10
78 nigel 91 \fB-S\fP \fIsize\fP
79 ph10 599 On Unix-like systems, set the size of the run-time stack to \fIsize\fP
80 nigel 91 megabytes.
81     .TP 10
82 nigel 75 \fB-t\fP
83 nigel 63 Run each compile, study, and match many times with a timer, and output
84 nigel 75 resulting time per compile or match (in milliseconds). Do not set \fB-m\fP with
85     \fB-t\fP, because you will then get the size output a zillion times, and the
86 nigel 93 timing will be distorted. You can control the number of iterations that are
87     used for timing by following \fB-t\fP with a number (as a separate item on the
88     command line). For example, "-t 1000" would iterate 1000 times. The default is
89     to iterate 500000 times.
90     .TP 10
91     \fB-tm\fP
92     This is like \fB-t\fP except that it times only the matching phase, not the
93     compile or study phases.
94 nigel 75 .
95     .
96 nigel 53 .SH DESCRIPTION
97 nigel 63 .rs
98     .sp
99 nigel 75 If \fBpcretest\fP is given two filename arguments, it reads from the first and
100 nigel 53 writes to the second. If it is given only one filename argument, it reads from
101     that file and writes to stdout. Otherwise, it reads from stdin and writes to
102     stdout, and prompts for each line of input, using "re>" to prompt for regular
103     expressions, and "data>" to prompt for data lines.
104 nigel 75 .P
105 ph10 289 When \fBpcretest\fP is built, a configuration option can specify that it should
106 ph10 287 be linked with the \fBlibreadline\fP library. When this is done, if the input
107     is from a terminal, it is read using the \fBreadline()\fP function. This
108     provides line-editing and history facilities. The output from the \fB-help\fP
109     option states whether or not \fBreadline()\fP will be used.
110     .P
111 nigel 53 The program handles any number of sets of input on a single input file. Each
112     set starts with a regular expression, and continues with any number of data
113 nigel 63 lines to be matched against the pattern.
114 nigel 75 .P
115     Each data line is matched separately and independently. If you want to do
116 nigel 91 multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
117 nigel 93 etc., depending on the newline setting) in a single line of input to encode the
118     newline sequences. There is no limit on the length of data lines; the input
119 nigel 91 buffer is automatically extended if it is too small.
120 nigel 75 .P
121 nigel 63 An empty line signals the end of the data lines, at which point a new regular
122     expression is read. The regular expressions are given enclosed in any
123 nigel 91 non-alphanumeric delimiters other than backslash, for example:
124 nigel 75 .sp
125 nigel 53 /(a|bc)x+yz/
126 nigel 75 .sp
127 nigel 53 White space before the initial delimiter is ignored. A regular expression may
128     be continued over several input lines, in which case the newline characters are
129     included within it. It is possible to include the delimiter within the pattern
130     by escaping it, for example
131 nigel 75 .sp
132     /abc\e/def/
133     .sp
134 nigel 53 If you do so, the escape and the delimiter form part of the pattern, but since
135 nigel 75 delimiters are always non-alphanumeric, this does not affect its interpretation.
136 nigel 53 If the terminating delimiter is immediately followed by a backslash, for
137     example,
138 nigel 75 .sp
139     /abc/\e
140     .sp
141 nigel 53 then a backslash is added to the end of the pattern. This is done to provide a
142     way of testing the error condition that arises if a pattern finishes with a
143     backslash, because
144 nigel 75 .sp
145     /abc\e/
146     .sp
147 nigel 53 is interpreted as the first line of a pattern that starts with "abc/", causing
148     pcretest to read the next line as a continuation of the regular expression.
149 nigel 75 .
150     .
151     .SH "PATTERN MODIFIERS"
152 nigel 63 .rs
153     .sp
154 nigel 75 A pattern may be followed by any number of modifiers, which are mostly single
155     characters. Following Perl usage, these are referred to below as, for example,
156     "the \fB/i\fP modifier", even though the delimiter of the pattern need not
157 ph10 599 always be a slash, and no slash is used when writing modifiers. White space may
158 nigel 75 appear between the final pattern delimiter and the first modifier, and between
159     the modifiers themselves.
160     .P
161     The \fB/i\fP, \fB/m\fP, \fB/s\fP, and \fB/x\fP modifiers set the PCRE_CASELESS,
162     PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when
163     \fBpcre_compile()\fP is called. These four modifier letters have the same
164     effect as they do in Perl. For example:
165     .sp
166 nigel 53 /caseless/i
167 nigel 75 .sp
168 ph10 518 The following table shows additional modifiers for setting PCRE compile-time
169     options that do not correspond to anything in Perl:
170 nigel 75 .sp
171 ph10 518 \fB/8\fP PCRE_UTF8
172 ph10 535 \fB/?\fP PCRE_NO_UTF8_CHECK
173 ph10 231 \fB/A\fP PCRE_ANCHORED
174     \fB/C\fP PCRE_AUTO_CALLOUT
175     \fB/E\fP PCRE_DOLLAR_ENDONLY
176     \fB/f\fP PCRE_FIRSTLINE
177     \fB/J\fP PCRE_DUPNAMES
178     \fB/N\fP PCRE_NO_AUTO_CAPTURE
179     \fB/U\fP PCRE_UNGREEDY
180 ph10 535 \fB/W\fP PCRE_UCP
181 ph10 231 \fB/X\fP PCRE_EXTRA
182 ph10 579 \fB/Y\fP PCRE_NO_START_OPTIMIZE
183 ph10 345 \fB/<JS>\fP PCRE_JAVASCRIPT_COMPAT
184 ph10 231 \fB/<cr>\fP PCRE_NEWLINE_CR
185     \fB/<lf>\fP PCRE_NEWLINE_LF
186     \fB/<crlf>\fP PCRE_NEWLINE_CRLF
187     \fB/<anycrlf>\fP PCRE_NEWLINE_ANYCRLF
188     \fB/<any>\fP PCRE_NEWLINE_ANY
189     \fB/<bsr_anycrlf>\fP PCRE_BSR_ANYCRLF
190     \fB/<bsr_unicode>\fP PCRE_BSR_UNICODE
191 nigel 75 .sp
192 ph10 518 The modifiers that are enclosed in angle brackets are literal strings as shown,
193     including the angle brackets, but the letters can be in either case. This
194     example sets multiline matching with CRLF as the line ending sequence:
195 nigel 93 .sp
196     /^abc/m<crlf>
197     .sp
198 ph10 518 As well as turning on the PCRE_UTF8 option, the \fB/8\fP modifier also causes
199     any non-printing characters in output strings to be printed using the
200     \ex{hh...} notation if they are valid UTF-8 sequences. Full details of the PCRE
201     options are given in the
202 nigel 91 .\" HREF
203     \fBpcreapi\fP
204     .\"
205 ph10 535 documentation.
206 nigel 91 .
207     .
208     .SS "Finding all matches in a string"
209     .rs
210     .sp
211 nigel 53 Searching for all possible matches within each subject string can be requested
212 nigel 75 by the \fB/g\fP or \fB/G\fP modifier. After finding a match, PCRE is called
213 nigel 53 again to search the remainder of the subject string. The difference between
214 nigel 75 \fB/g\fP and \fB/G\fP is that the former uses the \fIstartoffset\fP argument to
215     \fBpcre_exec()\fP to start searching at a new point within the entire string
216 nigel 53 (which is in effect what Perl does), whereas the latter passes over a shortened
217     substring. This makes a difference to the matching process if the pattern
218 nigel 75 begins with a lookbehind assertion (including \eb or \eB).
219     .P
220     If any call to \fBpcre_exec()\fP in a \fB/g\fP or \fB/G\fP sequence matches an
221 ph10 442 empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
222     PCRE_ANCHORED flags set in order to search for another, non-empty, match at the
223 ph10 566 same point. If this second match fails, the start offset is advanced, and the
224     normal match is retried. This imitates the way Perl handles such cases when
225 ph10 579 using the \fB/g\fP modifier or the \fBsplit()\fP function. Normally, the start
226     offset is advanced by one character, but if the newline convention recognizes
227     CRLF as a newline, and the current character is CR followed by LF, an advance
228 ph10 566 of two is used.
229 nigel 91 .
230     .
231     .SS "Other modifiers"
232     .rs
233     .sp
234 nigel 75 There are yet more modifiers for controlling the way \fBpcretest\fP
235 nigel 53 operates.
236 nigel 75 .P
237     The \fB/+\fP modifier requests that as well as outputting the substring that
238 nigel 53 matched the entire pattern, pcretest should in addition output the remainder of
239     the subject string. This is useful for tests where the subject contains
240     multiple copies of the same substring.
241 nigel 75 .P
242 nigel 93 The \fB/B\fP modifier is a debugging feature. It requests that \fBpcretest\fP
243 ph10 123 output a representation of the compiled byte code after compilation. Normally
244 ph10 116 this information contains length and offset values; however, if \fB/Z\fP is
245     also present, this data is replaced by spaces. This is a special feature for
246     use in the automatic test scripts; it ensures that the same output is generated
247     for different internal link sizes.
248 nigel 93 .P
249     The \fB/D\fP modifier is a PCRE debugging feature, and is equivalent to
250 ph10 148 \fB/BI\fP, that is, both the \fB/B\fP and the \fB/I\fP modifiers.
251 nigel 75 .P
252     The \fB/F\fP modifier causes \fBpcretest\fP to flip the byte order of the
253     fields in the compiled pattern that contain 2-byte and 4-byte numbers. This
254     facility is for testing the feature in PCRE that allows it to execute patterns
255     that were compiled on a host with a different endianness. This feature is not
256     available when the POSIX interface to PCRE is being used, that is, when the
257     \fB/P\fP pattern modifier is specified. See also the section about saving and
258     reloading compiled patterns below.
259     .P
260 ph10 510 The \fB/I\fP modifier requests that \fBpcretest\fP output information about the
261     compiled pattern (whether it is anchored, has a fixed first character, and
262     so on). It does this by calling \fBpcre_fullinfo()\fP after compiling a
263     pattern. If the pattern is studied, the results of that are also output.
264 nigel 75 .P
265 ph10 510 The \fB/K\fP modifier requests \fBpcretest\fP to show names from backtracking
266     control verbs that are returned from calls to \fBpcre_exec()\fP. It causes
267     \fBpcretest\fP to create a \fBpcre_extra\fP block if one has not already been
268     created by a call to \fBpcre_study()\fP, and to set the PCRE_EXTRA_MARK flag
269     and the \fBmark\fP field within it, every time that \fBpcre_exec()\fP is
270     called. If the variable that the \fBmark\fP field points to is non-NULL for a
271 ph10 512 match, non-match, or partial match, \fBpcretest\fP prints the string to which
272     it points. For a match, this is shown on a line by itself, tagged with "MK:".
273 ph10 510 For a non-match it is added to the message.
274     .P
275     The \fB/L\fP modifier must be followed directly by the name of a locale, for
276     example,
277     .sp
278     /pattern/Lfr_FR
279     .sp
280     For this reason, it must be the last modifier. The given locale is set,
281     \fBpcre_maketables()\fP is called to build a set of character tables for the
282     locale, and this is then passed to \fBpcre_compile()\fP when compiling the
283 ph10 541 regular expression. Without an \fB/L\fP (or \fB/T\fP) modifier, NULL is passed
284     as the tables pointer; that is, \fB/L\fP applies only to the expression on
285     which it appears.
286 ph10 510 .P
287 nigel 75 The \fB/M\fP modifier causes the size of memory block used to hold the compiled
288 nigel 53 pattern to be output.
289 nigel 75 .P
290 ph10 510 The \fB/S\fP modifier causes \fBpcre_study()\fP to be called after the
291     expression has been compiled, and the results used when the expression is
292     matched.
293 ph10 541 .P
294 ph10 545 The \fB/T\fP modifier must be followed by a single digit. It causes a specific
295     set of built-in character tables to be passed to \fBpcre_compile()\fP. It is
296     used in the standard PCRE tests to check behaviour with different character
297 ph10 541 tables. The digit specifies the tables as follows:
298     .sp
299 ph10 545 0 the default ASCII tables, as distributed in
300 ph10 541 pcre_chartables.c.dist
301     1 a set of tables defining ISO 8859 characters
302     .sp
303 ph10 545 In table 1, some characters whose codes are greater than 128 are identified as
304 ph10 541 letters, digits, spaces, etc.
305 nigel 75 .
306     .
307 ph10 518 .SS "Using the POSIX wrapper API"
308     .rs
309     .sp
310     The \fB/P\fP modifier causes \fBpcretest\fP to call PCRE via the POSIX wrapper
311 ph10 535 API rather than its native API. When \fB/P\fP is set, the following modifiers
312 ph10 518 set options for the \fBregcomp()\fP function:
313     .sp
314     /i REG_ICASE
315     /m REG_NEWLINE
316     /N REG_NOSUB
317     /s REG_DOTALL )
318 ph10 535 /U REG_UNGREEDY ) These options are not part of
319 ph10 518 /W REG_UCP ) the POSIX standard
320     /8 REG_UTF8 )
321     .sp
322     The \fB/+\fP modifier works as described above. All other modifiers are
323     ignored.
324     .
325     .
326 nigel 75 .SH "DATA LINES"
327 nigel 63 .rs
328     .sp
329 nigel 75 Before each data line is passed to \fBpcre_exec()\fP, leading and trailing
330 ph10 599 white space is removed, and it is then scanned for \e escapes. Some of these
331     are pretty esoteric features, intended for checking out some of the more
332 nigel 63 complicated features of PCRE. If you are just testing "ordinary" regular
333     expressions, you probably don't need any of these. The following escapes are
334 nigel 53 recognized:
335 nigel 75 .sp
336 nigel 93 \ea alarm (BEL, \ex07)
337     \eb backspace (\ex08)
338     \ee escape (\ex27)
339 ph10 599 \ef form feed (\ex0c)
340 nigel 93 \en newline (\ex0a)
341 nigel 91 .\" JOIN
342     \eqdd set the PCRE_MATCH_LIMIT limit to dd
343     (any number of digits)
344 nigel 93 \er carriage return (\ex0d)
345     \et tab (\ex09)
346     \ev vertical tab (\ex0b)
347 nigel 75 \ennn octal character (up to 3 octal digits)
348 ph10 579 always a byte unless > 255 in UTF-8 mode
349 ph10 570 \exhh hexadecimal byte (up to 2 hex digits)
350 nigel 75 .\" JOIN
351     \ex{hh...} hexadecimal character, any number of digits
352 nigel 63 in UTF-8 mode
353 nigel 91 .\" JOIN
354 nigel 75 \eA pass the PCRE_ANCHORED option to \fBpcre_exec()\fP
355 nigel 91 or \fBpcre_dfa_exec()\fP
356     .\" JOIN
357 nigel 75 \eB pass the PCRE_NOTBOL option to \fBpcre_exec()\fP
358 nigel 91 or \fBpcre_dfa_exec()\fP
359 nigel 75 .\" JOIN
360     \eCdd call pcre_copy_substring() for substring dd
361     after a successful match (number less than 32)
362     .\" JOIN
363     \eCname call pcre_copy_named_substring() for substring
364 nigel 63 "name" after a successful match (name termin-
365     ated by next non alphanumeric character)
366 nigel 75 .\" JOIN
367     \eC+ show the current captured substrings at callout
368 nigel 63 time
369 nigel 75 \eC- do not supply a callout function
370     .\" JOIN
371     \eC!n return 1 instead of 0 when callout number n is
372 nigel 63 reached
373 nigel 75 .\" JOIN
374     \eC!n!m return 1 instead of 0 when callout number n is
375 nigel 63 reached for the nth time
376 nigel 75 .\" JOIN
377     \eC*n pass the number n (may be negative) as callout
378     data; this is used as the callout return value
379 nigel 77 \eD use the \fBpcre_dfa_exec()\fP match function
380     \eF only shortest match for \fBpcre_dfa_exec()\fP
381 nigel 75 .\" JOIN
382     \eGdd call pcre_get_substring() for substring dd
383     after a successful match (number less than 32)
384     .\" JOIN
385     \eGname call pcre_get_named_substring() for substring
386 nigel 63 "name" after a successful match (name termin-
387     ated by next non-alphanumeric character)
388 nigel 75 .\" JOIN
389     \eL call pcre_get_substringlist() after a
390 nigel 63 successful match
391 nigel 91 .\" JOIN
392 nigel 87 \eM discover the minimum MATCH_LIMIT and
393     MATCH_LIMIT_RECURSION settings
394 nigel 91 .\" JOIN
395 nigel 75 \eN pass the PCRE_NOTEMPTY option to \fBpcre_exec()\fP
396 ph10 442 or \fBpcre_dfa_exec()\fP; if used twice, pass the
397 ph10 461 PCRE_NOTEMPTY_ATSTART option
398 nigel 75 .\" JOIN
399     \eOdd set the size of the output vector passed to
400     \fBpcre_exec()\fP to dd (any number of digits)
401 nigel 77 .\" JOIN
402 ph10 428 \eP pass the PCRE_PARTIAL_SOFT option to \fBpcre_exec()\fP
403     or \fBpcre_dfa_exec()\fP; if used twice, pass the
404 ph10 461 PCRE_PARTIAL_HARD option
405 nigel 91 .\" JOIN
406     \eQdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
407     (any number of digits)
408 nigel 77 \eR pass the PCRE_DFA_RESTART option to \fBpcre_dfa_exec()\fP
409 nigel 75 \eS output details of memory get/free calls during matching
410 nigel 91 .\" JOIN
411 ph10 455 \eY pass the PCRE_NO_START_OPTIMIZE option to \fBpcre_exec()\fP
412     or \fBpcre_dfa_exec()\fP
413     .\" JOIN
414 nigel 75 \eZ pass the PCRE_NOTEOL option to \fBpcre_exec()\fP
415 nigel 91 or \fBpcre_dfa_exec()\fP
416 nigel 75 .\" JOIN
417     \e? pass the PCRE_NO_UTF8_CHECK option to
418 nigel 91 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP
419     .\" JOIN
420 ph10 567 \e>dd start the match at offset dd (optional "-"; then
421 ph10 579 any number of digits); this sets the \fIstartoffset\fP
422 ph10 567 argument for \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP
423 nigel 91 .\" JOIN
424     \e<cr> pass the PCRE_NEWLINE_CR option to \fBpcre_exec()\fP
425     or \fBpcre_dfa_exec()\fP
426     .\" JOIN
427     \e<lf> pass the PCRE_NEWLINE_LF option to \fBpcre_exec()\fP
428     or \fBpcre_dfa_exec()\fP
429     .\" JOIN
430     \e<crlf> pass the PCRE_NEWLINE_CRLF option to \fBpcre_exec()\fP
431     or \fBpcre_dfa_exec()\fP
432 nigel 93 .\" JOIN
433 ph10 149 \e<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to \fBpcre_exec()\fP
434     or \fBpcre_dfa_exec()\fP
435     .\" JOIN
436 nigel 93 \e<any> pass the PCRE_NEWLINE_ANY option to \fBpcre_exec()\fP
437     or \fBpcre_dfa_exec()\fP
438 nigel 75 .sp
439 ph10 579 Note that \exhh always specifies one byte, even in UTF-8 mode; this makes it
440     possible to construct invalid UTF-8 sequences for testing purposes. On the
441 ph10 570 other hand, \ex{hh} is interpreted as a UTF-8 character in UTF-8 mode,
442 ph10 579 generating more than one byte if the value is greater than 127. When not in
443     UTF-8 mode, it generates one byte for values less than 256, and causes an error
444 ph10 570 for greater values.
445     .P
446 nigel 93 The escapes that specify line ending sequences are literal strings, exactly as
447     shown. No more than one newline setting should be present in any data line.
448 nigel 75 .P
449 nigel 93 A backslash followed by anything else just escapes the anything else. If
450     the very last character is a backslash, it is ignored. This gives a way of
451     passing an empty line as data, since a real empty line terminates the data
452     input.
453     .P
454 nigel 75 If \eM is present, \fBpcretest\fP calls \fBpcre_exec()\fP several times, with
455 nigel 87 different values in the \fImatch_limit\fP and \fImatch_limit_recursion\fP
456     fields of the \fBpcre_extra\fP data structure, until it finds the minimum
457     numbers for each parameter that allow \fBpcre_exec()\fP to complete. The
458     \fImatch_limit\fP number is a measure of the amount of backtracking that takes
459     place, and checking it out can be instructive. For most simple matches, the
460     number is quite small, but for patterns with very large numbers of matching
461     possibilities, it can become large very quickly with increasing length of
462     subject string. The \fImatch_limit_recursion\fP number is a measure of how much
463     stack (or, if PCRE is compiled with NO_RECURSE, how much heap) memory is needed
464     to complete the match attempt.
465 nigel 75 .P
466     When \eO is used, the value specified may be higher or lower than the size set
467     by the \fB-O\fP command line option (or defaulted to 45); \eO applies only to
468     the call of \fBpcre_exec()\fP for the line in which it appears.
469     .P
470     If the \fB/P\fP modifier was present on the pattern, causing the POSIX wrapper
471 ph10 518 API to be used, the only option-setting sequences that have any effect are \eB,
472     \eN, and \eZ, causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively,
473     to be passed to \fBregexec()\fP.
474 nigel 75 .P
475     The use of \ex{hh...} to represent UTF-8 characters is not dependent on the use
476     of the \fB/8\fP modifier on the pattern. It is recognized always. There may be
477 nigel 53 any number of hexadecimal digits inside the braces. The result is from one to
478 ph10 211 six bytes, encoded according to the original UTF-8 rules of RFC 2279. This
479     allows for values in the range 0 to 0x7FFFFFFF. Note that not all of those are
480     valid Unicode code points, or indeed valid UTF-8 characters according to the
481     later rules in RFC 3629.
482 nigel 75 .
483     .
484 nigel 77 .SH "THE ALTERNATIVE MATCHING FUNCTION"
485 nigel 63 .rs
486     .sp
487 nigel 77 By default, \fBpcretest\fP uses the standard PCRE matching function,
488     \fBpcre_exec()\fP to match each data line. From release 6.0, PCRE supports an
489     alternative matching function, \fBpcre_dfa_test()\fP, which operates in a
490     different way, and has some restrictions. The differences between the two
491     functions are described in the
492     .\" HREF
493     \fBpcrematching\fP
494     .\"
495     documentation.
496     .P
497     If a data line contains the \eD escape sequence, or if the command line
498     contains the \fB-dfa\fP option, the alternative matching function is called.
499     This function finds all possible matches at a given point. If, however, the \eF
500     escape sequence is present in the data line, it stops after the first match is
501     found. This is always the shortest possible match.
502     .
503     .
504     .SH "DEFAULT OUTPUT FROM PCRETEST"
505     .rs
506     .sp
507     This section describes the output when the normal matching function,
508     \fBpcre_exec()\fP, is being used.
509     .P
510 ph10 598 When a match succeeds, \fBpcretest\fP outputs the list of captured substrings
511     that \fBpcre_exec()\fP returns, starting with number 0 for the string that
512     matched the whole pattern. Otherwise, it outputs "No match" when the return is
513 ph10 435 PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
514 ph10 553 substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. (Note that this is
515     the entire substring that was inspected during the partial match; it may
516     include characters before the actual match start if a lookbehind assertion,
517 ph10 598 \eK, \eb, or \eB was involved.) For any other return, \fBpcretest\fP outputs
518     the PCRE negative error number and a short descriptive phrase. If the error is
519     a failed UTF-8 string check, the byte offset of the start of the failing
520     character and the reason code are also output, provided that the size of the
521     output vector is at least two. Here is an example of an interactive
522     \fBpcretest\fP run.
523 nigel 75 .sp
524 nigel 53 $ pcretest
525 ph10 598 PCRE version 8.13 2011-04-30
526 nigel 75 .sp
527     re> /^abc(\ed+)/
528 nigel 53 data> abc123
529     0: abc123
530     1: 123
531     data> xyz
532     No match
533 nigel 75 .sp
534 ph10 598 Unset capturing substrings that are not followed by one that is set are not
535     returned by \fBpcre_exec()\fP, and are not shown by \fBpcretest\fP. In the
536     following example, there are two capturing substrings, but when the first data
537     line is matched, the second, unset substring is not shown. An "internal" unset
538     substring is shown as "<unset>", as for the second data line.
539 ph10 273 .sp
540     re> /(a)|(b)/
541     data> a
542     0: a
543     1: a
544     data> b
545     0: b
546     1: <unset>
547 ph10 286 2: b
548 ph10 273 .sp
549 nigel 75 If the strings contain any non-printing characters, they are output as \e0x
550     escapes, or as \ex{...} escapes if the \fB/8\fP modifier was present on the
551 nigel 93 pattern. See below for the definition of non-printing characters. If the
552     pattern has the \fB/+\fP modifier, the output for substring 0 is followed by
553     the the rest of the subject string, identified by "0+" like this:
554 nigel 75 .sp
555 nigel 53 re> /cat/+
556     data> cataract
557     0: cat
558     0+ aract
559 nigel 75 .sp
560     If the pattern has the \fB/g\fP or \fB/G\fP modifier, the results of successive
561 nigel 53 matching attempts are output in sequence, like this:
562 nigel 75 .sp
563     re> /\eBi(\ew\ew)/g
564 nigel 53 data> Mississippi
565     0: iss
566     1: ss
567     0: iss
568     1: ss
569     0: ipp
570     1: pp
571 nigel 75 .sp
572 ph10 598 "No match" is output only if the first match attempt fails. Here is an example
573     of a failure message (the offset 4 that is specified by \e>4 is past the end of
574     the subject string):
575     .sp
576     re> /xyz/
577     data> xyz\>4
578     Error -24 (bad offset value)
579 nigel 75 .P
580     If any of the sequences \fB\eC\fP, \fB\eG\fP, or \fB\eL\fP are present in a
581 nigel 53 data line that is successfully matched, the substrings extracted by the
582     convenience functions are output with C, G, or L after the string number
583     instead of a colon. This is in addition to the normal full list. The string
584     length (that is, the return from the extraction function) is given in
585 nigel 75 parentheses after each string for \fB\eC\fP and \fB\eG\fP.
586     .P
587 nigel 93 Note that whereas patterns can be continued over several lines (a plain ">"
588 nigel 53 prompt is used for continuations), data lines may not. However newlines can be
589 nigel 93 included in data by means of the \en escape (or \er, \er\en, etc., depending on
590     the newline sequence setting).
591 nigel 75 .
592     .
593 nigel 93 .
594 nigel 77 .SH "OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION"
595     .rs
596     .sp
597     When the alternative matching function, \fBpcre_dfa_exec()\fP, is used (by
598     means of the \eD escape sequence or the \fB-dfa\fP command line option), the
599     output consists of a list of all the matches that start at the first point in
600     the subject where there is at least one match. For example:
601     .sp
602     re> /(tang|tangerine|tan)/
603     data> yellow tangerine\eD
604     0: tangerine
605     1: tang
606     2: tan
607     .sp
608     (Using the normal matching function on this data finds only "tang".) The
609 ph10 428 longest matching string is always given first (and numbered zero). After a
610 ph10 461 PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
611 ph10 553 partially matching substring. (Note that this is the entire substring that was
612     inspected during the partial match; it may include characters before the actual
613     match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
614 nigel 77 .P
615 nigel 93 If \fB/g\fP is present on the pattern, the search for further matches resumes
616 nigel 77 at the end of the longest match. For example:
617     .sp
618     re> /(tang|tangerine|tan)/g
619     data> yellow tangerine and tangy sultana\eD
620     0: tangerine
621     1: tang
622     2: tan
623     0: tang
624     1: tan
625     0: tan
626     .sp
627     Since the matching function does not support substring capture, the escape
628     sequences that are concerned with captured substrings are not relevant.
629     .
630     .
631     .SH "RESTARTING AFTER A PARTIAL MATCH"
632     .rs
633     .sp
634     When the alternative matching function has given the PCRE_ERROR_PARTIAL return,
635     indicating that the subject partially matched the pattern, you can restart the
636     match with additional subject data by means of the \eR escape sequence. For
637     example:
638     .sp
639 ph10 155 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
640 nigel 77 data> 23ja\eP\eD
641     Partial match: 23ja
642     data> n05\eR\eD
643     0: n05
644     .sp
645     For further information about partial matching, see the
646     .\" HREF
647     \fBpcrepartial\fP
648     .\"
649     documentation.
650     .
651     .
652 nigel 75 .SH CALLOUTS
653     .rs
654     .sp
655     If the pattern contains any callout requests, \fBpcretest\fP's callout function
656 nigel 77 is called during matching. This works with both matching functions. By default,
657     the called function displays the callout number, the start and current
658     positions in the text at the callout time, and the next pattern item to be
659     tested. For example, the output
660 nigel 75 .sp
661     --->pqrabcdef
662     0 ^ ^ \ed
663     .sp
664     indicates that callout number 0 occurred for a match attempt starting at the
665     fourth character of the subject string, when the pointer was at the seventh
666     character of the data, and when the next pattern item was \ed. Just one
667     circumflex is output if the start and current positions are the same.
668     .P
669     Callouts numbered 255 are assumed to be automatic callouts, inserted as a
670     result of the \fB/C\fP pattern modifier. In this case, instead of showing the
671     callout number, the offset in the pattern, preceded by a plus, is output. For
672     example:
673     .sp
674     re> /\ed?[A-E]\e*/C
675     data> E*
676     --->E*
677     +0 ^ \ed?
678     +3 ^ [A-E]
679     +8 ^^ \e*
680     +10 ^ ^
681     0: E*
682     .sp
683     The callout function in \fBpcretest\fP returns zero (carry on matching) by
684 nigel 77 default, but you can use a \eC item in a data line (as described above) to
685 nigel 75 change this.
686     .P
687     Inserting callouts can be helpful when using \fBpcretest\fP to check
688     complicated regular expressions. For further information about callouts, see
689     the
690     .\" HREF
691     \fBpcrecallout\fP
692     .\"
693     documentation.
694     .
695     .
696 nigel 93 .
697     .SH "NON-PRINTING CHARACTERS"
698     .rs
699     .sp
700     When \fBpcretest\fP is outputting text in the compiled version of a pattern,
701     bytes other than 32-126 are always treated as non-printing characters are are
702     therefore shown as hex escapes.
703     .P
704     When \fBpcretest\fP is outputting text that is a matched part of a subject
705     string, it behaves in the same way, unless a different locale has been set for
706     the pattern (using the \fB/L\fP modifier). In this case, the \fBisprint()\fP
707     function to distinguish printing and non-printing characters.
708     .
709     .
710     .
711 nigel 75 .SH "SAVING AND RELOADING COMPILED PATTERNS"
712     .rs
713     .sp
714     The facilities described in this section are not available when the POSIX
715 ph10 599 interface to PCRE is being used, that is, when the \fB/P\fP pattern modifier is
716 nigel 75 specified.
717     .P
718     When the POSIX interface is not in use, you can cause \fBpcretest\fP to write a
719     compiled pattern to a file, by following the modifiers with > and a file name.
720     For example:
721     .sp
722     /pattern/im >/some/file
723     .sp
724     See the
725     .\" HREF
726     \fBpcreprecompile\fP
727     .\"
728     documentation for a discussion about saving and re-using compiled patterns.
729     .P
730     The data that is written is binary. The first eight bytes are the length of the
731     compiled pattern data followed by the length of the optional study data, each
732     written as four bytes in big-endian order (most significant byte first). If
733     there is no study data (either the pattern was not studied, or studying did not
734     return any data), the second length is zero. The lengths are followed by an
735     exact copy of the compiled pattern. If there is additional study data, this
736     follows immediately after the compiled pattern. After writing the file,
737     \fBpcretest\fP expects to read a new pattern.
738     .P
739 ph10 599 A saved pattern can be reloaded into \fBpcretest\fP by specifying < and a file
740 nigel 75 name instead of a pattern. The name of the file must not contain a < character,
741     as otherwise \fBpcretest\fP will interpret the line as a pattern delimited by <
742     characters.
743     For example:
744     .sp
745     re> </some/file
746     Compiled regex loaded from /some/file
747     No study data
748     .sp
749     When the pattern has been loaded, \fBpcretest\fP proceeds to read data lines in
750     the usual way.
751     .P
752     You can copy a file written by \fBpcretest\fP to a different host and reload it
753     there, even if the new host has opposite endianness to the one on which the
754     pattern was compiled. For example, you can compile on an i86 machine and run on
755     a SPARC machine.
756     .P
757     File names for saving and reloading can be absolute or relative, but note that
758     the shell facility of expanding a file name that starts with a tilde (~) is not
759     available.
760     .P
761     The ability to save and reload files in \fBpcretest\fP is intended for testing
762     and experimentation. It is not intended for production use because only a
763     single pattern can be written to a file. Furthermore, there is no facility for
764     supplying custom character tables for use with a reloaded pattern. If the
765     original pattern was compiled with custom tables, an attempt to match a subject
766     string using a reloaded pattern is likely to cause \fBpcretest\fP to crash.
767     Finally, if you attempt to load a file that is not in the correct format, the
768     result is undefined.
769     .
770     .
771 nigel 93 .SH "SEE ALSO"
772     .rs
773     .sp
774     \fBpcre\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
775 ph10 148 \fBpcrepartial\fP(d), \fBpcrepattern\fP(3), \fBpcreprecompile\fP(3).
776 nigel 93 .
777     .
778 nigel 53 .SH AUTHOR
779 nigel 63 .rs
780     .sp
781 ph10 99 .nf
782 nigel 77 Philip Hazel
783 ph10 99 University Computing Service
784 nigel 93 Cambridge CB2 3QH, England.
785 ph10 99 .fi
786     .
787     .
788     .SH REVISION
789     .rs
790     .sp
791     .nf
792 ph10 599 Last updated: 07 May 2011
793 ph10 598 Copyright (c) 1997-2011 University of Cambridge.
794 ph10 99 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12