/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 79 - (hide annotations) (download)
Sat Feb 24 21:40:52 2007 UTC (7 years, 4 months ago) by nigel
File MIME type: text/plain
File size: 24783 byte(s)
Load pcre-6.1 into code/trunk.

1 nigel 73 PCRETEST(1) PCRETEST(1)
2 nigel 41
3    
4 nigel 73 NAME
5     pcretest - a program for testing Perl-compatible regular expressions.
6    
7 nigel 79
8 nigel 53 SYNOPSIS
9 nigel 41
10 nigel 77 pcretest [-C] [-d] [-dfa] [-i] [-m] [-o osize] [-p] [-t] [source]
11 nigel 75 [destination]
12    
13 nigel 73 pcretest was written as a test program for the PCRE regular expression
14     library itself, but it can also be used for experimenting with regular
15     expressions. This document describes the features of the test program;
16     for details of the regular expressions themselves, see the pcrepattern
17 nigel 75 documentation. For details of the PCRE library function calls and their
18     options, see the pcreapi documentation.
19 nigel 41
20    
21 nigel 63 OPTIONS
22 nigel 41
23 nigel 73 -C Output the version number of the PCRE library, and all avail-
24     able information about the optional features that are
25     included, and then exit.
26 nigel 63
27 nigel 77 -d Behave as if each regex has the /D (debug) modifier; the
28 nigel 73 internal form is output after compilation.
29 nigel 41
30 nigel 77 -dfa Behave as if each data line contains the \D escape sequence;
31     this causes the alternative matching function,
32     pcre_dfa_exec(), to be used instead of the standard
33     pcre_exec() function (more detail is given below).
34    
35     -i Behave as if each regex has the /I modifier; information
36 nigel 73 about the compiled pattern is given after compilation.
37 nigel 41
38 nigel 77 -m Output the size of each compiled pattern after it has been
39     compiled. This is equivalent to adding /M to each regular
40     expression. For compatibility with earlier versions of
41 nigel 73 pcretest, -s is a synonym for -m.
42 nigel 41
43 nigel 77 -o osize Set the number of elements in the output vector that is used
44     when calling pcre_exec() to be osize. The default value is
45 nigel 75 45, which is enough for 14 capturing subexpressions. The vec-
46 nigel 77 tor size can be changed for individual matching calls by
47 nigel 75 including \O in the data line (see below).
48 nigel 41
49 nigel 77 -p Behave as if each regex has the /P modifier; the POSIX wrap-
50     per API is used to call PCRE. None of the other options has
51     any effect when -p is set.
52 nigel 41
53 nigel 77 -t Run each compile, study, and match many times with a timer,
54     and output resulting time per compile or match (in millisec-
55     onds). Do not set -m with -t, because you will then get the
56     size output a zillion times, and the timing will be dis-
57 nigel 75 torted.
58 nigel 49
59    
60 nigel 63 DESCRIPTION
61 nigel 41
62 nigel 77 If pcretest is given two filename arguments, it reads from the first
63 nigel 73 and writes to the second. If it is given only one filename argument, it
64 nigel 77 reads from that file and writes to stdout. Otherwise, it reads from
65     stdin and writes to stdout, and prompts for each line of input, using
66 nigel 73 "re>" to prompt for regular expressions, and "data>" to prompt for data
67     lines.
68 nigel 41
69 nigel 73 The program handles any number of sets of input on a single input file.
70 nigel 77 Each set starts with a regular expression, and continues with any num-
71 nigel 73 ber of data lines to be matched against the pattern.
72 nigel 41
73 nigel 77 Each data line is matched separately and independently. If you want to
74     do multiple-line matches, you have to use the \n escape sequence in a
75     single line of input to encode the newline characters. The maximum
76 nigel 75 length of data line is 30,000 characters.
77 nigel 63
78 nigel 77 An empty line signals the end of the data lines, at which point a new
79     regular expression is read. The regular expressions are given enclosed
80 nigel 75 in any non-alphanumeric delimiters other than backslash, for example
81 nigel 63
82 nigel 73 /(a|bc)x+yz/
83 nigel 41
84 nigel 77 White space before the initial delimiter is ignored. A regular expres-
85     sion may be continued over several input lines, in which case the new-
86     line characters are included within it. It is possible to include the
87 nigel 73 delimiter within the pattern by escaping it, for example
88 nigel 41
89 nigel 73 /abc\/def/
90 nigel 41
91 nigel 77 If you do so, the escape and the delimiter form part of the pattern,
92     but since delimiters are always non-alphanumeric, this does not affect
93     its interpretation. If the terminating delimiter is immediately fol-
94 nigel 73 lowed by a backslash, for example,
95 nigel 41
96 nigel 73 /abc/\
97 nigel 41
98 nigel 77 then a backslash is added to the end of the pattern. This is done to
99     provide a way of testing the error condition that arises if a pattern
100 nigel 73 finishes with a backslash, because
101 nigel 49
102 nigel 73 /abc\/
103 nigel 49
104 nigel 77 is interpreted as the first line of a pattern that starts with "abc/",
105 nigel 73 causing pcretest to read the next line as a continuation of the regular
106     expression.
107 nigel 49
108 nigel 41
109 nigel 63 PATTERN MODIFIERS
110 nigel 41
111 nigel 77 A pattern may be followed by any number of modifiers, which are mostly
112     single characters. Following Perl usage, these are referred to below
113     as, for example, "the /i modifier", even though the delimiter of the
114     pattern need not always be a slash, and no slash is used when writing
115     modifiers. Whitespace may appear between the final pattern delimiter
116 nigel 75 and the first modifier, and between the modifiers themselves.
117 nigel 41
118 nigel 75 The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE,
119 nigel 77 PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com-
120     pile() is called. These four modifier letters have the same effect as
121 nigel 75 they do in Perl. For example:
122    
123 nigel 73 /caseless/i
124 nigel 41
125 nigel 75 The following table shows additional modifiers for setting PCRE options
126     that do not correspond to anything in Perl:
127 nigel 41
128 nigel 75 /A PCRE_ANCHORED
129     /C PCRE_AUTO_CALLOUT
130     /E PCRE_DOLLAR_ENDONLY
131 nigel 77 /f PCRE_FIRSTLINE
132 nigel 75 /N PCRE_NO_AUTO_CAPTURE
133     /U PCRE_UNGREEDY
134     /X PCRE_EXTRA
135    
136 nigel 77 Searching for all possible matches within each subject string can be
137     requested by the /g or /G modifier. After finding a match, PCRE is
138 nigel 73 called again to search the remainder of the subject string. The differ-
139     ence between /g and /G is that the former uses the startoffset argument
140 nigel 77 to pcre_exec() to start searching at a new point within the entire
141     string (which is in effect what Perl does), whereas the latter passes
142     over a shortened substring. This makes a difference to the matching
143 nigel 73 process if the pattern begins with a lookbehind assertion (including \b
144     or \B).
145 nigel 49
146 nigel 77 If any call to pcre_exec() in a /g or /G sequence matches an empty
147     string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
148     flags set in order to search for another, non-empty, match at the same
149     point. If this second match fails, the start offset is advanced by
150     one, and the normal match is retried. This imitates the way Perl han-
151 nigel 73 dles such cases when using the /g modifier or the split() function.
152 nigel 49
153 nigel 75 There are yet more modifiers for controlling the way pcretest operates.
154 nigel 49
155 nigel 77 The /+ modifier requests that as well as outputting the substring that
156     matched the entire pattern, pcretest should in addition output the
157     remainder of the subject string. This is useful for tests where the
158 nigel 73 subject contains multiple copies of the same substring.
159 nigel 41
160 nigel 77 The /L modifier must be followed directly by the name of a locale, for
161 nigel 73 example,
162 nigel 41
163 nigel 75 /pattern/Lfr_FR
164 nigel 41
165 nigel 75 For this reason, it must be the last modifier. The given locale is set,
166 nigel 77 pcre_maketables() is called to build a set of character tables for the
167     locale, and this is then passed to pcre_compile() when compiling the
168     regular expression. Without an /L modifier, NULL is passed as the
169     tables pointer; that is, /L applies only to the expression on which it
170 nigel 75 appears.
171 nigel 41
172 nigel 77 The /I modifier requests that pcretest output information about the
173     compiled pattern (whether it is anchored, has a fixed first character,
174     and so on). It does this by calling pcre_fullinfo() after compiling a
175     pattern. If the pattern is studied, the results of that are also out-
176 nigel 75 put.
177 nigel 63
178 nigel 73 The /D modifier is a PCRE debugging feature, which also assumes /I. It
179 nigel 77 causes the internal form of compiled regular expressions to be output
180 nigel 73 after compilation. If the pattern was studied, the information returned
181     is also output.
182 nigel 41
183 nigel 75 The /F modifier causes pcretest to flip the byte order of the fields in
184 nigel 77 the compiled pattern that contain 2-byte and 4-byte numbers. This
185     facility is for testing the feature in PCRE that allows it to execute
186 nigel 75 patterns that were compiled on a host with a different endianness. This
187 nigel 77 feature is not available when the POSIX interface to PCRE is being
188     used, that is, when the /P pattern modifier is specified. See also the
189 nigel 75 section about saving and reloading compiled patterns below.
190    
191 nigel 77 The /S modifier causes pcre_study() to be called after the expression
192 nigel 73 has been compiled, and the results used when the expression is matched.
193 nigel 41
194 nigel 77 The /M modifier causes the size of memory block used to hold the com-
195 nigel 73 piled pattern to be output.
196 nigel 41
197 nigel 77 The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
198     rather than its native API. When this is done, all other modifiers
199     except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
200     and REG_NEWLINE is set if /m is present. The wrapper functions force
201     PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
202 nigel 41
203 nigel 77 The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
204     set. This turns on support for UTF-8 character handling in PCRE, pro-
205     vided that it was compiled with this support enabled. This modifier
206 nigel 73 also causes any non-printing characters in output strings to be printed
207     using the \x{hh...} notation if they are valid UTF-8 sequences.
208 nigel 41
209 nigel 77 If the /? modifier is used with /8, it causes pcretest to call
210     pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
211 nigel 73 checking of the string for UTF-8 validity.
212 nigel 41
213 nigel 71
214 nigel 53 DATA LINES
215 nigel 63
216 nigel 77 Before each data line is passed to pcre_exec(), leading and trailing
217     whitespace is removed, and it is then scanned for \ escapes. Some of
218     these are pretty esoteric features, intended for checking out some of
219     the more complicated features of PCRE. If you are just testing "ordi-
220     nary" regular expressions, you probably don't need any of these. The
221 nigel 73 following escapes are recognized:
222 nigel 49
223 nigel 73 \a alarm (= BEL)
224     \b backspace
225     \e escape
226     \f formfeed
227     \n newline
228     \r carriage return
229     \t tab
230     \v vertical tab
231     \nnn octal character (up to 3 octal digits)
232     \xhh hexadecimal character (up to 2 hex digits)
233     \x{hh...} hexadecimal character, any number of digits
234     in UTF-8 mode
235     \A pass the PCRE_ANCHORED option to pcre_exec()
236     \B pass the PCRE_NOTBOL option to pcre_exec()
237     \Cdd call pcre_copy_substring() for substring dd
238 nigel 75 after a successful match (number less than 32)
239 nigel 73 \Cname call pcre_copy_named_substring() for substring
240     "name" after a successful match (name termin-
241     ated by next non alphanumeric character)
242     \C+ show the current captured substrings at callout
243     time
244     \C- do not supply a callout function
245     \C!n return 1 instead of 0 when callout number n is
246     reached
247     \C!n!m return 1 instead of 0 when callout number n is
248     reached for the nth time
249     \C*n pass the number n (may be negative) as callout
250 nigel 75 data; this is used as the callout return value
251 nigel 77 \D use the pcre_dfa_exec() match function
252     \F only shortest match for pcre_dfa_exec()
253 nigel 73 \Gdd call pcre_get_substring() for substring dd
254 nigel 75 after a successful match (number less than 32)
255 nigel 73 \Gname call pcre_get_named_substring() for substring
256     "name" after a successful match (name termin-
257     ated by next non-alphanumeric character)
258     \L call pcre_get_substringlist() after a
259     successful match
260     \M discover the minimum MATCH_LIMIT setting
261     \N pass the PCRE_NOTEMPTY option to pcre_exec()
262     \Odd set the size of the output vector passed to
263 nigel 75 pcre_exec() to dd (any number of digits)
264     \P pass the PCRE_PARTIAL option to pcre_exec()
265 nigel 77 or pcre_dfa_exec()
266     \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
267 nigel 73 \S output details of memory get/free calls during matching
268     \Z pass the PCRE_NOTEOL option to pcre_exec()
269     \? pass the PCRE_NO_UTF8_CHECK option to
270     pcre_exec()
271 nigel 75 \>dd start the match at offset dd (any number of digits);
272     this sets the startoffset argument for pcre_exec()
273 nigel 71
274 nigel 77 A backslash followed by anything else just escapes the anything else.
275     If the very last character is a backslash, it is ignored. This gives a
276     way of passing an empty line as data, since a real empty line termi-
277 nigel 75 nates the data input.
278    
279 nigel 77 If \M is present, pcretest calls pcre_exec() several times, with dif-
280     ferent values in the match_limit field of the pcre_extra data struc-
281     ture, until it finds the minimum number that is needed for pcre_exec()
282     to complete. This number is a measure of the amount of recursion and
283     backtracking that takes place, and checking it out can be instructive.
284     For most simple matches, the number is quite small, but for patterns
285     with very large numbers of matching possibilities, it can become large
286 nigel 73 very quickly with increasing length of subject string.
287 nigel 41
288 nigel 77 When \O is used, the value specified may be higher or lower than the
289 nigel 75 size set by the -O command line option (or defaulted to 45); \O applies
290     only to the call of pcre_exec() for the line in which it appears.
291 nigel 63
292 nigel 77 If the /P modifier was present on the pattern, causing the POSIX wrap-
293     per API to be used, only \B and \Z have any effect, causing REG_NOTBOL
294 nigel 75 and REG_NOTEOL to be passed to regexec() respectively.
295 nigel 41
296 nigel 77 The use of \x{hh...} to represent UTF-8 characters is not dependent on
297     the use of the /8 modifier on the pattern. It is recognized always.
298     There may be any number of hexadecimal digits inside the braces. The
299     result is from one to six bytes, encoded according to the UTF-8 rules.
300 nigel 41
301 nigel 53
302 nigel 77 THE ALTERNATIVE MATCHING FUNCTION
303 nigel 53
304 nigel 77 By default, pcretest uses the standard PCRE matching function,
305     pcre_exec() to match each data line. From release 6.0, PCRE supports an
306     alternative matching function, pcre_dfa_test(), which operates in a
307     different way, and has some restrictions. The differences between the
308     two functions are described in the pcrematching documentation.
309    
310     If a data line contains the \D escape sequence, or if the command line
311     contains the -dfa option, the alternative matching function is called.
312     This function finds all possible matches at a given point. If, however,
313     the \F escape sequence is present in the data line, it stops after the
314     first match is found. This is always the shortest possible match.
315    
316    
317     DEFAULT OUTPUT FROM PCRETEST
318    
319     This section describes the output when the normal matching function,
320     pcre_exec(), is being used.
321    
322 nigel 73 When a match succeeds, pcretest outputs the list of captured substrings
323 nigel 75 that pcre_exec() returns, starting with number 0 for the string that
324     matched the whole pattern. Otherwise, it outputs "No match" or "Partial
325     match" when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PAR-
326     TIAL, respectively, and otherwise the PCRE negative error number. Here
327     is an example of an interactive pcretest run.
328 nigel 53
329 nigel 73 $ pcretest
330 nigel 75 PCRE version 5.00 07-Sep-2004
331 nigel 53
332 nigel 73 re> /^abc(\d+)/
333     data> abc123
334     0: abc123
335     1: 123
336     data> xyz
337     No match
338 nigel 53
339 nigel 75 If the strings contain any non-printing characters, they are output as
340     \0x escapes, or as \x{...} escapes if the /8 modifier was present on
341     the pattern. If the pattern has the /+ modifier, the output for sub-
342     string 0 is followed by the the rest of the subject string, identified
343     by "0+" like this:
344 nigel 53
345 nigel 73 re> /cat/+
346     data> cataract
347     0: cat
348     0+ aract
349 nigel 53
350 nigel 75 If the pattern has the /g or /G modifier, the results of successive
351 nigel 73 matching attempts are output in sequence, like this:
352 nigel 53
353 nigel 73 re> /\Bi(\w\w)/g
354     data> Mississippi
355     0: iss
356     1: ss
357     0: iss
358     1: ss
359     0: ipp
360     1: pp
361 nigel 53
362 nigel 73 "No match" is output only if the first match attempt fails.
363 nigel 53
364 nigel 75 If any of the sequences \C, \G, or \L are present in a data line that
365     is successfully matched, the substrings extracted by the convenience
366 nigel 73 functions are output with C, G, or L after the string number instead of
367     a colon. This is in addition to the normal full list. The string length
368 nigel 75 (that is, the return from the extraction function) is given in paren-
369 nigel 73 theses after each string for \C and \G.
370 nigel 53
371 nigel 75 Note that while patterns can be continued over several lines (a plain
372 nigel 73 ">" prompt is used for continuations), data lines may not. However new-
373     lines can be included in data by means of the \n escape.
374 nigel 53
375    
376 nigel 77 OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
377    
378     When the alternative matching function, pcre_dfa_exec(), is used (by
379     means of the \D escape sequence or the -dfa command line option), the
380     output consists of a list of all the matches that start at the first
381     point in the subject where there is at least one match. For example:
382    
383     re> /(tang|tangerine|tan)/
384     data> yellow tangerine\D
385     0: tangerine
386     1: tang
387     2: tan
388    
389     (Using the normal matching function on this data finds only "tang".)
390     The longest matching string is always given first (and numbered zero).
391    
392     If /gP is present on the pattern, the search for further matches
393     resumes at the end of the longest match. For example:
394    
395     re> /(tang|tangerine|tan)/g
396     data> yellow tangerine and tangy sultana\D
397     0: tangerine
398     1: tang
399     2: tan
400     0: tang
401     1: tan
402     0: tan
403    
404     Since the matching function does not support substring capture, the
405     escape sequences that are concerned with captured substrings are not
406     relevant.
407    
408    
409     RESTARTING AFTER A PARTIAL MATCH
410    
411     When the alternative matching function has given the PCRE_ERROR_PARTIAL
412     return, indicating that the subject partially matched the pattern, you
413     can restart the match with additional subject data by means of the \R
414     escape sequence. For example:
415    
416     re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
417     data> 23ja\P\D
418     Partial match: 23ja
419     data> n05\R\D
420     0: n05
421    
422     For further information about partial matching, see the pcrepartial
423     documentation.
424    
425    
426 nigel 75 CALLOUTS
427    
428     If the pattern contains any callout requests, pcretest's callout func-
429 nigel 77 tion is called during matching. This works with both matching func-
430     tions. By default, the called function displays the callout number, the
431     start and current positions in the text at the callout time, and the
432     next pattern item to be tested. For example, the output
433 nigel 75
434     --->pqrabcdef
435     0 ^ ^ \d
436    
437 nigel 77 indicates that callout number 0 occurred for a match attempt starting
438     at the fourth character of the subject string, when the pointer was at
439     the seventh character of the data, and when the next pattern item was
440     \d. Just one circumflex is output if the start and current positions
441 nigel 75 are the same.
442    
443     Callouts numbered 255 are assumed to be automatic callouts, inserted as
444 nigel 77 a result of the /C pattern modifier. In this case, instead of showing
445     the callout number, the offset in the pattern, preceded by a plus, is
446 nigel 75 output. For example:
447    
448     re> /\d?[A-E]\*/C
449     data> E*
450     --->E*
451     +0 ^ \d?
452     +3 ^ [A-E]
453     +8 ^^ \*
454     +10 ^ ^
455     0: E*
456    
457 nigel 77 The callout function in pcretest returns zero (carry on matching) by
458     default, but you can use a \C item in a data line (as described above)
459 nigel 75 to change this.
460    
461 nigel 77 Inserting callouts can be helpful when using pcretest to check compli-
462     cated regular expressions. For further information about callouts, see
463 nigel 75 the pcrecallout documentation.
464    
465    
466     SAVING AND RELOADING COMPILED PATTERNS
467    
468 nigel 77 The facilities described in this section are not available when the
469 nigel 75 POSIX inteface to PCRE is being used, that is, when the /P pattern mod-
470     ifier is specified.
471    
472     When the POSIX interface is not in use, you can cause pcretest to write
473 nigel 77 a compiled pattern to a file, by following the modifiers with > and a
474 nigel 75 file name. For example:
475    
476     /pattern/im >/some/file
477    
478 nigel 77 See the pcreprecompile documentation for a discussion about saving and
479 nigel 75 re-using compiled patterns.
480    
481 nigel 77 The data that is written is binary. The first eight bytes are the
482     length of the compiled pattern data followed by the length of the
483     optional study data, each written as four bytes in big-endian order
484     (most significant byte first). If there is no study data (either the
485 nigel 75 pattern was not studied, or studying did not return any data), the sec-
486 nigel 77 ond length is zero. The lengths are followed by an exact copy of the
487 nigel 75 compiled pattern. If there is additional study data, this follows imme-
488 nigel 77 diately after the compiled pattern. After writing the file, pcretest
489 nigel 75 expects to read a new pattern.
490    
491     A saved pattern can be reloaded into pcretest by specifing < and a file
492 nigel 77 name instead of a pattern. The name of the file must not contain a <
493     character, as otherwise pcretest will interpret the line as a pattern
494 nigel 75 delimited by < characters. For example:
495    
496     re> </some/file
497     Compiled regex loaded from /some/file
498     No study data
499    
500 nigel 77 When the pattern has been loaded, pcretest proceeds to read data lines
501 nigel 75 in the usual way.
502    
503 nigel 77 You can copy a file written by pcretest to a different host and reload
504     it there, even if the new host has opposite endianness to the one on
505     which the pattern was compiled. For example, you can compile on an i86
506 nigel 75 machine and run on a SPARC machine.
507    
508 nigel 77 File names for saving and reloading can be absolute or relative, but
509     note that the shell facility of expanding a file name that starts with
510 nigel 75 a tilde (~) is not available.
511    
512 nigel 77 The ability to save and reload files in pcretest is intended for test-
513     ing and experimentation. It is not intended for production use because
514     only a single pattern can be written to a file. Furthermore, there is
515     no facility for supplying custom character tables for use with a
516     reloaded pattern. If the original pattern was compiled with custom
517     tables, an attempt to match a subject string using a reloaded pattern
518     is likely to cause pcretest to crash. Finally, if you attempt to load
519 nigel 75 a file that is not in the correct format, the result is undefined.
520    
521    
522 nigel 63 AUTHOR
523 nigel 53
524 nigel 77 Philip Hazel
525 nigel 73 University Computing Service,
526     Cambridge CB2 3QG, England.
527 nigel 53
528 nigel 77 Last updated: 28 February 2005
529     Copyright (c) 1997-2005 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12