/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 73 - (hide annotations) (download)
Sat Feb 24 21:40:30 2007 UTC (6 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 16738 byte(s)
Load pcre-4.5 into code/trunk.

1 nigel 73 PCRETEST(1) PCRETEST(1)
2 nigel 41
3    
4 nigel 73
5     NAME
6     pcretest - a program for testing Perl-compatible regular expressions.
7    
8 nigel 53 SYNOPSIS
9 nigel 73 pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination]
10 nigel 41
11 nigel 73 pcretest was written as a test program for the PCRE regular expression
12     library itself, but it can also be used for experimenting with regular
13     expressions. This document describes the features of the test program;
14     for details of the regular expressions themselves, see the pcrepattern
15     documentation. For details of PCRE and its options, see the pcreapi
16     documentation.
17 nigel 41
18    
19 nigel 63 OPTIONS
20 nigel 41
21 nigel 63
22 nigel 73 -C Output the version number of the PCRE library, and all avail-
23     able information about the optional features that are
24     included, and then exit.
25 nigel 63
26 nigel 73 -d Behave as if each regex had the /D modifier (see below); the
27     internal form is output after compilation.
28 nigel 41
29 nigel 73 -i Behave as if each regex had the /I modifier; information
30     about the compiled pattern is given after compilation.
31 nigel 41
32 nigel 73 -m Output the size of each compiled pattern after it has been
33     compiled. This is equivalent to adding /M to each regular
34     expression. For compatibility with earlier versions of
35     pcretest, -s is a synonym for -m.
36 nigel 41
37 nigel 73 -o osize Set the number of elements in the output vector that is used
38     when calling PCRE to be osize. The default value is 45, which
39     is enough for 14 capturing subexpressions. The vector size
40     can be changed for individual matching calls by including \O
41     in the data line (see below).
42 nigel 41
43 nigel 73 -p Behave as if each regex has /P modifier; the POSIX wrapper
44     API is used to call PCRE. None of the other options has any
45     effect when -p is set.
46 nigel 41
47 nigel 73 -t Run each compile, study, and match many times with a timer,
48     and output resulting time per compile or match (in millisec-
49     onds). Do not set -t with -m, because you will then get the
50     size output 20000 times and the timing will be distorted.
51 nigel 49
52    
53 nigel 63 DESCRIPTION
54 nigel 41
55 nigel 73 If pcretest is given two filename arguments, it reads from the first
56     and writes to the second. If it is given only one filename argument, it
57     reads from that file and writes to stdout. Otherwise, it reads from
58     stdin and writes to stdout, and prompts for each line of input, using
59     "re>" to prompt for regular expressions, and "data>" to prompt for data
60     lines.
61 nigel 41
62 nigel 73 The program handles any number of sets of input on a single input file.
63     Each set starts with a regular expression, and continues with any num-
64     ber of data lines to be matched against the pattern.
65 nigel 41
66 nigel 73 Each line is matched separately and independently. If you want to do
67     multiple-line matches, you have to use the \n escape sequence in a sin-
68     gle line of input to encode the newline characters. The maximum length
69     of data line is 30,000 characters.
70 nigel 63
71 nigel 73 An empty line signals the end of the data lines, at which point a new
72     regular expression is read. The regular expressions are given enclosed
73     in any non-alphameric delimiters other than backslash, for example
74 nigel 63
75 nigel 73 /(a|bc)x+yz/
76 nigel 41
77 nigel 73 White space before the initial delimiter is ignored. A regular expres-
78     sion may be continued over several input lines, in which case the new-
79     line characters are included within it. It is possible to include the
80     delimiter within the pattern by escaping it, for example
81 nigel 41
82 nigel 73 /abc\/def/
83 nigel 41
84 nigel 73 If you do so, the escape and the delimiter form part of the pattern,
85     but since delimiters are always non-alphameric, this does not affect
86     its interpretation. If the terminating delimiter is immediately fol-
87     lowed by a backslash, for example,
88 nigel 41
89 nigel 73 /abc/\
90 nigel 41
91 nigel 73 then a backslash is added to the end of the pattern. This is done to
92     provide a way of testing the error condition that arises if a pattern
93     finishes with a backslash, because
94 nigel 49
95 nigel 73 /abc\/
96 nigel 49
97 nigel 73 is interpreted as the first line of a pattern that starts with "abc/",
98     causing pcretest to read the next line as a continuation of the regular
99     expression.
100 nigel 49
101 nigel 41
102 nigel 63 PATTERN MODIFIERS
103 nigel 41
104 nigel 73 The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
105     PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively.
106     For example:
107 nigel 41
108 nigel 73 /caseless/i
109 nigel 41
110 nigel 73 These modifier letters have the same effect as they do in Perl. There
111     are others that set PCRE options that do not correspond to anything in
112     Perl: /A, /E, /N, /U, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY,
113     PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA respectively.
114 nigel 41
115 nigel 73 Searching for all possible matches within each subject string can be
116     requested by the /g or /G modifier. After finding a match, PCRE is
117     called again to search the remainder of the subject string. The differ-
118     ence between /g and /G is that the former uses the startoffset argument
119     to pcre_exec() to start searching at a new point within the entire
120     string (which is in effect what Perl does), whereas the latter passes
121     over a shortened substring. This makes a difference to the matching
122     process if the pattern begins with a lookbehind assertion (including \b
123     or \B).
124 nigel 49
125 nigel 73 If any call to pcre_exec() in a /g or /G sequence matches an empty
126     string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
127     flags set in order to search for another, non-empty, match at the same
128     point. If this second match fails, the start offset is advanced by
129     one, and the normal match is retried. This imitates the way Perl han-
130     dles such cases when using the /g modifier or the split() function.
131 nigel 49
132 nigel 73 There are a number of other modifiers for controlling the way pcretest
133     operates.
134 nigel 49
135 nigel 73 The /+ modifier requests that as well as outputting the substring that
136     matched the entire pattern, pcretest should in addition output the
137     remainder of the subject string. This is useful for tests where the
138     subject contains multiple copies of the same substring.
139 nigel 41
140 nigel 73 The /L modifier must be followed directly by the name of a locale, for
141     example,
142 nigel 41
143 nigel 73 /pattern/Lfr
144 nigel 41
145 nigel 73 For this reason, it must be the last modifier letter. The given locale
146     is set, pcre_maketables() is called to build a set of character tables
147     for the locale, and this is then passed to pcre_compile() when compil-
148     ing the regular expression. Without an /L modifier, NULL is passed as
149     the tables pointer; that is, /L applies only to the expression on which
150     it appears.
151 nigel 41
152 nigel 73 The /I modifier requests that pcretest output information about the
153     compiled expression (whether it is anchored, has a fixed first charac-
154     ter, and so on). It does this by calling pcre_fullinfo() after compil-
155     ing an expression, and outputting the information it gets back. If the
156     pattern is studied, the results of that are also output.
157 nigel 63
158 nigel 73 The /D modifier is a PCRE debugging feature, which also assumes /I. It
159     causes the internal form of compiled regular expressions to be output
160     after compilation. If the pattern was studied, the information returned
161     is also output.
162 nigel 41
163 nigel 73 The /S modifier causes pcre_study() to be called after the expression
164     has been compiled, and the results used when the expression is matched.
165 nigel 41
166 nigel 73 The /M modifier causes the size of memory block used to hold the com-
167     piled pattern to be output.
168 nigel 41
169 nigel 73 The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
170     rather than its native API. When this is done, all other modifiers
171     except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
172     and REG_NEWLINE is set if /m is present. The wrapper functions force
173     PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
174 nigel 41
175 nigel 73 The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
176     set. This turns on support for UTF-8 character handling in PCRE, pro-
177     vided that it was compiled with this support enabled. This modifier
178     also causes any non-printing characters in output strings to be printed
179     using the \x{hh...} notation if they are valid UTF-8 sequences.
180 nigel 41
181 nigel 73 If the /? modifier is used with /8, it causes pcretest to call
182     pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
183     checking of the string for UTF-8 validity.
184 nigel 41
185 nigel 71
186 nigel 63 CALLOUTS
187 nigel 49
188 nigel 73 If the pattern contains any callout requests, pcretest's callout func-
189     tion will be called. By default, it displays the callout number, and
190     the start and current positions in the text at the callout time. For
191     example, the output
192 nigel 63
193 nigel 73 --->pqrabcdef
194     0 ^ ^
195 nigel 63
196 nigel 73 indicates that callout number 0 occurred for a match attempt starting
197     at the fourth character of the subject string, when the pointer was at
198     the seventh character. The callout function returns zero (carry on
199     matching) by default.
200 nigel 63
201 nigel 73 Inserting callouts may be helpful when using pcretest to check compli-
202     cated regular expressions. For further information about callouts, see
203     the pcrecallout documentation.
204 nigel 63
205 nigel 73 For testing the PCRE library, additional control of callout behaviour
206     is available via escape sequences in the data, as described in the fol-
207     lowing section. In particular, it is possible to pass in a number as
208     callout data (the default is zero). If the callout function receives a
209     non-zero number, it returns that value instead of zero.
210 nigel 63
211    
212 nigel 53 DATA LINES
213 nigel 63
214 nigel 73 Before each data line is passed to pcre_exec(), leading and trailing
215     whitespace is removed, and it is then scanned for \ escapes. Some of
216     these are pretty esoteric features, intended for checking out some of
217     the more complicated features of PCRE. If you are just testing "ordi-
218     nary" regular expressions, you probably don't need any of these. The
219     following escapes are recognized:
220 nigel 49
221 nigel 73 \a alarm (= BEL)
222     \b backspace
223     \e escape
224     \f formfeed
225     \n newline
226     \r carriage return
227     \t tab
228     \v vertical tab
229     \nnn octal character (up to 3 octal digits)
230     \xhh hexadecimal character (up to 2 hex digits)
231     \x{hh...} hexadecimal character, any number of digits
232     in UTF-8 mode
233     \A pass the PCRE_ANCHORED option to pcre_exec()
234     \B pass the PCRE_NOTBOL option to pcre_exec()
235     \Cdd call pcre_copy_substring() for substring dd
236     after a successful match (any decimal number
237     less than 32)
238     \Cname call pcre_copy_named_substring() for substring
239     "name" after a successful match (name termin-
240     ated by next non alphanumeric character)
241     \C+ show the current captured substrings at callout
242     time
243     \C- do not supply a callout function
244     \C!n return 1 instead of 0 when callout number n is
245     reached
246     \C!n!m return 1 instead of 0 when callout number n is
247     reached for the nth time
248     \C*n pass the number n (may be negative) as callout
249     data
250     \Gdd call pcre_get_substring() for substring dd
251     after a successful match (any decimal number
252     less than 32)
253     \Gname call pcre_get_named_substring() for substring
254     "name" after a successful match (name termin-
255     ated by next non-alphanumeric character)
256     \L call pcre_get_substringlist() after a
257     successful match
258     \M discover the minimum MATCH_LIMIT setting
259     \N pass the PCRE_NOTEMPTY option to pcre_exec()
260     \Odd set the size of the output vector passed to
261     pcre_exec() to dd (any number of decimal
262     digits)
263     \S output details of memory get/free calls during matching
264     \Z pass the PCRE_NOTEOL option to pcre_exec()
265     \? pass the PCRE_NO_UTF8_CHECK option to
266     pcre_exec()
267 nigel 71
268 nigel 73 If \M is present, pcretest calls pcre_exec() several times, with dif-
269     ferent values in the match_limit field of the pcre_extra data struc-
270     ture, until it finds the minimum number that is needed for pcre_exec()
271     to complete. This number is a measure of the amount of recursion and
272     backtracking that takes place, and checking it out can be instructive.
273     For most simple matches, the number is quite small, but for patterns
274     with very large numbers of matching possibilities, it can become large
275     very quickly with increasing length of subject string.
276 nigel 41
277 nigel 73 When \O is used, it may be higher or lower than the size set by the -O
278     option (or defaulted to 45); \O applies only to the call of pcre_exec()
279     for the line in which it appears.
280 nigel 63
281 nigel 73 A backslash followed by anything else just escapes the anything else.
282     If the very last character is a backslash, it is ignored. This gives a
283     way of passing an empty line as data, since a real empty line termi-
284     nates the data input.
285 nigel 41
286 nigel 73 If /P was present on the regex, causing the POSIX wrapper API to be
287     used, only 0 causing REG_NOTBOL and REG_NOTEOL to be passed to
288     regexec() respectively.
289 nigel 41
290 nigel 73 The use of \x{hh...} to represent UTF-8 characters is not dependent on
291     the use of the /8 modifier on the pattern. It is recognized always.
292     There may be any number of hexadecimal digits inside the braces. The
293     result is from one to six bytes, encoded according to the UTF-8 rules.
294 nigel 53
295    
296 nigel 63 OUTPUT FROM PCRETEST
297 nigel 53
298 nigel 73 When a match succeeds, pcretest outputs the list of captured substrings
299     that pcre_exec() returns, starting with number 0 for the string that
300     matched the whole pattern. Here is an example of an interactive
301     pcretest run.
302 nigel 53
303 nigel 73 $ pcretest
304     PCRE version 4.00 08-Jan-2003
305 nigel 53
306 nigel 73 re> /^abc(\d+)/
307     data> abc123
308     0: abc123
309     1: 123
310     data> xyz
311     No match
312 nigel 53
313 nigel 73 If the strings contain any non-printing characters, they are output as
314     \0x escapes, or as \x{...} escapes if the /8 modifier was present on
315     the pattern. If the pattern has the /+ modifier, then the output for
316     substring 0 is followed by the the rest of the subject string, identi-
317     fied by "0+" like this:
318 nigel 53
319 nigel 73 re> /cat/+
320     data> cataract
321     0: cat
322     0+ aract
323 nigel 53
324 nigel 73 If the pattern has the /g or /G modifier, the results of successive
325     matching attempts are output in sequence, like this:
326 nigel 53
327 nigel 73 re> /\Bi(\w\w)/g
328     data> Mississippi
329     0: iss
330     1: ss
331     0: iss
332     1: ss
333     0: ipp
334     1: pp
335 nigel 53
336 nigel 73 "No match" is output only if the first match attempt fails.
337 nigel 53
338 nigel 73 If any of the sequences \C, \G, or \L are present in a data line that
339     is successfully matched, the substrings extracted by the convenience
340     functions are output with C, G, or L after the string number instead of
341     a colon. This is in addition to the normal full list. The string length
342     (that is, the return from the extraction function) is given in paren-
343     theses after each string for \C and \G.
344 nigel 53
345 nigel 73 Note that while patterns can be continued over several lines (a plain
346     ">" prompt is used for continuations), data lines may not. However new-
347     lines can be included in data by means of the \n escape.
348 nigel 53
349    
350 nigel 63 AUTHOR
351 nigel 53
352 nigel 73 Philip Hazel <ph10@cam.ac.uk>
353     University Computing Service,
354     Cambridge CB2 3QG, England.
355 nigel 53
356 nigel 73 Last updated: 09 December 2003
357 nigel 63 Copyright (c) 1997-2003 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12