ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 71 - (hide annotations) (download)
Sat Feb 24 21:40:24 2007 UTC (8 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 16399 byte(s)
Load pcre-4.4 into code/trunk.

1 nigel 53 NAME
2     pcretest - a program for testing Perl-compatible regular
3     expressions.
4 nigel 41
6 nigel 53 SYNOPSIS
7     pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
8     tination]
9 nigel 41
10 nigel 53 pcretest was written as a test program for the PCRE regular
11     expression library itself, but it can also be used for
12 nigel 63 experimenting with regular expressions. This document
13 nigel 53 describes the features of the test program; for details of
14 nigel 63 the regular expressions themselves, see the pcrepattern
15     documentation. For details of PCRE and its options, see the
16     pcreapi documentation.
17 nigel 41
19 nigel 63 OPTIONS
20 nigel 41
21 nigel 63
22     -C Output the version number of the PCRE library, and
23     all available information about the optional
24     features that are included, and then exit.
26 nigel 53 -d Behave as if each regex had the /D modifier (see
27     below); the internal form is output after compila-
28     tion.
29 nigel 41
30 nigel 53 -i Behave as if each regex had the /I modifier;
31     information about the compiled pattern is given
32     after compilation.
33 nigel 41
34 nigel 53 -m Output the size of each compiled pattern after it
35     has been compiled. This is equivalent to adding /M
36     to each regular expression. For compatibility with
37     earlier versions of pcretest, -s is a synonym for
38     -m.
39 nigel 41
40 nigel 53 -o osize Set the number of elements in the output vector
41     that is used when calling PCRE to be osize. The
42     default value is 45, which is enough for 14 cap-
43     turing subexpressions. The vector size can be
44     changed for individual matching calls by including
45     \O in the data line (see below).
46 nigel 41
47 nigel 53 -p Behave as if each regex has /P modifier; the POSIX
48     wrapper API is used to call PCRE. None of the
49     other options has any effect when -p is set.
50 nigel 41
51 nigel 63 -t Run each compile, study, and match many times with
52     a timer, and output resulting time per compile or
53     match (in milliseconds). Do not set -t with -m,
54     because you will then get the size output 20000
55     times and the timing will be distorted.
56 nigel 49
58 nigel 63 DESCRIPTION
59 nigel 41
60 nigel 53 If pcretest is given two filename arguments, it reads from
61     the first and writes to the second. If it is given only one
62     filename argument, it reads from that file and writes to
63     stdout. Otherwise, it reads from stdin and writes to stdout,
64     and prompts for each line of input, using "re>" to prompt
65     for regular expressions, and "data>" to prompt for data
66     lines.
67 nigel 41
68 nigel 53 The program handles any number of sets of input on a single
69     input file. Each set starts with a regular expression, and
70     continues with any number of data lines to be matched
71 nigel 63 against the pattern.
72 nigel 41
73 nigel 63 Each line is matched separately and independently. If you
74     want to do multiple-line matches, you have to use the \n
75     escape sequence in a single line of input to encode the new-
76     line characters. The maximum length of data line is 30,000
77     characters.
79     An empty line signals the end of the data lines, at which
80     point a new regular expression is read. The regular expres-
81     sions are given enclosed in any non-alphameric delimiters
82     other than backslash, for example
84 nigel 53 /(a|bc)x+yz/
85 nigel 41
86 nigel 53 White space before the initial delimiter is ignored. A regu-
87     lar expression may be continued over several input lines, in
88     which case the newline characters are included within it. It
89     is possible to include the delimiter within the pattern by
90     escaping it, for example
91 nigel 41
92 nigel 53 /abc\/def/
93 nigel 41
94 nigel 53 If you do so, the escape and the delimiter form part of the
95     pattern, but since delimiters are always non-alphameric,
96     this does not affect its interpretation. If the terminating
97     delimiter is immediately followed by a backslash, for exam-
98     ple,
99 nigel 41
100 nigel 53 /abc/\
101 nigel 41
102 nigel 53 then a backslash is added to the end of the pattern. This is
103     done to provide a way of testing the error condition that
104     arises if a pattern finishes with a backslash, because
105 nigel 49
106 nigel 53 /abc\/
107 nigel 49
108 nigel 53 is interpreted as the first line of a pattern that starts
109     with "abc/", causing pcretest to read the next line as a
110     continuation of the regular expression.
111 nigel 49
112 nigel 41
114 nigel 41
115 nigel 53 The pattern may be followed by i, m, s, or x to set the
117     options, respectively. For example:
118 nigel 41
119 nigel 53 /caseless/i
120 nigel 41
121 nigel 53 These modifier letters have the same effect as they do in
122 nigel 71 Perl. There are others that set PCRE options that do not
123     correspond to anything in Perl: /A, /E, /N, /U, and /X set
125     PCRE_UNGREEDY, and PCRE_EXTRA respectively.
126 nigel 41
127 nigel 53 Searching for all possible matches within each subject
128     string can be requested by the /g or /G modifier. After
129     finding a match, PCRE is called again to search the
130     remainder of the subject string. The difference between /g
131     and /G is that the former uses the startoffset argument to
132     pcre_exec() to start searching at a new point within the
133     entire string (which is in effect what Perl does), whereas
134     the latter passes over a shortened substring. This makes a
135     difference to the matching process if the pattern begins
136     with a lookbehind assertion (including \b or \B).
137 nigel 49
138 nigel 53 If any call to pcre_exec() in a /g or /G sequence matches an
139     empty string, the next call is done with the PCRE_NOTEMPTY
140     and PCRE_ANCHORED flags set in order to search for another,
141     non-empty, match at the same point. If this second match
142     fails, the start offset is advanced by one, and the normal
143     match is retried. This imitates the way Perl handles such
144     cases when using the /g modifier or the split() function.
145 nigel 49
146 nigel 53 There are a number of other modifiers for controlling the
147     way pcretest operates.
148 nigel 49
149 nigel 53 The /+ modifier requests that as well as outputting the sub-
150     string that matched the entire pattern, pcretest should in
151     addition output the remainder of the subject string. This is
152     useful for tests where the subject contains multiple copies
153     of the same substring.
154 nigel 41
155 nigel 53 The /L modifier must be followed directly by the name of a
156     locale, for example,
157 nigel 41
158 nigel 53 /pattern/Lfr
159 nigel 41
160 nigel 53 For this reason, it must be the last modifier letter. The
161     given locale is set, pcre_maketables() is called to build a
162     set of character tables for the locale, and this is then
163     passed to pcre_compile() when compiling the regular expres-
164     sion. Without an /L modifier, NULL is passed as the tables
165     pointer; that is, /L applies only to the expression on which
166     it appears.
167 nigel 41
168 nigel 53 The /I modifier requests that pcretest output information
169     about the compiled expression (whether it is anchored, has a
170     fixed first character, and so on). It does this by calling
171     pcre_fullinfo() after compiling an expression, and output-
172     ting the information it gets back. If the pattern is stu-
173     died, the results of that are also output.
174 nigel 63
175 nigel 53 The /D modifier is a PCRE debugging feature, which also
176     assumes /I. It causes the internal form of compiled regular
177 nigel 63 expressions to be output after compilation. If the pattern
178     was studied, the information returned is also output.
179 nigel 41
180 nigel 53 The /S modifier causes pcre_study() to be called after the
181     expression has been compiled, and the results used when the
182     expression is matched.
183 nigel 41
184 nigel 53 The /M modifier causes the size of memory block used to hold
185     the compiled pattern to be output.
186 nigel 41
187 nigel 53 The /P modifier causes pcretest to call PCRE via the POSIX
188     wrapper API rather than its native API. When this is done,
189     all other modifiers except /i, /m, and /+ are ignored.
190     REG_ICASE is set if /i is present, and REG_NEWLINE is set if
191     /m is present. The wrapper functions force
192     PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
193     REG_NEWLINE is set.
194 nigel 41
195 nigel 53 The /8 modifier causes pcretest to call PCRE with the
196 nigel 63 PCRE_UTF8 option set. This turns on support for UTF-8 char-
197     acter handling in PCRE, provided that it was compiled with
198     this support enabled. This modifier also causes any non-
199     printing characters in output strings to be printed using
200     the \x{hh...} notation if they are valid UTF-8 sequences.
201 nigel 41
202 nigel 71 If the /? modifier is used with /8, it causes pcretest to
203     call pcre_compile() with the PCRE_NO_UTF8_CHECK option, to
204     suppress the checking of the string for UTF-8 validity.
205 nigel 41
206 nigel 71
207 nigel 63 CALLOUTS
208 nigel 49
209 nigel 63 If the pattern contains any callout requests, pcretest's
210     callout function will be called. By default, it displays the
211     callout number, and the start and current positions in the
212     text at the callout time. For example, the output
214     --->pqrabcdef
215     0 ^ ^
217     indicates that callout number 0 occurred for a match attempt
218     starting at the fourth character of the subject string, when
219     the pointer was at the seventh character. The callout func-
220     tion returns zero (carry on matching) by default.
222     Inserting callouts may be helpful when using pcretest to
223     check complicated regular expressions. For further informa-
224     tion about callouts, see the pcrecallout documentation.
226     For testing the PCRE library, additional control of callout
227     behaviour is available via escape sequences in the data, as
228     described in the following section. In particular, it is
229     possible to pass in a number as callout data (the default is
230     zero). If the callout function receives a non-zero number,
231     it returns that value instead of zero.
234 nigel 53 DATA LINES
235 nigel 63
236 nigel 53 Before each data line is passed to pcre_exec(), leading and
237     trailing whitespace is removed, and it is then scanned for \
238 nigel 63 escapes. Some of these are pretty esoteric features,
239     intended for checking out some of the more complicated
240     features of PCRE. If you are just testing "ordinary" regular
241     expressions, you probably don't need any of these. The fol-
242     lowing escapes are recognized:
243 nigel 49
244 nigel 53 \a alarm (= BEL)
245     \b backspace
246     \e escape
247     \f formfeed
248     \n newline
249     \r carriage return
250     \t tab
251     \v vertical tab
252     \nnn octal character (up to 3 octal digits)
253     \xhh hexadecimal character (up to 2 hex digits)
254 nigel 63 \x{hh...} hexadecimal character, any number of digits
255     in UTF-8 mode
256 nigel 53 \A pass the PCRE_ANCHORED option to pcre_exec()
257     \B pass the PCRE_NOTBOL option to pcre_exec()
258     \Cdd call pcre_copy_substring() for substring dd
259 nigel 63 after a successful match (any decimal number
260     less than 32)
261     \Cname call pcre_copy_named_substring() for substring
262 nigel 71
263 nigel 63 "name" after a successful match (name termin-
264     ated by next non alphanumeric character)
265     \C+ show the current captured substrings at callout
266     time
267 nigel 71 \C- do not supply a callout function
268 nigel 63 \C!n return 1 instead of 0 when callout number n is
269     reached
270     \C!n!m return 1 instead of 0 when callout number n is
271     reached for the nth time
272     \C*n pass the number n (may be negative) as callout
273     data
274 nigel 53 \Gdd call pcre_get_substring() for substring dd
275 nigel 63 after a successful match (any decimal number
276     less than 32)
277     \Gname call pcre_get_named_substring() for substring
278     "name" after a successful match (name termin-
279     ated by next non-alphanumeric character)
280 nigel 53 \L call pcre_get_substringlist() after a
281 nigel 63 successful match
282     \M discover the minimum MATCH_LIMIT setting
283 nigel 53 \N pass the PCRE_NOTEMPTY option to pcre_exec()
284     \Odd set the size of the output vector passed to
285 nigel 63 pcre_exec() to dd (any number of decimal
286     digits)
287 nigel 53 \Z pass the PCRE_NOTEOL option to pcre_exec()
288 nigel 71 \? pass the PCRE_NO_UTF8_CHECK option to
289     pcre_exec()
290 nigel 41
291 nigel 63 If \M is present, pcretest calls pcre_exec() several times,
292     with different values in the match_limit field of the
293     pcre_extra data structure, until it finds the minimum number
294     that is needed for pcre_exec() to complete. This number is a
295     measure of the amount of recursion and backtracking that
296     takes place, and checking it out can be instructive. For
297     most simple matches, the number is quite small, but for pat-
298     terns with very large numbers of matching possibilities, it
299     can become large very quickly with increasing length of sub-
300     ject string.
302 nigel 53 When \O is used, it may be higher or lower than the size set
303     by the -O option (or defaulted to 45); \O applies only to
304     the call of pcre_exec() for the line in which it appears.
305 nigel 41
306 nigel 53 A backslash followed by anything else just escapes the any-
307     thing else. If the very last character is a backslash, it is
308     ignored. This gives a way of passing an empty line as data,
309     since a real empty line terminates the data input.
310 nigel 41
311 nigel 53 If /P was present on the regex, causing the POSIX wrapper
312     API to be used, only B, and Z have any effect, causing
313     REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
314     tively.
315     The use of \x{hh...} to represent UTF-8 characters is not
316     dependent on the use of the /8 modifier on the pattern. It
317     is recognized always. There may be any number of hexadecimal
318     digits inside the braces. The result is from one to six
319     bytes, encoded according to the UTF-8 rules.
323 nigel 53
324     When a match succeeds, pcretest outputs the list of captured
325     substrings that pcre_exec() returns, starting with number 0
326     for the string that matched the whole pattern. Here is an
327     example of an interactive pcretest run.
329     $ pcretest
330 nigel 63 PCRE version 4.00 08-Jan-2003
331 nigel 53
332     re> /^abc(\d+)/
333     data> abc123
334     0: abc123
335     1: 123
336     data> xyz
337     No match
339     If the strings contain any non-printing characters, they are
340     output as \0x escapes, or as \x{...} escapes if the /8
341     modifier was present on the pattern. If the pattern has the
342     /+ modifier, then the output for substring 0 is followed by
343     the the rest of the subject string, identified by "0+" like
344     this:
346     re> /cat/+
347     data> cataract
348     0: cat
349     0+ aract
351     If the pattern has the /g or /G modifier, the results of
352     successive matching attempts are output in sequence, like
353     this:
355     re> /\Bi(\w\w)/g
356     data> Mississippi
357     0: iss
358     1: ss
359     0: iss
360     1: ss
361     0: ipp
362     1: pp
364     "No match" is output only if the first match attempt fails.
366     If any of the sequences \C, \G, or \L are present in a data
367     line that is successfully matched, the substrings extracted
368     by the convenience functions are output with C, G, or L
369     after the string number instead of a colon. This is in addi-
370     tion to the normal full list. The string length (that is,
371     the return from the extraction function) is given in
372     parentheses after each string for \C and \G.
374     Note that while patterns can be continued over several lines
375     (a plain ">" prompt is used for continuations), data lines
376     may not. However newlines can be included in data by means
377     of the \n escape.
380 nigel 63 AUTHOR
381 nigel 53
382     Philip Hazel <ph10@cam.ac.uk>
383     University Computing Service,
384     Cambridge CB2 3QG, England.
386 nigel 71 Last updated: 20 August 2003
387 nigel 63 Copyright (c) 1997-2003 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12