ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 63 - (hide annotations) (download)
Sat Feb 24 21:40:03 2007 UTC (8 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 16083 byte(s)
Load pcre-4.0 into code/trunk.

1 nigel 53 NAME
2     pcretest - a program for testing Perl-compatible regular
3     expressions.
4 nigel 41
6 nigel 53 SYNOPSIS
7     pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
8     tination]
9 nigel 41
10 nigel 53 pcretest was written as a test program for the PCRE regular
11     expression library itself, but it can also be used for
12 nigel 63 experimenting with regular expressions. This document
13 nigel 53 describes the features of the test program; for details of
14 nigel 63 the regular expressions themselves, see the pcrepattern
15     documentation. For details of PCRE and its options, see the
16     pcreapi documentation.
17 nigel 41
19 nigel 63 OPTIONS
20 nigel 41
21 nigel 63
22     -C Output the version number of the PCRE library, and
23     all available information about the optional
24     features that are included, and then exit.
26 nigel 53 -d Behave as if each regex had the /D modifier (see
27     below); the internal form is output after compila-
28     tion.
29 nigel 41
30 nigel 53 -i Behave as if each regex had the /I modifier;
31     information about the compiled pattern is given
32     after compilation.
33 nigel 41
34 nigel 53 -m Output the size of each compiled pattern after it
35     has been compiled. This is equivalent to adding /M
36     to each regular expression. For compatibility with
37     earlier versions of pcretest, -s is a synonym for
38     -m.
39 nigel 41
40 nigel 53 -o osize Set the number of elements in the output vector
41     that is used when calling PCRE to be osize. The
42     default value is 45, which is enough for 14 cap-
43     turing subexpressions. The vector size can be
44     changed for individual matching calls by including
45     \O in the data line (see below).
46 nigel 41
47 nigel 53 -p Behave as if each regex has /P modifier; the POSIX
48     wrapper API is used to call PCRE. None of the
49     other options has any effect when -p is set.
50 nigel 41
51 nigel 63 -t Run each compile, study, and match many times with
52     a timer, and output resulting time per compile or
53     match (in milliseconds). Do not set -t with -m,
54     because you will then get the size output 20000
55     times and the timing will be distorted.
56 nigel 49
58 nigel 63 DESCRIPTION
59 nigel 41
60 nigel 53 If pcretest is given two filename arguments, it reads from
61     the first and writes to the second. If it is given only one
62     filename argument, it reads from that file and writes to
63     stdout. Otherwise, it reads from stdin and writes to stdout,
64     and prompts for each line of input, using "re>" to prompt
65     for regular expressions, and "data>" to prompt for data
66     lines.
67 nigel 41
68 nigel 53 The program handles any number of sets of input on a single
69     input file. Each set starts with a regular expression, and
70     continues with any number of data lines to be matched
71 nigel 63 against the pattern.
72 nigel 41
73 nigel 63 Each line is matched separately and independently. If you
74     want to do multiple-line matches, you have to use the \n
75     escape sequence in a single line of input to encode the new-
76     line characters. The maximum length of data line is 30,000
77     characters.
79     An empty line signals the end of the data lines, at which
80     point a new regular expression is read. The regular expres-
81     sions are given enclosed in any non-alphameric delimiters
82     other than backslash, for example
84 nigel 53 /(a|bc)x+yz/
85 nigel 41
86 nigel 53 White space before the initial delimiter is ignored. A regu-
87     lar expression may be continued over several input lines, in
88     which case the newline characters are included within it. It
89     is possible to include the delimiter within the pattern by
90     escaping it, for example
91 nigel 41
92 nigel 53 /abc\/def/
93 nigel 41
94 nigel 53 If you do so, the escape and the delimiter form part of the
95     pattern, but since delimiters are always non-alphameric,
96     this does not affect its interpretation. If the terminating
97     delimiter is immediately followed by a backslash, for exam-
98     ple,
99 nigel 41
100 nigel 53 /abc/\
101 nigel 41
102 nigel 53 then a backslash is added to the end of the pattern. This is
103     done to provide a way of testing the error condition that
104     arises if a pattern finishes with a backslash, because
105 nigel 49
106 nigel 53 /abc\/
107 nigel 49
108 nigel 53 is interpreted as the first line of a pattern that starts
109     with "abc/", causing pcretest to read the next line as a
110     continuation of the regular expression.
111 nigel 49
112 nigel 41
114 nigel 41
115 nigel 53 The pattern may be followed by i, m, s, or x to set the
117     options, respectively. For example:
118 nigel 41
119 nigel 53 /caseless/i
120 nigel 41
121 nigel 53 These modifier letters have the same effect as they do in
122     Perl. There are others which set PCRE options that do not
123     correspond to anything in Perl: /A, /E, and /X set
125     tively.
126 nigel 41
127 nigel 53 Searching for all possible matches within each subject
128     string can be requested by the /g or /G modifier. After
129     finding a match, PCRE is called again to search the
130     remainder of the subject string. The difference between /g
131     and /G is that the former uses the startoffset argument to
132     pcre_exec() to start searching at a new point within the
133     entire string (which is in effect what Perl does), whereas
134     the latter passes over a shortened substring. This makes a
135     difference to the matching process if the pattern begins
136     with a lookbehind assertion (including \b or \B).
137 nigel 49
138 nigel 53 If any call to pcre_exec() in a /g or /G sequence matches an
139     empty string, the next call is done with the PCRE_NOTEMPTY
140     and PCRE_ANCHORED flags set in order to search for another,
141     non-empty, match at the same point. If this second match
142     fails, the start offset is advanced by one, and the normal
143     match is retried. This imitates the way Perl handles such
144     cases when using the /g modifier or the split() function.
145 nigel 49
146 nigel 53 There are a number of other modifiers for controlling the
147     way pcretest operates.
148 nigel 49
149 nigel 53 The /+ modifier requests that as well as outputting the sub-
150     string that matched the entire pattern, pcretest should in
151     addition output the remainder of the subject string. This is
152     useful for tests where the subject contains multiple copies
153     of the same substring.
154 nigel 41
155 nigel 53 The /L modifier must be followed directly by the name of a
156     locale, for example,
157 nigel 41
158 nigel 53 /pattern/Lfr
159 nigel 41
160 nigel 53 For this reason, it must be the last modifier letter. The
161     given locale is set, pcre_maketables() is called to build a
162     set of character tables for the locale, and this is then
163     passed to pcre_compile() when compiling the regular expres-
164     sion. Without an /L modifier, NULL is passed as the tables
165     pointer; that is, /L applies only to the expression on which
166     it appears.
167 nigel 41
168 nigel 53 The /I modifier requests that pcretest output information
169     about the compiled expression (whether it is anchored, has a
170     fixed first character, and so on). It does this by calling
171     pcre_fullinfo() after compiling an expression, and output-
172     ting the information it gets back. If the pattern is stu-
173     died, the results of that are also output.
174 nigel 63
175 nigel 53 The /D modifier is a PCRE debugging feature, which also
176     assumes /I. It causes the internal form of compiled regular
177 nigel 63 expressions to be output after compilation. If the pattern
178     was studied, the information returned is also output.
179 nigel 41
180 nigel 53 The /S modifier causes pcre_study() to be called after the
181     expression has been compiled, and the results used when the
182     expression is matched.
183 nigel 41
184 nigel 53 The /M modifier causes the size of memory block used to hold
185     the compiled pattern to be output.
186 nigel 41
187 nigel 53 The /P modifier causes pcretest to call PCRE via the POSIX
188     wrapper API rather than its native API. When this is done,
189     all other modifiers except /i, /m, and /+ are ignored.
190     REG_ICASE is set if /i is present, and REG_NEWLINE is set if
191     /m is present. The wrapper functions force
192     PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
193     REG_NEWLINE is set.
194 nigel 41
195 nigel 53 The /8 modifier causes pcretest to call PCRE with the
196 nigel 63 PCRE_UTF8 option set. This turns on support for UTF-8 char-
197     acter handling in PCRE, provided that it was compiled with
198     this support enabled. This modifier also causes any non-
199     printing characters in output strings to be printed using
200     the \x{hh...} notation if they are valid UTF-8 sequences.
201 nigel 41
203 nigel 63 CALLOUTS
204 nigel 49
205 nigel 63 If the pattern contains any callout requests, pcretest's
206     callout function will be called. By default, it displays the
207     callout number, and the start and current positions in the
208     text at the callout time. For example, the output
210     --->pqrabcdef
211     0 ^ ^
213     indicates that callout number 0 occurred for a match attempt
214     starting at the fourth character of the subject string, when
215     the pointer was at the seventh character. The callout func-
216     tion returns zero (carry on matching) by default.
218     Inserting callouts may be helpful when using pcretest to
219     check complicated regular expressions. For further informa-
220     tion about callouts, see the pcrecallout documentation.
222     For testing the PCRE library, additional control of callout
223     behaviour is available via escape sequences in the data, as
224     described in the following section. In particular, it is
225     possible to pass in a number as callout data (the default is
226     zero). If the callout function receives a non-zero number,
227     it returns that value instead of zero.
230 nigel 53 DATA LINES
231 nigel 63
232 nigel 53 Before each data line is passed to pcre_exec(), leading and
233     trailing whitespace is removed, and it is then scanned for \
234 nigel 63 escapes. Some of these are pretty esoteric features,
235     intended for checking out some of the more complicated
236     features of PCRE. If you are just testing "ordinary" regular
237     expressions, you probably don't need any of these. The fol-
238     lowing escapes are recognized:
239 nigel 49
240 nigel 53 \a alarm (= BEL)
241     \b backspace
242     \e escape
243     \f formfeed
244     \n newline
245     \r carriage return
246     \t tab
247     \v vertical tab
248     \nnn octal character (up to 3 octal digits)
249     \xhh hexadecimal character (up to 2 hex digits)
250 nigel 63 \x{hh...} hexadecimal character, any number of digits
251     in UTF-8 mode
252 nigel 53 \A pass the PCRE_ANCHORED option to pcre_exec()
253     \B pass the PCRE_NOTBOL option to pcre_exec()
254     \Cdd call pcre_copy_substring() for substring dd
255 nigel 63 after a successful match (any decimal number
256     less than 32)
257     \Cname call pcre_copy_named_substring() for substring
258     "name" after a successful match (name termin-
259     ated by next non alphanumeric character)
260     \C+ show the current captured substrings at callout
261     time
263     C- do not supply a callout function
264     \C!n return 1 instead of 0 when callout number n is
265     reached
266     \C!n!m return 1 instead of 0 when callout number n is
267     reached for the nth time
268     \C*n pass the number n (may be negative) as callout
269     data
270 nigel 53 \Gdd call pcre_get_substring() for substring dd
271 nigel 63 after a successful match (any decimal number
272     less than 32)
273     \Gname call pcre_get_named_substring() for substring
274     "name" after a successful match (name termin-
275     ated by next non-alphanumeric character)
276 nigel 53 \L call pcre_get_substringlist() after a
277 nigel 63 successful match
278     \M discover the minimum MATCH_LIMIT setting
279 nigel 53 \N pass the PCRE_NOTEMPTY option to pcre_exec()
280     \Odd set the size of the output vector passed to
281 nigel 63 pcre_exec() to dd (any number of decimal
282     digits)
283 nigel 53 \Z pass the PCRE_NOTEOL option to pcre_exec()
284 nigel 41
285 nigel 63 If \M is present, pcretest calls pcre_exec() several times,
286     with different values in the match_limit field of the
287     pcre_extra data structure, until it finds the minimum number
288     that is needed for pcre_exec() to complete. This number is a
289     measure of the amount of recursion and backtracking that
290     takes place, and checking it out can be instructive. For
291     most simple matches, the number is quite small, but for pat-
292     terns with very large numbers of matching possibilities, it
293     can become large very quickly with increasing length of sub-
294     ject string.
296 nigel 53 When \O is used, it may be higher or lower than the size set
297     by the -O option (or defaulted to 45); \O applies only to
298     the call of pcre_exec() for the line in which it appears.
299 nigel 41
300 nigel 53 A backslash followed by anything else just escapes the any-
301     thing else. If the very last character is a backslash, it is
302     ignored. This gives a way of passing an empty line as data,
303     since a real empty line terminates the data input.
304 nigel 41
305 nigel 53 If /P was present on the regex, causing the POSIX wrapper
306     API to be used, only B, and Z have any effect, causing
307     REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
308     tively.
310     The use of \x{hh...} to represent UTF-8 characters is not
311     dependent on the use of the /8 modifier on the pattern. It
312     is recognized always. There may be any number of hexadecimal
313     digits inside the braces. The result is from one to six
314     bytes, encoded according to the UTF-8 rules.
318 nigel 53
319     When a match succeeds, pcretest outputs the list of captured
320     substrings that pcre_exec() returns, starting with number 0
321     for the string that matched the whole pattern. Here is an
322     example of an interactive pcretest run.
324     $ pcretest
325 nigel 63 PCRE version 4.00 08-Jan-2003
326 nigel 53
327     re> /^abc(\d+)/
328     data> abc123
329     0: abc123
330     1: 123
331     data> xyz
332     No match
334     If the strings contain any non-printing characters, they are
335     output as \0x escapes, or as \x{...} escapes if the /8
336     modifier was present on the pattern. If the pattern has the
337     /+ modifier, then the output for substring 0 is followed by
338     the the rest of the subject string, identified by "0+" like
339     this:
341     re> /cat/+
342     data> cataract
343     0: cat
344     0+ aract
346     If the pattern has the /g or /G modifier, the results of
347     successive matching attempts are output in sequence, like
348     this:
350     re> /\Bi(\w\w)/g
351     data> Mississippi
352     0: iss
353     1: ss
354     0: iss
355     1: ss
356     0: ipp
357     1: pp
359     "No match" is output only if the first match attempt fails.
361     If any of the sequences \C, \G, or \L are present in a data
362     line that is successfully matched, the substrings extracted
363     by the convenience functions are output with C, G, or L
364     after the string number instead of a colon. This is in addi-
365     tion to the normal full list. The string length (that is,
366     the return from the extraction function) is given in
367     parentheses after each string for \C and \G.
369     Note that while patterns can be continued over several lines
370     (a plain ">" prompt is used for continuations), data lines
371     may not. However newlines can be included in data by means
372     of the \n escape.
375 nigel 63 AUTHOR
376 nigel 53
377     Philip Hazel <ph10@cam.ac.uk>
378     University Computing Service,
379     Cambridge CB2 3QG, England.
381 nigel 63 Last updated: 03 February 2003
382     Copyright (c) 1997-2003 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12