/[pcre]/code/tags/pcre-6.2/doc/pcretest.txt
ViewVC logotype

Contents of /code/tags/pcre-6.2/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 53 - (show annotations) (download)
Sat Feb 24 21:39:42 2007 UTC (7 years, 2 months ago) by nigel
Original Path: code/trunk/doc/pcretest.txt
File MIME type: text/plain
File size: 12689 byte(s)
Load pcre-3.5 into code/trunk.

1 NAME
2 pcretest - a program for testing Perl-compatible regular
3 expressions.
4
5
6
7 SYNOPSIS
8 pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
9 tination]
10
11 pcretest was written as a test program for the PCRE regular
12 expression library itself, but it can also be used for
13 experimenting with regular expressions. This man page
14 describes the features of the test program; for details of
15 the regular expressions themselves, see the pcre man page.
16
17
18
19 OPTIONS
20 -d Behave as if each regex had the /D modifier (see
21 below); the internal form is output after compila-
22 tion.
23
24 -i Behave as if each regex had the /I modifier;
25 information about the compiled pattern is given
26 after compilation.
27
28 -m Output the size of each compiled pattern after it
29 has been compiled. This is equivalent to adding /M
30 to each regular expression. For compatibility with
31 earlier versions of pcretest, -s is a synonym for
32 -m.
33
34 -o osize Set the number of elements in the output vector
35 that is used when calling PCRE to be osize. The
36 default value is 45, which is enough for 14 cap-
37 turing subexpressions. The vector size can be
38 changed for individual matching calls by including
39 \O in the data line (see below).
40
41 -p Behave as if each regex has /P modifier; the POSIX
42 wrapper API is used to call PCRE. None of the
43 other options has any effect when -p is set.
44
45 -t Run each compile, study, and match 20000 times
46 with a timer, and output resulting time per com-
47 pile or match (in milliseconds). Do not set -t
48 with -m, because you will then get the size output
49 20000 times and the timing will be distorted.
50
51
52
53 DESCRIPTION
54 If pcretest is given two filename arguments, it reads from
55 the first and writes to the second. If it is given only one
56
57
58
59
60 SunOS 5.8 Last change: 1
61
62
63
64 filename argument, it reads from that file and writes to
65 stdout. Otherwise, it reads from stdin and writes to stdout,
66 and prompts for each line of input, using "re>" to prompt
67 for regular expressions, and "data>" to prompt for data
68 lines.
69
70 The program handles any number of sets of input on a single
71 input file. Each set starts with a regular expression, and
72 continues with any number of data lines to be matched
73 against the pattern. An empty line signals the end of the
74 data lines, at which point a new regular expression is read.
75 The regular expressions are given enclosed in any non-
76 alphameric delimiters other than backslash, for example
77
78 /(a|bc)x+yz/
79
80 White space before the initial delimiter is ignored. A regu-
81 lar expression may be continued over several input lines, in
82 which case the newline characters are included within it. It
83 is possible to include the delimiter within the pattern by
84 escaping it, for example
85
86 /abc\/def/
87
88 If you do so, the escape and the delimiter form part of the
89 pattern, but since delimiters are always non-alphameric,
90 this does not affect its interpretation. If the terminating
91 delimiter is immediately followed by a backslash, for exam-
92 ple,
93
94 /abc/\
95
96 then a backslash is added to the end of the pattern. This is
97 done to provide a way of testing the error condition that
98 arises if a pattern finishes with a backslash, because
99
100 /abc\/
101
102 is interpreted as the first line of a pattern that starts
103 with "abc/", causing pcretest to read the next line as a
104 continuation of the regular expression.
105
106
107
108 PATTERN MODIFIERS
109 The pattern may be followed by i, m, s, or x to set the
110 PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
111 options, respectively. For example:
112
113 /caseless/i
114
115 These modifier letters have the same effect as they do in
116 Perl. There are others which set PCRE options that do not
117 correspond to anything in Perl: /A, /E, and /X set
118 PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
119 tively.
120
121 Searching for all possible matches within each subject
122 string can be requested by the /g or /G modifier. After
123 finding a match, PCRE is called again to search the
124 remainder of the subject string. The difference between /g
125 and /G is that the former uses the startoffset argument to
126 pcre_exec() to start searching at a new point within the
127 entire string (which is in effect what Perl does), whereas
128 the latter passes over a shortened substring. This makes a
129 difference to the matching process if the pattern begins
130 with a lookbehind assertion (including \b or \B).
131
132 If any call to pcre_exec() in a /g or /G sequence matches an
133 empty string, the next call is done with the PCRE_NOTEMPTY
134 and PCRE_ANCHORED flags set in order to search for another,
135 non-empty, match at the same point. If this second match
136 fails, the start offset is advanced by one, and the normal
137 match is retried. This imitates the way Perl handles such
138 cases when using the /g modifier or the split() function.
139
140 There are a number of other modifiers for controlling the
141 way pcretest operates.
142
143 The /+ modifier requests that as well as outputting the sub-
144 string that matched the entire pattern, pcretest should in
145 addition output the remainder of the subject string. This is
146 useful for tests where the subject contains multiple copies
147 of the same substring.
148
149 The /L modifier must be followed directly by the name of a
150 locale, for example,
151
152 /pattern/Lfr
153
154 For this reason, it must be the last modifier letter. The
155 given locale is set, pcre_maketables() is called to build a
156 set of character tables for the locale, and this is then
157 passed to pcre_compile() when compiling the regular expres-
158 sion. Without an /L modifier, NULL is passed as the tables
159 pointer; that is, /L applies only to the expression on which
160 it appears.
161
162 The /I modifier requests that pcretest output information
163 about the compiled expression (whether it is anchored, has a
164 fixed first character, and so on). It does this by calling
165 pcre_fullinfo() after compiling an expression, and output-
166 ting the information it gets back. If the pattern is stu-
167 died, the results of that are also output.
168 The /D modifier is a PCRE debugging feature, which also
169 assumes /I. It causes the internal form of compiled regular
170 expressions to be output after compilation.
171
172 The /S modifier causes pcre_study() to be called after the
173 expression has been compiled, and the results used when the
174 expression is matched.
175
176 The /M modifier causes the size of memory block used to hold
177 the compiled pattern to be output.
178
179 The /P modifier causes pcretest to call PCRE via the POSIX
180 wrapper API rather than its native API. When this is done,
181 all other modifiers except /i, /m, and /+ are ignored.
182 REG_ICASE is set if /i is present, and REG_NEWLINE is set if
183 /m is present. The wrapper functions force
184 PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
185 REG_NEWLINE is set.
186
187 The /8 modifier causes pcretest to call PCRE with the
188 PCRE_UTF8 option set. This turns on the (currently incom-
189 plete) support for UTF-8 character handling in PCRE, pro-
190 vided that it was compiled with this support enabled. This
191 modifier also causes any non-printing characters in output
192 strings to be printed using the \x{hh...} notation if they
193 are valid UTF-8 sequences.
194
195
196
197 DATA LINES
198 Before each data line is passed to pcre_exec(), leading and
199 trailing whitespace is removed, and it is then scanned for \
200 escapes. The following are recognized:
201
202 \a alarm (= BEL)
203 \b backspace
204 \e escape
205 \f formfeed
206 \n newline
207 \r carriage return
208 \t tab
209 \v vertical tab
210 \nnn octal character (up to 3 octal digits)
211 \xhh hexadecimal character (up to 2 hex digits)
212 \x{hh...} hexadecimal UTF-8 character
213
214 \A pass the PCRE_ANCHORED option to pcre_exec()
215 \B pass the PCRE_NOTBOL option to pcre_exec()
216 \Cdd call pcre_copy_substring() for substring dd
217 after a successful match (any decimal number
218 less than 32)
219 \Gdd call pcre_get_substring() for substring dd
220
221 after a successful match (any decimal number
222 less than 32)
223 \L call pcre_get_substringlist() after a
224 successful match
225 \N pass the PCRE_NOTEMPTY option to pcre_exec()
226 \Odd set the size of the output vector passed to
227 pcre_exec() to dd (any number of decimal
228 digits)
229 \Z pass the PCRE_NOTEOL option to pcre_exec()
230
231 When \O is used, it may be higher or lower than the size set
232 by the -O option (or defaulted to 45); \O applies only to
233 the call of pcre_exec() for the line in which it appears.
234
235 A backslash followed by anything else just escapes the any-
236 thing else. If the very last character is a backslash, it is
237 ignored. This gives a way of passing an empty line as data,
238 since a real empty line terminates the data input.
239
240 If /P was present on the regex, causing the POSIX wrapper
241 API to be used, only B, and Z have any effect, causing
242 REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
243 tively.
244
245 The use of \x{hh...} to represent UTF-8 characters is not
246 dependent on the use of the /8 modifier on the pattern. It
247 is recognized always. There may be any number of hexadecimal
248 digits inside the braces. The result is from one to six
249 bytes, encoded according to the UTF-8 rules.
250
251
252
253 OUTPUT FROM PCRETEST
254 When a match succeeds, pcretest outputs the list of captured
255 substrings that pcre_exec() returns, starting with number 0
256 for the string that matched the whole pattern. Here is an
257 example of an interactive pcretest run.
258
259 $ pcretest
260 PCRE version 2.06 08-Jun-1999
261
262 re> /^abc(\d+)/
263 data> abc123
264 0: abc123
265 1: 123
266 data> xyz
267 No match
268
269 If the strings contain any non-printing characters, they are
270 output as \0x escapes, or as \x{...} escapes if the /8
271 modifier was present on the pattern. If the pattern has the
272 /+ modifier, then the output for substring 0 is followed by
273 the the rest of the subject string, identified by "0+" like
274 this:
275
276 re> /cat/+
277 data> cataract
278 0: cat
279 0+ aract
280
281 If the pattern has the /g or /G modifier, the results of
282 successive matching attempts are output in sequence, like
283 this:
284
285 re> /\Bi(\w\w)/g
286 data> Mississippi
287 0: iss
288 1: ss
289 0: iss
290 1: ss
291 0: ipp
292 1: pp
293
294 "No match" is output only if the first match attempt fails.
295
296 If any of the sequences \C, \G, or \L are present in a data
297 line that is successfully matched, the substrings extracted
298 by the convenience functions are output with C, G, or L
299 after the string number instead of a colon. This is in addi-
300 tion to the normal full list. The string length (that is,
301 the return from the extraction function) is given in
302 parentheses after each string for \C and \G.
303
304 Note that while patterns can be continued over several lines
305 (a plain ">" prompt is used for continuations), data lines
306 may not. However newlines can be included in data by means
307 of the \n escape.
308
309
310
311 AUTHOR
312 Philip Hazel <ph10@cam.ac.uk>
313 University Computing Service,
314 New Museums Site,
315 Cambridge CB2 3QG, England.
316 Phone: +44 1223 334714
317
318 Last updated: 15 August 2001
319 Copyright (c) 1997-2001 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12