/[pcre]/code/tags/pcre-6.2/doc/pcretest.txt
ViewVC logotype

Contents of /code/tags/pcre-6.2/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 53 - (hide annotations) (download)
Sat Feb 24 21:39:42 2007 UTC (7 years, 9 months ago) by nigel
Original Path: code/trunk/doc/pcretest.txt
File MIME type: text/plain
File size: 12689 byte(s)
Load pcre-3.5 into code/trunk.

1 nigel 53 NAME
2     pcretest - a program for testing Perl-compatible regular
3     expressions.
4 nigel 41
5    
6    
7 nigel 53 SYNOPSIS
8     pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
9     tination]
10 nigel 41
11 nigel 53 pcretest was written as a test program for the PCRE regular
12     expression library itself, but it can also be used for
13     experimenting with regular expressions. This man page
14     describes the features of the test program; for details of
15     the regular expressions themselves, see the pcre man page.
16 nigel 41
17    
18    
19 nigel 53 OPTIONS
20     -d Behave as if each regex had the /D modifier (see
21     below); the internal form is output after compila-
22     tion.
23 nigel 41
24 nigel 53 -i Behave as if each regex had the /I modifier;
25     information about the compiled pattern is given
26     after compilation.
27 nigel 41
28 nigel 53 -m Output the size of each compiled pattern after it
29     has been compiled. This is equivalent to adding /M
30     to each regular expression. For compatibility with
31     earlier versions of pcretest, -s is a synonym for
32     -m.
33 nigel 41
34 nigel 53 -o osize Set the number of elements in the output vector
35     that is used when calling PCRE to be osize. The
36     default value is 45, which is enough for 14 cap-
37     turing subexpressions. The vector size can be
38     changed for individual matching calls by including
39     \O in the data line (see below).
40 nigel 41
41 nigel 53 -p Behave as if each regex has /P modifier; the POSIX
42     wrapper API is used to call PCRE. None of the
43     other options has any effect when -p is set.
44 nigel 41
45 nigel 53 -t Run each compile, study, and match 20000 times
46     with a timer, and output resulting time per com-
47     pile or match (in milliseconds). Do not set -t
48     with -m, because you will then get the size output
49     20000 times and the timing will be distorted.
50 nigel 49
51    
52 nigel 41
53 nigel 53 DESCRIPTION
54     If pcretest is given two filename arguments, it reads from
55     the first and writes to the second. If it is given only one
56 nigel 41
57    
58    
59    
60 nigel 53 SunOS 5.8 Last change: 1
61 nigel 41
62    
63    
64 nigel 53 filename argument, it reads from that file and writes to
65     stdout. Otherwise, it reads from stdin and writes to stdout,
66     and prompts for each line of input, using "re>" to prompt
67     for regular expressions, and "data>" to prompt for data
68     lines.
69 nigel 41
70 nigel 53 The program handles any number of sets of input on a single
71     input file. Each set starts with a regular expression, and
72     continues with any number of data lines to be matched
73     against the pattern. An empty line signals the end of the
74     data lines, at which point a new regular expression is read.
75     The regular expressions are given enclosed in any non-
76     alphameric delimiters other than backslash, for example
77 nigel 41
78 nigel 53 /(a|bc)x+yz/
79 nigel 41
80 nigel 53 White space before the initial delimiter is ignored. A regu-
81     lar expression may be continued over several input lines, in
82     which case the newline characters are included within it. It
83     is possible to include the delimiter within the pattern by
84     escaping it, for example
85 nigel 41
86 nigel 53 /abc\/def/
87 nigel 41
88 nigel 53 If you do so, the escape and the delimiter form part of the
89     pattern, but since delimiters are always non-alphameric,
90     this does not affect its interpretation. If the terminating
91     delimiter is immediately followed by a backslash, for exam-
92     ple,
93 nigel 41
94 nigel 53 /abc/\
95 nigel 41
96 nigel 53 then a backslash is added to the end of the pattern. This is
97     done to provide a way of testing the error condition that
98     arises if a pattern finishes with a backslash, because
99 nigel 49
100 nigel 53 /abc\/
101 nigel 49
102 nigel 53 is interpreted as the first line of a pattern that starts
103     with "abc/", causing pcretest to read the next line as a
104     continuation of the regular expression.
105 nigel 49
106 nigel 41
107    
108 nigel 53 PATTERN MODIFIERS
109     The pattern may be followed by i, m, s, or x to set the
110     PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
111     options, respectively. For example:
112 nigel 41
113 nigel 53 /caseless/i
114 nigel 41
115 nigel 53 These modifier letters have the same effect as they do in
116     Perl. There are others which set PCRE options that do not
117     correspond to anything in Perl: /A, /E, and /X set
118     PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
119     tively.
120 nigel 41
121 nigel 53 Searching for all possible matches within each subject
122     string can be requested by the /g or /G modifier. After
123     finding a match, PCRE is called again to search the
124     remainder of the subject string. The difference between /g
125     and /G is that the former uses the startoffset argument to
126     pcre_exec() to start searching at a new point within the
127     entire string (which is in effect what Perl does), whereas
128     the latter passes over a shortened substring. This makes a
129     difference to the matching process if the pattern begins
130     with a lookbehind assertion (including \b or \B).
131 nigel 49
132 nigel 53 If any call to pcre_exec() in a /g or /G sequence matches an
133     empty string, the next call is done with the PCRE_NOTEMPTY
134     and PCRE_ANCHORED flags set in order to search for another,
135     non-empty, match at the same point. If this second match
136     fails, the start offset is advanced by one, and the normal
137     match is retried. This imitates the way Perl handles such
138     cases when using the /g modifier or the split() function.
139 nigel 49
140 nigel 53 There are a number of other modifiers for controlling the
141     way pcretest operates.
142 nigel 49
143 nigel 53 The /+ modifier requests that as well as outputting the sub-
144     string that matched the entire pattern, pcretest should in
145     addition output the remainder of the subject string. This is
146     useful for tests where the subject contains multiple copies
147     of the same substring.
148 nigel 41
149 nigel 53 The /L modifier must be followed directly by the name of a
150     locale, for example,
151 nigel 41
152 nigel 53 /pattern/Lfr
153 nigel 41
154 nigel 53 For this reason, it must be the last modifier letter. The
155     given locale is set, pcre_maketables() is called to build a
156     set of character tables for the locale, and this is then
157     passed to pcre_compile() when compiling the regular expres-
158     sion. Without an /L modifier, NULL is passed as the tables
159     pointer; that is, /L applies only to the expression on which
160     it appears.
161 nigel 41
162 nigel 53 The /I modifier requests that pcretest output information
163     about the compiled expression (whether it is anchored, has a
164     fixed first character, and so on). It does this by calling
165     pcre_fullinfo() after compiling an expression, and output-
166     ting the information it gets back. If the pattern is stu-
167     died, the results of that are also output.
168     The /D modifier is a PCRE debugging feature, which also
169     assumes /I. It causes the internal form of compiled regular
170     expressions to be output after compilation.
171 nigel 41
172 nigel 53 The /S modifier causes pcre_study() to be called after the
173     expression has been compiled, and the results used when the
174     expression is matched.
175 nigel 41
176 nigel 53 The /M modifier causes the size of memory block used to hold
177     the compiled pattern to be output.
178 nigel 41
179 nigel 53 The /P modifier causes pcretest to call PCRE via the POSIX
180     wrapper API rather than its native API. When this is done,
181     all other modifiers except /i, /m, and /+ are ignored.
182     REG_ICASE is set if /i is present, and REG_NEWLINE is set if
183     /m is present. The wrapper functions force
184     PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
185     REG_NEWLINE is set.
186 nigel 41
187 nigel 53 The /8 modifier causes pcretest to call PCRE with the
188     PCRE_UTF8 option set. This turns on the (currently incom-
189     plete) support for UTF-8 character handling in PCRE, pro-
190     vided that it was compiled with this support enabled. This
191     modifier also causes any non-printing characters in output
192     strings to be printed using the \x{hh...} notation if they
193     are valid UTF-8 sequences.
194 nigel 41
195    
196 nigel 49
197 nigel 53 DATA LINES
198     Before each data line is passed to pcre_exec(), leading and
199     trailing whitespace is removed, and it is then scanned for \
200     escapes. The following are recognized:
201 nigel 49
202 nigel 53 \a alarm (= BEL)
203     \b backspace
204     \e escape
205     \f formfeed
206     \n newline
207     \r carriage return
208     \t tab
209     \v vertical tab
210     \nnn octal character (up to 3 octal digits)
211     \xhh hexadecimal character (up to 2 hex digits)
212     \x{hh...} hexadecimal UTF-8 character
213 nigel 41
214 nigel 53 \A pass the PCRE_ANCHORED option to pcre_exec()
215     \B pass the PCRE_NOTBOL option to pcre_exec()
216     \Cdd call pcre_copy_substring() for substring dd
217     after a successful match (any decimal number
218     less than 32)
219     \Gdd call pcre_get_substring() for substring dd
220 nigel 41
221 nigel 53 after a successful match (any decimal number
222     less than 32)
223     \L call pcre_get_substringlist() after a
224     successful match
225     \N pass the PCRE_NOTEMPTY option to pcre_exec()
226     \Odd set the size of the output vector passed to
227     pcre_exec() to dd (any number of decimal
228     digits)
229     \Z pass the PCRE_NOTEOL option to pcre_exec()
230 nigel 41
231 nigel 53 When \O is used, it may be higher or lower than the size set
232     by the -O option (or defaulted to 45); \O applies only to
233     the call of pcre_exec() for the line in which it appears.
234 nigel 41
235 nigel 53 A backslash followed by anything else just escapes the any-
236     thing else. If the very last character is a backslash, it is
237     ignored. This gives a way of passing an empty line as data,
238     since a real empty line terminates the data input.
239 nigel 41
240 nigel 53 If /P was present on the regex, causing the POSIX wrapper
241     API to be used, only B, and Z have any effect, causing
242     REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
243     tively.
244    
245     The use of \x{hh...} to represent UTF-8 characters is not
246     dependent on the use of the /8 modifier on the pattern. It
247     is recognized always. There may be any number of hexadecimal
248     digits inside the braces. The result is from one to six
249     bytes, encoded according to the UTF-8 rules.
250    
251    
252    
253     OUTPUT FROM PCRETEST
254     When a match succeeds, pcretest outputs the list of captured
255     substrings that pcre_exec() returns, starting with number 0
256     for the string that matched the whole pattern. Here is an
257     example of an interactive pcretest run.
258    
259     $ pcretest
260     PCRE version 2.06 08-Jun-1999
261    
262     re> /^abc(\d+)/
263     data> abc123
264     0: abc123
265     1: 123
266     data> xyz
267     No match
268    
269     If the strings contain any non-printing characters, they are
270     output as \0x escapes, or as \x{...} escapes if the /8
271     modifier was present on the pattern. If the pattern has the
272     /+ modifier, then the output for substring 0 is followed by
273     the the rest of the subject string, identified by "0+" like
274     this:
275    
276     re> /cat/+
277     data> cataract
278     0: cat
279     0+ aract
280    
281     If the pattern has the /g or /G modifier, the results of
282     successive matching attempts are output in sequence, like
283     this:
284    
285     re> /\Bi(\w\w)/g
286     data> Mississippi
287     0: iss
288     1: ss
289     0: iss
290     1: ss
291     0: ipp
292     1: pp
293    
294     "No match" is output only if the first match attempt fails.
295    
296     If any of the sequences \C, \G, or \L are present in a data
297     line that is successfully matched, the substrings extracted
298     by the convenience functions are output with C, G, or L
299     after the string number instead of a colon. This is in addi-
300     tion to the normal full list. The string length (that is,
301     the return from the extraction function) is given in
302     parentheses after each string for \C and \G.
303    
304     Note that while patterns can be continued over several lines
305     (a plain ">" prompt is used for continuations), data lines
306     may not. However newlines can be included in data by means
307     of the \n escape.
308    
309    
310    
311     AUTHOR
312     Philip Hazel <ph10@cam.ac.uk>
313     University Computing Service,
314     New Museums Site,
315     Cambridge CB2 3QG, England.
316     Phone: +44 1223 334714
317    
318     Last updated: 15 August 2001
319     Copyright (c) 1997-2001 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12