/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Diff of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 53 by nigel, Sat Feb 24 21:39:42 2007 UTC
# Line 1  Line 1 
1  The pcretest program  NAME
2  --------------------       pcretest - a program  for  testing  Perl-compatible  regular
3         expressions.
4    
5    
6    
7    SYNOPSIS
8         pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source]  [des-
9         tination]
10    
11         pcretest was written as a test program for the PCRE  regular
12         expression  library  itself,  but  it  can  also be used for
13         experimenting  with  regular  expressions.  This  man   page
14         describes  the  features of the test program; for details of
15         the regular expressions themselves, see the pcre man page.
16    
17    
18    
19    OPTIONS
20         -d        Behave as if each regex had the /D  modifier  (see
21                   below); the internal form is output after compila-
22                   tion.
23    
24         -i        Behave as if  each  regex  had  the  /I  modifier;
25                   information  about  the  compiled pattern is given
26                   after compilation.
27    
28         -m        Output the size of each compiled pattern after  it
29                   has been compiled. This is equivalent to adding /M
30                   to each regular expression. For compatibility with
31                   earlier  versions of pcretest, -s is a synonym for
32                   -m.
33    
34         -o osize  Set the number of elements in  the  output  vector
35                   that  is  used  when calling PCRE to be osize. The
36                   default value is 45, which is enough for  14  cap-
37                   turing  subexpressions.  The  vector  size  can be
38                   changed for individual matching calls by including
39                   \O in the data line (see below).
40    
41         -p        Behave as if each regex has /P modifier; the POSIX
42                   wrapper  API  is  used  to  call PCRE. None of the
43                   other options has any effect when -p is set.
44    
45         -t        Run each compile, study,  and  match  20000  times
46                   with  a  timer, and output resulting time per com-
47                   pile or match (in milliseconds).  Do  not  set  -t
48                   with -m, because you will then get the size output
49                   20000 times and the timing will be distorted.
50    
51    
52    
53    DESCRIPTION
54         If pcretest is given two filename arguments, it  reads  from
55         the  first and writes to the second. If it is given only one
56    
57    
58    
59    
60    SunOS 5.8                 Last change:                          1
61    
62    
63    
64         filename argument, it reads from that  file  and  writes  to
65         stdout. Otherwise, it reads from stdin and writes to stdout,
66         and prompts for each line of input, using  "re>"  to  prompt
67         for  regular  expressions,  and  "data>"  to prompt for data
68         lines.
69    
70         The program handles any number of sets of input on a  single
71         input  file.  Each set starts with a regular expression, and
72         continues with any  number  of  data  lines  to  be  matched
73         against  the  pattern.  An empty line signals the end of the
74         data lines, at which point a new regular expression is read.
75         The  regular  expressions  are  given  enclosed  in any non-
76         alphameric delimiters other than backslash, for example
77    
78           /(a|bc)x+yz/
79    
80         White space before the initial delimiter is ignored. A regu-
81         lar expression may be continued over several input lines, in
82         which case the newline characters are included within it. It
83         is  possible  to include the delimiter within the pattern by
84         escaping it, for example
85    
86           /abc\/def/
87    
88         If you do so, the escape and the delimiter form part of  the
89         pattern,  but  since  delimiters  are always non-alphameric,
90         this does not affect its interpretation.  If the terminating
91         delimiter  is immediately followed by a backslash, for exam-
92         ple,
93    
94           /abc/\
95    
96         then a backslash is added to the end of the pattern. This is
97         done  to  provide  a way of testing the error condition that
98         arises if a pattern finishes with a backslash, because
99    
100           /abc\/
101    
102         is interpreted as the first line of a  pattern  that  starts
103         with  "abc/",  causing  pcretest  to read the next line as a
104         continuation of the regular expression.
105    
106    
107    
108    PATTERN MODIFIERS
109         The pattern may be followed by i, m, s,  or  x  to  set  the
110         PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
111         options, respectively. For example:
112    
113           /caseless/i
114    
115         These modifier letters have the same effect as  they  do  in
116         Perl.  There  are  others which set PCRE options that do not
117         correspond  to  anything  in  Perl:   /A,  /E,  and  /X  set
118         PCRE_ANCHORED,  PCRE_DOLLAR_ENDONLY,  and PCRE_EXTRA respec-
119         tively.
120    
121         Searching for  all  possible  matches  within  each  subject
122         string  can  be  requested  by  the /g or /G modifier. After
123         finding  a  match,  PCRE  is  called  again  to  search  the
124         remainder  of  the subject string. The difference between /g
125         and /G is that the former uses the startoffset  argument  to
126         pcre_exec()  to  start  searching  at a new point within the
127         entire string (which is in effect what Perl  does),  whereas
128         the  latter  passes over a shortened substring. This makes a
129         difference to the matching process  if  the  pattern  begins
130         with a lookbehind assertion (including \b or \B).
131    
132         If any call to pcre_exec() in a /g or /G sequence matches an
133         empty  string,  the next call is done with the PCRE_NOTEMPTY
134         and PCRE_ANCHORED flags set in order to search for  another,
135         non-empty,  match  at  the same point.  If this second match
136         fails, the start offset is advanced by one, and  the  normal
137         match  is  retried.  This imitates the way Perl handles such
138         cases when using the /g modifier or the split() function.
139    
140         There are a number of other modifiers  for  controlling  the
141         way pcretest operates.
142    
143         The /+ modifier requests that as well as outputting the sub-
144         string  that  matched the entire pattern, pcretest should in
145         addition output the remainder of the subject string. This is
146         useful  for tests where the subject contains multiple copies
147         of the same substring.
148    
149         The /L modifier must be followed directly by the name  of  a
150         locale, for example,
151    
152           /pattern/Lfr
153    
154         For this reason, it must be the last  modifier  letter.  The
155         given  locale is set, pcre_maketables() is called to build a
156         set of character tables for the locale,  and  this  is  then
157         passed  to pcre_compile() when compiling the regular expres-
158         sion. Without an /L modifier, NULL is passed as  the  tables
159         pointer; that is, /L applies only to the expression on which
160         it appears.
161    
162         The /I modifier requests that  pcretest  output  information
163         about the compiled expression (whether it is anchored, has a
164         fixed first character, and so on). It does this  by  calling
165         pcre_fullinfo()  after  compiling an expression, and output-
166         ting the information it gets back. If the  pattern  is  stu-
167         died, the results of that are also output.
168         The /D modifier is a  PCRE  debugging  feature,  which  also
169         assumes /I.  It causes the internal form of compiled regular
170         expressions to be output after compilation.
171    
172         The /S modifier causes pcre_study() to be called  after  the
173         expression  has been compiled, and the results used when the
174         expression is matched.
175    
176         The /M modifier causes the size of memory block used to hold
177         the compiled pattern to be output.
178    
179         The /P modifier causes pcretest to call PCRE via  the  POSIX
180         wrapper  API  rather than its native API. When this is done,
181         all other modifiers except  /i,  /m,  and  /+  are  ignored.
182         REG_ICASE is set if /i is present, and REG_NEWLINE is set if
183         /m    is    present.    The    wrapper    functions    force
184         PCRE_DOLLAR_ENDONLY    always,    and   PCRE_DOTALL   unless
185         REG_NEWLINE is set.
186    
187         The /8 modifier  causes  pcretest  to  call  PCRE  with  the
188         PCRE_UTF8  option  set.  This turns on the (currently incom-
189         plete) support for UTF-8 character handling  in  PCRE,  pro-
190         vided  that  it was compiled with this support enabled. This
191         modifier also causes any non-printing characters  in  output
192         strings  to  be printed using the \x{hh...} notation if they
193         are valid UTF-8 sequences.
194    
195    
196    
197    DATA LINES
198         Before each data line is passed to pcre_exec(), leading  and
199         trailing whitespace is removed, and it is then scanned for \
200         escapes. The following are recognized:
201    
202           \a         alarm (= BEL)
203           \b         backspace
204           \e         escape
205           \f         formfeed
206           \n         newline
207           \r         carriage return
208           \t         tab
209           \v         vertical tab
210           \nnn       octal character (up to 3 octal digits)
211           \xhh       hexadecimal character (up to 2 hex digits)
212           \x{hh...}  hexadecimal UTF-8 character
213    
214           \A         pass the PCRE_ANCHORED option to pcre_exec()
215           \B         pass the PCRE_NOTBOL option to pcre_exec()
216           \Cdd       call pcre_copy_substring() for substring dd
217                         after a successful match (any decimal number
218                         less than 32)
219           \Gdd       call pcre_get_substring() for substring dd
220    
221                         after a successful match (any decimal number
222                         less than 32)
223           \L         call pcre_get_substringlist() after a
224                         successful match
225           \N         pass the PCRE_NOTEMPTY option to pcre_exec()
226           \Odd       set the size of the output vector passed to
227                         pcre_exec() to dd (any number of decimal
228                         digits)
229           \Z         pass the PCRE_NOTEOL option to pcre_exec()
230    
231         When \O is used, it may be higher or lower than the size set
232         by  the  -O  option (or defaulted to 45); \O applies only to
233         the call of pcre_exec() for the line in which it appears.
234    
235         A backslash followed by anything else just escapes the  any-
236         thing else. If the very last character is a backslash, it is
237         ignored. This gives a way of passing an empty line as  data,
238         since a real empty line terminates the data input.
239    
240         If /P was present on the regex, causing  the  POSIX  wrapper
241         API  to  be  used,  only  B,  and Z have any effect, causing
242         REG_NOTBOL and REG_NOTEOL to be passed to regexec()  respec-
243         tively.
244    
245         The use of \x{hh...} to represent UTF-8  characters  is  not
246         dependent  on  the use of the /8 modifier on the pattern. It
247         is recognized always. There may be any number of hexadecimal
248         digits  inside  the  braces.  The  result is from one to six
249         bytes, encoded according to the UTF-8 rules.
250    
251    
252    
253    OUTPUT FROM PCRETEST
254         When a match succeeds, pcretest outputs the list of captured
255         substrings  that pcre_exec() returns, starting with number 0
256         for the string that matched the whole pattern.  Here  is  an
257         example of an interactive pcretest run.
258    
259           $ pcretest
260           PCRE version 2.06 08-Jun-1999
261    
262             re> /^abc(\d+)/
263           data> abc123
264            0: abc123
265            1: 123
266           data> xyz
267           No match
268    
269         If the strings contain any non-printing characters, they are
270         output  as  \0x  escapes,  or  as  \x{...} escapes if the /8
271         modifier was present on the pattern. If the pattern has  the
272         /+  modifier, then the output for substring 0 is followed by
273         the the rest of the subject string, identified by "0+"  like
274         this:
275    
276             re> /cat/+
277           data> cataract
278            0: cat
279            0+ aract
280    
281         If the pattern has the /g or /G  modifier,  the  results  of
282         successive  matching  attempts  are output in sequence, like
283         this:
284    
285             re> /\Bi(\w\w)/g
286           data> Mississippi
287            0: iss
288            1: ss
289            0: iss
290            1: ss
291            0: ipp
292            1: pp
293    
294         "No match" is output only if the first match attempt fails.
295    
296         If any of the sequences \C, \G, or \L are present in a  data
297         line  that is successfully matched, the substrings extracted
298         by the convenience functions are output  with  C,  G,  or  L
299         after the string number instead of a colon. This is in addi-
300         tion to the normal full list. The string  length  (that  is,
301         the  return  from  the  extraction  function)  is  given  in
302         parentheses after each string for \C and \G.
303    
304         Note that while patterns can be continued over several lines
305         (a  plain  ">" prompt is used for continuations), data lines
306         may not. However newlines can be included in data  by  means
307         of the \n escape.
308    
309    
310    
311    AUTHOR
312         Philip Hazel <ph10@cam.ac.uk>
313         University Computing Service,
314         New Museums Site,
315         Cambridge CB2 3QG, England.
316         Phone: +44 1223 334714
317    
318  This program is intended for testing PCRE, but it can also be used for       Last updated: 15 August 2001
319  experimenting with regular expressions.       Copyright (c) 1997-2001 University of Cambridge.
   
 If it is given two filename arguments, it reads from the first and writes to  
 the second. If it is given only one filename argument, it reads from that file  
 and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and  
 prompts for each line of input.  
   
 The program handles any number of sets of input on a single input file. Each  
 set starts with a regular expression, and continues with any number of data  
 lines to be matched against the pattern. An empty line signals the end of the  
 set. The regular expressions are given enclosed in any non-alphameric  
 delimiters other than backslash, for example  
   
   /(a|bc)x+yz/  
   
 White space before the initial delimiter is ignored. A regular expression may  
 be continued over several input lines, in which case the newline characters are  
 included within it. See the testinput files for many examples. It is possible  
 to include the delimiter within the pattern by escaping it, for example  
   
   /abc\/def/  
   
 If you do so, the escape and the delimiter form part of the pattern, but since  
 delimiters are always non-alphameric, this does not affect its interpretation.  
 If the terminating delimiter is immediately followed by a backslash, for  
 example,  
   
   /abc/\  
   
 then a backslash is added to the end of the pattern. This is done to provide a  
 way of testing the error condition that arises if a pattern finishes with a  
 backslash, because  
   
   /abc\/  
   
 is interpreted as the first line of a pattern that starts with "abc/", causing  
 pcretest to read the next line as a continuation of the regular expression.  
   
 The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,  
 PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For  
 example:  
   
   /caseless/i  
   
 These modifier letters have the same effect as they do in Perl. There are  
 others which set PCRE options that do not correspond to anything in Perl: /A,  
 /E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.  
   
 Searching for all possible matches within each subject string can be requested  
 by the /g or /G modifier. After finding a match, PCRE is called again to search  
 the remainder of the subject string. The difference between /g and /G is that  
 the former uses the startoffset argument to pcre_exec() to start searching at  
 a new point within the entire string (which is in effect what Perl does),  
 whereas the latter passes over a shortened substring. This makes a difference  
 to the matching process if the pattern begins with a lookbehind assertion  
 (including \b or \B).  
   
 If any call to pcre_exec() in a /g or /G sequence matches an empty string, the  
 next call is done with the PCRE_NOTEMPTY flag set so that it cannot match an  
 empty string again at the same point. If however, this second match fails, the  
 start offset is advanced by one, and the match is retried. This imitates the  
 way Perl handles such cases when using the /g modifier or the split() function.  
   
 There are a number of other modifiers for controlling the way pcretest  
 operates.  
   
 The /+ modifier requests that as well as outputting the substring that matched  
 the entire pattern, pcretest should in addition output the remainder of the  
 subject string. This is useful for tests where the subject contains multiple  
 copies of the same substring.  
   
 The /L modifier must be followed directly by the name of a locale, for example,  
   
   /pattern/Lfr  
   
 For this reason, it must be the last modifier letter. The given locale is set,  
 pcre_maketables() is called to build a set of character tables for the locale,  
 and this is then passed to pcre_compile() when compiling the regular  
 expression. Without an /L modifier, NULL is passed as the tables pointer; that  
 is, /L applies only to the expression on which it appears.  
   
 The /I modifier requests that pcretest output information about the compiled  
 expression (whether it is anchored, has a fixed first character, and so on). It  
 does this by calling pcre_info() after compiling an expression, and outputting  
 the information it gets back. If the pattern is studied, the results of that  
 are also output.  
   
 The /D modifier is a PCRE debugging feature, which also assumes /I. It causes  
 the internal form of compiled regular expressions to be output after  
 compilation.  
   
 The /S modifier causes pcre_study() to be called after the expression has been  
 compiled, and the results used when the expression is matched.  
   
 The /M modifier causes the size of memory block used to hold the compiled  
 pattern to be output.  
   
 Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API  
 rather than its native API. When this is done, all other modifiers except /i,  
 /m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is  
 set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,  
 and PCRE_DOTALL unless REG_NEWLINE is set.  
   
 Before each data line is passed to pcre_exec(), leading and trailing whitespace  
 is removed, and it is then scanned for \ escapes. The following are recognized:  
   
   \a     alarm (= BEL)  
   \b     backspace  
   \e     escape  
   \f     formfeed  
   \n     newline  
   \r     carriage return  
   \t     tab  
   \v     vertical tab  
   \nnn   octal character (up to 3 octal digits)  
   \xhh   hexadecimal character (up to 2 hex digits)  
   
   \A     pass the PCRE_ANCHORED option to pcre_exec()  
   \B     pass the PCRE_NOTBOL option to pcre_exec()  
   \Cdd   call pcre_copy_substring() for substring dd after a successful match  
            (any decimal number less than 32)  
   \Gdd   call pcre_get_substring() for substring dd after a successful match  
            (any decimal number less than 32)  
   \L     call pcre_get_substringlist() after a successful match  
   \N     pass the PCRE_NOTEMPTY option to pcre_exec()  
   \Odd   set the size of the output vector passed to pcre_exec() to dd  
            (any number of decimal digits)  
   \Z     pass the PCRE_NOTEOL option to pcre_exec()  
   
 A backslash followed by anything else just escapes the anything else. If the  
 very last character is a backslash, it is ignored. This gives a way of passing  
 an empty line as data, since a real empty line terminates the data input.  
   
 If /P was present on the regex, causing the POSIX wrapper API to be used, only  
 \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to  
 regexec() respectively.  
   
 When a match succeeds, pcretest outputs the list of captured substrings that  
 pcre_exec() returns, starting with number 0 for the string that matched the  
 whole pattern. Here is an example of an interactive pcretest run.  
   
   $ pcretest  
   PCRE version 2.06 08-Jun-1999  
   
     re> /^abc(\d+)/  
   data> abc123  
    0: abc123  
    1: 123  
   data> xyz  
   No match  
   
 If the strings contain any non-printing characters, they are output as \0x  
 escapes. If the pattern has the /+ modifier, then the output for substring 0 is  
 followed by the the rest of the subject string, identified by "0+" like this:  
   
     re> /cat/+  
   data> cataract  
    0: cat  
    0+ aract  
   
 If the pattern has the /g or /G modifier, the results of successive matching  
 attempts are output in sequence, like this:  
   
     re> /\Bi(\w\w)/g  
   data> Mississippi  
    0: iss  
    1: ss  
    0: iss  
    1: ss  
    0: ipp  
    1: pp  
   
 "No match" is output only if the first match attempt fails.  
   
 If any of \C, \G, or \L are present in a data line that is successfully  
 matched, the substrings extracted by the convenience functions are output with  
 C, G, or L after the string number instead of a colon. This is in addition to  
 the normal full list. The string length (that is, the return from the  
 extraction function) is given in parentheses after each string for \C and \G.  
   
 Note that while patterns can be continued over several lines (a plain ">"  
 prompt is used for continuations), data lines may not. However newlines can be  
 included in data by means of the \n escape.  
   
 If the -p option is given to pcretest, it is equivalent to adding /P to each  
 regular expression: the POSIX wrapper API is used to call PCRE. None of the  
 following flags has any effect in this case.  
   
 If the option -d is given to pcretest, it is equivalent to adding /D to each  
 regular expression: the internal form is output after compilation.  
   
 If the option -i is given to pcretest, it is equivalent to adding /I to each  
 regular expression: information about the compiled pattern is given after  
 compilation.  
   
 If the option -m is given to pcretest, it outputs the size of each compiled  
 pattern after it has been compiled. It is equivalent to adding /M to each  
 regular expression. For compatibility with earlier versions of pcretest, -s is  
 a synonym for -m.  
   
 If the -t option is given, each compile, study, and match is run 20000 times  
 while being timed, and the resulting time per compile or match is output in  
 milliseconds. Do not set -t with -s, because you will then get the size output  
 20000 times and the timing will be distorted. If you want to change the number  
 of repetitions used for timing, edit the definition of LOOPREPEAT at the top of  
 pcretest.c  
   
 Philip Hazel <ph10@cam.ac.uk>  
 January 2000  

Legend:
Removed from v.41  
changed lines
  Added in v.53

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12