/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Diff of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC revision 53 by nigel, Sat Feb 24 21:39:42 2007 UTC
# Line 1  Line 1 
1  The pcretest program  NAME
2  --------------------       pcretest - a program  for  testing  Perl-compatible  regular
3         expressions.
4    
 This program is intended for testing PCRE, but it can also be used for  
 experimenting with regular expressions.  
5    
 If it is given two filename arguments, it reads from the first and writes to  
 the second. If it is given only one filename argument, it reads from that file  
 and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and  
 prompts for each line of input, using "re>" to prompt for regular expressions,  
 and "data>" to prompt for data lines.  
   
 The program handles any number of sets of input on a single input file. Each  
 set starts with a regular expression, and continues with any number of data  
 lines to be matched against the pattern. An empty line signals the end of the  
 data lines, at which point a new regular expression is read. The regular  
 expressions are given enclosed in any non-alphameric delimiters other than  
 backslash, for example  
   
   /(a|bc)x+yz/  
   
 White space before the initial delimiter is ignored. A regular expression may  
 be continued over several input lines, in which case the newline characters are  
 included within it. See the test input files in the testdata directory for many  
 examples. It is possible to include the delimiter within the pattern by  
 escaping it, for example  
   
   /abc\/def/  
   
 If you do so, the escape and the delimiter form part of the pattern, but since  
 delimiters are always non-alphameric, this does not affect its interpretation.  
 If the terminating delimiter is immediately followed by a backslash, for  
 example,  
   
   /abc/\  
   
 then a backslash is added to the end of the pattern. This is done to provide a  
 way of testing the error condition that arises if a pattern finishes with a  
 backslash, because  
6    
7    /abc\/  SYNOPSIS
8         pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source]  [des-
9         tination]
10    
11         pcretest was written as a test program for the PCRE  regular
12         expression  library  itself,  but  it  can  also be used for
13         experimenting  with  regular  expressions.  This  man   page
14         describes  the  features of the test program; for details of
15         the regular expressions themselves, see the pcre man page.
16    
17    
18    
19    OPTIONS
20         -d        Behave as if each regex had the /D  modifier  (see
21                   below); the internal form is output after compila-
22                   tion.
23    
24         -i        Behave as if  each  regex  had  the  /I  modifier;
25                   information  about  the  compiled pattern is given
26                   after compilation.
27    
28         -m        Output the size of each compiled pattern after  it
29                   has been compiled. This is equivalent to adding /M
30                   to each regular expression. For compatibility with
31                   earlier  versions of pcretest, -s is a synonym for
32                   -m.
33    
34         -o osize  Set the number of elements in  the  output  vector
35                   that  is  used  when calling PCRE to be osize. The
36                   default value is 45, which is enough for  14  cap-
37                   turing  subexpressions.  The  vector  size  can be
38                   changed for individual matching calls by including
39                   \O in the data line (see below).
40    
41         -p        Behave as if each regex has /P modifier; the POSIX
42                   wrapper  API  is  used  to  call PCRE. None of the
43                   other options has any effect when -p is set.
44    
45         -t        Run each compile, study,  and  match  20000  times
46                   with  a  timer, and output resulting time per com-
47                   pile or match (in milliseconds).  Do  not  set  -t
48                   with -m, because you will then get the size output
49                   20000 times and the timing will be distorted.
50    
51    
52    
53    DESCRIPTION
54         If pcretest is given two filename arguments, it  reads  from
55         the  first and writes to the second. If it is given only one
56    
57    
58    
59    
60    SunOS 5.8                 Last change:                          1
61    
62    
63    
64         filename argument, it reads from that  file  and  writes  to
65         stdout. Otherwise, it reads from stdin and writes to stdout,
66         and prompts for each line of input, using  "re>"  to  prompt
67         for  regular  expressions,  and  "data>"  to prompt for data
68         lines.
69    
70         The program handles any number of sets of input on a  single
71         input  file.  Each set starts with a regular expression, and
72         continues with any  number  of  data  lines  to  be  matched
73         against  the  pattern.  An empty line signals the end of the
74         data lines, at which point a new regular expression is read.
75         The  regular  expressions  are  given  enclosed  in any non-
76         alphameric delimiters other than backslash, for example
77    
78           /(a|bc)x+yz/
79    
80         White space before the initial delimiter is ignored. A regu-
81         lar expression may be continued over several input lines, in
82         which case the newline characters are included within it. It
83         is  possible  to include the delimiter within the pattern by
84         escaping it, for example
85    
86           /abc\/def/
87    
88         If you do so, the escape and the delimiter form part of  the
89         pattern,  but  since  delimiters  are always non-alphameric,
90         this does not affect its interpretation.  If the terminating
91         delimiter  is immediately followed by a backslash, for exam-
92         ple,
93    
94           /abc/\
95    
96         then a backslash is added to the end of the pattern. This is
97         done  to  provide  a way of testing the error condition that
98         arises if a pattern finishes with a backslash, because
99    
100           /abc\/
101    
102         is interpreted as the first line of a  pattern  that  starts
103         with  "abc/",  causing  pcretest  to read the next line as a
104         continuation of the regular expression.
105    
 is interpreted as the first line of a pattern that starts with "abc/", causing  
 pcretest to read the next line as a continuation of the regular expression.  
106    
107    
108  PATTERN MODIFIERS  PATTERN MODIFIERS
109  -----------------       The pattern may be followed by i, m, s,  or  x  to  set  the
110         PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
111         options, respectively. For example:
112    
113           /caseless/i
114    
115         These modifier letters have the same effect as  they  do  in
116         Perl.  There  are  others which set PCRE options that do not
117         correspond  to  anything  in  Perl:   /A,  /E,  and  /X  set
118         PCRE_ANCHORED,  PCRE_DOLLAR_ENDONLY,  and PCRE_EXTRA respec-
119         tively.
120    
121         Searching for  all  possible  matches  within  each  subject
122         string  can  be  requested  by  the /g or /G modifier. After
123         finding  a  match,  PCRE  is  called  again  to  search  the
124         remainder  of  the subject string. The difference between /g
125         and /G is that the former uses the startoffset  argument  to
126         pcre_exec()  to  start  searching  at a new point within the
127         entire string (which is in effect what Perl  does),  whereas
128         the  latter  passes over a shortened substring. This makes a
129         difference to the matching process  if  the  pattern  begins
130         with a lookbehind assertion (including \b or \B).
131    
132         If any call to pcre_exec() in a /g or /G sequence matches an
133         empty  string,  the next call is done with the PCRE_NOTEMPTY
134         and PCRE_ANCHORED flags set in order to search for  another,
135         non-empty,  match  at  the same point.  If this second match
136         fails, the start offset is advanced by one, and  the  normal
137         match  is  retried.  This imitates the way Perl handles such
138         cases when using the /g modifier or the split() function.
139    
140         There are a number of other modifiers  for  controlling  the
141         way pcretest operates.
142    
143         The /+ modifier requests that as well as outputting the sub-
144         string  that  matched the entire pattern, pcretest should in
145         addition output the remainder of the subject string. This is
146         useful  for tests where the subject contains multiple copies
147         of the same substring.
148    
149         The /L modifier must be followed directly by the name  of  a
150         locale, for example,
151    
152           /pattern/Lfr
153    
154         For this reason, it must be the last  modifier  letter.  The
155         given  locale is set, pcre_maketables() is called to build a
156         set of character tables for the locale,  and  this  is  then
157         passed  to pcre_compile() when compiling the regular expres-
158         sion. Without an /L modifier, NULL is passed as  the  tables
159         pointer; that is, /L applies only to the expression on which
160         it appears.
161    
162         The /I modifier requests that  pcretest  output  information
163         about the compiled expression (whether it is anchored, has a
164         fixed first character, and so on). It does this  by  calling
165         pcre_fullinfo()  after  compiling an expression, and output-
166         ting the information it gets back. If the  pattern  is  stu-
167         died, the results of that are also output.
168         The /D modifier is a  PCRE  debugging  feature,  which  also
169         assumes /I.  It causes the internal form of compiled regular
170         expressions to be output after compilation.
171    
172         The /S modifier causes pcre_study() to be called  after  the
173         expression  has been compiled, and the results used when the
174         expression is matched.
175    
176         The /M modifier causes the size of memory block used to hold
177         the compiled pattern to be output.
178    
179         The /P modifier causes pcretest to call PCRE via  the  POSIX
180         wrapper  API  rather than its native API. When this is done,
181         all other modifiers except  /i,  /m,  and  /+  are  ignored.
182         REG_ICASE is set if /i is present, and REG_NEWLINE is set if
183         /m    is    present.    The    wrapper    functions    force
184         PCRE_DOLLAR_ENDONLY    always,    and   PCRE_DOTALL   unless
185         REG_NEWLINE is set.
186    
187         The /8 modifier  causes  pcretest  to  call  PCRE  with  the
188         PCRE_UTF8  option  set.  This turns on the (currently incom-
189         plete) support for UTF-8 character handling  in  PCRE,  pro-
190         vided  that  it was compiled with this support enabled. This
191         modifier also causes any non-printing characters  in  output
192         strings  to  be printed using the \x{hh...} notation if they
193         are valid UTF-8 sequences.
194    
 The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,  
 PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For  
 example:  
   
   /caseless/i  
   
 These modifier letters have the same effect as they do in Perl. There are  
 others which set PCRE options that do not correspond to anything in Perl: /A,  
 /E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.  
   
 Searching for all possible matches within each subject string can be requested  
 by the /g or /G modifier. After finding a match, PCRE is called again to search  
 the remainder of the subject string. The difference between /g and /G is that  
 the former uses the startoffset argument to pcre_exec() to start searching at  
 a new point within the entire string (which is in effect what Perl does),  
 whereas the latter passes over a shortened substring. This makes a difference  
 to the matching process if the pattern begins with a lookbehind assertion  
 (including \b or \B).  
   
 If any call to pcre_exec() in a /g or /G sequence matches an empty string, the  
 next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set in order  
 to search for another, non-empty, match at the same point. If this second match  
 fails, the start offset is advanced by one, and the normal match is retried.  
 This imitates the way Perl handles such cases when using the /g modifier or the  
 split() function.  
   
 There are a number of other modifiers for controlling the way pcretest  
 operates.  
   
 The /+ modifier requests that as well as outputting the substring that matched  
 the entire pattern, pcretest should in addition output the remainder of the  
 subject string. This is useful for tests where the subject contains multiple  
 copies of the same substring.  
   
 The /L modifier must be followed directly by the name of a locale, for example,  
   
   /pattern/Lfr  
   
 For this reason, it must be the last modifier letter. The given locale is set,  
 pcre_maketables() is called to build a set of character tables for the locale,  
 and this is then passed to pcre_compile() when compiling the regular  
 expression. Without an /L modifier, NULL is passed as the tables pointer; that  
 is, /L applies only to the expression on which it appears.  
   
 The /I modifier requests that pcretest output information about the compiled  
 expression (whether it is anchored, has a fixed first character, and so on). It  
 does this by calling pcre_fullinfo() after compiling an expression, and  
 outputting the information it gets back. If the pattern is studied, the results  
 of that are also output.  
   
 The /D modifier is a PCRE debugging feature, which also assumes /I. It causes  
 the internal form of compiled regular expressions to be output after  
 compilation.  
   
 The /S modifier causes pcre_study() to be called after the expression has been  
 compiled, and the results used when the expression is matched.  
   
 The /M modifier causes the size of memory block used to hold the compiled  
 pattern to be output.  
   
 The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather  
 than its native API. When this is done, all other modifiers except /i, /m, and  
 /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m  
 is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and  
 PCRE_DOTALL unless REG_NEWLINE is set.  
   
 The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option set.  
 This turns on the (currently incomplete) support for UTF-8 character handling  
 in PCRE, provided that it was compiled with this support enabled. This modifier  
 also causes any non-printing characters in output strings to be printed using  
 the \x{hh...} notation if they are valid UTF-8 sequences.  
195    
196    
197  DATA LINES  DATA LINES
198  ----------       Before each data line is passed to pcre_exec(), leading  and
199         trailing whitespace is removed, and it is then scanned for \
200  Before each data line is passed to pcre_exec(), leading and trailing whitespace       escapes. The following are recognized:
201  is removed, and it is then scanned for \ escapes. The following are recognized:  
202           \a         alarm (= BEL)
203           \b         backspace
204           \e         escape
205           \f         formfeed
206           \n         newline
207           \r         carriage return
208           \t         tab
209           \v         vertical tab
210           \nnn       octal character (up to 3 octal digits)
211           \xhh       hexadecimal character (up to 2 hex digits)
212           \x{hh...}  hexadecimal UTF-8 character
213    
214           \A         pass the PCRE_ANCHORED option to pcre_exec()
215           \B         pass the PCRE_NOTBOL option to pcre_exec()
216           \Cdd       call pcre_copy_substring() for substring dd
217                         after a successful match (any decimal number
218                         less than 32)
219           \Gdd       call pcre_get_substring() for substring dd
220    
221                         after a successful match (any decimal number
222                         less than 32)
223           \L         call pcre_get_substringlist() after a
224                         successful match
225           \N         pass the PCRE_NOTEMPTY option to pcre_exec()
226           \Odd       set the size of the output vector passed to
227                         pcre_exec() to dd (any number of decimal
228                         digits)
229           \Z         pass the PCRE_NOTEOL option to pcre_exec()
230    
231         When \O is used, it may be higher or lower than the size set
232         by  the  -O  option (or defaulted to 45); \O applies only to
233         the call of pcre_exec() for the line in which it appears.
234    
235         A backslash followed by anything else just escapes the  any-
236         thing else. If the very last character is a backslash, it is
237         ignored. This gives a way of passing an empty line as  data,
238         since a real empty line terminates the data input.
239    
240         If /P was present on the regex, causing  the  POSIX  wrapper
241         API  to  be  used,  only  B,  and Z have any effect, causing
242         REG_NOTBOL and REG_NOTEOL to be passed to regexec()  respec-
243         tively.
244    
245         The use of \x{hh...} to represent UTF-8  characters  is  not
246         dependent  on  the use of the /8 modifier on the pattern. It
247         is recognized always. There may be any number of hexadecimal
248         digits  inside  the  braces.  The  result is from one to six
249         bytes, encoded according to the UTF-8 rules.
250    
   \a         alarm (= BEL)  
   \b         backspace  
   \e         escape  
   \f         formfeed  
   \n         newline  
   \r         carriage return  
   \t         tab  
   \v         vertical tab  
   \nnn       octal character (up to 3 octal digits)  
   \xhh       hexadecimal character (up to 2 hex digits)  
   \x{hh...}  hexadecimal UTF-8 character  
   
   \A         pass the PCRE_ANCHORED option to pcre_exec()  
   \B         pass the PCRE_NOTBOL option to pcre_exec()  
   \Cdd       call pcre_copy_substring() for substring dd after a successful  
                match (any decimal number less than 32)  
   \Gdd       call pcre_get_substring() for substring dd after a successful  
                match (any decimal number less than 32)  
   \L         call pcre_get_substringlist() after a successful match  
   \N         pass the PCRE_NOTEMPTY option to pcre_exec()  
   \Odd       set the size of the output vector passed to pcre_exec() to dd  
                (any number of decimal digits)  
   \Z         pass the PCRE_NOTEOL option to pcre_exec()  
   
 A backslash followed by anything else just escapes the anything else. If the  
 very last character is a backslash, it is ignored. This gives a way of passing  
 an empty line as data, since a real empty line terminates the data input.  
   
 If /P was present on the regex, causing the POSIX wrapper API to be used, only  
 \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to  
 regexec() respectively.  
   
 The use of \x{hh...} to represent UTF-8 characters is not dependent on the use  
 of the /8 modifier on the pattern. It is recognized always. There may be any  
 number of hexadecimal digits inside the braces. The result is from one to six  
 bytes, encoded according to the UTF-8 rules.  
251    
252    
253  OUTPUT FROM PCRETEST  OUTPUT FROM PCRETEST
254  --------------------       When a match succeeds, pcretest outputs the list of captured
255         substrings  that pcre_exec() returns, starting with number 0
256  When a match succeeds, pcretest outputs the list of captured substrings that       for the string that matched the whole pattern.  Here  is  an
257  pcre_exec() returns, starting with number 0 for the string that matched the       example of an interactive pcretest run.
258  whole pattern. Here is an example of an interactive pcretest run.  
259           $ pcretest
260    $ pcretest         PCRE version 2.06 08-Jun-1999
261    PCRE version 2.06 08-Jun-1999  
262             re> /^abc(\d+)/
263      re> /^abc(\d+)/         data> abc123
264    data> abc123          0: abc123
265     0: abc123          1: 123
266     1: 123         data> xyz
267    data> xyz         No match
268    No match  
269         If the strings contain any non-printing characters, they are
270  If the strings contain any non-printing characters, they are output as \0x       output  as  \0x  escapes,  or  as  \x{...} escapes if the /8
271  escapes, or as \x{...} escapes if the /8 modifier was present on the pattern.       modifier was present on the pattern. If the pattern has  the
272  If the pattern has the /+ modifier, then the output for substring 0 is followed       /+  modifier, then the output for substring 0 is followed by
273  by the the rest of the subject string, identified by "0+" like this:       the the rest of the subject string, identified by "0+"  like
274         this:
275      re> /cat/+  
276    data> cataract           re> /cat/+
277     0: cat         data> cataract
278     0+ aract          0: cat
279            0+ aract
280  If the pattern has the /g or /G modifier, the results of successive matching  
281  attempts are output in sequence, like this:       If the pattern has the /g or /G  modifier,  the  results  of
282         successive  matching  attempts  are output in sequence, like
283      re> /\Bi(\w\w)/g       this:
284    data> Mississippi  
285     0: iss           re> /\Bi(\w\w)/g
286     1: ss         data> Mississippi
287     0: iss          0: iss
288     1: ss          1: ss
289     0: ipp          0: iss
290     1: pp          1: ss
291            0: ipp
292  "No match" is output only if the first match attempt fails.          1: pp
293    
294  If any of \C, \G, or \L are present in a data line that is successfully       "No match" is output only if the first match attempt fails.
295  matched, the substrings extracted by the convenience functions are output with  
296  C, G, or L after the string number instead of a colon. This is in addition to       If any of the sequences \C, \G, or \L are present in a  data
297  the normal full list. The string length (that is, the return from the       line  that is successfully matched, the substrings extracted
298  extraction function) is given in parentheses after each string for \C and \G.       by the convenience functions are output  with  C,  G,  or  L
299         after the string number instead of a colon. This is in addi-
300  Note that while patterns can be continued over several lines (a plain ">"       tion to the normal full list. The string  length  (that  is,
301  prompt is used for continuations), data lines may not. However newlines can be       the  return  from  the  extraction  function)  is  given  in
302  included in data by means of the \n escape.       parentheses after each string for \C and \G.
303    
304         Note that while patterns can be continued over several lines
305  COMMAND LINE OPTIONS       (a  plain  ">" prompt is used for continuations), data lines
306  --------------------       may not. However newlines can be included in data  by  means
307         of the \n escape.
308  If the -p option is given to pcretest, it is equivalent to adding /P to each  
309  regular expression: the POSIX wrapper API is used to call PCRE. None of the  
310  following flags has any effect in this case.  
311    AUTHOR
312  If the option -d is given to pcretest, it is equivalent to adding /D to each       Philip Hazel <ph10@cam.ac.uk>
313  regular expression: the internal form is output after compilation.       University Computing Service,
314         New Museums Site,
315  If the option -i is given to pcretest, it is equivalent to adding /I to each       Cambridge CB2 3QG, England.
316  regular expression: information about the compiled pattern is given after       Phone: +44 1223 334714
 compilation.  
   
 If the option -m is given to pcretest, it outputs the size of each compiled  
 pattern after it has been compiled. It is equivalent to adding /M to each  
 regular expression. For compatibility with earlier versions of pcretest, -s is  
 a synonym for -m.  
   
 If the -t option is given, each compile, study, and match is run 20000 times  
 while being timed, and the resulting time per compile or match is output in  
 milliseconds. Do not set -t with -m, because you will then get the size output  
 20000 times and the timing will be distorted. If you want to change the number  
 of repetitions used for timing, edit the definition of LOOPREPEAT at the top of  
 pcretest.c  
317    
318  Philip Hazel <ph10@cam.ac.uk>       Last updated: 15 August 2001
319  August 2000       Copyright (c) 1997-2001 University of Cambridge.

Legend:
Removed from v.49  
changed lines
  Added in v.53

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12