/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 51 by nigel, Sat Feb 24 21:39:37 2007 UTC revision 416 by ph10, Sat Apr 11 14:34:02 2009 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2    This file contains a concatenation of the PCRE man pages, converted to plain
3    text format for ease of searching with a text editor, or for use on systems
4    that do not have a man page processor. The small individual files that give
5    synopses of each function in the library have not been included. There are
6    separate text files for the pcregrep and pcretest commands.
7    -----------------------------------------------------------------------------
8    
9    
10    PCRE(3)                                                                PCRE(3)
11    
12    
13    NAME
14           PCRE - Perl-compatible regular expressions
15    
16    
17    INTRODUCTION
18    
19           The  PCRE  library is a set of functions that implement regular expres-
20           sion pattern matching using the same syntax and semantics as Perl, with
21           just  a  few  differences. Certain features that appeared in Python and
22           PCRE before they appeared in Perl are also available using  the  Python
23           syntax.  There is also some support for certain .NET and Oniguruma syn-
24           tax items, and there is an option for  requesting  some  minor  changes
25           that give better JavaScript compatibility.
26    
27           The  current  implementation of PCRE (release 7.x) corresponds approxi-
28           mately with Perl 5.10, including support for UTF-8 encoded strings  and
29           Unicode general category properties. However, UTF-8 and Unicode support
30           has to be explicitly enabled; it is not the default. The Unicode tables
31           correspond to Unicode release 5.1.
32    
33           In  addition to the Perl-compatible matching function, PCRE contains an
34           alternative matching function that matches the same  compiled  patterns
35           in  a different way. In certain circumstances, the alternative function
36           has some advantages. For a discussion of the two  matching  algorithms,
37           see the pcrematching page.
38    
39           PCRE  is  written  in C and released as a C library. A number of people
40           have written wrappers and interfaces of various kinds.  In  particular,
41           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
42           included as part of the PCRE distribution. The pcrecpp page has details
43           of  this  interface.  Other  people's contributions can be found in the
44           Contrib directory at the primary FTP site, which is:
45    
46           ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
47    
48           Details of exactly which Perl regular expression features are  and  are
49           not supported by PCRE are given in separate documents. See the pcrepat-
50           tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
51           page.
52    
53           Some  features  of  PCRE can be included, excluded, or changed when the
54           library is built. The pcre_config() function makes it  possible  for  a
55           client  to  discover  which  features are available. The features them-
56           selves are described in the pcrebuild page. Documentation about  build-
57           ing  PCRE for various operating systems can be found in the README file
58           in the source distribution.
59    
60           The library contains a number of undocumented  internal  functions  and
61           data  tables  that  are  used by more than one of the exported external
62           functions, but which are not intended  for  use  by  external  callers.
63           Their  names  all begin with "_pcre_", which hopefully will not provoke
64           any name clashes. In some environments, it is possible to control which
65           external  symbols  are  exported when a shared library is built, and in
66           these cases the undocumented symbols are not exported.
67    
68    
69    USER DOCUMENTATION
70    
71           The user documentation for PCRE comprises a number  of  different  sec-
72           tions.  In the "man" format, each of these is a separate "man page". In
73           the HTML format, each is a separate page, linked from the  index  page.
74           In  the  plain text format, all the sections are concatenated, for ease
75           of searching. The sections are as follows:
76    
77             pcre              this document
78             pcre-config       show PCRE installation configuration information
79             pcreapi           details of PCRE's native C API
80             pcrebuild         options for building PCRE
81             pcrecallout       details of the callout feature
82             pcrecompat        discussion of Perl compatibility
83             pcrecpp           details of the C++ wrapper
84             pcregrep          description of the pcregrep command
85             pcrematching      discussion of the two matching algorithms
86             pcrepartial       details of the partial matching facility
87             pcrepattern       syntax and semantics of supported
88                                 regular expressions
89             pcresyntax        quick syntax reference
90             pcreperform       discussion of performance issues
91             pcreposix         the POSIX-compatible C API
92             pcreprecompile    details of saving and re-using precompiled patterns
93             pcresample        discussion of the sample program
94             pcrestack         discussion of stack usage
95             pcretest          description of the pcretest testing command
96    
97           In addition, in the "man" and HTML formats, there is a short  page  for
98           each C library function, listing its arguments and results.
99    
100    
101    LIMITATIONS
102    
103           There  are some size limitations in PCRE but it is hoped that they will
104           never in practice be relevant.
105    
106           The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
107           is compiled with the default internal linkage size of 2. If you want to
108           process regular expressions that are truly enormous,  you  can  compile
109           PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
110           the source distribution and the pcrebuild documentation  for  details).
111           In  these  cases the limit is substantially larger.  However, the speed
112           of execution is slower.
113    
114           All values in repeating quantifiers must be less than 65536.
115    
116           There is no limit to the number of parenthesized subpatterns, but there
117           can be no more than 65535 capturing subpatterns.
118    
119           The maximum length of name for a named subpattern is 32 characters, and
120           the maximum number of named subpatterns is 10000.
121    
122           The maximum length of a subject string is the largest  positive  number
123           that  an integer variable can hold. However, when using the traditional
124           matching function, PCRE uses recursion to handle subpatterns and indef-
125           inite  repetition.  This means that the available stack space may limit
126           the size of a subject string that can be processed by certain patterns.
127           For a discussion of stack issues, see the pcrestack documentation.
128    
129    
130    UTF-8 AND UNICODE PROPERTY SUPPORT
131    
132           From  release  3.3,  PCRE  has  had  some support for character strings
133           encoded in the UTF-8 format. For release 4.0 this was greatly  extended
134           to  cover  most common requirements, and in release 5.0 additional sup-
135           port for Unicode general category properties was added.
136    
137           In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
138           support  in  the  code,  and, in addition, you must call pcre_compile()
139           with the PCRE_UTF8 option flag, or the  pattern  must  start  with  the
140           sequence  (*UTF8).  When  either of these is the case, both the pattern
141           and any subject strings that are matched  against  it  are  treated  as
142           UTF-8 strings instead of just strings of bytes.
143    
144           If  you compile PCRE with UTF-8 support, but do not use it at run time,
145           the library will be a bit bigger, but the additional run time  overhead
146           is limited to testing the PCRE_UTF8 flag occasionally, so should not be
147           very big.
148    
149           If PCRE is built with Unicode character property support (which implies
150           UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
151           ported.  The available properties that can be tested are limited to the
152           general  category  properties such as Lu for an upper case letter or Nd
153           for a decimal number, the Unicode script names such as Arabic  or  Han,
154           and  the  derived  properties  Any  and L&. A full list is given in the
155           pcrepattern documentation. Only the short names for properties are sup-
156           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
157           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
158           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
159           does not support this.
160    
161       Validity of UTF-8 strings
162    
163           When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
164           subjects are (by default) checked for validity on entry to the relevant
165           functions. From release 7.3 of PCRE, the check is according  the  rules
166           of  RFC  3629, which are themselves derived from the Unicode specifica-
167           tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
168           allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
169           check allows only values in the range U+0 to U+10FFFF, excluding U+D800
170           to U+DFFF.
171    
172           The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
173           which the Unicode Standard says this: "The Low Surrogate Area does  not
174           contain  any  character  assignments,  consequently  no  character code
175           charts or namelists are provided for this area. Surrogates are reserved
176           for  use  with  UTF-16 and then must be used in pairs." The code points
177           that are encoded by UTF-16 pairs  are  available  as  independent  code
178           points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
179           thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
180    
181           If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return
182           (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
183           that your strings are valid, and therefore want to skip these checks in
184           order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
185           compile time or at run time, PCRE assumes that the pattern  or  subject
186           it  is  given  (respectively)  contains only valid UTF-8 codes. In this
187           case, it does not diagnose an invalid UTF-8 string.
188    
189           If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
190           what  happens  depends on why the string is invalid. If the string con-
191           forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
192           string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
193           apart from the initial validity test, PCRE (when in UTF-8 mode) handles
194           strings  according  to  the more liberal rules of RFC 2279. However, if
195           the string does not even conform to RFC 2279, the result is  undefined.
196           Your program may crash.
197    
198           If  you  want  to  process  strings  of  values  in the full range 0 to
199           0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
200           set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
201           this situation, you will have to apply your own validity check.
202    
203       General comments about UTF-8 mode
204    
205           1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
206           two-byte UTF-8 character if the value is greater than 127.
207    
208           2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
209           characters for values greater than \177.
210    
211           3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
212           vidual bytes, for example: \x{100}{3}.
213    
214           4.  The dot metacharacter matches one UTF-8 character instead of a sin-
215           gle byte.
216    
217           5. The escape sequence \C can be used to match a single byte  in  UTF-8
218           mode,  but  its  use can lead to some strange effects. This facility is
219           not available in the alternative matching function, pcre_dfa_exec().
220    
221           6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
222           test  characters of any code value, but the characters that PCRE recog-
223           nizes as digits, spaces, or word characters  remain  the  same  set  as
224           before, all with values less than 256. This remains true even when PCRE
225           includes Unicode property support, because to do otherwise  would  slow
226           down  PCRE in many common cases. If you really want to test for a wider
227           sense of, say, "digit", you must use Unicode  property  tests  such  as
228           \p{Nd}.  Note  that  this  also applies to \b, because it is defined in
229           terms of \w and \W.
230    
231           7. Similarly, characters that match the POSIX named  character  classes
232           are all low-valued characters.
233    
234           8.  However,  the Perl 5.10 horizontal and vertical whitespace matching
235           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
236           acters.
237    
238           9.  Case-insensitive  matching  applies only to characters whose values
239           are less than 128, unless PCRE is built with Unicode property  support.
240           Even  when  Unicode  property support is available, PCRE still uses its
241           own character tables when checking the case of  low-valued  characters,
242           so  as not to degrade performance.  The Unicode property information is
243           used only for characters with higher values. Even when Unicode property
244           support is available, PCRE supports case-insensitive matching only when
245           there is a one-to-one mapping between a letter's  cases.  There  are  a
246           small  number  of  many-to-one  mappings in Unicode; these are not sup-
247           ported by PCRE.
248    
249    
250    AUTHOR
251    
252           Philip Hazel
253           University Computing Service
254           Cambridge CB2 3QH, England.
255    
256           Putting an actual email address here seems to have been a spam  magnet,
257           so  I've  taken  it away. If you want to email me, use my two initials,
258           followed by the two digits 10, at the domain cam.ac.uk.
259    
260    
261    REVISION
262    
263           Last updated: 11 April 2009
264           Copyright (c) 1997-2009 University of Cambridge.
265    ------------------------------------------------------------------------------
266    
267    
268    PCREBUILD(3)                                                      PCREBUILD(3)
269    
270    
271    NAME
272           PCRE - Perl-compatible regular expressions
273    
274    
275    PCRE BUILD-TIME OPTIONS
276    
277           This  document  describes  the  optional  features  of PCRE that can be
278           selected when the library is compiled. It assumes use of the  configure
279           script,  where the optional features are selected or deselected by pro-
280           viding options to configure before running the make  command.  However,
281           the  same  options  can be selected in both Unix-like and non-Unix-like
282           environments using the GUI facility of  CMakeSetup  if  you  are  using
283           CMake instead of configure to build PCRE.
284    
285           The complete list of options for configure (which includes the standard
286           ones such as the  selection  of  the  installation  directory)  can  be
287           obtained by running
288    
289             ./configure --help
290    
291           The  following  sections  include  descriptions  of options whose names
292           begin with --enable or --disable. These settings specify changes to the
293           defaults  for  the configure command. Because of the way that configure
294           works, --enable and --disable always come in pairs, so  the  complemen-
295           tary  option always exists as well, but as it specifies the default, it
296           is not described.
297    
298    
299    C++ SUPPORT
300    
301           By default, the configure script will search for a C++ compiler and C++
302           header files. If it finds them, it automatically builds the C++ wrapper
303           library for PCRE. You can disable this by adding
304    
305             --disable-cpp
306    
307           to the configure command.
308    
309    
310    UTF-8 SUPPORT
311    
312           To build PCRE with support for UTF-8 Unicode character strings, add
313    
314             --enable-utf8
315    
316           to the configure command. Of itself, this  does  not  make  PCRE  treat
317           strings  as UTF-8. As well as compiling PCRE with this option, you also
318           have have to set the PCRE_UTF8 option when you call the  pcre_compile()
319           function.
320    
321           If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
322           expects its input to be either ASCII or UTF-8 (depending on the runtime
323           option).  It  is not possible to support both EBCDIC and UTF-8 codes in
324           the same  version  of  the  library.  Consequently,  --enable-utf8  and
325           --enable-ebcdic are mutually exclusive.
326    
327    
328    UNICODE CHARACTER PROPERTY SUPPORT
329    
330           UTF-8  support allows PCRE to process character values greater than 255
331           in the strings that it handles. On its own, however, it does  not  pro-
332           vide any facilities for accessing the properties of such characters. If
333           you want to be able to use the pattern escapes \P, \p,  and  \X,  which
334           refer to Unicode character properties, you must add
335    
336             --enable-unicode-properties
337    
338           to  the configure command. This implies UTF-8 support, even if you have
339           not explicitly requested it.
340    
341           Including Unicode property support adds around 30K  of  tables  to  the
342           PCRE  library.  Only  the general category properties such as Lu and Nd
343           are supported. Details are given in the pcrepattern documentation.
344    
345    
346    CODE VALUE OF NEWLINE
347    
348           By default, PCRE interprets the linefeed (LF) character  as  indicating
349           the  end  of  a line. This is the normal newline character on Unix-like
350           systems. You can compile PCRE to use carriage return (CR)  instead,  by
351           adding
352    
353             --enable-newline-is-cr
354    
355           to  the  configure  command.  There  is  also  a --enable-newline-is-lf
356           option, which explicitly specifies linefeed as the newline character.
357    
358           Alternatively, you can specify that line endings are to be indicated by
359           the two character sequence CRLF. If you want this, add
360    
361             --enable-newline-is-crlf
362    
363           to the configure command. There is a fourth option, specified by
364    
365             --enable-newline-is-anycrlf
366    
367           which  causes  PCRE  to recognize any of the three sequences CR, LF, or
368           CRLF as indicating a line ending. Finally, a fifth option, specified by
369    
370             --enable-newline-is-any
371    
372           causes PCRE to recognize any Unicode newline sequence.
373    
374           Whatever line ending convention is selected when PCRE is built  can  be
375           overridden  when  the library functions are called. At build time it is
376           conventional to use the standard for your operating system.
377    
378    
379    WHAT \R MATCHES
380    
381           By default, the sequence \R in a pattern matches  any  Unicode  newline
382           sequence,  whatever  has  been selected as the line ending sequence. If
383           you specify
384    
385             --enable-bsr-anycrlf
386    
387           the default is changed so that \R matches only CR, LF, or  CRLF.  What-
388           ever  is selected when PCRE is built can be overridden when the library
389           functions are called.
390    
391    
392    BUILDING SHARED AND STATIC LIBRARIES
393    
394           The PCRE building process uses libtool to build both shared and  static
395           Unix  libraries by default. You can suppress one of these by adding one
396           of
397    
398             --disable-shared
399             --disable-static
400    
401           to the configure command, as required.
402    
403    
404    POSIX MALLOC USAGE
405    
406           When PCRE is called through the POSIX interface (see the pcreposix doc-
407           umentation),  additional  working  storage  is required for holding the
408           pointers to capturing substrings, because PCRE requires three  integers
409           per  substring,  whereas  the POSIX interface provides only two. If the
410           number of expected substrings is small, the wrapper function uses space
411           on the stack, because this is faster than using malloc() for each call.
412           The default threshold above which the stack is no longer used is 10; it
413           can be changed by adding a setting such as
414    
415             --with-posix-malloc-threshold=20
416    
417           to the configure command.
418    
419    
420    HANDLING VERY LARGE PATTERNS
421    
422           Within  a  compiled  pattern,  offset values are used to point from one
423           part to another (for example, from an opening parenthesis to an  alter-
424           nation  metacharacter).  By default, two-byte values are used for these
425           offsets, leading to a maximum size for a  compiled  pattern  of  around
426           64K.  This  is sufficient to handle all but the most gigantic patterns.
427           Nevertheless, some people do want to process enormous patterns,  so  it
428           is  possible  to compile PCRE to use three-byte or four-byte offsets by
429           adding a setting such as
430    
431             --with-link-size=3
432    
433           to the configure command. The value given must be 2,  3,  or  4.  Using
434           longer  offsets slows down the operation of PCRE because it has to load
435           additional bytes when handling them.
436    
437    
438    AVOIDING EXCESSIVE STACK USAGE
439    
440           When matching with the pcre_exec() function, PCRE implements backtrack-
441           ing  by  making recursive calls to an internal function called match().
442           In environments where the size of the stack is limited,  this  can  se-
443           verely  limit  PCRE's operation. (The Unix environment does not usually
444           suffer from this problem, but it may sometimes be necessary to increase
445           the  maximum  stack size.  There is a discussion in the pcrestack docu-
446           mentation.) An alternative approach to recursion that uses memory  from
447           the  heap  to remember data, instead of using recursive function calls,
448           has been implemented to work round the problem of limited  stack  size.
449           If you want to build a version of PCRE that works this way, add
450    
451             --disable-stack-for-recursion
452    
453           to  the  configure  command. With this configuration, PCRE will use the
454           pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
455           ment  functions. By default these point to malloc() and free(), but you
456           can replace the pointers so that your own functions are used.
457    
458           Separate functions are  provided  rather  than  using  pcre_malloc  and
459           pcre_free  because  the  usage  is  very  predictable:  the block sizes
460           requested are always the same, and  the  blocks  are  always  freed  in
461           reverse  order.  A calling program might be able to implement optimized
462           functions that perform better  than  malloc()  and  free().  PCRE  runs
463           noticeably more slowly when built in this way. This option affects only
464           the  pcre_exec()  function;  it   is   not   relevant   for   the   the
465           pcre_dfa_exec() function.
466    
467    
468    LIMITING PCRE RESOURCE USAGE
469    
470           Internally,  PCRE has a function called match(), which it calls repeat-
471           edly  (sometimes  recursively)  when  matching  a  pattern   with   the
472           pcre_exec()  function.  By controlling the maximum number of times this
473           function may be called during a single matching operation, a limit  can
474           be  placed  on  the resources used by a single call to pcre_exec(). The
475           limit can be changed at run time, as described in the pcreapi  documen-
476           tation.  The default is 10 million, but this can be changed by adding a
477           setting such as
478    
479             --with-match-limit=500000
480    
481           to  the  configure  command.  This  setting  has  no  effect   on   the
482           pcre_dfa_exec() matching function.
483    
484           In  some  environments  it is desirable to limit the depth of recursive
485           calls of match() more strictly than the total number of calls, in order
486           to  restrict  the maximum amount of stack (or heap, if --disable-stack-
487           for-recursion is specified) that is used. A second limit controls this;
488           it  defaults  to  the  value  that is set for --with-match-limit, which
489           imposes no additional constraints. However, you can set a  lower  limit
490           by adding, for example,
491    
492             --with-match-limit-recursion=10000
493    
494           to  the  configure  command.  This  value can also be overridden at run
495           time.
496    
497    
498    CREATING CHARACTER TABLES AT BUILD TIME
499    
500           PCRE uses fixed tables for processing characters whose code values  are
501           less  than 256. By default, PCRE is built with a set of tables that are
502           distributed in the file pcre_chartables.c.dist. These  tables  are  for
503           ASCII codes only. If you add
504    
505             --enable-rebuild-chartables
506    
507           to  the  configure  command, the distributed tables are no longer used.
508           Instead, a program called dftables is compiled and  run.  This  outputs
509           the source for new set of tables, created in the default locale of your
510           C runtime system. (This method of replacing the tables does not work if
511           you  are cross compiling, because dftables is run on the local host. If
512           you need to create alternative tables when cross  compiling,  you  will
513           have to do so "by hand".)
514    
515    
516    USING EBCDIC CODE
517    
518           PCRE  assumes  by  default that it will run in an environment where the
519           character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
520           This  is  the  case for most computer operating systems. PCRE can, how-
521           ever, be compiled to run in an EBCDIC environment by adding
522    
523             --enable-ebcdic
524    
525           to the configure command. This setting implies --enable-rebuild-charta-
526           bles.  You  should  only  use  it if you know that you are in an EBCDIC
527           environment (for example,  an  IBM  mainframe  operating  system).  The
528           --enable-ebcdic option is incompatible with --enable-utf8.
529    
530    
531    PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
532    
533           By default, pcregrep reads all files as plain text. You can build it so
534           that it recognizes files whose names end in .gz or .bz2, and reads them
535           with libz or libbz2, respectively, by adding one or both of
536    
537             --enable-pcregrep-libz
538             --enable-pcregrep-libbz2
539    
540           to the configure command. These options naturally require that the rel-
541           evant libraries are installed on your system. Configuration  will  fail
542           if they are not.
543    
544    
545    PCRETEST OPTION FOR LIBREADLINE SUPPORT
546    
547           If you add
548    
549             --enable-pcretest-libreadline
550    
551           to  the  configure  command,  pcretest  is  linked with the libreadline
552           library, and when its input is from a terminal, it reads it  using  the
553           readline() function. This provides line-editing and history facilities.
554           Note that libreadline is GPL-licenced, so if you distribute a binary of
555           pcretest linked in this way, there may be licensing issues.
556    
557           Setting  this  option  causes  the -lreadline option to be added to the
558           pcretest build. In many operating environments with  a  sytem-installed
559           libreadline this is sufficient. However, in some environments (e.g.  if
560           an unmodified distribution version of readline is in use),  some  extra
561           configuration  may  be necessary. The INSTALL file for libreadline says
562           this:
563    
564             "Readline uses the termcap functions, but does not link with the
565             termcap or curses library itself, allowing applications which link
566             with readline the to choose an appropriate library."
567    
568           If your environment has not been set up so that an appropriate  library
569           is automatically included, you may need to add something like
570    
571             LIBS="-ncurses"
572    
573           immediately before the configure command.
574    
575    
576    SEE ALSO
577    
578           pcreapi(3), pcre_config(3).
579    
580    
581    AUTHOR
582    
583           Philip Hazel
584           University Computing Service
585           Cambridge CB2 3QH, England.
586    
587    
588    REVISION
589    
590           Last updated: 17 March 2009
591           Copyright (c) 1997-2009 University of Cambridge.
592    ------------------------------------------------------------------------------
593    
594    
595    PCREMATCHING(3)                                                PCREMATCHING(3)
596    
597    
598    NAME
599           PCRE - Perl-compatible regular expressions
600    
601    
602    PCRE MATCHING ALGORITHMS
603    
604           This document describes the two different algorithms that are available
605           in PCRE for matching a compiled regular expression against a given sub-
606           ject  string.  The  "standard"  algorithm  is  the  one provided by the
607           pcre_exec() function.  This works in the same was  as  Perl's  matching
608           function, and provides a Perl-compatible matching operation.
609    
610           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
611           this operates in a different way, and is not  Perl-compatible.  It  has
612           advantages  and disadvantages compared with the standard algorithm, and
613           these are described below.
614    
615           When there is only one possible way in which a given subject string can
616           match  a pattern, the two algorithms give the same answer. A difference
617           arises, however, when there are multiple possibilities. For example, if
618           the pattern
619    
620             ^<.*>
621    
622           is matched against the string
623    
624             <something> <something else> <something further>
625    
626           there are three possible answers. The standard algorithm finds only one
627           of them, whereas the alternative algorithm finds all three.
628    
629    
630    REGULAR EXPRESSIONS AS TREES
631    
632           The set of strings that are matched by a regular expression can be rep-
633           resented  as  a  tree structure. An unlimited repetition in the pattern
634           makes the tree of infinite size, but it is still a tree.  Matching  the
635           pattern  to a given subject string (from a given starting point) can be
636           thought of as a search of the tree.  There are two  ways  to  search  a
637           tree:  depth-first  and  breadth-first, and these correspond to the two
638           matching algorithms provided by PCRE.
639    
640    
641    THE STANDARD MATCHING ALGORITHM
642    
643           In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
644           sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
645           depth-first search of the pattern tree. That is, it  proceeds  along  a
646           single path through the tree, checking that the subject matches what is
647           required. When there is a mismatch, the algorithm  tries  any  alterna-
648           tives  at  the  current point, and if they all fail, it backs up to the
649           previous branch point in the  tree,  and  tries  the  next  alternative
650           branch  at  that  level.  This often involves backing up (moving to the
651           left) in the subject string as well.  The  order  in  which  repetition
652           branches  are  tried  is controlled by the greedy or ungreedy nature of
653           the quantifier.
654    
655           If a leaf node is reached, a matching string has  been  found,  and  at
656           that  point the algorithm stops. Thus, if there is more than one possi-
657           ble match, this algorithm returns the first one that it finds.  Whether
658           this  is the shortest, the longest, or some intermediate length depends
659           on the way the greedy and ungreedy repetition quantifiers are specified
660           in the pattern.
661    
662           Because  it  ends  up  with a single path through the tree, it is rela-
663           tively straightforward for this algorithm to keep  track  of  the  sub-
664           strings  that  are  matched  by portions of the pattern in parentheses.
665           This provides support for capturing parentheses and back references.
666    
667    
668    THE ALTERNATIVE MATCHING ALGORITHM
669    
670           This algorithm conducts a breadth-first search of  the  tree.  Starting
671           from  the  first  matching  point  in the subject, it scans the subject
672           string from left to right, once, character by character, and as it does
673           this,  it remembers all the paths through the tree that represent valid
674           matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
675           though  it is not implemented as a traditional finite state machine (it
676           keeps multiple states active simultaneously).
677    
678           The scan continues until either the end of the subject is  reached,  or
679           there  are  no more unterminated paths. At this point, terminated paths
680           represent the different matching possibilities (if there are none,  the
681           match  has  failed).   Thus,  if there is more than one possible match,
682           this algorithm finds all of them, and in particular, it finds the long-
683           est.  In PCRE, there is an option to stop the algorithm after the first
684           match (which is necessarily the shortest) has been found.
685    
686           Note that all the matches that are found start at the same point in the
687           subject. If the pattern
688    
689             cat(er(pillar)?)
690    
691           is  matched  against the string "the caterpillar catchment", the result
692           will be the three strings "cat", "cater", and "caterpillar" that  start
693           at the fourth character of the subject. The algorithm does not automat-
694           ically move on to find matches that start at later positions.
695    
696           There are a number of features of PCRE regular expressions that are not
697           supported by the alternative matching algorithm. They are as follows:
698    
699           1.  Because  the  algorithm  finds  all possible matches, the greedy or
700           ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
701           ungreedy quantifiers are treated in exactly the same way. However, pos-
702           sessive quantifiers can make a difference when what follows could  also
703           match what is quantified, for example in a pattern like this:
704    
705             ^a++\w!
706    
707           This  pattern matches "aaab!" but not "aaa!", which would be matched by
708           a non-possessive quantifier. Similarly, if an atomic group is  present,
709           it  is matched as if it were a standalone pattern at the current point,
710           and the longest match is then "locked in" for the rest of  the  overall
711           pattern.
712    
713           2. When dealing with multiple paths through the tree simultaneously, it
714           is not straightforward to keep track of  captured  substrings  for  the
715           different  matching  possibilities,  and  PCRE's implementation of this
716           algorithm does not attempt to do this. This means that no captured sub-
717           strings are available.
718    
719           3.  Because no substrings are captured, back references within the pat-
720           tern are not supported, and cause errors if encountered.
721    
722           4. For the same reason, conditional expressions that use  a  backrefer-
723           ence  as  the  condition or test for a specific group recursion are not
724           supported.
725    
726           5. Because many paths through the tree may be  active,  the  \K  escape
727           sequence, which resets the start of the match when encountered (but may
728           be on some paths and not on others), is not  supported.  It  causes  an
729           error if encountered.
730    
731           6.  Callouts  are  supported, but the value of the capture_top field is
732           always 1, and the value of the capture_last field is always -1.
733    
734           7. The \C escape sequence, which (in the standard algorithm) matches  a
735           single  byte, even in UTF-8 mode, is not supported because the alterna-
736           tive algorithm moves through the subject  string  one  character  at  a
737           time, for all active paths through the tree.
738    
739           8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
740           are not supported. (*FAIL) is supported, and  behaves  like  a  failing
741           negative assertion.
742    
743    
744    ADVANTAGES OF THE ALTERNATIVE ALGORITHM
745    
746           Using  the alternative matching algorithm provides the following advan-
747           tages:
748    
749           1. All possible matches (at a single point in the subject) are automat-
750           ically  found,  and  in particular, the longest match is found. To find
751           more than one match using the standard algorithm, you have to do kludgy
752           things with callouts.
753    
754           2.  There is much better support for partial matching. The restrictions
755           on the content of the pattern that apply when using the standard  algo-
756           rithm  for  partial matching do not apply to the alternative algorithm.
757           For non-anchored patterns, the starting position of a partial match  is
758           available.
759    
760           3.  Because  the  alternative  algorithm  scans the subject string just
761           once, and never needs to backtrack, it is possible to  pass  very  long
762           subject  strings  to  the matching function in several pieces, checking
763           for partial matching each time.
764    
765    
766    DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
767    
768           The alternative algorithm suffers from a number of disadvantages:
769    
770           1. It is substantially slower than  the  standard  algorithm.  This  is
771           partly  because  it has to search for all possible matches, but is also
772           because it is less susceptible to optimization.
773    
774           2. Capturing parentheses and back references are not supported.
775    
776           3. Although atomic groups are supported, their use does not provide the
777           performance advantage that it does for the standard algorithm.
778    
779    
780    AUTHOR
781    
782           Philip Hazel
783           University Computing Service
784           Cambridge CB2 3QH, England.
785    
786    
787    REVISION
788    
789           Last updated: 19 April 2008
790           Copyright (c) 1997-2008 University of Cambridge.
791    ------------------------------------------------------------------------------
792    
793    
794    PCREAPI(3)                                                          PCREAPI(3)
795    
796    
797    NAME
798           PCRE - Perl-compatible regular expressions
799    
800    
801    PCRE NATIVE API
802    
803           #include <pcre.h>
804    
805           pcre *pcre_compile(const char *pattern, int options,
806                const char **errptr, int *erroffset,
807                const unsigned char *tableptr);
808    
809           pcre *pcre_compile2(const char *pattern, int options,
810                int *errorcodeptr,
811                const char **errptr, int *erroffset,
812                const unsigned char *tableptr);
813    
814           pcre_extra *pcre_study(const pcre *code, int options,
815                const char **errptr);
816    
817           int pcre_exec(const pcre *code, const pcre_extra *extra,
818                const char *subject, int length, int startoffset,
819                int options, int *ovector, int ovecsize);
820    
821           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
822                const char *subject, int length, int startoffset,
823                int options, int *ovector, int ovecsize,
824                int *workspace, int wscount);
825    
826           int pcre_copy_named_substring(const pcre *code,
827                const char *subject, int *ovector,
828                int stringcount, const char *stringname,
829                char *buffer, int buffersize);
830    
831           int pcre_copy_substring(const char *subject, int *ovector,
832                int stringcount, int stringnumber, char *buffer,
833                int buffersize);
834    
835           int pcre_get_named_substring(const pcre *code,
836                const char *subject, int *ovector,
837                int stringcount, const char *stringname,
838                const char **stringptr);
839    
840           int pcre_get_stringnumber(const pcre *code,
841                const char *name);
842    
843           int pcre_get_stringtable_entries(const pcre *code,
844                const char *name, char **first, char **last);
845    
846           int pcre_get_substring(const char *subject, int *ovector,
847                int stringcount, int stringnumber,
848                const char **stringptr);
849    
850           int pcre_get_substring_list(const char *subject,
851                int *ovector, int stringcount, const char ***listptr);
852    
853           void pcre_free_substring(const char *stringptr);
854    
855           void pcre_free_substring_list(const char **stringptr);
856    
857           const unsigned char *pcre_maketables(void);
858    
859           int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
860                int what, void *where);
861    
862           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
863    
864           int pcre_refcount(pcre *code, int adjust);
865    
866           int pcre_config(int what, void *where);
867    
868           char *pcre_version(void);
869    
870           void *(*pcre_malloc)(size_t);
871    
872           void (*pcre_free)(void *);
873    
874           void *(*pcre_stack_malloc)(size_t);
875    
876           void (*pcre_stack_free)(void *);
877    
878           int (*pcre_callout)(pcre_callout_block *);
879    
880    
881    PCRE API OVERVIEW
882    
883           PCRE has its own native API, which is described in this document. There
884           are also some wrapper functions that correspond to  the  POSIX  regular
885           expression  API.  These  are  described in the pcreposix documentation.
886           Both of these APIs define a set of C function calls. A C++  wrapper  is
887           distributed with PCRE. It is documented in the pcrecpp page.
888    
889           The  native  API  C  function prototypes are defined in the header file
890           pcre.h, and on Unix systems the library itself is called  libpcre.   It
891           can normally be accessed by adding -lpcre to the command for linking an
892           application  that  uses  PCRE.  The  header  file  defines  the  macros
893           PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
894           bers for the library.  Applications can use these  to  include  support
895           for different releases of PCRE.
896    
897           The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
898           pcre_exec() are used for compiling and matching regular expressions  in
899           a  Perl-compatible  manner. A sample program that demonstrates the sim-
900           plest way of using them is provided in the file  called  pcredemo.c  in
901           the  source distribution. The pcresample documentation describes how to
902           compile and run it.
903    
904           A second matching function, pcre_dfa_exec(), which is not Perl-compati-
905           ble,  is  also provided. This uses a different algorithm for the match-
906           ing. The alternative algorithm finds all possible matches (at  a  given
907           point  in  the subject), and scans the subject just once. However, this
908           algorithm does not return captured substrings. A description of the two
909           matching  algorithms and their advantages and disadvantages is given in
910           the pcrematching documentation.
911    
912           In addition to the main compiling and  matching  functions,  there  are
913           convenience functions for extracting captured substrings from a subject
914           string that is matched by pcre_exec(). They are:
915    
916             pcre_copy_substring()
917             pcre_copy_named_substring()
918             pcre_get_substring()
919             pcre_get_named_substring()
920             pcre_get_substring_list()
921             pcre_get_stringnumber()
922             pcre_get_stringtable_entries()
923    
924           pcre_free_substring() and pcre_free_substring_list() are also provided,
925           to free the memory used for extracted strings.
926    
927           The  function  pcre_maketables()  is  used  to build a set of character
928           tables  in  the  current  locale   for   passing   to   pcre_compile(),
929           pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
930           provided for specialist use.  Most  commonly,  no  special  tables  are
931           passed,  in  which case internal tables that are generated when PCRE is
932           built are used.
933    
934           The function pcre_fullinfo() is used to find out  information  about  a
935           compiled  pattern; pcre_info() is an obsolete version that returns only
936           some of the available information, but is retained for  backwards  com-
937           patibility.   The function pcre_version() returns a pointer to a string
938           containing the version of PCRE and its date of release.
939    
940           The function pcre_refcount() maintains a  reference  count  in  a  data
941           block  containing  a compiled pattern. This is provided for the benefit
942           of object-oriented applications.
943    
944           The global variables pcre_malloc and pcre_free  initially  contain  the
945           entry  points  of  the  standard malloc() and free() functions, respec-
946           tively. PCRE calls the memory management functions via these variables,
947           so  a  calling  program  can replace them if it wishes to intercept the
948           calls. This should be done before calling any PCRE functions.
949    
950           The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
951           indirections  to  memory  management functions. These special functions
952           are used only when PCRE is compiled to use  the  heap  for  remembering
953           data, instead of recursive function calls, when running the pcre_exec()
954           function. See the pcrebuild documentation for  details  of  how  to  do
955           this.  It  is  a non-standard way of building PCRE, for use in environ-
956           ments that have limited stacks. Because of the greater  use  of  memory
957           management,  it  runs  more  slowly. Separate functions are provided so
958           that special-purpose external code can be  used  for  this  case.  When
959           used,  these  functions  are always called in a stack-like manner (last
960           obtained, first freed), and always for memory blocks of the same  size.
961           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
962           mentation.
963    
964           The global variable pcre_callout initially contains NULL. It can be set
965           by  the  caller  to  a "callout" function, which PCRE will then call at
966           specified points during a matching operation. Details are given in  the
967           pcrecallout documentation.
968    
969    
970    NEWLINES
971    
972           PCRE  supports five different conventions for indicating line breaks in
973           strings: a single CR (carriage return) character, a  single  LF  (line-
974           feed) character, the two-character sequence CRLF, any of the three pre-
975           ceding, or any Unicode newline sequence. The Unicode newline  sequences
976           are  the  three just mentioned, plus the single characters VT (vertical
977           tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
978           separator, U+2028), and PS (paragraph separator, U+2029).
979    
980           Each  of  the first three conventions is used by at least one operating
981           system as its standard newline sequence. When PCRE is built, a  default
982           can  be  specified.  The default default is LF, which is the Unix stan-
983           dard. When PCRE is run, the default can be overridden,  either  when  a
984           pattern is compiled, or when it is matched.
985    
986           At compile time, the newline convention can be specified by the options
987           argument of pcre_compile(), or it can be specified by special  text  at
988           the start of the pattern itself; this overrides any other settings. See
989           the pcrepattern page for details of the special character sequences.
990    
991           In the PCRE documentation the word "newline" is used to mean "the char-
992           acter  or pair of characters that indicate a line break". The choice of
993           newline convention affects the handling of  the  dot,  circumflex,  and
994           dollar metacharacters, the handling of #-comments in /x mode, and, when
995           CRLF is a recognized line ending sequence, the match position  advance-
996           ment for a non-anchored pattern. There is more detail about this in the
997           section on pcre_exec() options below.
998    
999           The choice of newline convention does not affect the interpretation  of
1000           the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
1001           which is controlled in a similar way, but by separate options.
1002    
1003    
1004    MULTITHREADING
1005    
1006           The PCRE functions can be used in  multi-threading  applications,  with
1007           the  proviso  that  the  memory  management  functions  pointed  to  by
1008           pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1009           callout function pointed to by pcre_callout, are shared by all threads.
1010    
1011           The  compiled form of a regular expression is not altered during match-
1012           ing, so the same compiled pattern can safely be used by several threads
1013           at once.
1014    
1015    
1016    SAVING PRECOMPILED PATTERNS FOR LATER USE
1017    
1018           The compiled form of a regular expression can be saved and re-used at a
1019           later time, possibly by a different program, and even on a  host  other
1020           than  the  one  on  which  it  was  compiled.  Details are given in the
1021           pcreprecompile documentation. However, compiling a  regular  expression
1022           with  one version of PCRE for use with a different version is not guar-
1023           anteed to work and may cause crashes.
1024    
1025    
1026    CHECKING BUILD-TIME OPTIONS
1027    
1028           int pcre_config(int what, void *where);
1029    
1030           The function pcre_config() makes it possible for a PCRE client to  dis-
1031           cover which optional features have been compiled into the PCRE library.
1032           The pcrebuild documentation has more details about these optional  fea-
1033           tures.
1034    
1035           The  first  argument  for pcre_config() is an integer, specifying which
1036           information is required; the second argument is a pointer to a variable
1037           into  which  the  information  is  placed. The following information is
1038           available:
1039    
1040             PCRE_CONFIG_UTF8
1041    
1042           The output is an integer that is set to one if UTF-8 support is  avail-
1043           able; otherwise it is set to zero.
1044    
1045             PCRE_CONFIG_UNICODE_PROPERTIES
1046    
1047           The  output  is  an  integer  that is set to one if support for Unicode
1048           character properties is available; otherwise it is set to zero.
1049    
1050             PCRE_CONFIG_NEWLINE
1051    
1052           The output is an integer whose value specifies  the  default  character
1053           sequence  that is recognized as meaning "newline". The four values that
1054           are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1055           and  -1  for  ANY.  Though they are derived from ASCII, the same values
1056           are returned in EBCDIC environments. The default should normally corre-
1057           spond to the standard sequence for your operating system.
1058    
1059             PCRE_CONFIG_BSR
1060    
1061           The output is an integer whose value indicates what character sequences
1062           the \R escape sequence matches by default. A value of 0 means  that  \R
1063           matches  any  Unicode  line ending sequence; a value of 1 means that \R
1064           matches only CR, LF, or CRLF. The default can be overridden when a pat-
1065           tern is compiled or matched.
1066    
1067             PCRE_CONFIG_LINK_SIZE
1068    
1069           The  output  is  an  integer that contains the number of bytes used for
1070           internal linkage in compiled regular expressions. The value is 2, 3, or
1071           4.  Larger  values  allow larger regular expressions to be compiled, at
1072           the expense of slower matching. The default value of  2  is  sufficient
1073           for  all  but  the  most massive patterns, since it allows the compiled
1074           pattern to be up to 64K in size.
1075    
1076             PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1077    
1078           The output is an integer that contains the threshold  above  which  the
1079           POSIX  interface  uses malloc() for output vectors. Further details are
1080           given in the pcreposix documentation.
1081    
1082             PCRE_CONFIG_MATCH_LIMIT
1083    
1084           The output is a long integer that gives the default limit for the  num-
1085           ber  of  internal  matching  function calls in a pcre_exec() execution.
1086           Further details are given with pcre_exec() below.
1087    
1088             PCRE_CONFIG_MATCH_LIMIT_RECURSION
1089    
1090           The output is a long integer that gives the default limit for the depth
1091           of   recursion  when  calling  the  internal  matching  function  in  a
1092           pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1093           below.
1094    
1095             PCRE_CONFIG_STACKRECURSE
1096    
1097           The  output is an integer that is set to one if internal recursion when
1098           running pcre_exec() is implemented by recursive function calls that use
1099           the  stack  to remember their state. This is the usual way that PCRE is
1100           compiled. The output is zero if PCRE was compiled to use blocks of data
1101           on  the  heap  instead  of  recursive  function  calls.  In  this case,
1102           pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1103           blocks on the heap, thus avoiding the use of the stack.
1104    
1105    
1106    COMPILING A PATTERN
1107    
1108           pcre *pcre_compile(const char *pattern, int options,
1109                const char **errptr, int *erroffset,
1110                const unsigned char *tableptr);
1111    
1112           pcre *pcre_compile2(const char *pattern, int options,
1113                int *errorcodeptr,
1114                const char **errptr, int *erroffset,
1115                const unsigned char *tableptr);
1116    
1117           Either of the functions pcre_compile() or pcre_compile2() can be called
1118           to compile a pattern into an internal form. The only difference between
1119           the  two interfaces is that pcre_compile2() has an additional argument,
1120           errorcodeptr, via which a numerical error code can be returned.
1121    
1122           The pattern is a C string terminated by a binary zero, and is passed in
1123           the  pattern  argument.  A  pointer to a single block of memory that is
1124           obtained via pcre_malloc is returned. This contains the  compiled  code
1125           and related data. The pcre type is defined for the returned block; this
1126           is a typedef for a structure whose contents are not externally defined.
1127           It is up to the caller to free the memory (via pcre_free) when it is no
1128           longer required.
1129    
1130           Although the compiled code of a PCRE regex is relocatable, that is,  it
1131           does not depend on memory location, the complete pcre data block is not
1132           fully relocatable, because it may contain a copy of the tableptr  argu-
1133           ment, which is an address (see below).
1134    
1135           The options argument contains various bit settings that affect the com-
1136           pilation. It should be zero if no options are required.  The  available
1137           options  are  described  below. Some of them (in particular, those that
1138           are compatible with Perl, but also some others) can  also  be  set  and
1139           unset  from  within  the  pattern  (see the detailed description in the
1140           pcrepattern documentation). For those options that can be different  in
1141           different  parts  of  the pattern, the contents of the options argument
1142           specifies their initial settings at the start of compilation and execu-
1143           tion.  The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
1144           time of matching as well as at compile time.
1145    
1146           If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1147           if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1148           sets the variable pointed to by errptr to point to a textual error mes-
1149           sage. This is a static string that is part of the library. You must not
1150           try to free it. The offset from the start of the pattern to the charac-
1151           ter where the error was discovered is placed in the variable pointed to
1152           by erroffset, which must not be NULL. If it is, an immediate  error  is
1153           given.
1154    
1155           If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1156           codeptr argument is not NULL, a non-zero error code number is  returned
1157           via  this argument in the event of an error. This is in addition to the
1158           textual error message. Error codes and messages are listed below.
1159    
1160           If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1161           character  tables  that  are  built  when  PCRE  is compiled, using the
1162           default C locale. Otherwise, tableptr must be an address  that  is  the
1163           result  of  a  call to pcre_maketables(). This value is stored with the
1164           compiled pattern, and used again by pcre_exec(), unless  another  table
1165           pointer is passed to it. For more discussion, see the section on locale
1166           support below.
1167    
1168           This code fragment shows a typical straightforward  call  to  pcre_com-
1169           pile():
1170    
1171             pcre *re;
1172             const char *error;
1173             int erroffset;
1174             re = pcre_compile(
1175               "^A.*Z",          /* the pattern */
1176               0,                /* default options */
1177               &error,           /* for error message */
1178               &erroffset,       /* for error offset */
1179               NULL);            /* use default character tables */
1180    
1181           The  following  names  for option bits are defined in the pcre.h header
1182           file:
1183    
1184             PCRE_ANCHORED
1185    
1186           If this bit is set, the pattern is forced to be "anchored", that is, it
1187           is  constrained to match only at the first matching point in the string
1188           that is being searched (the "subject string"). This effect can also  be
1189           achieved  by appropriate constructs in the pattern itself, which is the
1190           only way to do it in Perl.
1191    
1192             PCRE_AUTO_CALLOUT
1193    
1194           If this bit is set, pcre_compile() automatically inserts callout items,
1195           all  with  number  255, before each pattern item. For discussion of the
1196           callout facility, see the pcrecallout documentation.
1197    
1198             PCRE_BSR_ANYCRLF
1199             PCRE_BSR_UNICODE
1200    
1201           These options (which are mutually exclusive) control what the \R escape
1202           sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1203           or to match any Unicode newline sequence. The default is specified when
1204           PCRE is built. It can be overridden from within the pattern, or by set-
1205           ting an option when a compiled pattern is matched.
1206    
1207             PCRE_CASELESS
1208    
1209           If this bit is set, letters in the pattern match both upper  and  lower
1210           case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1211           changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1212           always  understands the concept of case for characters whose values are
1213           less than 128, so caseless matching is always possible. For  characters
1214           with  higher  values,  the concept of case is supported if PCRE is com-
1215           piled with Unicode property support, but not otherwise. If you want  to
1216           use  caseless  matching  for  characters 128 and above, you must ensure
1217           that PCRE is compiled with Unicode property support  as  well  as  with
1218           UTF-8 support.
1219    
1220             PCRE_DOLLAR_ENDONLY
1221    
1222           If  this bit is set, a dollar metacharacter in the pattern matches only
1223           at the end of the subject string. Without this option,  a  dollar  also
1224           matches  immediately before a newline at the end of the string (but not
1225           before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1226           if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1227           Perl, and no way to set it within a pattern.
1228    
1229             PCRE_DOTALL
1230    
1231           If this bit is set, a dot metacharater in the pattern matches all char-
1232           acters,  including  those that indicate newline. Without it, a dot does
1233           not match when the current position is at a  newline.  This  option  is
1234           equivalent  to Perl's /s option, and it can be changed within a pattern
1235           by a (?s) option setting. A negative class such as [^a] always  matches
1236           newline characters, independent of the setting of this option.
1237    
1238             PCRE_DUPNAMES
1239    
1240           If  this  bit is set, names used to identify capturing subpatterns need
1241           not be unique. This can be helpful for certain types of pattern when it
1242           is  known  that  only  one instance of the named subpattern can ever be
1243           matched. There are more details of named subpatterns  below;  see  also
1244           the pcrepattern documentation.
1245    
1246             PCRE_EXTENDED
1247    
1248           If  this  bit  is  set,  whitespace  data characters in the pattern are
1249           totally ignored except when escaped or inside a character class. White-
1250           space does not include the VT character (code 11). In addition, charac-
1251           ters between an unescaped # outside a character class and the next new-
1252           line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1253           option, and it can be changed within a pattern by a  (?x)  option  set-
1254           ting.
1255    
1256           This  option  makes  it possible to include comments inside complicated
1257           patterns.  Note, however, that this applies only  to  data  characters.
1258           Whitespace   characters  may  never  appear  within  special  character
1259           sequences in a pattern, for  example  within  the  sequence  (?(  which
1260           introduces a conditional subpattern.
1261    
1262             PCRE_EXTRA
1263    
1264           This  option  was invented in order to turn on additional functionality
1265           of PCRE that is incompatible with Perl, but it  is  currently  of  very
1266           little  use. When set, any backslash in a pattern that is followed by a
1267           letter that has no special meaning  causes  an  error,  thus  reserving
1268           these  combinations  for  future  expansion.  By default, as in Perl, a
1269           backslash followed by a letter with no special meaning is treated as  a
1270           literal.  (Perl can, however, be persuaded to give a warning for this.)
1271           There are at present no other features controlled by  this  option.  It
1272           can also be set by a (?X) option setting within a pattern.
1273    
1274             PCRE_FIRSTLINE
1275    
1276           If  this  option  is  set,  an  unanchored pattern is required to match
1277           before or at the first  newline  in  the  subject  string,  though  the
1278           matched text may continue over the newline.
1279    
1280             PCRE_JAVASCRIPT_COMPAT
1281    
1282           If this option is set, PCRE's behaviour is changed in some ways so that
1283           it is compatible with JavaScript rather than Perl. The changes  are  as
1284           follows:
1285    
1286           (1)  A  lone  closing square bracket in a pattern causes a compile-time
1287           error, because this is illegal in JavaScript (by default it is  treated
1288           as a data character). Thus, the pattern AB]CD becomes illegal when this
1289           option is set.
1290    
1291           (2) At run time, a back reference to an unset subpattern group  matches
1292           an  empty  string (by default this causes the current matching alterna-
1293           tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1294           set  (assuming  it can find an "a" in the subject), whereas it fails by
1295           default, for Perl compatibility.
1296    
1297             PCRE_MULTILINE
1298    
1299           By default, PCRE treats the subject string as consisting  of  a  single
1300           line  of characters (even if it actually contains newlines). The "start
1301           of line" metacharacter (^) matches only at the  start  of  the  string,
1302           while  the  "end  of line" metacharacter ($) matches only at the end of
1303           the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1304           is set). This is the same as Perl.
1305    
1306           When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1307           constructs match immediately following or immediately  before  internal
1308           newlines  in  the  subject string, respectively, as well as at the very
1309           start and end. This is equivalent to Perl's /m option, and  it  can  be
1310           changed within a pattern by a (?m) option setting. If there are no new-
1311           lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1312           setting PCRE_MULTILINE has no effect.
1313    
1314             PCRE_NEWLINE_CR
1315             PCRE_NEWLINE_LF
1316             PCRE_NEWLINE_CRLF
1317             PCRE_NEWLINE_ANYCRLF
1318             PCRE_NEWLINE_ANY
1319    
1320           These  options  override the default newline definition that was chosen
1321           when PCRE was built. Setting the first or the second specifies  that  a
1322           newline  is  indicated  by a single character (CR or LF, respectively).
1323           Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1324           two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1325           that any of the three preceding sequences should be recognized. Setting
1326           PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1327           recognized. The Unicode newline sequences are the three just mentioned,
1328           plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1329           U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1330           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1331           UTF-8 mode.
1332    
1333           The newline setting in the  options  word  uses  three  bits  that  are
1334           treated as a number, giving eight possibilities. Currently only six are
1335           used (default plus the five values above). This means that if  you  set
1336           more  than one newline option, the combination may or may not be sensi-
1337           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1338           PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1339           cause an error.
1340    
1341           The only time that a line break is specially recognized when  compiling
1342           a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1343           character class is encountered. This indicates  a  comment  that  lasts
1344           until  after the next line break sequence. In other circumstances, line
1345           break  sequences  are  treated  as  literal  data,   except   that   in
1346           PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1347           and are therefore ignored.
1348    
1349           The newline option that is set at compile time becomes the default that
1350           is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1351    
1352             PCRE_NO_AUTO_CAPTURE
1353    
1354           If this option is set, it disables the use of numbered capturing paren-
1355           theses in the pattern. Any opening parenthesis that is not followed  by
1356           ?  behaves as if it were followed by ?: but named parentheses can still
1357           be used for capturing (and they acquire  numbers  in  the  usual  way).
1358           There is no equivalent of this option in Perl.
1359    
1360             PCRE_UNGREEDY
1361    
1362           This  option  inverts  the "greediness" of the quantifiers so that they
1363           are not greedy by default, but become greedy if followed by "?". It  is
1364           not  compatible  with Perl. It can also be set by a (?U) option setting
1365           within the pattern.
1366    
1367             PCRE_UTF8
1368    
1369           This option causes PCRE to regard both the pattern and the  subject  as
1370           strings  of  UTF-8 characters instead of single-byte character strings.
1371           However, it is available only when PCRE is built to include UTF-8  sup-
1372           port.  If not, the use of this option provokes an error. Details of how
1373           this option changes the behaviour of PCRE are given in the  section  on
1374           UTF-8 support in the main pcre page.
1375    
1376             PCRE_NO_UTF8_CHECK
1377    
1378           When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1379           automatically checked. There is a  discussion  about  the  validity  of
1380           UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1381           bytes is found, pcre_compile() returns an error. If  you  already  know
1382           that your pattern is valid, and you want to skip this check for perfor-
1383           mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1384           set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1385           undefined. It may cause your program to crash. Note  that  this  option
1386           can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1387           UTF-8 validity checking of subject strings.
1388    
1389    
1390    COMPILATION ERROR CODES
1391    
1392           The following table lists the error  codes  than  may  be  returned  by
1393           pcre_compile2(),  along with the error messages that may be returned by
1394           both compiling functions. As PCRE has developed, some error codes  have
1395           fallen out of use. To avoid confusion, they have not been re-used.
1396    
1397              0  no error
1398              1  \ at end of pattern
1399              2  \c at end of pattern
1400              3  unrecognized character follows \
1401              4  numbers out of order in {} quantifier
1402              5  number too big in {} quantifier
1403              6  missing terminating ] for character class
1404              7  invalid escape sequence in character class
1405              8  range out of order in character class
1406              9  nothing to repeat
1407             10  [this code is not in use]
1408             11  internal error: unexpected repeat
1409             12  unrecognized character after (? or (?-
1410             13  POSIX named classes are supported only within a class
1411             14  missing )
1412             15  reference to non-existent subpattern
1413             16  erroffset passed as NULL
1414             17  unknown option bit(s) set
1415             18  missing ) after comment
1416             19  [this code is not in use]
1417             20  regular expression is too large
1418             21  failed to get memory
1419             22  unmatched parentheses
1420             23  internal error: code overflow
1421             24  unrecognized character after (?<
1422             25  lookbehind assertion is not fixed length
1423             26  malformed number or name after (?(
1424             27  conditional group contains more than two branches
1425             28  assertion expected after (?(
1426             29  (?R or (?[+-]digits must be followed by )
1427             30  unknown POSIX class name
1428             31  POSIX collating elements are not supported
1429             32  this version of PCRE is not compiled with PCRE_UTF8 support
1430             33  [this code is not in use]
1431             34  character value in \x{...} sequence is too large
1432             35  invalid condition (?(0)
1433             36  \C not allowed in lookbehind assertion
1434             37  PCRE does not support \L, \l, \N, \U, or \u
1435             38  number after (?C is > 255
1436             39  closing ) for (?C expected
1437             40  recursive call could loop indefinitely
1438             41  unrecognized character after (?P
1439             42  syntax error in subpattern name (missing terminator)
1440             43  two named subpatterns have the same name
1441             44  invalid UTF-8 string
1442             45  support for \P, \p, and \X has not been compiled
1443             46  malformed \P or \p sequence
1444             47  unknown property name after \P or \p
1445             48  subpattern name is too long (maximum 32 characters)
1446             49  too many named subpatterns (maximum 10000)
1447             50  [this code is not in use]
1448             51  octal value is greater than \377 (not in UTF-8 mode)
1449             52  internal error: overran compiling workspace
1450             53   internal  error:  previously-checked  referenced  subpattern not
1451           found
1452             54  DEFINE group contains more than one branch
1453             55  repeating a DEFINE group is not allowed
1454             56  inconsistent NEWLINE options
1455             57  \g is not followed by a braced, angle-bracketed, or quoted
1456                   name/number or by a plain number
1457             58  a numbered reference must not be zero
1458             59  (*VERB) with an argument is not supported
1459             60  (*VERB) not recognized
1460             61  number is too big
1461             62  subpattern name expected
1462             63  digit expected after (?+
1463             64  ] is an invalid data character in JavaScript compatibility mode
1464    
1465           The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1466           values may be used if the limits were changed when PCRE was built.
1467    
1468    
1469    STUDYING A PATTERN
1470    
1471           pcre_extra *pcre_study(const pcre *code, int options
1472                const char **errptr);
1473    
1474           If  a  compiled  pattern is going to be used several times, it is worth
1475           spending more time analyzing it in order to speed up the time taken for
1476           matching.  The function pcre_study() takes a pointer to a compiled pat-
1477           tern as its first argument. If studying the pattern produces additional
1478           information  that  will  help speed up matching, pcre_study() returns a
1479           pointer to a pcre_extra block, in which the study_data field points  to
1480           the results of the study.
1481    
1482           The  returned  value  from  pcre_study()  can  be  passed  directly  to
1483           pcre_exec(). However, a pcre_extra block  also  contains  other  fields
1484           that  can  be  set  by the caller before the block is passed; these are
1485           described below in the section on matching a pattern.
1486    
1487           If studying the pattern does not  produce  any  additional  information
1488           pcre_study() returns NULL. In that circumstance, if the calling program
1489           wants to pass any of the other fields to pcre_exec(), it  must  set  up
1490           its own pcre_extra block.
1491    
1492           The  second  argument of pcre_study() contains option bits. At present,
1493           no options are defined, and this argument should always be zero.
1494    
1495           The third argument for pcre_study() is a pointer for an error  message.
1496           If  studying  succeeds  (even  if no data is returned), the variable it
1497           points to is set to NULL. Otherwise it is set to  point  to  a  textual
1498           error message. This is a static string that is part of the library. You
1499           must not try to free it. You should test the  error  pointer  for  NULL
1500           after calling pcre_study(), to be sure that it has run successfully.
1501    
1502           This is a typical call to pcre_study():
1503    
1504             pcre_extra *pe;
1505             pe = pcre_study(
1506               re,             /* result of pcre_compile() */
1507               0,              /* no options exist */
1508               &error);        /* set to NULL or points to a message */
1509    
1510           At present, studying a pattern is useful only for non-anchored patterns
1511           that do not have a single fixed starting character. A bitmap of  possi-
1512           ble starting bytes is created.
1513    
1514    
1515    LOCALE SUPPORT
1516    
1517           PCRE  handles  caseless matching, and determines whether characters are
1518           letters, digits, or whatever, by reference to a set of tables,  indexed
1519           by  character  value.  When running in UTF-8 mode, this applies only to
1520           characters with codes less than 128. Higher-valued  codes  never  match
1521           escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1522           with Unicode character property support. The use of locales  with  Uni-
1523           code  is discouraged. If you are handling characters with codes greater
1524           than 128, you should either use UTF-8 and Unicode, or use locales,  but
1525           not try to mix the two.
1526    
1527           PCRE  contains  an  internal set of tables that are used when the final
1528           argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1529           applications.  Normally, the internal tables recognize only ASCII char-
1530           acters. However, when PCRE is built, it is possible to cause the inter-
1531           nal tables to be rebuilt in the default "C" locale of the local system,
1532           which may cause them to be different.
1533    
1534           The internal tables can always be overridden by tables supplied by  the
1535           application that calls PCRE. These may be created in a different locale
1536           from the default. As more and more applications change  to  using  Uni-
1537           code, the need for this locale support is expected to die away.
1538    
1539           External  tables  are  built by calling the pcre_maketables() function,
1540           which has no arguments, in the relevant locale. The result can then  be
1541           passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1542           example, to build and use tables that are appropriate  for  the  French
1543           locale  (where  accented  characters  with  values greater than 128 are
1544           treated as letters), the following code could be used:
1545    
1546             setlocale(LC_CTYPE, "fr_FR");
1547             tables = pcre_maketables();
1548             re = pcre_compile(..., tables);
1549    
1550           The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1551           if you are using Windows, the name for the French locale is "french".
1552    
1553           When  pcre_maketables()  runs,  the  tables are built in memory that is
1554           obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1555           that  the memory containing the tables remains available for as long as
1556           it is needed.
1557    
1558           The pointer that is passed to pcre_compile() is saved with the compiled
1559           pattern,  and the same tables are used via this pointer by pcre_study()
1560           and normally also by pcre_exec(). Thus, by default, for any single pat-
1561           tern, compilation, studying and matching all happen in the same locale,
1562           but different patterns can be compiled in different locales.
1563    
1564           It is possible to pass a table pointer or NULL (indicating the  use  of
1565           the  internal  tables)  to  pcre_exec(). Although not intended for this
1566           purpose, this facility could be used to match a pattern in a  different
1567           locale from the one in which it was compiled. Passing table pointers at
1568           run time is discussed below in the section on matching a pattern.
1569    
1570    
1571    INFORMATION ABOUT A PATTERN
1572    
1573           int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1574                int what, void *where);
1575    
1576           The pcre_fullinfo() function returns information about a compiled  pat-
1577           tern. It replaces the obsolete pcre_info() function, which is neverthe-
1578           less retained for backwards compability (and is documented below).
1579    
1580           The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
1581           pattern.  The second argument is the result of pcre_study(), or NULL if
1582           the pattern was not studied. The third argument specifies  which  piece
1583           of  information  is required, and the fourth argument is a pointer to a
1584           variable to receive the data. The yield of the  function  is  zero  for
1585           success, or one of the following negative numbers:
1586    
1587             PCRE_ERROR_NULL       the argument code was NULL
1588                                   the argument where was NULL
1589             PCRE_ERROR_BADMAGIC   the "magic number" was not found
1590             PCRE_ERROR_BADOPTION  the value of what was invalid
1591    
1592           The  "magic  number" is placed at the start of each compiled pattern as
1593           an simple check against passing an arbitrary memory pointer. Here is  a
1594           typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1595           pattern:
1596    
1597             int rc;
1598             size_t length;
1599             rc = pcre_fullinfo(
1600               re,               /* result of pcre_compile() */
1601               pe,               /* result of pcre_study(), or NULL */
1602               PCRE_INFO_SIZE,   /* what is required */
1603               &length);         /* where to put the data */
1604    
1605           The possible values for the third argument are defined in  pcre.h,  and
1606           are as follows:
1607    
1608             PCRE_INFO_BACKREFMAX
1609    
1610           Return  the  number  of  the highest back reference in the pattern. The
1611           fourth argument should point to an int variable. Zero  is  returned  if
1612           there are no back references.
1613    
1614             PCRE_INFO_CAPTURECOUNT
1615    
1616           Return  the  number of capturing subpatterns in the pattern. The fourth
1617           argument should point to an int variable.
1618    
1619             PCRE_INFO_DEFAULT_TABLES
1620    
1621           Return a pointer to the internal default character tables within  PCRE.
1622           The  fourth  argument should point to an unsigned char * variable. This
1623           information call is provided for internal use by the pcre_study() func-
1624           tion.  External  callers  can  cause PCRE to use its internal tables by
1625           passing a NULL table pointer.
1626    
1627             PCRE_INFO_FIRSTBYTE
1628    
1629           Return information about the first byte of any matched  string,  for  a
1630           non-anchored  pattern. The fourth argument should point to an int vari-
1631           able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1632           is still recognized for backwards compatibility.)
1633    
1634           If  there  is  a  fixed first byte, for example, from a pattern such as
1635           (cat|cow|coyote), its value is returned. Otherwise, if either
1636    
1637           (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1638           branch starts with "^", or
1639    
1640           (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1641           set (if it were set, the pattern would be anchored),
1642    
1643           -1 is returned, indicating that the pattern matches only at  the  start
1644           of  a  subject string or after any newline within the string. Otherwise
1645           -2 is returned. For anchored patterns, -2 is returned.
1646    
1647             PCRE_INFO_FIRSTTABLE
1648    
1649           If the pattern was studied, and this resulted in the construction of  a
1650           256-bit table indicating a fixed set of bytes for the first byte in any
1651           matching string, a pointer to the table is returned. Otherwise NULL  is
1652           returned.  The fourth argument should point to an unsigned char * vari-
1653           able.
1654    
1655             PCRE_INFO_HASCRORLF
1656    
1657           Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1658           characters,  otherwise  0.  The  fourth argument should point to an int
1659           variable. An explicit match is either a literal CR or LF character,  or
1660           \r or \n.
1661    
1662             PCRE_INFO_JCHANGED
1663    
1664           Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1665           otherwise 0. The fourth argument should point to an int variable.  (?J)
1666           and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1667    
1668             PCRE_INFO_LASTLITERAL
1669    
1670           Return  the  value of the rightmost literal byte that must exist in any
1671           matched string, other than at its  start,  if  such  a  byte  has  been
1672           recorded. The fourth argument should point to an int variable. If there
1673           is no such byte, -1 is returned. For anchored patterns, a last  literal
1674           byte  is  recorded only if it follows something of variable length. For
1675           example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1676           /^a\dz\d/ the returned value is -1.
1677    
1678             PCRE_INFO_NAMECOUNT
1679             PCRE_INFO_NAMEENTRYSIZE
1680             PCRE_INFO_NAMETABLE
1681    
1682           PCRE  supports the use of named as well as numbered capturing parenthe-
1683           ses. The names are just an additional way of identifying the  parenthe-
1684           ses, which still acquire numbers. Several convenience functions such as
1685           pcre_get_named_substring() are provided for  extracting  captured  sub-
1686           strings  by  name. It is also possible to extract the data directly, by
1687           first converting the name to a number in order to  access  the  correct
1688           pointers in the output vector (described with pcre_exec() below). To do
1689           the conversion, you need  to  use  the  name-to-number  map,  which  is
1690           described by these three values.
1691    
1692           The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1693           gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1694           of  each  entry;  both  of  these  return  an int value. The entry size
1695           depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1696           a  pointer  to  the  first  entry of the table (a pointer to char). The
1697           first two bytes of each entry are the number of the capturing parenthe-
1698           sis,  most  significant byte first. The rest of the entry is the corre-
1699           sponding name, zero terminated. The names are  in  alphabetical  order.
1700           When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1701           theses numbers. For example, consider  the  following  pattern  (assume
1702           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1703           ignored):
1704    
1705             (?<date> (?<year>(\d\d)?\d\d) -
1706             (?<month>\d\d) - (?<day>\d\d) )
1707    
1708           There are four named subpatterns, so the table has  four  entries,  and
1709           each  entry  in the table is eight bytes long. The table is as follows,
1710           with non-printing bytes shows in hexadecimal, and undefined bytes shown
1711           as ??:
1712    
1713             00 01 d  a  t  e  00 ??
1714             00 05 d  a  y  00 ?? ??
1715             00 04 m  o  n  t  h  00
1716             00 02 y  e  a  r  00 ??
1717    
1718           When  writing  code  to  extract  data from named subpatterns using the
1719           name-to-number map, remember that the length of the entries  is  likely
1720           to be different for each compiled pattern.
1721    
1722             PCRE_INFO_OKPARTIAL
1723    
1724           Return  1 if the pattern can be used for partial matching, otherwise 0.
1725           The fourth argument should point to an int  variable.  The  pcrepartial
1726           documentation  lists  the restrictions that apply to patterns when par-
1727           tial matching is used.
1728    
1729             PCRE_INFO_OPTIONS
1730    
1731           Return a copy of the options with which the pattern was  compiled.  The
1732           fourth  argument  should  point to an unsigned long int variable. These
1733           option bits are those specified in the call to pcre_compile(), modified
1734           by any top-level option settings at the start of the pattern itself. In
1735           other words, they are the options that will be in force  when  matching
1736           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1737           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1738           and PCRE_EXTENDED.
1739    
1740           A  pattern  is  automatically  anchored by PCRE if all of its top-level
1741           alternatives begin with one of the following:
1742    
1743             ^     unless PCRE_MULTILINE is set
1744             \A    always
1745             \G    always
1746             .*    if PCRE_DOTALL is set and there are no back
1747                     references to the subpattern in which .* appears
1748    
1749           For such patterns, the PCRE_ANCHORED bit is set in the options returned
1750           by pcre_fullinfo().
1751    
1752             PCRE_INFO_SIZE
1753    
1754           Return  the  size  of the compiled pattern, that is, the value that was
1755           passed as the argument to pcre_malloc() when PCRE was getting memory in
1756           which to place the compiled data. The fourth argument should point to a
1757           size_t variable.
1758    
1759             PCRE_INFO_STUDYSIZE
1760    
1761           Return the size of the data block pointed to by the study_data field in
1762           a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1763           pcre_malloc() when PCRE was getting memory into which to place the data
1764           created  by  pcre_study(). The fourth argument should point to a size_t
1765           variable.
1766    
1767    
1768    OBSOLETE INFO FUNCTION
1769    
1770           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1771    
1772           The pcre_info() function is now obsolete because its interface  is  too
1773           restrictive  to return all the available data about a compiled pattern.
1774           New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1775           pcre_info()  is the number of capturing subpatterns, or one of the fol-
1776           lowing negative numbers:
1777    
1778             PCRE_ERROR_NULL       the argument code was NULL
1779             PCRE_ERROR_BADMAGIC   the "magic number" was not found
1780    
1781           If the optptr argument is not NULL, a copy of the  options  with  which
1782           the  pattern  was  compiled  is placed in the integer it points to (see
1783           PCRE_INFO_OPTIONS above).
1784    
1785           If the pattern is not anchored and the  firstcharptr  argument  is  not
1786           NULL,  it is used to pass back information about the first character of
1787           any matched string (see PCRE_INFO_FIRSTBYTE above).
1788    
1789    
1790    REFERENCE COUNTS
1791    
1792           int pcre_refcount(pcre *code, int adjust);
1793    
1794           The pcre_refcount() function is used to maintain a reference  count  in
1795           the data block that contains a compiled pattern. It is provided for the
1796           benefit of applications that  operate  in  an  object-oriented  manner,
1797           where different parts of the application may be using the same compiled
1798           pattern, but you want to free the block when they are all done.
1799    
1800           When a pattern is compiled, the reference count field is initialized to
1801           zero.   It is changed only by calling this function, whose action is to
1802           add the adjust value (which may be positive or  negative)  to  it.  The
1803           yield of the function is the new value. However, the value of the count
1804           is constrained to lie between 0 and 65535, inclusive. If the new  value
1805           is outside these limits, it is forced to the appropriate limit value.
1806    
1807           Except  when it is zero, the reference count is not correctly preserved
1808           if a pattern is compiled on one host and then  transferred  to  a  host
1809           whose byte-order is different. (This seems a highly unlikely scenario.)
1810    
1811    
1812    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1813    
1814           int pcre_exec(const pcre *code, const pcre_extra *extra,
1815                const char *subject, int length, int startoffset,
1816                int options, int *ovector, int ovecsize);
1817    
1818           The  function pcre_exec() is called to match a subject string against a
1819           compiled pattern, which is passed in the code argument. If the  pattern
1820           has been studied, the result of the study should be passed in the extra
1821           argument. This function is the main matching facility of  the  library,
1822           and it operates in a Perl-like manner. For specialist use there is also
1823           an alternative matching function, which is described below in the  sec-
1824           tion about the pcre_dfa_exec() function.
1825    
1826           In  most applications, the pattern will have been compiled (and option-
1827           ally studied) in the same process that calls pcre_exec().  However,  it
1828           is possible to save compiled patterns and study data, and then use them
1829           later in different processes, possibly even on different hosts.  For  a
1830           discussion about this, see the pcreprecompile documentation.
1831    
1832           Here is an example of a simple call to pcre_exec():
1833    
1834             int rc;
1835             int ovector[30];
1836             rc = pcre_exec(
1837               re,             /* result of pcre_compile() */
1838               NULL,           /* we didn't study the pattern */
1839               "some string",  /* the subject string */
1840               11,             /* the length of the subject string */
1841               0,              /* start at offset 0 in the subject */
1842               0,              /* default options */
1843               ovector,        /* vector of integers for substring information */
1844               30);            /* number of elements (NOT size in bytes) */
1845    
1846       Extra data for pcre_exec()
1847    
1848           If  the  extra argument is not NULL, it must point to a pcre_extra data
1849           block. The pcre_study() function returns such a block (when it  doesn't
1850           return  NULL), but you can also create one for yourself, and pass addi-
1851           tional information in it. The pcre_extra block contains  the  following
1852           fields (not necessarily in this order):
1853    
1854             unsigned long int flags;
1855             void *study_data;
1856             unsigned long int match_limit;
1857             unsigned long int match_limit_recursion;
1858             void *callout_data;
1859             const unsigned char *tables;
1860    
1861           The  flags  field  is a bitmap that specifies which of the other fields
1862           are set. The flag bits are:
1863    
1864             PCRE_EXTRA_STUDY_DATA
1865             PCRE_EXTRA_MATCH_LIMIT
1866             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1867             PCRE_EXTRA_CALLOUT_DATA
1868             PCRE_EXTRA_TABLES
1869    
1870           Other flag bits should be set to zero. The study_data field is  set  in
1871           the  pcre_extra  block  that is returned by pcre_study(), together with
1872           the appropriate flag bit. You should not set this yourself, but you may
1873           add  to  the  block by setting the other fields and their corresponding
1874           flag bits.
1875    
1876           The match_limit field provides a means of preventing PCRE from using up
1877           a  vast amount of resources when running patterns that are not going to
1878           match, but which have a very large number  of  possibilities  in  their
1879           search  trees.  The  classic  example  is  the  use of nested unlimited
1880           repeats.
1881    
1882           Internally, PCRE uses a function called match() which it calls  repeat-
1883           edly  (sometimes  recursively). The limit set by match_limit is imposed
1884           on the number of times this function is called during  a  match,  which
1885           has  the  effect  of  limiting the amount of backtracking that can take
1886           place. For patterns that are not anchored, the count restarts from zero
1887           for each position in the subject string.
1888    
1889           The  default  value  for  the  limit can be set when PCRE is built; the
1890           default default is 10 million, which handles all but the  most  extreme
1891           cases.  You  can  override  the  default by suppling pcre_exec() with a
1892           pcre_extra    block    in    which    match_limit    is    set,     and
1893           PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1894           exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1895    
1896           The match_limit_recursion field is similar to match_limit, but  instead
1897           of limiting the total number of times that match() is called, it limits
1898           the depth of recursion. The recursion depth is a  smaller  number  than
1899           the  total number of calls, because not all calls to match() are recur-
1900           sive.  This limit is of use only if it is set smaller than match_limit.
1901    
1902           Limiting the recursion depth limits the amount of  stack  that  can  be
1903           used, or, when PCRE has been compiled to use memory on the heap instead
1904           of the stack, the amount of heap memory that can be used.
1905    
1906           The default value for match_limit_recursion can be  set  when  PCRE  is
1907           built;  the  default  default  is  the  same  value  as the default for
1908           match_limit. You can override the default by suppling pcre_exec()  with
1909           a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1910           PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1911           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1912    
1913           The  pcre_callout  field is used in conjunction with the "callout" fea-
1914           ture, which is described in the pcrecallout documentation.
1915    
1916           The tables field  is  used  to  pass  a  character  tables  pointer  to
1917           pcre_exec();  this overrides the value that is stored with the compiled
1918           pattern. A non-NULL value is stored with the compiled pattern  only  if
1919           custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1920           ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1921           PCRE's  internal  tables  to be used. This facility is helpful when re-
1922           using patterns that have been saved after compiling  with  an  external
1923           set  of  tables,  because  the  external tables might be at a different
1924           address when pcre_exec() is called. See the  pcreprecompile  documenta-
1925           tion for a discussion of saving compiled patterns for later use.
1926    
1927       Option bits for pcre_exec()
1928    
1929           The  unused  bits of the options argument for pcre_exec() must be zero.
1930           The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1931           PCRE_NOTBOL,    PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_START_OPTIMIZE,
1932           PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
1933    
1934             PCRE_ANCHORED
1935    
1936           The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1937           matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1938           turned out to be anchored by virtue of its contents, it cannot be  made
1939           unachored at matching time.
1940    
1941             PCRE_BSR_ANYCRLF
1942             PCRE_BSR_UNICODE
1943    
1944           These options (which are mutually exclusive) control what the \R escape
1945           sequence matches. The choice is either to match only CR, LF,  or  CRLF,
1946           or  to  match  any Unicode newline sequence. These options override the
1947           choice that was made or defaulted when the pattern was compiled.
1948    
1949             PCRE_NEWLINE_CR
1950             PCRE_NEWLINE_LF
1951             PCRE_NEWLINE_CRLF
1952             PCRE_NEWLINE_ANYCRLF
1953             PCRE_NEWLINE_ANY
1954    
1955           These options override  the  newline  definition  that  was  chosen  or
1956           defaulted  when the pattern was compiled. For details, see the descrip-
1957           tion of pcre_compile()  above.  During  matching,  the  newline  choice
1958           affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
1959           ters. It may also alter the way the match position is advanced after  a
1960           match failure for an unanchored pattern.
1961    
1962           When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
1963           set, and a match attempt for an unanchored pattern fails when the  cur-
1964           rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
1965           explicit matches for  CR  or  LF  characters,  the  match  position  is
1966           advanced by two characters instead of one, in other words, to after the
1967           CRLF.
1968    
1969           The above rule is a compromise that makes the most common cases work as
1970           expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
1971           option is not set), it does not match the string "\r\nA" because, after
1972           failing  at the start, it skips both the CR and the LF before retrying.
1973           However, the pattern [\r\n]A does match that string,  because  it  con-
1974           tains an explicit CR or LF reference, and so advances only by one char-
1975           acter after the first failure.
1976    
1977           An explicit match for CR of LF is either a literal appearance of one of
1978           those  characters,  or  one  of the \r or \n escape sequences. Implicit
1979           matches such as [^X] do not count, nor does \s (which includes  CR  and
1980           LF in the characters that it matches).
1981    
1982           Notwithstanding  the above, anomalous effects may still occur when CRLF
1983           is a valid newline sequence and explicit \r or \n escapes appear in the
1984           pattern.
1985    
1986             PCRE_NOTBOL
1987    
1988           This option specifies that first character of the subject string is not
1989           the beginning of a line, so the  circumflex  metacharacter  should  not
1990           match  before it. Setting this without PCRE_MULTILINE (at compile time)
1991           causes circumflex never to match. This option affects only  the  behav-
1992           iour of the circumflex metacharacter. It does not affect \A.
1993    
1994             PCRE_NOTEOL
1995    
1996           This option specifies that the end of the subject string is not the end
1997           of a line, so the dollar metacharacter should not match it nor  (except
1998           in  multiline mode) a newline immediately before it. Setting this with-
1999           out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2000           option  affects only the behaviour of the dollar metacharacter. It does
2001           not affect \Z or \z.
2002    
2003             PCRE_NOTEMPTY
2004    
2005           An empty string is not considered to be a valid match if this option is
2006           set.  If  there are alternatives in the pattern, they are tried. If all
2007           the alternatives match the empty string, the entire  match  fails.  For
2008           example, if the pattern
2009    
2010             a?b?
2011    
2012           is  applied  to  a string not beginning with "a" or "b", it matches the
2013           empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
2014           match is not valid, so PCRE searches further into the string for occur-
2015           rences of "a" or "b".
2016    
2017           Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
2018           cial  case  of  a  pattern match of the empty string within its split()
2019           function, and when using the /g modifier. It  is  possible  to  emulate
2020           Perl's behaviour after matching a null string by first trying the match
2021           again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
2022           if  that  fails by advancing the starting offset (see below) and trying
2023           an ordinary match again. There is some code that demonstrates how to do
2024           this in the pcredemo.c sample program.
2025    
2026             PCRE_NO_START_OPTIMIZE
2027    
2028           There  are a number of optimizations that pcre_exec() uses at the start
2029           of a match, in order to speed up the process. For  example,  if  it  is
2030           known  that  a  match must start with a specific character, it searches
2031           the subject for that character, and fails immediately if it cannot find
2032           it,  without actually running the main matching function. When callouts
2033           are in use, these optimizations can cause  them  to  be  skipped.  This
2034           option  disables  the  "start-up" optimizations, causing performance to
2035           suffer, but ensuring that the callouts do occur.
2036    
2037             PCRE_NO_UTF8_CHECK
2038    
2039           When PCRE_UTF8 is set at compile time, the validity of the subject as a
2040           UTF-8  string is automatically checked when pcre_exec() is subsequently
2041           called.  The value of startoffset is also checked  to  ensure  that  it
2042           points  to  the start of a UTF-8 character. There is a discussion about
2043           the validity of UTF-8 strings in the section on UTF-8  support  in  the
2044           main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2045           pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
2046           tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
2047    
2048           If  you  already  know that your subject is valid, and you want to skip
2049           these   checks   for   performance   reasons,   you   can    set    the
2050           PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2051           do this for the second and subsequent calls to pcre_exec() if  you  are
2052           making  repeated  calls  to  find  all  the matches in a single subject
2053           string. However, you should be  sure  that  the  value  of  startoffset
2054           points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
2055           set, the effect of passing an invalid UTF-8 string as a subject,  or  a
2056           value  of startoffset that does not point to the start of a UTF-8 char-
2057           acter, is undefined. Your program may crash.
2058    
2059             PCRE_PARTIAL
2060    
2061           This option turns on the  partial  matching  feature.  If  the  subject
2062           string  fails to match the pattern, but at some point during the match-
2063           ing process the end of the subject was reached (that  is,  the  subject
2064           partially  matches  the  pattern and the failure to match occurred only
2065           because there were not enough subject characters), pcre_exec()  returns
2066           PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
2067           used, there are restrictions on what may appear in the  pattern.  These
2068           are discussed in the pcrepartial documentation.
2069    
2070       The string to be matched by pcre_exec()
2071    
2072           The  subject string is passed to pcre_exec() as a pointer in subject, a
2073           length (in bytes) in length, and a starting byte offset in startoffset.
2074           In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
2075           acter. Unlike the pattern string, the subject may contain  binary  zero
2076           bytes.  When the starting offset is zero, the search for a match starts
2077           at the beginning of the subject, and this is by  far  the  most  common
2078           case.
2079    
2080           A  non-zero  starting offset is useful when searching for another match
2081           in the same subject by calling pcre_exec() again after a previous  suc-
2082           cess.   Setting  startoffset differs from just passing over a shortened
2083           string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
2084           with any kind of lookbehind. For example, consider the pattern
2085    
2086             \Biss\B
2087    
2088           which  finds  occurrences  of "iss" in the middle of words. (\B matches
2089           only if the current position in the subject is not  a  word  boundary.)
2090           When  applied  to the string "Mississipi" the first call to pcre_exec()
2091           finds the first occurrence. If pcre_exec() is called  again  with  just
2092           the  remainder  of  the  subject,  namely  "issipi", it does not match,
2093           because \B is always false at the start of the subject, which is deemed
2094           to  be  a  word  boundary. However, if pcre_exec() is passed the entire
2095           string again, but with startoffset set to 4, it finds the second occur-
2096           rence  of "iss" because it is able to look behind the starting point to
2097           discover that it is preceded by a letter.
2098    
2099           If a non-zero starting offset is passed when the pattern  is  anchored,
2100           one attempt to match at the given offset is made. This can only succeed
2101           if the pattern does not require the match to be at  the  start  of  the
2102           subject.
2103    
2104       How pcre_exec() returns captured substrings
2105    
2106           In  general, a pattern matches a certain portion of the subject, and in
2107           addition, further substrings from the subject  may  be  picked  out  by
2108           parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
2109           this is called "capturing" in what follows, and the  phrase  "capturing
2110           subpattern"  is  used for a fragment of a pattern that picks out a sub-
2111           string. PCRE supports several other kinds of  parenthesized  subpattern
2112           that do not cause substrings to be captured.
2113    
2114           Captured substrings are returned to the caller via a vector of integers
2115           whose address is passed in ovector. The number of elements in the  vec-
2116           tor  is  passed in ovecsize, which must be a non-negative number. Note:
2117           this argument is NOT the size of ovector in bytes.
2118    
2119           The first two-thirds of the vector is used to pass back  captured  sub-
2120           strings,  each  substring using a pair of integers. The remaining third
2121           of the vector is used as workspace by pcre_exec() while  matching  cap-
2122           turing  subpatterns, and is not available for passing back information.
2123           The number passed in ovecsize should always be a multiple of three.  If
2124           it is not, it is rounded down.
2125    
2126           When  a  match  is successful, information about captured substrings is
2127           returned in pairs of integers, starting at the  beginning  of  ovector,
2128           and  continuing  up  to two-thirds of its length at the most. The first
2129           element of each pair is set to the byte offset of the  first  character
2130           in  a  substring, and the second is set to the byte offset of the first
2131           character after the end of a substring. Note: these values  are  always
2132           byte offsets, even in UTF-8 mode. They are not character counts.
2133    
2134           The  first  pair  of  integers, ovector[0] and ovector[1], identify the
2135           portion of the subject string matched by the entire pattern.  The  next
2136           pair  is  used for the first capturing subpattern, and so on. The value
2137           returned by pcre_exec() is one more than the highest numbered pair that
2138           has  been  set.  For example, if two substrings have been captured, the
2139           returned value is 3. If there are no capturing subpatterns, the  return
2140           value from a successful match is 1, indicating that just the first pair
2141           of offsets has been set.
2142    
2143           If a capturing subpattern is matched repeatedly, it is the last portion
2144           of the string that it matched that is returned.
2145    
2146           If  the vector is too small to hold all the captured substring offsets,
2147           it is used as far as possible (up to two-thirds of its length), and the
2148           function  returns  a value of zero. If the substring offsets are not of
2149           interest, pcre_exec() may be called with ovector  passed  as  NULL  and
2150           ovecsize  as zero. However, if the pattern contains back references and
2151           the ovector is not big enough to remember the related substrings,  PCRE
2152           has  to  get additional memory for use during matching. Thus it is usu-
2153           ally advisable to supply an ovector.
2154    
2155           The pcre_info() function can be used to find  out  how  many  capturing
2156           subpatterns  there  are  in  a  compiled pattern. The smallest size for
2157           ovector that will allow for n captured substrings, in addition  to  the
2158           offsets of the substring matched by the whole pattern, is (n+1)*3.
2159    
2160           It  is  possible for capturing subpattern number n+1 to match some part
2161           of the subject when subpattern n has not been used at all. For example,
2162           if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
2163           return from the function is 4, and subpatterns 1 and 3 are matched, but
2164           2  is  not.  When  this happens, both values in the offset pairs corre-
2165           sponding to unused subpatterns are set to -1.
2166    
2167           Offset values that correspond to unused subpatterns at the end  of  the
2168           expression  are  also  set  to  -1. For example, if the string "abc" is
2169           matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
2170           matched.  The  return  from the function is 2, because the highest used
2171           capturing subpattern number is 1. However, you can refer to the offsets
2172           for  the  second  and third capturing subpatterns if you wish (assuming
2173           the vector is large enough, of course).
2174    
2175           Some convenience functions are provided  for  extracting  the  captured
2176           substrings as separate strings. These are described below.
2177    
2178       Error return values from pcre_exec()
2179    
2180           If  pcre_exec()  fails, it returns a negative number. The following are
2181           defined in the header file:
2182    
2183             PCRE_ERROR_NOMATCH        (-1)
2184    
2185           The subject string did not match the pattern.
2186    
2187             PCRE_ERROR_NULL           (-2)
2188    
2189           Either code or subject was passed as NULL,  or  ovector  was  NULL  and
2190           ovecsize was not zero.
2191    
2192             PCRE_ERROR_BADOPTION      (-3)
2193    
2194           An unrecognized bit was set in the options argument.
2195    
2196             PCRE_ERROR_BADMAGIC       (-4)
2197    
2198           PCRE  stores a 4-byte "magic number" at the start of the compiled code,
2199           to catch the case when it is passed a junk pointer and to detect when a
2200           pattern that was compiled in an environment of one endianness is run in
2201           an environment with the other endianness. This is the error  that  PCRE
2202           gives when the magic number is not present.
2203    
2204             PCRE_ERROR_UNKNOWN_OPCODE (-5)
2205    
2206           While running the pattern match, an unknown item was encountered in the
2207           compiled pattern. This error could be caused by a bug  in  PCRE  or  by
2208           overwriting of the compiled pattern.
2209    
2210             PCRE_ERROR_NOMEMORY       (-6)
2211    
2212           If  a  pattern contains back references, but the ovector that is passed
2213           to pcre_exec() is not big enough to remember the referenced substrings,
2214           PCRE  gets  a  block of memory at the start of matching to use for this
2215           purpose. If the call via pcre_malloc() fails, this error is given.  The
2216           memory is automatically freed at the end of matching.
2217    
2218             PCRE_ERROR_NOSUBSTRING    (-7)
2219    
2220           This  error is used by the pcre_copy_substring(), pcre_get_substring(),
2221           and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2222           returned by pcre_exec().
2223    
2224             PCRE_ERROR_MATCHLIMIT     (-8)
2225    
2226           The  backtracking  limit,  as  specified  by the match_limit field in a
2227           pcre_extra structure (or defaulted) was reached.  See  the  description
2228           above.
2229    
2230             PCRE_ERROR_CALLOUT        (-9)
2231    
2232           This error is never generated by pcre_exec() itself. It is provided for
2233           use by callout functions that want to yield a distinctive  error  code.
2234           See the pcrecallout documentation for details.
2235    
2236             PCRE_ERROR_BADUTF8        (-10)
2237    
2238           A  string  that contains an invalid UTF-8 byte sequence was passed as a
2239           subject.
2240    
2241             PCRE_ERROR_BADUTF8_OFFSET (-11)
2242    
2243           The UTF-8 byte sequence that was passed as a subject was valid, but the
2244           value  of startoffset did not point to the beginning of a UTF-8 charac-
2245           ter.
2246    
2247             PCRE_ERROR_PARTIAL        (-12)
2248    
2249           The subject string did not match, but it did match partially.  See  the
2250           pcrepartial documentation for details of partial matching.
2251    
2252             PCRE_ERROR_BADPARTIAL     (-13)
2253    
2254           The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
2255           items that are not supported for partial matching. See the  pcrepartial
2256           documentation for details of partial matching.
2257    
2258             PCRE_ERROR_INTERNAL       (-14)
2259    
2260           An  unexpected  internal error has occurred. This error could be caused
2261           by a bug in PCRE or by overwriting of the compiled pattern.
2262    
2263             PCRE_ERROR_BADCOUNT       (-15)
2264    
2265           This error is given if the value of the ovecsize argument is negative.
2266    
2267             PCRE_ERROR_RECURSIONLIMIT (-21)
2268    
2269           The internal recursion limit, as specified by the match_limit_recursion
2270           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2271           description above.
2272    
2273             PCRE_ERROR_BADNEWLINE     (-23)
2274    
2275           An invalid combination of PCRE_NEWLINE_xxx options was given.
2276    
2277           Error numbers -16 to -20 and -22 are not used by pcre_exec().
2278    
2279    
2280    EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2281    
2282           int pcre_copy_substring(const char *subject, int *ovector,
2283                int stringcount, int stringnumber, char *buffer,
2284                int buffersize);
2285    
2286           int pcre_get_substring(const char *subject, int *ovector,
2287                int stringcount, int stringnumber,
2288                const char **stringptr);
2289    
2290           int pcre_get_substring_list(const char *subject,
2291                int *ovector, int stringcount, const char ***listptr);
2292    
2293           Captured substrings can be  accessed  directly  by  using  the  offsets
2294           returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2295           pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2296           string_list()  are  provided for extracting captured substrings as new,
2297           separate, zero-terminated strings. These functions identify  substrings
2298           by  number.  The  next section describes functions for extracting named
2299           substrings.
2300    
2301           A substring that contains a binary zero is correctly extracted and  has
2302           a  further zero added on the end, but the result is not, of course, a C
2303           string.  However, you can process such a string  by  referring  to  the
2304           length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2305           string().  Unfortunately, the interface to pcre_get_substring_list() is
2306           not  adequate for handling strings containing binary zeros, because the
2307           end of the final string is not independently indicated.
2308    
2309           The first three arguments are the same for all  three  of  these  func-
2310           tions:  subject  is  the subject string that has just been successfully
2311           matched, ovector is a pointer to the vector of integer offsets that was
2312           passed to pcre_exec(), and stringcount is the number of substrings that
2313           were captured by the match, including the substring  that  matched  the
2314           entire regular expression. This is the value returned by pcre_exec() if
2315           it is greater than zero. If pcre_exec() returned zero, indicating  that
2316           it  ran out of space in ovector, the value passed as stringcount should
2317           be the number of elements in the vector divided by three.
2318    
2319           The functions pcre_copy_substring() and pcre_get_substring() extract  a
2320           single  substring,  whose  number  is given as stringnumber. A value of
2321           zero extracts the substring that matched the  entire  pattern,  whereas
2322           higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2323           string(), the string is placed in buffer,  whose  length  is  given  by
2324           buffersize,  while  for  pcre_get_substring()  a new block of memory is
2325           obtained via pcre_malloc, and its address is  returned  via  stringptr.
2326           The  yield  of  the function is the length of the string, not including
2327           the terminating zero, or one of these error codes:
2328    
2329             PCRE_ERROR_NOMEMORY       (-6)
2330    
2331           The buffer was too small for pcre_copy_substring(), or the  attempt  to
2332           get memory failed for pcre_get_substring().
2333    
2334             PCRE_ERROR_NOSUBSTRING    (-7)
2335    
2336           There is no substring whose number is stringnumber.
2337    
2338           The  pcre_get_substring_list()  function  extracts  all  available sub-
2339           strings and builds a list of pointers to them. All this is  done  in  a
2340           single block of memory that is obtained via pcre_malloc. The address of
2341           the memory block is returned via listptr, which is also  the  start  of
2342           the  list  of  string pointers. The end of the list is marked by a NULL
2343           pointer. The yield of the function is zero if all  went  well,  or  the
2344           error code
2345    
2346             PCRE_ERROR_NOMEMORY       (-6)
2347    
2348           if the attempt to get the memory block failed.
2349    
2350           When  any of these functions encounter a substring that is unset, which
2351           can happen when capturing subpattern number n+1 matches  some  part  of
2352           the  subject, but subpattern n has not been used at all, they return an
2353           empty string. This can be distinguished from a genuine zero-length sub-
2354           string  by inspecting the appropriate offset in ovector, which is nega-
2355           tive for unset substrings.
2356    
2357           The two convenience functions pcre_free_substring() and  pcre_free_sub-
2358           string_list()  can  be  used  to free the memory returned by a previous
2359           call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2360           tively.  They  do  nothing  more  than  call the function pointed to by
2361           pcre_free, which of course could be called directly from a  C  program.
2362           However,  PCRE is used in some situations where it is linked via a spe-
2363           cial  interface  to  another  programming  language  that  cannot   use
2364           pcre_free  directly;  it is for these cases that the functions are pro-
2365           vided.
2366    
2367    
2368    EXTRACTING CAPTURED SUBSTRINGS BY NAME
2369    
2370           int pcre_get_stringnumber(const pcre *code,
2371                const char *name);
2372    
2373           int pcre_copy_named_substring(const pcre *code,
2374                const char *subject, int *ovector,
2375                int stringcount, const char *stringname,
2376                char *buffer, int buffersize);
2377    
2378           int pcre_get_named_substring(const pcre *code,
2379                const char *subject, int *ovector,
2380                int stringcount, const char *stringname,
2381                const char **stringptr);
2382    
2383           To extract a substring by name, you first have to find associated  num-
2384           ber.  For example, for this pattern
2385    
2386             (a+)b(?<xxx>\d+)...
2387    
2388           the number of the subpattern called "xxx" is 2. If the name is known to
2389           be unique (PCRE_DUPNAMES was not set), you can find the number from the
2390           name by calling pcre_get_stringnumber(). The first argument is the com-
2391           piled pattern, and the second is the name. The yield of the function is
2392           the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2393           subpattern of that name.
2394    
2395           Given the number, you can extract the substring directly, or use one of
2396           the functions described in the previous section. For convenience, there
2397           are also two functions that do the whole job.
2398    
2399           Most   of   the   arguments    of    pcre_copy_named_substring()    and
2400           pcre_get_named_substring()  are  the  same  as  those for the similarly
2401           named functions that extract by number. As these are described  in  the
2402           previous  section,  they  are not re-described here. There are just two
2403           differences:
2404    
2405           First, instead of a substring number, a substring name is  given.  Sec-
2406           ond, there is an extra argument, given at the start, which is a pointer
2407           to the compiled pattern. This is needed in order to gain access to  the
2408           name-to-number translation table.
2409    
2410           These  functions call pcre_get_stringnumber(), and if it succeeds, they
2411           then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2412           ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2413           behaviour may not be what you want (see the next section).
2414    
2415           Warning: If the pattern uses the "(?|" feature to set up multiple  sub-
2416           patterns  with  the  same  number,  you cannot use names to distinguish
2417           them, because names are not included in the compiled code. The matching
2418           process uses only numbers.
2419    
2420    
2421    DUPLICATE SUBPATTERN NAMES
2422    
2423           int pcre_get_stringtable_entries(const pcre *code,
2424                const char *name, char **first, char **last);
2425    
2426           When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2427           subpatterns are not required to  be  unique.  Normally,  patterns  with
2428           duplicate  names  are such that in any one match, only one of the named
2429           subpatterns participates. An example is shown in the pcrepattern  docu-
2430           mentation.
2431    
2432           When    duplicates   are   present,   pcre_copy_named_substring()   and
2433           pcre_get_named_substring() return the first substring corresponding  to
2434           the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
2435           (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
2436           function  returns one of the numbers that are associated with the name,
2437           but it is not defined which it is.
2438    
2439           If you want to get full details of all captured substrings for a  given
2440           name,  you  must  use  the pcre_get_stringtable_entries() function. The
2441           first argument is the compiled pattern, and the second is the name. The
2442           third  and  fourth  are  pointers to variables which are updated by the
2443           function. After it has run, they point to the first and last entries in
2444           the  name-to-number  table  for  the  given  name.  The function itself
2445           returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
2446           there  are none. The format of the table is described above in the sec-
2447           tion entitled Information about a  pattern.   Given  all  the  relevant
2448           entries  for the name, you can extract each of their numbers, and hence
2449           the captured data, if any.
2450    
2451    
2452    FINDING ALL POSSIBLE MATCHES
2453    
2454           The traditional matching function uses a  similar  algorithm  to  Perl,
2455           which stops when it finds the first match, starting at a given point in
2456           the subject. If you want to find all possible matches, or  the  longest
2457           possible  match,  consider using the alternative matching function (see
2458           below) instead. If you cannot use the alternative function,  but  still
2459           need  to  find all possible matches, you can kludge it up by making use
2460           of the callout facility, which is described in the pcrecallout documen-
2461           tation.
2462    
2463           What you have to do is to insert a callout right at the end of the pat-
2464           tern.  When your callout function is called, extract and save the  cur-
2465           rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2466           backtrack and try other alternatives. Ultimately, when it runs  out  of
2467           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2468    
2469    
2470    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2471    
2472           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2473                const char *subject, int length, int startoffset,
2474                int options, int *ovector, int ovecsize,
2475                int *workspace, int wscount);
2476    
2477           The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2478           against a compiled pattern, using a matching algorithm that  scans  the
2479           subject  string  just  once, and does not backtrack. This has different
2480           characteristics to the normal algorithm, and  is  not  compatible  with
2481           Perl.  Some  of the features of PCRE patterns are not supported. Never-
2482           theless, there are times when this kind of matching can be useful.  For
2483           a discussion of the two matching algorithms, see the pcrematching docu-
2484           mentation.
2485    
2486           The arguments for the pcre_dfa_exec() function  are  the  same  as  for
2487           pcre_exec(), plus two extras. The ovector argument is used in a differ-
2488           ent way, and this is described below. The other  common  arguments  are
2489           used  in  the  same way as for pcre_exec(), so their description is not
2490           repeated here.
2491    
2492           The two additional arguments provide workspace for  the  function.  The
2493           workspace  vector  should  contain at least 20 elements. It is used for
2494           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2495           workspace  will  be  needed for patterns and subjects where there are a
2496           lot of potential matches.
2497    
2498           Here is an example of a simple call to pcre_dfa_exec():
2499    
2500             int rc;
2501             int ovector[10];
2502             int wspace[20];
2503             rc = pcre_dfa_exec(
2504               re,             /* result of pcre_compile() */
2505               NULL,           /* we didn't study the pattern */
2506               "some string",  /* the subject string */
2507               11,             /* the length of the subject string */
2508               0,              /* start at offset 0 in the subject */
2509               0,              /* default options */
2510               ovector,        /* vector of integers for substring information */
2511               10,             /* number of elements (NOT size in bytes) */
2512               wspace,         /* working space vector */
2513               20);            /* number of elements (NOT size in bytes) */
2514    
2515       Option bits for pcre_dfa_exec()
2516    
2517           The unused bits of the options argument  for  pcre_dfa_exec()  must  be
2518           zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
2519           LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,
2520           PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2521           three of these are the same as for pcre_exec(), so their description is
2522           not repeated here.
2523    
2524             PCRE_PARTIAL
2525    
2526           This  has  the  same general effect as it does for pcre_exec(), but the
2527           details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for
2528           pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into
2529           PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have
2530           been no complete matches, but there is still at least one matching pos-
2531           sibility. The portion of the string that provided the partial match  is
2532           set as the first matching string.
2533    
2534             PCRE_DFA_SHORTEST
2535    
2536           Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
2537           stop as soon as it has found one match. Because of the way the alterna-
2538           tive  algorithm  works, this is necessarily the shortest possible match
2539           at the first possible matching point in the subject string.
2540    
2541             PCRE_DFA_RESTART
2542    
2543           When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and
2544           returns  a  partial  match, it is possible to call it again, with addi-
2545           tional subject characters, and have it continue with  the  same  match.
2546           The  PCRE_DFA_RESTART  option requests this action; when it is set, the
2547           workspace and wscount options must reference the same vector as  before
2548           because  data  about  the  match so far is left in them after a partial
2549           match. There is more discussion of this  facility  in  the  pcrepartial
2550           documentation.
2551    
2552       Successful returns from pcre_dfa_exec()
2553    
2554           When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
2555           string in the subject. Note, however, that all the matches from one run
2556           of  the  function  start  at the same point in the subject. The shorter
2557           matches are all initial substrings of the longer matches. For  example,
2558           if the pattern
2559    
2560             <.*>
2561    
2562           is matched against the string
2563    
2564             This is <something> <something else> <something further> no more
2565    
2566           the three matched strings are
2567    
2568             <something>
2569             <something> <something else>
2570             <something> <something else> <something further>
2571    
2572           On  success,  the  yield of the function is a number greater than zero,
2573           which is the number of matched substrings.  The  substrings  themselves
2574           are  returned  in  ovector. Each string uses two elements; the first is
2575           the offset to the start, and the second is the offset to  the  end.  In
2576           fact,  all  the  strings  have the same start offset. (Space could have
2577           been saved by giving this only once, but it was decided to retain  some
2578           compatibility  with  the  way pcre_exec() returns data, even though the
2579           meaning of the strings is different.)
2580    
2581           The strings are returned in reverse order of length; that is, the long-
2582           est  matching  string is given first. If there were too many matches to
2583           fit into ovector, the yield of the function is zero, and the vector  is
2584           filled with the longest matches.
2585    
2586       Error returns from pcre_dfa_exec()
2587    
2588           The  pcre_dfa_exec()  function returns a negative number when it fails.
2589           Many of the errors are the same  as  for  pcre_exec(),  and  these  are
2590           described  above.   There are in addition the following errors that are
2591           specific to pcre_dfa_exec():
2592    
2593             PCRE_ERROR_DFA_UITEM      (-16)
2594    
2595           This return is given if pcre_dfa_exec() encounters an item in the  pat-
2596           tern  that  it  does not support, for instance, the use of \C or a back
2597           reference.
2598    
2599             PCRE_ERROR_DFA_UCOND      (-17)
2600    
2601           This return is given if pcre_dfa_exec()  encounters  a  condition  item
2602           that  uses  a back reference for the condition, or a test for recursion
2603           in a specific group. These are not supported.
2604    
2605             PCRE_ERROR_DFA_UMLIMIT    (-18)
2606    
2607           This return is given if pcre_dfa_exec() is called with an  extra  block
2608           that contains a setting of the match_limit field. This is not supported
2609           (it is meaningless).
2610    
2611             PCRE_ERROR_DFA_WSSIZE     (-19)
2612    
2613           This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
2614           workspace vector.
2615    
2616             PCRE_ERROR_DFA_RECURSE    (-20)
2617    
2618           When  a  recursive subpattern is processed, the matching function calls
2619           itself recursively, using private vectors for  ovector  and  workspace.
2620           This  error  is  given  if  the output vector is not large enough. This
2621           should be extremely rare, as a vector of size 1000 is used.
2622    
2623    
2624    SEE ALSO
2625    
2626           pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
2627           tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2628    
2629    
2630    AUTHOR
2631    
2632           Philip Hazel
2633           University Computing Service
2634           Cambridge CB2 3QH, England.
2635    
2636    
2637    REVISION
2638    
2639           Last updated: 11 April 2009
2640           Copyright (c) 1997-2009 University of Cambridge.
2641    ------------------------------------------------------------------------------
2642    
2643    
2644    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2645    
2646    
2647    NAME
2648           PCRE - Perl-compatible regular expressions
2649    
2650    
2651    PCRE CALLOUTS
2652    
2653           int (*pcre_callout)(pcre_callout_block *);
2654    
2655           PCRE provides a feature called "callout", which is a means of temporar-
2656           ily passing control to the caller of PCRE  in  the  middle  of  pattern
2657           matching.  The  caller of PCRE provides an external function by putting
2658           its entry point in the global variable pcre_callout. By  default,  this
2659           variable contains NULL, which disables all calling out.
2660    
2661           Within  a  regular  expression,  (?C) indicates the points at which the
2662           external function is to be called.  Different  callout  points  can  be
2663           identified  by  putting  a number less than 256 after the letter C. The
2664           default value is zero.  For  example,  this  pattern  has  two  callout
2665           points:
2666    
2667             (?C1)abc(?C2)def
2668    
2669           If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2670           called, PCRE automatically  inserts  callouts,  all  with  number  255,
2671           before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
2672           used with the pattern
2673    
2674             A(\d{2}|--)
2675    
2676           it is processed as if it were
2677    
2678           (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2679    
2680           Notice that there is a callout before and after  each  parenthesis  and
2681           alternation  bar.  Automatic  callouts  can  be  used  for tracking the
2682           progress of pattern matching. The pcretest command has an  option  that
2683           sets  automatic callouts; when it is used, the output indicates how the
2684           pattern is matched. This is useful information when you are  trying  to
2685           optimize the performance of a particular pattern.
2686    
2687    
2688    MISSING CALLOUTS
2689    
2690           You  should  be  aware  that,  because of optimizations in the way PCRE
2691           matches patterns by default, callouts  sometimes  do  not  happen.  For
2692           example, if the pattern is
2693    
2694             ab(?C4)cd
2695    
2696           PCRE knows that any matching string must contain the letter "d". If the
2697           subject string is "abyz", the lack of "d" means that  matching  doesn't
2698           ever  start,  and  the  callout is never reached. However, with "abyd",
2699           though the result is still no match, the callout is obeyed.
2700    
2701           You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
2702           MIZE  option  to  pcre_exec()  or  pcre_dfa_exec(). This slows down the
2703           matching process, but does ensure that callouts  such  as  the  example
2704           above are obeyed.
2705    
2706    
2707    THE CALLOUT INTERFACE
2708    
2709           During  matching, when PCRE reaches a callout point, the external func-
2710           tion defined by pcre_callout is called (if it is set). This applies  to
2711           both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The
2712           only argument to the callout function is a pointer  to  a  pcre_callout
2713           block. This structure contains the following fields:
2714    
2715             int          version;
2716             int          callout_number;
2717             int         *offset_vector;
2718             const char  *subject;
2719             int          subject_length;
2720             int          start_match;
2721             int          current_position;
2722             int          capture_top;
2723             int          capture_last;
2724             void        *callout_data;
2725             int          pattern_position;
2726             int          next_item_length;
2727    
2728           The  version  field  is an integer containing the version number of the
2729           block format. The initial version was 0; the current version is 1.  The
2730           version  number  will  change  again in future if additional fields are
2731           added, but the intention is never to remove any of the existing fields.
2732    
2733           The callout_number field contains the number of the  callout,  as  com-
2734           piled  into  the pattern (that is, the number after ?C for manual call-
2735           outs, and 255 for automatically generated callouts).
2736    
2737           The offset_vector field is a pointer to the vector of offsets that  was
2738           passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2739           pcre_exec() is used, the contents can be inspected in order to  extract
2740           substrings  that  have  been  matched  so  far,  in the same way as for
2741           extracting substrings after a match has completed. For  pcre_dfa_exec()
2742           this field is not useful.
2743    
2744           The subject and subject_length fields contain copies of the values that
2745           were passed to pcre_exec().
2746    
2747           The start_match field normally contains the offset within  the  subject
2748           at  which  the  current  match  attempt started. However, if the escape
2749           sequence \K has been encountered, this value is changed to reflect  the
2750           modified  starting  point.  If the pattern is not anchored, the callout
2751           function may be called several times from the same point in the pattern
2752           for different starting points in the subject.
2753    
2754           The  current_position  field  contains the offset within the subject of
2755           the current match pointer.
2756    
2757           When the pcre_exec() function is used, the capture_top  field  contains
2758           one  more than the number of the highest numbered captured substring so
2759           far. If no substrings have been captured, the value of  capture_top  is
2760           one.  This  is always the case when pcre_dfa_exec() is used, because it
2761           does not support captured substrings.
2762    
2763           The capture_last field contains the number of the  most  recently  cap-
2764           tured  substring. If no substrings have been captured, its value is -1.
2765           This is always the case when pcre_dfa_exec() is used.
2766    
2767           The callout_data field contains a value that is passed  to  pcre_exec()
2768           or  pcre_dfa_exec() specifically so that it can be passed back in call-
2769           outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
2770           structure.  If  no such data was passed, the value of callout_data in a
2771           pcre_callout block is NULL. There is a description  of  the  pcre_extra
2772           structure in the pcreapi documentation.
2773    
2774           The  pattern_position field is present from version 1 of the pcre_call-
2775           out structure. It contains the offset to the next item to be matched in
2776           the pattern string.
2777    
2778           The  next_item_length field is present from version 1 of the pcre_call-
2779           out structure. It contains the length of the next item to be matched in
2780           the  pattern  string. When the callout immediately precedes an alterna-
2781           tion bar, a closing parenthesis, or the end of the pattern, the  length
2782           is  zero.  When the callout precedes an opening parenthesis, the length
2783           is that of the entire subpattern.
2784    
2785           The pattern_position and next_item_length fields are intended  to  help
2786           in  distinguishing between different automatic callouts, which all have
2787           the same callout number. However, they are set for all callouts.
2788    
2789    
2790    RETURN VALUES
2791    
2792           The external callout function returns an integer to PCRE. If the  value
2793           is  zero,  matching  proceeds  as  normal. If the value is greater than
2794           zero, matching fails at the current point, but  the  testing  of  other
2795           matching possibilities goes ahead, just as if a lookahead assertion had
2796           failed. If the value is less than zero, the  match  is  abandoned,  and
2797           pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2798    
2799           Negative   values   should   normally   be   chosen  from  the  set  of
2800           PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2801           dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
2802           reserved for use by callout functions; it will never be  used  by  PCRE
2803           itself.
2804    
2805    
2806    AUTHOR
2807    
2808           Philip Hazel
2809           University Computing Service
2810           Cambridge CB2 3QH, England.
2811    
2812    
2813    REVISION
2814    
2815           Last updated: 15 March 2009
2816           Copyright (c) 1997-2009 University of Cambridge.
2817    ------------------------------------------------------------------------------
2818    
2819    
2820    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2821    
2822    
2823    NAME
2824           PCRE - Perl-compatible regular expressions
2825    
2826    
2827    DIFFERENCES BETWEEN PCRE AND PERL
2828    
2829           This  document describes the differences in the ways that PCRE and Perl
2830           handle regular expressions. The differences described here  are  mainly
2831           with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2832           some features that are expected to be in the forthcoming Perl 5.10.
2833    
2834           1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2835           of  what  it does have are given in the section on UTF-8 support in the
2836           main pcre page.
2837    
2838           2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2839           permits  them,  but they do not mean what you might think. For example,
2840           (?!a){3} does not assert that the next three characters are not "a". It
2841           just asserts that the next character is not "a" three times.
2842    
2843           3.  Capturing  subpatterns  that occur inside negative lookahead asser-
2844           tions are counted, but their entries in the offsets  vector  are  never
2845           set.  Perl sets its numerical variables from any such patterns that are
2846           matched before the assertion fails to match something (thereby succeed-
2847           ing),  but  only  if the negative lookahead assertion contains just one
2848           branch.
2849    
2850           4. Though binary zero characters are supported in the  subject  string,
2851           they are not allowed in a pattern string because it is passed as a nor-
2852           mal C string, terminated by zero. The escape sequence \0 can be used in
2853           the pattern to represent a binary zero.
2854    
2855           5.  The  following Perl escape sequences are not supported: \l, \u, \L,
2856           \U, and \N. In fact these are implemented by Perl's general string-han-
2857           dling  and are not part of its pattern matching engine. If any of these
2858           are encountered by PCRE, an error is generated.
2859    
2860           6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
2861           is  built  with Unicode character property support. The properties that
2862           can be tested with \p and \P are limited to the general category  prop-
2863           erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2864           derived properties Any and L&.
2865    
2866           7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2867           ters  in  between  are  treated as literals. This is slightly different
2868           from Perl in that $ and @ are  also  handled  as  literals  inside  the
2869           quotes.  In Perl, they cause variable interpolation (but of course PCRE
2870           does not have variables). Note the following examples:
2871    
2872               Pattern            PCRE matches      Perl matches
2873    
2874               \Qabc$xyz\E        abc$xyz           abc followed by the
2875                                                      contents of $xyz
2876               \Qabc\$xyz\E       abc\$xyz          abc\$xyz
2877               \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
2878    
2879           The \Q...\E sequence is recognized both inside  and  outside  character
2880           classes.
2881    
2882           8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2883           constructions. However, there is support for recursive  patterns.  This
2884           is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
2885           "callout" feature allows an external function to be called during  pat-
2886           tern matching. See the pcrecallout documentation for details.
2887    
2888           9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2889           always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2890           unlike Perl.
2891    
2892           10.  There are some differences that are concerned with the settings of
2893           captured strings when part of  a  pattern  is  repeated.  For  example,
2894           matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
2895           unset, but in PCRE it is set to "b".
2896    
2897           11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
2898           (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
2899           the forms without an  argument.  PCRE  does  not  support  (*MARK).  If
2900           (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-
2901           ture group; this is different to Perl.
2902    
2903           12. PCRE provides some extensions to the Perl regular expression facil-
2904           ities.   Perl  5.10  will  include new features that are not in earlier
2905           versions, some of which (such as named parentheses) have been  in  PCRE
2906           for some time. This list is with respect to Perl 5.10:
2907    
2908           (a)  Although  lookbehind  assertions  must match fixed length strings,
2909           each alternative branch of a lookbehind assertion can match a different
2910           length of string. Perl requires them all to have the same length.
2911    
2912           (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
2913           meta-character matches only at the very end of the string.
2914    
2915           (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2916           cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2917           ignored.  (Perl can be made to issue a warning.)
2918    
2919           (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2920           fiers is inverted, that is, by default they are not greedy, but if fol-
2921           lowed by a question mark they are.
2922    
2923           (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2924           tried only at the first matching position in the subject string.
2925    
2926           (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2927           TURE options for pcre_exec() have no Perl equivalents.
2928    
2929           (g) The \R escape sequence can be restricted to match only CR,  LF,  or
2930           CRLF by the PCRE_BSR_ANYCRLF option.
2931    
2932           (h) The callout facility is PCRE-specific.
2933    
2934           (i) The partial matching facility is PCRE-specific.
2935    
2936           (j) Patterns compiled by PCRE can be saved and re-used at a later time,
2937           even on different hosts that have the other endianness.
2938    
2939           (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
2940           different way and is not Perl-compatible.
2941    
2942           (l)  PCRE  recognizes some special sequences such as (*CR) at the start
2943           of a pattern that set overall options that cannot be changed within the
2944           pattern.
2945    
2946    
2947    AUTHOR
2948    
2949           Philip Hazel
2950           University Computing Service
2951           Cambridge CB2 3QH, England.
2952    
2953    
2954    REVISION
2955    
2956           Last updated: 11 September 2007
2957           Copyright (c) 1997-2007 University of Cambridge.
2958    ------------------------------------------------------------------------------
2959    
2960    
2961    PCREPATTERN(3)                                                  PCREPATTERN(3)
2962    
2963    
2964    NAME
2965           PCRE - Perl-compatible regular expressions
2966    
2967    
2968    PCRE REGULAR EXPRESSION DETAILS
2969    
2970           The  syntax and semantics of the regular expressions that are supported
2971           by PCRE are described in detail below. There is a quick-reference  syn-
2972           tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
2973           semantics as closely as it can. PCRE  also  supports  some  alternative
2974           regular  expression  syntax (which does not conflict with the Perl syn-
2975           tax) in order to provide some compatibility with regular expressions in
2976           Python, .NET, and Oniguruma.
2977    
2978           Perl's  regular expressions are described in its own documentation, and
2979           regular expressions in general are covered in a number of  books,  some
2980           of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
2981           Expressions", published by  O'Reilly,  covers  regular  expressions  in
2982           great  detail.  This  description  of  PCRE's  regular  expressions  is
2983           intended as reference material.
2984    
2985           The original operation of PCRE was on strings of  one-byte  characters.
2986           However,  there is now also support for UTF-8 character strings. To use
2987           this, you must build PCRE to  include  UTF-8  support,  and  then  call
2988           pcre_compile()  with  the  PCRE_UTF8  option.  There  is also a special
2989           sequence that can be given at the start of a pattern:
2990    
2991             (*UTF8)
2992    
2993           Starting a pattern with this sequence  is  equivalent  to  setting  the
2994           PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
2995           UTF-8 mode affects pattern matching  is  mentioned  in  several  places
2996           below.  There  is  also  a  summary of UTF-8 features in the section on
2997           UTF-8 support in the main pcre page.
2998    
2999           The remainder of this document discusses the  patterns  that  are  sup-
3000           ported  by  PCRE when its main matching function, pcre_exec(), is used.
3001           From  release  6.0,   PCRE   offers   a   second   matching   function,
3002           pcre_dfa_exec(),  which matches using a different algorithm that is not
3003           Perl-compatible. Some of the features discussed below are not available
3004           when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
3005           alternative function, and how it differs from the normal function,  are
3006           discussed in the pcrematching page.
3007    
3008    
3009    NEWLINE CONVENTIONS
3010    
3011           PCRE  supports five different conventions for indicating line breaks in
3012           strings: a single CR (carriage return) character, a  single  LF  (line-
3013           feed) character, the two-character sequence CRLF, any of the three pre-
3014           ceding, or any Unicode newline sequence. The pcreapi page  has  further
3015           discussion  about newlines, and shows how to set the newline convention
3016           in the options arguments for the compiling and matching functions.
3017    
3018           It is also possible to specify a newline convention by starting a  pat-
3019           tern string with one of the following five sequences:
3020    
3021             (*CR)        carriage return
3022             (*LF)        linefeed
3023             (*CRLF)      carriage return, followed by linefeed
3024             (*ANYCRLF)   any of the three above
3025             (*ANY)       all Unicode newline sequences
3026    
3027           These override the default and the options given to pcre_compile(). For
3028           example, on a Unix system where LF is the default newline sequence, the
3029           pattern
3030    
3031             (*CR)a.b
3032    
3033           changes the convention to CR. That pattern matches "a\nb" because LF is
3034           no longer a newline. Note that these special settings,  which  are  not
3035           Perl-compatible,  are  recognized  only at the very start of a pattern,
3036           and that they must be in upper case.  If  more  than  one  of  them  is
3037           present, the last one is used.
3038    
3039           The  newline  convention  does  not  affect what the \R escape sequence
3040           matches. By default, this is any Unicode  newline  sequence,  for  Perl
3041           compatibility.  However, this can be changed; see the description of \R
3042           in the section entitled "Newline sequences" below. A change of \R  set-
3043           ting can be combined with a change of newline convention.
3044    
3045    
3046    CHARACTERS AND METACHARACTERS
3047    
3048           A  regular  expression  is  a pattern that is matched against a subject
3049           string from left to right. Most characters stand for  themselves  in  a
3050           pattern,  and  match  the corresponding characters in the subject. As a
3051           trivial example, the pattern
3052    
3053             The quick brown fox
3054    
3055           matches a portion of a subject string that is identical to itself. When
3056           caseless  matching is specified (the PCRE_CASELESS option), letters are
3057           matched independently of case. In UTF-8 mode, PCRE  always  understands
3058           the  concept  of case for characters whose values are less than 128, so
3059           caseless matching is always possible. For characters with  higher  val-
3060           ues,  the concept of case is supported if PCRE is compiled with Unicode
3061           property support, but not otherwise.   If  you  want  to  use  caseless
3062           matching  for  characters  128  and above, you must ensure that PCRE is
3063           compiled with Unicode property support as well as with UTF-8 support.
3064    
3065           The power of regular expressions comes  from  the  ability  to  include
3066           alternatives  and  repetitions in the pattern. These are encoded in the
3067           pattern by the use of metacharacters, which do not stand for themselves
3068           but instead are interpreted in some special way.
3069    
3070           There  are  two different sets of metacharacters: those that are recog-
3071           nized anywhere in the pattern except within square brackets, and  those
3072           that  are  recognized  within square brackets. Outside square brackets,
3073           the metacharacters are as follows:
3074    
3075             \      general escape character with several uses
3076             ^      assert start of string (or line, in multiline mode)
3077             $      assert end of string (or line, in multiline mode)
3078             .      match any character except newline (by default)
3079             [      start character class definition
3080             |      start of alternative branch
3081             (      start subpattern
3082             )      end subpattern
3083             ?      extends the meaning of (
3084                    also 0 or 1 quantifier
3085                    also quantifier minimizer
3086             *      0 or more quantifier
3087             +      1 or more quantifier
3088                    also "possessive quantifier"
3089             {      start min/max quantifier
3090    
3091           Part of a pattern that is in square brackets  is  called  a  "character
3092           class". In a character class the only metacharacters are:
3093    
3094             \      general escape character
3095             ^      negate the class, but only if the first character
3096             -      indicates character range
3097             [      POSIX character class (only if followed by POSIX
3098                      syntax)
3099             ]      terminates the character class
3100    
3101           The following sections describe the use of each of the metacharacters.
3102    
3103    
3104    BACKSLASH
3105    
3106           The backslash character has several uses. Firstly, if it is followed by
3107           a non-alphanumeric character, it takes away any  special  meaning  that
3108           character  may  have.  This  use  of  backslash  as an escape character
3109           applies both inside and outside character classes.
3110    
3111           For example, if you want to match a * character, you write  \*  in  the
3112           pattern.   This  escaping  action  applies whether or not the following
3113           character would otherwise be interpreted as a metacharacter, so  it  is
3114           always  safe  to  precede  a non-alphanumeric with backslash to specify
3115           that it stands for itself. In particular, if you want to match a  back-
3116           slash, you write \\.
3117    
3118           If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
3119           the pattern (other than in a character class) and characters between  a
3120           # outside a character class and the next newline are ignored. An escap-
3121           ing backslash can be used to include a whitespace  or  #  character  as
3122           part of the pattern.
3123    
3124           If  you  want  to remove the special meaning from a sequence of charac-
3125           ters, you can do so by putting them between \Q and \E. This is  differ-
3126           ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
3127           sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
3128           tion. Note the following examples:
3129    
3130             Pattern            PCRE matches   Perl matches
3131    
3132             \Qabc$xyz\E        abc$xyz        abc followed by the
3133                                                 contents of $xyz
3134             \Qabc\$xyz\E       abc\$xyz       abc\$xyz
3135             \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3136    
3137           The  \Q...\E  sequence  is recognized both inside and outside character
3138           classes.
3139    
3140       Non-printing characters
3141    
3142           A second use of backslash provides a way of encoding non-printing char-
3143           acters  in patterns in a visible manner. There is no restriction on the
3144           appearance of non-printing characters, apart from the binary zero  that
3145           terminates  a  pattern,  but  when  a pattern is being prepared by text
3146           editing, it is usually easier  to  use  one  of  the  following  escape
3147           sequences than the binary character it represents:
3148    
3149             \a        alarm, that is, the BEL character (hex 07)
3150             \cx       "control-x", where x is any character
3151             \e        escape (hex 1B)
3152             \f        formfeed (hex 0C)
3153             \n        linefeed (hex 0A)
3154             \r        carriage return (hex 0D)
3155             \t        tab (hex 09)
3156             \ddd      character with octal code ddd, or backreference
3157             \xhh      character with hex code hh
3158             \x{hhh..} character with hex code hhh..
3159    
3160           The  precise  effect of \cx is as follows: if x is a lower case letter,
3161           it is converted to upper case. Then bit 6 of the character (hex 40)  is
3162           inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
3163           becomes hex 7B.
3164    
3165           After \x, from zero to two hexadecimal digits are read (letters can  be
3166           in  upper  or  lower case). Any number of hexadecimal digits may appear
3167           between \x{ and }, but the value of the character  code  must  be  less
3168           than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3169           the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
3170           than the largest Unicode code point, which is 10FFFF.
3171    
3172           If  characters  other than hexadecimal digits appear between \x{ and },
3173           or if there is no terminating }, this form of escape is not recognized.
3174           Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
3175           escape, with no following digits, giving a  character  whose  value  is
3176           zero.
3177    
3178           Characters whose value is less than 256 can be defined by either of the
3179           two syntaxes for \x. There is no difference in the way  they  are  han-
3180           dled. For example, \xdc is exactly the same as \x{dc}.
3181    
3182           After  \0  up  to two further octal digits are read. If there are fewer
3183           than two digits, just  those  that  are  present  are  used.  Thus  the
3184           sequence \0\x\07 specifies two binary zeros followed by a BEL character
3185           (code value 7). Make sure you supply two digits after the initial  zero
3186           if the pattern character that follows is itself an octal digit.
3187    
3188           The handling of a backslash followed by a digit other than 0 is compli-
3189           cated.  Outside a character class, PCRE reads it and any following dig-
3190           its  as  a  decimal  number. If the number is less than 10, or if there
3191           have been at least that many previous capturing left parentheses in the
3192           expression,  the  entire  sequence  is  taken  as  a  back reference. A
3193           description of how this works is given later, following the  discussion
3194           of parenthesized subpatterns.
3195    
3196           Inside  a  character  class, or if the decimal number is greater than 9
3197           and there have not been that many capturing subpatterns, PCRE  re-reads
3198           up to three octal digits following the backslash, and uses them to gen-
3199           erate a data character. Any subsequent digits stand for themselves.  In
3200           non-UTF-8  mode,  the  value  of a character specified in octal must be
3201           less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
3202           example:
3203    
3204             \040   is another way of writing a space
3205             \40    is the same, provided there are fewer than 40
3206                       previous capturing subpatterns
3207             \7     is always a back reference
3208             \11    might be a back reference, or another way of
3209                       writing a tab
3210             \011   is always a tab
3211             \0113  is a tab followed by the character "3"
3212             \113   might be a back reference, otherwise the
3213                       character with octal code 113
3214             \377   might be a back reference, otherwise
3215                       the byte consisting entirely of 1 bits
3216             \81    is either a back reference, or a binary zero
3217                       followed by the two characters "8" and "1"
3218    
3219           Note  that  octal  values of 100 or greater must not be introduced by a
3220           leading zero, because no more than three octal digits are ever read.
3221    
3222           All the sequences that define a single character value can be used both
3223           inside  and  outside character classes. In addition, inside a character
3224           class, the sequence \b is interpreted as the backspace  character  (hex
3225           08),  and the sequences \R and \X are interpreted as the characters "R"
3226           and "X", respectively. Outside a character class, these sequences  have
3227           different meanings (see below).
3228    
3229       Absolute and relative back references
3230    
3231           The  sequence  \g followed by an unsigned or a negative number, option-
3232           ally enclosed in braces, is an absolute or relative back  reference.  A
3233           named back reference can be coded as \g{name}. Back references are dis-
3234           cussed later, following the discussion of parenthesized subpatterns.
3235    
3236       Absolute and relative subroutine calls
3237    
3238           For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
3239           name or a number enclosed either in angle brackets or single quotes, is
3240           an alternative syntax for referencing a subpattern as  a  "subroutine".
3241           Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
3242           \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
3243           reference; the latter is a subroutine call.
3244    
3245       Generic character types
3246    
3247           Another use of backslash is for specifying generic character types. The
3248           following are always recognized:
3249    
3250             \d     any decimal digit
3251             \D     any character that is not a decimal digit
3252             \h     any horizontal whitespace character
3253             \H     any character that is not a horizontal whitespace character
3254             \s     any whitespace character
3255             \S     any character that is not a whitespace character
3256             \v     any vertical whitespace character
3257             \V     any character that is not a vertical whitespace character
3258             \w     any "word" character
3259             \W     any "non-word" character
3260    
3261           Each pair of escape sequences partitions the complete set of characters
3262           into  two disjoint sets. Any given character matches one, and only one,
3263           of each pair.
3264    
3265           These character type sequences can appear both inside and outside char-
3266           acter  classes.  They each match one character of the appropriate type.
3267           If the current matching point is at the end of the subject string,  all
3268           of them fail, since there is no character to match.
3269    
3270           For  compatibility  with Perl, \s does not match the VT character (code
3271           11).  This makes it different from the the POSIX "space" class. The  \s
3272           characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
3273           "use locale;" is included in a Perl script, \s may match the VT charac-
3274           ter. In PCRE, it never does.
3275    
3276           In  UTF-8 mode, characters with values greater than 128 never match \d,
3277           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
3278           code  character  property  support is available. These sequences retain
3279           their original meanings from before UTF-8 support was available, mainly
3280           for  efficiency  reasons. Note that this also affects \b, because it is
3281           defined in terms of \w and \W.
3282    
3283           The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
3284           the  other  sequences, these do match certain high-valued codepoints in
3285           UTF-8 mode.  The horizontal space characters are:
3286    
3287             U+0009     Horizontal tab
3288             U+0020     Space
3289             U+00A0     Non-break space
3290             U+1680     Ogham space mark
3291             U+180E     Mongolian vowel separator
3292             U+2000     En quad
3293             U+2001     Em quad
3294             U+2002     En space
3295             U+2003     Em space
3296             U+2004     Three-per-em space
3297             U+2005     Four-per-em space
3298             U+2006     Six-per-em space
3299             U+2007     Figure space
3300             U+2008     Punctuation space
3301             U+2009     Thin space
3302             U+200A     Hair space
3303             U+202F     Narrow no-break space
3304             U+205F     Medium mathematical space
3305             U+3000     Ideographic space
3306    
3307           The vertical space characters are:
3308    
3309             U+000A     Linefeed
3310             U+000B     Vertical tab
3311             U+000C     Formfeed
3312             U+000D     Carriage return
3313             U+0085     Next line
3314             U+2028     Line separator
3315             U+2029     Paragraph separator
3316    
3317           A "word" character is an underscore or any character less than 256 that
3318           is  a  letter  or  digit.  The definition of letters and digits is con-
3319           trolled by PCRE's low-valued character tables, and may vary if  locale-
3320           specific  matching is taking place (see "Locale support" in the pcreapi
3321           page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3322           systems,  or "french" in Windows, some character codes greater than 128
3323           are used for accented letters, and these are matched by \w. The use  of
3324           locales with Unicode is discouraged.
3325    
3326       Newline sequences
3327    
3328           Outside  a  character class, by default, the escape sequence \R matches
3329           any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
3330           mode \R is equivalent to the following:
3331    
3332             (?>\r\n|\n|\x0b|\f|\r|\x85)
3333    
3334           This  is  an  example  of an "atomic group", details of which are given
3335           below.  This particular group matches either the two-character sequence
3336           CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3337           U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3338           return, U+000D), or NEL (next line, U+0085). The two-character sequence
3339           is treated as a single unit that cannot be split.
3340    
3341           In UTF-8 mode, two additional characters whose codepoints  are  greater
3342           than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3343           rator, U+2029).  Unicode character property support is not  needed  for
3344           these characters to be recognized.
3345    
3346           It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3347           the complete set  of  Unicode  line  endings)  by  setting  the  option
3348           PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3349           (BSR is an abbrevation for "backslash R".) This can be made the default
3350           when  PCRE  is  built;  if this is the case, the other behaviour can be
3351           requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
3352           specify  these  settings  by  starting a pattern string with one of the
3353           following sequences:
3354    
3355             (*BSR_ANYCRLF)   CR, LF, or CRLF only
3356             (*BSR_UNICODE)   any Unicode newline sequence
3357    
3358           These override the default and the options given to pcre_compile(), but
3359           they can be overridden by options given to pcre_exec(). Note that these
3360           special settings, which are not Perl-compatible, are recognized only at
3361           the  very  start  of a pattern, and that they must be in upper case. If
3362           more than one of them is present, the last one is  used.  They  can  be
3363           combined  with  a  change of newline convention, for example, a pattern
3364           can start with:
3365    
3366             (*ANY)(*BSR_ANYCRLF)
3367    
3368           Inside a character class, \R matches the letter "R".
3369