/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 834 by ph10, Tue Dec 6 15:38:01 2011 UTC revision 835 by ph10, Wed Dec 28 16:10:09 2011 UTC
# Line 268  one of the following escape sequences th Line 268  one of the following escape sequences th
268    \t        tab (hex 09)    \t        tab (hex 09)
269    \ddd      character with octal code ddd, or back reference    \ddd      character with octal code ddd, or back reference
270    \xhh      character with hex code hh    \xhh      character with hex code hh
271    \x{hhh..} character with hex code hhh.. (non-JavaScript mode)    \x{hhh..} character with hex code hhh..
   \uhhhh    character with hex code hhhh (JavaScript mode only)  
272  </pre>  </pre>
273  The precise effect of \cx is as follows: if x is a lower case letter, it  The precise effect of \cx is as follows: if x is a lower case letter, it
274  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 281  values are valid. A lower case letter is Line 280  values are valid. A lower case letter is
280  0xc0 bits are flipped.)  0xc0 bits are flipped.)
281  </P>  </P>
282  <P>  <P>
283  By default, after \x, from zero to two hexadecimal digits are read (letters  After \x, from zero to two hexadecimal digits are read (letters can be in
284  can be in upper or lower case). Any number of hexadecimal digits may appear  upper or lower case). Any number of hexadecimal digits may appear between \x{
285  between \x{ and }, but the value of the character code must be less than 256  and }, but the value of the character code must be less than 256 in non-UTF-8
286  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
287  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
288  Unicode code point, which is 10FFFF.  point, which is 10FFFF.
289  </P>  </P>
290  <P>  <P>
291  If characters other than hexadecimal digits appear between \x{ and }, or if  If characters other than hexadecimal digits appear between \x{ and }, or if
# Line 295  initial \x will be interpreted as a basi Line 294  initial \x will be interpreted as a basi
294  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
295  </P>  </P>
296  <P>  <P>
 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is  
 as just described only when it is followed by two hexadecimal digits.  
 Otherwise, it matches a literal "x" character. In JavaScript mode, support for  
 code points greater than 256 is provided by \u, which must be followed by  
 four hexadecimal digits; otherwise it matches a literal "u" character.  
 </P>  
 <P>  
297  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
298  syntaxes for \x (or by \u in JavaScript mode). There is no difference in the  syntaxes for \x. There is no difference in the way they are handled. For
299  way they are handled. For example, \xdc is exactly the same as \x{dc} (or  example, \xdc is exactly the same as \x{dc}.
 \u00dc in JavaScript mode).  
300  </P>  </P>
301  <P>  <P>
302  After \0 up to two further octal digits are read. If there are fewer than two  After \0 up to two further octal digits are read. If there are fewer than two
# Line 347  zero, because no more than three octal d Line 338  zero, because no more than three octal d
338  </P>  </P>
339  <P>  <P>
340  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
341  and outside character classes. In addition, inside a character class, \b is  and outside character classes. In addition, inside a character class, the
342  interpreted as the backspace character (hex 08).  sequence \b is interpreted as the backspace character (hex 08). The sequences
343  </P>  \B, \N, \R, and \X are not special inside a character class. Like any other
344  <P>  unrecognized escape sequences, they are treated as the literal characters "B",
345  \N is not allowed in a character class. \B, \R, and \X are not special  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
346  inside a character class. Like other unrecognized escape sequences, they are  set. Outside a character class, these sequences have different meanings.
 treated as the literal characters "B", "R", and "X" by default, but cause an  
 error if the PCRE_EXTRA option is set. Outside a character class, these  
 sequences have different meanings.  
 </P>  
 <br><b>  
 Unsupported escape sequences  
 </b><br>  
 <P>  
 In Perl, the sequences \l, \L, \u, and \U are recognized by its string  
 handler and used to modify the case of following characters. By default, PCRE  
 does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT  
 option is set, \U matches a "U" character, and \u can be used to define a  
 character by code point, as described in the previous section.  
347  </P>  </P>
348  <br><b>  <br><b>
349  Absolute and relative back references  Absolute and relative back references
# Line 411  Another use of backslash is for specifyi Line 389  Another use of backslash is for specifyi
389  There is also the single sequence \N, which matches a non-newline character.  There is also the single sequence \N, which matches a non-newline character.
390  This is the same as  This is the same as
391  <a href="#fullstopdot">the "." metacharacter</a>  <a href="#fullstopdot">the "." metacharacter</a>
392  when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;  when PCRE_DOTALL is not set.
 PCRE does not support this.  
393  </P>  </P>
394  <P>  <P>
395  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
# Line 986  special meaning in a character class. Line 963  special meaning in a character class.
963  <P>  <P>
964  The escape sequence \N behaves like a dot, except that it is not affected by  The escape sequence \N behaves like a dot, except that it is not affected by
965  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
966  that signifies the end of a line. Perl also uses \N to match characters by  that signifies the end of a line.
 name; PCRE does not support this.  
967  </P>  </P>
968  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
969  <P>  <P>
# Line 1003  processing unless the PCRE_NO_UTF8_CHECK Line 979  processing unless the PCRE_NO_UTF8_CHECK
979  </P>  </P>
980  <P>  <P>
981  PCRE does not allow \C to appear in lookbehind assertions  PCRE does not allow \C to appear in lookbehind assertions
982  <a href="#lookbehind">(described below)</a>  <a href="#lookbehind">(described below),</a>
983  in UTF-8 mode, because this would make it impossible to calculate the length of  because in UTF-8 mode this would make it impossible to calculate the length of
984  the lookbehind.  the lookbehind.
985  </P>  </P>
986  <P>  <P>
# Line 1950  match. If there are insufficient charact Line 1926  match. If there are insufficient charact
1926  assertion fails.  assertion fails.
1927  </P>  </P>
1928  <P>  <P>
1929  In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte,  PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
1930  even in UTF-8 mode) to appear in lookbehind assertions, because it makes it  to appear in lookbehind assertions, because it makes it impossible to calculate
1931  impossible to calculate the length of the lookbehind. The \X and \R escapes,  the length of the lookbehind. The \X and \R escapes, which can match
1932  which can match different numbers of bytes, are also not permitted.  different numbers of bytes, are also not permitted.
1933  </P>  </P>
1934  <P>  <P>
1935  <a href="#subpatternsassubroutines">"Subroutine"</a>  <a href="#subpatternsassubroutines">"Subroutine"</a>
# Line 2535  failing negative assertion, they cause a Line 2511  failing negative assertion, they cause a
2511  If any of these verbs are used in an assertion or in a subpattern that is  If any of these verbs are used in an assertion or in a subpattern that is
2512  called as a subroutine (whether or not recursively), their effect is confined  called as a subroutine (whether or not recursively), their effect is confined
2513  to that subpattern; it does not extend to the surrounding pattern, with one  to that subpattern; it does not extend to the surrounding pattern, with one
2514  exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in  exception: a *MARK that is encountered in a positive assertion <i>is</i> passed
2515  a successful positive assertion <i>is</i> passed back when a match succeeds  back (compare capturing parentheses in assertions). Note that such subpatterns
2516  (compare capturing parentheses in assertions). Note that such subpatterns are  are processed as anchored at the point where they are tested. Note also that
2517  processed as anchored at the point where they are tested. Note also that Perl's  Perl's treatment of subroutines is different in some cases.
 treatment of subroutines is different in some cases.  
2518  </P>  </P>
2519  <P>  <P>
2520  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
# Line 2561  the start-of-match optimizations by sett Line 2536  the start-of-match optimizations by sett
2536  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
2537  pattern with (*NO_START_OPT).  pattern with (*NO_START_OPT).
2538  </P>  </P>
 <P>  
 Experiments with Perl suggest that it too has similar optimizations, sometimes  
 leading to anomalous results.  
 </P>  
2539  <br><b>  <br><b>
2540  Verbs that act immediately  Verbs that act immediately
2541  </b><br>  </b><br>
# Line 2612  A name is always required with this verb Line 2583  A name is always required with this verb
2583  (*MARK) as you like in a pattern, and their names do not have to be unique.  (*MARK) as you like in a pattern, and their names do not have to be unique.
2584  </P>  </P>
2585  <P>  <P>
2586  When a match succeeds, the name of the last-encountered (*MARK) on the matching  When a match succeeds, the name of the last-encountered (*MARK) is passed back
2587  path is passed back to the caller via the <i>pcre_extra</i> data structure, as  to the caller via the <i>pcre_extra</i> data structure, as described in the
 described in the  
2588  <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>  <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>
2589  in the  in the
2590  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
2591  documentation. Here is an example of <b>pcretest</b> output, where the /K  documentation. No data is returned for a partial match. Here is an example of
2592  modifier requests the retrieval and outputting of (*MARK) data:  <b>pcretest</b> output, where the /K modifier requests the retrieval and
2593    outputting of (*MARK) data:
2594  <pre>  <pre>
2595      re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K    /X(*MARK:A)Y|X(*MARK:B)Z/K
2596    data&#62; XY    XY
2597     0: XY     0: XY
2598    MK: A    MK: A
2599    XZ    XZ
# Line 2640  passed back if it is the last-encountere Line 2611  passed back if it is the last-encountere
2611  assertions.  assertions.
2612  </P>  </P>
2613  <P>  <P>
2614  After a partial match or a failed match, the name of the last encountered  A name may also be returned after a failed match if the final path through the
2615  (*MARK) in the entire match process is returned. For example:  pattern involves (*MARK). However, unless (*MARK) used in conjunction with
2616    (*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
2617    starting point for matching is advanced, the final check is often with an empty
2618    string, causing a failure before (*MARK) is reached. For example:
2619  <pre>  <pre>
2620      re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K    /X(*MARK:A)Y|X(*MARK:B)Z/K
2621    data&#62; XP    XP
2622      No match
2623    </pre>
2624    There are three potential starting points for this match (starting with X,
2625    starting with P, and with an empty string). If the pattern is anchored, the
2626    result is different:
2627    <pre>
2628      /^X(*MARK:A)Y|^X(*MARK:B)Z/K
2629      XP
2630    No match, mark = B    No match, mark = B
2631  </pre>  </pre>
2632  Note that in this unanchored example the mark is retained from the match  PCRE's start-of-match optimizations can also interfere with this. For example,
2633  attempt that started at the letter "X". Subsequent match attempts starting at  if, as a result of a call to <b>pcre_study()</b>, it knows the minimum
2634  "P" and then with an empty string do not get as far as the (*MARK) item, but  subject length for a match, a shorter subject will not be scanned at all.
2635  nevertheless do not reset it.  </P>
2636    <P>
2637    Note that similar anomalies (though different in detail) exist in Perl, no
2638    doubt for the same reasons. The use of (*MARK) data after a failed match of an
2639    unanchored pattern is not recommended, unless (*COMMIT) is involved.
2640  </P>  </P>
2641  <br><b>  <br><b>
2642  Verbs that act after backtracking  Verbs that act after backtracking
# Line 2689  Note that (*COMMIT) at the start of a pa Line 2675  Note that (*COMMIT) at the start of a pa
2675  unless PCRE's start-of-match optimizations are turned off, as shown in this  unless PCRE's start-of-match optimizations are turned off, as shown in this
2676  <b>pcretest</b> example:  <b>pcretest</b> example:
2677  <pre>  <pre>
2678      re&#62; /(*COMMIT)abc/    /(*COMMIT)abc/
2679    data&#62; xyzabc    xyzabc
2680     0: abc     0: abc
2681    xyzabc\Y    xyzabc\Y
2682    No match    No match
# Line 2711  reached, or when matching to the right o Line 2697  reached, or when matching to the right o
2697  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2698  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2699  but there are some uses of (*PRUNE) that cannot be expressed in any other way.  but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2700  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
2701  anchored pattern (*PRUNE) has the same effect as (*COMMIT).  match fails completely; the name is passed back if this is the final attempt.
2702    (*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
2703    pattern (*PRUNE) has the same effect as (*COMMIT).
2704  <pre>  <pre>
2705    (*SKIP)    (*SKIP)
2706  </pre>  </pre>
# Line 2738  following pattern fails to match, the pr Line 2726  following pattern fails to match, the pr
2726  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2727  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2728  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2729  matching name is found, the (*SKIP) is ignored.  matching name is found, normal "bumpalong" of one character happens (that is,
2730    the (*SKIP) is ignored).
2731  <pre>  <pre>
2732    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2733  </pre>  </pre>
# Line 2752  be used for a pattern-based if-then-else Line 2741  be used for a pattern-based if-then-else
2741  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2742  the end of the group if FOO succeeds); on failure, the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2743  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2744  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2745  If (*THEN) is not inside an alternation, it acts like (*PRUNE).  overall match fails. If (*THEN) is not inside an alternation, it acts like
2746    (*PRUNE).
2747  </P>  </P>
2748  <P>  <P>
2749  Note that a subpattern that does not contain a | character is just a part of  Note that a subpattern that does not contain a | character is just a part of
# Line 2829  Cambridge CB2 3QH, England. Line 2819  Cambridge CB2 3QH, England.
2819  </P>  </P>
2820  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2821  <P>  <P>
2822  Last updated: 29 November 2011  Last updated: 19 October 2011
2823  <br>  <br>
2824  Copyright &copy; 1997-2011 University of Cambridge.  Copyright &copy; 1997-2011 University of Cambridge.
2825  <br>  <br>

Legend:
Removed from v.834  
changed lines
  Added in v.835

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12