/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 491 by ph10, Mon Mar 1 17:45:08 2010 UTC revision 513 by ph10, Mon May 3 11:13:37 2010 UTC
# Line 295  zero, because no more than three octal d Line 295  zero, because no more than three octal d
295  .P  .P
296  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
297  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, the
298  sequence \eb is interpreted as the backspace character (hex 08), and the  sequence \eb is interpreted as the backspace character (hex 08). The sequences
299  sequences \eR and \eX are interpreted as the characters "R" and "X",  \eB, \eR, and \eX are not special inside a character class. Like any other
300  respectively. Outside a character class, these sequences have different  unrecognized escape sequences, they are treated as the literal characters "B",
301  meanings  "R", and "X" by default, but cause an error if the PCRE_EXTRA option is set.
302    Outside a character class, these sequences have different meanings
303  .\" HTML <a href="#uniextseq">  .\" HTML <a href="#uniextseq">
304  .\" </a>  .\" </a>
305  (see below).  (see below).
# Line 478  convention, for example, a pattern can s Line 479  convention, for example, a pattern can s
479  .sp  .sp
480    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
481  .sp  .sp
482  Inside a character class, \eR matches the letter "R".  Inside a character class, \eR is treated as an unrecognized escape sequence,
483    and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is
484    set.
485  .  .
486  .  .
487  .\" HTML <a name="uniextseq"></a>  .\" HTML <a name="uniextseq"></a>
# Line 737  For example, when the pattern Line 740  For example, when the pattern
740    (foo)\eKbar    (foo)\eKbar
741  .sp  .sp
742  matches "foobar", the first substring is still set to "foo".  matches "foobar", the first substring is still set to "foo".
743    .P
744    Perl documents that the use of \eK within assertions is "not well defined". In
745    PCRE, \eK is acted upon when it occurs inside positive assertions, but is
746    ignored in negative assertions.
747  .  .
748  .  .
749  .\" HTML <a name="smallassertions"></a>  .\" HTML <a name="smallassertions"></a>
# Line 761  The backslashed assertions are: Line 768  The backslashed assertions are:
768    \ez     matches only at the end of the subject    \ez     matches only at the end of the subject
769    \eG     matches at the first matching position in the subject    \eG     matches at the first matching position in the subject
770  .sp  .sp
771  These assertions may not appear in character classes (but note that \eb has a  Inside a character class, \eb has a different meaning; it matches the backspace
772  different meaning, namely the backspace character, inside a character class).  character. If any other of these assertions appears in a character class, by
773    default it matches the corresponding literal character (for example, \eB
774    matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
775    escape sequence" error is generated instead.
776  .P  .P
777  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
778  and the previous character do not both match \ew or \eW (i.e. one matches  and the previous character do not both match \ew or \eW (i.e. one matches
# Line 2314  description of the interface to the call Line 2324  description of the interface to the call
2324  documentation.  documentation.
2325  .  .
2326  .  .
2327    .\" HTML <a name="backtrackcontrol"></a>
2328  .SH "BACKTRACKING CONTROL"  .SH "BACKTRACKING CONTROL"
2329  .rs  .rs
2330  .sp  .sp
# Line 2335  it does not extend to the surrounding pa Line 2346  it does not extend to the surrounding pa
2346  processed as anchored at the point where they are tested.  processed as anchored at the point where they are tested.
2347  .P  .P
2348  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2349  parenthesis followed by an asterisk. In Perl, they are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2350  (*VERB:ARG) but PCRE does not support the use of arguments, so its general  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2351  form is just (*VERB). Any number of these verbs may occur in a pattern. There  depending on whether or not an argument is present. An name is a sequence of
2352  are two kinds:  letters, digits, and underscores. If the name is empty, that is, if the closing
2353    parenthesis immediately follows the colon, the effect is as if the colon were
2354    not there. Any number of these verbs may occur in a pattern.
2355    .P
2356    PCRE contains some optimizations that are used to speed up matching by running
2357    some checks at the start of each match attempt. For example, it may know the
2358    minimum length of matching subject, or that a particular character must be
2359    present. When one of these optimizations suppresses the running of a match, any
2360    included backtracking verbs will not, of course, be processed. You can suppress
2361    the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
2362    when calling \fBpcre_exec()\fP.
2363    .
2364  .  .
2365  .SS "Verbs that act immediately"  .SS "Verbs that act immediately"
2366  .rs  .rs
2367  .sp  .sp
2368  The following verbs act as soon as they are encountered:  The following verbs act as soon as they are encountered. They may not be
2369    followed by a name.
2370  .sp  .sp
2371     (*ACCEPT)     (*ACCEPT)
2372  .sp  .sp
# Line 2370  callout feature, as for example in this Line 2393  callout feature, as for example in this
2393  A match with the string "aaaa" always fails, but the callout is taken before  A match with the string "aaaa" always fails, but the callout is taken before
2394  each backtrack happens (in this example, 10 times).  each backtrack happens (in this example, 10 times).
2395  .  .
2396    .
2397    .SS "Recording which path was taken"
2398    .rs
2399    .sp
2400    There is one verb whose main purpose is to track how a match was arrived at,
2401    though it also has a secondary use in conjunction with advancing the match
2402    starting point (see (*SKIP) below).
2403    .sp
2404      (*MARK:NAME) or (*:NAME)
2405    .sp
2406    A name is always required with this verb. There may be as many instances of
2407    (*MARK) as you like in a pattern, and their names do not have to be unique.
2408    .P
2409    When a match succeeds, the name of the last-encountered (*MARK) is passed back
2410    to the caller via the \fIpcre_extra\fP data structure, as described in the
2411    .\" HTML <a href="pcreapi.html#extradata">
2412    .\" </a>
2413    section on \fIpcre_extra\fP
2414    .\"
2415    in the
2416    .\" HREF
2417    \fBpcreapi\fP
2418    .\"
2419    documentation. No data is returned for a partial match. Here is an example of
2420    \fBpcretest\fP output, where the /K modifier requests the retrieval and
2421    outputting of (*MARK) data:
2422    .sp
2423      /X(*MARK:A)Y|X(*MARK:B)Z/K
2424      XY
2425       0: XY
2426      MK: A
2427      XZ
2428       0: XZ
2429      MK: B
2430    .sp
2431    The (*MARK) name is tagged with "MK:" in this output, and in this example it
2432    indicates which of the two alternatives matched. This is a more efficient way
2433    of obtaining this information than putting each alternative in its own
2434    capturing parentheses.
2435    .P
2436    A name may also be returned after a failed match if the final path through the
2437    pattern involves (*MARK). However, unless (*MARK) used in conjunction with
2438    (*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
2439    starting point for matching is advanced, the final check is often with an empty
2440    string, causing a failure before (*MARK) is reached. For example:
2441    .sp
2442      /X(*MARK:A)Y|X(*MARK:B)Z/K
2443      XP
2444      No match
2445    .sp
2446    There are three potential starting points for this match (starting with X,
2447    starting with P, and with an empty string). If the pattern is anchored, the
2448    result is different:
2449    .sp
2450      /^X(*MARK:A)Y|^X(*MARK:B)Z/K
2451      XP
2452      No match, mark = B
2453    .sp
2454    PCRE's start-of-match optimizations can also interfere with this. For example,
2455    if, as a result of a call to \fBpcre_study()\fP, it knows the minimum
2456    subject length for a match, a shorter subject will not be scanned at all.
2457    .P
2458    Note that similar anomalies (though different in detail) exist in Perl, no
2459    doubt for the same reasons. The use of (*MARK) data after a failed match of an
2460    unanchored pattern is not recommended, unless (*COMMIT) is involved.
2461    .
2462    .
2463  .SS "Verbs that act after backtracking"  .SS "Verbs that act after backtracking"
2464  .rs  .rs
2465  .sp  .sp
2466  The following verbs do nothing when they are encountered. Matching continues  The following verbs do nothing when they are encountered. Matching continues
2467  with what follows, but if there is no subsequent match, a failure is forced.  with what follows, but if there is no subsequent match, causing a backtrack to
2468  The verbs differ in exactly what kind of failure occurs.  the verb, a failure is forced. That is, backtracking cannot pass to the left of
2469    the verb. However, when one of these verbs appears inside an atomic group, its
2470    effect is confined to that group, because once the group has been matched,
2471    there is never any backtracking into it. In this situation, backtracking can
2472    "jump back" to the left of the entire atomic group. (Remember also, as stated
2473    above, that this localization also applies in subroutine calls and assertions.)
2474    .P
2475    These verbs differ in exactly what kind of failure occurs when backtracking
2476    reaches them.
2477  .sp  .sp
2478    (*COMMIT)    (*COMMIT)
2479  .sp  .sp
2480  This verb causes the whole match to fail outright if the rest of the pattern  This verb, which may not be followed by a name, causes the whole match to fail
2481  does not match. Even if the pattern is unanchored, no further attempts to find  outright if the rest of the pattern does not match. Even if the pattern is
2482  a match by advancing the starting point take place. Once (*COMMIT) has been  unanchored, no further attempts to find a match by advancing the starting point
2483  passed, \fBpcre_exec()\fP is committed to finding a match at the current  take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to
2484  starting point, or not at all. For example:  finding a match at the current starting point, or not at all. For example:
2485  .sp  .sp
2486    a+(*COMMIT)b    a+(*COMMIT)b
2487  .sp  .sp
2488  This matches "xxaab" but not "aacaab". It can be thought of as a kind of  This matches "xxaab" but not "aacaab". It can be thought of as a kind of
2489  dynamic anchor, or "I've started, so I must finish."  dynamic anchor, or "I've started, so I must finish." The name of the most
2490  .sp  recently passed (*MARK) in the path is passed back when (*COMMIT) forces a
2491    (*PRUNE)  match failure.
2492  .sp  .P
2493  This verb causes the match to fail at the current position if the rest of the  Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
2494  pattern does not match. If the pattern is unanchored, the normal "bumpalong"  unless PCRE's start-of-match optimizations are turned off, as shown in this
2495  advance to the next starting character then happens. Backtracking can occur as  \fBpcretest\fP example:
2496  usual to the left of (*PRUNE), or when matching to the right of (*PRUNE), but  .sp
2497  if there is no match to the right, backtracking cannot cross (*PRUNE).    /(*COMMIT)abc/
2498  In simple cases, the use of (*PRUNE) is just an alternative to an atomic    xyzabc
2499  group or possessive quantifier, but there are some uses of (*PRUNE) that cannot     0: abc
2500  be expressed in any other way.    xyzabc\eY
2501      No match
2502    .sp
2503    PCRE knows that any match must start with "a", so the optimization skips along
2504    the subject to "a" before running the first match attempt, which succeeds. When
2505    the optimization is disabled by the \eY escape in the second subject, the match
2506    starts at "x" and so the (*COMMIT) causes it to fail without trying any other
2507    starting points.
2508    .sp
2509      (*PRUNE) or (*PRUNE:NAME)
2510    .sp
2511    This verb causes the match to fail at the current starting position in the
2512    subject if the rest of the pattern does not match. If the pattern is
2513    unanchored, the normal "bumpalong" advance to the next starting character then
2514    happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
2515    reached, or when matching to the right of (*PRUNE), but if there is no match to
2516    the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2517    (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2518    but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2519    The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
2520    match fails completely; the name is passed back if this is the final attempt.
2521    (*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
2522    pattern (*PRUNE) has the same effect as (*COMMIT).
2523  .sp  .sp
2524    (*SKIP)    (*SKIP)
2525  .sp  .sp
2526  This verb is like (*PRUNE), except that if the pattern is unanchored, the  This verb, when given without a name, is like (*PRUNE), except that if the
2527  "bumpalong" advance is not to the next character, but to the position in the  pattern is unanchored, the "bumpalong" advance is not to the next character,
2528  subject where (*SKIP) was encountered. (*SKIP) signifies that whatever text  but to the position in the subject where (*SKIP) was encountered. (*SKIP)
2529  was matched leading up to it cannot be part of a successful match. Consider:  signifies that whatever text was matched leading up to it cannot be part of a
2530    successful match. Consider:
2531  .sp  .sp
2532    a+(*SKIP)b    a+(*SKIP)b
2533  .sp  .sp
# Line 2417  effect as this example; although it woul Line 2538  effect as this example; although it woul
2538  first match attempt, the second attempt would start at the second character  first match attempt, the second attempt would start at the second character
2539  instead of skipping on to "c".  instead of skipping on to "c".
2540  .sp  .sp
2541    (*THEN)    (*SKIP:NAME)
2542    .sp
2543    When (*SKIP) has an associated name, its behaviour is modified. If the
2544    following pattern fails to match, the previous path through the pattern is
2545    searched for the most recent (*MARK) that has the same name. If one is found,
2546    the "bumpalong" advance is to the subject position that corresponds to that
2547    (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2548    matching name is found, normal "bumpalong" of one character happens (the
2549    (*SKIP) is ignored).
2550    .sp
2551      (*THEN) or (*THEN:NAME)
2552  .sp  .sp
2553  This verb causes a skip to the next alternation if the rest of the pattern does  This verb causes a skip to the next alternation if the rest of the pattern does
2554  not match. That is, it cancels pending backtracking, but only within the  not match. That is, it cancels pending backtracking, but only within the
# Line 2428  for a pattern-based if-then-else block: Line 2559  for a pattern-based if-then-else block:
2559  .sp  .sp
2560  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2561  the end of the group if FOO succeeds); on failure the matcher skips to the  the end of the group if FOO succeeds); on failure the matcher skips to the
2562  second alternative and tries COND2, without backtracking into COND1. If (*THEN)  second alternative and tries COND2, without backtracking into COND1. The
2563  is used outside of any alternation, it acts exactly like (*PRUNE).  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2564    overall match fails. If (*THEN) is not directly inside an alternation, it acts
2565    like (*PRUNE).
2566  .  .
2567  .  .
2568  .SH "SEE ALSO"  .SH "SEE ALSO"
# Line 2453  Cambridge CB2 3QH, England. Line 2586  Cambridge CB2 3QH, England.
2586  .rs  .rs
2587  .sp  .sp
2588  .nf  .nf
2589  Last updated: 01 March 2010  Last updated: 03 May 2010
2590  Copyright (c) 1997-2010 University of Cambridge.  Copyright (c) 1997-2010 University of Cambridge.
2591  .fi  .fi

Legend:
Removed from v.491  
changed lines
  Added in v.513

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12