--- code/trunk/doc/pcrepattern.3 2007/08/08 14:24:50 210 +++ code/trunk/doc/pcrepattern.3 2007/08/09 09:52:43 211 @@ -168,11 +168,14 @@ After \ex, from zero to two hexadecimal digits are read (letters can be in upper or lower case). Any number of hexadecimal digits may appear between \ex{ and }, but the value of the character code must be less than 256 in non-UTF-8 -mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value -is 7FFFFFFF). If characters other than hexadecimal digits appear between \ex{ -and }, or if there is no terminating }, this form of escape is not recognized. -Instead, the initial \ex will be interpreted as a basic hexadecimal escape, -with no following digits, giving a character whose value is zero. +mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in +hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code +point, which is 10FFFF. +.P +If characters other than hexadecimal digits appear between \ex{ and }, or if +there is no terminating }, this form of escape is not recognized. Instead, the +initial \ex will be interpreted as a basic hexadecimal escape, with no +following digits, giving a character whose value is zero. .P Characters whose value is less than 256 can be defined by either of the two syntaxes for \ex. There is no difference in the way they are handled. For @@ -535,6 +538,15 @@ the Lu, Ll, or Lt property, in other words, a letter that is not classified as a modifier or "other". .P +The Cs (Surrogate) property applies only to characters in the range U+D800 to +U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so +cannot be tested by PCRE, unless UTF-8 validity checking has been turned off +(see the discussion of PCRE_NO_UTF8_CHECK in the +.\" HREF +\fBpcreapi\fP +.\" +page). +.P The long synonyms for these properties that Perl supports (such as \ep{Letter}) are not supported by PCRE, nor is it permitted to prefix any of these properties with "Is". @@ -1969,18 +1981,18 @@ .SH "BACTRACKING CONTROL" .rs .sp -Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which +Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which are described in the Perl documentation as "experimental and subject to change -or removal in a future version of Perl". It goes on to say: "Their usage in -production code should be noted to avoid problems during upgrades." The same +or removal in a future version of Perl". It goes on to say: "Their usage in +production code should be noted to avoid problems during upgrades." The same remarks apply to the PCRE features described in this section. .P -Since these verbs are specifically related to backtracking, they can be used -only when the pattern is to be matched using \fBpcre_exec()\fP, which uses a -backtracking algorithm. They cause an error if encountered by +Since these verbs are specifically related to backtracking, they can be used +only when the pattern is to be matched using \fBpcre_exec()\fP, which uses a +backtracking algorithm. They cause an error if encountered by \fBpcre_dfa_exec()\fP. .P -The new verbs make use of what was previously invalid syntax: an opening +The new verbs make use of what was previously invalid syntax: an opening parenthesis followed by an asterisk. In Perl, they are generally of the form (*VERB:ARG) but PCRE does not support the use of arguments, so its general form is just (*VERB). Any number of these verbs may occur in a pattern. There @@ -1994,19 +2006,19 @@ (*ACCEPT) .sp This verb causes the match to end successfully, skipping the remainder of the -pattern. When inside a recursion, only the innermost pattern is ended -immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside +pattern. When inside a recursion, only the innermost pattern is ended +immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside capturing parentheses. In Perl, the data so far is captured: in PCRE no data is captured. For example: .sp A(A|B(*ACCEPT)|C)D .sp -This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is +This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is captured. .sp (*FAIL) or (*F) .sp -This verb causes the match to fail, forcing backtracking to occur. It is +This verb causes the match to fail, forcing backtracking to occur. It is equivalent to (?!) but easier to read. The Perl documentation notes that it is probably useful only when combined with (?{}) or (??{}). Those are, of course, Perl features that are not present in PCRE. The nearest equivalent is the @@ -2014,58 +2026,58 @@ .sp a+(?C)(*FAIL) .sp -A match with the string "aaaa" always fails, but the callout is taken before -each backtrack happens (in this example, 10 times). +A match with the string "aaaa" always fails, but the callout is taken before +each backtrack happens (in this example, 10 times). . .SS "Verbs that act after backtracking" .rs .sp -The following verbs do nothing when they are encountered. Matching continues -with what follows, but if there is no subsequent match, a failure is forced. +The following verbs do nothing when they are encountered. Matching continues +with what follows, but if there is no subsequent match, a failure is forced. The verbs differ in exactly what kind of failure occurs. .sp (*COMMIT) .sp -This verb causes the whole match to fail outright if the rest of the pattern +This verb causes the whole match to fail outright if the rest of the pattern does not match. Even if the pattern is unanchored, no further attempts to find a match by advancing the start point take place. Once (*COMMIT) has been -passed, \fBpcre_exec()\fP is committed to finding a match at the current +passed, \fBpcre_exec()\fP is committed to finding a match at the current starting point, or not at all. For example: .sp a+(*COMMIT)b .sp -This matches "xxaab" but not "aacaab". It can be thought of as a kind of +This matches "xxaab" but not "aacaab". It can be thought of as a kind of dynamic anchor, or "I've started, so I must finish." .sp (*PRUNE) .sp -This verb causes the match to fail at the current position if the rest of the -pattern does not match. If the pattern is unanchored, the normal "bumpalong" +This verb causes the match to fail at the current position if the rest of the +pattern does not match. If the pattern is unanchored, the normal "bumpalong" advance to the next starting character then happens. Backtracking can occur as usual to the left of (*PRUNE), or when matching to the right of (*PRUNE), but if there is no match to the right, backtracking cannot cross (*PRUNE). -In simple cases, the use of (*PRUNE) is just an alternative to an atomic +In simple cases, the use of (*PRUNE) is just an alternative to an atomic group or possessive quantifier, but there are some uses of (*PRUNE) that cannot be expressed in any other way. .sp (*SKIP) .sp -This verb is like (*PRUNE), except that if the pattern is unanchored, the +This verb is like (*PRUNE), except that if the pattern is unanchored, the "bumpalong" advance is not to the next character, but to the position in the subject where (*SKIP) was encountered. (*SKIP) signifies that whatever text was matched leading up to it cannot be part of a successful match. Consider: .sp a+(*SKIP)b .sp -If the subject is "aaaac...", after the first match attempt fails (starting at +If the subject is "aaaac...", after the first match attempt fails (starting at the first character in the string), the starting point skips on to start the -next attempt at "c". Note that a possessive quantifer does not have the same +next attempt at "c". Note that a possessive quantifer does not have the same effect in this example; although it would suppress backtracking during the first match attempt, the second attempt would start at the second character instead of skipping on to "c". .sp (*THEN) -.sp +.sp This verb causes a skip to the next alternation if the rest of the pattern does not match. That is, it cancels pending backtracking, but only within the current alternation. Its name comes from the observation that it can be used @@ -2073,9 +2085,9 @@ .sp ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... .sp -If the COND1 pattern matches, FOO is tried (and possibly further items after +If the COND1 pattern matches, FOO is tried (and possibly further items after the end of the group if FOO succeeds); on failure the matcher skips to the -second alternative and tries COND2, without backtracking into COND1. If (*THEN) +second alternative and tries COND2, without backtracking into COND1. If (*THEN) is used outside of any alternation, it acts exactly like (*PRUNE). . . @@ -2099,6 +2111,6 @@ .rs .sp .nf -Last updated: 08 August 2007 +Last updated: 09 August 2007 Copyright (c) 1997-2007 University of Cambridge. .fi