/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 86 by nigel, Sat Feb 24 21:40:52 2007 UTC revision 87 by nigel, Sat Feb 24 21:41:21 2007 UTC
# Line 148  represents: Line 148  represents:
148    \et        tab (hex 09)    \et        tab (hex 09)
149    \eddd      character with octal code ddd, or backreference    \eddd      character with octal code ddd, or backreference
150    \exhh      character with hex code hh    \exhh      character with hex code hh
151    \ex{hhh..} character with hex code hhh... (UTF-8 mode only)    \ex{hhh..} character with hex code hhh..
152  .sp  .sp
153  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
154  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 156  Thus \ecz becomes hex 1A, but \ec{ becom Line 156  Thus \ecz becomes hex 1A, but \ec{ becom
156  7B.  7B.
157  .P  .P
158  After \ex, from zero to two hexadecimal digits are read (letters can be in  After \ex, from zero to two hexadecimal digits are read (letters can be in
159  upper or lower case). In UTF-8 mode, any number of hexadecimal digits may  upper or lower case). Any number of hexadecimal digits may appear between \ex{
160  appear between \ex{ and }, but the value of the character code must be less  and }, but the value of the character code must be less than 256 in non-UTF-8
161  than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters  mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
162  other than hexadecimal digits appear between \ex{ and }, or if there is no  is 7FFFFFFF). If characters other than hexadecimal digits appear between \ex{
163  terminating }, this form of escape is not recognized. Instead, the initial  and }, or if there is no terminating }, this form of escape is not recognized.
164  \ex will be interpreted as a basic hexadecimal escape, with no following  Instead, the initial \ex will be interpreted as a basic hexadecimal escape,
165  digits, giving a character whose value is zero.  with no following digits, giving a character whose value is zero.
166  .P  .P
167  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
168  syntaxes for \ex when PCRE is in UTF-8 mode. There is no difference in the  syntaxes for \ex. There is no difference in the way they are handled. For
169  way they are handled. For example, \exdc is exactly the same as \ex{dc}.  example, \exdc is exactly the same as \ex{dc}.
170  .P  .P
171  After \e0 up to two further octal digits are read. In both cases, if there  After \e0 up to two further octal digits are read. In both cases, if there
172  are fewer than two digits, just those that are present are used. Thus the  are fewer than two digits, just those that are present are used. Thus the
# Line 272  greater than 128 are used for accented l Line 272  greater than 128 are used for accented l
272  .P  .P
273  In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or  In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or
274  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
275  character property support is available.  character property support is available. The use of locales with Unicode is
276    discouraged.
277  .  .
278  .  .
279  .\" HTML <a name="uniextseq"></a>  .\" HTML <a name="uniextseq"></a>
# Line 280  character property support is available. Line 281  character property support is available.
281  .rs  .rs
282  .sp  .sp
283  When PCRE is built with Unicode character property support, three additional  When PCRE is built with Unicode character property support, three additional
284  escape sequences to match generic character types are available when UTF-8 mode  escape sequences to match character properties are available when UTF-8 mode
285  is selected. They are:  is selected. They are:
286  .sp  .sp
287   \ep{\fIxx\fP}   a character with the \fIxx\fP property    \ep{\fIxx\fP}   a character with the \fIxx\fP property
288   \eP{\fIxx\fP}   a character without the \fIxx\fP property    \eP{\fIxx\fP}   a character without the \fIxx\fP property
289   \eX       an extended Unicode sequence    \eX       an extended Unicode sequence
290  .sp  .sp
291  The property names represented by \fIxx\fP above are limited to the  The property names represented by \fIxx\fP above are limited to the Unicode
292  Unicode general category properties. Each character has exactly one such  script names, the general category properties, and "Any", which matches any
293  property, specified by a two-letter abbreviation. For compatibility with Perl,  character (including newline). Other properties such as "InMusicalSymbols" are
294  negation can be specified by including a circumflex between the opening brace  not currently supported by PCRE. Note that \eP{Any} does not match any
295  and the property name. For example, \ep{^Lu} is the same as \eP{Lu}.  characters, so always causes a match failure.
296  .P  .P
297  If only one letter is specified with \ep or \eP, it includes all the properties  Sets of Unicode characters are defined as belonging to certain scripts. A
298  that start with that letter. In this case, in the absence of negation, the  character from one of these sets can be matched using a script name. For
299  curly brackets in the escape sequence are optional; these two examples have  example:
300  the same effect:  .sp
301      \ep{Greek}
302      \eP{Han}
303    .sp
304    Those that are not part of an identified script are lumped together as
305    "Common". The current list of scripts is:
306    .P
307    Arabic,
308    Armenian,
309    Bengali,
310    Bopomofo,
311    Braille,
312    Buginese,
313    Buhid,
314    Canadian_Aboriginal,
315    Cherokee,
316    Common,
317    Coptic,
318    Cypriot,
319    Cyrillic,
320    Deseret,
321    Devanagari,
322    Ethiopic,
323    Georgian,
324    Glagolitic,
325    Gothic,
326    Greek,
327    Gujarati,
328    Gurmukhi,
329    Han,
330    Hangul,
331    Hanunoo,
332    Hebrew,
333    Hiragana,
334    Inherited,
335    Kannada,
336    Katakana,
337    Kharoshthi,
338    Khmer,
339    Lao,
340    Latin,
341    Limbu,
342    Linear_B,
343    Malayalam,
344    Mongolian,
345    Myanmar,
346    New_Tai_Lue,
347    Ogham,
348    Old_Italic,
349    Old_Persian,
350    Oriya,
351    Osmanya,
352    Runic,
353    Shavian,
354    Sinhala,
355    Syloti_Nagri,
356    Syriac,
357    Tagalog,
358    Tagbanwa,
359    Tai_Le,
360    Tamil,
361    Telugu,
362    Thaana,
363    Thai,
364    Tibetan,
365    Tifinagh,
366    Ugaritic,
367    Yi.
368    .P
369    Each character has exactly one general category property, specified by a
370    two-letter abbreviation. For compatibility with Perl, negation can be specified
371    by including a circumflex between the opening brace and the property name. For
372    example, \ep{^Lu} is the same as \eP{Lu}.
373    .P
374    If only one letter is specified with \ep or \eP, it includes all the general
375    category properties that start with that letter. In this case, in the absence
376    of negation, the curly brackets in the escape sequence are optional; these two
377    examples have the same effect:
378  .sp  .sp
379    \ep{L}    \ep{L}
380    \epL    \epL
381  .sp  .sp
382  The following property codes are supported:  The following general category property codes are supported:
383  .sp  .sp
384    C     Other    C     Other
385    Cc    Control    Cc    Control
# Line 347  The following property codes are support Line 425  The following property codes are support
425    Zp    Paragraph separator    Zp    Paragraph separator
426    Zs    Space separator    Zs    Space separator
427  .sp  .sp
428  Extended properties such as "Greek" or "InMusicalSymbols" are not supported by  The special property L& is also supported: it matches a character that has
429  PCRE.  the Lu, Ll, or Lt property, in other words, a letter that is not classified as
430    a modifier or "other".
431    .P
432    The long synonyms for these properties that Perl supports (such as \ep{Letter})
433    are not supported by PCRE. Nor is is permitted to prefix any of these
434    properties with "Is".
435    .P
436    No character that is in the Unicode table has the Cn (unassigned) property.
437    Instead, this property is assumed for any code point that is not in the
438    Unicode table.
439  .P  .P
440  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
441  example, \ep{Lu} always matches only upper case letters.  example, \ep{Lu} always matches only upper case letters.
# Line 1346  number, provided that it occurs inside t Line 1433  number, provided that it occurs inside t
1433  "subroutine" call, which is described in the next section.) The special item  "subroutine" call, which is described in the next section.) The special item
1434  (?R) is a recursive call of the entire regular expression.  (?R) is a recursive call of the entire regular expression.
1435  .P  .P
1436  For example, this PCRE pattern solves the nested parentheses problem (assume  A recursive subpattern call is always treated as an atomic group. That is, once
1437  the PCRE_EXTENDED option is set so that white space is ignored):  it has matched some of the subject string, it is never re-entered, even if
1438    it contains untried alternatives and there is a subsequent matching failure.
1439    .P
1440    This PCRE pattern solves the nested parentheses problem (assume the
1441    PCRE_EXTENDED option is set so that white space is ignored):
1442  .sp  .sp
1443    \e( ( (?>[^()]+) | (?R) )* \e)    \e( ( (?>[^()]+) | (?R) )* \e)
1444  .sp  .sp
1445  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
1446  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
1447  match of the pattern itself (that is a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
1448  Finally there is a closing parenthesis.  Finally there is a closing parenthesis.
1449  .P  .P
1450  If this were part of a larger pattern, you would not want to recurse the entire  If this were part of a larger pattern, you would not want to recurse the entire
# Line 1437  matches "sense and sensibility" and "res Line 1528  matches "sense and sensibility" and "res
1528  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
1529  strings. Such references must, however, follow the subpattern to which they  strings. Such references must, however, follow the subpattern to which they
1530  refer.  refer.
1531    .P
1532    Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1533    group. That is, once it has matched some of the subject string, it is never
1534    re-entered, even if it contains untried alternatives and there is a subsequent
1535    matching failure.
1536  .  .
1537  .  .
1538  .SH CALLOUTS  .SH CALLOUTS
# Line 1475  description of the interface to the call Line 1571  description of the interface to the call
1571  documentation.  documentation.
1572  .P  .P
1573  .in 0  .in 0
1574  Last updated: 28 February 2005  Last updated: 24 January 2006
1575  .br  .br
1576  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.

Legend:
Removed from v.86  
changed lines
  Added in v.87

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12