/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 517 - (hide annotations) (download)
Wed May 5 10:44:20 2010 UTC (4 years, 4 months ago) by ph10
File size: 11277 byte(s)
Add new special properties Xan, Xps, Xsp, Xwd to help with \w etc.

1 ph10 208 .TH PCRESYNTAX 3
2     .SH NAME
3     PCRE - Perl-compatible regular expressions
4     .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5     .rs
6     .sp
7     The full syntax and semantics of the regular expressions that are supported by
8     PCRE are described in the
9     .\" HREF
10     \fBpcrepattern\fP
11     .\"
12     documentation. This document contains just a quick-reference summary of the
13     syntax.
14     .
15     .
16     .SH "QUOTING"
17     .rs
18     .sp
19     \ex where x is non-alphanumeric is a literal x
20     \eQ...\eE treat enclosed characters as literal
21     .
22     .
23     .SH "CHARACTERS"
24     .rs
25     .sp
26     \ea alarm, that is, the BEL character (hex 07)
27     \ecx "control-x", where x is any character
28     \ee escape (hex 1B)
29     \ef formfeed (hex 0C)
30     \en newline (hex 0A)
31     \er carriage return (hex 0D)
32     \et tab (hex 09)
33     \eddd character with octal code ddd, or backreference
34     \exhh character with hex code hh
35     \ex{hhh..} character with hex code hhh..
36     .
37     .
38     .SH "CHARACTER TYPES"
39     .rs
40     .sp
41     . any character except newline;
42     in dotall mode, any character whatsoever
43     \eC one byte, even in UTF-8 mode (best avoided)
44     \ed a decimal digit
45     \eD a character that is not a decimal digit
46     \eh a horizontal whitespace character
47     \eH a character that is not a horizontal whitespace character
48 ph10 517 \eN a character that is not a newline
49 ph10 208 \ep{\fIxx\fP} a character with the \fIxx\fP property
50     \eP{\fIxx\fP} a character without the \fIxx\fP property
51     \eR a newline sequence
52     \es a whitespace character
53     \eS a character that is not a whitespace character
54     \ev a vertical whitespace character
55     \eV a character that is not a vertical whitespace character
56     \ew a "word" character
57     \eW a "non-word" character
58     \eX an extended Unicode sequence
59     .sp
60     In PCRE, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII characters.
61     .
62     .
63 ph10 517 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
64 ph10 208 .rs
65     .sp
66     C Other
67     Cc Control
68     Cf Format
69     Cn Unassigned
70     Co Private use
71     Cs Surrogate
72     .sp
73     L Letter
74     Ll Lower case letter
75     Lm Modifier letter
76     Lo Other letter
77     Lt Title case letter
78     Lu Upper case letter
79     L& Ll, Lu, or Lt
80     .sp
81     M Mark
82     Mc Spacing mark
83     Me Enclosing mark
84     Mn Non-spacing mark
85     .sp
86     N Number
87     Nd Decimal number
88     Nl Letter number
89     No Other number
90     .sp
91     P Punctuation
92     Pc Connector punctuation
93     Pd Dash punctuation
94     Pe Close punctuation
95     Pf Final punctuation
96     Pi Initial punctuation
97     Po Other punctuation
98     Ps Open punctuation
99     .sp
100     S Symbol
101     Sc Currency symbol
102     Sk Modifier symbol
103     Sm Mathematical symbol
104     So Other symbol
105     .sp
106     Z Separator
107     Zl Line separator
108     Zp Paragraph separator
109     Zs Space separator
110     .
111     .
112 ph10 517 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
113     .rs
114     .sp
115     Xan Alphanumeric: union of properties L and N
116     Xps POSIX space: property Z or tab, NL, VT, FF, CR
117     Xsp Perl space: property Z or tab, NL, FF, CR
118     Xwd Perl word: property Xan or underscore
119     .
120     .
121 ph10 208 .SH "SCRIPT NAMES FOR \ep AND \eP"
122     .rs
123     .sp
124     Arabic,
125     Armenian,
126 ph10 491 Avestan,
127 ph10 208 Balinese,
128 ph10 491 Bamum,
129 ph10 208 Bengali,
130     Bopomofo,
131     Braille,
132     Buginese,
133     Buhid,
134     Canadian_Aboriginal,
135 ph10 412 Carian,
136     Cham,
137 ph10 208 Cherokee,
138     Common,
139     Coptic,
140     Cuneiform,
141     Cypriot,
142     Cyrillic,
143     Deseret,
144     Devanagari,
145 ph10 491 Egyptian_Hieroglyphs,
146 ph10 208 Ethiopic,
147     Georgian,
148     Glagolitic,
149     Gothic,
150     Greek,
151     Gujarati,
152     Gurmukhi,
153     Han,
154     Hangul,
155     Hanunoo,
156     Hebrew,
157     Hiragana,
158 ph10 491 Imperial_Aramaic,
159 ph10 208 Inherited,
160 ph10 491 Inscriptional_Pahlavi,
161     Inscriptional_Parthian,
162     Javanese,
163     Kaithi,
164 ph10 208 Kannada,
165     Katakana,
166 ph10 412 Kayah_Li,
167 ph10 208 Kharoshthi,
168     Khmer,
169     Lao,
170     Latin,
171 ph10 412 Lepcha,
172 ph10 208 Limbu,
173     Linear_B,
174 ph10 491 Lisu,
175 ph10 412 Lycian,
176     Lydian,
177 ph10 208 Malayalam,
178 ph10 491 Meetei_Mayek,
179 ph10 208 Mongolian,
180     Myanmar,
181     New_Tai_Lue,
182     Nko,
183     Ogham,
184     Old_Italic,
185     Old_Persian,
186 ph10 491 Old_South_Arabian,
187     Old_Turkic,
188 ph10 412 Ol_Chiki,
189 ph10 208 Oriya,
190     Osmanya,
191     Phags_Pa,
192     Phoenician,
193 ph10 412 Rejang,
194 ph10 208 Runic,
195 ph10 491 Samaritan,
196 ph10 412 Saurashtra,
197 ph10 208 Shavian,
198     Sinhala,
199 ph10 491 Sundanese,
200 ph10 208 Syloti_Nagri,
201     Syriac,
202     Tagalog,
203     Tagbanwa,
204     Tai_Le,
205 ph10 491 Tai_Tham,
206     Tai_Viet,
207 ph10 208 Tamil,
208     Telugu,
209     Thaana,
210     Thai,
211     Tibetan,
212     Tifinagh,
213     Ugaritic,
214 ph10 412 Vai,
215 ph10 208 Yi.
216     .
217     .
218     .SH "CHARACTER CLASSES"
219     .rs
220     .sp
221     [...] positive character class
222     [^...] negative character class
223     [x-y] range (can be used for hex characters)
224     [[:xxx:]] positive POSIX named set
225 ph10 266 [[:^xxx:]] negative POSIX named set
226 ph10 208 .sp
227     alnum alphanumeric
228     alpha alphabetic
229     ascii 0-127
230     blank space or tab
231     cntrl control character
232     digit decimal digit
233     graph printing, excluding space
234     lower lower case letter
235     print printing, including space
236     punct printing, excluding alphanumeric
237     space whitespace
238     upper upper case letter
239     word same as \ew
240     xdigit hexadecimal digit
241     .sp
242     In PCRE, POSIX character set names recognize only ASCII characters. You can use
243     \eQ...\eE inside a character class.
244     .
245     .
246     .SH "QUANTIFIERS"
247     .rs
248     .sp
249     ? 0 or 1, greedy
250     ?+ 0 or 1, possessive
251     ?? 0 or 1, lazy
252     * 0 or more, greedy
253     *+ 0 or more, possessive
254     *? 0 or more, lazy
255     + 1 or more, greedy
256     ++ 1 or more, possessive
257     +? 1 or more, lazy
258     {n} exactly n
259     {n,m} at least n, no more than m, greedy
260     {n,m}+ at least n, no more than m, possessive
261     {n,m}? at least n, no more than m, lazy
262     {n,} n or more, greedy
263     {n,}+ n or more, possessive
264     {n,}? n or more, lazy
265     .
266     .
267     .SH "ANCHORS AND SIMPLE ASSERTIONS"
268     .rs
269     .sp
270 ph10 412 \eb word boundary (only ASCII letters recognized)
271 ph10 208 \eB not a word boundary
272     ^ start of subject
273     also after internal newline in multiline mode
274     \eA start of subject
275     $ end of subject
276     also before newline at end of subject
277     also before internal newline in multiline mode
278     \eZ end of subject
279     also before newline at end of subject
280     \ez end of subject
281     \eG first matching position in subject
282     .
283     .
284     .SH "MATCH POINT RESET"
285     .rs
286     .sp
287     \eK reset start of match
288     .
289     .
290     .SH "ALTERNATION"
291     .rs
292     .sp
293     expr|expr|expr...
294     .
295     .
296     .SH "CAPTURING"
297     .rs
298     .sp
299 ph10 412 (...) capturing group
300     (?<name>...) named capturing group (Perl)
301     (?'name'...) named capturing group (Perl)
302     (?P<name>...) named capturing group (Python)
303     (?:...) non-capturing group
304     (?|...) non-capturing group; reset group numbers for
305     capturing groups in each alternative
306 ph10 208 .
307     .
308     .SH "ATOMIC GROUPS"
309     .rs
310     .sp
311 ph10 412 (?>...) atomic, non-capturing group
312 ph10 208 .
313     .
314     .
315     .
316     .SH "COMMENT"
317     .rs
318     .sp
319 ph10 412 (?#....) comment (not nestable)
320 ph10 208 .
321     .
322     .SH "OPTION SETTING"
323     .rs
324     .sp
325 ph10 412 (?i) caseless
326     (?J) allow duplicate names
327     (?m) multiline
328     (?s) single line (dotall)
329     (?U) default ungreedy (lazy)
330     (?x) extended (ignore white space)
331     (?-...) unset option(s)
332     .sp
333 ph10 416 The following is recognized only at the start of a pattern or after one of the
334 ph10 412 newline-setting options with similar syntax:
335     .sp
336 ph10 416 (*UTF8) set UTF-8 mode
337 ph10 208 .
338     .
339     .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
340     .rs
341     .sp
342 ph10 412 (?=...) positive look ahead
343     (?!...) negative look ahead
344     (?<=...) positive look behind
345     (?<!...) negative look behind
346 ph10 208 .sp
347     Each top-level branch of a look behind must be of a fixed length.
348 ph10 333 .
349     .
350 ph10 208 .SH "BACKREFERENCES"
351     .rs
352     .sp
353 ph10 412 \en reference by number (can be ambiguous)
354     \egn reference by number
355     \eg{n} reference by number
356     \eg{-n} relative reference by number
357     \ek<name> reference by name (Perl)
358     \ek'name' reference by name (Perl)
359     \eg{name} reference by name (Perl)
360     \ek{name} reference by name (.NET)
361     (?P=name) reference by name (Python)
362 ph10 208 .
363     .
364     .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
365     .rs
366     .sp
367 ph10 412 (?R) recurse whole pattern
368     (?n) call subpattern by absolute number
369     (?+n) call subpattern by relative number
370     (?-n) call subpattern by relative number
371     (?&name) call subpattern by name (Perl)
372     (?P>name) call subpattern by name (Python)
373     \eg<name> call subpattern by name (Oniguruma)
374     \eg'name' call subpattern by name (Oniguruma)
375     \eg<n> call subpattern by absolute number (Oniguruma)
376     \eg'n' call subpattern by absolute number (Oniguruma)
377     \eg<+n> call subpattern by relative number (PCRE extension)
378     \eg'+n' call subpattern by relative number (PCRE extension)
379     \eg<-n> call subpattern by relative number (PCRE extension)
380     \eg'-n' call subpattern by relative number (PCRE extension)
381 ph10 208 .
382     .
383     .SH "CONDITIONAL PATTERNS"
384     .rs
385     .sp
386     (?(condition)yes-pattern)
387     (?(condition)yes-pattern|no-pattern)
388     .sp
389 ph10 412 (?(n)... absolute reference condition
390     (?(+n)... relative reference condition
391     (?(-n)... relative reference condition
392     (?(<name>)... named reference condition (Perl)
393     (?('name')... named reference condition (Perl)
394     (?(name)... named reference condition (PCRE)
395     (?(R)... overall recursion condition
396     (?(Rn)... specific group recursion condition
397     (?(R&name)... specific recursion condition
398     (?(DEFINE)... define subpattern for reference
399     (?(assert)... assertion condition
400 ph10 208 .
401     .
402 ph10 210 .SH "BACKTRACKING CONTROL"
403     .rs
404     .sp
405 ph10 211 The following act immediately they are reached:
406     .sp
407 ph10 412 (*ACCEPT) force successful match
408     (*FAIL) force backtrack; synonym (*F)
409 ph10 210 .sp
410 ph10 211 The following act only when a subsequent match failure causes a backtrack to
411     reach them. They all force a match failure, but they differ in what happens
412     afterwards. Those that advance the start-of-match point do so only if the
413 ph10 210 pattern is not anchored.
414     .sp
415 ph10 412 (*COMMIT) overall failure, no advance of starting point
416     (*PRUNE) advance to next starting character
417     (*SKIP) advance start to current matching position
418     (*THEN) local failure, backtrack to next alternation
419 ph10 210 .
420     .
421 ph10 227 .SH "NEWLINE CONVENTIONS"
422     .rs
423     .sp
424 ph10 261 These are recognized only at the very start of the pattern or after a
425 ph10 412 (*BSR_...) or (*UTF8) option.
426 ph10 227 .sp
427 ph10 412 (*CR) carriage return only
428     (*LF) linefeed only
429     (*CRLF) carriage return followed by linefeed
430     (*ANYCRLF) all three of the above
431     (*ANY) any Unicode newline sequence
432 ph10 227 .
433     .
434 ph10 231 .SH "WHAT \eR MATCHES"
435     .rs
436     .sp
437 ph10 261 These are recognized only at the very start of the pattern or after a
438 ph10 412 (*...) option that sets the newline convention or UTF-8 mode.
439 ph10 231 .sp
440 ph10 412 (*BSR_ANYCRLF) CR, LF, or CRLF
441     (*BSR_UNICODE) any Unicode newline sequence
442 ph10 231 .
443     .
444 ph10 208 .SH "CALLOUTS"
445     .rs
446     .sp
447     (?C) callout
448     (?Cn) callout with data n
449     .
450     .
451     .SH "SEE ALSO"
452     .rs
453     .sp
454     \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
455     \fBpcrematching\fP(3), \fBpcre\fP(3).
456     .
457     .
458     .SH AUTHOR
459     .rs
460     .sp
461     .nf
462     Philip Hazel
463     University Computing Service
464     Cambridge CB2 3QH, England.
465     .fi
466     .
467     .
468     .SH REVISION
469     .rs
470     .sp
471     .nf
472 ph10 517 Last updated: 05 May 2010
473 ph10 491 Copyright (c) 1997-2010 University of Cambridge.
474 ph10 208 .fi

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12