/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 518 - (hide annotations) (download)
Tue May 18 15:47:01 2010 UTC (4 years, 4 months ago) by ph10
File size: 11509 byte(s)
Added PCRE_UCP and related stuff to make \w etc use Unicode properties.

1 ph10 208 .TH PCRESYNTAX 3
2     .SH NAME
3     PCRE - Perl-compatible regular expressions
4     .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5     .rs
6     .sp
7     The full syntax and semantics of the regular expressions that are supported by
8     PCRE are described in the
9     .\" HREF
10     \fBpcrepattern\fP
11     .\"
12     documentation. This document contains just a quick-reference summary of the
13     syntax.
14     .
15     .
16     .SH "QUOTING"
17     .rs
18     .sp
19     \ex where x is non-alphanumeric is a literal x
20     \eQ...\eE treat enclosed characters as literal
21     .
22     .
23     .SH "CHARACTERS"
24     .rs
25     .sp
26     \ea alarm, that is, the BEL character (hex 07)
27     \ecx "control-x", where x is any character
28     \ee escape (hex 1B)
29     \ef formfeed (hex 0C)
30     \en newline (hex 0A)
31     \er carriage return (hex 0D)
32     \et tab (hex 09)
33     \eddd character with octal code ddd, or backreference
34     \exhh character with hex code hh
35     \ex{hhh..} character with hex code hhh..
36     .
37     .
38     .SH "CHARACTER TYPES"
39     .rs
40     .sp
41     . any character except newline;
42     in dotall mode, any character whatsoever
43     \eC one byte, even in UTF-8 mode (best avoided)
44     \ed a decimal digit
45     \eD a character that is not a decimal digit
46     \eh a horizontal whitespace character
47     \eH a character that is not a horizontal whitespace character
48 ph10 517 \eN a character that is not a newline
49 ph10 208 \ep{\fIxx\fP} a character with the \fIxx\fP property
50     \eP{\fIxx\fP} a character without the \fIxx\fP property
51     \eR a newline sequence
52     \es a whitespace character
53     \eS a character that is not a whitespace character
54     \ev a vertical whitespace character
55     \eV a character that is not a vertical whitespace character
56     \ew a "word" character
57     \eW a "non-word" character
58     \eX an extended Unicode sequence
59     .sp
60 ph10 518 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
61     characters, even in UTF-8 mode. However, this can be changed by setting the
62     PCRE_UCP option.
63 ph10 208 .
64     .
65 ph10 517 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
66 ph10 208 .rs
67     .sp
68     C Other
69     Cc Control
70     Cf Format
71     Cn Unassigned
72     Co Private use
73     Cs Surrogate
74     .sp
75     L Letter
76     Ll Lower case letter
77     Lm Modifier letter
78     Lo Other letter
79     Lt Title case letter
80     Lu Upper case letter
81     L& Ll, Lu, or Lt
82     .sp
83     M Mark
84     Mc Spacing mark
85     Me Enclosing mark
86     Mn Non-spacing mark
87     .sp
88     N Number
89     Nd Decimal number
90     Nl Letter number
91     No Other number
92     .sp
93     P Punctuation
94     Pc Connector punctuation
95     Pd Dash punctuation
96     Pe Close punctuation
97     Pf Final punctuation
98     Pi Initial punctuation
99     Po Other punctuation
100     Ps Open punctuation
101     .sp
102     S Symbol
103     Sc Currency symbol
104     Sk Modifier symbol
105     Sm Mathematical symbol
106     So Other symbol
107     .sp
108     Z Separator
109     Zl Line separator
110     Zp Paragraph separator
111     Zs Space separator
112     .
113     .
114 ph10 517 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
115     .rs
116     .sp
117     Xan Alphanumeric: union of properties L and N
118     Xps POSIX space: property Z or tab, NL, VT, FF, CR
119     Xsp Perl space: property Z or tab, NL, FF, CR
120     Xwd Perl word: property Xan or underscore
121     .
122     .
123 ph10 208 .SH "SCRIPT NAMES FOR \ep AND \eP"
124     .rs
125     .sp
126     Arabic,
127     Armenian,
128 ph10 491 Avestan,
129 ph10 208 Balinese,
130 ph10 491 Bamum,
131 ph10 208 Bengali,
132     Bopomofo,
133     Braille,
134     Buginese,
135     Buhid,
136     Canadian_Aboriginal,
137 ph10 412 Carian,
138     Cham,
139 ph10 208 Cherokee,
140     Common,
141     Coptic,
142     Cuneiform,
143     Cypriot,
144     Cyrillic,
145     Deseret,
146     Devanagari,
147 ph10 491 Egyptian_Hieroglyphs,
148 ph10 208 Ethiopic,
149     Georgian,
150     Glagolitic,
151     Gothic,
152     Greek,
153     Gujarati,
154     Gurmukhi,
155     Han,
156     Hangul,
157     Hanunoo,
158     Hebrew,
159     Hiragana,
160 ph10 491 Imperial_Aramaic,
161 ph10 208 Inherited,
162 ph10 491 Inscriptional_Pahlavi,
163     Inscriptional_Parthian,
164     Javanese,
165     Kaithi,
166 ph10 208 Kannada,
167     Katakana,
168 ph10 412 Kayah_Li,
169 ph10 208 Kharoshthi,
170     Khmer,
171     Lao,
172     Latin,
173 ph10 412 Lepcha,
174 ph10 208 Limbu,
175     Linear_B,
176 ph10 491 Lisu,
177 ph10 412 Lycian,
178     Lydian,
179 ph10 208 Malayalam,
180 ph10 491 Meetei_Mayek,
181 ph10 208 Mongolian,
182     Myanmar,
183     New_Tai_Lue,
184     Nko,
185     Ogham,
186     Old_Italic,
187     Old_Persian,
188 ph10 491 Old_South_Arabian,
189     Old_Turkic,
190 ph10 412 Ol_Chiki,
191 ph10 208 Oriya,
192     Osmanya,
193     Phags_Pa,
194     Phoenician,
195 ph10 412 Rejang,
196 ph10 208 Runic,
197 ph10 491 Samaritan,
198 ph10 412 Saurashtra,
199 ph10 208 Shavian,
200     Sinhala,
201 ph10 491 Sundanese,
202 ph10 208 Syloti_Nagri,
203     Syriac,
204     Tagalog,
205     Tagbanwa,
206     Tai_Le,
207 ph10 491 Tai_Tham,
208     Tai_Viet,
209 ph10 208 Tamil,
210     Telugu,
211     Thaana,
212     Thai,
213     Tibetan,
214     Tifinagh,
215     Ugaritic,
216 ph10 412 Vai,
217 ph10 208 Yi.
218     .
219     .
220     .SH "CHARACTER CLASSES"
221     .rs
222     .sp
223     [...] positive character class
224     [^...] negative character class
225     [x-y] range (can be used for hex characters)
226     [[:xxx:]] positive POSIX named set
227 ph10 266 [[:^xxx:]] negative POSIX named set
228 ph10 208 .sp
229     alnum alphanumeric
230     alpha alphabetic
231     ascii 0-127
232     blank space or tab
233     cntrl control character
234     digit decimal digit
235     graph printing, excluding space
236     lower lower case letter
237     print printing, including space
238     punct printing, excluding alphanumeric
239     space whitespace
240     upper upper case letter
241     word same as \ew
242     xdigit hexadecimal digit
243     .sp
244 ph10 518 In PCRE, POSIX character set names recognize only ASCII characters by default,
245     but some of them use Unicode properties if PCRE_UCP is set. You can use
246 ph10 208 \eQ...\eE inside a character class.
247     .
248     .
249     .SH "QUANTIFIERS"
250     .rs
251     .sp
252     ? 0 or 1, greedy
253     ?+ 0 or 1, possessive
254     ?? 0 or 1, lazy
255     * 0 or more, greedy
256     *+ 0 or more, possessive
257     *? 0 or more, lazy
258     + 1 or more, greedy
259     ++ 1 or more, possessive
260     +? 1 or more, lazy
261     {n} exactly n
262     {n,m} at least n, no more than m, greedy
263     {n,m}+ at least n, no more than m, possessive
264     {n,m}? at least n, no more than m, lazy
265     {n,} n or more, greedy
266     {n,}+ n or more, possessive
267     {n,}? n or more, lazy
268     .
269     .
270     .SH "ANCHORS AND SIMPLE ASSERTIONS"
271     .rs
272     .sp
273 ph10 518 \eb word boundary
274 ph10 208 \eB not a word boundary
275     ^ start of subject
276     also after internal newline in multiline mode
277     \eA start of subject
278     $ end of subject
279     also before newline at end of subject
280     also before internal newline in multiline mode
281     \eZ end of subject
282     also before newline at end of subject
283     \ez end of subject
284     \eG first matching position in subject
285     .
286     .
287     .SH "MATCH POINT RESET"
288     .rs
289     .sp
290     \eK reset start of match
291     .
292     .
293     .SH "ALTERNATION"
294     .rs
295     .sp
296     expr|expr|expr...
297     .
298     .
299     .SH "CAPTURING"
300     .rs
301     .sp
302 ph10 412 (...) capturing group
303     (?<name>...) named capturing group (Perl)
304     (?'name'...) named capturing group (Perl)
305     (?P<name>...) named capturing group (Python)
306     (?:...) non-capturing group
307     (?|...) non-capturing group; reset group numbers for
308     capturing groups in each alternative
309 ph10 208 .
310     .
311     .SH "ATOMIC GROUPS"
312     .rs
313     .sp
314 ph10 412 (?>...) atomic, non-capturing group
315 ph10 208 .
316     .
317     .
318     .
319     .SH "COMMENT"
320     .rs
321     .sp
322 ph10 412 (?#....) comment (not nestable)
323 ph10 208 .
324     .
325     .SH "OPTION SETTING"
326     .rs
327     .sp
328 ph10 412 (?i) caseless
329     (?J) allow duplicate names
330     (?m) multiline
331     (?s) single line (dotall)
332     (?U) default ungreedy (lazy)
333     (?x) extended (ignore white space)
334     (?-...) unset option(s)
335     .sp
336 ph10 518 The following are recognized only at the start of a pattern or after one of the
337 ph10 412 newline-setting options with similar syntax:
338     .sp
339 ph10 518 (*UTF8) set UTF-8 mode (PCRE_UTF8)
340     (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
341 ph10 208 .
342     .
343     .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
344     .rs
345     .sp
346 ph10 412 (?=...) positive look ahead
347     (?!...) negative look ahead
348     (?<=...) positive look behind
349     (?<!...) negative look behind
350 ph10 208 .sp
351     Each top-level branch of a look behind must be of a fixed length.
352 ph10 333 .
353     .
354 ph10 208 .SH "BACKREFERENCES"
355     .rs
356     .sp
357 ph10 412 \en reference by number (can be ambiguous)
358     \egn reference by number
359     \eg{n} reference by number
360     \eg{-n} relative reference by number
361     \ek<name> reference by name (Perl)
362     \ek'name' reference by name (Perl)
363     \eg{name} reference by name (Perl)
364     \ek{name} reference by name (.NET)
365     (?P=name) reference by name (Python)
366 ph10 208 .
367     .
368     .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
369     .rs
370     .sp
371 ph10 412 (?R) recurse whole pattern
372     (?n) call subpattern by absolute number
373     (?+n) call subpattern by relative number
374     (?-n) call subpattern by relative number
375     (?&name) call subpattern by name (Perl)
376     (?P>name) call subpattern by name (Python)
377     \eg<name> call subpattern by name (Oniguruma)
378     \eg'name' call subpattern by name (Oniguruma)
379     \eg<n> call subpattern by absolute number (Oniguruma)
380     \eg'n' call subpattern by absolute number (Oniguruma)
381     \eg<+n> call subpattern by relative number (PCRE extension)
382     \eg'+n' call subpattern by relative number (PCRE extension)
383     \eg<-n> call subpattern by relative number (PCRE extension)
384     \eg'-n' call subpattern by relative number (PCRE extension)
385 ph10 208 .
386     .
387     .SH "CONDITIONAL PATTERNS"
388     .rs
389     .sp
390     (?(condition)yes-pattern)
391     (?(condition)yes-pattern|no-pattern)
392     .sp
393 ph10 412 (?(n)... absolute reference condition
394     (?(+n)... relative reference condition
395     (?(-n)... relative reference condition
396     (?(<name>)... named reference condition (Perl)
397     (?('name')... named reference condition (Perl)
398     (?(name)... named reference condition (PCRE)
399     (?(R)... overall recursion condition
400     (?(Rn)... specific group recursion condition
401     (?(R&name)... specific recursion condition
402     (?(DEFINE)... define subpattern for reference
403     (?(assert)... assertion condition
404 ph10 208 .
405     .
406 ph10 210 .SH "BACKTRACKING CONTROL"
407     .rs
408     .sp
409 ph10 211 The following act immediately they are reached:
410     .sp
411 ph10 412 (*ACCEPT) force successful match
412     (*FAIL) force backtrack; synonym (*F)
413 ph10 210 .sp
414 ph10 211 The following act only when a subsequent match failure causes a backtrack to
415     reach them. They all force a match failure, but they differ in what happens
416     afterwards. Those that advance the start-of-match point do so only if the
417 ph10 210 pattern is not anchored.
418     .sp
419 ph10 412 (*COMMIT) overall failure, no advance of starting point
420     (*PRUNE) advance to next starting character
421     (*SKIP) advance start to current matching position
422     (*THEN) local failure, backtrack to next alternation
423 ph10 210 .
424     .
425 ph10 227 .SH "NEWLINE CONVENTIONS"
426     .rs
427     .sp
428 ph10 261 These are recognized only at the very start of the pattern or after a
429 ph10 518 (*BSR_...) or (*UTF8) or (*UCP) option.
430 ph10 227 .sp
431 ph10 412 (*CR) carriage return only
432     (*LF) linefeed only
433     (*CRLF) carriage return followed by linefeed
434     (*ANYCRLF) all three of the above
435     (*ANY) any Unicode newline sequence
436 ph10 227 .
437     .
438 ph10 231 .SH "WHAT \eR MATCHES"
439     .rs
440     .sp
441 ph10 261 These are recognized only at the very start of the pattern or after a
442 ph10 518 (*...) option that sets the newline convention or UTF-8 or UCP mode.
443 ph10 231 .sp
444 ph10 412 (*BSR_ANYCRLF) CR, LF, or CRLF
445     (*BSR_UNICODE) any Unicode newline sequence
446 ph10 231 .
447     .
448 ph10 208 .SH "CALLOUTS"
449     .rs
450     .sp
451     (?C) callout
452     (?Cn) callout with data n
453     .
454     .
455     .SH "SEE ALSO"
456     .rs
457     .sp
458     \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
459     \fBpcrematching\fP(3), \fBpcre\fP(3).
460     .
461     .
462     .SH AUTHOR
463     .rs
464     .sp
465     .nf
466     Philip Hazel
467     University Computing Service
468     Cambridge CB2 3QH, England.
469     .fi
470     .
471     .
472     .SH REVISION
473     .rs
474     .sp
475     .nf
476 ph10 518 Last updated: 12 May 2010
477 ph10 491 Copyright (c) 1997-2010 University of Cambridge.
478 ph10 208 .fi

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12