/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 535 - (show annotations) (download)
Thu Jun 3 19:18:24 2010 UTC (4 years, 6 months ago) by ph10
File size: 11505 byte(s)
Prepare for release candidate.

1 .TH PCRESYNTAX 3
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains just a quick-reference summary of the
13 syntax.
14 .
15 .
16 .SH "QUOTING"
17 .rs
18 .sp
19 \ex where x is non-alphanumeric is a literal x
20 \eQ...\eE treat enclosed characters as literal
21 .
22 .
23 .SH "CHARACTERS"
24 .rs
25 .sp
26 \ea alarm, that is, the BEL character (hex 07)
27 \ecx "control-x", where x is any character
28 \ee escape (hex 1B)
29 \ef formfeed (hex 0C)
30 \en newline (hex 0A)
31 \er carriage return (hex 0D)
32 \et tab (hex 09)
33 \eddd character with octal code ddd, or backreference
34 \exhh character with hex code hh
35 \ex{hhh..} character with hex code hhh..
36 .
37 .
38 .SH "CHARACTER TYPES"
39 .rs
40 .sp
41 . any character except newline;
42 in dotall mode, any character whatsoever
43 \eC one byte, even in UTF-8 mode (best avoided)
44 \ed a decimal digit
45 \eD a character that is not a decimal digit
46 \eh a horizontal whitespace character
47 \eH a character that is not a horizontal whitespace character
48 \eN a character that is not a newline
49 \ep{\fIxx\fP} a character with the \fIxx\fP property
50 \eP{\fIxx\fP} a character without the \fIxx\fP property
51 \eR a newline sequence
52 \es a whitespace character
53 \eS a character that is not a whitespace character
54 \ev a vertical whitespace character
55 \eV a character that is not a vertical whitespace character
56 \ew a "word" character
57 \eW a "non-word" character
58 \eX an extended Unicode sequence
59 .sp
60 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
61 characters, even in UTF-8 mode. However, this can be changed by setting the
62 PCRE_UCP option.
63 .
64 .
65 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
66 .rs
67 .sp
68 C Other
69 Cc Control
70 Cf Format
71 Cn Unassigned
72 Co Private use
73 Cs Surrogate
74 .sp
75 L Letter
76 Ll Lower case letter
77 Lm Modifier letter
78 Lo Other letter
79 Lt Title case letter
80 Lu Upper case letter
81 L& Ll, Lu, or Lt
82 .sp
83 M Mark
84 Mc Spacing mark
85 Me Enclosing mark
86 Mn Non-spacing mark
87 .sp
88 N Number
89 Nd Decimal number
90 Nl Letter number
91 No Other number
92 .sp
93 P Punctuation
94 Pc Connector punctuation
95 Pd Dash punctuation
96 Pe Close punctuation
97 Pf Final punctuation
98 Pi Initial punctuation
99 Po Other punctuation
100 Ps Open punctuation
101 .sp
102 S Symbol
103 Sc Currency symbol
104 Sk Modifier symbol
105 Sm Mathematical symbol
106 So Other symbol
107 .sp
108 Z Separator
109 Zl Line separator
110 Zp Paragraph separator
111 Zs Space separator
112 .
113 .
114 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
115 .rs
116 .sp
117 Xan Alphanumeric: union of properties L and N
118 Xps POSIX space: property Z or tab, NL, VT, FF, CR
119 Xsp Perl space: property Z or tab, NL, FF, CR
120 Xwd Perl word: property Xan or underscore
121 .
122 .
123 .SH "SCRIPT NAMES FOR \ep AND \eP"
124 .rs
125 .sp
126 Arabic,
127 Armenian,
128 Avestan,
129 Balinese,
130 Bamum,
131 Bengali,
132 Bopomofo,
133 Braille,
134 Buginese,
135 Buhid,
136 Canadian_Aboriginal,
137 Carian,
138 Cham,
139 Cherokee,
140 Common,
141 Coptic,
142 Cuneiform,
143 Cypriot,
144 Cyrillic,
145 Deseret,
146 Devanagari,
147 Egyptian_Hieroglyphs,
148 Ethiopic,
149 Georgian,
150 Glagolitic,
151 Gothic,
152 Greek,
153 Gujarati,
154 Gurmukhi,
155 Han,
156 Hangul,
157 Hanunoo,
158 Hebrew,
159 Hiragana,
160 Imperial_Aramaic,
161 Inherited,
162 Inscriptional_Pahlavi,
163 Inscriptional_Parthian,
164 Javanese,
165 Kaithi,
166 Kannada,
167 Katakana,
168 Kayah_Li,
169 Kharoshthi,
170 Khmer,
171 Lao,
172 Latin,
173 Lepcha,
174 Limbu,
175 Linear_B,
176 Lisu,
177 Lycian,
178 Lydian,
179 Malayalam,
180 Meetei_Mayek,
181 Mongolian,
182 Myanmar,
183 New_Tai_Lue,
184 Nko,
185 Ogham,
186 Old_Italic,
187 Old_Persian,
188 Old_South_Arabian,
189 Old_Turkic,
190 Ol_Chiki,
191 Oriya,
192 Osmanya,
193 Phags_Pa,
194 Phoenician,
195 Rejang,
196 Runic,
197 Samaritan,
198 Saurashtra,
199 Shavian,
200 Sinhala,
201 Sundanese,
202 Syloti_Nagri,
203 Syriac,
204 Tagalog,
205 Tagbanwa,
206 Tai_Le,
207 Tai_Tham,
208 Tai_Viet,
209 Tamil,
210 Telugu,
211 Thaana,
212 Thai,
213 Tibetan,
214 Tifinagh,
215 Ugaritic,
216 Vai,
217 Yi.
218 .
219 .
220 .SH "CHARACTER CLASSES"
221 .rs
222 .sp
223 [...] positive character class
224 [^...] negative character class
225 [x-y] range (can be used for hex characters)
226 [[:xxx:]] positive POSIX named set
227 [[:^xxx:]] negative POSIX named set
228 .sp
229 alnum alphanumeric
230 alpha alphabetic
231 ascii 0-127
232 blank space or tab
233 cntrl control character
234 digit decimal digit
235 graph printing, excluding space
236 lower lower case letter
237 print printing, including space
238 punct printing, excluding alphanumeric
239 space whitespace
240 upper upper case letter
241 word same as \ew
242 xdigit hexadecimal digit
243 .sp
244 In PCRE, POSIX character set names recognize only ASCII characters by default,
245 but some of them use Unicode properties if PCRE_UCP is set. You can use
246 \eQ...\eE inside a character class.
247 .
248 .
249 .SH "QUANTIFIERS"
250 .rs
251 .sp
252 ? 0 or 1, greedy
253 ?+ 0 or 1, possessive
254 ?? 0 or 1, lazy
255 * 0 or more, greedy
256 *+ 0 or more, possessive
257 *? 0 or more, lazy
258 + 1 or more, greedy
259 ++ 1 or more, possessive
260 +? 1 or more, lazy
261 {n} exactly n
262 {n,m} at least n, no more than m, greedy
263 {n,m}+ at least n, no more than m, possessive
264 {n,m}? at least n, no more than m, lazy
265 {n,} n or more, greedy
266 {n,}+ n or more, possessive
267 {n,}? n or more, lazy
268 .
269 .
270 .SH "ANCHORS AND SIMPLE ASSERTIONS"
271 .rs
272 .sp
273 \eb word boundary
274 \eB not a word boundary
275 ^ start of subject
276 also after internal newline in multiline mode
277 \eA start of subject
278 $ end of subject
279 also before newline at end of subject
280 also before internal newline in multiline mode
281 \eZ end of subject
282 also before newline at end of subject
283 \ez end of subject
284 \eG first matching position in subject
285 .
286 .
287 .SH "MATCH POINT RESET"
288 .rs
289 .sp
290 \eK reset start of match
291 .
292 .
293 .SH "ALTERNATION"
294 .rs
295 .sp
296 expr|expr|expr...
297 .
298 .
299 .SH "CAPTURING"
300 .rs
301 .sp
302 (...) capturing group
303 (?<name>...) named capturing group (Perl)
304 (?'name'...) named capturing group (Perl)
305 (?P<name>...) named capturing group (Python)
306 (?:...) non-capturing group
307 (?|...) non-capturing group; reset group numbers for
308 capturing groups in each alternative
309 .
310 .
311 .SH "ATOMIC GROUPS"
312 .rs
313 .sp
314 (?>...) atomic, non-capturing group
315 .
316 .
317 .
318 .
319 .SH "COMMENT"
320 .rs
321 .sp
322 (?#....) comment (not nestable)
323 .
324 .
325 .SH "OPTION SETTING"
326 .rs
327 .sp
328 (?i) caseless
329 (?J) allow duplicate names
330 (?m) multiline
331 (?s) single line (dotall)
332 (?U) default ungreedy (lazy)
333 (?x) extended (ignore white space)
334 (?-...) unset option(s)
335 .sp
336 The following are recognized only at the start of a pattern or after one of the
337 newline-setting options with similar syntax:
338 .sp
339 (*UTF8) set UTF-8 mode (PCRE_UTF8)
340 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
341 .
342 .
343 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
344 .rs
345 .sp
346 (?=...) positive look ahead
347 (?!...) negative look ahead
348 (?<=...) positive look behind
349 (?<!...) negative look behind
350 .sp
351 Each top-level branch of a look behind must be of a fixed length.
352 .
353 .
354 .SH "BACKREFERENCES"
355 .rs
356 .sp
357 \en reference by number (can be ambiguous)
358 \egn reference by number
359 \eg{n} reference by number
360 \eg{-n} relative reference by number
361 \ek<name> reference by name (Perl)
362 \ek'name' reference by name (Perl)
363 \eg{name} reference by name (Perl)
364 \ek{name} reference by name (.NET)
365 (?P=name) reference by name (Python)
366 .
367 .
368 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
369 .rs
370 .sp
371 (?R) recurse whole pattern
372 (?n) call subpattern by absolute number
373 (?+n) call subpattern by relative number
374 (?-n) call subpattern by relative number
375 (?&name) call subpattern by name (Perl)
376 (?P>name) call subpattern by name (Python)
377 \eg<name> call subpattern by name (Oniguruma)
378 \eg'name' call subpattern by name (Oniguruma)
379 \eg<n> call subpattern by absolute number (Oniguruma)
380 \eg'n' call subpattern by absolute number (Oniguruma)
381 \eg<+n> call subpattern by relative number (PCRE extension)
382 \eg'+n' call subpattern by relative number (PCRE extension)
383 \eg<-n> call subpattern by relative number (PCRE extension)
384 \eg'-n' call subpattern by relative number (PCRE extension)
385 .
386 .
387 .SH "CONDITIONAL PATTERNS"
388 .rs
389 .sp
390 (?(condition)yes-pattern)
391 (?(condition)yes-pattern|no-pattern)
392 .sp
393 (?(n)... absolute reference condition
394 (?(+n)... relative reference condition
395 (?(-n)... relative reference condition
396 (?(<name>)... named reference condition (Perl)
397 (?('name')... named reference condition (Perl)
398 (?(name)... named reference condition (PCRE)
399 (?(R)... overall recursion condition
400 (?(Rn)... specific group recursion condition
401 (?(R&name)... specific recursion condition
402 (?(DEFINE)... define subpattern for reference
403 (?(assert)... assertion condition
404 .
405 .
406 .SH "BACKTRACKING CONTROL"
407 .rs
408 .sp
409 The following act immediately they are reached:
410 .sp
411 (*ACCEPT) force successful match
412 (*FAIL) force backtrack; synonym (*F)
413 .sp
414 The following act only when a subsequent match failure causes a backtrack to
415 reach them. They all force a match failure, but they differ in what happens
416 afterwards. Those that advance the start-of-match point do so only if the
417 pattern is not anchored.
418 .sp
419 (*COMMIT) overall failure, no advance of starting point
420 (*PRUNE) advance to next starting character
421 (*SKIP) advance start to current matching position
422 (*THEN) local failure, backtrack to next alternation
423 .
424 .
425 .SH "NEWLINE CONVENTIONS"
426 .rs
427 .sp
428 These are recognized only at the very start of the pattern or after a
429 (*BSR_...) or (*UTF8) or (*UCP) option.
430 .sp
431 (*CR) carriage return only
432 (*LF) linefeed only
433 (*CRLF) carriage return followed by linefeed
434 (*ANYCRLF) all three of the above
435 (*ANY) any Unicode newline sequence
436 .
437 .
438 .SH "WHAT \eR MATCHES"
439 .rs
440 .sp
441 These are recognized only at the very start of the pattern or after a
442 (*...) option that sets the newline convention or UTF-8 or UCP mode.
443 .sp
444 (*BSR_ANYCRLF) CR, LF, or CRLF
445 (*BSR_UNICODE) any Unicode newline sequence
446 .
447 .
448 .SH "CALLOUTS"
449 .rs
450 .sp
451 (?C) callout
452 (?Cn) callout with data n
453 .
454 .
455 .SH "SEE ALSO"
456 .rs
457 .sp
458 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
459 \fBpcrematching\fP(3), \fBpcre\fP(3).
460 .
461 .
462 .SH AUTHOR
463 .rs
464 .sp
465 .nf
466 Philip Hazel
467 University Computing Service
468 Cambridge CB2 3QH, England.
469 .fi
470 .
471 .
472 .SH REVISION
473 .rs
474 .sp
475 .nf
476 Last updated: 12 May 2010
477 Copyright (c) 1997-2010 University of Cambridge.
478 .fi

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12