/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 518 - (show annotations) (download) (as text)
Tue May 18 15:47:01 2010 UTC (4 years, 4 months ago) by ph10
File MIME type: text/html
File size: 14436 byte(s)
Added PCRE_UCP and related stuff to make \w etc use Unicode properties.

1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
43 </ul>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45 <P>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains just a quick-reference summary of the
50 syntax.
51 </P>
52 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
53 <P>
54 <pre>
55 \x where x is non-alphanumeric is a literal x
56 \Q...\E treat enclosed characters as literal
57 </PRE>
58 </P>
59 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
60 <P>
61 <pre>
62 \a alarm, that is, the BEL character (hex 07)
63 \cx "control-x", where x is any character
64 \e escape (hex 1B)
65 \f formfeed (hex 0C)
66 \n newline (hex 0A)
67 \r carriage return (hex 0D)
68 \t tab (hex 09)
69 \ddd character with octal code ddd, or backreference
70 \xhh character with hex code hh
71 \x{hhh..} character with hex code hhh..
72 </PRE>
73 </P>
74 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
75 <P>
76 <pre>
77 . any character except newline;
78 in dotall mode, any character whatsoever
79 \C one byte, even in UTF-8 mode (best avoided)
80 \d a decimal digit
81 \D a character that is not a decimal digit
82 \h a horizontal whitespace character
83 \H a character that is not a horizontal whitespace character
84 \N a character that is not a newline
85 \p{<i>xx</i>} a character with the <i>xx</i> property
86 \P{<i>xx</i>} a character without the <i>xx</i> property
87 \R a newline sequence
88 \s a whitespace character
89 \S a character that is not a whitespace character
90 \v a vertical whitespace character
91 \V a character that is not a vertical whitespace character
92 \w a "word" character
93 \W a "non-word" character
94 \X an extended Unicode sequence
95 </pre>
96 In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
97 </P>
98 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
99 <P>
100 <pre>
101 C Other
102 Cc Control
103 Cf Format
104 Cn Unassigned
105 Co Private use
106 Cs Surrogate
107
108 L Letter
109 Ll Lower case letter
110 Lm Modifier letter
111 Lo Other letter
112 Lt Title case letter
113 Lu Upper case letter
114 L& Ll, Lu, or Lt
115
116 M Mark
117 Mc Spacing mark
118 Me Enclosing mark
119 Mn Non-spacing mark
120
121 N Number
122 Nd Decimal number
123 Nl Letter number
124 No Other number
125
126 P Punctuation
127 Pc Connector punctuation
128 Pd Dash punctuation
129 Pe Close punctuation
130 Pf Final punctuation
131 Pi Initial punctuation
132 Po Other punctuation
133 Ps Open punctuation
134
135 S Symbol
136 Sc Currency symbol
137 Sk Modifier symbol
138 Sm Mathematical symbol
139 So Other symbol
140
141 Z Separator
142 Zl Line separator
143 Zp Paragraph separator
144 Zs Space separator
145 </PRE>
146 </P>
147 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
148 <P>
149 <pre>
150 Xan Alphanumeric: union of properties L and N
151 Xps POSIX space: property Z or tab, NL, VT, FF, CR
152 Xsp Perl space: property Z or tab, NL, FF, CR
153 Xwd Perl word: property Xan or underscore
154 </PRE>
155 </P>
156 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
157 <P>
158 Arabic,
159 Armenian,
160 Avestan,
161 Balinese,
162 Bamum,
163 Bengali,
164 Bopomofo,
165 Braille,
166 Buginese,
167 Buhid,
168 Canadian_Aboriginal,
169 Carian,
170 Cham,
171 Cherokee,
172 Common,
173 Coptic,
174 Cuneiform,
175 Cypriot,
176 Cyrillic,
177 Deseret,
178 Devanagari,
179 Egyptian_Hieroglyphs,
180 Ethiopic,
181 Georgian,
182 Glagolitic,
183 Gothic,
184 Greek,
185 Gujarati,
186 Gurmukhi,
187 Han,
188 Hangul,
189 Hanunoo,
190 Hebrew,
191 Hiragana,
192 Imperial_Aramaic,
193 Inherited,
194 Inscriptional_Pahlavi,
195 Inscriptional_Parthian,
196 Javanese,
197 Kaithi,
198 Kannada,
199 Katakana,
200 Kayah_Li,
201 Kharoshthi,
202 Khmer,
203 Lao,
204 Latin,
205 Lepcha,
206 Limbu,
207 Linear_B,
208 Lisu,
209 Lycian,
210 Lydian,
211 Malayalam,
212 Meetei_Mayek,
213 Mongolian,
214 Myanmar,
215 New_Tai_Lue,
216 Nko,
217 Ogham,
218 Old_Italic,
219 Old_Persian,
220 Old_South_Arabian,
221 Old_Turkic,
222 Ol_Chiki,
223 Oriya,
224 Osmanya,
225 Phags_Pa,
226 Phoenician,
227 Rejang,
228 Runic,
229 Samaritan,
230 Saurashtra,
231 Shavian,
232 Sinhala,
233 Sundanese,
234 Syloti_Nagri,
235 Syriac,
236 Tagalog,
237 Tagbanwa,
238 Tai_Le,
239 Tai_Tham,
240 Tai_Viet,
241 Tamil,
242 Telugu,
243 Thaana,
244 Thai,
245 Tibetan,
246 Tifinagh,
247 Ugaritic,
248 Vai,
249 Yi.
250 </P>
251 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
252 <P>
253 <pre>
254 [...] positive character class
255 [^...] negative character class
256 [x-y] range (can be used for hex characters)
257 [[:xxx:]] positive POSIX named set
258 [[:^xxx:]] negative POSIX named set
259
260 alnum alphanumeric
261 alpha alphabetic
262 ascii 0-127
263 blank space or tab
264 cntrl control character
265 digit decimal digit
266 graph printing, excluding space
267 lower lower case letter
268 print printing, including space
269 punct printing, excluding alphanumeric
270 space whitespace
271 upper upper case letter
272 word same as \w
273 xdigit hexadecimal digit
274 </pre>
275 In PCRE, POSIX character set names recognize only ASCII characters. You can use
276 \Q...\E inside a character class.
277 </P>
278 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
279 <P>
280 <pre>
281 ? 0 or 1, greedy
282 ?+ 0 or 1, possessive
283 ?? 0 or 1, lazy
284 * 0 or more, greedy
285 *+ 0 or more, possessive
286 *? 0 or more, lazy
287 + 1 or more, greedy
288 ++ 1 or more, possessive
289 +? 1 or more, lazy
290 {n} exactly n
291 {n,m} at least n, no more than m, greedy
292 {n,m}+ at least n, no more than m, possessive
293 {n,m}? at least n, no more than m, lazy
294 {n,} n or more, greedy
295 {n,}+ n or more, possessive
296 {n,}? n or more, lazy
297 </PRE>
298 </P>
299 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
300 <P>
301 <pre>
302 \b word boundary (only ASCII letters recognized)
303 \B not a word boundary
304 ^ start of subject
305 also after internal newline in multiline mode
306 \A start of subject
307 $ end of subject
308 also before newline at end of subject
309 also before internal newline in multiline mode
310 \Z end of subject
311 also before newline at end of subject
312 \z end of subject
313 \G first matching position in subject
314 </PRE>
315 </P>
316 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
317 <P>
318 <pre>
319 \K reset start of match
320 </PRE>
321 </P>
322 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
323 <P>
324 <pre>
325 expr|expr|expr...
326 </PRE>
327 </P>
328 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
329 <P>
330 <pre>
331 (...) capturing group
332 (?&#60;name&#62;...) named capturing group (Perl)
333 (?'name'...) named capturing group (Perl)
334 (?P&#60;name&#62;...) named capturing group (Python)
335 (?:...) non-capturing group
336 (?|...) non-capturing group; reset group numbers for
337 capturing groups in each alternative
338 </PRE>
339 </P>
340 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
341 <P>
342 <pre>
343 (?&#62;...) atomic, non-capturing group
344 </PRE>
345 </P>
346 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
347 <P>
348 <pre>
349 (?#....) comment (not nestable)
350 </PRE>
351 </P>
352 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
353 <P>
354 <pre>
355 (?i) caseless
356 (?J) allow duplicate names
357 (?m) multiline
358 (?s) single line (dotall)
359 (?U) default ungreedy (lazy)
360 (?x) extended (ignore white space)
361 (?-...) unset option(s)
362 </pre>
363 The following is recognized only at the start of a pattern or after one of the
364 newline-setting options with similar syntax:
365 <pre>
366 (*UTF8) set UTF-8 mode
367 </PRE>
368 </P>
369 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
370 <P>
371 <pre>
372 (?=...) positive look ahead
373 (?!...) negative look ahead
374 (?&#60;=...) positive look behind
375 (?&#60;!...) negative look behind
376 </pre>
377 Each top-level branch of a look behind must be of a fixed length.
378 </P>
379 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
380 <P>
381 <pre>
382 \n reference by number (can be ambiguous)
383 \gn reference by number
384 \g{n} reference by number
385 \g{-n} relative reference by number
386 \k&#60;name&#62; reference by name (Perl)
387 \k'name' reference by name (Perl)
388 \g{name} reference by name (Perl)
389 \k{name} reference by name (.NET)
390 (?P=name) reference by name (Python)
391 </PRE>
392 </P>
393 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
394 <P>
395 <pre>
396 (?R) recurse whole pattern
397 (?n) call subpattern by absolute number
398 (?+n) call subpattern by relative number
399 (?-n) call subpattern by relative number
400 (?&name) call subpattern by name (Perl)
401 (?P&#62;name) call subpattern by name (Python)
402 \g&#60;name&#62; call subpattern by name (Oniguruma)
403 \g'name' call subpattern by name (Oniguruma)
404 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
405 \g'n' call subpattern by absolute number (Oniguruma)
406 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
407 \g'+n' call subpattern by relative number (PCRE extension)
408 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
409 \g'-n' call subpattern by relative number (PCRE extension)
410 </PRE>
411 </P>
412 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
413 <P>
414 <pre>
415 (?(condition)yes-pattern)
416 (?(condition)yes-pattern|no-pattern)
417
418 (?(n)... absolute reference condition
419 (?(+n)... relative reference condition
420 (?(-n)... relative reference condition
421 (?(&#60;name&#62;)... named reference condition (Perl)
422 (?('name')... named reference condition (Perl)
423 (?(name)... named reference condition (PCRE)
424 (?(R)... overall recursion condition
425 (?(Rn)... specific group recursion condition
426 (?(R&name)... specific recursion condition
427 (?(DEFINE)... define subpattern for reference
428 (?(assert)... assertion condition
429 </PRE>
430 </P>
431 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
432 <P>
433 The following act immediately they are reached:
434 <pre>
435 (*ACCEPT) force successful match
436 (*FAIL) force backtrack; synonym (*F)
437 </pre>
438 The following act only when a subsequent match failure causes a backtrack to
439 reach them. They all force a match failure, but they differ in what happens
440 afterwards. Those that advance the start-of-match point do so only if the
441 pattern is not anchored.
442 <pre>
443 (*COMMIT) overall failure, no advance of starting point
444 (*PRUNE) advance to next starting character
445 (*SKIP) advance start to current matching position
446 (*THEN) local failure, backtrack to next alternation
447 </PRE>
448 </P>
449 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
450 <P>
451 These are recognized only at the very start of the pattern or after a
452 (*BSR_...) or (*UTF8) option.
453 <pre>
454 (*CR) carriage return only
455 (*LF) linefeed only
456 (*CRLF) carriage return followed by linefeed
457 (*ANYCRLF) all three of the above
458 (*ANY) any Unicode newline sequence
459 </PRE>
460 </P>
461 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
462 <P>
463 These are recognized only at the very start of the pattern or after a
464 (*...) option that sets the newline convention or UTF-8 mode.
465 <pre>
466 (*BSR_ANYCRLF) CR, LF, or CRLF
467 (*BSR_UNICODE) any Unicode newline sequence
468 </PRE>
469 </P>
470 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
471 <P>
472 <pre>
473 (?C) callout
474 (?Cn) callout with data n
475 </PRE>
476 </P>
477 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
478 <P>
479 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
480 <b>pcrematching</b>(3), <b>pcre</b>(3).
481 </P>
482 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
483 <P>
484 Philip Hazel
485 <br>
486 University Computing Service
487 <br>
488 Cambridge CB2 3QH, England.
489 <br>
490 </P>
491 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
492 <P>
493 Last updated: 05 May 2010
494 <br>
495 Copyright &copy; 1997-2010 University of Cambridge.
496 <br>
497 <p>
498 Return to the <a href="index.html">PCRE index page</a>.
499 </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12