/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 416 - (show annotations) (download) (as text)
Sat Apr 11 14:34:02 2009 UTC (5 years, 7 months ago) by ph10
File MIME type: text/html
File size: 13777 byte(s)
File tidies for 7.9 release.

1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">SCRIPT NAMES FOR \p AND \P</a>
22 <li><a name="TOC7" href="#SEC7">CHARACTER CLASSES</a>
23 <li><a name="TOC8" href="#SEC8">QUANTIFIERS</a>
24 <li><a name="TOC9" href="#SEC9">ANCHORS AND SIMPLE ASSERTIONS</a>
25 <li><a name="TOC10" href="#SEC10">MATCH POINT RESET</a>
26 <li><a name="TOC11" href="#SEC11">ALTERNATION</a>
27 <li><a name="TOC12" href="#SEC12">CAPTURING</a>
28 <li><a name="TOC13" href="#SEC13">ATOMIC GROUPS</a>
29 <li><a name="TOC14" href="#SEC14">COMMENT</a>
30 <li><a name="TOC15" href="#SEC15">OPTION SETTING</a>
31 <li><a name="TOC16" href="#SEC16">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
32 <li><a name="TOC17" href="#SEC17">BACKREFERENCES</a>
33 <li><a name="TOC18" href="#SEC18">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
34 <li><a name="TOC19" href="#SEC19">CONDITIONAL PATTERNS</a>
35 <li><a name="TOC20" href="#SEC20">BACKTRACKING CONTROL</a>
36 <li><a name="TOC21" href="#SEC21">NEWLINE CONVENTIONS</a>
37 <li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
38 <li><a name="TOC23" href="#SEC23">CALLOUTS</a>
39 <li><a name="TOC24" href="#SEC24">SEE ALSO</a>
40 <li><a name="TOC25" href="#SEC25">AUTHOR</a>
41 <li><a name="TOC26" href="#SEC26">REVISION</a>
42 </ul>
43 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
44 <P>
45 The full syntax and semantics of the regular expressions that are supported by
46 PCRE are described in the
47 <a href="pcrepattern.html"><b>pcrepattern</b></a>
48 documentation. This document contains just a quick-reference summary of the
49 syntax.
50 </P>
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52 <P>
53 <pre>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
56 </PRE>
57 </P>
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59 <P>
60 <pre>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any character
63 \e escape (hex 1B)
64 \f formfeed (hex 0C)
65 \n newline (hex 0A)
66 \r carriage return (hex 0D)
67 \t tab (hex 09)
68 \ddd character with octal code ddd, or backreference
69 \xhh character with hex code hh
70 \x{hhh..} character with hex code hhh..
71 </PRE>
72 </P>
73 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74 <P>
75 <pre>
76 . any character except newline;
77 in dotall mode, any character whatsoever
78 \C one byte, even in UTF-8 mode (best avoided)
79 \d a decimal digit
80 \D a character that is not a decimal digit
81 \h a horizontal whitespace character
82 \H a character that is not a horizontal whitespace character
83 \p{<i>xx</i>} a character with the <i>xx</i> property
84 \P{<i>xx</i>} a character without the <i>xx</i> property
85 \R a newline sequence
86 \s a whitespace character
87 \S a character that is not a whitespace character
88 \v a vertical whitespace character
89 \V a character that is not a vertical whitespace character
90 \w a "word" character
91 \W a "non-word" character
92 \X an extended Unicode sequence
93 </pre>
94 In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
95 </P>
96 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a><br>
97 <P>
98 <pre>
99 C Other
100 Cc Control
101 Cf Format
102 Cn Unassigned
103 Co Private use
104 Cs Surrogate
105
106 L Letter
107 Ll Lower case letter
108 Lm Modifier letter
109 Lo Other letter
110 Lt Title case letter
111 Lu Upper case letter
112 L& Ll, Lu, or Lt
113
114 M Mark
115 Mc Spacing mark
116 Me Enclosing mark
117 Mn Non-spacing mark
118
119 N Number
120 Nd Decimal number
121 Nl Letter number
122 No Other number
123
124 P Punctuation
125 Pc Connector punctuation
126 Pd Dash punctuation
127 Pe Close punctuation
128 Pf Final punctuation
129 Pi Initial punctuation
130 Po Other punctuation
131 Ps Open punctuation
132
133 S Symbol
134 Sc Currency symbol
135 Sk Modifier symbol
136 Sm Mathematical symbol
137 So Other symbol
138
139 Z Separator
140 Zl Line separator
141 Zp Paragraph separator
142 Zs Space separator
143 </PRE>
144 </P>
145 <br><a name="SEC6" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
146 <P>
147 Arabic,
148 Armenian,
149 Balinese,
150 Bengali,
151 Bopomofo,
152 Braille,
153 Buginese,
154 Buhid,
155 Canadian_Aboriginal,
156 Carian,
157 Cham,
158 Cherokee,
159 Common,
160 Coptic,
161 Cuneiform,
162 Cypriot,
163 Cyrillic,
164 Deseret,
165 Devanagari,
166 Ethiopic,
167 Georgian,
168 Glagolitic,
169 Gothic,
170 Greek,
171 Gujarati,
172 Gurmukhi,
173 Han,
174 Hangul,
175 Hanunoo,
176 Hebrew,
177 Hiragana,
178 Inherited,
179 Kannada,
180 Katakana,
181 Kayah_Li,
182 Kharoshthi,
183 Khmer,
184 Lao,
185 Latin,
186 Lepcha,
187 Limbu,
188 Linear_B,
189 Lycian,
190 Lydian,
191 Malayalam,
192 Mongolian,
193 Myanmar,
194 New_Tai_Lue,
195 Nko,
196 Ogham,
197 Old_Italic,
198 Old_Persian,
199 Ol_Chiki,
200 Oriya,
201 Osmanya,
202 Phags_Pa,
203 Phoenician,
204 Rejang,
205 Runic,
206 Saurashtra,
207 Shavian,
208 Sinhala,
209 Sudanese,
210 Syloti_Nagri,
211 Syriac,
212 Tagalog,
213 Tagbanwa,
214 Tai_Le,
215 Tamil,
216 Telugu,
217 Thaana,
218 Thai,
219 Tibetan,
220 Tifinagh,
221 Ugaritic,
222 Vai,
223 Yi.
224 </P>
225 <br><a name="SEC7" href="#TOC1">CHARACTER CLASSES</a><br>
226 <P>
227 <pre>
228 [...] positive character class
229 [^...] negative character class
230 [x-y] range (can be used for hex characters)
231 [[:xxx:]] positive POSIX named set
232 [[:^xxx:]] negative POSIX named set
233
234 alnum alphanumeric
235 alpha alphabetic
236 ascii 0-127
237 blank space or tab
238 cntrl control character
239 digit decimal digit
240 graph printing, excluding space
241 lower lower case letter
242 print printing, including space
243 punct printing, excluding alphanumeric
244 space whitespace
245 upper upper case letter
246 word same as \w
247 xdigit hexadecimal digit
248 </pre>
249 In PCRE, POSIX character set names recognize only ASCII characters. You can use
250 \Q...\E inside a character class.
251 </P>
252 <br><a name="SEC8" href="#TOC1">QUANTIFIERS</a><br>
253 <P>
254 <pre>
255 ? 0 or 1, greedy
256 ?+ 0 or 1, possessive
257 ?? 0 or 1, lazy
258 * 0 or more, greedy
259 *+ 0 or more, possessive
260 *? 0 or more, lazy
261 + 1 or more, greedy
262 ++ 1 or more, possessive
263 +? 1 or more, lazy
264 {n} exactly n
265 {n,m} at least n, no more than m, greedy
266 {n,m}+ at least n, no more than m, possessive
267 {n,m}? at least n, no more than m, lazy
268 {n,} n or more, greedy
269 {n,}+ n or more, possessive
270 {n,}? n or more, lazy
271 </PRE>
272 </P>
273 <br><a name="SEC9" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
274 <P>
275 <pre>
276 \b word boundary (only ASCII letters recognized)
277 \B not a word boundary
278 ^ start of subject
279 also after internal newline in multiline mode
280 \A start of subject
281 $ end of subject
282 also before newline at end of subject
283 also before internal newline in multiline mode
284 \Z end of subject
285 also before newline at end of subject
286 \z end of subject
287 \G first matching position in subject
288 </PRE>
289 </P>
290 <br><a name="SEC10" href="#TOC1">MATCH POINT RESET</a><br>
291 <P>
292 <pre>
293 \K reset start of match
294 </PRE>
295 </P>
296 <br><a name="SEC11" href="#TOC1">ALTERNATION</a><br>
297 <P>
298 <pre>
299 expr|expr|expr...
300 </PRE>
301 </P>
302 <br><a name="SEC12" href="#TOC1">CAPTURING</a><br>
303 <P>
304 <pre>
305 (...) capturing group
306 (?&#60;name&#62;...) named capturing group (Perl)
307 (?'name'...) named capturing group (Perl)
308 (?P&#60;name&#62;...) named capturing group (Python)
309 (?:...) non-capturing group
310 (?|...) non-capturing group; reset group numbers for
311 capturing groups in each alternative
312 </PRE>
313 </P>
314 <br><a name="SEC13" href="#TOC1">ATOMIC GROUPS</a><br>
315 <P>
316 <pre>
317 (?&#62;...) atomic, non-capturing group
318 </PRE>
319 </P>
320 <br><a name="SEC14" href="#TOC1">COMMENT</a><br>
321 <P>
322 <pre>
323 (?#....) comment (not nestable)
324 </PRE>
325 </P>
326 <br><a name="SEC15" href="#TOC1">OPTION SETTING</a><br>
327 <P>
328 <pre>
329 (?i) caseless
330 (?J) allow duplicate names
331 (?m) multiline
332 (?s) single line (dotall)
333 (?U) default ungreedy (lazy)
334 (?x) extended (ignore white space)
335 (?-...) unset option(s)
336 </pre>
337 The following is recognized only at the start of a pattern or after one of the
338 newline-setting options with similar syntax:
339 <pre>
340 (*UTF8) set UTF-8 mode
341 </PRE>
342 </P>
343 <br><a name="SEC16" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
344 <P>
345 <pre>
346 (?=...) positive look ahead
347 (?!...) negative look ahead
348 (?&#60;=...) positive look behind
349 (?&#60;!...) negative look behind
350 </pre>
351 Each top-level branch of a look behind must be of a fixed length.
352 </P>
353 <br><a name="SEC17" href="#TOC1">BACKREFERENCES</a><br>
354 <P>
355 <pre>
356 \n reference by number (can be ambiguous)
357 \gn reference by number
358 \g{n} reference by number
359 \g{-n} relative reference by number
360 \k&#60;name&#62; reference by name (Perl)
361 \k'name' reference by name (Perl)
362 \g{name} reference by name (Perl)
363 \k{name} reference by name (.NET)
364 (?P=name) reference by name (Python)
365 </PRE>
366 </P>
367 <br><a name="SEC18" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
368 <P>
369 <pre>
370 (?R) recurse whole pattern
371 (?n) call subpattern by absolute number
372 (?+n) call subpattern by relative number
373 (?-n) call subpattern by relative number
374 (?&name) call subpattern by name (Perl)
375 (?P&#62;name) call subpattern by name (Python)
376 \g&#60;name&#62; call subpattern by name (Oniguruma)
377 \g'name' call subpattern by name (Oniguruma)
378 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
379 \g'n' call subpattern by absolute number (Oniguruma)
380 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
381 \g'+n' call subpattern by relative number (PCRE extension)
382 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
383 \g'-n' call subpattern by relative number (PCRE extension)
384 </PRE>
385 </P>
386 <br><a name="SEC19" href="#TOC1">CONDITIONAL PATTERNS</a><br>
387 <P>
388 <pre>
389 (?(condition)yes-pattern)
390 (?(condition)yes-pattern|no-pattern)
391
392 (?(n)... absolute reference condition
393 (?(+n)... relative reference condition
394 (?(-n)... relative reference condition
395 (?(&#60;name&#62;)... named reference condition (Perl)
396 (?('name')... named reference condition (Perl)
397 (?(name)... named reference condition (PCRE)
398 (?(R)... overall recursion condition
399 (?(Rn)... specific group recursion condition
400 (?(R&name)... specific recursion condition
401 (?(DEFINE)... define subpattern for reference
402 (?(assert)... assertion condition
403 </PRE>
404 </P>
405 <br><a name="SEC20" href="#TOC1">BACKTRACKING CONTROL</a><br>
406 <P>
407 The following act immediately they are reached:
408 <pre>
409 (*ACCEPT) force successful match
410 (*FAIL) force backtrack; synonym (*F)
411 </pre>
412 The following act only when a subsequent match failure causes a backtrack to
413 reach them. They all force a match failure, but they differ in what happens
414 afterwards. Those that advance the start-of-match point do so only if the
415 pattern is not anchored.
416 <pre>
417 (*COMMIT) overall failure, no advance of starting point
418 (*PRUNE) advance to next starting character
419 (*SKIP) advance start to current matching position
420 (*THEN) local failure, backtrack to next alternation
421 </PRE>
422 </P>
423 <br><a name="SEC21" href="#TOC1">NEWLINE CONVENTIONS</a><br>
424 <P>
425 These are recognized only at the very start of the pattern or after a
426 (*BSR_...) or (*UTF8) option.
427 <pre>
428 (*CR) carriage return only
429 (*LF) linefeed only
430 (*CRLF) carriage return followed by linefeed
431 (*ANYCRLF) all three of the above
432 (*ANY) any Unicode newline sequence
433 </PRE>
434 </P>
435 <br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
436 <P>
437 These are recognized only at the very start of the pattern or after a
438 (*...) option that sets the newline convention or UTF-8 mode.
439 <pre>
440 (*BSR_ANYCRLF) CR, LF, or CRLF
441 (*BSR_UNICODE) any Unicode newline sequence
442 </PRE>
443 </P>
444 <br><a name="SEC23" href="#TOC1">CALLOUTS</a><br>
445 <P>
446 <pre>
447 (?C) callout
448 (?Cn) callout with data n
449 </PRE>
450 </P>
451 <br><a name="SEC24" href="#TOC1">SEE ALSO</a><br>
452 <P>
453 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
454 <b>pcrematching</b>(3), <b>pcre</b>(3).
455 </P>
456 <br><a name="SEC25" href="#TOC1">AUTHOR</a><br>
457 <P>
458 Philip Hazel
459 <br>
460 University Computing Service
461 <br>
462 Cambridge CB2 3QH, England.
463 <br>
464 </P>
465 <br><a name="SEC26" href="#TOC1">REVISION</a><br>
466 <P>
467 Last updated: 11 April 2009
468 <br>
469 Copyright &copy; 1997-2009 University of Cambridge.
470 <br>
471 <p>
472 Return to the <a href="index.html">PCRE index page</a>.
473 </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12