/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1335 - (hide annotations) (download) (as text)
Tue May 28 09:13:59 2013 UTC (3 weeks, 1 day ago) by ph10
File MIME type: text/html
File size: 15629 byte(s)
Final source file tidies for 8.33 release.

1 ph10 208 <html>
2     <head>
3     <title>pcresyntax specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcresyntax man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17     <li><a name="TOC2" href="#SEC2">QUOTING</a>
18     <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19     <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 ph10 518 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21     <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22     <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23     <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24     <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25     <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26     <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27     <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28     <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29     <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30     <li><a name="TOC15" href="#SEC15">COMMENT</a>
31     <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32     <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33     <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34     <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35     <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36     <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37     <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38     <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39     <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40     <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41     <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42     <li><a name="TOC27" href="#SEC27">REVISION</a>
43 ph10 208 </ul>
44     <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45     <P>
46     The full syntax and semantics of the regular expressions that are supported by
47     PCRE are described in the
48     <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 ph10 869 documentation. This document contains a quick-reference summary of the syntax.
50 ph10 208 </P>
51     <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52     <P>
53     <pre>
54     \x where x is non-alphanumeric is a literal x
55     \Q...\E treat enclosed characters as literal
56     </PRE>
57     </P>
58     <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59     <P>
60     <pre>
61     \a alarm, that is, the BEL character (hex 07)
62 ph10 579 \cx "control-x", where x is any ASCII character
63 ph10 208 \e escape (hex 1B)
64 ph10 975 \f form feed (hex 0C)
65 ph10 208 \n newline (hex 0A)
66     \r carriage return (hex 0D)
67     \t tab (hex 09)
68     \ddd character with octal code ddd, or backreference
69     \xhh character with hex code hh
70     \x{hhh..} character with hex code hhh..
71     </PRE>
72     </P>
73     <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74     <P>
75     <pre>
76     . any character except newline;
77     in dotall mode, any character whatsoever
78 ph10 869 \C one data unit, even in UTF mode (best avoided)
79 ph10 208 \d a decimal digit
80     \D a character that is not a decimal digit
81 ph10 975 \h a horizontal white space character
82     \H a character that is not a horizontal white space character
83 ph10 535 \N a character that is not a newline
84 ph10 208 \p{<i>xx</i>} a character with the <i>xx</i> property
85     \P{<i>xx</i>} a character without the <i>xx</i> property
86     \R a newline sequence
87 ph10 975 \s a white space character
88     \S a character that is not a white space character
89     \v a vertical white space character
90     \V a character that is not a vertical white space character
91 ph10 208 \w a "word" character
92     \W a "non-word" character
93 ph10 1194 \X a Unicode extended grapheme cluster
94 ph10 208 </pre>
95 ph10 535 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
96 ph10 869 characters, even in a UTF mode. However, this can be changed by setting the
97 ph10 535 PCRE_UCP option.
98 ph10 208 </P>
99 ph10 518 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
100 ph10 208 <P>
101     <pre>
102     C Other
103     Cc Control
104     Cf Format
105     Cn Unassigned
106     Co Private use
107     Cs Surrogate
108    
109     L Letter
110     Ll Lower case letter
111     Lm Modifier letter
112     Lo Other letter
113     Lt Title case letter
114     Lu Upper case letter
115     L& Ll, Lu, or Lt
116    
117     M Mark
118     Mc Spacing mark
119     Me Enclosing mark
120     Mn Non-spacing mark
121    
122     N Number
123     Nd Decimal number
124     Nl Letter number
125     No Other number
126    
127     P Punctuation
128     Pc Connector punctuation
129     Pd Dash punctuation
130     Pe Close punctuation
131     Pf Final punctuation
132     Pi Initial punctuation
133     Po Other punctuation
134     Ps Open punctuation
135    
136     S Symbol
137     Sc Currency symbol
138     Sk Modifier symbol
139     Sm Mathematical symbol
140     So Other symbol
141    
142     Z Separator
143     Zl Line separator
144     Zp Paragraph separator
145     Zs Space separator
146     </PRE>
147     </P>
148 ph10 518 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
149 ph10 208 <P>
150 ph10 518 <pre>
151     Xan Alphanumeric: union of properties L and N
152     Xps POSIX space: property Z or tab, NL, VT, FF, CR
153     Xsp Perl space: property Z or tab, NL, FF, CR
154 ph10 1335 Xuc Univerally-named character: one that can be
155     represented by a Universal Character Name
156 ph10 535 Xwd Perl word: property Xan or underscore
157 ph10 518 </PRE>
158     </P>
159     <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
160     <P>
161 ph10 208 Arabic,
162     Armenian,
163 ph10 507 Avestan,
164 ph10 208 Balinese,
165 ph10 507 Bamum,
166 ph10 954 Batak,
167 ph10 208 Bengali,
168     Bopomofo,
169 ph10 954 Brahmi,
170 ph10 208 Braille,
171     Buginese,
172     Buhid,
173     Canadian_Aboriginal,
174 ph10 416 Carian,
175 ph10 954 Chakma,
176 ph10 416 Cham,
177 ph10 208 Cherokee,
178     Common,
179     Coptic,
180     Cuneiform,
181     Cypriot,
182     Cyrillic,
183     Deseret,
184     Devanagari,
185 ph10 507 Egyptian_Hieroglyphs,
186 ph10 208 Ethiopic,
187     Georgian,
188     Glagolitic,
189     Gothic,
190     Greek,
191     Gujarati,
192     Gurmukhi,
193     Han,
194     Hangul,
195     Hanunoo,
196     Hebrew,
197     Hiragana,
198 ph10 507 Imperial_Aramaic,
199 ph10 208 Inherited,
200 ph10 507 Inscriptional_Pahlavi,
201     Inscriptional_Parthian,
202     Javanese,
203     Kaithi,
204 ph10 208 Kannada,
205     Katakana,
206 ph10 416 Kayah_Li,
207 ph10 208 Kharoshthi,
208     Khmer,
209     Lao,
210     Latin,
211 ph10 416 Lepcha,
212 ph10 208 Limbu,
213     Linear_B,
214 ph10 507 Lisu,
215 ph10 416 Lycian,
216     Lydian,
217 ph10 208 Malayalam,
218 ph10 954 Mandaic,
219 ph10 507 Meetei_Mayek,
220 ph10 954 Meroitic_Cursive,
221     Meroitic_Hieroglyphs,
222     Miao,
223 ph10 208 Mongolian,
224     Myanmar,
225     New_Tai_Lue,
226     Nko,
227     Ogham,
228     Old_Italic,
229     Old_Persian,
230 ph10 507 Old_South_Arabian,
231     Old_Turkic,
232 ph10 416 Ol_Chiki,
233 ph10 208 Oriya,
234     Osmanya,
235     Phags_Pa,
236     Phoenician,
237 ph10 416 Rejang,
238 ph10 208 Runic,
239 ph10 507 Samaritan,
240 ph10 416 Saurashtra,
241 ph10 954 Sharada,
242 ph10 208 Shavian,
243     Sinhala,
244 ph10 954 Sora_Sompeng,
245 ph10 507 Sundanese,
246 ph10 208 Syloti_Nagri,
247     Syriac,
248     Tagalog,
249     Tagbanwa,
250     Tai_Le,
251 ph10 507 Tai_Tham,
252     Tai_Viet,
253 ph10 954 Takri,
254 ph10 208 Tamil,
255     Telugu,
256     Thaana,
257     Thai,
258     Tibetan,
259     Tifinagh,
260     Ugaritic,
261 ph10 416 Vai,
262 ph10 208 Yi.
263     </P>
264 ph10 518 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
265 ph10 208 <P>
266     <pre>
267     [...] positive character class
268     [^...] negative character class
269     [x-y] range (can be used for hex characters)
270     [[:xxx:]] positive POSIX named set
271 ph10 286 [[:^xxx:]] negative POSIX named set
272 ph10 208
273     alnum alphanumeric
274     alpha alphabetic
275     ascii 0-127
276     blank space or tab
277     cntrl control character
278     digit decimal digit
279     graph printing, excluding space
280     lower lower case letter
281     print printing, including space
282     punct printing, excluding alphanumeric
283 ph10 975 space white space
284 ph10 208 upper upper case letter
285     word same as \w
286     xdigit hexadecimal digit
287     </pre>
288 ph10 535 In PCRE, POSIX character set names recognize only ASCII characters by default,
289     but some of them use Unicode properties if PCRE_UCP is set. You can use
290 ph10 208 \Q...\E inside a character class.
291     </P>
292 ph10 518 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
293 ph10 208 <P>
294     <pre>
295     ? 0 or 1, greedy
296     ?+ 0 or 1, possessive
297     ?? 0 or 1, lazy
298     * 0 or more, greedy
299     *+ 0 or more, possessive
300     *? 0 or more, lazy
301     + 1 or more, greedy
302     ++ 1 or more, possessive
303     +? 1 or more, lazy
304     {n} exactly n
305     {n,m} at least n, no more than m, greedy
306     {n,m}+ at least n, no more than m, possessive
307     {n,m}? at least n, no more than m, lazy
308     {n,} n or more, greedy
309     {n,}+ n or more, possessive
310     {n,}? n or more, lazy
311     </PRE>
312     </P>
313 ph10 518 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
314 ph10 208 <P>
315     <pre>
316 ph10 535 \b word boundary
317 ph10 208 \B not a word boundary
318     ^ start of subject
319     also after internal newline in multiline mode
320     \A start of subject
321     $ end of subject
322     also before newline at end of subject
323     also before internal newline in multiline mode
324     \Z end of subject
325     also before newline at end of subject
326     \z end of subject
327     \G first matching position in subject
328     </PRE>
329     </P>
330 ph10 518 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
331 ph10 208 <P>
332     <pre>
333     \K reset start of match
334     </PRE>
335     </P>
336 ph10 518 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
337 ph10 208 <P>
338     <pre>
339     expr|expr|expr...
340     </PRE>
341     </P>
342 ph10 518 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
343 ph10 208 <P>
344     <pre>
345 ph10 416 (...) capturing group
346     (?&#60;name&#62;...) named capturing group (Perl)
347     (?'name'...) named capturing group (Perl)
348     (?P&#60;name&#62;...) named capturing group (Python)
349     (?:...) non-capturing group
350     (?|...) non-capturing group; reset group numbers for
351     capturing groups in each alternative
352 ph10 208 </PRE>
353     </P>
354 ph10 518 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
355 ph10 208 <P>
356     <pre>
357 ph10 416 (?&#62;...) atomic, non-capturing group
358 ph10 208 </PRE>
359     </P>
360 ph10 518 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
361 ph10 208 <P>
362     <pre>
363 ph10 416 (?#....) comment (not nestable)
364 ph10 208 </PRE>
365     </P>
366 ph10 518 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
367 ph10 208 <P>
368     <pre>
369 ph10 416 (?i) caseless
370     (?J) allow duplicate names
371     (?m) multiline
372     (?s) single line (dotall)
373     (?U) default ungreedy (lazy)
374     (?x) extended (ignore white space)
375     (?-...) unset option(s)
376     </pre>
377 ph10 535 The following are recognized only at the start of a pattern or after one of the
378 ph10 416 newline-setting options with similar syntax:
379     <pre>
380 ph10 1320 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
381     (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
382 ph10 579 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
383 ph10 869 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
384     (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
385 ph10 1194 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
386 ph10 1221 (*UTF) set appropriate UTF mode for the library in use
387 ph10 535 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
388 ph10 208 </PRE>
389     </P>
390 ph10 518 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
391 ph10 208 <P>
392     <pre>
393 ph10 416 (?=...) positive look ahead
394     (?!...) negative look ahead
395     (?&#60;=...) positive look behind
396     (?&#60;!...) negative look behind
397 ph10 208 </pre>
398     Each top-level branch of a look behind must be of a fixed length.
399     </P>
400 ph10 518 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
401 ph10 208 <P>
402     <pre>
403 ph10 416 \n reference by number (can be ambiguous)
404     \gn reference by number
405     \g{n} reference by number
406     \g{-n} relative reference by number
407     \k&#60;name&#62; reference by name (Perl)
408     \k'name' reference by name (Perl)
409     \g{name} reference by name (Perl)
410     \k{name} reference by name (.NET)
411     (?P=name) reference by name (Python)
412 ph10 208 </PRE>
413     </P>
414 ph10 518 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
415 ph10 208 <P>
416     <pre>
417 ph10 416 (?R) recurse whole pattern
418     (?n) call subpattern by absolute number
419     (?+n) call subpattern by relative number
420     (?-n) call subpattern by relative number
421     (?&name) call subpattern by name (Perl)
422     (?P&#62;name) call subpattern by name (Python)
423     \g&#60;name&#62; call subpattern by name (Oniguruma)
424     \g'name' call subpattern by name (Oniguruma)
425     \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
426     \g'n' call subpattern by absolute number (Oniguruma)
427     \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
428     \g'+n' call subpattern by relative number (PCRE extension)
429     \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
430     \g'-n' call subpattern by relative number (PCRE extension)
431 ph10 208 </PRE>
432     </P>
433 ph10 518 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
434 ph10 208 <P>
435     <pre>
436     (?(condition)yes-pattern)
437     (?(condition)yes-pattern|no-pattern)
438    
439 ph10 416 (?(n)... absolute reference condition
440     (?(+n)... relative reference condition
441     (?(-n)... relative reference condition
442     (?(&#60;name&#62;)... named reference condition (Perl)
443     (?('name')... named reference condition (Perl)
444     (?(name)... named reference condition (PCRE)
445     (?(R)... overall recursion condition
446     (?(Rn)... specific group recursion condition
447     (?(R&name)... specific recursion condition
448     (?(DEFINE)... define subpattern for reference
449     (?(assert)... assertion condition
450 ph10 208 </PRE>
451     </P>
452 ph10 518 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
453 ph10 208 <P>
454 ph10 211 The following act immediately they are reached:
455 ph10 208 <pre>
456 ph10 416 (*ACCEPT) force successful match
457     (*FAIL) force backtrack; synonym (*F)
458 ph10 869 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
459 ph10 211 </pre>
460     The following act only when a subsequent match failure causes a backtrack to
461     reach them. They all force a match failure, but they differ in what happens
462     afterwards. Those that advance the start-of-match point do so only if the
463     pattern is not anchored.
464     <pre>
465 ph10 416 (*COMMIT) overall failure, no advance of starting point
466     (*PRUNE) advance to next starting character
467 ph10 903 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
468 ph10 869 (*SKIP) advance to current matching position
469     (*SKIP:NAME) advance to position corresponding to an earlier
470 ph10 903 (*MARK:NAME); if not found, the (*SKIP) is ignored
471 ph10 416 (*THEN) local failure, backtrack to next alternation
472 ph10 903 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
473 ph10 211 </PRE>
474     </P>
475 ph10 518 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
476 ph10 211 <P>
477 ph10 261 These are recognized only at the very start of the pattern or after a
478 ph10 1194 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
479 ph10 211 <pre>
480 ph10 416 (*CR) carriage return only
481     (*LF) linefeed only
482     (*CRLF) carriage return followed by linefeed
483     (*ANYCRLF) all three of the above
484     (*ANY) any Unicode newline sequence
485 ph10 227 </PRE>
486     </P>
487 ph10 518 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
488 ph10 227 <P>
489 ph10 261 These are recognized only at the very start of the pattern or after a
490 ph10 869 (*...) option that sets the newline convention or a UTF or UCP mode.
491 ph10 227 <pre>
492 ph10 416 (*BSR_ANYCRLF) CR, LF, or CRLF
493     (*BSR_UNICODE) any Unicode newline sequence
494 ph10 231 </PRE>
495     </P>
496 ph10 518 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
497 ph10 231 <P>
498     <pre>
499 ph10 208 (?C) callout
500     (?Cn) callout with data n
501     </PRE>
502     </P>
503 ph10 518 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
504 ph10 208 <P>
505     <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
506     <b>pcrematching</b>(3), <b>pcre</b>(3).
507     </P>
508 ph10 518 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
509 ph10 208 <P>
510     Philip Hazel
511     <br>
512     University Computing Service
513     <br>
514     Cambridge CB2 3QH, England.
515     <br>
516     </P>
517 ph10 518 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
518 ph10 208 <P>
519 ph10 1320 Last updated: 26 April 2013
520 ph10 208 <br>
521 ph10 1298 Copyright &copy; 1997-2013 University of Cambridge.
522 ph10 208 <br>
523     <p>
524     Return to the <a href="index.html">PCRE index page</a>.
525     </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12