/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1502 - (hide annotations) (download) (as text)
Mon Sep 15 13:56:18 2014 UTC (4 days, 8 hours ago) by ph10
File MIME type: text/html
File size: 16694 byte(s)
Files tidied for 8.36-RC1.

1 ph10 208 <html>
2     <head>
3     <title>pcresyntax specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcresyntax man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17     <li><a name="TOC2" href="#SEC2">QUOTING</a>
18     <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19     <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 ph10 518 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21     <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22     <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23     <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24     <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25     <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26     <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27     <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28     <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29     <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30     <li><a name="TOC15" href="#SEC15">COMMENT</a>
31     <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 ph10 1459 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33     <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34     <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35     <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36     <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37     <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38     <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39 ph10 518 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40     <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41     <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42     <li><a name="TOC27" href="#SEC27">REVISION</a>
43 ph10 208 </ul>
44     <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45     <P>
46     The full syntax and semantics of the regular expressions that are supported by
47     PCRE are described in the
48     <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 ph10 869 documentation. This document contains a quick-reference summary of the syntax.
50 ph10 208 </P>
51     <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52     <P>
53     <pre>
54     \x where x is non-alphanumeric is a literal x
55     \Q...\E treat enclosed characters as literal
56     </PRE>
57     </P>
58     <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59     <P>
60     <pre>
61     \a alarm, that is, the BEL character (hex 07)
62 ph10 579 \cx "control-x", where x is any ASCII character
63 ph10 208 \e escape (hex 1B)
64 ph10 975 \f form feed (hex 0C)
65 ph10 208 \n newline (hex 0A)
66     \r carriage return (hex 0D)
67     \t tab (hex 09)
68 ph10 1404 \0dd character with octal code 0dd
69 ph10 208 \ddd character with octal code ddd, or backreference
70 ph10 1404 \o{ddd..} character with octal code ddd..
71 ph10 208 \xhh character with hex code hh
72     \x{hhh..} character with hex code hhh..
73 ph10 1404 </pre>
74     Note that \0dd is always an octal code, and that \8 and \9 are the literal
75     characters "8" and "9".
76 ph10 208 </P>
77     <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
78     <P>
79     <pre>
80     . any character except newline;
81     in dotall mode, any character whatsoever
82 ph10 869 \C one data unit, even in UTF mode (best avoided)
83 ph10 208 \d a decimal digit
84     \D a character that is not a decimal digit
85 ph10 975 \h a horizontal white space character
86     \H a character that is not a horizontal white space character
87 ph10 535 \N a character that is not a newline
88 ph10 208 \p{<i>xx</i>} a character with the <i>xx</i> property
89     \P{<i>xx</i>} a character without the <i>xx</i> property
90     \R a newline sequence
91 ph10 975 \s a white space character
92     \S a character that is not a white space character
93     \v a vertical white space character
94     \V a character that is not a vertical white space character
95 ph10 208 \w a "word" character
96     \W a "non-word" character
97 ph10 1194 \X a Unicode extended grapheme cluster
98 ph10 208 </pre>
99 ph10 1404 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100     or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101     happening, \s and \w may also match characters with code points in the range
102     128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103     is changed to use Unicode properties and they match many more characters.
104 ph10 208 </P>
105 ph10 518 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
106 ph10 208 <P>
107     <pre>
108     C Other
109     Cc Control
110     Cf Format
111     Cn Unassigned
112     Co Private use
113     Cs Surrogate
114    
115     L Letter
116     Ll Lower case letter
117     Lm Modifier letter
118     Lo Other letter
119     Lt Title case letter
120     Lu Upper case letter
121     L& Ll, Lu, or Lt
122    
123     M Mark
124     Mc Spacing mark
125     Me Enclosing mark
126     Mn Non-spacing mark
127    
128     N Number
129     Nd Decimal number
130     Nl Letter number
131     No Other number
132    
133     P Punctuation
134     Pc Connector punctuation
135     Pd Dash punctuation
136     Pe Close punctuation
137     Pf Final punctuation
138     Pi Initial punctuation
139     Po Other punctuation
140     Ps Open punctuation
141    
142     S Symbol
143     Sc Currency symbol
144     Sk Modifier symbol
145     Sm Mathematical symbol
146     So Other symbol
147    
148     Z Separator
149     Zl Line separator
150     Zp Paragraph separator
151     Zs Space separator
152     </PRE>
153     </P>
154 ph10 518 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
155 ph10 208 <P>
156 ph10 518 <pre>
157     Xan Alphanumeric: union of properties L and N
158     Xps POSIX space: property Z or tab, NL, VT, FF, CR
159 ph10 1404 Xsp Perl space: property Z or tab, NL, VT, FF, CR
160 ph10 1335 Xuc Univerally-named character: one that can be
161     represented by a Universal Character Name
162 ph10 535 Xwd Perl word: property Xan or underscore
163 ph10 1404 </pre>
164     Perl and POSIX space are now the same. Perl added VT to its space character set
165     at release 5.18 and PCRE changed at release 8.34.
166 ph10 518 </P>
167     <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
168     <P>
169 ph10 208 Arabic,
170     Armenian,
171 ph10 507 Avestan,
172 ph10 208 Balinese,
173 ph10 507 Bamum,
174 ph10 1502 Bassa_Vah,
175 ph10 954 Batak,
176 ph10 208 Bengali,
177     Bopomofo,
178 ph10 954 Brahmi,
179 ph10 208 Braille,
180     Buginese,
181     Buhid,
182     Canadian_Aboriginal,
183 ph10 416 Carian,
184 ph10 1502 Caucasian_Albanian,
185 ph10 954 Chakma,
186 ph10 416 Cham,
187 ph10 208 Cherokee,
188     Common,
189     Coptic,
190     Cuneiform,
191     Cypriot,
192     Cyrillic,
193     Deseret,
194     Devanagari,
195 ph10 1502 Duployan,
196 ph10 507 Egyptian_Hieroglyphs,
197 ph10 1502 Elbasan,
198 ph10 208 Ethiopic,
199     Georgian,
200     Glagolitic,
201     Gothic,
202 ph10 1502 Grantha,
203 ph10 208 Greek,
204     Gujarati,
205     Gurmukhi,
206     Han,
207     Hangul,
208     Hanunoo,
209     Hebrew,
210     Hiragana,
211 ph10 507 Imperial_Aramaic,
212 ph10 208 Inherited,
213 ph10 507 Inscriptional_Pahlavi,
214     Inscriptional_Parthian,
215     Javanese,
216     Kaithi,
217 ph10 208 Kannada,
218     Katakana,
219 ph10 416 Kayah_Li,
220 ph10 208 Kharoshthi,
221     Khmer,
222 ph10 1502 Khojki,
223     Khudawadi,
224 ph10 208 Lao,
225     Latin,
226 ph10 416 Lepcha,
227 ph10 208 Limbu,
228 ph10 1502 Linear_A,
229 ph10 208 Linear_B,
230 ph10 507 Lisu,
231 ph10 416 Lycian,
232     Lydian,
233 ph10 1502 Mahajani,
234 ph10 208 Malayalam,
235 ph10 954 Mandaic,
236 ph10 1502 Manichaean,
237 ph10 507 Meetei_Mayek,
238 ph10 1502 Mende_Kikakui,
239 ph10 954 Meroitic_Cursive,
240     Meroitic_Hieroglyphs,
241     Miao,
242 ph10 1502 Modi,
243 ph10 208 Mongolian,
244 ph10 1502 Mro,
245 ph10 208 Myanmar,
246 ph10 1502 Nabataean,
247 ph10 208 New_Tai_Lue,
248     Nko,
249     Ogham,
250 ph10 1502 Ol_Chiki,
251 ph10 208 Old_Italic,
252 ph10 1502 Old_North_Arabian,
253     Old_Permic,
254 ph10 208 Old_Persian,
255 ph10 507 Old_South_Arabian,
256     Old_Turkic,
257 ph10 208 Oriya,
258     Osmanya,
259 ph10 1502 Pahawh_Hmong,
260     Palmyrene,
261     Pau_Cin_Hau,
262 ph10 208 Phags_Pa,
263     Phoenician,
264 ph10 1502 Psalter_Pahlavi,
265 ph10 416 Rejang,
266 ph10 208 Runic,
267 ph10 507 Samaritan,
268 ph10 416 Saurashtra,
269 ph10 954 Sharada,
270 ph10 208 Shavian,
271 ph10 1502 Siddham,
272 ph10 208 Sinhala,
273 ph10 954 Sora_Sompeng,
274 ph10 507 Sundanese,
275 ph10 208 Syloti_Nagri,
276     Syriac,
277     Tagalog,
278     Tagbanwa,
279     Tai_Le,
280 ph10 507 Tai_Tham,
281     Tai_Viet,
282 ph10 954 Takri,
283 ph10 208 Tamil,
284     Telugu,
285     Thaana,
286     Thai,
287     Tibetan,
288     Tifinagh,
289 ph10 1502 Tirhuta,
290 ph10 208 Ugaritic,
291 ph10 416 Vai,
292 ph10 1502 Warang_Citi,
293 ph10 208 Yi.
294     </P>
295 ph10 518 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
296 ph10 208 <P>
297     <pre>
298     [...] positive character class
299     [^...] negative character class
300     [x-y] range (can be used for hex characters)
301     [[:xxx:]] positive POSIX named set
302 ph10 286 [[:^xxx:]] negative POSIX named set
303 ph10 208
304     alnum alphanumeric
305     alpha alphabetic
306     ascii 0-127
307     blank space or tab
308     cntrl control character
309     digit decimal digit
310     graph printing, excluding space
311     lower lower case letter
312     print printing, including space
313     punct printing, excluding alphanumeric
314 ph10 975 space white space
315 ph10 208 upper upper case letter
316     word same as \w
317     xdigit hexadecimal digit
318     </pre>
319 ph10 535 In PCRE, POSIX character set names recognize only ASCII characters by default,
320     but some of them use Unicode properties if PCRE_UCP is set. You can use
321 ph10 208 \Q...\E inside a character class.
322     </P>
323 ph10 518 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
324 ph10 208 <P>
325     <pre>
326     ? 0 or 1, greedy
327     ?+ 0 or 1, possessive
328     ?? 0 or 1, lazy
329     * 0 or more, greedy
330     *+ 0 or more, possessive
331     *? 0 or more, lazy
332     + 1 or more, greedy
333     ++ 1 or more, possessive
334     +? 1 or more, lazy
335     {n} exactly n
336     {n,m} at least n, no more than m, greedy
337     {n,m}+ at least n, no more than m, possessive
338     {n,m}? at least n, no more than m, lazy
339     {n,} n or more, greedy
340     {n,}+ n or more, possessive
341     {n,}? n or more, lazy
342     </PRE>
343     </P>
344 ph10 518 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
345 ph10 208 <P>
346     <pre>
347 ph10 535 \b word boundary
348 ph10 208 \B not a word boundary
349     ^ start of subject
350     also after internal newline in multiline mode
351     \A start of subject
352     $ end of subject
353     also before newline at end of subject
354     also before internal newline in multiline mode
355     \Z end of subject
356     also before newline at end of subject
357     \z end of subject
358     \G first matching position in subject
359     </PRE>
360     </P>
361 ph10 518 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
362 ph10 208 <P>
363     <pre>
364     \K reset start of match
365 ph10 1459 </pre>
366     \K is honoured in positive assertions, but ignored in negative ones.
367 ph10 208 </P>
368 ph10 518 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
369 ph10 208 <P>
370     <pre>
371     expr|expr|expr...
372     </PRE>
373     </P>
374 ph10 518 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
375 ph10 208 <P>
376     <pre>
377 ph10 416 (...) capturing group
378     (?&#60;name&#62;...) named capturing group (Perl)
379     (?'name'...) named capturing group (Perl)
380     (?P&#60;name&#62;...) named capturing group (Python)
381     (?:...) non-capturing group
382     (?|...) non-capturing group; reset group numbers for
383     capturing groups in each alternative
384 ph10 208 </PRE>
385     </P>
386 ph10 518 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
387 ph10 208 <P>
388     <pre>
389 ph10 416 (?&#62;...) atomic, non-capturing group
390 ph10 208 </PRE>
391     </P>
392 ph10 518 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
393 ph10 208 <P>
394     <pre>
395 ph10 416 (?#....) comment (not nestable)
396 ph10 208 </PRE>
397     </P>
398 ph10 518 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
399 ph10 208 <P>
400     <pre>
401 ph10 416 (?i) caseless
402     (?J) allow duplicate names
403     (?m) multiline
404     (?s) single line (dotall)
405     (?U) default ungreedy (lazy)
406     (?x) extended (ignore white space)
407     (?-...) unset option(s)
408     </pre>
409 ph10 1459 The following are recognized only at the very start of a pattern or after one
410     of the newline or \R options with similar syntax. More than one of them may
411     appear.
412 ph10 416 <pre>
413 ph10 1320 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
414     (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
415 ph10 1459 (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
416 ph10 579 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
417 ph10 869 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
418     (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
419 ph10 1194 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
420 ph10 1221 (*UTF) set appropriate UTF mode for the library in use
421 ph10 535 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
422 ph10 1404 </pre>
423     Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
424     limits set by the caller of pcre_exec(), not increase them.
425 ph10 208 </P>
426 ph10 1459 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
427 ph10 208 <P>
428 ph10 1459 These are recognized only at the very start of the pattern or after option
429     settings with a similar syntax.
430 ph10 208 <pre>
431 ph10 1459 (*CR) carriage return only
432     (*LF) linefeed only
433     (*CRLF) carriage return followed by linefeed
434     (*ANYCRLF) all three of the above
435     (*ANY) any Unicode newline sequence
436     </PRE>
437     </P>
438     <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
439     <P>
440     These are recognized only at the very start of the pattern or after option
441     setting with a similar syntax.
442     <pre>
443     (*BSR_ANYCRLF) CR, LF, or CRLF
444     (*BSR_UNICODE) any Unicode newline sequence
445     </PRE>
446     </P>
447     <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
448     <P>
449     <pre>
450 ph10 416 (?=...) positive look ahead
451     (?!...) negative look ahead
452     (?&#60;=...) positive look behind
453     (?&#60;!...) negative look behind
454 ph10 208 </pre>
455     Each top-level branch of a look behind must be of a fixed length.
456     </P>
457 ph10 1459 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
458 ph10 208 <P>
459     <pre>
460 ph10 416 \n reference by number (can be ambiguous)
461     \gn reference by number
462     \g{n} reference by number
463     \g{-n} relative reference by number
464     \k&#60;name&#62; reference by name (Perl)
465     \k'name' reference by name (Perl)
466     \g{name} reference by name (Perl)
467     \k{name} reference by name (.NET)
468     (?P=name) reference by name (Python)
469 ph10 208 </PRE>
470     </P>
471 ph10 1459 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
472 ph10 208 <P>
473     <pre>
474 ph10 416 (?R) recurse whole pattern
475     (?n) call subpattern by absolute number
476     (?+n) call subpattern by relative number
477     (?-n) call subpattern by relative number
478     (?&name) call subpattern by name (Perl)
479     (?P&#62;name) call subpattern by name (Python)
480     \g&#60;name&#62; call subpattern by name (Oniguruma)
481     \g'name' call subpattern by name (Oniguruma)
482     \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
483     \g'n' call subpattern by absolute number (Oniguruma)
484     \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
485     \g'+n' call subpattern by relative number (PCRE extension)
486     \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
487     \g'-n' call subpattern by relative number (PCRE extension)
488 ph10 208 </PRE>
489     </P>
490 ph10 1459 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
491 ph10 208 <P>
492     <pre>
493     (?(condition)yes-pattern)
494     (?(condition)yes-pattern|no-pattern)
495    
496 ph10 416 (?(n)... absolute reference condition
497     (?(+n)... relative reference condition
498     (?(-n)... relative reference condition
499     (?(&#60;name&#62;)... named reference condition (Perl)
500     (?('name')... named reference condition (Perl)
501     (?(name)... named reference condition (PCRE)
502     (?(R)... overall recursion condition
503     (?(Rn)... specific group recursion condition
504     (?(R&name)... specific recursion condition
505     (?(DEFINE)... define subpattern for reference
506     (?(assert)... assertion condition
507 ph10 208 </PRE>
508     </P>
509 ph10 1459 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
510 ph10 208 <P>
511 ph10 211 The following act immediately they are reached:
512 ph10 208 <pre>
513 ph10 416 (*ACCEPT) force successful match
514     (*FAIL) force backtrack; synonym (*F)
515 ph10 869 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
516 ph10 211 </pre>
517     The following act only when a subsequent match failure causes a backtrack to
518     reach them. They all force a match failure, but they differ in what happens
519     afterwards. Those that advance the start-of-match point do so only if the
520     pattern is not anchored.
521     <pre>
522 ph10 416 (*COMMIT) overall failure, no advance of starting point
523     (*PRUNE) advance to next starting character
524 ph10 903 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
525 ph10 869 (*SKIP) advance to current matching position
526     (*SKIP:NAME) advance to position corresponding to an earlier
527 ph10 903 (*MARK:NAME); if not found, the (*SKIP) is ignored
528 ph10 416 (*THEN) local failure, backtrack to next alternation
529 ph10 903 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
530 ph10 211 </PRE>
531     </P>
532 ph10 518 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
533 ph10 231 <P>
534     <pre>
535 ph10 208 (?C) callout
536     (?Cn) callout with data n
537     </PRE>
538     </P>
539 ph10 518 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
540 ph10 208 <P>
541     <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
542     <b>pcrematching</b>(3), <b>pcre</b>(3).
543     </P>
544 ph10 518 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
545 ph10 208 <P>
546     Philip Hazel
547     <br>
548     University Computing Service
549     <br>
550     Cambridge CB2 3QH, England.
551     <br>
552     </P>
553 ph10 518 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
554 ph10 208 <P>
555 ph10 1459 Last updated: 08 January 2014
556 ph10 208 <br>
557 ph10 1459 Copyright &copy; 1997-2014 University of Cambridge.
558 ph10 208 <br>
559     <p>
560     Return to the <a href="index.html">PCRE index page</a>.
561     </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12