/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1459 - (hide annotations) (download) (as text)
Tue Mar 4 10:45:15 2014 UTC (7 weeks, 1 day ago) by ph10
File MIME type: text/html
File size: 16430 byte(s)
Preparations for next release.

1 ph10 208 <html>
2     <head>
3     <title>pcresyntax specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcresyntax man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17     <li><a name="TOC2" href="#SEC2">QUOTING</a>
18     <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19     <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 ph10 518 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21     <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22     <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23     <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24     <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25     <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26     <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27     <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28     <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29     <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30     <li><a name="TOC15" href="#SEC15">COMMENT</a>
31     <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 ph10 1459 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33     <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34     <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35     <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36     <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37     <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38     <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39 ph10 518 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40     <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41     <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42     <li><a name="TOC27" href="#SEC27">REVISION</a>
43 ph10 208 </ul>
44     <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45     <P>
46     The full syntax and semantics of the regular expressions that are supported by
47     PCRE are described in the
48     <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 ph10 869 documentation. This document contains a quick-reference summary of the syntax.
50 ph10 208 </P>
51     <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52     <P>
53     <pre>
54     \x where x is non-alphanumeric is a literal x
55     \Q...\E treat enclosed characters as literal
56     </PRE>
57     </P>
58     <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59     <P>
60     <pre>
61     \a alarm, that is, the BEL character (hex 07)
62 ph10 579 \cx "control-x", where x is any ASCII character
63 ph10 208 \e escape (hex 1B)
64 ph10 975 \f form feed (hex 0C)
65 ph10 208 \n newline (hex 0A)
66     \r carriage return (hex 0D)
67     \t tab (hex 09)
68 ph10 1404 \0dd character with octal code 0dd
69 ph10 208 \ddd character with octal code ddd, or backreference
70 ph10 1404 \o{ddd..} character with octal code ddd..
71 ph10 208 \xhh character with hex code hh
72     \x{hhh..} character with hex code hhh..
73 ph10 1404 </pre>
74     Note that \0dd is always an octal code, and that \8 and \9 are the literal
75     characters "8" and "9".
76 ph10 208 </P>
77     <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
78     <P>
79     <pre>
80     . any character except newline;
81     in dotall mode, any character whatsoever
82 ph10 869 \C one data unit, even in UTF mode (best avoided)
83 ph10 208 \d a decimal digit
84     \D a character that is not a decimal digit
85 ph10 975 \h a horizontal white space character
86     \H a character that is not a horizontal white space character
87 ph10 535 \N a character that is not a newline
88 ph10 208 \p{<i>xx</i>} a character with the <i>xx</i> property
89     \P{<i>xx</i>} a character without the <i>xx</i> property
90     \R a newline sequence
91 ph10 975 \s a white space character
92     \S a character that is not a white space character
93     \v a vertical white space character
94     \V a character that is not a vertical white space character
95 ph10 208 \w a "word" character
96     \W a "non-word" character
97 ph10 1194 \X a Unicode extended grapheme cluster
98 ph10 208 </pre>
99 ph10 1404 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100     or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101     happening, \s and \w may also match characters with code points in the range
102     128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103     is changed to use Unicode properties and they match many more characters.
104 ph10 208 </P>
105 ph10 518 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
106 ph10 208 <P>
107     <pre>
108     C Other
109     Cc Control
110     Cf Format
111     Cn Unassigned
112     Co Private use
113     Cs Surrogate
114    
115     L Letter
116     Ll Lower case letter
117     Lm Modifier letter
118     Lo Other letter
119     Lt Title case letter
120     Lu Upper case letter
121     L& Ll, Lu, or Lt
122    
123     M Mark
124     Mc Spacing mark
125     Me Enclosing mark
126     Mn Non-spacing mark
127    
128     N Number
129     Nd Decimal number
130     Nl Letter number
131     No Other number
132    
133     P Punctuation
134     Pc Connector punctuation
135     Pd Dash punctuation
136     Pe Close punctuation
137     Pf Final punctuation
138     Pi Initial punctuation
139     Po Other punctuation
140     Ps Open punctuation
141    
142     S Symbol
143     Sc Currency symbol
144     Sk Modifier symbol
145     Sm Mathematical symbol
146     So Other symbol
147    
148     Z Separator
149     Zl Line separator
150     Zp Paragraph separator
151     Zs Space separator
152     </PRE>
153     </P>
154 ph10 518 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
155 ph10 208 <P>
156 ph10 518 <pre>
157     Xan Alphanumeric: union of properties L and N
158     Xps POSIX space: property Z or tab, NL, VT, FF, CR
159 ph10 1404 Xsp Perl space: property Z or tab, NL, VT, FF, CR
160 ph10 1335 Xuc Univerally-named character: one that can be
161     represented by a Universal Character Name
162 ph10 535 Xwd Perl word: property Xan or underscore
163 ph10 1404 </pre>
164     Perl and POSIX space are now the same. Perl added VT to its space character set
165     at release 5.18 and PCRE changed at release 8.34.
166 ph10 518 </P>
167     <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
168     <P>
169 ph10 208 Arabic,
170     Armenian,
171 ph10 507 Avestan,
172 ph10 208 Balinese,
173 ph10 507 Bamum,
174 ph10 954 Batak,
175 ph10 208 Bengali,
176     Bopomofo,
177 ph10 954 Brahmi,
178 ph10 208 Braille,
179     Buginese,
180     Buhid,
181     Canadian_Aboriginal,
182 ph10 416 Carian,
183 ph10 954 Chakma,
184 ph10 416 Cham,
185 ph10 208 Cherokee,
186     Common,
187     Coptic,
188     Cuneiform,
189     Cypriot,
190     Cyrillic,
191     Deseret,
192     Devanagari,
193 ph10 507 Egyptian_Hieroglyphs,
194 ph10 208 Ethiopic,
195     Georgian,
196     Glagolitic,
197     Gothic,
198     Greek,
199     Gujarati,
200     Gurmukhi,
201     Han,
202     Hangul,
203     Hanunoo,
204     Hebrew,
205     Hiragana,
206 ph10 507 Imperial_Aramaic,
207 ph10 208 Inherited,
208 ph10 507 Inscriptional_Pahlavi,
209     Inscriptional_Parthian,
210     Javanese,
211     Kaithi,
212 ph10 208 Kannada,
213     Katakana,
214 ph10 416 Kayah_Li,
215 ph10 208 Kharoshthi,
216     Khmer,
217     Lao,
218     Latin,
219 ph10 416 Lepcha,
220 ph10 208 Limbu,
221     Linear_B,
222 ph10 507 Lisu,
223 ph10 416 Lycian,
224     Lydian,
225 ph10 208 Malayalam,
226 ph10 954 Mandaic,
227 ph10 507 Meetei_Mayek,
228 ph10 954 Meroitic_Cursive,
229     Meroitic_Hieroglyphs,
230     Miao,
231 ph10 208 Mongolian,
232     Myanmar,
233     New_Tai_Lue,
234     Nko,
235     Ogham,
236     Old_Italic,
237     Old_Persian,
238 ph10 507 Old_South_Arabian,
239     Old_Turkic,
240 ph10 416 Ol_Chiki,
241 ph10 208 Oriya,
242     Osmanya,
243     Phags_Pa,
244     Phoenician,
245 ph10 416 Rejang,
246 ph10 208 Runic,
247 ph10 507 Samaritan,
248 ph10 416 Saurashtra,
249 ph10 954 Sharada,
250 ph10 208 Shavian,
251     Sinhala,
252 ph10 954 Sora_Sompeng,
253 ph10 507 Sundanese,
254 ph10 208 Syloti_Nagri,
255     Syriac,
256     Tagalog,
257     Tagbanwa,
258     Tai_Le,
259 ph10 507 Tai_Tham,
260     Tai_Viet,
261 ph10 954 Takri,
262 ph10 208 Tamil,
263     Telugu,
264     Thaana,
265     Thai,
266     Tibetan,
267     Tifinagh,
268     Ugaritic,
269 ph10 416 Vai,
270 ph10 208 Yi.
271     </P>
272 ph10 518 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
273 ph10 208 <P>
274     <pre>
275     [...] positive character class
276     [^...] negative character class
277     [x-y] range (can be used for hex characters)
278     [[:xxx:]] positive POSIX named set
279 ph10 286 [[:^xxx:]] negative POSIX named set
280 ph10 208
281     alnum alphanumeric
282     alpha alphabetic
283     ascii 0-127
284     blank space or tab
285     cntrl control character
286     digit decimal digit
287     graph printing, excluding space
288     lower lower case letter
289     print printing, including space
290     punct printing, excluding alphanumeric
291 ph10 975 space white space
292 ph10 208 upper upper case letter
293     word same as \w
294     xdigit hexadecimal digit
295     </pre>
296 ph10 535 In PCRE, POSIX character set names recognize only ASCII characters by default,
297     but some of them use Unicode properties if PCRE_UCP is set. You can use
298 ph10 208 \Q...\E inside a character class.
299     </P>
300 ph10 518 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
301 ph10 208 <P>
302     <pre>
303     ? 0 or 1, greedy
304     ?+ 0 or 1, possessive
305     ?? 0 or 1, lazy
306     * 0 or more, greedy
307     *+ 0 or more, possessive
308     *? 0 or more, lazy
309     + 1 or more, greedy
310     ++ 1 or more, possessive
311     +? 1 or more, lazy
312     {n} exactly n
313     {n,m} at least n, no more than m, greedy
314     {n,m}+ at least n, no more than m, possessive
315     {n,m}? at least n, no more than m, lazy
316     {n,} n or more, greedy
317     {n,}+ n or more, possessive
318     {n,}? n or more, lazy
319     </PRE>
320     </P>
321 ph10 518 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
322 ph10 208 <P>
323     <pre>
324 ph10 535 \b word boundary
325 ph10 208 \B not a word boundary
326     ^ start of subject
327     also after internal newline in multiline mode
328     \A start of subject
329     $ end of subject
330     also before newline at end of subject
331     also before internal newline in multiline mode
332     \Z end of subject
333     also before newline at end of subject
334     \z end of subject
335     \G first matching position in subject
336     </PRE>
337     </P>
338 ph10 518 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
339 ph10 208 <P>
340     <pre>
341     \K reset start of match
342 ph10 1459 </pre>
343     \K is honoured in positive assertions, but ignored in negative ones.
344 ph10 208 </P>
345 ph10 518 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
346 ph10 208 <P>
347     <pre>
348     expr|expr|expr...
349     </PRE>
350     </P>
351 ph10 518 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
352 ph10 208 <P>
353     <pre>
354 ph10 416 (...) capturing group
355     (?&#60;name&#62;...) named capturing group (Perl)
356     (?'name'...) named capturing group (Perl)
357     (?P&#60;name&#62;...) named capturing group (Python)
358     (?:...) non-capturing group
359     (?|...) non-capturing group; reset group numbers for
360     capturing groups in each alternative
361 ph10 208 </PRE>
362     </P>
363 ph10 518 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
364 ph10 208 <P>
365     <pre>
366 ph10 416 (?&#62;...) atomic, non-capturing group
367 ph10 208 </PRE>
368     </P>
369 ph10 518 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
370 ph10 208 <P>
371     <pre>
372 ph10 416 (?#....) comment (not nestable)
373 ph10 208 </PRE>
374     </P>
375 ph10 518 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
376 ph10 208 <P>
377     <pre>
378 ph10 416 (?i) caseless
379     (?J) allow duplicate names
380     (?m) multiline
381     (?s) single line (dotall)
382     (?U) default ungreedy (lazy)
383     (?x) extended (ignore white space)
384     (?-...) unset option(s)
385     </pre>
386 ph10 1459 The following are recognized only at the very start of a pattern or after one
387     of the newline or \R options with similar syntax. More than one of them may
388     appear.
389 ph10 416 <pre>
390 ph10 1320 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
391     (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
392 ph10 1459 (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
393 ph10 579 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
394 ph10 869 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
395     (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
396 ph10 1194 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
397 ph10 1221 (*UTF) set appropriate UTF mode for the library in use
398 ph10 535 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
399 ph10 1404 </pre>
400     Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
401     limits set by the caller of pcre_exec(), not increase them.
402 ph10 208 </P>
403 ph10 1459 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
404 ph10 208 <P>
405 ph10 1459 These are recognized only at the very start of the pattern or after option
406     settings with a similar syntax.
407 ph10 208 <pre>
408 ph10 1459 (*CR) carriage return only
409     (*LF) linefeed only
410     (*CRLF) carriage return followed by linefeed
411     (*ANYCRLF) all three of the above
412     (*ANY) any Unicode newline sequence
413     </PRE>
414     </P>
415     <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
416     <P>
417     These are recognized only at the very start of the pattern or after option
418     setting with a similar syntax.
419     <pre>
420     (*BSR_ANYCRLF) CR, LF, or CRLF
421     (*BSR_UNICODE) any Unicode newline sequence
422     </PRE>
423     </P>
424     <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
425     <P>
426     <pre>
427 ph10 416 (?=...) positive look ahead
428     (?!...) negative look ahead
429     (?&#60;=...) positive look behind
430     (?&#60;!...) negative look behind
431 ph10 208 </pre>
432     Each top-level branch of a look behind must be of a fixed length.
433     </P>
434 ph10 1459 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
435 ph10 208 <P>
436     <pre>
437 ph10 416 \n reference by number (can be ambiguous)
438     \gn reference by number
439     \g{n} reference by number
440     \g{-n} relative reference by number
441     \k&#60;name&#62; reference by name (Perl)
442     \k'name' reference by name (Perl)
443     \g{name} reference by name (Perl)
444     \k{name} reference by name (.NET)
445     (?P=name) reference by name (Python)
446 ph10 208 </PRE>
447     </P>
448 ph10 1459 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
449 ph10 208 <P>
450     <pre>
451 ph10 416 (?R) recurse whole pattern
452     (?n) call subpattern by absolute number
453     (?+n) call subpattern by relative number
454     (?-n) call subpattern by relative number
455     (?&name) call subpattern by name (Perl)
456     (?P&#62;name) call subpattern by name (Python)
457     \g&#60;name&#62; call subpattern by name (Oniguruma)
458     \g'name' call subpattern by name (Oniguruma)
459     \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
460     \g'n' call subpattern by absolute number (Oniguruma)
461     \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
462     \g'+n' call subpattern by relative number (PCRE extension)
463     \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
464     \g'-n' call subpattern by relative number (PCRE extension)
465 ph10 208 </PRE>
466     </P>
467 ph10 1459 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
468 ph10 208 <P>
469     <pre>
470     (?(condition)yes-pattern)
471     (?(condition)yes-pattern|no-pattern)
472    
473 ph10 416 (?(n)... absolute reference condition
474     (?(+n)... relative reference condition
475     (?(-n)... relative reference condition
476     (?(&#60;name&#62;)... named reference condition (Perl)
477     (?('name')... named reference condition (Perl)
478     (?(name)... named reference condition (PCRE)
479     (?(R)... overall recursion condition
480     (?(Rn)... specific group recursion condition
481     (?(R&name)... specific recursion condition
482     (?(DEFINE)... define subpattern for reference
483     (?(assert)... assertion condition
484 ph10 208 </PRE>
485     </P>
486 ph10 1459 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
487 ph10 208 <P>
488 ph10 211 The following act immediately they are reached:
489 ph10 208 <pre>
490 ph10 416 (*ACCEPT) force successful match
491     (*FAIL) force backtrack; synonym (*F)
492 ph10 869 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
493 ph10 211 </pre>
494     The following act only when a subsequent match failure causes a backtrack to
495     reach them. They all force a match failure, but they differ in what happens
496     afterwards. Those that advance the start-of-match point do so only if the
497     pattern is not anchored.
498     <pre>
499 ph10 416 (*COMMIT) overall failure, no advance of starting point
500     (*PRUNE) advance to next starting character
501 ph10 903 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
502 ph10 869 (*SKIP) advance to current matching position
503     (*SKIP:NAME) advance to position corresponding to an earlier
504 ph10 903 (*MARK:NAME); if not found, the (*SKIP) is ignored
505 ph10 416 (*THEN) local failure, backtrack to next alternation
506 ph10 903 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
507 ph10 211 </PRE>
508     </P>
509 ph10 518 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
510 ph10 231 <P>
511     <pre>
512 ph10 208 (?C) callout
513     (?Cn) callout with data n
514     </PRE>
515     </P>
516 ph10 518 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
517 ph10 208 <P>
518     <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
519     <b>pcrematching</b>(3), <b>pcre</b>(3).
520     </P>
521 ph10 518 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
522 ph10 208 <P>
523     Philip Hazel
524     <br>
525     University Computing Service
526     <br>
527     Cambridge CB2 3QH, England.
528     <br>
529     </P>
530 ph10 518 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
531 ph10 208 <P>
532 ph10 1459 Last updated: 08 January 2014
533 ph10 208 <br>
534 ph10 1459 Copyright &copy; 1997-2014 University of Cambridge.
535 ph10 208 <br>
536     <p>
537     Return to the <a href="index.html">PCRE index page</a>.
538     </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12