/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 412 - (hide annotations) (download)
Sat Apr 11 10:34:37 2009 UTC (5 years, 3 months ago) by ph10
File size: 10745 byte(s)
Add support for (*UTF8).

1 ph10 208 .TH PCRESYNTAX 3
2     .SH NAME
3     PCRE - Perl-compatible regular expressions
4     .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5     .rs
6     .sp
7     The full syntax and semantics of the regular expressions that are supported by
8     PCRE are described in the
9     .\" HREF
10     \fBpcrepattern\fP
11     .\"
12     documentation. This document contains just a quick-reference summary of the
13     syntax.
14     .
15     .
16     .SH "QUOTING"
17     .rs
18     .sp
19     \ex where x is non-alphanumeric is a literal x
20     \eQ...\eE treat enclosed characters as literal
21     .
22     .
23     .SH "CHARACTERS"
24     .rs
25     .sp
26     \ea alarm, that is, the BEL character (hex 07)
27     \ecx "control-x", where x is any character
28     \ee escape (hex 1B)
29     \ef formfeed (hex 0C)
30     \en newline (hex 0A)
31     \er carriage return (hex 0D)
32     \et tab (hex 09)
33     \eddd character with octal code ddd, or backreference
34     \exhh character with hex code hh
35     \ex{hhh..} character with hex code hhh..
36     .
37     .
38     .SH "CHARACTER TYPES"
39     .rs
40     .sp
41     . any character except newline;
42     in dotall mode, any character whatsoever
43     \eC one byte, even in UTF-8 mode (best avoided)
44     \ed a decimal digit
45     \eD a character that is not a decimal digit
46     \eh a horizontal whitespace character
47     \eH a character that is not a horizontal whitespace character
48     \ep{\fIxx\fP} a character with the \fIxx\fP property
49     \eP{\fIxx\fP} a character without the \fIxx\fP property
50     \eR a newline sequence
51     \es a whitespace character
52     \eS a character that is not a whitespace character
53     \ev a vertical whitespace character
54     \eV a character that is not a vertical whitespace character
55     \ew a "word" character
56     \eW a "non-word" character
57     \eX an extended Unicode sequence
58     .sp
59     In PCRE, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII characters.
60     .
61     .
62     .SH "GENERAL CATEGORY PROPERTY CODES FOR \ep and \eP"
63     .rs
64     .sp
65     C Other
66     Cc Control
67     Cf Format
68     Cn Unassigned
69     Co Private use
70     Cs Surrogate
71     .sp
72     L Letter
73     Ll Lower case letter
74     Lm Modifier letter
75     Lo Other letter
76     Lt Title case letter
77     Lu Upper case letter
78     L& Ll, Lu, or Lt
79     .sp
80     M Mark
81     Mc Spacing mark
82     Me Enclosing mark
83     Mn Non-spacing mark
84     .sp
85     N Number
86     Nd Decimal number
87     Nl Letter number
88     No Other number
89     .sp
90     P Punctuation
91     Pc Connector punctuation
92     Pd Dash punctuation
93     Pe Close punctuation
94     Pf Final punctuation
95     Pi Initial punctuation
96     Po Other punctuation
97     Ps Open punctuation
98     .sp
99     S Symbol
100     Sc Currency symbol
101     Sk Modifier symbol
102     Sm Mathematical symbol
103     So Other symbol
104     .sp
105     Z Separator
106     Zl Line separator
107     Zp Paragraph separator
108     Zs Space separator
109     .
110     .
111     .SH "SCRIPT NAMES FOR \ep AND \eP"
112     .rs
113     .sp
114     Arabic,
115     Armenian,
116     Balinese,
117     Bengali,
118     Bopomofo,
119     Braille,
120     Buginese,
121     Buhid,
122     Canadian_Aboriginal,
123 ph10 412 Carian,
124     Cham,
125 ph10 208 Cherokee,
126     Common,
127     Coptic,
128     Cuneiform,
129     Cypriot,
130     Cyrillic,
131     Deseret,
132     Devanagari,
133     Ethiopic,
134     Georgian,
135     Glagolitic,
136     Gothic,
137     Greek,
138     Gujarati,
139     Gurmukhi,
140     Han,
141     Hangul,
142     Hanunoo,
143     Hebrew,
144     Hiragana,
145     Inherited,
146     Kannada,
147     Katakana,
148 ph10 412 Kayah_Li,
149 ph10 208 Kharoshthi,
150     Khmer,
151     Lao,
152     Latin,
153 ph10 412 Lepcha,
154 ph10 208 Limbu,
155     Linear_B,
156 ph10 412 Lycian,
157     Lydian,
158 ph10 208 Malayalam,
159     Mongolian,
160     Myanmar,
161     New_Tai_Lue,
162     Nko,
163     Ogham,
164     Old_Italic,
165     Old_Persian,
166 ph10 412 Ol_Chiki,
167 ph10 208 Oriya,
168     Osmanya,
169     Phags_Pa,
170     Phoenician,
171 ph10 412 Rejang,
172 ph10 208 Runic,
173 ph10 412 Saurashtra,
174 ph10 208 Shavian,
175     Sinhala,
176 ph10 412 Sudanese,
177 ph10 208 Syloti_Nagri,
178     Syriac,
179     Tagalog,
180     Tagbanwa,
181     Tai_Le,
182     Tamil,
183     Telugu,
184     Thaana,
185     Thai,
186     Tibetan,
187     Tifinagh,
188     Ugaritic,
189 ph10 412 Vai,
190 ph10 208 Yi.
191     .
192     .
193     .SH "CHARACTER CLASSES"
194     .rs
195     .sp
196     [...] positive character class
197     [^...] negative character class
198     [x-y] range (can be used for hex characters)
199     [[:xxx:]] positive POSIX named set
200 ph10 266 [[:^xxx:]] negative POSIX named set
201 ph10 208 .sp
202     alnum alphanumeric
203     alpha alphabetic
204     ascii 0-127
205     blank space or tab
206     cntrl control character
207     digit decimal digit
208     graph printing, excluding space
209     lower lower case letter
210     print printing, including space
211     punct printing, excluding alphanumeric
212     space whitespace
213     upper upper case letter
214     word same as \ew
215     xdigit hexadecimal digit
216     .sp
217     In PCRE, POSIX character set names recognize only ASCII characters. You can use
218     \eQ...\eE inside a character class.
219     .
220     .
221     .SH "QUANTIFIERS"
222     .rs
223     .sp
224     ? 0 or 1, greedy
225     ?+ 0 or 1, possessive
226     ?? 0 or 1, lazy
227     * 0 or more, greedy
228     *+ 0 or more, possessive
229     *? 0 or more, lazy
230     + 1 or more, greedy
231     ++ 1 or more, possessive
232     +? 1 or more, lazy
233     {n} exactly n
234     {n,m} at least n, no more than m, greedy
235     {n,m}+ at least n, no more than m, possessive
236     {n,m}? at least n, no more than m, lazy
237     {n,} n or more, greedy
238     {n,}+ n or more, possessive
239     {n,}? n or more, lazy
240     .
241     .
242     .SH "ANCHORS AND SIMPLE ASSERTIONS"
243     .rs
244     .sp
245 ph10 412 \eb word boundary (only ASCII letters recognized)
246 ph10 208 \eB not a word boundary
247     ^ start of subject
248     also after internal newline in multiline mode
249     \eA start of subject
250     $ end of subject
251     also before newline at end of subject
252     also before internal newline in multiline mode
253     \eZ end of subject
254     also before newline at end of subject
255     \ez end of subject
256     \eG first matching position in subject
257     .
258     .
259     .SH "MATCH POINT RESET"
260     .rs
261     .sp
262     \eK reset start of match
263     .
264     .
265     .SH "ALTERNATION"
266     .rs
267     .sp
268     expr|expr|expr...
269     .
270     .
271     .SH "CAPTURING"
272     .rs
273     .sp
274 ph10 412 (...) capturing group
275     (?<name>...) named capturing group (Perl)
276     (?'name'...) named capturing group (Perl)
277     (?P<name>...) named capturing group (Python)
278     (?:...) non-capturing group
279     (?|...) non-capturing group; reset group numbers for
280     capturing groups in each alternative
281 ph10 208 .
282     .
283     .SH "ATOMIC GROUPS"
284     .rs
285     .sp
286 ph10 412 (?>...) atomic, non-capturing group
287 ph10 208 .
288     .
289     .
290     .
291     .SH "COMMENT"
292     .rs
293     .sp
294 ph10 412 (?#....) comment (not nestable)
295 ph10 208 .
296     .
297     .SH "OPTION SETTING"
298     .rs
299     .sp
300 ph10 412 (?i) caseless
301     (?J) allow duplicate names
302     (?m) multiline
303     (?s) single line (dotall)
304     (?U) default ungreedy (lazy)
305     (?x) extended (ignore white space)
306     (?-...) unset option(s)
307     .sp
308     The following is recognized only at the start of a pattern or after one of the
309     newline-setting options with similar syntax:
310     .sp
311     (*UTF8) set UTF-8 mode
312 ph10 208 .
313     .
314     .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
315     .rs
316     .sp
317 ph10 412 (?=...) positive look ahead
318     (?!...) negative look ahead
319     (?<=...) positive look behind
320     (?<!...) negative look behind
321 ph10 208 .sp
322     Each top-level branch of a look behind must be of a fixed length.
323 ph10 333 .
324     .
325 ph10 208 .SH "BACKREFERENCES"
326     .rs
327     .sp
328 ph10 412 \en reference by number (can be ambiguous)
329     \egn reference by number
330     \eg{n} reference by number
331     \eg{-n} relative reference by number
332     \ek<name> reference by name (Perl)
333     \ek'name' reference by name (Perl)
334     \eg{name} reference by name (Perl)
335     \ek{name} reference by name (.NET)
336     (?P=name) reference by name (Python)
337 ph10 208 .
338     .
339     .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
340     .rs
341     .sp
342 ph10 412 (?R) recurse whole pattern
343     (?n) call subpattern by absolute number
344     (?+n) call subpattern by relative number
345     (?-n) call subpattern by relative number
346     (?&name) call subpattern by name (Perl)
347     (?P>name) call subpattern by name (Python)
348     \eg<name> call subpattern by name (Oniguruma)
349     \eg'name' call subpattern by name (Oniguruma)
350     \eg<n> call subpattern by absolute number (Oniguruma)
351     \eg'n' call subpattern by absolute number (Oniguruma)
352     \eg<+n> call subpattern by relative number (PCRE extension)
353     \eg'+n' call subpattern by relative number (PCRE extension)
354     \eg<-n> call subpattern by relative number (PCRE extension)
355     \eg'-n' call subpattern by relative number (PCRE extension)
356 ph10 208 .
357     .
358     .SH "CONDITIONAL PATTERNS"
359     .rs
360     .sp
361     (?(condition)yes-pattern)
362     (?(condition)yes-pattern|no-pattern)
363     .sp
364 ph10 412 (?(n)... absolute reference condition
365     (?(+n)... relative reference condition
366     (?(-n)... relative reference condition
367     (?(<name>)... named reference condition (Perl)
368     (?('name')... named reference condition (Perl)
369     (?(name)... named reference condition (PCRE)
370     (?(R)... overall recursion condition
371     (?(Rn)... specific group recursion condition
372     (?(R&name)... specific recursion condition
373     (?(DEFINE)... define subpattern for reference
374     (?(assert)... assertion condition
375 ph10 208 .
376     .
377 ph10 210 .SH "BACKTRACKING CONTROL"
378     .rs
379     .sp
380 ph10 211 The following act immediately they are reached:
381     .sp
382 ph10 412 (*ACCEPT) force successful match
383     (*FAIL) force backtrack; synonym (*F)
384 ph10 210 .sp
385 ph10 211 The following act only when a subsequent match failure causes a backtrack to
386     reach them. They all force a match failure, but they differ in what happens
387     afterwards. Those that advance the start-of-match point do so only if the
388 ph10 210 pattern is not anchored.
389     .sp
390 ph10 412 (*COMMIT) overall failure, no advance of starting point
391     (*PRUNE) advance to next starting character
392     (*SKIP) advance start to current matching position
393     (*THEN) local failure, backtrack to next alternation
394 ph10 210 .
395     .
396 ph10 227 .SH "NEWLINE CONVENTIONS"
397     .rs
398     .sp
399 ph10 261 These are recognized only at the very start of the pattern or after a
400 ph10 412 (*BSR_...) or (*UTF8) option.
401 ph10 227 .sp
402 ph10 412 (*CR) carriage return only
403     (*LF) linefeed only
404     (*CRLF) carriage return followed by linefeed
405     (*ANYCRLF) all three of the above
406     (*ANY) any Unicode newline sequence
407 ph10 227 .
408     .
409 ph10 231 .SH "WHAT \eR MATCHES"
410     .rs
411     .sp
412 ph10 261 These are recognized only at the very start of the pattern or after a
413 ph10 412 (*...) option that sets the newline convention or UTF-8 mode.
414 ph10 231 .sp
415 ph10 412 (*BSR_ANYCRLF) CR, LF, or CRLF
416     (*BSR_UNICODE) any Unicode newline sequence
417 ph10 231 .
418     .
419 ph10 208 .SH "CALLOUTS"
420     .rs
421     .sp
422     (?C) callout
423     (?Cn) callout with data n
424     .
425     .
426     .SH "SEE ALSO"
427     .rs
428     .sp
429     \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
430     \fBpcrematching\fP(3), \fBpcre\fP(3).
431     .
432     .
433     .SH AUTHOR
434     .rs
435     .sp
436     .nf
437     Philip Hazel
438     University Computing Service
439     Cambridge CB2 3QH, England.
440     .fi
441     .
442     .
443     .SH REVISION
444     .rs
445     .sp
446     .nf
447 ph10 412 Last updated: 11 April 2009
448     Copyright (c) 1997-2009 University of Cambridge.
449 ph10 208 .fi

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12