ViewVC logotype

Contents of /code/trunk/doc/html/pcreposix.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 429 - (hide annotations) (download) (as text)
Tue Sep 1 16:10:16 2009 UTC (5 years, 7 months ago) by ph10
File MIME type: text/html
File size: 11249 byte(s)
Add pcredemo man page, containing a listing of pcredemo.c.

1 nigel 63 <html>
2     <head>
3     <title>pcreposix specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 nigel 75 <h1>pcreposix man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10 ph10 111 <p>
11 nigel 75 This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14 ph10 111 <br>
15 nigel 63 <ul>
16     <li><a name="TOC1" href="#SEC1">SYNOPSIS OF POSIX API</a>
17     <li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
18     <li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a>
19     <li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a>
20     <li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a>
21     <li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a>
22 nigel 75 <li><a name="TOC7" href="#SEC7">MEMORY USAGE</a>
23 nigel 63 <li><a name="TOC8" href="#SEC8">AUTHOR</a>
24 ph10 99 <li><a name="TOC9" href="#SEC9">REVISION</a>
25 nigel 63 </ul>
26     <br><a name="SEC1" href="#TOC1">SYNOPSIS OF POSIX API</a><br>
27     <P>
28     <b>#include &#60;pcreposix.h&#62;</b>
29     </P>
30     <P>
31     <b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
32     <b>int <i>cflags</i>);</b>
33     </P>
34     <P>
35     <b>int regexec(regex_t *<i>preg</i>, const char *<i>string</i>,</b>
36     <b>size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
37     </P>
38     <P>
39     <b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
40     <b>char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
41     </P>
42     <P>
43     <b>void regfree(regex_t *<i>preg</i>);</b>
44     </P>
45     <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
46     <P>
47     This set of functions provides a POSIX-style API to the PCRE regular expression
48     package. See the
49     <a href="pcreapi.html"><b>pcreapi</b></a>
50 nigel 77 documentation for a description of PCRE's native API, which contains much
51     additional functionality.
52 nigel 63 </P>
53     <P>
54     The functions described here are just wrapper functions that ultimately call
55     the PCRE native API. Their prototypes are defined in the <b>pcreposix.h</b>
56     header file, and on Unix systems the library itself is called
57     <b>pcreposix.a</b>, so can be accessed by adding <b>-lpcreposix</b> to the
58 nigel 75 command for linking an application that uses them. Because the POSIX functions
59     call the native ones, it is also necessary to add <b>-lpcre</b>.
60 nigel 63 </P>
61     <P>
62 ph10 392 I have implemented only those POSIX option bits that can be reasonably mapped
63     to PCRE native options. In addition, the option REG_EXTENDED is defined with
64     the value zero. This has no effect, but since programs that are written to the
65     POSIX interface often use it, this makes it easier to slot in PCRE as a
66     replacement library. Other POSIX options are not even defined.
67 nigel 63 </P>
68     <P>
69     When PCRE is called via these functions, it is only the API that is POSIX-like
70     in style. The syntax and semantics of the regular expressions themselves are
71     still those of Perl, subject to the setting of various PCRE options, as
72 nigel 69 described below. "POSIX-like in style" means that the API approximates to the
73     POSIX definition; it is not fully POSIX-compatible, and in multi-byte encoding
74     domains it is probably even less compatible.
75 nigel 63 </P>
76     <P>
77     The header for these functions is supplied as <b>pcreposix.h</b> to avoid any
78     potential clash with other POSIX libraries. It can, of course, be renamed or
79     aliased as <b>regex.h</b>, which is the "correct" name. It provides two
80     structure types, <i>regex_t</i> for compiled internal forms, and
81     <i>regmatch_t</i> for returning captured substrings. It also defines some
82     constants whose names start with "REG_"; these are used for setting options and
83     identifying error codes.
84     </P>
85 nigel 75 <P>
86     </P>
87 nigel 63 <br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
88     <P>
89     The function <b>regcomp()</b> is called to compile a pattern into an
90     internal form. The pattern is a C string terminated by a binary zero, and
91     is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
92 nigel 75 to a <b>regex_t</b> structure that is used as a base for storing information
93 nigel 87 about the compiled regular expression.
94 nigel 63 </P>
95     <P>
96     The argument <i>cflags</i> is either zero, or contains one or more of the bits
97     defined by the following macros:
98     <pre>
99 nigel 77 REG_DOTALL
100     </pre>
101 nigel 87 The PCRE_DOTALL option is set when the regular expression is passed for
102     compilation to the native function. Note that REG_DOTALL is not part of the
103     POSIX standard.
104 nigel 77 <pre>
105 nigel 63 REG_ICASE
106 nigel 75 </pre>
107 nigel 87 The PCRE_CASELESS option is set when the regular expression is passed for
108     compilation to the native function.
109 nigel 63 <pre>
111 nigel 75 </pre>
112 nigel 87 The PCRE_MULTILINE option is set when the regular expression is passed for
113     compilation to the native function. Note that this does <i>not</i> mimic the
114     defined POSIX behaviour for REG_NEWLINE (see the following section).
115     <pre>
116     REG_NOSUB
117     </pre>
118     The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is passed
119     for compilation to the native function. In addition, when a pattern that is
120     compiled with this flag is passed to <b>regexec()</b> for matching, the
121     <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
122     are returned.
123     <pre>
124     REG_UTF8
125     </pre>
126     The PCRE_UTF8 option is set when the regular expression is passed for
127     compilation to the native function. This causes the pattern itself and all data
128     strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF8
129     is not part of the POSIX standard.
130 nigel 63 </P>
131     <P>
132     In the absence of these flags, no options are passed to the native function.
133     This means the the regex is compiled with PCRE default semantics. In
134     particular, the way it handles newline characters in the subject string is the
135     Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
136     <i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
137     newlines are matched by . (they aren't) or by a negative class such as [^a]
138     (they are).
139     </P>
140     <P>
141     The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
142     <i>preg</i> structure is filled in on success, and one member of the structure
143     is public: <i>re_nsub</i> contains the number of capturing subpatterns in
144     the regular expression. Various error codes are defined in the header file.
145     </P>
146 ph10 429 <P>
147     NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
148     use the contents of the <i>preg</i> structure. If, for example, you pass it to
149     <b>regexec()</b>, the result is undefined and your program is likely to crash.
150     </P>
151 nigel 63 <br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
152     <P>
153     This area is not simple, because POSIX and Perl take different views of things.
154     It is not possible to get PCRE to obey POSIX semantics, but then PCRE was never
155     intended to be a POSIX engine. The following table lists the different
156     possibilities for matching newline characters in PCRE:
157     <pre>
158     Default Change with
159 nigel 75
160 nigel 63 . matches newline no PCRE_DOTALL
161     newline matches [^a] yes not changeable
162     $ matches \n at end yes PCRE_DOLLARENDONLY
163     $ matches \n in middle no PCRE_MULTILINE
164     ^ matches \n in middle no PCRE_MULTILINE
165 nigel 75 </pre>
166 nigel 63 This is the equivalent table for POSIX:
167     <pre>
168     Default Change with
169 nigel 75
170     . matches newline yes REG_NEWLINE
171     newline matches [^a] yes REG_NEWLINE
172     $ matches \n at end no REG_NEWLINE
173     $ matches \n in middle no REG_NEWLINE
174     ^ matches \n in middle no REG_NEWLINE
175     </pre>
176 nigel 63 PCRE's behaviour is the same as Perl's, except that there is no equivalent for
177 nigel 75 PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is no way to stop
178 nigel 63 newline from matching [^a].
179     </P>
180     <P>
181     The default POSIX newline handling can be obtained by setting PCRE_DOTALL and
182 nigel 75 PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE behave exactly as for the
183 nigel 63 REG_NEWLINE action.
184     </P>
185     <br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
186     <P>
187 nigel 75 The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i>
188 ph10 345 against a given <i>string</i>, which is by default terminated by a zero byte
189     (but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can
190     be:
191 nigel 63 <pre>
192     REG_NOTBOL
193 nigel 75 </pre>
194 nigel 63 The PCRE_NOTBOL option is set when calling the underlying PCRE matching
195     function.
196     <pre>
197 ph10 392 REG_NOTEMPTY
198     </pre>
199     The PCRE_NOTEMPTY option is set when calling the underlying PCRE matching
200     function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
201     setting this option can give more POSIX-like behaviour in some situations.
202     <pre>
203 nigel 63 REG_NOTEOL
204 nigel 75 </pre>
205 nigel 63 The PCRE_NOTEOL option is set when calling the underlying PCRE matching
206     function.
207 ph10 345 <pre>
209     </pre>
210     The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
211     to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
212     (there need not actually be a NUL at that location), regardless of the value of
213     <i>nmatch</i>. This is a BSD extension, compatible with but not specified by
214     IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
215     intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
216     not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
217     how it is matched.
218 nigel 63 </P>
219     <P>
220 nigel 87 If the pattern was compiled with the REG_NOSUB flag, no data about any matched
221     strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
222     <b>regexec()</b> are ignored.
223 nigel 63 </P>
224     <P>
225 nigel 87 Otherwise,the portion of the string that was matched, and also any captured
226     substrings, are returned via the <i>pmatch</i> argument, which points to an
227     array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
228     members <i>rm_so</i> and <i>rm_eo</i>. These contain the offset to the first
229     character of each substring and the offset to the first character after the end
230     of each substring, respectively. The 0th element of the vector relates to the
231     entire portion of <i>string</i> that was matched; subsequent elements relate to
232     the capturing subpatterns of the regular expression. Unused entries in the
233     array have both structure members set to -1.
234     </P>
235     <P>
236 nigel 63 A successful match yields a zero return; various error codes are defined in the
237     header file, of which REG_NOMATCH is the "expected" failure code.
238     </P>
239     <br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br>
240     <P>
241     The <b>regerror()</b> function maps a non-zero errorcode from either
242     <b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
243     NULL, the error should have arisen from the use of that structure. A message
244     terminated by a binary zero is placed in <i>errbuf</i>. The length of the
245     message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
246     function is the size of buffer needed to hold the whole message.
247     </P>
248 nigel 75 <br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
249 nigel 63 <P>
250     Compiling a regular expression causes memory to be allocated and associated
251     with the <i>preg</i> structure. The function <b>regfree()</b> frees all such
252     memory, after which <i>preg</i> may no longer be used as a compiled expression.
253     </P>
254     <br><a name="SEC8" href="#TOC1">AUTHOR</a><br>
255     <P>
256 nigel 77 Philip Hazel
257 nigel 63 <br>
258 ph10 99 University Computing Service
259 nigel 63 <br>
260 nigel 93 Cambridge CB2 3QH, England.
261 ph10 99 <br>
262 nigel 63 </P>
263 ph10 99 <br><a name="SEC9" href="#TOC1">REVISION</a><br>
264 nigel 63 <P>
265 ph10 429 Last updated: 15 August 2009
266 nigel 63 <br>
267 ph10 392 Copyright &copy; 1997-2009 University of Cambridge.
268 ph10 99 <br>
269 nigel 75 <p>
270     Return to the <a href="index.html">PCRE index page</a>.
271     </p>


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

ViewVC Help
Powered by ViewVC 1.1.12