| 6958 |
restarted instead. |
restarted instead. |
| 6959 |
|
|
| 6960 |
The pcre_dfa_exec() function operates in an entirely different way, and |
The pcre_dfa_exec() function operates in an entirely different way, and |
| 6961 |
hardly uses recursion at all. The limit on its complexity is the amount |
uses recursion only when there is a regular expression recursion or |
| 6962 |
of workspace it is given. The comments that follow do NOT apply to |
subroutine call in the pattern. This includes the processing of asser- |
| 6963 |
pcre_dfa_exec(); they are relevant only for pcre_exec(). |
tion and "once-only" subpatterns, which are handled like subroutine |
| 6964 |
|
calls. Normally, these are never very deep, and the limit on the com- |
| 6965 |
You can set limits on the number of times that match() is called, both |
plexity of pcre_dfa_exec() is controlled by the amount of workspace it |
| 6966 |
in total and recursively. If the limit is exceeded, an error occurs. |
is given. However, it is possible to write patterns with runaway infi- |
| 6967 |
For details, see the section on extra data for pcre_exec() in the |
nite recursions; such patterns will cause pcre_dfa_exec() to run out of |
| 6968 |
pcreapi documentation. |
stack. At present, there is no protection against this. |
| 6969 |
|
|
| 6970 |
Each time that match() is actually called recursively, it uses memory |
The comments that follow do NOT apply to pcre_dfa_exec(); they are rel- |
| 6971 |
from the process stack. For certain kinds of pattern and data, very |
evant only for pcre_exec(). |
| 6972 |
large amounts of stack may be needed, despite the recognition of "tail |
|
| 6973 |
recursion". You can often reduce the amount of recursion, and there- |
Reducing pcre_exec()'s stack usage |
| 6974 |
fore the amount of stack used, by modifying the pattern that is being |
|
| 6975 |
|
Each time that match() is actually called recursively, it uses memory |
| 6976 |
|
from the process stack. For certain kinds of pattern and data, very |
| 6977 |
|
large amounts of stack may be needed, despite the recognition of "tail |
| 6978 |
|
recursion". You can often reduce the amount of recursion, and there- |
| 6979 |
|
fore the amount of stack used, by modifying the pattern that is being |
| 6980 |
matched. Consider, for example, this pattern: |
matched. Consider, for example, this pattern: |
| 6981 |
|
|
| 6982 |
([^<]|<(?!inet))+ |
([^<]|<(?!inet))+ |
| 6983 |
|
|
| 6984 |
It matches from wherever it starts until it encounters "<inet" or the |
It matches from wherever it starts until it encounters "<inet" or the |
| 6985 |
end of the data, and is the kind of pattern that might be used when |
end of the data, and is the kind of pattern that might be used when |
| 6986 |
processing an XML file. Each iteration of the outer parentheses matches |
processing an XML file. Each iteration of the outer parentheses matches |
| 6987 |
either one character that is not "<" or a "<" that is not followed by |
either one character that is not "<" or a "<" that is not followed by |
| 6988 |
"inet". However, each time a parenthesis is processed, a recursion |
"inet". However, each time a parenthesis is processed, a recursion |
| 6989 |
occurs, so this formulation uses a stack frame for each matched charac- |
occurs, so this formulation uses a stack frame for each matched charac- |
| 6990 |
ter. For a long string, a lot of stack is required. Consider now this |
ter. For a long string, a lot of stack is required. Consider now this |
| 6991 |
rewritten pattern, which matches exactly the same strings: |
rewritten pattern, which matches exactly the same strings: |
| 6992 |
|
|
| 6993 |
([^<]++|<(?!inet))+ |
([^<]++|<(?!inet))+ |
| 6994 |
|
|
| 6995 |
This uses very much less stack, because runs of characters that do not |
This uses very much less stack, because runs of characters that do not |
| 6996 |
contain "<" are "swallowed" in one item inside the parentheses. Recur- |
contain "<" are "swallowed" in one item inside the parentheses. Recur- |
| 6997 |
sion happens only when a "<" character that is not followed by "inet" |
sion happens only when a "<" character that is not followed by "inet" |
| 6998 |
is encountered (and we assume this is relatively rare). A possessive |
is encountered (and we assume this is relatively rare). A possessive |
| 6999 |
quantifier is used to stop any backtracking into the runs of non-"<" |
quantifier is used to stop any backtracking into the runs of non-"<" |
| 7000 |
characters, but that is not related to stack usage. |
characters, but that is not related to stack usage. |
| 7001 |
|
|
| 7002 |
This example shows that one way of avoiding stack problems when match- |
This example shows that one way of avoiding stack problems when match- |
| 7003 |
ing long subject strings is to write repeated parenthesized subpatterns |
ing long subject strings is to write repeated parenthesized subpatterns |
| 7004 |
to match more than one character whenever possible. |
to match more than one character whenever possible. |
| 7005 |
|
|
| 7006 |
Compiling PCRE to use heap instead of stack |
Compiling PCRE to use heap instead of stack for pcre_exec() |
| 7007 |
|
|
| 7008 |
|
In environments where stack memory is constrained, you might want to |
| 7009 |
|
compile PCRE to use heap memory instead of stack for remembering back- |
| 7010 |
|
up points when pcre_exec() is running. This makes it run a lot more |
| 7011 |
|
slowly, however. Details of how to do this are given in the pcrebuild |
| 7012 |
|
documentation. When built in this way, instead of using the stack, PCRE |
| 7013 |
|
obtains and frees memory by calling the functions that are pointed to |
| 7014 |
|
by the pcre_stack_malloc and pcre_stack_free variables. By default, |
| 7015 |
|
these point to malloc() and free(), but you can replace the pointers to |
| 7016 |
|
cause PCRE to use your own functions. Since the block sizes are always |
| 7017 |
|
the same, and are always freed in reverse order, it may be possible to |
| 7018 |
|
implement customized memory handlers that are more efficient than the |
| 7019 |
|
standard functions. |
| 7020 |
|
|
| 7021 |
In environments where stack memory is constrained, you might want to |
Limiting pcre_exec()'s stack usage |
| 7022 |
compile PCRE to use heap memory instead of stack for remembering back- |
|
| 7023 |
up points. This makes it run a lot more slowly, however. Details of how |
You can set limits on the number of times that match() is called, both |
| 7024 |
to do this are given in the pcrebuild documentation. When built in this |
in total and recursively. If a limit is exceeded, pcre_exec() returns |
| 7025 |
way, instead of using the stack, PCRE obtains and frees memory by call- |
an error code. Setting suitable limits should prevent it from running |
| 7026 |
ing the functions that are pointed to by the pcre_stack_malloc and |
out of stack. The default values of the limits are very large, and |
| 7027 |
pcre_stack_free variables. By default, these point to malloc() and |
unlikely ever to operate. They can be changed when PCRE is built, and |
| 7028 |
free(), but you can replace the pointers to cause PCRE to use your own |
they can also be set when pcre_exec() is called. For details of these |
| 7029 |
functions. Since the block sizes are always the same, and are always |
interfaces, see the pcrebuild documentation and the section on extra |
| 7030 |
freed in reverse order, it may be possible to implement customized mem- |
data for pcre_exec() in the pcreapi documentation. |
|
ory handlers that are more efficient than the standard functions. |
|
|
|
|
|
Limiting PCRE's stack usage |
|
|
|
|
|
PCRE has an internal counter that can be used to limit the depth of |
|
|
recursion, and thus cause pcre_exec() to give an error code before it |
|
|
runs out of stack. By default, the limit is very large, and unlikely |
|
|
ever to operate. It can be changed when PCRE is built, and it can also |
|
|
be set when pcre_exec() is called. For details of these interfaces, see |
|
|
the pcrebuild and pcreapi documentation. |
|
| 7031 |
|
|
| 7032 |
As a very rough rule of thumb, you should reckon on about 500 bytes per |
As a very rough rule of thumb, you should reckon on about 500 bytes per |
| 7033 |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
| 7034 |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
| 7035 |
hand, can support around 128000 recursions. The pcretest test program |
hand, can support around 128000 recursions. |
| 7036 |
has a command line option (-S) that can be used to increase the size of |
|
| 7037 |
its stack. |
In Unix-like environments, the pcretest test program has a command line |
| 7038 |
|
option (-S) that can be used to increase the size of its stack. As long |
| 7039 |
|
as the stack is large enough, another option (-M) can be used to find |
| 7040 |
|
the smallest limits that allow a particular pattern to match a given |
| 7041 |
|
subject string. This is done by calling pcre_exec() repeatedly with |
| 7042 |
|
different limits. |
| 7043 |
|
|
| 7044 |
Changing stack size in Unix-like systems |
Changing stack size in Unix-like systems |
| 7045 |
|
|
| 7046 |
In Unix-like environments, there is not often a problem with the stack |
In Unix-like environments, there is not often a problem with the stack |
| 7047 |
unless very long strings are involved, though the default limit on |
unless very long strings are involved, though the default limit on |
| 7048 |
stack size varies from system to system. Values from 8Mb to 64Mb are |
stack size varies from system to system. Values from 8Mb to 64Mb are |
| 7049 |
common. You can find your default limit by running the command: |
common. You can find your default limit by running the command: |
| 7050 |
|
|
| 7051 |
ulimit -s |
ulimit -s |
| 7052 |
|
|
| 7053 |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
| 7054 |
though sometimes a more explicit error message is given. You can nor- |
though sometimes a more explicit error message is given. You can nor- |
| 7055 |
mally increase the limit on stack size by code such as this: |
mally increase the limit on stack size by code such as this: |
| 7056 |
|
|
| 7057 |
struct rlimit rlim; |
struct rlimit rlim; |
| 7059 |
rlim.rlim_cur = 100*1024*1024; |
rlim.rlim_cur = 100*1024*1024; |
| 7060 |
setrlimit(RLIMIT_STACK, &rlim); |
setrlimit(RLIMIT_STACK, &rlim); |
| 7061 |
|
|
| 7062 |
This reads the current limits (soft and hard) using getrlimit(), then |
This reads the current limits (soft and hard) using getrlimit(), then |
| 7063 |
attempts to increase the soft limit to 100Mb using setrlimit(). You |
attempts to increase the soft limit to 100Mb using setrlimit(). You |
| 7064 |
must do this before calling pcre_exec(). |
must do this before calling pcre_exec(). |
| 7065 |
|
|
| 7066 |
Changing stack size in Mac OS X |
Changing stack size in Mac OS X |
| 7067 |
|
|
| 7068 |
Using setrlimit(), as described above, should also work on Mac OS X. It |
Using setrlimit(), as described above, should also work on Mac OS X. It |
| 7069 |
is also possible to set a stack size when linking a program. There is a |
is also possible to set a stack size when linking a program. There is a |
| 7070 |
discussion about stack sizes in Mac OS X at this web site: |
discussion about stack sizes in Mac OS X at this web site: |
| 7071 |
http://developer.apple.com/qa/qa2005/qa1419.html. |
http://developer.apple.com/qa/qa2005/qa1419.html. |
| 7072 |
|
|
| 7073 |
|
|
| 7080 |
|
|
| 7081 |
REVISION |
REVISION |
| 7082 |
|
|
| 7083 |
Last updated: 09 July 2008 |
Last updated: 03 January 2010 |
| 7084 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
| 7085 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 7086 |
|
|
| 7087 |
|
|