| 1 |
ChangeLog for PCRE |
ChangeLog for PCRE |
| 2 |
------------------ |
------------------ |
| 3 |
|
|
| 4 |
|
Version 6.7 04-Jul-06 |
| 5 |
|
--------------------- |
| 6 |
|
|
| 7 |
|
1. In order to handle tests when input lines are enormously long, pcretest has |
| 8 |
|
been re-factored so that it automatically extends its buffers when |
| 9 |
|
necessary. The code is crude, but this _is_ just a test program. The |
| 10 |
|
default size has been increased from 32K to 50K. |
| 11 |
|
|
| 12 |
|
2. The code in pcre_study() was using the value of the re argument before |
| 13 |
|
testing it for NULL. (Of course, in any sensible call of the function, it |
| 14 |
|
won't be NULL.) |
| 15 |
|
|
| 16 |
|
3. The memmove() emulation function in pcre_internal.h, which is used on |
| 17 |
|
systems that lack both memmove() and bcopy() - that is, hardly ever - |
| 18 |
|
was missing a "static" storage class specifier. |
| 19 |
|
|
| 20 |
|
4. When UTF-8 mode was not set, PCRE looped when compiling certain patterns |
| 21 |
|
containing an extended class (one that cannot be represented by a bitmap |
| 22 |
|
because it contains high-valued characters or Unicode property items, e.g. |
| 23 |
|
[\pZ]). Almost always one would set UTF-8 mode when processing such a |
| 24 |
|
pattern, but PCRE should not loop if you do not (it no longer does). |
| 25 |
|
[Detail: two cases were found: (a) a repeated subpattern containing an |
| 26 |
|
extended class; (b) a recursive reference to a subpattern that followed a |
| 27 |
|
previous extended class. It wasn't skipping over the extended class |
| 28 |
|
correctly when UTF-8 mode was not set.] |
| 29 |
|
|
| 30 |
|
5. A negated single-character class was not being recognized as fixed-length |
| 31 |
|
in lookbehind assertions such as (?<=[^f]), leading to an incorrect |
| 32 |
|
compile error "lookbehind assertion is not fixed length". |
| 33 |
|
|
| 34 |
|
6. The RunPerlTest auxiliary script was showing an unexpected difference |
| 35 |
|
between PCRE and Perl for UTF-8 tests. It turns out that it is hard to |
| 36 |
|
write a Perl script that can interpret lines of an input file either as |
| 37 |
|
byte characters or as UTF-8, which is what "perltest" was being required to |
| 38 |
|
do for the non-UTF-8 and UTF-8 tests, respectively. Essentially what you |
| 39 |
|
can't do is switch easily at run time between having the "use utf8;" pragma |
| 40 |
|
or not. In the end, I fudged it by using the RunPerlTest script to insert |
| 41 |
|
"use utf8;" explicitly for the UTF-8 tests. |
| 42 |
|
|
| 43 |
|
7. In multiline (/m) mode, PCRE was matching ^ after a terminating newline at |
| 44 |
|
the end of the subject string, contrary to the documentation and to what |
| 45 |
|
Perl does. This was true of both matching functions. Now it matches only at |
| 46 |
|
the start of the subject and immediately after *internal* newlines. |
| 47 |
|
|
| 48 |
|
8. A call of pcre_fullinfo() from pcretest to get the option bits was passing |
| 49 |
|
a pointer to an int instead of a pointer to an unsigned long int. This |
| 50 |
|
caused problems on 64-bit systems. |
| 51 |
|
|
| 52 |
|
9. Applied a patch from the folks at Google to pcrecpp.cc, to fix "another |
| 53 |
|
instance of the 'standard' template library not being so standard". |
| 54 |
|
|
| 55 |
|
10. There was no check on the number of named subpatterns nor the maximum |
| 56 |
|
length of a subpattern name. The product of these values is used to compute |
| 57 |
|
the size of the memory block for a compiled pattern. By supplying a very |
| 58 |
|
long subpattern name and a large number of named subpatterns, the size |
| 59 |
|
computation could be caused to overflow. This is now prevented by limiting |
| 60 |
|
the length of names to 32 characters, and the number of named subpatterns |
| 61 |
|
to 10,000. |
| 62 |
|
|
| 63 |
|
11. Subpatterns that are repeated with specific counts have to be replicated in |
| 64 |
|
the compiled pattern. The size of memory for this was computed from the |
| 65 |
|
length of the subpattern and the repeat count. The latter is limited to |
| 66 |
|
65535, but there was no limit on the former, meaning that integer overflow |
| 67 |
|
could in principle occur. The compiled length of a repeated subpattern is |
| 68 |
|
now limited to 30,000 bytes in order to prevent this. |
| 69 |
|
|
| 70 |
|
12. Added the optional facility to have named substrings with the same name. |
| 71 |
|
|
| 72 |
|
13. Added the ability to use a named substring as a condition, using the |
| 73 |
|
Python syntax: (?(name)yes|no). This overloads (?(R)... and names that |
| 74 |
|
are numbers (not recommended). Forward references are permitted. |
| 75 |
|
|
| 76 |
|
14. Added forward references in named backreferences (if you see what I mean). |
| 77 |
|
|
| 78 |
|
15. In UTF-8 mode, with the PCRE_DOTALL option set, a quantified dot in the |
| 79 |
|
pattern could run off the end of the subject. For example, the pattern |
| 80 |
|
"(?s)(.{1,5})"8 did this with the subject "ab". |
| 81 |
|
|
| 82 |
|
16. If PCRE_DOTALL or PCRE_MULTILINE were set, pcre_dfa_exec() behaved as if |
| 83 |
|
PCRE_CASELESS was set when matching characters that were quantified with ? |
| 84 |
|
or *. |
| 85 |
|
|
| 86 |
|
17. A character class other than a single negated character that had a minimum |
| 87 |
|
but no maximum quantifier - for example [ab]{6,} - was not handled |
| 88 |
|
correctly by pce_dfa_exec(). It would match only one character. |
| 89 |
|
|
| 90 |
|
18. A valid (though odd) pattern that looked like a POSIX character |
| 91 |
|
class but used an invalid character after [ (for example [[,abc,]]) caused |
| 92 |
|
pcre_compile() to give the error "Failed: internal error: code overflow" or |
| 93 |
|
in some cases to crash with a glibc free() error. This could even happen if |
| 94 |
|
the pattern terminated after [[ but there just happened to be a sequence of |
| 95 |
|
letters, a binary zero, and a closing ] in the memory that followed. |
| 96 |
|
|
| 97 |
|
19. Perl's treatment of octal escapes in the range \400 to \777 has changed |
| 98 |
|
over the years. Originally (before any Unicode support), just the bottom 8 |
| 99 |
|
bits were taken. Thus, for example, \500 really meant \100. Nowadays the |
| 100 |
|
output from "man perlunicode" includes this: |
| 101 |
|
|
| 102 |
|
The regular expression compiler produces polymorphic opcodes. That |
| 103 |
|
is, the pattern adapts to the data and automatically switches to |
| 104 |
|
the Unicode character scheme when presented with Unicode data--or |
| 105 |
|
instead uses a traditional byte scheme when presented with byte |
| 106 |
|
data. |
| 107 |
|
|
| 108 |
|
Sadly, a wide octal escape does not cause a switch, and in a string with |
| 109 |
|
no other multibyte characters, these octal escapes are treated as before. |
| 110 |
|
Thus, in Perl, the pattern /\500/ actually matches \100 but the pattern |
| 111 |
|
/\500|\x{1ff}/ matches \500 or \777 because the whole thing is treated as a |
| 112 |
|
Unicode string. |
| 113 |
|
|
| 114 |
|
I have not perpetrated such confusion in PCRE. Up till now, it took just |
| 115 |
|
the bottom 8 bits, as in old Perl. I have now made octal escapes with |
| 116 |
|
values greater than \377 illegal in non-UTF-8 mode. In UTF-8 mode they |
| 117 |
|
translate to the appropriate multibyte character. |
| 118 |
|
|
| 119 |
|
29. Applied some refactoring to reduce the number of warnings from Microsoft |
| 120 |
|
and Borland compilers. This has included removing the fudge introduced |
| 121 |
|
seven years ago for the OS/2 compiler (see 2.02/2 below) because it caused |
| 122 |
|
a warning about an unused variable. |
| 123 |
|
|
| 124 |
|
21. PCRE has not included VT (character 0x0b) in the set of whitespace |
| 125 |
|
characters since release 4.0, because Perl (from release 5.004) does not. |
| 126 |
|
[Or at least, is documented not to: some releases seem to be in conflict |
| 127 |
|
with the documentation.] However, when a pattern was studied with |
| 128 |
|
pcre_study() and all its branches started with \s, PCRE still included VT |
| 129 |
|
as a possible starting character. Of course, this did no harm; it just |
| 130 |
|
caused an unnecessary match attempt. |
| 131 |
|
|
| 132 |
|
22. Removed a now-redundant internal flag bit that recorded the fact that case |
| 133 |
|
dependency changed within the pattern. This was once needed for "required |
| 134 |
|
byte" processing, but is no longer used. This recovers a now-scarce options |
| 135 |
|
bit. Also moved the least significant internal flag bit to the most- |
| 136 |
|
significant bit of the word, which was not previously used (hangover from |
| 137 |
|
the days when it was an int rather than a uint) to free up another bit for |
| 138 |
|
the future. |
| 139 |
|
|
| 140 |
|
23. Added support for CRLF line endings as well as CR and LF. As well as the |
| 141 |
|
default being selectable at build time, it can now be changed at runtime |
| 142 |
|
via the PCRE_NEWLINE_xxx flags. There are now options for pcregrep to |
| 143 |
|
specify that it is scanning data with non-default line endings. |
| 144 |
|
|
| 145 |
|
24. Changed the definition of CXXLINK to make it agree with the definition of |
| 146 |
|
LINK in the Makefile, by replacing LDFLAGS to CXXFLAGS. |
| 147 |
|
|
| 148 |
|
25. Applied Ian Taylor's patches to avoid using another stack frame for tail |
| 149 |
|
recursions. This makes a big different to stack usage for some patterns. |
| 150 |
|
|
| 151 |
|
26. If a subpattern containing a named recursion or subroutine reference such |
| 152 |
|
as (?P>B) was quantified, for example (xxx(?P>B)){3}, the calculation of |
| 153 |
|
the space required for the compiled pattern went wrong and gave too small a |
| 154 |
|
value. Depending on the environment, this could lead to "Failed: internal |
| 155 |
|
error: code overflow at offset 49" or "glibc detected double free or |
| 156 |
|
corruption" errors. |
| 157 |
|
|
| 158 |
|
27. Applied patches from Google (a) to support the new newline modes and (b) to |
| 159 |
|
advance over multibyte UTF-8 characters in GlobalReplace. |
| 160 |
|
|
| 161 |
|
28. Change free() to pcre_free() in pcredemo.c. Apparently this makes a |
| 162 |
|
difference for some implementation of PCRE in some Windows version. |
| 163 |
|
|
| 164 |
|
29. Added some extra testing facilities to pcretest: |
| 165 |
|
|
| 166 |
|
\q<number> in a data line sets the "match limit" value |
| 167 |
|
\Q<number> in a data line sets the "match recursion limt" value |
| 168 |
|
-S <number> sets the stack size, where <number> is in megabytes |
| 169 |
|
|
| 170 |
|
The -S option isn't available for Windows. |
| 171 |
|
|
| 172 |
|
|
| 173 |
Version 6.6 06-Feb-06 |
Version 6.6 06-Feb-06 |
| 174 |
--------------------- |
--------------------- |
| 175 |
|
|