| 1 |
ChangeLog for PCRE |
ChangeLog for PCRE |
| 2 |
------------------ |
------------------ |
| 3 |
|
|
| 4 |
Version 4.00 17-Feb-03 |
Version 4.4 13-Aug-03 |
| 5 |
---------------------- |
--------------------- |
| 6 |
|
|
| 7 |
|
1. In UTF-8 mode, a character class containing characters with values between |
| 8 |
|
127 and 255 was not handled correctly if the compiled pattern was studied. |
| 9 |
|
In fixing this, I have also improved the studying algorithm for such |
| 10 |
|
classes (slightly). |
| 11 |
|
|
| 12 |
|
2. Three internal functions had redundant arguments passed to them. Removal |
| 13 |
|
might give a very teeny performance improvement. |
| 14 |
|
|
| 15 |
|
3. Documentation bug: the value of the capture_top field in a callout is *one |
| 16 |
|
more than* the number of the hightest numbered captured substring. |
| 17 |
|
|
| 18 |
|
4. The Makefile linked pcretest and pcregrep with -lpcre, which could result |
| 19 |
|
in incorrectly linking with a previously installed version. They now link |
| 20 |
|
explicitly with libpcre.la. |
| 21 |
|
|
| 22 |
|
5. configure.in no longer needs to recognize Cygwin specially. |
| 23 |
|
|
| 24 |
|
6. A problem in pcre.in for Windows platforms is fixed. |
| 25 |
|
|
| 26 |
|
7. If a pattern was successfully studied, and the -d (or /D) flag was given to |
| 27 |
|
pcretest, it used to include the size of the study block as part of its |
| 28 |
|
output. Unfortunately, the structure contains a field that has a different |
| 29 |
|
size on different hardware architectures. This meant that the tests that |
| 30 |
|
showed this size failed. As the block is currently always of a fixed size, |
| 31 |
|
this information isn't actually particularly useful in pcretest output, so |
| 32 |
|
I have just removed it. |
| 33 |
|
|
| 34 |
|
8. Three pre-processor statements accidentally did not start in column 1. |
| 35 |
|
Sadly, there are *still* compilers around that complain, even though |
| 36 |
|
standard C has not required this for well over a decade. Sigh. |
| 37 |
|
|
| 38 |
|
9. In pcretest, the code for checking callouts passed small integers in the |
| 39 |
|
callout_data field, which is a void * field. However, some picky compilers |
| 40 |
|
complained about the casts involved for this on 64-bit systems. Now |
| 41 |
|
pcretest passes the address of the small integer instead, which should get |
| 42 |
|
rid of the warnings. |
| 43 |
|
|
| 44 |
|
10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at |
| 45 |
|
both compile and run time, and gives an error if an invalid UTF-8 sequence |
| 46 |
|
is found. There is a option for disabling this check in cases where the |
| 47 |
|
string is known to be correct and/or the maximum performance is wanted. |
| 48 |
|
|
| 49 |
|
11. In response to a bug report, I changed one line in Makefile.in from |
| 50 |
|
|
| 51 |
|
-Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \ |
| 52 |
|
to |
| 53 |
|
-Wl,--out-implib,.libs/@WIN_PREFIX@libpcreposix.dll.a \ |
| 54 |
|
|
| 55 |
|
to look similar to other lines, but I have no way of telling whether this |
| 56 |
|
is the right thing to do, as I do not use Windows. No doubt I'll get told |
| 57 |
|
if it's wrong... |
| 58 |
|
|
| 59 |
|
|
| 60 |
|
Version 4.3 21-May-03 |
| 61 |
|
--------------------- |
| 62 |
|
|
| 63 |
|
1. Two instances of @WIN_PREFIX@ omitted from the Windows targets in the |
| 64 |
|
Makefile. |
| 65 |
|
|
| 66 |
|
2. Some refactoring to improve the quality of the code: |
| 67 |
|
|
| 68 |
|
(i) The utf8_table... variables are now declared "const". |
| 69 |
|
|
| 70 |
|
(ii) The code for \cx, which used the "case flipping" table to upper case |
| 71 |
|
lower case letters, now just substracts 32. This is ASCII-specific, |
| 72 |
|
but the whole concept of \cx is ASCII-specific, so it seems |
| 73 |
|
reasonable. |
| 74 |
|
|
| 75 |
|
(iii) PCRE was using its character types table to recognize decimal and |
| 76 |
|
hexadecimal digits in the pattern. This is silly, because it handles |
| 77 |
|
only 0-9, a-f, and A-F, but the character types table is locale- |
| 78 |
|
specific, which means strange things might happen. A private |
| 79 |
|
table is now used for this - though it costs 256 bytes, a table is |
| 80 |
|
much faster than multiple explicit tests. Of course, the standard |
| 81 |
|
character types table is still used for matching digits in subject |
| 82 |
|
strings against \d. |
| 83 |
|
|
| 84 |
|
(iv) Strictly, the identifier ESC_t is reserved by POSIX (all identifiers |
| 85 |
|
ending in _t are). So I've renamed it as ESC_tee. |
| 86 |
|
|
| 87 |
|
3. The first argument for regexec() in the POSIX wrapper should have been |
| 88 |
|
defined as "const". |
| 89 |
|
|
| 90 |
|
4. Changed pcretest to use malloc() for its buffers so that they can be |
| 91 |
|
Electric Fenced for debugging. |
| 92 |
|
|
| 93 |
|
5. There were several places in the code where, in UTF-8 mode, PCRE would try |
| 94 |
|
to read one or more bytes before the start of the subject string. Often this |
| 95 |
|
had no effect on PCRE's behaviour, but in some circumstances it could |
| 96 |
|
provoke a segmentation fault. |
| 97 |
|
|
| 98 |
|
6. A lookbehind at the start of a pattern in UTF-8 mode could also cause PCRE |
| 99 |
|
to try to read one or more bytes before the start of the subject string. |
| 100 |
|
|
| 101 |
|
7. A lookbehind in a pattern matched in non-UTF-8 mode on a PCRE compiled with |
| 102 |
|
UTF-8 support could misbehave in various ways if the subject string |
| 103 |
|
contained bytes with the 0x80 bit set and the 0x40 bit unset in a lookbehind |
| 104 |
|
area. (PCRE was not checking for the UTF-8 mode flag, and trying to move |
| 105 |
|
back over UTF-8 characters.) |
| 106 |
|
|
| 107 |
|
|
| 108 |
|
Version 4.2 14-Apr-03 |
| 109 |
|
--------------------- |
| 110 |
|
|
| 111 |
|
1. Typo "#if SUPPORT_UTF8" instead of "#ifdef SUPPORT_UTF8" fixed. |
| 112 |
|
|
| 113 |
|
2. Changes to the building process, supplied by Ronald Landheer-Cieslak |
| 114 |
|
[ON_WINDOWS]: new variable, "#" on non-Windows platforms |
| 115 |
|
[NOT_ON_WINDOWS]: new variable, "#" on Windows platforms |
| 116 |
|
[WIN_PREFIX]: new variable, "cyg" for Cygwin |
| 117 |
|
* Makefile.in: use autoconf substitution for OBJEXT, EXEEXT, BUILD_OBJEXT |
| 118 |
|
and BUILD_EXEEXT |
| 119 |
|
Note: automatic setting of the BUILD variables is not yet working |
| 120 |
|
set CPPFLAGS and BUILD_CPPFLAGS (but don't use yet) - should be used at |
| 121 |
|
compile-time but not at link-time |
| 122 |
|
[LINK]: use for linking executables only |
| 123 |
|
make different versions for Windows and non-Windows |
| 124 |
|
[LINKLIB]: new variable, copy of UNIX-style LINK, used for linking |
| 125 |
|
libraries |
| 126 |
|
[LINK_FOR_BUILD]: new variable |
| 127 |
|
[OBJEXT]: use throughout |
| 128 |
|
[EXEEXT]: use throughout |
| 129 |
|
<winshared>: new target |
| 130 |
|
<wininstall>: new target |
| 131 |
|
<dftables.o>: use native compiler |
| 132 |
|
<dftables>: use native linker |
| 133 |
|
<install>: handle Windows platform correctly |
| 134 |
|
<clean>: ditto |
| 135 |
|
<check>: ditto |
| 136 |
|
copy DLL to top builddir before testing |
| 137 |
|
|
| 138 |
|
As part of these changes, -no-undefined was removed again. This was reported |
| 139 |
|
to give trouble on HP-UX 11.0, so getting rid of it seems like a good idea |
| 140 |
|
in any case. |
| 141 |
|
|
| 142 |
|
3. Some tidies to get rid of compiler warnings: |
| 143 |
|
|
| 144 |
|
. In the match_data structure, match_limit was an unsigned long int, whereas |
| 145 |
|
match_call_count was an int. I've made them both unsigned long ints. |
| 146 |
|
|
| 147 |
|
. In pcretest the fact that a const uschar * doesn't automatically cast to |
| 148 |
|
a void * provoked a warning. |
| 149 |
|
|
| 150 |
|
. Turning on some more compiler warnings threw up some "shadow" variables |
| 151 |
|
and a few more missing casts. |
| 152 |
|
|
| 153 |
|
4. If PCRE was complied with UTF-8 support, but called without the PCRE_UTF8 |
| 154 |
|
option, a class that contained a single character with a value between 128 |
| 155 |
|
and 255 (e.g. /[\xFF]/) caused PCRE to crash. |
| 156 |
|
|
| 157 |
|
5. If PCRE was compiled with UTF-8 support, but called without the PCRE_UTF8 |
| 158 |
|
option, a class that contained several characters, but with at least one |
| 159 |
|
whose value was between 128 and 255 caused PCRE to crash. |
| 160 |
|
|
| 161 |
|
|
| 162 |
|
Version 4.1 12-Mar-03 |
| 163 |
|
--------------------- |
| 164 |
|
|
| 165 |
|
1. Compiling with gcc -pedantic found a couple of places where casts were |
| 166 |
|
needed, and a string in dftables.c that was longer than standard compilers are |
| 167 |
|
required to support. |
| 168 |
|
|
| 169 |
|
2. Compiling with Sun's compiler found a few more places where the code could |
| 170 |
|
be tidied up in order to avoid warnings. |
| 171 |
|
|
| 172 |
|
3. The variables for cross-compiling were called HOST_CC and HOST_CFLAGS; the |
| 173 |
|
first of these names is deprecated in the latest Autoconf in favour of the name |
| 174 |
|
CC_FOR_BUILD, because "host" is typically used to mean the system on which the |
| 175 |
|
compiled code will be run. I can't find a reference for HOST_CFLAGS, but by |
| 176 |
|
analogy I have changed it to CFLAGS_FOR_BUILD. |
| 177 |
|
|
| 178 |
|
4. Added -no-undefined to the linking command in the Makefile, because this is |
| 179 |
|
apparently helpful for Windows. To make it work, also added "-L. -lpcre" to the |
| 180 |
|
linking step for the pcreposix library. |
| 181 |
|
|
| 182 |
|
5. PCRE was failing to diagnose the case of two named groups with the same |
| 183 |
|
name. |
| 184 |
|
|
| 185 |
|
6. A problem with one of PCRE's optimizations was discovered. PCRE remembers a |
| 186 |
|
literal character that is needed in the subject for a match, and scans along to |
| 187 |
|
ensure that it is present before embarking on the full matching process. This |
| 188 |
|
saves time in cases of nested unlimited repeats that are never going to match. |
| 189 |
|
Problem: the scan can take a lot of time if the subject is very long (e.g. |
| 190 |
|
megabytes), thus penalizing straightforward matches. It is now done only if the |
| 191 |
|
amount of subject to be scanned is less than 1000 bytes. |
| 192 |
|
|
| 193 |
|
7. A lesser problem with the same optimization is that it was recording the |
| 194 |
|
first character of an anchored pattern as "needed", thus provoking a search |
| 195 |
|
right along the subject, even when the first match of the pattern was going to |
| 196 |
|
fail. The "needed" character is now not set for anchored patterns, unless it |
| 197 |
|
follows something in the pattern that is of non-fixed length. Thus, it still |
| 198 |
|
fulfils its original purpose of finding quick non-matches in cases of nested |
| 199 |
|
unlimited repeats, but isn't used for simple anchored patterns such as /^abc/. |
| 200 |
|
|
| 201 |
|
|
| 202 |
|
Version 4.0 17-Feb-03 |
| 203 |
|
--------------------- |
| 204 |
|
|
| 205 |
1. If a comment in an extended regex that started immediately after a meta-item |
1. If a comment in an extended regex that started immediately after a meta-item |
| 206 |
extended to the end of string, PCRE compiled incorrect data. This could lead to |
extended to the end of string, PCRE compiled incorrect data. This could lead to |