--- code/trunk/ChangeLog 2007/03/26 16:00:17 134 +++ code/trunk/ChangeLog 2007/12/07 19:32:32 282 @@ -1,7 +1,400 @@ ChangeLog for PCRE ------------------ -Version 7.1 12-Mar-07 +Version 7.5 12-Nov-07 +--------------------- + +1. Applied a patch from Craig: "This patch makes it possible to 'ignore' + values in parens when parsing an RE using the C++ wrapper." + +2. Negative specials like \S did not work in character classes in UTF-8 mode. + Characters greater than 255 were excluded from the class instead of being + included. + +3. The same bug as (2) above applied to negated POSIX classes such as + [:^space:]. + +4. PCRECPP_STATIC was referenced in pcrecpp_internal.h, but nowhere was it + defined or documented. It seems to have been a typo for PCRE_STATIC, so + I have changed it. + +5. The construct (?&) was not diagnosed as a syntax error (it referenced the + first named subpattern) and a construct such as (?&a) would reference the + first named subpattern whose name started with "a" (in other words, the + length check was missing). Both these problems are fixed. "Subpattern name + expected" is now given for (?&) (a zero-length name), and this patch also + makes it give the same error for \k'' (previously it complained that that + was a reference to a non-existent subpattern). + +6. The erroneous patterns (?+-a) and (?-+a) give different error messages; + this is right because (?- can be followed by option settings as well as by + digits. I have, however, made the messages clearer. + +7. Patterns such as (?(1)a|b) (a pattern that contains fewer subpatterns + than the number used in the conditional) now cause a compile-time error. + This is actually not compatible with Perl, which accepts such patterns, but + treats the conditional as always being FALSE (as PCRE used to), but it + seems to me that giving a diagnostic is better. + +8. Change "alphameric" to the more common word "alphanumeric" in comments + and messages. + +9. Fix two occurrences of "backslash" in comments that should have been + "backspace". + +10. Remove two redundant lines of code that can never be obeyed (their function + was moved elsewhere). + +11. The program that makes PCRE's Unicode character property table had a bug + which caused it to generate incorrect table entries for sequences of + characters that have the same character type, but are in different scripts. + It amalgamated them into a single range, with the script of the first of + them. In other words, some characters were in the wrong script. There were + thirteen such cases, affecting characters in the following ranges: + + U+002b0 - U+002c1 + U+0060c - U+0060d + U+0061e - U+00612 + U+0064b - U+0065e + U+0074d - U+0076d + U+01800 - U+01805 + U+01d00 - U+01d77 + U+01d9b - U+01dbf + U+0200b - U+0200f + U+030fc - U+030fe + U+03260 - U+0327f + U+0fb46 - U+0fbb1 + U+10450 - U+1049d + +12. The -o option (show only the matching part of a line) for pcregrep was not + compatible with GNU grep in that, if there was more than one match in a + line, it showed only the first of them. It now behaves in the same way as + GNU grep. + +13. If the -o and -v options were combined for pcregrep, it printed a blank + line for every non-matching line. GNU grep prints nothing, and pcregrep now + does the same. The return code can be used to tell if there were any + non-matching lines. + +14. The pattern (?=something)(?R) was not being diagnosed as a potentially + infinitely looping recursion. The bug was that positive lookaheads were not + being skipped when checking for a possible empty match (negative lookaheads + and both kinds of lookbehind were skipped). + + +Version 7.4 21-Sep-07 +--------------------- + +1. Change 7.3/28 was implemented for classes by looking at the bitmap. This + means that a class such as [\s] counted as "explicit reference to CR or + LF". That isn't really right - the whole point of the change was to try to + help when there was an actual mention of one of the two characters. So now + the change happens only if \r or \n (or a literal CR or LF) character is + encountered. + +2. The 32-bit options word was also used for 6 internal flags, but the numbers + of both had grown to the point where there were only 3 bits left. + Fortunately, there was spare space in the data structure, and so I have + moved the internal flags into a new 16-bit field to free up more option + bits. + +3. The appearance of (?J) at the start of a pattern set the DUPNAMES option, + but did not set the internal JCHANGED flag - either of these is enough to + control the way the "get" function works - but the PCRE_INFO_JCHANGED + facility is supposed to tell if (?J) was ever used, so now (?J) at the + start sets both bits. + +4. Added options (at build time, compile time, exec time) to change \R from + matching any Unicode line ending sequence to just matching CR, LF, or CRLF. + +5. doc/pcresyntax.html was missing from the distribution. + +6. Put back the definition of PCRE_ERROR_NULLWSLIMIT, for backward + compatibility, even though it is no longer used. + +7. Added macro for snprintf to pcrecpp_unittest.cc and also for strtoll and + strtoull to pcrecpp.cc to select the available functions in WIN32 when the + windows.h file is present (where different names are used). [This was + reversed later after testing - see 16 below.] + +8. Changed all #include to #include "config.h". There were also + some further cases that I changed to "pcre.h". + +9. When pcregrep was used with the --colour option, it missed the line ending + sequence off the lines that it output. + +10. It was pointed out to me that arrays of string pointers cause lots of + relocations when a shared library is dynamically loaded. A technique of + using a single long string with a table of offsets can drastically reduce + these. I have refactored PCRE in four places to do this. The result is + dramatic: + + Originally: 290 + After changing UCP table: 187 + After changing error message table: 43 + After changing table of "verbs" 36 + After changing table of Posix names 22 + + Thanks to the folks working on Gregex for glib for this insight. + +11. --disable-stack-for-recursion caused compiling to fail unless -enable- + unicode-properties was also set. + +12. Updated the tests so that they work when \R is defaulted to ANYCRLF. + +13. Added checks for ANY and ANYCRLF to pcrecpp.cc where it previously + checked only for CRLF. + +14. Added casts to pcretest.c to avoid compiler warnings. + +15. Added Craig's patch to various pcrecpp modules to avoid compiler warnings. + +16. Added Craig's patch to remove the WINDOWS_H tests, that were not working, + and instead check for _strtoi64 explicitly, and avoid the use of snprintf() + entirely. This removes changes made in 7 above. + +17. The CMake files have been updated, and there is now more information about + building with CMake in the NON-UNIX-USE document. + + +Version 7.3 28-Aug-07 +--------------------- + + 1. In the rejigging of the build system that eventually resulted in 7.1, the + line "#include " was included in pcre_internal.h. The use of angle + brackets there is not right, since it causes compilers to look for an + installed pcre.h, not the version that is in the source that is being + compiled (which of course may be different). I have changed it back to: + + #include "pcre.h" + + I have a vague recollection that the change was concerned with compiling in + different directories, but in the new build system, that is taken care of + by the VPATH setting the Makefile. + + 2. The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed + when the subject happened to end in the byte 0x85 (e.g. if the last + character was \x{1ec5}). *Character* 0x85 is one of the "any" newline + characters but of course it shouldn't be taken as a newline when it is part + of another character. The bug was that, for an unlimited repeat of . in + not-DOTALL UTF-8 mode, PCRE was advancing by bytes rather than by + characters when looking for a newline. + + 3. A small performance improvement in the DOTALL UTF-8 mode .* case. + + 4. Debugging: adjusted the names of opcodes for different kinds of parentheses + in debug output. + + 5. Arrange to use "%I64d" instead of "%lld" and "%I64u" instead of "%llu" for + long printing in the pcrecpp unittest when running under MinGW. + + 6. ESC_K was left out of the EBCDIC table. + + 7. Change 7.0/38 introduced a new limit on the number of nested non-capturing + parentheses; I made it 1000, which seemed large enough. Unfortunately, the + limit also applies to "virtual nesting" when a pattern is recursive, and in + this case 1000 isn't so big. I have been able to remove this limit at the + expense of backing off one optimization in certain circumstances. Normally, + when pcre_exec() would call its internal match() function recursively and + immediately return the result unconditionally, it uses a "tail recursion" + feature to save stack. However, when a subpattern that can match an empty + string has an unlimited repetition quantifier, it no longer makes this + optimization. That gives it a stack frame in which to save the data for + checking that an empty string has been matched. Previously this was taken + from the 1000-entry workspace that had been reserved. So now there is no + explicit limit, but more stack is used. + + 8. Applied Daniel's patches to solve problems with the import/export magic + syntax that is required for Windows, and which was going wrong for the + pcreposix and pcrecpp parts of the library. These were overlooked when this + problem was solved for the main library. + + 9. There were some crude static tests to avoid integer overflow when computing + the size of patterns that contain repeated groups with explicit upper + limits. As the maximum quantifier is 65535, the maximum group length was + set at 30,000 so that the product of these two numbers did not overflow a + 32-bit integer. However, it turns out that people want to use groups that + are longer than 30,000 bytes (though not repeat them that many times). + Change 7.0/17 (the refactoring of the way the pattern size is computed) has + made it possible to implement the integer overflow checks in a much more + dynamic way, which I have now done. The artificial limitation on group + length has been removed - we now have only the limit on the total length of + the compiled pattern, which depends on the LINK_SIZE setting. + +10. Fixed a bug in the documentation for get/copy named substring when + duplicate names are permitted. If none of the named substrings are set, the + functions return PCRE_ERROR_NOSUBSTRING (7); the doc said they returned an + empty string. + +11. Because Perl interprets \Q...\E at a high level, and ignores orphan \E + instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, + because the ] is interpreted as the first data character and the + terminating ] is not found. PCRE has been made compatible with Perl in this + regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could + cause memory overwriting. + +10. Like Perl, PCRE automatically breaks an unlimited repeat after an empty + string has been matched (to stop an infinite loop). It was not recognizing + a conditional subpattern that could match an empty string if that + subpattern was within another subpattern. For example, it looped when + trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the + condition was not nested. This bug has been fixed. + +12. A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack + past the start of the subject in the presence of bytes with the top bit + set, for example "\x8aBCD". + +13. Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), + (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT). + +14. Optimized (?!) to (*FAIL). + +15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. + This restricts code points to be within the range 0 to 0x10FFFF, excluding + the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the + full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still + does: it's just the validity check that is more restrictive. + +16. Inserted checks for integer overflows during escape sequence (backslash) + processing, and also fixed erroneous offset values for syntax errors during + backslash processing. + +17. Fixed another case of looking too far back in non-UTF-8 mode (cf 12 above) + for patterns like [\PPP\x8a]{1,}\x80 with the subject "A\x80". + +18. An unterminated class in a pattern like (?1)\c[ with a "forward reference" + caused an overrun. + +19. A pattern like (?:[\PPa*]*){8,} which had an "extended class" (one with + something other than just ASCII characters) inside a group that had an + unlimited repeat caused a loop at compile time (while checking to see + whether the group could match an empty string). + +20. Debugging a pattern containing \p or \P could cause a crash. For example, + [\P{Any}] did so. (Error in the code for printing property names.) + +21. An orphan \E inside a character class could cause a crash. + +22. A repeated capturing bracket such as (A)? could cause a wild memory + reference during compilation. + +23. There are several functions in pcre_compile() that scan along a compiled + expression for various reasons (e.g. to see if it's fixed length for look + behind). There were bugs in these functions when a repeated \p or \P was + present in the pattern. These operators have additional parameters compared + with \d, etc, and these were not being taken into account when moving along + the compiled data. Specifically: + + (a) A item such as \p{Yi}{3} in a lookbehind was not treated as fixed + length. + + (b) An item such as \pL+ within a repeated group could cause crashes or + loops. + + (c) A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect + "reference to non-existent subpattern" error. + + (d) A pattern like (\P{Yi}{2}\277)? could loop at compile time. + +24. A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte + characters were involved (for example /\S{2}/8g with "A\x{a3}BC"). + +25. Using pcregrep in multiline, inverted mode (-Mv) caused it to loop. + +26. Patterns such as [\P{Yi}A] which include \p or \P and just one other + character were causing crashes (broken optimization). + +27. Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing + \p or \P) caused a compile-time loop. + +28. More problems have arisen in unanchored patterns when CRLF is a valid line + break. For example, the unstudied pattern [\r\n]A does not match the string + "\r\nA" because change 7.0/46 below moves the current point on by two + characters after failing to match at the start. However, the pattern \nA + *does* match, because it doesn't start till \n, and if [\r\n]A is studied, + the same is true. There doesn't seem any very clean way out of this, but + what I have chosen to do makes the common cases work: PCRE now takes note + of whether there can be an explicit match for \r or \n anywhere in the + pattern, and if so, 7.0/46 no longer applies. As part of this change, + there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled + pattern has explicit CR or LF references. + +29. Added (*CR) etc for changing newline setting at start of pattern. + + +Version 7.2 19-Jun-07 +--------------------- + + 1. If the fr_FR locale cannot be found for test 3, try the "french" locale, + which is apparently normally available under Windows. + + 2. Re-jig the pcregrep tests with different newline settings in an attempt + to make them independent of the local environment's newline setting. + + 3. Add code to configure.ac to remove -g from the CFLAGS default settings. + + 4. Some of the "internals" tests were previously cut out when the link size + was not 2, because the output contained actual offsets. The recent new + "Z" feature of pcretest means that these can be cut out, making the tests + usable with all link sizes. + + 5. Implemented Stan Switzer's goto replacement for longjmp() when not using + stack recursion. This gives a massive performance boost under BSD, but just + a small improvement under Linux. However, it saves one field in the frame + in all cases. + + 6. Added more features from the forthcoming Perl 5.10: + + (a) (?-n) (where n is a string of digits) is a relative subroutine or + recursion call. It refers to the nth most recently opened parentheses. + + (b) (?+n) is also a relative subroutine call; it refers to the nth next + to be opened parentheses. + + (c) Conditions that refer to capturing parentheses can be specified + relatively, for example, (?(-2)... or (?(+3)... + + (d) \K resets the start of the current match so that everything before + is not part of it. + + (e) \k{name} is synonymous with \k and \k'name' (.NET compatible). + + (f) \g{name} is another synonym - part of Perl 5.10's unification of + reference syntax. + + (g) (?| introduces a group in which the numbering of parentheses in each + alternative starts with the same number. + + (h) \h, \H, \v, and \V match horizontal and vertical whitespace. + + 7. Added two new calls to pcre_fullinfo(): PCRE_INFO_OKPARTIAL and + PCRE_INFO_JCHANGED. + + 8. A pattern such as (.*(.)?)* caused pcre_exec() to fail by either not + terminating or by crashing. Diagnosed by Viktor Griph; it was in the code + for detecting groups that can match an empty string. + + 9. A pattern with a very large number of alternatives (more than several + hundred) was running out of internal workspace during the pre-compile + phase, where pcre_compile() figures out how much memory will be needed. A + bit of new cunning has reduced the workspace needed for groups with + alternatives. The 1000-alternative test pattern now uses 12 bytes of + workspace instead of running out of the 4096 that are available. + +10. Inserted some missing (unsigned int) casts to get rid of compiler warnings. + +11. Applied patch from Google to remove an optimization that didn't quite work. + The report of the bug said: + + pcrecpp::RE("a*").FullMatch("aaa") matches, while + pcrecpp::RE("a*?").FullMatch("aaa") does not, and + pcrecpp::RE("a*?\\z").FullMatch("aaa") does again. + +12. If \p or \P was used in non-UTF-8 mode on a character greater than 127 + it matched the wrong number of bytes. + + +Version 7.1 24-Apr-07 --------------------- 1. Applied Bob Rossi and Daniel G's patches to convert the build system to one @@ -33,7 +426,7 @@ 5. Updated the support (such as it is) for Virtual Pascal, thanks to Stefan Weber: (1) pcre_internal.h was missing some function renames; (2) updated makevp.bat for the current PCRE, using the additional files - makevp-c.txt, makevp-l.txt, and pcregexp.pas. + makevp_c.txt, makevp_l.txt, and pcregexp.pas. 6. A Windows user reported a minor discrepancy with test 2, which turned out to be caused by a trailing space on an input line that had got lost in his @@ -127,6 +520,30 @@ in an attempt to make files that differ only in their line terminators compare equal. This works on Linux. +17. Under certain error circumstances pcregrep might try to free random memory + as it exited. This is now fixed, thanks to valgrind. + +19. In pcretest, if the pattern /(?m)^$/g was matched against the string + "abc\r\n\r\n", it found an unwanted second match after the second \r. This + was because its rules for how to advance for /g after matching an empty + string at the end of a line did not allow for this case. They now check for + it specially. + +20. pcretest is supposed to handle patterns and data of any length, by + extending its buffers when necessary. It was getting this wrong when the + buffer for a data line had to be extended. + +21. Added PCRE_NEWLINE_ANYCRLF which is like ANY, but matches only CR, LF, or + CRLF as a newline sequence. + +22. Code for handling Unicode properties in pcre_dfa_exec() wasn't being cut + out by #ifdef SUPPORT_UCP. This did no harm, as it could never be used, but + I have nevertheless tidied it up. + +23. Added some casts to kill warnings from HP-UX ia64 compiler. + +24. Added a man page for pcre-config. + Version 7.0 19-Dec-06 ---------------------