--- code/trunk/ChangeLog 2011/01/15 17:26:57 591 +++ code/trunk/ChangeLog 2011/07/09 10:48:16 614 @@ -1,6 +1,129 @@ ChangeLog for PCRE ------------------ +Version 8.13 30-Apr-2011 +------------------------ + +1. The Unicode data tables have been updated to Unicode 6.0.0. + +2. Two minor typos in pcre_internal.h have been fixed. + +3. Added #include to pcre_scanner_unittest.cc, pcrecpp.cc, and + pcrecpp_unittest.cc. They are needed for strcmp(), memset(), and strchr() + in some environments (e.g. Solaris 10/SPARC using Sun Studio 12U2). + +4. There were a number of related bugs in the code for matching backrefences + caselessly in UTF-8 mode when codes for the characters concerned were + different numbers of bytes. For example, U+023A and U+2C65 are an upper + and lower case pair, using 2 and 3 bytes, respectively. The main bugs were: + (a) A reference to 3 copies of a 2-byte code matched only 2 of a 3-byte + code. (b) A reference to 2 copies of a 3-byte code would not match 2 of a + 2-byte code at the end of the subject (it thought there wasn't enough data + left). + +5. Comprehensive information about what went wrong is now returned by + pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long + as the output vector has at least 2 elements. The offset of the start of + the failing character and a reason code are placed in the vector. + +6. When the UTF-8 string check fails for pcre_compile(), the offset that is + now returned is for the first byte of the failing character, instead of the + last byte inspected. This is an incompatible change, but I hope it is small + enough not to be a problem. It makes the returned offset consistent with + pcre_exec() and pcre_dfa_exec(). + +7. pcretest now gives a text phrase as well as the error number when + pcre_exec() or pcre_dfa_exec() fails; if the error is a UTF-8 check + failure, the offset and reason code are output. + +8. When \R was used with a maximizing quantifier it failed to skip backwards + over a \r\n pair if the subsequent match failed. Instead, it just skipped + back over a single character (\n). This seems wrong (because it treated the + two characters as a single entity when going forwards), conflicts with the + documentation that \R is equivalent to (?>\r\n|\n|...etc), and makes the + behaviour of \R* different to (\R)*, which also seems wrong. The behaviour + has been changed. + +9. Some internal refactoring has changed the processing so that the handling + of the PCRE_CASELESS and PCRE_MULTILINE options is done entirely at compile + time (the PCRE_DOTALL option was changed this way some time ago: version + 7.7 change 16). This has made it possible to abolish the OP_OPT op code, + which was always a bit of a fudge. It also means that there is one less + argument for the match() function, which reduces its stack requirements + slightly. This change also fixes an incompatibility with Perl: the pattern + (?i:([^b]))(?1) should not match "ab", but previously PCRE gave a match. + +10. More internal refactoring has drastically reduced the number of recursive + calls to match() for possessively repeated groups such as (abc)++ when + using pcre_exec(). + +11. While implementing 10, a number of bugs in the handling of groups were + discovered and fixed: + + (?<=(a)+) was not diagnosed as invalid (non-fixed-length lookbehind). + (a|)*(?1) gave a compile-time internal error. + ((a|)+)+ did not notice that the outer group could match an empty string. + (^a|^)+ was not marked as anchored. + (.*a|.*)+ was not marked as matching at start or after a newline. + +12. Yet more internal refactoring has removed another argument from the match() + function. Special calls to this function are now indicated by setting a + value in a variable in the "match data" data block. + +13. Be more explicit in pcre_study() instead of relying on "default" for + opcodes that mean there is no starting character; this means that when new + ones are added and accidentally left out of pcre_study(), testing should + pick them up. + +14. The -s option of pcretest has been documented for ages as being an old + synonym of -m (show memory usage). I have changed it to mean "force study + for every regex", that is, assume /S for every regex. This is similar to -i + and -d etc. It's slightly incompatible, but I'm hoping nobody is still + using it. It makes it easier to run collections of tests with and without + study enabled, and thereby test pcre_study() more easily. All the standard + tests are now run with and without -s (but some patterns can be marked as + "never study" - see 20 below). + +15. When (*ACCEPT) was used in a subpattern that was called recursively, the + restoration of the capturing data to the outer values was not happening + correctly. + +16. If a recursively called subpattern ended with (*ACCEPT) and matched an + empty string, and PCRE_NOTEMPTY was set, pcre_exec() thought the whole + pattern had matched an empty string, and so incorrectly returned a no + match. + +17. There was optimizing code for the last branch of non-capturing parentheses, + and also for the obeyed branch of a conditional subexpression, which used + tail recursion to cut down on stack usage. Unfortunately, not that there is + the possibility of (*THEN) occurring in these branches, tail recursion is + no longer possible because the return has to be checked for (*THEN). These + two optimizations have therefore been removed. + +18. If a pattern containing \R was studied, it was assumed that \R always + matched two bytes, thus causing the minimum subject length to be + incorrectly computed because \R can also match just one byte. + +19. If a pattern containing (*ACCEPT) was studied, the minimum subject length + was incorrectly computed. + +20. If /S is present twice on a test pattern in pcretest input, it *disables* + studying, thereby overriding the use of -s on the command line. This is + necessary for one or two tests to keep the output identical in both cases. + +21. When (*ACCEPT) was used in an assertion that matched an empty string and + PCRE_NOTEMPTY was set, PCRE applied the non-empty test to the assertion. + +22. When an atomic group that contained a capturing parenthesis was + successfully matched, but the branch in which it appeared failed, the + capturing was not being forgotten if a higher numbered group was later + captured. For example, /(?>(a))b|(a)c/ when matching "ac" set capturing + group 1 to "a", when in fact it should be unset. This applied to multi- + branched capturing and non-capturing groups, repeated or not, and also to + positive assertions (capturing in negative assertions is not well defined + in PCRE) and also to nested atomic groups. + + Version 8.12 15-Jan-2011 ------------------------