--- code/trunk/ChangeLog 2011/06/12 16:25:55 608 +++ code/trunk/ChangeLog 2011/08/23 16:45:55 672 @@ -1,7 +1,36 @@ ChangeLog for PCRE ------------------ -Version 8.13 30-Apr-2011 +Version 8.20 +------------ + +1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had + a POSIX class. After further experiments with Perl, which convinced me that + Perl has bugs and confusions, a closing square bracket is no longer allowed + in a POSIX name. + +2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no captured + substring, but while checking the failing first alternative, substring 1 is + temporarily captured. If the output vector supplied to pcre_exec() was not + big enough for this capture, the yield of the function was still zero + ("insufficient space for captured substrings"). This cannot be totally fixed + without adding another stack variable, which seems a lot of expense for a + edge case. However, I have improved the situation in cases such as + /(a)(b)x|abc/ matched against "abc", where the return code indicates that + fewer than the maximum number of slots in the ovector have been set. + +3. Related to (2) above: when there are more back references in a pattern than + slots in the output vector, pcre_exec() uses temporary memory during + matching, and copies in the captures as far as possible afterwards. It was + using the entire output vector, but this conflicts with the specification + that only 2/3 is used for passing back captured substrings. Now it uses only + the first 2/3, for compatibility. This is, of course, another edge case. + +4. Zoltan Herczeg's just-in-time compiler support has been integrated into the + main code base, and can be used by building with --enable-jit. + + +Version 8.13 16-Aug-2011 ------------------------ 1. The Unicode data tables have been updated to Unicode 6.0.0. @@ -20,77 +49,235 @@ code. (b) A reference to 2 copies of a 3-byte code would not match 2 of a 2-byte code at the end of the subject (it thought there wasn't enough data left). - -5. Comprehensive information about what went wrong is now returned by - pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long - as the output vector has at least 2 elements. The offset of the start of + +5. Comprehensive information about what went wrong is now returned by + pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long + as the output vector has at least 2 elements. The offset of the start of the failing character and a reason code are placed in the vector. - -6. When the UTF-8 string check fails for pcre_compile(), the offset that is - now returned is for the first byte of the failing character, instead of the - last byte inspected. This is an incompatible change, but I hope it is small + +6. When the UTF-8 string check fails for pcre_compile(), the offset that is + now returned is for the first byte of the failing character, instead of the + last byte inspected. This is an incompatible change, but I hope it is small enough not to be a problem. It makes the returned offset consistent with pcre_exec() and pcre_dfa_exec(). - + 7. pcretest now gives a text phrase as well as the error number when pcre_exec() or pcre_dfa_exec() fails; if the error is a UTF-8 check failure, the offset and reason code are output. - -8. When \R was used with a maximizing quantifier it failed to skip backwards + +8. When \R was used with a maximizing quantifier it failed to skip backwards over a \r\n pair if the subsequent match failed. Instead, it just skipped - back over a single character (\n). This seems wrong (because it treated the + back over a single character (\n). This seems wrong (because it treated the two characters as a single entity when going forwards), conflicts with the documentation that \R is equivalent to (?>\r\n|\n|...etc), and makes the - behaviour of \R* different to (\R)*, which also seems wrong. The behaviour + behaviour of \R* different to (\R)*, which also seems wrong. The behaviour has been changed. - -9. Some internal refactoring has changed the processing so that the handling + +9. Some internal refactoring has changed the processing so that the handling of the PCRE_CASELESS and PCRE_MULTILINE options is done entirely at compile time (the PCRE_DOTALL option was changed this way some time ago: version - 7.7 change 16). This has made it possible to abolish the OP_OPT op code, - which was always a bit of a fudge. It also means that there is one less - argument for the match() function, which reduces its stack requirements + 7.7 change 16). This has made it possible to abolish the OP_OPT op code, + which was always a bit of a fudge. It also means that there is one less + argument for the match() function, which reduces its stack requirements slightly. This change also fixes an incompatibility with Perl: the pattern (?i:([^b]))(?1) should not match "ab", but previously PCRE gave a match. - + 10. More internal refactoring has drastically reduced the number of recursive - calls to match() for possessively repeated groups such as (abc)++ when + calls to match() for possessively repeated groups such as (abc)++ when using pcre_exec(). - + 11. While implementing 10, a number of bugs in the handling of groups were discovered and fixed: - + (?<=(a)+) was not diagnosed as invalid (non-fixed-length lookbehind). (a|)*(?1) gave a compile-time internal error. - ((a|)+)+ did not notice that the outer group could match an empty string. + ((a|)+)+ did not notice that the outer group could match an empty string. (^a|^)+ was not marked as anchored. - (.*a|.*)+ was not marked as matching at start or after a newline. - + (.*a|.*)+ was not marked as matching at start or after a newline. + 12. Yet more internal refactoring has removed another argument from the match() - function. Special calls to this function are now indicated by setting a - value in a variable in the "match data" data block. - -13. Be more explicit in pcre_study() instead of relying on "default" for - opcodes that mean there is no starting character; this means that when new - ones are added and accidentally left out of pcre_study(), testing should + function. Special calls to this function are now indicated by setting a + value in a variable in the "match data" data block. + +13. Be more explicit in pcre_study() instead of relying on "default" for + opcodes that mean there is no starting character; this means that when new + ones are added and accidentally left out of pcre_study(), testing should pick them up. - -14. The -s option of pcretest has been documented for ages as being an old - synonym of -m (show memory usage). I have changed it to mean "force study - for every regex", that is, assume /S for every regex. This is similar to -i - and -d etc. It's slightly incompatible, but I'm hoping nobody is still - using it. It makes it easier to run collection of tests with study enabled, - and thereby test pcre_study() more easily. - + +14. The -s option of pcretest has been documented for ages as being an old + synonym of -m (show memory usage). I have changed it to mean "force study + for every regex", that is, assume /S for every regex. This is similar to -i + and -d etc. It's slightly incompatible, but I'm hoping nobody is still + using it. It makes it easier to run collections of tests with and without + study enabled, and thereby test pcre_study() more easily. All the standard + tests are now run with and without -s (but some patterns can be marked as + "never study" - see 20 below). + 15. When (*ACCEPT) was used in a subpattern that was called recursively, the - restoration of the capturing data to the outer values was not happening + restoration of the capturing data to the outer values was not happening correctly. - + 16. If a recursively called subpattern ended with (*ACCEPT) and matched an empty string, and PCRE_NOTEMPTY was set, pcre_exec() thought the whole pattern had matched an empty string, and so incorrectly returned a no match. +17. There was optimizing code for the last branch of non-capturing parentheses, + and also for the obeyed branch of a conditional subexpression, which used + tail recursion to cut down on stack usage. Unfortunately, now that there is + the possibility of (*THEN) occurring in these branches, tail recursion is + no longer possible because the return has to be checked for (*THEN). These + two optimizations have therefore been removed. + +18. If a pattern containing \R was studied, it was assumed that \R always + matched two bytes, thus causing the minimum subject length to be + incorrectly computed because \R can also match just one byte. + +19. If a pattern containing (*ACCEPT) was studied, the minimum subject length + was incorrectly computed. + +20. If /S is present twice on a test pattern in pcretest input, it now + *disables* studying, thereby overriding the use of -s on the command line + (see 14 above). This is necessary for one or two tests to keep the output + identical in both cases. + +21. When (*ACCEPT) was used in an assertion that matched an empty string and + PCRE_NOTEMPTY was set, PCRE applied the non-empty test to the assertion. + +22. When an atomic group that contained a capturing parenthesis was + successfully matched, but the branch in which it appeared failed, the + capturing was not being forgotten if a higher numbered group was later + captured. For example, /(?>(a))b|(a)c/ when matching "ac" set capturing + group 1 to "a", when in fact it should be unset. This applied to multi- + branched capturing and non-capturing groups, repeated or not, and also to + positive assertions (capturing in negative assertions does not happen + in PCRE) and also to nested atomic groups. + +23. Add the ++ qualifier feature to pcretest, to show the remainder of the + subject after a captured substring, to make it easier to tell which of a + number of identical substrings has been captured. + +24. The way atomic groups are processed by pcre_exec() has been changed so that + if they are repeated, backtracking one repetition now resets captured + values correctly. For example, if ((?>(a+)b)+aabab) is matched against + "aaaabaaabaabab" the value of captured group 2 is now correctly recorded as + "aaa". Previously, it would have been "a". As part of this code + refactoring, the way recursive calls are handled has also been changed. + +25. If an assertion condition captured any substrings, they were not passed + back unless some other capturing happened later. For example, if + (?(?=(a))a) was matched against "a", no capturing was returned. + +26. When studying a pattern that contained subroutine calls or assertions, + the code for finding the minimum length of a possible match was handling + direct recursions such as (xxx(?1)|yyy) but not mutual recursions (where + group 1 called group 2 while simultaneously a separate group 2 called group + 1). A stack overflow occurred in this case. I have fixed this by limiting + the recursion depth to 10. + +27. Updated RunTest.bat in the distribution to the version supplied by Tom + Fortmann. This supports explicit test numbers on the command line, and has + argument validation and error reporting. + +28. An instance of \X with an unlimited repeat could fail if at any point the + first character it looked at was a mark character. + +29. Some minor code refactoring concerning Unicode properties and scripts + should reduce the stack requirement of match() slightly. + +30. Added the '=' option to pcretest to check the setting of unused capturing + slots at the end of the pattern, which are documented as being -1, but are + not included in the return count. + +31. If \k was not followed by a braced, angle-bracketed, or quoted name, PCRE + compiled something random. Now it gives a compile-time error (as does + Perl). + +32. A *MARK encountered during the processing of a positive assertion is now + recorded and passed back (compatible with Perl). + +33. If --only-matching or --colour was set on a pcregrep call whose pattern + had alternative anchored branches, the search for a second match in a line + was done as if at the line start. Thus, for example, /^01|^02/ incorrectly + matched the line "0102" twice. The same bug affected patterns that started + with a backwards assertion. For example /\b01|\b02/ also matched "0102" + twice. + +34. Previously, PCRE did not allow quantification of assertions. However, Perl + does, and because of capturing effects, quantifying parenthesized + assertions may at times be useful. Quantifiers are now allowed for + parenthesized assertions. + +35. A minor code tidy in pcre_compile() when checking options for \R usage. + +36. \g was being checked for fancy things in a character class, when it should + just be a literal "g". + +37. PCRE was rejecting [:a[:digit:]] whereas Perl was not. It seems that the + appearance of a nested POSIX class supersedes an apparent external class. + For example, [:a[:digit:]b:] matches "a", "b", ":", or a digit. Also, + unescaped square brackets may also appear as part of class names. For + example, [:a[:abc]b:] gives unknown class "[:abc]b:]". PCRE now behaves + more like Perl. (But see 8.20/1 above.) + +38. PCRE was giving an error for \N with a braced quantifier such as {1,} (this + was because it thought it was \N{name}, which is not supported). + +39. Add minix to OS list not supporting the -S option in pcretest. + +40. PCRE tries to detect cases of infinite recursion at compile time, but it + cannot analyze patterns in sufficient detail to catch mutual recursions + such as ((?1))((?2)). There is now a runtime test that gives an error if a + subgroup is called recursively as a subpattern for a second time at the + same position in the subject string. In previous releases this might have + been caught by the recursion limit, or it might have run out of stack. + +41. A pattern such as /(?(R)a+|(?R)b)/ is quite safe, as the recursion can + happen only once. PCRE was, however incorrectly giving a compile time error + "recursive call could loop indefinitely" because it cannot analyze the + pattern in sufficient detail. The compile time test no longer happens when + PCRE is compiling a conditional subpattern, but actual runaway loops are + now caught at runtime (see 40 above). + +42. It seems that Perl allows any characters other than a closing parenthesis + to be part of the NAME in (*MARK:NAME) and other backtracking verbs. PCRE + has been changed to be the same. + +43. Updated configure.ac to put in more quoting round AC_LANG_PROGRAM etc. so + as not to get warnings when autogen.sh is called. Also changed + AC_PROG_LIBTOOL (deprecated) to LT_INIT (the current macro). + +44. To help people who use pcregrep to scan files containing exceedingly long + lines, the following changes have been made: + + (a) The default value of the buffer size parameter has been increased from + 8K to 20K. (The actual buffer used is three times this size.) + + (b) The default can be changed by ./configure --with-pcregrep-bufsize when + PCRE is built. + + (c) A --buffer-size=n option has been added to pcregrep, to allow the size + to be set at run time. + + (d) Numerical values in pcregrep options can be followed by K or M, for + example --buffer-size=50K. + + (e) If a line being scanned overflows pcregrep's buffer, an error is now + given and the return code is set to 2. + +45. Add a pointer to the latest mark to the callout data block. + +46. The pattern /.(*F)/, when applied to "abc" with PCRE_PARTIAL_HARD, gave a + partial match of an empty string instead of no match. This was specific to + the use of ".". + +47. The pattern /f.*/8s, when applied to "for" with PCRE_PARTIAL_HARD, gave a + complete match instead of a partial match. This bug was dependent on both + the PCRE_UTF8 and PCRE_DOTALL options being set. + +48. For a pattern such as /\babc|\bdef/ pcre_study() was failing to set up the + starting byte set, because \b was not being ignored. + Version 8.12 15-Jan-2011 ------------------------