--- code/trunk/ChangeLog 2009/03/08 16:56:58 385 +++ code/trunk/ChangeLog 2009/12/11 16:42:50 472 @@ -1,56 +1,307 @@ ChangeLog for PCRE ------------------ -Version 7.9 xx-xxx-09 +Version 8.01 11-Dec-09 +---------------------- + +1. If a pattern contained a conditional subpattern with only one branch (in + particular, this includes all (DEFINE) patterns), a call to pcre_study() + computed the wrong minimum data length (which is of course zero for such + subpatterns). + +2. For patterns such as (?i)a(?-i)b|c where an option setting at the start of + the pattern is reset in the first branch, pcre_compile() failed with + "internal error: code overflow at offset...". This happened only when + the reset was to the original external option setting. (An optimization + abstracts leading options settings into an external setting, which was the + cause of this.) + + +Version 8.00 19-Oct-09 +---------------------- + +1. The table for translating pcre_compile() error codes into POSIX error codes + was out-of-date, and there was no check on the pcre_compile() error code + being within the table. This could lead to an OK return being given in + error. + +2. Changed the call to open a subject file in pcregrep from fopen(pathname, + "r") to fopen(pathname, "rb"), which fixed a problem with some of the tests + in a Windows environment. + +3. The pcregrep --count option prints the count for each file even when it is + zero, as does GNU grep. However, pcregrep was also printing all files when + --files-with-matches was added. Now, when both options are given, it prints + counts only for those files that have at least one match. (GNU grep just + prints the file name in this circumstance, but including the count seems + more useful - otherwise, why use --count?) Also ensured that the + combination -clh just lists non-zero counts, with no names. + +4. The long form of the pcregrep -F option was incorrectly implemented as + --fixed_strings instead of --fixed-strings. This is an incompatible change, + but it seems right to fix it, and I didn't think it was worth preserving + the old behaviour. + +5. The command line items --regex=pattern and --regexp=pattern were not + recognized by pcregrep, which required --regex pattern or --regexp pattern + (with a space rather than an '='). The man page documented the '=' forms, + which are compatible with GNU grep; these now work. + +6. No libpcreposix.pc file was created for pkg-config; there was just + libpcre.pc and libpcrecpp.pc. The omission has been rectified. + +7. Added #ifndef SUPPORT_UCP into the pcre_ucd.c module, to reduce its size + when UCP support is not needed, by modifying the Python script that + generates it from Unicode data files. This should not matter if the module + is correctly used as a library, but I received one complaint about 50K of + unwanted data. My guess is that the person linked everything into his + program rather than using a library. Anyway, it does no harm. + +8. A pattern such as /\x{123}{2,2}+/8 was incorrectly compiled; the trigger + was a minimum greater than 1 for a wide character in a possessive + repetition. The same bug could also affect patterns like /(\x{ff}{0,2})*/8 + which had an unlimited repeat of a nested, fixed maximum repeat of a wide + character. Chaos in the form of incorrect output or a compiling loop could + result. + +9. The restrictions on what a pattern can contain when partial matching is + requested for pcre_exec() have been removed. All patterns can now be + partially matched by this function. In addition, if there are at least two + slots in the offset vector, the offset of the earliest inspected character + for the match and the offset of the end of the subject are set in them when + PCRE_ERROR_PARTIAL is returned. + +10. Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is + synonymous with PCRE_PARTIAL, for backwards compatibility, and + PCRE_PARTIAL_HARD, which causes a partial match to supersede a full match, + and may be more useful for multi-segment matching. + +11. Partial matching with pcre_exec() is now more intuitive. A partial match + used to be given if ever the end of the subject was reached; now it is + given only if matching could not proceed because another character was + needed. This makes a difference in some odd cases such as Z(*FAIL) with the + string "Z", which now yields "no match" instead of "partial match". In the + case of pcre_dfa_exec(), "no match" is given if every matching path for the + final character ended with (*FAIL). + +12. Restarting a match using pcre_dfa_exec() after a partial match did not work + if the pattern had a "must contain" character that was already found in the + earlier partial match, unless partial matching was again requested. For + example, with the pattern /dog.(body)?/, the "must contain" character is + "g". If the first part-match was for the string "dog", restarting with + "sbody" failed. This bug has been fixed. + +13. The string returned by pcre_dfa_exec() after a partial match has been + changed so that it starts at the first inspected character rather than the + first character of the match. This makes a difference only if the pattern + starts with a lookbehind assertion or \b or \B (\K is not supported by + pcre_dfa_exec()). It's an incompatible change, but it makes the two + matching functions compatible, and I think it's the right thing to do. + +14. Added a pcredemo man page, created automatically from the pcredemo.c file, + so that the demonstration program is easily available in environments where + PCRE has not been installed from source. + +15. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp, + libpcrecpp.pc and pcre-config when PCRE is not compiled as a shared + library. + +16. Added REG_UNGREEDY to the pcreposix interface, at the request of a user. + It maps to PCRE_UNGREEDY. It is not, of course, POSIX-compatible, but it + is not the first non-POSIX option to be added. Clearly some people find + these options useful. + +17. If a caller to the POSIX matching function regexec() passes a non-zero + value for nmatch with a NULL value for pmatch, the value of + nmatch is forced to zero. + +18. RunGrepTest did not have a test for the availability of the -u option of + the diff command, as RunTest does. It now checks in the same way as + RunTest, and also checks for the -b option. + +19. If an odd number of negated classes containing just a single character + interposed, within parentheses, between a forward reference to a named + subpattern and the definition of the subpattern, compilation crashed with + an internal error, complaining that it could not find the referenced + subpattern. An example of a crashing pattern is /(?&A)(([^m])(?))/. + [The bug was that it was starting one character too far in when skipping + over the character class, thus treating the ] as data rather than + terminating the class. This meant it could skip too much.] + +20. Added PCRE_NOTEMPTY_ATSTART in order to be able to correctly implement the + /g option in pcretest when the pattern contains \K, which makes it possible + to have an empty string match not at the start, even when the pattern is + anchored. Updated pcretest and pcredemo to use this option. + +21. If the maximum number of capturing subpatterns in a recursion was greater + than the maximum at the outer level, the higher number was returned, but + with unset values at the outer level. The correct (outer level) value is + now given. + +22. If (*ACCEPT) appeared inside capturing parentheses, previous releases of + PCRE did not set those parentheses (unlike Perl). I have now found a way to + make it do so. The string so far is captured, making this feature + compatible with Perl. + +23. The tests have been re-organized, adding tests 11 and 12, to make it + possible to check the Perl 5.10 features against Perl 5.10. + +24. Perl 5.10 allows subroutine calls in lookbehinds, as long as the subroutine + pattern matches a fixed length string. PCRE did not allow this; now it + does. Neither allows recursion. + +25. I finally figured out how to implement a request to provide the minimum + length of subject string that was needed in order to match a given pattern. + (It was back references and recursion that I had previously got hung up + on.) This code has now been added to pcre_study(); it finds a lower bound + to the length of subject needed. It is not necessarily the greatest lower + bound, but using it to avoid searching strings that are too short does give + some useful speed-ups. The value is available to calling programs via + pcre_fullinfo(). + +26. While implementing 25, I discovered to my embarrassment that pcretest had + not been passing the result of pcre_study() to pcre_dfa_exec(), so the + study optimizations had never been tested with that matching function. + Oops. What is worse, even when it was passed study data, there was a bug in + pcre_dfa_exec() that meant it never actually used it. Double oops. There + were also very few tests of studied patterns with pcre_dfa_exec(). + +27. If (?| is used to create subpatterns with duplicate numbers, they are now + allowed to have the same name, even if PCRE_DUPNAMES is not set. However, + on the other side of the coin, they are no longer allowed to have different + names, because these cannot be distinguished in PCRE, and this has caused + confusion. (This is a difference from Perl.) + +28. When duplicate subpattern names are present (necessarily with different + numbers, as required by 27 above), and a test is made by name in a + conditional pattern, either for a subpattern having been matched, or for + recursion in such a pattern, all the associated numbered subpatterns are + tested, and the overall condition is true if the condition is true for any + one of them. This is the way Perl works, and is also more like the way + testing by number works. + + +Version 7.9 11-Apr-09 --------------------- -1. When building with support for bzlib/zlib (pcregrep) and/or readline +1. When building with support for bzlib/zlib (pcregrep) and/or readline (pcretest), all targets were linked against these libraries. This included libpcre, libpcreposix, and libpcrecpp, even though they do not use these libraries. This caused unwanted dependencies to be created. This problem - has been fixed, and now only pcregrep is linked with bzlib/zlib and only + has been fixed, and now only pcregrep is linked with bzlib/zlib and only pcretest is linked with readline. - + 2. The "typedef int BOOL" in pcre_internal.h that was included inside the "#ifndef FALSE" condition by an earlier change (probably 7.8/18) has been moved outside it again, because FALSE and TRUE are already defined in AIX, but BOOL is not. - -3. The pcre_config() function was treating the PCRE_MATCH_LIMIT and - PCRE_MATCH_LIMIT_RETURSION values as ints, when they should be long ints. - + +3. The pcre_config() function was treating the PCRE_MATCH_LIMIT and + PCRE_MATCH_LIMIT_RECURSION values as ints, when they should be long ints. + 4. The pcregrep documentation said spaces were inserted as well as colons (or hyphens) following file names and line numbers when outputting matching - lines. This is not true; no spaces are inserted. I have also clarified the + lines. This is not true; no spaces are inserted. I have also clarified the wording for the --colour (or --color) option. - + 5. In pcregrep, when --colour was used with -o, the list of matching strings was not coloured; this is different to GNU grep, so I have changed it to be the same. - + 6. When --colo(u)r was used in pcregrep, only the first matching substring in - each matching line was coloured. Now it goes on to look for further matches - of any of the test patterns, which is the same behaviour as GNU grep. - -7. A pattern that could match an empty string could cause pcregrep to loop; it - doesn't make sense to accept an empty string match in pcregrep, so I have + each matching line was coloured. Now it goes on to look for further matches + of any of the test patterns, which is the same behaviour as GNU grep. + +7. A pattern that could match an empty string could cause pcregrep to loop; it + doesn't make sense to accept an empty string match in pcregrep, so I have locked it out (using PCRE's PCRE_NOTEMPTY option). By experiment, this seems to be how GNU grep behaves. - + 8. The pattern (?(?=.*b)b|^) was incorrectly compiled as "match must be at - start or after a newline", because the conditional assertion was not being - correctly handled. The rule now is that both the assertion and what follows - in the first alternative must satisfy the test. - -9. If auto-callout was enabled in a pattern with a conditional group, PCRE - could crash during matching. - -10. The PCRE_DOLLAR_ENDONLY option was not working when pcre_dfa_exec() was - used for matching. - -11. Unicode property support in character classes was not working for + start or after a newline", because the conditional assertion was not being + correctly handled. The rule now is that both the assertion and what follows + in the first alternative must satisfy the test. + +9. If auto-callout was enabled in a pattern with a conditional group whose + condition was an assertion, PCRE could crash during matching, both with + pcre_exec() and pcre_dfa_exec(). + +10. The PCRE_DOLLAR_ENDONLY option was not working when pcre_dfa_exec() was + used for matching. + +11. Unicode property support in character classes was not working for characters (bytes) greater than 127 when not in UTF-8 mode. - + +12. Added the -M command line option to pcretest. + +14. Added the non-standard REG_NOTEMPTY option to the POSIX interface. + +15. Added the PCRE_NO_START_OPTIMIZE match-time option. + +16. Added comments and documentation about mis-use of no_arg in the C++ + wrapper. + +17. Implemented support for UTF-8 encoding in EBCDIC environments, a patch + from Martin Jerabek that uses macro names for all relevant character and + string constants. + +18. Added to pcre_internal.h two configuration checks: (a) If both EBCDIC and + SUPPORT_UTF8 are set, give an error; (b) If SUPPORT_UCP is set without + SUPPORT_UTF8, define SUPPORT_UTF8. The "configure" script handles both of + these, but not everybody uses configure. + +19. A conditional group that had only one branch was not being correctly + recognized as an item that could match an empty string. This meant that an + enclosing group might also not be so recognized, causing infinite looping + (and probably a segfault) for patterns such as ^"((?(?=[a])[^"])|b)*"$ + with the subject "ab", where knowledge that the repeated group can match + nothing is needed in order to break the loop. + +20. If a pattern that was compiled with callouts was matched using pcre_dfa_ + exec(), but without supplying a callout function, matching went wrong. + +21. If PCRE_ERROR_MATCHLIMIT occurred during a recursion, there was a memory + leak if the size of the offset vector was greater than 30. When the vector + is smaller, the saved offsets during recursion go onto a local stack + vector, but for larger vectors malloc() is used. It was failing to free + when the recursion yielded PCRE_ERROR_MATCH_LIMIT (or any other "abnormal" + error, in fact). + +22. There was a missing #ifdef SUPPORT_UTF8 round one of the variables in the + heapframe that is used only when UTF-8 support is enabled. This caused no + problem, but was untidy. + +23. Steven Van Ingelgem's patch to CMakeLists.txt to change the name + CMAKE_BINARY_DIR to PROJECT_BINARY_DIR so that it works when PCRE is + included within another project. + +24. Steven Van Ingelgem's patches to add more options to the CMake support, + slightly modified by me: + + (a) PCRE_BUILD_TESTS can be set OFF not to build the tests, including + not building pcregrep. + + (b) PCRE_BUILD_PCREGREP can be see OFF not to build pcregrep, but only + if PCRE_BUILD_TESTS is also set OFF, because the tests use pcregrep. + +25. Forward references, both numeric and by name, in patterns that made use of + duplicate group numbers, could behave incorrectly or give incorrect errors, + because when scanning forward to find the reference group, PCRE was not + taking into account the duplicate group numbers. A pattern such as + ^X(?3)(a)(?|(b)|(q))(Y) is an example. + +26. Changed a few more instances of "const unsigned char *" to USPTR, making + the feature of a custom pointer more persuasive (as requested by a user). + +27. Wrapped the definitions of fileno and isatty for Windows, which appear in + pcretest.c, inside #ifndefs, because it seems they are sometimes already + pre-defined. + +28. Added support for (*UTF8) at the start of a pattern. + +29. Arrange for flags added by the "release type" setting in CMake to be shown + in the configuration summary. + Version 7.8 05-Sep-08 ---------------------