| 1 |
ChangeLog for PCRE |
ChangeLog for PCRE |
| 2 |
------------------ |
------------------ |
| 3 |
|
|
| 4 |
Version 8.11 10-Oct-2010 |
Version 8.12 12-Jan-2011 |
| 5 |
|
------------------------ |
| 6 |
|
|
| 7 |
|
1. Fixed some typos in the markup of the man pages, and wrote a script that |
| 8 |
|
checks for such things as part of the documentation building process. |
| 9 |
|
|
| 10 |
|
2. On a big-endian 64-bit system, pcregrep did not correctly process the |
| 11 |
|
--match-limit and --recursion-limit options (added for 8.11). In |
| 12 |
|
particular, this made one of the standard tests crash. (The integer value |
| 13 |
|
went into the wrong half of a long int.) |
| 14 |
|
|
| 15 |
|
3. If the --colour option was given to pcregrep with -v (invert match), it |
| 16 |
|
did strange things, either producing crazy output, or crashing. It should, |
| 17 |
|
of course, ignore a request for colour when reporting lines that do not |
| 18 |
|
match. |
| 19 |
|
|
| 20 |
|
4. Another pcregrep bug caused similar problems if --colour was specified with |
| 21 |
|
-M (multiline) and the pattern match finished with a line ending. |
| 22 |
|
|
| 23 |
|
5. In pcregrep, when a pattern that ended with a literal newline sequence was |
| 24 |
|
matched in multiline mode, the following line was shown as part of the |
| 25 |
|
match. This seems wrong, so I have changed it. |
| 26 |
|
|
| 27 |
|
6. If pcregrep was compiled under Windows, there was a reference to the |
| 28 |
|
function pcregrep_exit() before it was defined. I am assuming this was |
| 29 |
|
the cause of the "error C2371: 'pcregrep_exit' : redefinition;" that was |
| 30 |
|
reported by a user. I've moved the definition above the reference. |
| 31 |
|
|
| 32 |
|
|
| 33 |
|
Version 8.11 10-Dec-2010 |
| 34 |
------------------------ |
------------------------ |
| 35 |
|
|
| 36 |
1. (*THEN) was not working properly if there were untried alternatives prior |
1. (*THEN) was not working properly if there were untried alternatives prior |
| 37 |
to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it |
to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it |
| 38 |
backtracked to try for "b" instead of moving to the next alternative branch |
backtracked to try for "b" instead of moving to the next alternative branch |
| 39 |
at the same level (in this case, to look for "c"). The Perl documentation |
at the same level (in this case, to look for "c"). The Perl documentation |
| 40 |
is clear that when (*THEN) is backtracked onto, it goes to the "next |
is clear that when (*THEN) is backtracked onto, it goes to the "next |
| 41 |
alternative in the innermost enclosing group". |
alternative in the innermost enclosing group". |
| 42 |
|
|
| 43 |
2. (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern |
2. (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern |
| 44 |
such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should |
such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should |
| 45 |
result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and |
result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and |
| 46 |
(*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides |
(*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides |
| 47 |
(*THEN). |
(*THEN). |
| 48 |
|
|
| 49 |
3. If \s appeared in a character class, it removed the VT character from |
3. If \s appeared in a character class, it removed the VT character from |
| 50 |
the class, even if it had been included by some previous item, for example |
the class, even if it had been included by some previous item, for example |
| 51 |
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part |
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part |
| 52 |
of \s, but is part of the POSIX "space" class.) |
of \s, but is part of the POSIX "space" class.) |
| 53 |
|
|
| 54 |
4. A partial match never returns an empty string (because you can always |
4. A partial match never returns an empty string (because you can always |
| 55 |
match an empty string at the end of the subject); however the checking for |
match an empty string at the end of the subject); however the checking for |
| 56 |
an empty string was starting at the "start of match" point. This has been |
an empty string was starting at the "start of match" point. This has been |
| 60 |
(previously it gave "no match"). |
(previously it gave "no match"). |
| 61 |
|
|
| 62 |
5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching |
5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching |
| 63 |
of $, \z, \Z, \b, and \B. If the match point is at the end of the string, |
of $, \z, \Z, \b, and \B. If the match point is at the end of the string, |
| 64 |
previously a full match would be given. However, setting PCRE_PARTIAL_HARD |
previously a full match would be given. However, setting PCRE_PARTIAL_HARD |
| 65 |
has an implication that the given string is incomplete (because a partial |
has an implication that the given string is incomplete (because a partial |
| 66 |
match is preferred over a full match). For this reason, these items now |
match is preferred over a full match). For this reason, these items now |
| 67 |
give a partial match in this situation. [Aside: previously, the one case |
give a partial match in this situation. [Aside: previously, the one case |
| 68 |
/t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial |
/t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial |
| 69 |
match rather than a full match, which was wrong by the old rules, but is |
match rather than a full match, which was wrong by the old rules, but is |
| 70 |
now correct.] |
now correct.] |
| 71 |
|
|
| 72 |
|
6. There was a bug in the handling of #-introduced comments, recognized when |
| 73 |
|
PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set. |
| 74 |
|
If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose |
| 75 |
|
UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when |
| 76 |
|
scanning for the end of the comment. (*Character* 0x85 is an "any" newline, |
| 77 |
|
but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several |
| 78 |
|
places in pcre_compile(). |
| 79 |
|
|
| 80 |
|
7. Related to (6) above, when pcre_compile() was skipping #-introduced |
| 81 |
|
comments when looking ahead for named forward references to subpatterns, |
| 82 |
|
the only newline sequence it recognized was NL. It now handles newlines |
| 83 |
|
according to the set newline convention. |
| 84 |
|
|
| 85 |
|
8. SunOS4 doesn't have strerror() or strtoul(); pcregrep dealt with the |
| 86 |
|
former, but used strtoul(), whereas pcretest avoided strtoul() but did not |
| 87 |
|
cater for a lack of strerror(). These oversights have been fixed. |
| 88 |
|
|
| 89 |
|
9. Added --match-limit and --recursion-limit to pcregrep. |
| 90 |
|
|
| 91 |
|
10. Added two casts needed to build with Visual Studio when NO_RECURSE is set. |
| 92 |
|
|
| 93 |
|
11. When the -o option was used, pcregrep was setting a return code of 1, even |
| 94 |
|
when matches were found, and --line-buffered was not being honoured. |
| 95 |
|
|
| 96 |
|
12. Added an optional parentheses number to the -o and --only-matching options |
| 97 |
|
of pcregrep. |
| 98 |
|
|
| 99 |
|
13. Imitating Perl's /g action for multiple matches is tricky when the pattern |
| 100 |
|
can match an empty string. The code to do it in pcretest and pcredemo |
| 101 |
|
needed fixing: |
| 102 |
|
|
| 103 |
|
(a) When the newline convention was "crlf", pcretest got it wrong, skipping |
| 104 |
|
only one byte after an empty string match just before CRLF (this case |
| 105 |
|
just got forgotten; "any" and "anycrlf" were OK). |
| 106 |
|
|
| 107 |
|
(b) The pcretest code also had a bug, causing it to loop forever in UTF-8 |
| 108 |
|
mode when an empty string match preceded an ASCII character followed by |
| 109 |
|
a non-ASCII character. (The code for advancing by one character rather |
| 110 |
|
than one byte was nonsense.) |
| 111 |
|
|
| 112 |
|
(c) The pcredemo.c sample program did not have any code at all to handle |
| 113 |
|
the cases when CRLF is a valid newline sequence. |
| 114 |
|
|
| 115 |
|
14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given |
| 116 |
|
as a starting offset was within the subject string. There is now a new |
| 117 |
|
error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is |
| 118 |
|
negative or greater than the length of the string. In order to test this, |
| 119 |
|
pcretest is extended to allow the setting of negative starting offsets. |
| 120 |
|
|
| 121 |
|
15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the |
| 122 |
|
starting offset points to the beginning of a UTF-8 character was |
| 123 |
|
unnecessarily clumsy. I tidied it up. |
| 124 |
|
|
| 125 |
|
16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a |
| 126 |
|
bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD. |
| 127 |
|
|
| 128 |
|
17. Nobody had reported that the --include_dir option, which was added in |
| 129 |
|
release 7.7 should have been called --include-dir (hyphen, not underscore) |
| 130 |
|
for compatibility with GNU grep. I have changed it to --include-dir, but |
| 131 |
|
left --include_dir as an undocumented synonym, and the same for |
| 132 |
|
--exclude-dir, though that is not available in GNU grep, at least as of |
| 133 |
|
release 2.5.4. |
| 134 |
|
|
| 135 |
|
18. At a user's suggestion, the macros GETCHAR and friends (which pick up UTF-8 |
| 136 |
|
characters from a string of bytes) have been redefined so as not to use |
| 137 |
|
loops, in order to improve performance in some environments. At the same |
| 138 |
|
time, I abstracted some of the common code into auxiliary macros to save |
| 139 |
|
repetition (this should not affect the compiled code). |
| 140 |
|
|
| 141 |
|
19. If \c was followed by a multibyte UTF-8 character, bad things happened. A |
| 142 |
|
compile-time error is now given if \c is not followed by an ASCII |
| 143 |
|
character, that is, a byte less than 128. (In EBCDIC mode, the code is |
| 144 |
|
different, and any byte value is allowed.) |
| 145 |
|
|
| 146 |
|
20. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_ |
| 147 |
|
START_OPTIMIZE option, which is now allowed at compile time - but just |
| 148 |
|
passed through to pcre_exec() or pcre_dfa_exec(). This makes it available |
| 149 |
|
to pcregrep and other applications that have no direct access to PCRE |
| 150 |
|
options. The new /Y option in pcretest sets this option when calling |
| 151 |
|
pcre_compile(). |
| 152 |
|
|
| 153 |
|
21. Change 18 of release 8.01 broke the use of named subpatterns for recursive |
| 154 |
|
back references. Groups containing recursive back references were forced to |
| 155 |
|
be atomic by that change, but in the case of named groups, the amount of |
| 156 |
|
memory required was incorrectly computed, leading to "Failed: internal |
| 157 |
|
error: code overflow". This has been fixed. |
| 158 |
|
|
| 159 |
|
22. Some patches to pcre_stringpiece.h, pcre_stringpiece_unittest.cc, and |
| 160 |
|
pcretest.c, to avoid build problems in some Borland environments. |
| 161 |
|
|
| 162 |
|
|
| 163 |
Version 8.10 25-Jun-2010 |
Version 8.10 25-Jun-2010 |