| 1548 |
.\" |
.\" |
| 1549 |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
| 1550 |
the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is |
the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is |
| 1551 |
a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. In |
a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. In |
| 1552 |
both cases, information about the precise nature of the error may also be |
both cases, information about the precise nature of the error may also be |
| 1553 |
returned (see the descriptions of these errors in the section entitled \fIError |
returned (see the descriptions of these errors in the section entitled \fIError |
| 1554 |
return values from\fP \fBpcre_exec()\fP |
return values from\fP \fBpcre_exec()\fP |
| 1810 |
.sp |
.sp |
| 1811 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 1812 |
.sp |
.sp |
| 1813 |
The UTF-8 byte sequence that was passed as a subject was checked and found to |
The UTF-8 byte sequence that was passed as a subject was checked and found to |
| 1814 |
be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of |
be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of |
| 1815 |
\fIstartoffset\fP did not point to the beginning of a UTF-8 character or the |
\fIstartoffset\fP did not point to the beginning of a UTF-8 character or the |
| 1816 |
end of the subject. |
end of the subject. |
| 1865 |
.sp |
.sp |
| 1866 |
PCRE_ERROR_RECURSELOOP (-26) |
PCRE_ERROR_RECURSELOOP (-26) |
| 1867 |
.sp |
.sp |
| 1868 |
This error is returned when \fBpcre_exec()\fP detects a recursion loop within |
This error is returned when \fBpcre_exec()\fP detects a recursion loop within |
| 1869 |
the pattern. Specifically, it means that either the whole pattern or a |
the pattern. Specifically, it means that either the whole pattern or a |
| 1870 |
subpattern has been called recursively for the second time at the same position |
subpattern has been called recursively for the second time at the same position |
| 1871 |
in the subject string. Some simple patterns that might do this are detected and |
in the subject string. Some simple patterns that might do this are detected and |
| 1872 |
faulted at compile time, but more complicated cases, in particular mutual |
faulted at compile time, but more complicated cases, in particular mutual |
| 1873 |
recursions between two different subpatterns, cannot be detected until run |
recursions between two different subpatterns, cannot be detected until run |
| 1880 |
.SS "Reason codes for invalid UTF-8 strings" |
.SS "Reason codes for invalid UTF-8 strings" |
| 1881 |
.rs |
.rs |
| 1882 |
.sp |
.sp |
| 1883 |
When \fBpcre_exec()\fP returns either PCRE_ERROR_BADUTF8 or |
When \fBpcre_exec()\fP returns either PCRE_ERROR_BADUTF8 or |
| 1884 |
PCRE_ERROR_SHORTUTF8, and the size of the output vector (\fIovecsize\fP) is at |
PCRE_ERROR_SHORTUTF8, and the size of the output vector (\fIovecsize\fP) is at |
| 1885 |
least 2, the offset of the start of the invalid UTF-8 character is placed in |
least 2, the offset of the start of the invalid UTF-8 character is placed in |
| 1886 |
the first output vector element (\fIovector[0]\fP) and a reason code is placed |
the first output vector element (\fIovector[0]\fP) and a reason code is placed |
| 1887 |
in the second element (\fIovector[1]\fP). The reason codes are given names in |
in the second element (\fIovector[1]\fP). The reason codes are given names in |
| 1888 |
the \fBpcre.h\fP header file: |
the \fBpcre.h\fP header file: |
| 1889 |
.sp |
.sp |
| 1893 |
PCRE_UTF8_ERR4 |
PCRE_UTF8_ERR4 |
| 1894 |
PCRE_UTF8_ERR5 |
PCRE_UTF8_ERR5 |
| 1895 |
.sp |
.sp |
| 1896 |
The string ends with a truncated UTF-8 character; the code specifies how many |
The string ends with a truncated UTF-8 character; the code specifies how many |
| 1897 |
bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be |
bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be |
| 1898 |
no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279) |
no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279) |
| 1899 |
allows for up to 6 bytes, and this is checked first; hence the possibility of |
allows for up to 6 bytes, and this is checked first; hence the possibility of |
| 1900 |
4 or 5 missing bytes. |
4 or 5 missing bytes. |
| 1901 |
.sp |
.sp |
| 1902 |
PCRE_UTF8_ERR6 |
PCRE_UTF8_ERR6 |
| 1905 |
PCRE_UTF8_ERR9 |
PCRE_UTF8_ERR9 |
| 1906 |
PCRE_UTF8_ERR10 |
PCRE_UTF8_ERR10 |
| 1907 |
.sp |
.sp |
| 1908 |
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the |
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the |
| 1909 |
character do not have the binary value 0b10 (that is, either the most |
character do not have the binary value 0b10 (that is, either the most |
| 1910 |
significant bit is 0, or the next bit is 1). |
significant bit is 0, or the next bit is 1). |
| 1911 |
.sp |
.sp |
| 1912 |
PCRE_UTF8_ERR11 |
PCRE_UTF8_ERR11 |
| 1913 |
PCRE_UTF8_ERR12 |
PCRE_UTF8_ERR12 |
| 1914 |
.sp |
.sp |
| 1915 |
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long; |
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long; |
| 1916 |
these code points are excluded by RFC 3629. |
these code points are excluded by RFC 3629. |
| 1917 |
.sp |
.sp |
| 1918 |
PCRE_UTF8_ERR13 |
PCRE_UTF8_ERR13 |
| 1919 |
.sp |
.sp |
| 1920 |
A 4-byte character has a value greater than 0x10fff; these code points are |
A 4-byte character has a value greater than 0x10fff; these code points are |
| 1921 |
excluded by RFC 3629. |
excluded by RFC 3629. |
| 1922 |
.sp |
.sp |
| 1923 |
PCRE_UTF8_ERR14 |
PCRE_UTF8_ERR14 |
| 1924 |
.sp |
.sp |
| 1925 |
A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of |
A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of |
| 1926 |
code points are reserved by RFC 3629 for use with UTF-16, and so are excluded |
code points are reserved by RFC 3629 for use with UTF-16, and so are excluded |
| 1927 |
from UTF-8. |
from UTF-8. |
| 1928 |
.sp |
.sp |
| 1929 |
PCRE_UTF8_ERR15 |
PCRE_UTF8_ERR15 |
| 1930 |
PCRE_UTF8_ERR16 |
PCRE_UTF8_ERR16 |
| 1931 |
PCRE_UTF8_ERR17 |
PCRE_UTF8_ERR17 |
| 1932 |
PCRE_UTF8_ERR18 |
PCRE_UTF8_ERR18 |
| 1933 |
PCRE_UTF8_ERR19 |
PCRE_UTF8_ERR19 |
| 1934 |
.sp |
.sp |
| 1935 |
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a |
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a |
| 1936 |
value that can be represented by fewer bytes, which is invalid. For example, |
value that can be represented by fewer bytes, which is invalid. For example, |
| 1937 |
the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just |
the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just |
| 1938 |
one byte. |
one byte. |
| 1939 |
.sp |
.sp |
| 1940 |
PCRE_UTF8_ERR20 |
PCRE_UTF8_ERR20 |
| 1941 |
.sp |
.sp |
| 1942 |
The two most significant bits of the first byte of a character have the binary |
The two most significant bits of the first byte of a character have the binary |
| 1943 |
value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a |
value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a |
| 1944 |
byte can only validly occur as the second or subsequent byte of a multi-byte |
byte can only validly occur as the second or subsequent byte of a multi-byte |
| 1945 |
character. |
character. |
| 1946 |
.sp |
.sp |