1) #8271: when a byte sequence is invalid, only the start byte and all the
valid continuation bytes are now replaced by U+FFFD, instead of replacing
the number of bytes specified by the start byte.
See http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
2) 5- and 6-bytes-long UTF-8 sequences are now considered invalid (no changes
in behavior);
3) Add code and tests to reject surrogates (U+D800-U+DFFF) as defined in
RFC 3629, but leave it commented out since it's not backward compatible;
4) Change the error messages "unexpected code byte" to "invalid start byte"
and "invalid data" to "invalid continuation byte";
5) Add an extensive set of tests in test_unicode;
6) Fix test_codeccallbacks because it was failing after this change.
about illegal code points. The codec now supports PEP 293 style error handlers.
(This is a variant of the Nik Haldimann's patch that detects truncated data)
charmaptranslate_makespace() allocated more memory than required for the
next replacement but didn't remember that fact, so memory size was growing
exponentially every time a replacement string is longer that one character.
This fixes SF bug #828737.
and test_support.run_classtests() into run_unittest()
and use it wherever possible.
Also don't use "from test.test_support import ...", but
"from test import test_support" in a few spots.
From SF patch #662807.
error handers in the Unicode codecs: Negative
positions are treated as being relative to the end of
the input and out of bounds positions result in an
IndexError.
Also update the PEP and include an explanation of
this in the documentation for codecs.register_error.
Fixes a small bug in iconv_codecs: if the position
from the callback is negative *add* it to the size
instead of substracting it.
From SF patch #677429.
From:
69.73% of 294 source lines executed in file ./Modules/_codecsmodule.c
79.47% of 487 source lines executed in file Python/codecs.c
78.45% of 3643 source lines executed in file Objects/unicodeobject.c
To:
70.41% of 294 source lines executed in file ./Modules/_codecsmodule.c
82.75% of 487 source lines executed in file Python/codecs.c
80.76% of 3638 source lines executed in file Objects/unicodeobject.c
This actually unearthed a bug in the handling of None
values in PyUnicode_EncodeCharmap.