svn+ssh://pythondev@svn.python.org/python/branches/py3k
........
r83395 | georg.brandl | 2010-08-01 10:49:18 +0200 (So, 01 Aug 2010) | 1 line
#8821: do not rely on Unicode strings being terminated with a \u0000, rather explicitly check range before looking for a second surrogate character.
........
1) #8271: when a byte sequence is invalid, only the start byte and all the
valid continuation bytes are now replaced by U+FFFD, instead of replacing
the number of bytes specified by the start byte.
See http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
2) 5- and 6-bytes-long UTF-8 sequences are now considered invalid (no changes
in behavior);
3) Add code and tests to reject surrogates (U+D800-U+DFFF) as defined in
RFC 3629, but leave it commented out since it's not backward compatible;
4) Change the error messages "unexpected code byte" to "invalid start byte"
and "invalid data" to "invalid continuation byte";
5) Add an extensive set of tests in test_unicode;
6) Fix test_codeccallbacks because it was failing after this change.
any errors that might occur during coercion of the left operand and
turning them into a TypeError with a message text that was confusing in
the given context. This patch lets any errors through, as was already
done during coercion of the right hand side.
Without this change, test_unicode.UnicodeTest.test_codecs_utf7 crashes in
debug mode. What happens is the unicode string u'\U000abcde' with a length
of 1 encodes to the string '+2m/c3g-' of length 8. Since only 5 bytes is
reserved in the buffer, a buffer overrun occurs.
PyUnicode_DecodeUTF8() once, remember the result and output it in a second
step. This avoids problems with counting UTF-8 bytes that ignores the effect
of using the replace error handler in PyUnicode_DecodeUTF8().
If anyone wants to clean up the documentation, feel free. It's my first documentation foray, and it's not that great.
Will port to py3k with a different strategy.
sizeof(Py_UNICODE) == 2, PyUnicode_FromWideChar now converts
each character outside the BMP to the appropriate surrogate pair.
Thanks Victor Stinner for the patch.
(backport of r70452 from py3k to trunk)
the freelist contained half-initialized objects with freed pointers.
The comment
/* XXX UNREF/NEWREF interface should be more symmetrical */
was copied from tupleobject.c, and appears in some other places.
I sign the petition.