It's a ParseTuple converter: decode bytes objects to unicode using
PyUnicode_DecodeFSDefaultAndSize(); str objects are output as-is.
* Don't specify surrogateescape error handler in the comments nor the
documentation, but PyUnicode_DecodeFSDefaultAndSize() and
PyUnicode_EncodeFSDefault() because these functions use strict error handler
for the mbcs encoding (on Windows).
* Remove PyUnicode_FSConverter() comment in unicodeobject.c to avoid
inconsistency with unicodeobject.h.
1) #8271: when a byte sequence is invalid, only the start byte and all the
valid continuation bytes are now replaced by U+FFFD, instead of replacing
the number of bytes specified by the start byte.
See http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
2) 5- and 6-bytes-long UTF-8 sequences are now considered invalid (no changes
in behavior);
3) Change the error messages "unexpected code byte" to "invalid start byte"
and "invalid data" to "invalid continuation byte";
4) Add an extensive set of tests in test_unicode;
5) Fix test_codeccallbacks because it was failing after this change.
mode raises unicode errors. The encoder only supports "strict" and "replace"
error handlers, the decoder only supports "strict" and "ignore" error handlers.
svn+ssh://pythondev@svn.python.org/python/trunk
........
r81907 | antoine.pitrou | 2010-06-11 23:42:26 +0200 (ven., 11 juin 2010) | 5 lines
Issue #8941: decoding big endian UTF-32 data in UCS-2 builds could crash
the interpreter with characters outside the Basic Multilingual Plane
(higher than 0x10000).
........
object to Py_FileSystemDefaultEncoding with the "surrogateescape" error
handler, return a bytes object. If Py_FileSystemDefaultEncoding is not set,
fall back to UTF-8.
This function is only used to decode Python module filenames, but Python
doesn't support surrogates in modules filenames yet. So nobody noticed this
minor bug.
svn+ssh://pythondev@svn.python.org/python/trunk
........
r79494 | florent.xicluna | 2010-03-30 10:24:06 +0200 (mar, 30 mar 2010) | 2 lines
#7643: Unicode codepoints VT (0x0B) and FF (0x0C) are linebreaks according to Unicode Standard Annex #14.
........
r79496 | florent.xicluna | 2010-03-30 18:29:03 +0200 (mar, 30 mar 2010) | 2 lines
Highlight the change of behavior related to r79494. Now VT and FF are linebreaks.
........
svn+ssh://pythondev@svn.python.org/python/trunk
........
r79278 | victor.stinner | 2010-03-22 13:24:37 +0100 (lun., 22 mars 2010) | 2 lines
Issue #1583863: An unicode subclass can now override the __str__ method
........
r79280 | victor.stinner | 2010-03-22 13:36:28 +0100 (lun., 22 mars 2010) | 5 lines
Fix the NEWS about my last commit: an unicode subclass can now override the
__unicode__ method (and not the __str__ method).
Simplify also the testcase.
........
svn+ssh://pythondev@svn.python.org/python/trunk
........
r77461 | antoine.pitrou | 2010-01-13 08:55:48 +0100 (mer., 13 janv. 2010) | 5 lines
Issue #7622: Improve the split(), rsplit(), splitlines() and replace()
methods of bytes, bytearray and unicode objects by using a common
implementation based on stringlib's fast search. Patch by Florent Xicluna.
........