* The UTF-8 incremental decoders fails now fast if encounter
a sequence that can't be handled by the error handler.
* The UTF-16 incremental decoders with the surrogatepass error
handler decodes now a lone low surrogate with final=False.
(cherry picked from commit 894263ba80)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
_PyCoreConfig: Change filesystem_encoding, filesystem_errors,
stdio_encoding and stdio_errors fields type from char* to wchar_t*.
Changes:
* PyInterpreterState: replace fscodec_initialized (int) with fs_codec
structure.
* Add get_error_handler_wide() and unicode_encode_utf8() helper
functions.
* Add error_handler parameter to unicode_encode_locale()
and unicode_decode_locale().
* Remove _PyCoreConfig_SetString().
* Rename _PyCoreConfig_SetWideString() to _PyCoreConfig_SetString().
* Rename _PyCoreConfig_SetWideStringFromString()
to _PyCoreConfig_DecodeLocale().
Add support for the "surrogatepass" error handler in
PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault()
for the UTF-8 encoding.
Changes:
* _Py_DecodeUTF8Ex() and _Py_EncodeUTF8Ex() now support the
surrogatepass error handler (_Py_ERROR_SURROGATEPASS).
* _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() now use
the _Py_error_handler enum instead of "int surrogateescape" to pass
the error handler. These functions now return -3 if the error
handler is unknown.
* Add unit tests on _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx()
in test_codecs.
* Rename get_error_handler() to _Py_GetErrorHandler() and expose it
as a private function.
* _freeze_importlib doesn't need config.filesystem_errors="strict"
workaround anymore.
string is pure ASCII: use _PyBytesWriter_WriteBytes(), don't check individual
character.
Cleanup unicode_encode_ucs1():
* Rename repunicode to rep
* Clear rep object on error
* Factorize code between bytes and unicode path
Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in
UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and
Latin1 encoders.
Use the new _PyBytesWriter API to optimize these error handlers for the
encoders. It avoids to create an exception and call the slow implementation of
the error handler.
Add a new private API to optimize Unicode encoders. It uses a small buffer
allocated on the stack and supports overallocation.
Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable
overallocation for the UTF-8 encoder with error handlers.
unicode_encode_ucs1(): initialize collend to collstart+1 to not check the
current character twice, we already know that it is not ASCII.
The utf-16* and utf-32* encoders no longer allow surrogate code points
(U+D800-U+DFFF) to be encoded.
The utf-32* decoders no longer decode byte sequences that correspond to
surrogate code points.
The surrogatepass error handler now works with the utf-16* and utf-32* codecs.
Based on patches by Victor Stinner and Kang-Hao (Kenny) Lu.