cpython

Commit Graph

Author	SHA1	Message	Date
Serhiy Storchaka	894263ba80	bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) * The UTF-8 incremental decoders fails now fast if encounter a sequence that can't be handled by the error handler. * The UTF-16 incremental decoders with the surrogatepass error handler decodes now a lone low surrogate with final=False.	2019-06-25 11:54:18 +03:00
Victor Stinner	709d23dee6	bpo-36775: _PyCoreConfig only uses wchar_t* (GH-13062) _PyCoreConfig: Change filesystem_encoding, filesystem_errors, stdio_encoding and stdio_errors fields type from char* to wchar_t. Changes: PyInterpreterState: replace fscodec_initialized (int) with fs_codec structure. * Add get_error_handler_wide() and unicode_encode_utf8() helper functions. * Add error_handler parameter to unicode_encode_locale() and unicode_decode_locale(). * Remove _PyCoreConfig_SetString(). * Rename _PyCoreConfig_SetWideString() to _PyCoreConfig_SetString(). * Rename _PyCoreConfig_SetWideStringFromString() to _PyCoreConfig_DecodeLocale().	2019-05-02 14:56:30 -04:00
Victor Stinner	3d4226a832	bpo-34523: Support surrogatepass in locale codecs (GH-8995) Add support for the "surrogatepass" error handler in PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault() for the UTF-8 encoding. Changes: * _Py_DecodeUTF8Ex() and _Py_EncodeUTF8Ex() now support the surrogatepass error handler (_Py_ERROR_SURROGATEPASS). * _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() now use the _Py_error_handler enum instead of "int surrogateescape" to pass the error handler. These functions now return -3 if the error handler is unknown. * Add unit tests on _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() in test_codecs. * Rename get_error_handler() to _Py_GetErrorHandler() and expose it as a private function. * _freeze_importlib doesn't need config.filesystem_errors="strict" workaround anymore.	2018-08-29 22:21:32 +02:00
Stefan Krah	f432a3234f	bpo-30923: Silence fall-through warnings included in -Wextra since gcc-7.0. (#3157 )	2017-08-21 13:09:59 +02:00
Serhiy Storchaka	998c9cdd42	Issue #28561 : Clean up UTF-8 encoder: remove dead code, update comments, etc. Patch by Xiang Zhang.	2016-10-30 18:25:27 +02:00
Victor Stinner	1a05d6c04d	PEP 7 style for if/else in C Add also a newline for readability in normalize_encoding().	2016-09-02 12:12:23 +02:00
Raymond Hettinger	15f44ab043	Issue #27895 : Spelling fixes (Contributed by Ville Skyttä).	2016-08-30 10:47:49 -07:00
Serhiy Storchaka	bcde10aa7e	Issue #26765 : Ensure that bytes- and unicode-specific stringlib files are used with correct type.	2016-05-16 09:42:29 +03:00
Victor Stinner	6bd525b656	Optimize error handlers of ASCII and Latin1 encoders when the replacement string is pure ASCII: use _PyBytesWriter_WriteBytes(), don't check individual character. Cleanup unicode_encode_ucs1(): * Rename repunicode to rep * Clear rep object on error * Factorize code between bytes and unicode path	2015-10-09 13:10:05 +02:00
Victor Stinner	ce179bf6ba	Add _PyBytesWriter_WriteBytes() to factorize the code	2015-10-09 12:57:22 +02:00
Victor Stinner	ad7715891e	_PyBytesWriter: simplify code to avoid "prealloc" parameters Substract preallocate bytes from min_size before calling _PyBytesWriter_Prepare().	2015-10-09 12:38:53 +02:00
Victor Stinner	e7bf86cd7d	Optimize backslashreplace error handler Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and Latin1 encoders. Use the new _PyBytesWriter API to optimize these error handlers for the encoders. It avoids to create an exception and call the slow implementation of the error handler.	2015-10-09 01:39:28 +02:00
Victor Stinner	fdfbf78114	Issue #25318 : Add _PyBytesWriter API Add a new private API to optimize Unicode encoders. It uses a small buffer allocated on the stack and supports overallocation. Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable overallocation for the UTF-8 encoder with error handlers. unicode_encode_ucs1(): initialize collend to collstart+1 to not check the current character twice, we already know that it is not ASCII.	2015-10-09 00:33:49 +02:00
Victor Stinner	01ada3996b	Issue #25267 : The UTF-8 encoder is now up to 75 times as fast for error handlers: ``ignore``, ``replace``, ``surrogateescape``, ``surrogatepass``. Patch co-written with Serhiy Storchaka.	2015-10-01 21:54:51 +02:00
Serhiy Storchaka	9ce71a6475	Fixed typos in comments.	2015-05-18 22:20:18 +03:00
Serhiy Storchaka	7e29eea926	Fixed typos in comments.	2015-05-18 22:19:42 +03:00
Serhiy Storchaka	0d4df752ac	Issue #15027 : The UTF-32 encoder is now 3x to 7x faster.	2015-05-12 23:12:45 +03:00
Serhiy Storchaka	3079328d29	Reverted changeset b72c5573c5e7 (issue #15027 ).	2014-01-04 22:44:01 +02:00
Serhiy Storchaka	583a93943c	Issue #15027 : Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster.	2014-01-04 19:25:37 +02:00
Serhiy Storchaka	dc2fd5101a	Remove dead code committed in issue #12892 .	2013-11-19 15:56:05 +02:00
Serhiy Storchaka	58cf607d13	Issue #12892 : The utf-16* and utf-32* codecs now reject (lone) surrogates. The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800-U+DFFF) to be encoded. The utf-32* decoders no longer decode byte sequences that correspond to surrogate code points. The surrogatepass error handler now works with the utf-16* and utf-32* codecs. Based on patches by Victor Stinner and Kang-Hao (Kenny) Lu.	2013-11-19 11:32:41 +02:00
Antoine Pitrou	9ed5f27266	Issue #18722 : Remove uses of the "register" keyword in C code.	2013-08-13 20:18:52 +02:00
Victor Stinner	6caa6fb535	(Merge 3.3) Issue #8271 : Fix compilation on Windows	2012-11-05 00:00:50 +01:00
Victor Stinner	ab60de478d	Issue #8271 : Fix compilation on Windows	2012-11-04 23:59:15 +01:00
Ezio Melotti	cfa9636404	#8271 : merge with 3.3.	2012-11-04 23:23:09 +02:00
Ezio Melotti	f7ed5d111b	#8271 : the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti.	2012-11-04 23:21:38 +02:00
Christian Heimes	743e0cd6b5	Issue #16166 : Add PY_LITTLE_ENDIAN and PY_BIG_ENDIAN macros and unified endianess detection and handling.	2012-10-17 23:52:17 +02:00
Antoine Pitrou	ca8aa4acf6	Issue #15144 : Fix possible integer overflow when handling pointers as integer values, by using Py_uintptr_t instead of size_t. Patch by Serhiy Storchaka.	2012-09-20 20:56:47 +02:00
Mark Dickinson	01ac8b6ab1	Use correct types for ASCII_CHAR_MASK integer constants.	2012-07-07 14:08:48 +02:00
Mark Dickinson	106c4145ff	Issue #14923 : Optimize continuation-byte check in UTF-8 decoding. Patch by Serhiy Storchaka.	2012-06-23 21:45:14 +01:00
Antoine Pitrou	27f6a3b0bf	Issue #15026 : utf-16 encoding is now significantly faster (up to 10x). Patch by Serhiy Storchaka.	2012-06-15 22:15:23 +02:00
Antoine Pitrou	63065d761e	Issue #14624 : UTF-16 decoding is now 3x to 4x faster on various inputs. Patch by Serhiy Storchaka.	2012-05-15 23:48:04 +02:00
Antoine Pitrou	ca5f91b888	Issue #14738 : Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka.	2012-05-10 16:36:02 +02:00
Victor Stinner	6099a03202	Issue #13624 : Write a specialized UTF-8 encoder to allow more optimization The main bottleneck was the PyUnicode_READ() macro.	2011-12-18 14:22:26 +01:00
Antoine Pitrou	0a3229de6b	Issue #13417 : speed up utf-8 decoding by around 2x for the non-fully-ASCII case. This almost catches up with pre-PEP 393 performance, when decoding needed only one pass.	2011-11-21 20:39:13 +01:00

35 Commits