Commit Graph

45 Commits

Author SHA1 Message Date
Serhiy Storchaka 18b07d773e
bpo-36819: Fix crashes in built-in encoders with weird error handlers (GH-28593)
If the error handler returns position less or equal than the starting
position of non-encodable characters, most of built-in encoders didn't
properly re-size the output buffer. This led to out-of-bounds writes,
and segfaults.
2022-05-02 12:37:48 +03:00
Jessica Clarke dec0757549
bpo-43179: Generalise alignment for optimised string routines (GH-24624)
* Remove m68k-specific hack from ascii_decode

On m68k, alignments of primitives is more relaxed, with 4-byte and
8-byte types only requiring 2-byte alignment, thus using sizeof(size_t)
does not work. Instead, use the portable alternative.

Note that this is a minimal fix that only relaxes the assertion and the
condition for when to use the optimised version remains overly strict.
Such issues will be fixed tree-wide in the next commit.

NB: In C11 we could use _Alignof(size_t) instead, but for compatibility
we use autoconf.

* Optimise string routines for architectures with non-natural alignment

C only requires that sizeof(x) is a multiple of alignof(x), not that the
two are equal. Thus anywhere where we optimise based on alignment we
should be using alignof(x) not sizeof(x).

This is more annoying than it would be in C11 where we could just use
_Alignof(x) (and alignof(x) in C++11), but since we still require only
C99 we must plumb the information all the way from autoconf through the
various typedefs and defines.
2021-03-31 12:12:39 +02:00
Ma Lin a0c603cb9d
bpo-38252: Use 8-byte step to detect ASCII sequence in 64bit Windows build (GH-16334) 2020-10-18 17:48:38 +03:00
Victor Stinner c6b292cdee
bpo-29882: Add _Py_popcount32() function (GH-20518)
* Rename pycore_byteswap.h to pycore_bitutils.h.
* Move popcount_digit() to pycore_bitutils.h as _Py_popcount32().
* _Py_popcount32() uses GCC and clang builtin function if available.
* Add unit tests to _Py_popcount32().
2020-06-08 16:30:33 +02:00
Victor Stinner d7c657d4b1
bpo-40302: UTF-32 encoder SWAB4() macro use a|b rather than a+b (GH-19572) 2020-04-17 19:13:34 +02:00
Victor Stinner 1ae035b7e8
bpo-40302: Add pycore_byteswap.h header file (GH-19552)
Add a new internal pycore_byteswap.h header file with the following
functions:

* _Py_bswap16()
* _Py_bswap32()
* _Py_bswap64()

Use these functions in _ctypes, sha256 and sha512 modules,
and also use in the UTF-32 encoder.

sha256, sha512 and _ctypes modules are now built with the internal
C API.
2020-04-17 17:47:20 +02:00
Serhiy Storchaka cd8295ff75
bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode data. (GH-19345) 2020-04-11 10:48:40 +03:00
Benjamin Peterson 51796e5d26
Update some www.unicode.org URLs to use HTTPS. (GH-18912) 2020-03-10 21:10:59 -07:00
Inada Naoki 02a4d57263
bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
Avoid using temporary bytes object.
2020-02-27 13:48:59 +09:00
Andy Lester e6be9b59a9
closes bpo-39605: Fix some casts to not cast away const. (GH-18453)
gcc -Wcast-qual turns up a number of instances of casting away constness of pointers. Some of these can be safely modified, by either:

Adding the const to the type cast, as in:

-    return _PyUnicode_FromUCS1((unsigned char*)s, size);
+    return _PyUnicode_FromUCS1((const unsigned char*)s, size);

or, Removing the cast entirely, because it's not necessary (but probably was at one time), as in:

-    PyDTrace_FUNCTION_ENTRY((char *)filename, (char *)funcname, lineno);
+    PyDTrace_FUNCTION_ENTRY(filename, funcname, lineno);

These changes will not change code, but they will make it much easier to check for errors in consts
2020-02-11 18:28:35 -08:00
Serhiy Storchaka 894263ba80
bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304)
* The UTF-8 incremental decoders fails now fast if encounter
  a sequence that can't be handled by the error handler.
* The UTF-16 incremental decoders with the surrogatepass error
  handler decodes now a lone low surrogate with final=False.
2019-06-25 11:54:18 +03:00
Victor Stinner 709d23dee6
bpo-36775: _PyCoreConfig only uses wchar_t* (GH-13062)
_PyCoreConfig: Change filesystem_encoding, filesystem_errors,
stdio_encoding and stdio_errors fields type from char* to wchar_t*.

Changes:

* PyInterpreterState: replace fscodec_initialized (int) with fs_codec
  structure.
* Add get_error_handler_wide() and unicode_encode_utf8() helper
  functions.
* Add error_handler parameter to unicode_encode_locale()
  and unicode_decode_locale().
* Remove _PyCoreConfig_SetString().
* Rename _PyCoreConfig_SetWideString() to _PyCoreConfig_SetString().
* Rename _PyCoreConfig_SetWideStringFromString()
  to _PyCoreConfig_DecodeLocale().
2019-05-02 14:56:30 -04:00
Victor Stinner 3d4226a832
bpo-34523: Support surrogatepass in locale codecs (GH-8995)
Add support for the "surrogatepass" error handler in
PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault()
for the UTF-8 encoding.

Changes:

* _Py_DecodeUTF8Ex() and _Py_EncodeUTF8Ex() now support the
  surrogatepass error handler (_Py_ERROR_SURROGATEPASS).
* _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() now use
  the _Py_error_handler enum instead of "int surrogateescape" to pass
  the error handler. These functions now return -3 if the error
  handler is unknown.
* Add unit tests on _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx()
  in test_codecs.
* Rename get_error_handler() to _Py_GetErrorHandler() and expose it
  as a private function.
* _freeze_importlib doesn't need config.filesystem_errors="strict"
  workaround anymore.
2018-08-29 22:21:32 +02:00
Stefan Krah f432a3234f bpo-30923: Silence fall-through warnings included in -Wextra since gcc-7.0. (#3157) 2017-08-21 13:09:59 +02:00
Serhiy Storchaka 998c9cdd42 Issue #28561: Clean up UTF-8 encoder: remove dead code, update comments, etc.
Patch by Xiang Zhang.
2016-10-30 18:25:27 +02:00
Victor Stinner 1a05d6c04d PEP 7 style for if/else in C
Add also a newline for readability in normalize_encoding().
2016-09-02 12:12:23 +02:00
Raymond Hettinger 15f44ab043 Issue #27895: Spelling fixes (Contributed by Ville Skyttä). 2016-08-30 10:47:49 -07:00
Serhiy Storchaka bcde10aa7e Issue #26765: Ensure that bytes- and unicode-specific stringlib files are used
with correct type.
2016-05-16 09:42:29 +03:00
Victor Stinner 6bd525b656 Optimize error handlers of ASCII and Latin1 encoders when the replacement
string is pure ASCII: use _PyBytesWriter_WriteBytes(), don't check individual
character.

Cleanup unicode_encode_ucs1():

* Rename repunicode to rep
* Clear rep object on error
* Factorize code between bytes and unicode path
2015-10-09 13:10:05 +02:00
Victor Stinner ce179bf6ba Add _PyBytesWriter_WriteBytes() to factorize the code 2015-10-09 12:57:22 +02:00
Victor Stinner ad7715891e _PyBytesWriter: simplify code to avoid "prealloc" parameters
Substract preallocate bytes from min_size before calling
_PyBytesWriter_Prepare().
2015-10-09 12:38:53 +02:00
Victor Stinner e7bf86cd7d Optimize backslashreplace error handler
Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in
UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and
Latin1 encoders.

Use the new _PyBytesWriter API to optimize these error handlers for the
encoders. It avoids to create an exception and call the slow implementation of
the error handler.
2015-10-09 01:39:28 +02:00
Victor Stinner fdfbf78114 Issue #25318: Add _PyBytesWriter API
Add a new private API to optimize Unicode encoders. It uses a small buffer
allocated on the stack and supports overallocation.

Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable
overallocation for the UTF-8 encoder with error handlers.

unicode_encode_ucs1(): initialize collend to collstart+1 to not check the
current character twice, we already know that it is not ASCII.
2015-10-09 00:33:49 +02:00
Victor Stinner 01ada3996b Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error
handlers: ``ignore``, ``replace``, ``surrogateescape``, ``surrogatepass``.
Patch co-written with Serhiy Storchaka.
2015-10-01 21:54:51 +02:00
Serhiy Storchaka 9ce71a6475 Fixed typos in comments. 2015-05-18 22:20:18 +03:00
Serhiy Storchaka 7e29eea926 Fixed typos in comments. 2015-05-18 22:19:42 +03:00
Serhiy Storchaka 0d4df752ac Issue #15027: The UTF-32 encoder is now 3x to 7x faster. 2015-05-12 23:12:45 +03:00
Serhiy Storchaka 3079328d29 Reverted changeset b72c5573c5e7 (issue #15027). 2014-01-04 22:44:01 +02:00
Serhiy Storchaka 583a93943c Issue #15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster. 2014-01-04 19:25:37 +02:00
Serhiy Storchaka dc2fd5101a Remove dead code committed in issue #12892. 2013-11-19 15:56:05 +02:00
Serhiy Storchaka 58cf607d13 Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates.
The utf-16* and utf-32* encoders no longer allow surrogate code points
(U+D800-U+DFFF) to be encoded.
The utf-32* decoders no longer decode byte sequences that correspond to
surrogate code points.
The surrogatepass error handler now works with the utf-16* and utf-32* codecs.

Based on patches by Victor Stinner and Kang-Hao (Kenny) Lu.
2013-11-19 11:32:41 +02:00
Antoine Pitrou 9ed5f27266 Issue #18722: Remove uses of the "register" keyword in C code. 2013-08-13 20:18:52 +02:00
Victor Stinner 6caa6fb535 (Merge 3.3) Issue #8271: Fix compilation on Windows 2012-11-05 00:00:50 +01:00
Victor Stinner ab60de478d Issue #8271: Fix compilation on Windows 2012-11-04 23:59:15 +01:00
Ezio Melotti cfa9636404 #8271: merge with 3.3. 2012-11-04 23:23:09 +02:00
Ezio Melotti f7ed5d111b #8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti. 2012-11-04 23:21:38 +02:00
Christian Heimes 743e0cd6b5 Issue #16166: Add PY_LITTLE_ENDIAN and PY_BIG_ENDIAN macros and unified
endianess detection and handling.
2012-10-17 23:52:17 +02:00
Antoine Pitrou ca8aa4acf6 Issue #15144: Fix possible integer overflow when handling pointers as integer values, by using Py_uintptr_t instead of size_t.
Patch by Serhiy Storchaka.
2012-09-20 20:56:47 +02:00
Mark Dickinson 01ac8b6ab1 Use correct types for ASCII_CHAR_MASK integer constants. 2012-07-07 14:08:48 +02:00
Mark Dickinson 106c4145ff Issue #14923: Optimize continuation-byte check in UTF-8 decoding. Patch by Serhiy Storchaka. 2012-06-23 21:45:14 +01:00
Antoine Pitrou 27f6a3b0bf Issue #15026: utf-16 encoding is now significantly faster (up to 10x).
Patch by Serhiy Storchaka.
2012-06-15 22:15:23 +02:00
Antoine Pitrou 63065d761e Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.
Patch by Serhiy Storchaka.
2012-05-15 23:48:04 +02:00
Antoine Pitrou ca5f91b888 Issue #14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka. 2012-05-10 16:36:02 +02:00
Victor Stinner 6099a03202 Issue #13624: Write a specialized UTF-8 encoder to allow more optimization
The main bottleneck was the PyUnicode_READ() macro.
2011-12-18 14:22:26 +01:00
Antoine Pitrou 0a3229de6b Issue #13417: speed up utf-8 decoding by around 2x for the non-fully-ASCII case.
This almost catches up with pre-PEP 393 performance, when decoding needed
only one pass.
2011-11-21 20:39:13 +01:00