Commit Graph

1386 Commits

Author SHA1 Message Date
Victor Stinner 0016507c16 Issue #25318: Move _PyBytesWriter to bytesobject.c
Declare also the private API in bytesobject.h.
2015-10-09 01:53:21 +02:00
Victor Stinner e7bf86cd7d Optimize backslashreplace error handler
Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in
UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and
Latin1 encoders.

Use the new _PyBytesWriter API to optimize these error handlers for the
encoders. It avoids to create an exception and call the slow implementation of
the error handler.
2015-10-09 01:39:28 +02:00
Victor Stinner fdfbf78114 Issue #25318: Add _PyBytesWriter API
Add a new private API to optimize Unicode encoders. It uses a small buffer
allocated on the stack and supports overallocation.

Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable
overallocation for the UTF-8 encoder with error handlers.

unicode_encode_ucs1(): initialize collend to collstart+1 to not check the
current character twice, we already know that it is not ASCII.
2015-10-09 00:33:49 +02:00
Victor Stinner 74e8fac3c8 Issue #25301: Fix compatibility with ISO C90 2015-10-05 13:49:26 +02:00
Victor Stinner 1d65d9192d Issue #25301: The UTF-8 decoder is now up to 15 times as fast for error
handlers: ``ignore``, ``replace`` and ``surrogateescape``.
2015-10-05 13:43:50 +02:00
Victor Stinner eb36fdaad8 Fix _PyUnicodeWriter_PrepareKind()
Initialize kind to 0 (PyUnicode_WCHAR_KIND) to ensure that
_PyUnicodeWriter_PrepareKind() handles correctly read-only buffer: copy the
buffer.
2015-10-03 01:55:51 +02:00
Serhiy Storchaka 29e68edbf4 Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
3. In some circumstances the '\xfd' character was produced instead of the
replacement character '\ufffd' (due to a bug in _PyUnicodeWriter).
2015-10-02 13:14:03 +03:00
Serhiy Storchaka 58c8f2bb6d Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
3. In some circumstances the '\xfd' character was produced instead of the
replacement character '\ufffd' (due to a bug in _PyUnicodeWriter).
2015-10-02 13:13:14 +03:00
Serhiy Storchaka 28b21e50c8 Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
2015-10-02 13:07:28 +03:00
Victor Stinner 3222da26fe Make _PyUnicode_TranslateCharmap() symbol private
unicodeobject.h exposes PyUnicode_TranslateCharmap() and PyUnicode_Translate().
2015-10-01 22:07:32 +02:00
Victor Stinner 01ada3996b Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error
handlers: ``ignore``, ``replace``, ``surrogateescape``, ``surrogatepass``.
Patch co-written with Serhiy Storchaka.
2015-10-01 21:54:51 +02:00
Victor Stinner c3713e9706 Optimize ascii/latin1+surrogateescape encoders
Issue #25227: Optimize ASCII and latin1 encoders with the ``surrogateescape``
error handler: the encoders are now up to 3 times as fast.

Initial patch written by Serhiy Storchaka.
2015-09-29 12:32:13 +02:00
Victor Stinner 0030cd52da Issue #25227: Cleanup unicode_encode_ucs1() error handler
* Change limit type from unsigned int to Py_UCS4, to use the same type than the
  "ch" variable (an Unicode character).
* Reuse ch variable for _Py_ERROR_XMLCHARREFREPLACE
* Add some newlines for readability
2015-09-24 14:45:00 +02:00
Victor Stinner 54385b206d Issue #24870: revert unwanted change
Sorry, I pushed the patch on the UTF-8 decoder by mistake :-(
2015-09-22 10:46:52 +02:00
Victor Stinner 5ebae87628 Issue #25207, #14626: Fix my commit.
It doesn't work to use #define XXX defined(YYY)" and then "#ifdef XXX"
to check YYY.
2015-09-22 01:29:33 +02:00
Victor Stinner 6174474bea _PyUnicodeWriter_PrepareInternal(): make the assertion more strict 2015-09-22 01:01:17 +02:00
Victor Stinner ca9381ea01 Issue #24870: Add _PyUnicodeWriter_PrepareKind() macro
Add a macro which ensures that the writer has at least the requested kind.
2015-09-22 00:58:32 +02:00
Victor Stinner 5014920cb7 Issue #24870: Reuse the new _Py_error_handler enum
Factorize code with the new get_error_handler() function.

Add some empty lines for readability.
2015-09-22 00:26:54 +02:00
Victor Stinner f96418de05 Issue #24870: Optimize the ASCII decoder for error handlers: surrogateescape,
ignore and replace. Initial patch written by Naoki Inada.

The decoder is now up to 60 times as fast for these error handlers.

Add also unit tests for the ASCII decoder.
2015-09-21 23:06:27 +02:00
Zachary Ware 070bd62cfa Closes #21279: Merge with 3.5 2015-08-06 00:05:13 -05:00
Zachary Ware d987a81d29 Issue #21279: Merge with 3.4 2015-08-06 00:04:23 -05:00
Zachary Ware 79b98df023 Issue #21279: Flesh out str.translate docs
Initial patch by Kinga Farkas, Martin Panter, and John Posner.
2015-08-05 23:54:15 -05:00
Raymond Hettinger ac2ef65c32 Make the unicode equality test an external function rather than in-lining it.
The real benefit of the unicode specialized function comes from
bypassing the overhead of PyObject_RichCompareBool() and not
from being in-lined (especially since there was almost no shared
data between the caller and callee).  Also, the in-lining was
having a negative effect on code generation for the callee.
2015-07-04 16:04:44 -07:00
Serhiy Storchaka d4ea03c785 Issue #24284: The startswith and endswith methods of the str class no longer
return True when finding the empty string and the indexes are completely out
of range.
2015-05-31 09:15:51 +03:00
Antoine Pitrou 873e0df946 Fix some compilation warnings when using gcc (-Wmaybe-uninitialized). 2015-05-19 21:06:04 +02:00
Antoine Pitrou f6d1f1fa8a Fix some compilation warnings when using gcc (-Wmaybe-uninitialized). 2015-05-19 21:04:33 +02:00
Serhiy Storchaka 0d4df752ac Issue #15027: The UTF-32 encoder is now 3x to 7x faster. 2015-05-12 23:12:45 +03:00
Serhiy Storchaka 7e9d1d1a1b Issue #23908: os functions now reject paths with embedded null character
on Windows instead of silently truncate them.

Removed no longer used _PyUnicode_HasNULChars().
2015-04-20 10:12:28 +03:00
Serhiy Storchaka 1009bf18b3 Issue #23501: Argumen Clinic now generates code into separate files by default. 2015-04-03 23:53:51 +03:00
Victor Stinner 1912b39def _PyUnicodeWriter_WriteStr() now checks that the input string is consistent
in debug mode to detect bugs earlier.

_PyUnicodeWriter_Finish() doesn't check if the read only string is consistent,
whereas it does check consistency for strings built by itself.
2015-03-26 09:37:23 +01:00
Serhiy Storchaka d9d769fcdd Issue #23573: Increased performance of string search operations (str.find,
str.index, str.count, the in operator, str.split, str.partition) with
arguments of different kinds (UCS1, UCS2, UCS4).
2015-03-24 21:55:47 +02:00
Victor Stinner f50e187724 Fix compiler warnings: comparison between signed and unsigned numbers 2015-03-20 11:32:24 +01:00
Victor Stinner 0c39b1b970 Initialize variables to prevent GCC warnings 2015-03-18 15:02:06 +01:00
Benjamin Peterson e5a853c390 use PyMem_NEW to detect overflow (closes #23362) 2015-03-02 13:23:25 -05:00
Steve Dower 3e96f324dc Issue #23451: Update pyconfig.h for Windows to require Vista headers and remove unnecessary version checks. 2015-03-02 08:01:10 -08:00
Serhiy Storchaka 78a8249127 Issue #23490: Fixed possible crashes related to interoperability between
old-style and new API for string with 2**30-1 characters.
2015-02-20 21:34:39 +02:00
Serhiy Storchaka e55181f517 Issue #23490: Fixed possible crashes related to interoperability between
old-style and new API for string with 2**30-1 characters.
2015-02-20 21:34:06 +02:00
Serhiy Storchaka 4d0d982985 Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integer
overflows.  Added few missed PyErr_NoMemory().
2015-02-16 13:33:32 +02:00
Serhiy Storchaka 1a1ff29659 Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integer
overflows.  Added few missed PyErr_NoMemory().
2015-02-16 13:28:22 +02:00
Serhiy Storchaka 4dbc305002 Issue #23055: Fixed a buffer overflow in PyUnicode_FromFormatV. Analysis
and fix by Guido Vranken.
2015-01-27 22:18:46 +02:00
Victor Stinner 29dacf2e97 Issue #15859: PyUnicode_EncodeFSDefault(), PyUnicode_EncodeMBCS() and
PyUnicode_EncodeCodePage() now raise an exception if the object is not an
Unicode object. For PyUnicode_EncodeFSDefault(), it was already the case on
platforms other than Windows. Patch written by Campbell Barton.
2015-01-26 16:41:32 +01:00
Serhiy Storchaka bbd3aa8ece Issue #23321: Fixed a crash in str.decode() when error handler returned
replacment string longer than mailformed input data.
2015-01-26 01:24:31 +02:00
Serhiy Storchaka 7e4b9057b3 Issue #23321: Fixed a crash in str.decode() when error handler returned
replacment string longer than mailformed input data.
2015-01-26 01:22:54 +02:00
Ethan Furman b95b56150f Issue20284: Implement PEP461 2015-01-23 20:05:18 -08:00
Serhiy Storchaka 82e07b92b3 Issue #23181: More "codepoint" -> "code point". 2015-01-18 11:33:31 +02:00
Serhiy Storchaka d3faf43f9b Issue #23181: More "codepoint" -> "code point". 2015-01-18 11:28:37 +02:00
Serhiy Storchaka b757c83ec6 Issue #22581: Use more "bytes-like object" throughout the docs and comments. 2014-12-05 22:25:22 +02:00
Serhiy Storchaka 133b11b566 Issue #22975: Close block at right place. 2014-12-01 18:56:28 +02:00
Serhiy Storchaka 92bf919ed0 Issue #22581: Use more "bytes-like object" throughout the docs and comments. 2014-12-05 22:26:10 +02:00
Serhiy Storchaka 407249c62b Issue #22975: Close block at right place. 2014-12-01 18:56:54 +02:00
Victor Stinner 3aa979e0cd Issue #20948: Inline makefmt() in unicode_fromformat_arg() 2014-11-18 21:40:51 +01:00
Antoine Pitrou b6dc9b7554 Fixed signed/unsigned comparison warning 2014-10-15 23:14:53 +02:00
Antoine Pitrou 4e334241b7 Fixed signed/unsigned comparison warning 2014-10-15 23:14:53 +02:00
Benjamin Peterson 736982d36d merge 3.4 (closes #22643) 2014-10-15 12:17:47 -04:00
Benjamin Peterson 9c422f3c3d merge 3.3 2014-10-15 12:17:33 -04:00
Benjamin Peterson 1e211ff10d it suffices to check for PY_SSIZE_T_MAX overflow (#22643) 2014-10-15 12:17:21 -04:00
Benjamin Peterson 315aa40403 Merge 3.4 2014-10-15 11:51:17 -04:00
Benjamin Peterson 60d7a73194 Merge 3.3 2014-10-15 11:51:12 -04:00
Benjamin Peterson c0e64f5027 make sure length is unsigned 2014-10-15 11:51:05 -04:00
Benjamin Peterson 6925264334 merge 3.4 (#22643) 2014-10-15 11:49:15 -04:00
Benjamin Peterson 1cbb3fe775 merge 3.3 (#22643) 2014-10-15 11:48:41 -04:00
Benjamin Peterson e1bd38c03c fix integer overflow in unicode case operations (closes #22643) 2014-10-15 11:47:36 -04:00
Gregory P. Smith 8486f9b134 Fix "warning: comparison between signed and unsigned integer expressions"
-Wsign-compare warnings in unicodeobject.c.  These were all a result
of sizeof() being unsigned and being compared to a Py_ssize_t.
Not actual problems.
2014-09-30 00:33:24 -07:00
Benjamin Peterson fd97a6fb2d merge 3.4 (#22520) 2014-09-29 23:02:56 -04:00
Benjamin Peterson 43030ee780 merge 3.3 (#22520) 2014-09-29 23:02:35 -04:00
Benjamin Peterson 736b8012b4 prevent overflow in unicode_repr (closes #22520) 2014-09-29 23:02:15 -04:00
Benjamin Peterson 10e4b2545e merge 3.4 (closes #22518) 2014-09-29 18:53:58 -04:00
Benjamin Peterson 2b76ce6d27 merge 3.3 (closes #22518) 2014-09-29 18:50:06 -04:00
Benjamin Peterson a1c1be4e03 cleanup overflowing handling in unicode_decode_call_errorhandler and unicode_encode_ucs1 (closes #22518) 2014-09-29 18:18:57 -04:00
Serhiy Storchaka 20b39b27d9 Removed redundant casts to `char *`.
Corresponding functions now accept `const char *` (issue #1772673).
2014-09-28 11:27:24 +03:00
Benjamin Peterson fa5021699a Merge 3.3 2014-10-15 23:58:32 -04:00
Serhiy Storchaka d8a1447c99 Issue #22215: Now ValueError is raised instead of TypeError when str or bytes
argument contains not permitted null character or byte.
2014-09-06 20:07:17 +03:00
Victor Stinner 12174a5dca Issue #22156: Fix "comparison between signed and unsigned integers" compiler
warnings in the Objects/ subdirectory.

PyType_FromSpecWithBases() and PyType_FromSpec() now reject explicitly negative
slot identifiers.
2014-08-15 23:17:38 +02:00
Victor Stinner f6a271ae98 Issue #18395: Rename ``_Py_char2wchar()`` to :c:func:`Py_DecodeLocale`, rename
``_Py_wchar2char()`` to :c:func:`Py_EncodeLocale`, and document these
functions.
2014-08-01 12:28:48 +02:00
Victor Stinner e1f17c6c0b unicodeobject.c: fix a compiler warning on Windows 64 bits 2014-07-25 14:03:03 +02:00
Victor Stinner c68b7fba86 (Merge 3.4) Issue #21892, #21893: Partial revert of changeset 4f55e802baf0,
PyErr_Format() uses "%zd" for Py_ssize_t, not PY_FORMAT_SIZE_T
2014-07-04 22:50:13 +02:00
Victor Stinner a33bce0945 Issue #21892, #21893: Partial revert of changeset 4f55e802baf0, PyErr_Format()
uses "%zd" for Py_ssize_t, not PY_FORMAT_SIZE_T
2014-07-04 22:47:46 +02:00
Victor Stinner 9f43505f3d (Merge 3.4) Closes #21892, #21893: Use PY_FORMAT_SIZE_T instead of %zi or %zu
to format C size_t, because %zi/%u is not supported on all platforms.
2014-07-01 08:57:54 +02:00
Victor Stinner 293f3f526d Closes #21892, #21893: Use PY_FORMAT_SIZE_T instead of %zi or %zu to format C
size_t, because %zi/%u is not supported on all platforms.
2014-07-01 08:57:10 +02:00
Serhiy Storchaka 48070c1248 Issue #23803: Fixed str.partition() and str.rpartition() when a separator
is wider then partitioned string.
2015-03-29 19:21:02 +03:00
Benjamin Peterson 92ce1b4392 merge 3.3 (#23362) 2015-03-02 13:23:41 -05:00
Victor Stinner 4dd25256e2 Issue #21118: PyLong_AS_LONG() result type is long
Even if PyLong_AS_LONG() cannot fail, I prefer to use the right type.
2014-04-08 09:14:21 +02:00
Benjamin Peterson 1365de764e fix reference leaks in the translate fast path (closes #21175)
Patch by Josh Rosenberg.
2014-04-07 20:15:41 -04:00
Victor Stinner 872b291b96 Issue #21118: Optimize also str.translate() for ASCII => ASCII deletion 2014-04-05 14:27:07 +02:00
Victor Stinner 4ff33af257 Issue #21118: Add unit test for invalid character replacement (code point higher than U+10ffff) 2014-04-05 11:56:37 +02:00
Victor Stinner 89a76abf20 Issue #21118: Optimize str.translate() for ASCII => ASCII translation 2014-04-05 11:44:04 +02:00
Victor Stinner 8a4422e78d Issue #21118: Remove unused variable 2014-04-05 00:15:52 +02:00
Victor Stinner 1194ea020c Issue #21118: Use _PyUnicodeWriter API in str.translate() to simplify and
factorize the code
2014-04-04 19:37:40 +02:00
Ethan Furman 9ab748013b Issue19995: more informative error message; spelling corrections; use operator.mod instead of __mod__ 2014-03-21 06:38:46 -07:00
Ethan Furman 38d872ee5d Issue19995: passing a non-int to %o, %c, %x, or %X now raises an exception 2014-03-19 08:38:52 -07:00
Victor Stinner 7d00cc1a64 Issue #20574: Implement incremental decoder for cp65001 code
(Windows code page 65001, Microsoft UTF-8).
2014-03-17 23:08:06 +01:00
Kristján Valur Jónsson 25dded041f Make the various iterators' "setstate" sliently and consistently clip the
index.  This avoids the possibility of setting an iterator to an invalid
state.
2014-03-05 13:47:57 +00:00
Kristján Valur Jónsson c5cc5011ac Make the various iterators' "setstate" sliently and consistently clip the
index.  This avoids the possibility of setting an iterator to an invalid
state.
2014-03-05 15:23:07 +00:00
Serhiy Storchaka 94ee389308 Issue #19619: Blacklist non-text codecs in method API
str.encode, bytes.decode and bytearray.decode now use an
internal API to throw LookupError for known non-text encodings,
rather than attempting the encoding or decoding operation and
then throwing a TypeError for an unexpected output type.

The latter mechanism remains in place for third party non-text
encodings.

Backported changeset d68df99d7a57.
2014-02-24 14:43:03 +02:00
Benjamin Peterson 4267869ad8 merge 3.3 (#20507) 2014-02-15 13:03:20 -05:00
Benjamin Peterson 9743b2c2b5 give non-iterable TypeError a message (closes #20507) 2014-02-15 13:02:52 -05:00
Serhiy Storchaka dfe98a102e Issue #20437: Fixed 22 potential bugs when deleting objects references. 2014-02-09 13:46:20 +02:00
Serhiy Storchaka 505ff755d7 Issue #20437: Fixed 21 potential bugs when deleting objects references. 2014-02-09 13:33:53 +02:00
Larry Hastings 2623c8c23c Issue #20530: Argument Clinic's signature format has been revised again.
The new syntax is highly human readable while still preventing false
positives.  The syntax also extends Python syntax to denote "self" and
positional-only parameters, allowing inspect.Signature objects to be
totally accurate for all supported builtins in Python 3.4.
2014-02-08 22:15:29 -08:00
Serhiy Storchaka 6cbf151032 Issue #20538: UTF-7 incremental decoder produced inconsistant string when
input was truncated in BASE64 section.
2014-02-08 14:06:33 +02:00
Serhiy Storchaka 016a3f33a5 Issue #20538: UTF-7 incremental decoder produced inconsistant string when
input was truncated in BASE64 section.
2014-02-08 14:01:29 +02:00
Larry Hastings 581ee3618c Issue #20326: Argument Clinic now uses a simple, unique signature to
annotate text signatures in docstrings, resulting in fewer false
positives.  "self" parameters are also explicitly marked, allowing
inspect.Signature() to authoritatively detect (and skip) said parameters.

Issue #20326: Argument Clinic now generates separate checksums for the
input and output sections of the block, allowing external tools to verify
that the input has not changed (and thus the output is not out-of-date).
2014-01-28 05:00:08 -08:00
Larry Hastings c20472640c Issue #20390: Small fixes and improvements for Argument Clinic. 2014-01-25 20:43:29 -08:00
Larry Hastings 5c66189e88 Issue #20189: Four additional builtin types (PyTypeObject,
PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type)
have been modified to provide introspection information for builtins.
Also: many additional Lib, test suite, and Argument Clinic fixes.
2014-01-24 06:17:25 -08:00
Ethan Furman a70805e1fa Issue19995: fixed typo; switched from test.support.check_warnings to assertWarns 2014-01-12 08:42:35 -08:00
Ethan Furman f9bba9c67f Issue19995: issue deprecation warning for non-integer values to %c, %o, %x, %X 2014-01-11 23:20:58 -08:00
Larry Hastings 61272b77b0 Issue #19273: The marker comments Argument Clinic uses have been changed
to improve readability.
2014-01-07 12:41:53 -08:00
Ethan Furman df3ed242c0 Issue19995: %o, %x, %X now only accept ints 2014-01-05 06:50:30 -08:00
Serhiy Storchaka 3079328d29 Reverted changeset b72c5573c5e7 (issue #15027). 2014-01-04 22:44:01 +02:00
Serhiy Storchaka 583a93943c Issue #15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster. 2014-01-04 19:25:37 +02:00
Victor Stinner fa4e68d425 Remove deadcode (HASH macro is no more defined) 2014-01-03 17:42:18 +01:00
Victor Stinner 92a419eea4 Remove now unused variables 2014-01-03 17:39:40 +01:00
Victor Stinner f3b46b4a66 unicode_char() uses get_latin1_char() to get latin1 singleton characters 2014-01-03 13:16:00 +01:00
Victor Stinner 985a82a6d2 add unicode_char() in unicodeobject.c to factorize code 2014-01-03 12:53:47 +01:00
Larry Hastings 44e2eaab54 Issue #19674: inspect.signature() now produces a correct signature
for some builtins.
2013-11-23 15:37:55 -08:00
Larry Hastings ebdcb50b8a Issue #19730: Argument Clinic now supports all the existing PyArg
"format units" as legacy converters, as well as two new features:
"self converters" and the "version" directive.
2013-11-23 14:54:00 -08:00
Nick Coghlan c72e4e6dcc Issue #19619: Blacklist non-text codecs in method API
str.encode, bytes.decode and bytearray.decode now use an
internal API to throw LookupError for known non-text encodings,
rather than attempting the encoding or decoding operation and
then throwing a TypeError for an unexpected output type.

The latter mechanism remains in place for third party non-text
encodings.
2013-11-22 22:39:36 +10:00
Christian Heimes 985ecdcfc2 ssue #19183: Implement PEP 456 'secure and interchangeable hash algorithm'.
Python now uses SipHash24 on all major platforms.
2013-11-20 11:46:18 +01:00
Victor Stinner 4a58707a34 Add _PyUnicodeWriter_WriteASCIIString() function 2013-11-19 12:54:53 +01:00
Serhiy Storchaka 58cf607d13 Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates.
The utf-16* and utf-32* encoders no longer allow surrogate code points
(U+D800-U+DFFF) to be encoded.
The utf-32* decoders no longer decode byte sequences that correspond to
surrogate code points.
The surrogatepass error handler now works with the utf-16* and utf-32* codecs.

Based on patches by Victor Stinner and Kang-Hao (Kenny) Lu.
2013-11-19 11:32:41 +02:00
Victor Stinner 6989ba0174 Issue #19581: Change the overallocation factor of _PyUnicodeWriter on Windows
On Windows, a factor of 50% gives best performances.
2013-11-18 21:08:39 +01:00
Larry Hastings ed4a1c5703 Argument Clinic: rename "self" to "module" for module-level functions. 2013-11-18 09:32:13 -08:00
Ezio Melotti 745d54d2fa #17806: Added keyword-argument support for "tabsize" to str/bytes.expandtabs(). 2013-11-16 19:10:57 +02:00
Nick Coghlan 8b097b4ed7 Close #17828: better handling of codec errors
- output type errors now redirect users to the type-neutral
  convenience functions in the codecs module
- stateless errors that occur during encoding and decoding
  will now be automatically wrapped in exceptions that give
  the name of the codec involved
2013-11-13 23:49:21 +10:00
Victor Stinner 66b3270975 _Py_normalize_encoding(): explain how the value 6 was computed 2013-11-07 23:12:23 +01:00
Victor Stinner df23e30bea Fix _Py_normalize_encoding(): ensure that buffer is big enough to store "utf-8"
if the input string is NULL
2013-11-07 13:33:36 +01:00
Victor Stinner ad14ccd047 Issue #19512: add _PyUnicode_CompareWithId() function
_PyUnicode_CompareWithId() is faster than PyUnicode_CompareWithASCIIString()
when both strings are equal and interned.

Add also _PyId_builtins identifier for "builtins" common string.
2013-11-07 00:46:04 +01:00
Victor Stinner 21ea21ef6d Issue #19424: PyUnicode_CompareWithASCIIString() normalizes memcmp() result
to -1, 0, 1
2013-11-04 11:28:26 +01:00
Victor Stinner f0c7b2af05 Issue #16286: remove duplicated identity check from unicode_compare()
Move the test to PyUnicode_Compare()
2013-11-04 11:27:14 +01:00
Victor Stinner fd9e44db37 Issue #16286: optimize PyUnicode_RichCompare() for identical strings (same
pointer) for any operator, not only Py_EQ and Py_NE.

Code of bytes_richcompare() and PyUnicode_RichCompare() is now closer.
2013-11-04 11:23:05 +01:00
Victor Stinner c8bc5377ac Issue #16286: write a new subfunction bytes_compare_eq()
* cleanup bytes_richcompare()
* PyUnicode_RichCompare(): replace a test with a XOR
2013-11-04 11:08:10 +01:00
Victor Stinner e1b1592fd4 Issue #19424: Fix a compiler warning on comparing signed/unsigned size_t
Patch written by Zachary Ware.
2013-11-03 13:53:12 +01:00
Victor Stinner a6b9b071a3 Issue #19424: Fix a compiler warning
memcmp() just takes raw pointers
2013-10-30 18:27:13 +01:00
Victor Stinner 602f7cf0b9 Issue #19424: Optimize PyUnicode_CompareWithASCIIString()
Use fast memcmp() instead of a loop using the slow PyUnicode_READ() macro.
strlen() is still necessary to check Unicode string containing null bytes.
2013-10-29 23:31:50 +01:00
Victor Stinner 68b674c9d4 Issue #19437: Fix _PyUnicode_New() (constructor of legacy string), set all
attributes before checking for error. The destructor expects all attributes to
be set. It is now safe to call Py_DECREF(unicode) in the constructor.
2013-10-29 19:31:43 +01:00
Victor Stinner fa3ba4c3bc Issue #18609: Add a fast-path for "iso8859-1" encoding
On AIX, the locale encoding may be "iso8859-1", which was not a known syntax of
the legacy ISO 8859-1 encoding.

Using a C codec instead of a Python codec is faster but also avoids tricky
issues during Python startup or complex code.
2013-10-29 11:34:05 +01:00
Victor Stinner a5afb58986 Issue #18408: Fix PyUnicode_AsUTF8AndSize(), raise MemoryError exception on
memory allocation failure
2013-10-29 01:28:23 +01:00
Serhiy Storchaka c679227e31 Issue #1772673: The type of `char*` arguments now changed to `const char*`. 2013-10-19 21:03:34 +03:00
Serhiy Storchaka 55e092f545 Issue #19279: UTF-7 decoder no more produces illegal strings. 2013-10-19 20:39:28 +03:00
Serhiy Storchaka 35804e4c63 Issue #19279: UTF-7 decoder no more produces illegal strings. 2013-10-19 20:38:19 +03:00
Larry Hastings 3182680210 Issue #16612: Add "Argument Clinic", a compile-time preprocessor
for C files to generate argument parsing code.  (See PEP 436.)
2013-10-19 00:09:25 -07:00
Ethan Furman fb13721b1b Close #18780: %-formatting now prints value for int subclasses with %d, %i, and %u codes. 2013-08-31 10:18:55 -07:00
Antoine Pitrou 9ed5f27266 Issue #18722: Remove uses of the "register" keyword in C code. 2013-08-13 20:18:52 +02:00
Raymond Hettinger e56666d17f Silence compiler warning about an uninitialized variable 2013-08-04 11:51:03 -07:00
Raymond Hettinger 5ed1b38a7d merge 2013-08-04 11:51:35 -07:00
Christian Heimes b578735dff Check return value of PyType_Ready(&EncodingMapType)
CID 486654
2013-07-20 14:57:28 +02:00
Christian Heimes 26532f7519 Check return value of PyType_Ready(&EncodingMapType)
CID 486654
2013-07-20 14:57:16 +02:00
Victor Stinner e699e5a218 Issue #18408: Don't check unicode consistency in _PyUnicode_HAS_UTF8_MEMORY()
and _PyUnicode_HAS_WSTR_MEMORY() macros

These macros are called in unicode_dealloc(), whereas the unicode object can be
"inconsistent" if the creation of the object failed.

For example, when unicode_subtype_new() fails on a memory allocation,
_PyUnicode_CheckConsistency() fails with an assertion error because data is
NULL.
2013-07-15 18:22:47 +02:00
Victor Stinner 9e6b4d715c Issue #18408: _PyUnicodeWriter_Finish() now clears its buffer attribute in all
cases, so _PyUnicodeWriter_Dealloc() can be called after finish.
2013-07-09 00:37:24 +02:00
Victor Stinner 15a0bd3965 Issue #18408: Fix _PyUnicodeWriter_Finish(): clear writer->buffer,
so _PyUnicodeWriter_Dealloc() can be called on the writer after finish.
2013-07-08 22:29:55 +02:00
Victor Stinner 6f8eeee7b9 Issue #18203: Fix _Py_DecodeUTF8_surrogateescape(), use PyMem_RawMalloc() as _Py_char2wchar() 2013-07-07 22:57:45 +02:00
Victor Stinner 1a7425f67a Issue #18203: Replace malloc() with PyMem_RawMalloc() at Python initialization
* Replace malloc() with PyMem_RawMalloc()
* Replace PyMem_Malloc() with PyMem_RawMalloc() where the GIL is not held.
* _Py_char2wchar() now returns a buffer allocated by PyMem_RawMalloc(), instead
  of PyMem_Malloc()
2013-07-07 16:25:15 +02:00
Christian Heimes d47802eef7 Fix ref leak in error case of unicode find, count, formatlong
CID 983315: Resource leak (RESOURCE_LEAK)
CID 983316: Resource leak (RESOURCE_LEAK)
CID 983317: Resource leak (RESOURCE_LEAK)
2013-06-29 21:33:36 +02:00
Christian Heimes d47a0456b1 Fix ref leak in error case of unicode index
CID 983319 (#1 of 2): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable substring going out of scope leaks the storage it points to.
2013-06-29 21:21:37 +02:00
Christian Heimes ea71a525c3 Fix ref leak in error case of unicode rindex and rfind
CID 983320: Resource leak (RESOURCE_LEAK)
CID 983321: Resource leak (RESOURCE_LEAK)
leaked_storage: Variable substring going out of scope leaks the storage it points to.
2013-06-29 21:17:34 +02:00
Christian Heimes 305e49e17e Fix memory leak in endswith
CID 1040368 (#1 of 1): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable substring going out of scope leaks the storage it points to.
2013-06-29 20:41:06 +02:00
Serhiy Storchaka c89533f72f Issue #18184: PyUnicode_FromFormat() and PyUnicode_FromFormatV() now raise
OverflowError when an argument of %c format is out of range.
2013-06-23 20:21:16 +03:00
Serhiy Storchaka 8eeae2126c Issue #18184: PyUnicode_FromFormat() and PyUnicode_FromFormatV() now raise
OverflowError when an argument of %c format is out of range.
2013-06-23 20:12:14 +03:00
Benjamin Peterson 3164f5d565 merge 3.3 (#18183) 2013-06-10 09:24:01 -07:00
Benjamin Peterson 7e30373126 remove MAX_MAXCHAR because it's unsafe for computing maximum codepoitn value (see #18183) 2013-06-10 09:19:46 -07:00
Victor Stinner 9f067f490f Issue #9566: Fix compiler warning on Windows 64-bit 2013-06-05 00:21:31 +02:00
Antoine Pitrou 7ce35a1816 Issue #17237: Fix crash in the ASCII decoder on m68k. 2013-05-11 15:59:37 +02:00
Antoine Pitrou 8b0e98426d Issue #17237: Fix crash in the ASCII decoder on m68k. 2013-05-11 15:58:34 +02:00
Victor Stinner f4f24248dc Fix uninitialized value in charmap_decode_mapping() 2013-05-07 01:01:31 +02:00
Victor Stinner 8cecc8c262 Issue #7330: Implement width and precision (ex: "%5.3s") for the format string
of PyUnicode_FromFormat() function, original patch written by Ysj Ray.
2013-05-06 23:11:54 +02:00
Victor Stinner bb4503f61e Partial revert of changeset 9744b2df134c
PyUnicode_Append() cannot call directly resize_compact(): I forgot that a
string can be ready *and* not compact (a legacy string can also be ready).
2013-04-18 09:41:34 +02:00
Victor Stinner fb161b1b6d Split PyUnicode_DecodeCharmap() into subfunction for readability 2013-04-18 01:44:27 +02:00
Victor Stinner 170ca6f84b Fix bug in Unicode decoders related to _PyUnicodeWriter
Bug introduced by changesets 7ed9993d53b4 and edf029fc9591.
2013-04-18 00:25:28 +02:00
Victor Stinner 376cfa122d Fix typo in unicode_decode_call_errorhandler_writer()
Bug introduced by changeset 7ed9993d53b4.
2013-04-17 23:58:16 +02:00
Victor Stinner 8f674ccd64 Close #17694: Add minimum length to _PyUnicodeWriter
* Add also min_char attribute to _PyUnicodeWriter structure (currently unused)
 * _PyUnicodeWriter_Init() has no more argument (except the writer itself):
   min_length and overallocate must be set explicitly
 * In error handlers, only enable overallocation if the replacement string
   is longer than 1 character
 * CJK decoders don't use overallocation anymore
 * Set min_length, instead of preallocating memory using
   _PyUnicodeWriter_Prepare(), in many decoders
 * _PyUnicode_DecodeUnicodeInternal() checks for integer overflow
2013-04-17 23:02:17 +02:00
Victor Stinner 77282cb4f8 Cleanup PyUnicode_Contains()
* No need to double-check that strings are ready: test already done by
   PyUnicode_FromObject()
 * Remove useless kind variable (use kind1 instead)
2013-04-14 19:22:47 +02:00
Victor Stinner d92e078c8d Minor change: fix character in do_strip() for the ASCII case 2013-04-14 19:17:42 +02:00
Victor Stinner f033510fee Cleanup PyUnicode_Append()
* Check also that right is a Unicode object
 * call directly resize_compact() instead of unicode_resize() for a more
   explicit error handling, and to avoid testing some properties twice
   (ex: unicode_modifiable())
2013-04-14 19:13:03 +02:00
Victor Stinner 4560f9c63f PyUnicode_Join(): move use_memcpy test out of the loop to cleanup and optimize the code 2013-04-14 18:56:46 +02:00
Victor Stinner 55c08781e8 Optimize repr(str): use _PyUnicode_FastCopyCharacters() when no character is escaped 2013-04-14 18:45:39 +02:00
Victor Stinner af03757d20 Optimize ascii(str): don't encode/decode repr if repr is already ASCII 2013-04-14 18:44:10 +02:00
Victor Stinner 8a1a6cffd6 Add _PyUnicodeWriter_WriteCharInline() 2013-04-14 02:35:33 +02:00
Serhiy Storchaka e2cef885a2 Issue #16061: Speed up str.replace() for replacing 1-character strings. 2013-04-13 22:45:04 +03:00
Victor Stinner a0dd0213cc Close #17693: Rewrite CJK decoders to use the _PyUnicodeWriter API instead of
the legacy Py_UNICODE API.

Add also a new _PyUnicodeWriter_WriteChar() function.
2013-04-11 22:09:04 +02:00
Victor Stinner 247109e74d Issue #17615: On Windows (VS2010), Performances of wmemcmp() to compare Unicode
strings are not convincing. For UCS2 (16-bit wchar_t type), use a dummy loop
instead of wmemcmp(). The dummy loop is as fast, or a little bit faster.

wchar_t is only 16-bit long on Windows. wmemcmp() is still used for 32-bit
wchar_t.
2013-04-09 23:53:26 +02:00
Victor Stinner 0cff4b16d9 replace(): only call PyUnicode_DATA(u) once 2013-04-09 22:52:48 +02:00
Victor Stinner cc7af72192 Write super-fast version of str.strip(), str.lstrip() and str.rstrip() for pure ASCII 2013-04-09 22:39:24 +02:00
Victor Stinner f50a4e9bc9 Don't calls macros in PyUnicode_WRITE() parameters
PyUnicode_WRITE() expands some parameters twice or more.
2013-04-09 22:38:52 +02:00
Victor Stinner 9c79e41fc5 Fix do_strip(): don't call PyUnicode_READ() in Py_UNICODE_ISSPACE() to not call
it twice
2013-04-09 22:21:08 +02:00
Victor Stinner b3a6014504 Fix _PyUnicode_XStrip()
Inline the BLOOM_MEMBER() to only call PyUnicode_READ() only once (per loop
iteration). Store also the length of the seperator in a variable to avoid calls
to PyUnicode_GET_LENGTH().
2013-04-09 22:19:21 +02:00
Victor Stinner 63d5c1a14a Optimize PyUnicode_DecodeCharmap()
Avoid expensive PyUnicode_READ() and PyUnicode_WRITE(), manipulate pointers
instead.
2013-04-09 22:13:33 +02:00
Victor Stinner a85af502a4 Optimize make_bloom_mask(), used by str.strip(), str.lstrip() and str.rstrip()
Write specialized functions per Unicode kind to avoid the expensive
PyUnicode_READ() macro.
2013-04-09 21:53:54 +02:00
Victor Stinner 69ed0f4c86 Use PyUnicode_READ() instead of PyUnicode_READ_CHAR()
"PyUnicode_READ_CHAR() is less efficient than PyUnicode_READ() because it calls
PyUnicode_KIND() and might call it twice." according to its documentation.
2013-04-09 21:48:24 +02:00
Victor Stinner 03c3e35d42 Add fast-path in PyUnicode_DecodeCharmap() for pure 8 bit encodings:
cp037, cp500 and iso8859_1 codecs
2013-04-09 21:53:09 +02:00
Victor Stinner cd777eaf53 Issue #17615: Comparing two Unicode strings now uses wmemcmp() when possible
wmemcmp() is twice faster than a dummy loop (342 usec vs 744 usec) on Fedora
18/x86_64, GCC 4.7.2.
2013-04-08 22:43:44 +02:00
Victor Stinner c1302bba4c Issue #17615: Expand expensive PyUnicode_READ() macro in unicode_compare():
write specialized functions for each combination of Unicode kinds.
2013-04-08 21:50:54 +02:00
Victor Stinner 207dd38726 fix unused variable 2013-04-03 03:14:58 +02:00
Victor Stinner eb4b5ac8af Close #16757: Avoid calling the expensive _PyUnicode_FindMaxChar() function
when possible
2013-04-03 02:02:33 +02:00
Victor Stinner cfc4c13b04 Add _PyUnicodeWriter_WriteSubstring() function
Write a function to enable more optimizations:

 * If the substring is the whole string and overallocation is disabled, just
   keep a reference to the string, don't copy characters
 * Avoid a call to the expensive _PyUnicode_FindMaxChar() function when
   possible
2013-04-03 01:48:39 +02:00
Raymond Hettinger 51612fd803 merge 2013-03-23 08:21:52 -07:00
Raymond Hettinger 378170d5d9 Issue 17447: Clarify that str.isidentifier doesn't check for reserved keywords. 2013-03-23 08:21:12 -07:00
Victor Stinner fb84b5d48d (Merge 3.3) _PyUnicode_Writer() now also reuses Unicode singletons:
empty string and latin1 single character
2013-03-06 19:29:09 +01:00
Victor Stinner 2cb16aa3cb _PyUnicode_Writer() now also reuses Unicode singletons:
empty string and latin1 single character
2013-03-06 19:28:37 +01:00
Victor Stinner cf77da9fb5 Backed out changeset b9f7b1bf36aa 2013-03-06 01:09:24 +01:00
Victor Stinner 313cac88c5 Issue #17223: Fix PyUnicode_FromUnicode() on Windows (16-bit wchar_t type)
to reject invalid UTF-16 surrogate.
2013-03-06 00:41:50 +01:00