Commit Graph

1188 Commits

Author SHA1 Message Date
Victor Stinner fc009eff9e Close #16311: Use the _PyUnicodeWriter API in text decoders
* Remove unicode_widen(): replaced with _PyUnicodeWriter_Prepare()
 * Remove unicode_putchar(): replaced with
   PyUnicodeWriter_Prepare() + PyUnicode_WRITER()
 * When handling an decoding error, only overallocate the buffer by +25%
   instead of +100%
2012-11-07 00:36:38 +01:00
Ezio Melotti cfa9636404 #8271: merge with 3.3. 2012-11-04 23:23:09 +02:00
Ezio Melotti f7ed5d111b #8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti. 2012-11-04 23:21:38 +02:00
Benjamin Peterson 7ff2094bc7 merge 3.3 (#16369) 2012-10-30 23:31:12 -04:00
Benjamin Peterson e8ea97fffb merge 3.2 (#16369) 2012-10-30 23:27:52 -04:00
Benjamin Peterson c43112823b initialize more global type objects (closes #16369) 2012-10-30 23:21:10 -04:00
Victor Stinner e64322e034 Close #14625: Rewrite the UTF-32 decoder. It is now 3x to 4x faster
Patch written by Serhiy Storchaka.
2012-10-30 23:12:47 +01:00
Victor Stinner 76df43de30 Issue #16330: Use surrogate-related macros
Patch written by Serhiy Storchaka.
2012-10-30 01:42:39 +01:00
Mark Dickinson fb90c0934c Issue #14700: Fix buggy overflow checks for large precision and width in new-style and old-style formatting. 2012-10-28 10:18:03 +00:00
Victor Stinner c6cf1ba29e Replace usage of the deprecated Py_UNICODE_COPY() with Py_MEMCPY() in resize_copy() 2012-10-23 02:54:47 +02:00
Victor Stinner fe75fb4b3e Optimize _PyUnicode_HasNULChars(): use findchar() instead of PyUnicode_Contains() 2012-10-23 02:52:18 +02:00
Victor Stinner 6fa627578a Inline raise_translate_exception(): it is only used once 2012-10-23 02:51:50 +02:00
Victor Stinner e5567ad236 Optimize PyUnicode_RichCompare() for Py_EQ and Py_NE: always use memcmp() 2012-10-23 02:48:49 +02:00
Christian Heimes 743e0cd6b5 Issue #16166: Add PY_LITTLE_ENDIAN and PY_BIG_ENDIAN macros and unified
endianess detection and handling.
2012-10-17 23:52:17 +02:00
Chris Jerdonek 4a7df9aba9 Issue #14783: Merge changes from 3.3. 2012-10-07 15:02:16 -07:00
Chris Jerdonek 042fa653ab Issue #14783: Merge changes from 3.2. 2012-10-07 14:56:27 -07:00
Chris Jerdonek 83fe2e1c22 Issue #14783: Improve int() docstring and also str(), range(), and slice().
This commit rewrites the docstring for int() to incorporate the documentation
changes made in issue #16036.  It also switches the docstrings for int(),
str(), range(), and slice() to use multi-line signatures.
2012-10-07 14:48:36 -07:00
Victor Stinner 4c63a972d1 Cleanup PyUnicode_FromFormatV() for zero padding
Skip the "0" instead of parsing it twice: detect zero padding and then parsed
as a digit of the width.
2012-10-06 23:55:33 +02:00
Victor Stinner 15a1136547 Issue #16147: PyUnicode_FromFormatV() doesn't need anymore to allocate a buffer
on the heap to format numbers.
2012-10-06 23:48:20 +02:00
Victor Stinner ff5a848db5 Issue #16147: PyUnicode_FromFormatV() now raises an error if the argument of
'%c' is not in the range(0x110000).
2012-10-06 23:05:45 +02:00
Victor Stinner 3921e90c5a Issue #16147: PyUnicode_FromFormatV() now detects integer overflow when parsing
width and precision
2012-10-06 23:05:00 +02:00
Victor Stinner e215d960be Issue #16147: Rewrite PyUnicode_FromFormatV() to use _PyUnicodeWriter API
* Simplify the code: replace 4 steps with one unique step using the
   _PyUnicodeWriter API. PyUnicode_Format() has the same design. It avoids to
   store intermediate results which require to allocate an array of pointers on
   the heap.
 * Use the _PyUnicodeWriter API for speed (and its convinient API):
   overallocate the buffer to reduce the number of "realloc()"
 * Implement "width" and "precision" in Python, don't rely on sprintf(). It
   avoids to need of a temporary buffer allocated on the heap: only use a small
   buffer allocated in the stack.
 * Add _PyUnicodeWriter_WriteCstr() function
 * Split PyUnicode_FromFormatV() into two functions: add
   unicode_fromformat_arg().
 * Inline parse_format_flags(): the format of an argument is now only parsed
   once, it's no more needed to have a subfunction.
 * Optimize PyUnicode_FromFormatV() for characters between two "%" arguments:
   search the next "%" and copy the substring in one chunk, instead of copying
   character per character.
2012-10-06 23:03:36 +02:00
Mark Dickinson ff9c54aca2 Issue #16096: Merge fixes from 3.3. 2012-10-06 18:05:14 +01:00
Mark Dickinson c04ddff290 Issue #16096: Fix several occurrences of potential signed integer overflow. Thanks Serhiy Storchaka. 2012-10-06 18:04:49 +01:00
Victor Stinner 8c6db45d3e In debug mode, unicode_write_cstr() now checks that non-ASCII characters are
not written into an ASCII string
2012-10-06 00:40:45 +02:00
Ezio Melotti 080a2c087e #16127: merge with 3.3. 2012-10-05 03:34:02 +03:00
Ezio Melotti e7f90375b1 #16127: remove outdated references to narrow builds. Patch by Serhiy Storchaka. 2012-10-05 03:33:31 +03:00
Victor Stinner 1929407406 Fix PyUnicode_Format(): return NULL if PyUnicode_READY(uformat) failed
This error cannot occur in practice: PyUnicode_FromObject() always return
a "ready" string.
2012-10-05 00:09:33 +02:00
Victor Stinner 770e19e0cc Optimize unicode_compare(): use memcmp() when comparing two UCS1 strings 2012-10-04 22:59:45 +02:00
Victor Stinner 90db9c47dc Enable also ptr==ptr optimization in PyUnicode_Compare()
It was already implemented in PyUnicode_RichCompare()
2012-10-04 21:53:50 +02:00
Victor Stinner aa7712711d unicode_result_wchar(): move the assert() to the "#ifdef Py_DEBUG" block 2012-10-04 02:32:58 +02:00
Victor Stinner a4708231e6 Split the huge PyUnicode_Format() function (+540 lines) into subfunctions 2012-10-04 02:19:54 +02:00
Victor Stinner a049443fab PyUnicode_Format(): disable overallocation when we are writing the last part
of the output string
2012-10-03 23:03:46 +02:00
Victor Stinner afffce489b Unicode: resize_compact() and resize_inplace() fills also the Unicode strings
with invalid bytes in debug mode, as done by PyUnicode_New()
2012-10-03 23:03:17 +02:00
Victor Stinner c89d28fdfc Issue #15609: Fix refleak introduced by my last optimization 2012-10-02 12:54:07 +02:00
Victor Stinner 621ef3d84f Issue #15609: Optimize str%args for integer argument
- Use _PyLong_FormatWriter() instead of formatlong() when possible, to avoid
   a temporary buffer
 - Enable the fast path when width is smaller or equals to the length,
   and when the precision is bigger or equals to the length
 - Add unit tests!
 - formatlong() uses PyUnicode_Resize() instead of _PyUnicode_FromASCII()
   to resize the output string
2012-10-02 00:33:47 +02:00
Antoine Pitrou a1f7655fa7 Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings).
Patch by Serhiy Storchaka.
2012-09-23 20:00:04 +02:00
Antoine Pitrou 6f80f5d444 Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings).
Patch by Serhiy Storchaka.
2012-09-23 19:55:21 +02:00
Antoine Pitrou ca8aa4acf6 Issue #15144: Fix possible integer overflow when handling pointers as integer values, by using Py_uintptr_t instead of size_t.
Patch by Serhiy Storchaka.
2012-09-20 20:56:47 +02:00
Christian Heimes 5f520f4fed Issue #15900: Fixed reference leak in PyUnicode_TranslateCharmap() 2012-09-11 14:03:25 +02:00
Christian Heimes f4f9939a96 Fixed memory leak in error branch of formatfloat(). CID 719687 2012-09-10 11:48:41 +02:00
Antoine Pitrou 057119b0b7 Fix C++-style comment (xlc compilation failure) 2012-09-02 17:56:33 +02:00
Benjamin Peterson 59043f96ea merge 3.2 (#15801) 2012-08-28 18:01:45 -04:00
Benjamin Peterson 28a6cfaefc use the stricter PyMapping_Check (closes #15801) 2012-08-28 17:55:35 -04:00
Stefan Krah 8528c3145e Issue #15728: Fix leak in PyUnicode_AsWideCharString(). Found by Coverity. 2012-08-19 21:52:43 +02:00
Nick Coghlan 0e41628d35 Merge str docstring fix from 3.2 2012-08-16 14:14:30 +10:00
Nick Coghlan 573b1fd779 Fix str docstring 2012-08-16 14:13:07 +10:00
Antoine Pitrou b4bbee25b1 Issue #14579: Fix CVE-2012-2135: vulnerability in the utf-16 decoder after error handling.
Patch by Serhiy Storchaka.
2012-07-21 00:45:14 +02:00
Mark Dickinson 01ac8b6ab1 Use correct types for ASCII_CHAR_MASK integer constants. 2012-07-07 14:08:48 +02:00
Antoine Pitrou aaefac76dd Issue #14874: Restore charmap decoding speed to pre-PEP 393 levels.
Patch by Serhiy Storchaka.
2012-06-16 22:48:21 +02:00
Victor Stinner f185226244 _copy_characters(): move debug code at the top to avoid noisy #ifdef
And don't use assert() anymore if check_maxchar is set: return -1 on error
instead.
2012-06-16 16:38:26 +02:00
Victor Stinner 07621338fb Fix PyUnicode_GetSize(): Don't replace _PyUnicode_Ready() exception 2012-06-16 04:53:46 +02:00
Victor Stinner 8a8b3eaabe Fix a compiler warning in _copy_characters() and remove debug code 2012-06-16 04:53:25 +02:00
Victor Stinner 24e403bbee Oops, fix my previous change on _copy_characters() 2012-06-16 04:53:00 +02:00
Victor Stinner ca439eecea Fix unicode_adjust_maxchar(): catch PyUnicode_New() failure 2012-06-16 03:17:34 +02:00
Victor Stinner 184252ad3f Fix "%f" format of str%args if the result is not an ASCII or latin1 string 2012-06-16 02:57:41 +02:00
Victor Stinner 9a77770add Remove debug code 2012-06-16 02:44:43 +02:00
Victor Stinner c9d369f1bf Optimize _PyUnicode_FastCopyCharacters() when maxchar(from) > maxchar(to) 2012-06-16 02:22:37 +02:00
Victor Stinner f05e17ece9 unicodeobject.c: Remove debug code 2012-06-16 01:53:04 +02:00
Antoine Pitrou 27f6a3b0bf Issue #15026: utf-16 encoding is now significantly faster (up to 10x).
Patch by Serhiy Storchaka.
2012-06-15 22:15:23 +02:00
Kristján Valur Jónsson 55e5dc8371 Rearrange code to beat an optimizer bug affecting Release x64 on windows
with VS2010sp1
2012-06-06 21:58:08 +00:00
Victor Stinner d7b7c7472b Issue #14993: Use standard "unsigned char" instead of a unsigned char bitfield 2012-06-04 22:52:12 +02:00
Kristjan Valur Jonsson 85634d7a2e Issue #14909: A number of places were using PyMem_Realloc() apis and
PyObject_GC_Resize() with incorrect error handling.  In case of errors,
the original object would be leaked.  This checkin fixes those cases.
2012-05-31 09:37:31 +00:00
Victor Stinner 3a7d096f2f Issue #14744: Fix compilation on Windows (part 2) 2012-05-29 18:53:56 +02:00
Victor Stinner d3f0882dfb Issue #14744: Use the new _PyUnicodeWriter internal API to speed up str%args and str.format(args)
* Formatting string, int, float and complex use the _PyUnicodeWriter API. It
   avoids a temporary buffer in most cases.
 * Add _PyUnicodeWriter_WriteStr() to restore the PyAccu optimization: just
   keep a reference to the string if the output is only composed of one string
 * Disable overallocation when formatting the last argument of str%args and
   str.format(args)
 * Overallocation allocates at least 100 characters: add min_length attribute
   to the _PyUnicodeWriter structure
 * Add new private functions: _PyUnicode_FastCopyCharacters(),
   _PyUnicode_FastFill() and _PyUnicode_FromASCII()

The speed up is around 20% in average.
2012-05-29 12:57:52 +02:00
Antoine Pitrou 63065d761e Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.
Patch by Serhiy Storchaka.
2012-05-15 23:48:04 +02:00
Martin v. Löwis b05c0738d8 Silence VS 2010 signed/unsigned warnings. 2012-05-15 13:45:49 +02:00
Antoine Pitrou 758153badb Fix refleaks introduced by 83da67651687. 2012-05-12 15:51:51 +02:00
Antoine Pitrou e45c0c5cef Fix logic error introduced by 83da67651687. 2012-05-12 15:49:07 +02:00
Benjamin Peterson 1ff2e35e84 simplify by shortcutting when the kind of the needle is larger than the haystack 2012-05-11 17:41:20 -05:00
Antoine Pitrou ca5f91b888 Issue #14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka. 2012-05-10 16:36:02 +02:00
Victor Stinner 3b1a74a9c3 Rename unicode_write_t structure and its methods to "_PyUnicodeWriter" 2012-05-09 22:25:00 +02:00
Victor Stinner ee4544c920 Issue #14744: Inline unicode_writer_write_char() and unicode_write_str()
Optimize also PyUnicode_Format(): call unicode_writer_prepare() only once
per argument.
2012-05-09 22:24:08 +02:00
Victor Stinner f59c28c930 unicode_writer_finish() checks string consistency 2012-05-09 03:24:14 +02:00
Victor Stinner 106802547c Backout ab500b297900: the check for integer overflow is wrong
Issue #14716: Change integer overflow check in unicode_writer_prepare()
to compute the limit at compile time instead of runtime. Patch writen by Serhiy
Storchaka.
2012-05-07 23:50:05 +02:00
Victor Stinner 0576f9b4cf Issue #14716: Change integer overflow check in unicode_writer_prepare()
to compute the limit at compile time instead of runtime. Patch writen by Serhiy
Storchaka.
2012-05-07 13:02:44 +02:00
Victor Stinner 202fdca133 Close #14716: str.format() now uses the new "unicode writer" API instead of the
PyAccu API. For example, it makes str.format() from 25% to 30% faster on Linux.
2012-05-07 12:47:02 +02:00
Mark Dickinson 99e2e5552a Issue #14700: Fix two broken and undefined-behaviour-inducing overflow checks in old-style string formatting. Thanks Serhiy Storchaka for report and original patch. 2012-05-07 11:20:50 +01:00
Victor Stinner d0dba6eee8 unicode_writer: don't force inline when it is not necessary
Keep inline for performance critical functions (functions used in loops)
2012-05-04 01:19:15 +02:00
Benjamin Peterson b63f49f2b4 if the kind of the string to count is larger than the string to search, shortcut to 0 2012-05-03 18:31:07 -04:00
Victor Stinner a7b654be30 unicode_writer: add finish() method and assertions to write_str() method
* The write_str() method does nothing if the length is zero.
 * Replace "struct unicode_writer_t" with "unicode_writer_t"
2012-05-03 23:58:55 +02:00
Victor Stinner bf4e266397 Issue #14687: Remove redundant length attribute of unicode_write_t
The length can be read directly from the buffer
2012-05-03 19:27:14 +02:00
Victor Stinner 7989157e49 Issue #14687: Cleanup unicode_writer_prepare()
"Inline" PyUnicode_Resize(): call directly resize_compact()
2012-05-03 13:43:07 +02:00
Victor Stinner f2c76aa6cb Issue #14687: str%tuple now uses an optimistic "unicode writer" instead of an
accumulator. Directly write characters into the output (don't use a temporary
list): resize and widen the string on demand.
2012-05-03 13:10:40 +02:00
Victor Stinner 1b487b467b Issue #14624, #14687: Optimize unicode_widen()
Don't convert uninitialized characters. Patch written by Serhiy Storchaka.
2012-05-03 12:29:04 +02:00
Victor Stinner 3a7f7977f1 Remove buggy assertion in PyUnicode_Substring()
Use also directly unicode_empty, instead of PyUnicode_New(0,0).
2012-05-03 03:36:40 +02:00
Victor Stinner 684d5fd420 Fix PyUnicode_Substring() for start >= length and start > end
Remove the fast-path for 1-character string: unicode_fromascii() and
_PyUnicode_FromUCS*() now have their own fast-path for 1-character strings.
2012-05-03 02:32:34 +02:00
Victor Stinner b6cd014d75 Unicode: optimize creating of 1-character strings 2012-05-03 02:17:04 +02:00
Victor Stinner bff7c96834 Issue #14687: Optimize str%tuple for the "%(name)s" syntax
Avoid an useless and expensive call to PyUnicode_READ().
2012-05-03 01:44:59 +02:00
Victor Stinner e6abb488c9 unicodeobject.c: Add MAX_MAXCHAR() macro to (micro-)optimize the computation
of the second argument of PyUnicode_New().

 * Create also align_maxchar() function
 * Optimize fix_decimal_and_space_to_ascii(): don't compute the maximum
   character when ch <= 127 (it is ASCII)
2012-05-02 01:15:40 +02:00
Victor Stinner 438106b66e Issue #14687: Cleanup PyUnicode_Format() 2012-05-02 00:41:57 +02:00
Victor Stinner b5c3ea3af3 Issue #14687: Optimize str%args
* formatfloat() uses unicode_fromascii() instead of PyUnicode_DecodeASCII()
   to not have to check characters, we know that it is really ASCII
 * Use PyUnicode_FromOrdinal() instead of _PyUnicode_FromUCS4() to format
   a character: if avoids a call to ucs4lib_find_max_char() to compute
   the maximum character (whereas we already know it, it is just the character
   itself)
2012-05-02 00:29:36 +02:00
Victor Stinner b80e46eca4 Issue #14687: Avoid an useless duplicated string in PyUnicode_Format() 2012-04-30 05:21:52 +02:00
Victor Stinner aff3cc659b Issue #14687: Cleanup PyUnicode_Format() 2012-04-30 05:19:21 +02:00
Victor Stinner b11d91d969 Fix my previous commit: bool is a long, restore the specical case for bool 2012-04-28 00:25:34 +02:00
Victor Stinner d0880d57b0 Simplify and optimize formatlong()
* Remove _PyBytes_FormatLong(): inline it into formatlong()
 * the input type is always a long, so remove the code for bool
 * don't duplicate the string if the length does not change
 * Use PyUnicode_DATA() instead of _PyUnicode_AsString()
2012-04-27 23:40:13 +02:00
Victor Stinner 94d558b063 Optimize _PyUnicode_FindMaxChar() find pure ASCII strings 2012-04-27 22:26:58 +02:00
Victor Stinner 8f825060f1 Check newly created consistency using _PyUnicode_CheckConsistency(str, 1)
* In debug mode, fill the string data with invalid characters
 * Simplify also reference counting in PyCodec_BackslashReplaceErrors()
   and PyCodec_XMLCharRefReplaceError()
2012-04-27 13:55:39 +02:00
Victor Stinner 718fbf078c _PyUnicode_CheckConsistency() ensures that the unicode string ends with a
null character
2012-04-26 00:39:37 +02:00
Benjamin Peterson b9f4c9daad make pointer arith c89 2012-04-23 21:45:40 -04:00
Benjamin Peterson f3b7d86e25 use correct base ptr 2012-04-23 18:07:01 -04:00
Benjamin Peterson 2844a7a6d3 simplify and reformat 2012-04-23 18:00:25 -04:00
Victor Stinner ece58deb9f Close #14648: Compute correctly maxchar in str.format() for substrin 2012-04-23 23:36:38 +02:00
Benjamin Peterson 64ed576de8 merge 3.2 (#14509) 2012-04-09 15:04:39 -04:00
Benjamin Peterson ca819c3c9d merge 3.1 (#14509) 2012-04-09 15:01:02 -04:00
Benjamin Peterson f6622c8a3e fix build without Py_DEBUG and DNDEBUG (closes #14509) 2012-04-09 14:53:07 -04:00
Victor Stinner afb5205c48 Close #14249: Use bit shifts instead of an union, it's more efficient.
Patch written by Serhiy Storchaka
2012-04-05 22:54:49 +02:00
Victor Stinner e7eee01f36 Close #14249: Use an union instead of a long to short pointer to avoid aliasing
issue. Speed up UTF-16 by 20%.
2012-04-05 13:44:34 +02:00
Antoine Pitrou a701388de1 Rename _PyIter_GetBuiltin to _PyObject_GetBuiltin, and do not include it in the stable ABI. 2012-04-05 00:04:20 +02:00
Kristján Valur Jónsson 31668b8f7a Issue #14288: Serialization support for builtin iterators. 2012-04-03 10:49:41 +00:00
Benjamin Peterson 0df542985a grammar 2012-03-26 14:50:32 -04:00
Benjamin Peterson c067d6661f merge 3.2 2012-03-25 22:41:16 -04:00
Benjamin Peterson a8755c586e kill this terribly outdated comment 2012-03-25 22:40:54 -04:00
Victor Stinner 0d03478b88 Remove an unused variable 2012-03-06 02:06:01 +01:00
Victor Stinner c9590ad745 Close #14085: remove assertions from PyUnicode_WRITE macro
Add checks in PyUnicode_WriteChar() and convert PyUnicode_New() assertion to a
test raising a Python exception.
2012-03-04 01:34:37 +01:00
Ezio Melotti cda6b6d60d #14081: The sep and maxsplit parameter to str.split, bytes.split, and bytearray.split may now be passed as keyword arguments. 2012-02-26 09:39:55 +02:00
Victor Stinner b0800dc53b Oops, revert unwanted changes 2012-02-25 00:47:08 +01:00
Victor Stinner abc649ddbe Issue #14107: fix bigmem tests on str.capitalize(), str.swapcase() and
str.title(). Compute correctly how much memory is required for the test
(memuse).
2012-02-25 00:43:27 +01:00
Antoine Pitrou 842c0f17eb Fix compilation error under Windows (and warnings too). 2012-02-24 13:30:46 +01:00
Victor Stinner 90f50d4df9 Issue #13706: Fix format(float, "n") for locale with non-ASCII decimal point (e.g. ps_aF) 2012-02-24 01:44:47 +01:00
Victor Stinner 41a863cb81 Issue #13706: Fix format(int, "n") for locale with non-ASCII thousands separator
* Decode thousands separator and decimal point using PyUnicode_DecodeLocale()
   (from the locale encoding), instead of decoding them implicitly from latin1
 * Remove _PyUnicode_InsertThousandsGroupingLocale(), it was not used
 * Change _PyUnicode_InsertThousandsGrouping() API to return the maximum
   character if unicode is NULL
 * Replace MIN/MAX macros by Py_MIN/Py_MAX
 * stringlib/undef.h undefines STRINGLIB_IS_UNICODE
 * stringlib/localeutil.h only supports Unicode
2012-02-24 00:37:51 +01:00
Victor Stinner b429d3b09c Fix doc of an internal function: unicode_write_cstr() 2012-02-22 21:22:20 +01:00
Antoine Pitrou ba6bafcfbe Fix compile failure under Windows 2012-02-22 16:41:50 +01:00
Victor Stinner c516610f0b Optimize str%arg for number formats: %i, %d, %u, %x, %p
Write a specialized function to write an ASCII/latin1 C char* string into a
Python Unicode string.
2012-02-22 13:55:02 +01:00
Victor Stinner 99d7ad0bb0 Micro-optimize computation of maxchar in PyUnicode_TransformDecimalToASCII() 2012-02-22 13:37:39 +01:00
Victor Stinner da79e632c4 Micro-optimize unicode_expandtabs(): use FILL() macro to write N spaces 2012-02-22 13:37:04 +01:00
Victor Stinner 15e9ed299c PyUnicode_New() and unicode_putchar() check for MAX_UNICODE maximum (U+10FFFF) 2012-02-22 13:36:20 +01:00
Benjamin Peterson d9a3591ed1 merge 3.2 2012-02-21 11:12:14 -05:00
Benjamin Peterson e249dcab7a merge 3.2 2012-02-21 11:09:13 -05:00
Benjamin Peterson 69e9727657 ensure no one tries to hash things before the random seed is found 2012-02-21 11:08:50 -05:00
Georg Brandl 16fa2a1097 Forgot the "empty string -> hash == 0" special case for strings. 2012-02-21 00:50:13 +01:00
Georg Brandl 2fb477c0f0 Merge 3.2: Issue #13703 plus some related test suite fixes. 2012-02-21 00:33:36 +01:00
Georg Brandl 09a7c72cad Merge from 3.1: Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime)
in order to make algorithmic complexity attacks on (e.g.) web apps much more complicated.

The environment variable PYTHONHASHSEED and the new command line flag -R control this
behavior.
2012-02-20 21:31:46 +01:00
Georg Brandl 2daf6ae249 Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime)
in order to make algorithmic complexity attacks on (e.g.) web apps much more complicated.

The environment variable PYTHONHASHSEED and the new command line flag -R control this
behavior.
2012-02-20 19:54:16 +01:00
Victor Stinner c3a6b02d70 (Merge 3.2) Issue #13913: normalize utf-8 codec name in UTF-8 decoder 2012-02-14 01:18:10 +01:00
Victor Stinner cbe01342bc Issue #13913: normalize utf-8 codec name in UTF-8 decoder 2012-02-14 01:17:45 +01:00
Victor Stinner d1cd99b533 Backout d2c1521ad0a1: _Py_IDENTIFIER() uses UTF-8 again 2012-02-07 23:05:55 +01:00
Victor Stinner d446d8e09a _Py_Identifier are always ASCII strings 2012-02-05 01:45:45 +01:00
Antoine Pitrou 7ab4af0427 Issue #13848: open() and the FileIO constructor now check for NUL characters in the file name.
Patch by Hynek Schlawack.
2012-01-29 18:43:36 +01:00
Antoine Pitrou 1334884ff2 Issue #13848: open() and the FileIO constructor now check for NUL characters in the file name.
Patch by Hynek Schlawack.
2012-01-29 18:36:34 +01:00
Benjamin Peterson eea4846d23 don't ready in case_operation, since most callers do it themselves 2012-01-16 14:28:50 -05:00
Gregory P. Smith f5b62a9b31 Consolidate the occurrances of the prime used as the multiplier when hashing. 2012-01-14 15:45:13 -08:00
Gregory P. Smith 63e6c3222f Consolidate the occurrances of the prime used as the multiplier when hashing
to a single #define instead of having several copies in several files.

This excludes the Modules/ tree (datetime and expat both have a copy
for their own purposes with no need for it to be the same).
2012-01-14 15:31:34 -08:00
Benjamin Peterson c8d8b8861e fix possible refleaks if PyUnicode_READY fails 2012-01-14 13:37:31 -05:00
Benjamin Peterson bac79498c8 always explicitly check for -1 from PyUnicode_READY 2012-01-14 13:34:47 -05:00
Benjamin Peterson d5890c8db5 add str.casefold() (closes #13752) 2012-01-14 13:23:30 -05:00
Benjamin Peterson 53aa1d7c57 fix possible if unlikely leak 2011-12-20 13:29:45 -06:00
Benjamin Peterson e51757f6de move do_title to a better place 2012-01-12 21:10:29 -05:00
Benjamin Peterson 821e4cfd01 make fix_decimal_and_space_to_ascii check if it modifies the string 2012-01-12 15:40:18 -05:00
Benjamin Peterson 0c91392fe6 kill capwords implementation which has been disabled since the begining 2012-01-12 15:25:41 -05:00
Benjamin Peterson b2bf01d824 use full unicode mappings for upper/lower/title case (#12736)
Also broaden the category of characters that count as lowercase/uppercase.
2012-01-11 18:17:06 -05:00
Victor Stinner 3fe553160c Add a new PyUnicode_Fill() function
It is faster than the unicode_fill() function which was implemented in
formatter_unicode.c.
2012-01-04 00:33:50 +01:00
Benjamin Peterson 5e458f520c also decref the right thing 2012-01-02 10:12:13 -06:00
Benjamin Peterson 4c13a4a352 ready the correct string 2012-01-02 09:07:38 -06:00
Benjamin Peterson 22a29708fd fix some possible refleaks from PyUnicode_READY error conditions 2012-01-02 09:00:30 -06:00
Benjamin Peterson 9ca3ffac94 == -1 is convention 2012-01-01 16:04:29 -06:00
Benjamin Peterson e157cf1012 make switch more robust 2012-01-01 15:56:20 -06:00
Benjamin Peterson c0b95d18fa 4 space indentation 2011-12-20 17:24:05 -06:00
Benjamin Peterson ead6b53659 fix spacing around switch statements 2011-12-20 17:23:42 -06:00
Benjamin Peterson 822c790527 merge 3.2 2011-12-20 13:32:50 -06:00
Victor Stinner 6099a03202 Issue #13624: Write a specialized UTF-8 encoder to allow more optimization
The main bottleneck was the PyUnicode_READ() macro.
2011-12-18 14:22:26 +01:00
Victor Stinner 73f53b57d1 Optimize str * n for len(str)==1 and UCS-2 or UCS-4 2011-12-18 03:26:31 +01:00
Victor Stinner f644110816 Issue #13621: Optimize str.replace(char1, char2)
Use findchar() which is more optimized than a dummy loop using
PyUnicode_READ().  PyUnicode_READ() is a complex and slow macro.
2011-12-18 02:43:08 +01:00
Victor Stinner ab870218e3 Issue #10951: Fix compiler warnings in timemodule.c and unicodeobject.c
Thanks Jérémy Anger for the fix.
2011-12-17 22:39:43 +01:00
Victor Stinner 2f197078fb The locale decoder raises a UnicodeDecodeError instead of an OSError
Search the invalid character using mbrtowc().
2011-12-17 07:08:30 +01:00
Victor Stinner 1b57967b96 Issue #13560: Locale codec functions use the classic "errors" parameter,
instead of surrogateescape

So it would be possible to support more error handlers later.
2011-12-17 05:47:23 +01:00
Victor Stinner ab59594326 What's New in Python 3.3: complete the deprecation list
Add also FIXMEs in unicodeobject.c
2011-12-17 04:59:06 +01:00
Victor Stinner 1f33f2b0c3 Issue #13560: os.strerror() now uses the current locale encoding instead of UTF-8 2011-12-17 04:45:09 +01:00
Victor Stinner f2ea71fcc8 Issue #13560: Add PyUnicode_EncodeLocale()
* Use PyUnicode_EncodeLocale() in time.strftime() if wcsftime() is not
   available
 * Document my last changes in Misc/NEWS
2011-12-17 04:13:41 +01:00
Victor Stinner af02e1c85a Add PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale()
* PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale() decode a string
   from the current locale encoding
 * _Py_char2wchar() writes an "error code" in the size argument to indicate
   if the function failed because of memory allocation failure or because of a
   decoding error. The function doesn't write the error message directly to
   stderr.
 * Fix time.strftime() (if wcsftime() is missing): decode strftime() result
   from the current locale encoding, not from the filesystem encoding.
2011-12-16 23:56:01 +01:00
Victor Stinner 16e6a80923 PyUnicode_Resize(): warn about canonical representation
Call also directly unicode_resize() in unicodeobject.c
2011-12-12 13:24:15 +01:00
Victor Stinner b0a82a6a7f Fix PyUnicode_Resize() for compact string: leave the string unchanged on error
Fix also PyUnicode_Resize() doc
2011-12-12 13:08:33 +01:00
Victor Stinner bf6e560d0c Make PyUnicode_Copy() private => _PyUnicode_Copy()
Undocument the function.

Make also decode_utf8_errors() as private (static).
2011-12-12 01:53:47 +01:00
Victor Stinner 7a9105a380 resize_copy() now supports legacy ready strings 2011-12-12 00:13:42 +01:00
Victor Stinner 488fa49acf Rewrite PyUnicode_Append(); unicode_modifiable() is more strict
* Rename unicode_resizable() to unicode_modifiable()
 * Rename _PyUnicode_Dirty() to unicode_check_modifiable() to make it clear
   that the function is private
 * Inline PyUnicode_Concat() and unicode_append_inplace() in PyUnicode_Append()
   to simplify the code
 * unicode_modifiable() return 0 if the hash has been computed or if the string
   is not an exact unicode string
 * Remove _PyUnicode_DIRTY(): no need to reset the hash anymore, because if the
   hash has already been computed, you cannot modify a string inplace anymore
 * PyUnicode_Concat() checks for integer overflow
2011-12-12 00:01:39 +01:00
Victor Stinner c4b495497a Create unicode_result_unchanged() subfunction 2011-12-11 22:44:26 +01:00
Victor Stinner eaab604829 Fix fixup() for unchanged unicode subtype
If maxchar_new == 0 and self is a unicode subtype, return u instead of duplicating u.
2011-12-11 22:22:39 +01:00
Victor Stinner e6b2d4407a unicode_fromascii() doesn't check string content twice in debug mode
_PyUnicode_CheckConsistency() also checks string content.
2011-12-11 21:54:30 +01:00
Victor Stinner a1d12bb119 Call directly PyUnicode_DecodeUTF8Stateful() instead of PyUnicode_DecodeUTF8()
* Remove micro-optimization from PyUnicode_FromStringAndSize():
   PyUnicode_DecodeUTF8Stateful() has already these optimizations (for size=0
   and one ascii char).
 * Rename utf8_max_char_size_and_char_count() to utf8_scanner(), and remove an
   useless variable
2011-12-11 21:53:09 +01:00
Victor Stinner 382955ff4e Use directly unicode_empty instead of PyUnicode_New(0, 0) 2011-12-11 21:44:00 +01:00
Victor Stinner 785938eebd Move the slowest UTF-8 decoder to its own subfunction
* Create decode_utf8_errors()
 * Reuse unicode_fromascii()
 * decode_utf8_errors() doesn't refit at the beginning
 * Remove refit_partial_string(), use unicode_adjust_maxchar() instead
2011-12-11 20:09:03 +01:00
Victor Stinner 84def3774d Fix error handling in resize_compact() 2011-12-11 20:04:56 +01:00
Victor Stinner 8faf8216e4 PyUnicode_FromWideChar() and PyUnicode_FromUnicode() raise a ValueError if a
character in not in range [U+0000; U+10ffff].
2011-12-08 22:14:11 +01:00
Victor Stinner 551ac95733 Py_UNICODE_HIGH_SURROGATE() and Py_UNICODE_LOW_SURROGATE() macros
And use surrogates macros everywhere in unicodeobject.c
2011-11-29 22:58:13 +01:00
Victor Stinner 6345be9a14 Close #13093: PyUnicode_EncodeDecimal() doesn't support error handlers
different than "strict" anymore. The caller was unable to compute the
size of the output buffer: it depends on the error handler.
2011-11-25 20:09:01 +01:00
Benjamin Peterson 1518e8713d and back to the "magic" formula (with a comment) it is 2011-11-23 10:44:52 -06:00
Benjamin Peterson 5944c36931 cave to those who like readable code 2011-11-22 19:05:49 -06:00
Benjamin Peterson 0268675193 fix compiler warning by implementing this more cleverly 2011-11-22 15:29:32 -05:00
Victor Stinner ca4f20782e find_maxchar_surrogates() reuses surrogate macros 2011-11-22 03:38:40 +01:00
Victor Stinner 0d3721d986 Issue #13441: Disable temporary the check on the maximum character until
the Solaris issue is solved.

But add assertion on the maximum character in various encoders: UTF-7, UTF-8,
wide character (wchar_t*, Py_UNICODE*), unicode-escape, raw-unicode-escape.

Fix also unicode_encode_ucs1() for backslashreplace error handler: Python is
now always "wide".
2011-11-22 03:27:53 +01:00
Victor Stinner f8facacf30 Fix compiler warnings 2011-11-22 02:30:47 +01:00
Victor Stinner b84d723509 (Merge 3.2) Issue #13093: Fix error handling on PyUnicode_EncodeDecimal() 2011-11-22 01:50:07 +01:00
Victor Stinner cfed46e00a PyUnicode_FromKindAndData() fails with a ValueError if size < 0 2011-11-22 01:29:14 +01:00
Victor Stinner 42885206ec UTF-8 decoder: set consumed value in the latin1 fast-path 2011-11-22 01:23:02 +01:00
Victor Stinner d3df8ab377 Replace _PyUnicode_READY_REPLACE() and _PyUnicode_ReadyReplace() with unicode_ready()
* unicode_ready() has a simpler API
 * try to reuse unicode_empty and latin1_char singleton everywhere
 * Fix a reference leak in _PyUnicode_TranslateCharmap()
 * PyUnicode_InternInPlace() doesn't try to get a singleton anymore, to avoid
   having to handle a failure
2011-11-22 01:22:34 +01:00
Victor Stinner f01245067a Rewrite PyUnicode_TransformDecimalToASCII() to use the new Unicode API 2011-11-21 23:12:56 +01:00
Victor Stinner 2d718f39a5 Remove an unused variable from PyUnicode_Copy() 2011-11-21 23:11:52 +01:00
Victor Stinner 87af4f2f3a Simplify PyUnicode_Copy()
USe PyUnicode_Copy() in fixup()
2011-11-21 23:03:47 +01:00
Victor Stinner 5bbe5e7c85 Fix a compiler warning in _PyUnicode_CheckConsistency() 2011-11-21 22:54:05 +01:00
Victor Stinner 42bf77537e Rewrite PyUnicode_EncodeDecimal() to use the new Unicode API
Add tests for PyUnicode_EncodeDecimal() and
PyUnicode_TransformDecimalToASCII().
2011-11-21 22:52:58 +01:00
Antoine Pitrou 0a3229de6b Issue #13417: speed up utf-8 decoding by around 2x for the non-fully-ASCII case.
This almost catches up with pre-PEP 393 performance, when decoding needed
only one pass.
2011-11-21 20:39:13 +01:00
Victor Stinner da29cc36aa Issue #13441: _PyUnicode_CheckConsistency() dumps the string if the maximum
character is bigger than U+10FFFF and locale.localeconv() dumps the string
before decoding it.

Temporary hack to debug the issue #13441.
2011-11-21 14:31:41 +01:00
Victor Stinner 9e30aa52fd Fix misuse of PyUnicode_GET_SIZE() => PyUnicode_GET_LENGTH()
And PyUnicode_GetSize() => PyUnicode_GetLength()
2011-11-21 02:49:52 +01:00
Victor Stinner 4ead7c7be8 PyObject_Str() ensures that the result string is ready
and check the string consistency.

_PyUnicode_CheckConsistency() doesn't check the hash anymore. It should be
possible to call this function even if hash(str) was already called.
2011-11-20 19:48:36 +01:00
Victor Stinner b960b34577 PyUnicode_AsUTF32String() calls directly _PyUnicode_EncodeUTF32(),
instead of calling the deprecated PyUnicode_EncodeUTF32() function
2011-11-20 19:12:52 +01:00
Victor Stinner 77faf69ca1 _PyUnicode_CheckConsistency() also checks maxchar maximum value,
not only its minimum value
2011-11-20 18:56:05 +01:00
Victor Stinner d5c4022d2a Remove the two ugly and unused WRITE_ASCII_OR_WSTR and WRITE_WSTR macros 2011-11-20 18:41:31 +01:00
Victor Stinner 2e9cfadd7c Reuse surrogate macros in UTF-16 decoder 2011-11-20 18:40:27 +01:00
Victor Stinner ae4f7c8e59 charmap_encoding_error() uses the new Unicode API 2011-11-20 18:28:55 +01:00
Victor Stinner ac931b1e5b Use PyUnicode_EncodeCodePage() instead of PyUnicode_EncodeMBCS() with
PyUnicode_AsUnicodeAndSize()
2011-11-20 18:27:03 +01:00
Victor Stinner 22168998f5 charmap encoders uses Py_UCS4, not Py_UNICODE 2011-11-20 17:09:18 +01:00
Victor Stinner 1f7951711c Catch PyUnicode_AS_UNICODE() errors 2011-11-17 00:45:54 +01:00
Ezio Melotti 11060a4a48 #13406: silence deprecation warnings in test_codecs. 2011-11-16 09:39:10 +02:00
Antoine Pitrou 78edf7576e Issue #13333: The UTF-7 decoder now accepts lone surrogates
(the encoder already accepts them).
2011-11-15 01:44:16 +01:00
Antoine Pitrou 5418ee0b9a Issue #13333: The UTF-7 decoder now accepts lone surrogates
(the encoder already accepts them).
2011-11-15 01:42:21 +01:00
Antoine Pitrou 31b92a534f Sanitize reference management in the utf-8 encoder 2011-11-12 18:35:19 +01:00
Antoine Pitrou 0290c7a811 Fix regression on 2-byte wchar_t systems (Windows) 2011-11-11 13:29:12 +01:00
Antoine Pitrou 44c6affc79 Avoid crashing because of an unaligned word access 2011-11-11 02:59:42 +01:00
Antoine Pitrou de20b0b50e Issue #13149: Speed up append-only StringIO objects.
This is very similar to the "lazy strings" idea.
2011-11-10 21:47:38 +01:00
Victor Stinner 9f4b1e9c50 Fix and deprecated the unicode_internal codec
unicode_internal codec uses Py_UNICODE instead of the real internal
representation (PEP 393: Py_UCS1, Py_UCS2 or Py_UCS4) for backward
compatibility.
2011-11-10 20:56:30 +01:00
Victor Stinner 24729f36bf Prefer Py_UCS4 or wchar_t over Py_UNICODE 2011-11-10 20:31:37 +01:00
Victor Stinner ebf3ba808e PyUnicode_DecodeCharmap() uses the new Unicode API 2011-11-10 20:30:22 +01:00
Victor Stinner a98b28c1bf Avoid PyUnicode_AS_UNICODE in the UTF-8 encoder 2011-11-10 20:21:49 +01:00
Victor Stinner 3326cb6a36 Fix "unicode_escape" encoder 2011-11-10 20:15:25 +01:00
Victor Stinner 0e36826a04 Fix UTF-7 encoder on Windows 2011-11-10 20:12:49 +01:00
Martin v. Löwis 1db7c13be1 Port encoders from Py_UNICODE API to unicode object API. 2011-11-10 18:24:32 +01:00
Victor Stinner 62aa4d086a Strip trailing spaces 2011-11-09 00:03:45 +01:00
Victor Stinner 0a045efb49 Fix a compiler warning: use unsiged for maxchar in unicode_widen() 2011-11-09 00:02:42 +01:00
Victor Stinner 596a6c4ffc Fix the code page decoder
* unicode_decode_call_errorhandler() now supports the PyUnicode_WCHAR_KIND
   kind
 * unicode_decode_call_errorhandler() calls copy_characters() instead of
   PyUnicode_CopyCharacters()
2011-11-09 00:02:18 +01:00
Antoine Pitrou a8f63c02ef Fix missing goto 2011-11-08 18:37:16 +01:00
Martin v. Löwis d10759f6ed Make _PyUnicode_FromId return borrowed references.
http://mail.python.org/pipermail/python-dev/2011-November/114347.html
2011-11-07 13:00:05 +01:00
Martin v. Löwis e9b11c1cd8 Change decoders to use Unicode API instead of Py_UNICODE. 2011-11-08 17:35:34 +01:00
Victor Stinner e30c0a1014 Fix gdb/libpython.py for not ready Unicode strings
_PyUnicode_CheckConsistency() checks also hash and length value for not ready
Unicode strings.
2011-11-04 20:54:05 +01:00
Victor Stinner 2fc507fe45 Replace tabs by spaces 2011-11-04 20:06:39 +01:00
Martin v. Löwis 12be46ca84 Drop Py_UNICODE based encode exceptions. 2011-11-04 19:04:15 +01:00
Martin v. Löwis 3d325191bf Port code page codec to Unicode API. 2011-11-04 18:23:06 +01:00
Victor Stinner fcd9653667 Fix a compiler warning in unicode_encode_ucs1() 2011-11-04 00:28:50 +01:00
Victor Stinner fc026c98d8 Fix PyUnicode_EncodeCharmap() 2011-11-04 00:24:51 +01:00
Victor Stinner 7931d9a951 Replace PyUnicodeObject type by PyObject
* _PyUnicode_CheckConsistency() now takes a PyObject* instead of void*
 * Remove now useless casts to PyObject*
2011-11-04 00:22:48 +01:00
Victor Stinner 76a31a6bff Cleanup decode_code_page_stateful() and encode_code_page()
* Fix decode_code_page_errors() result
 * Inline decode_code_page() and encode_code_page_chunk()
 * Replace the PyUnicodeObject type by PyObject
2011-11-04 00:05:13 +01:00
Victor Stinner 7581cef699 Adapt the code page encoder to the new unicode_encode_call_errorhandler()
The code is not correct, but at least it doesn't crash anymore.
2011-11-03 22:32:33 +01:00
Brian Curtin 2787ea41fd Fix a compile error (apparently Windows only) introduced in 295fdfd4f422 2011-11-02 15:09:37 -05:00
Martin v. Löwis 23e275b3ad Port UCS1 and charmap codecs to new API. 2011-11-02 18:02:51 +01:00
Martin v. Löwis 9e8166843c Introduce PyObject* API for raising encode errors. 2011-11-02 12:45:42 +01:00
Martin v. Löwis 0d3072e98d Drop Py_UCS4_ functions. Closes #13246. 2011-10-31 08:40:56 +01:00
Victor Stinner 57ffa9d4ff PyUnicode_AsUnicodeCopy() uses PyUnicode_AsUnicodeAndSize() to get directly the length 2011-10-23 20:10:08 +02:00
Victor Stinner af9e4b8c29 Fix PyUnicode_InternImmortal(): PyUnicode_InternInPlace() may changes *p 2011-10-23 20:07:00 +02:00
Victor Stinner 9faa384bed Cast directly to unsigned char, instead of using Py_CHARMASK
We don't need "& 0xff" on an unsigned char.
2011-10-23 20:06:00 +02:00
Victor Stinner 9db1a8b69f Replace PyUnicodeObject* by PyObject* where it was irrevelant
A Unicode string can now be a PyASCIIObject, PyCompactUnicodeObject or
PyUnicodeObject. Aliasing a PyASCIIObject* or PyCompactUnicodeObject* to
PyUnicodeObject* is wrong
2011-10-23 20:04:37 +02:00
Victor Stinner 0d60e87ad6 Fix data variable in _PyUnicode_Dump() for compact ASCII 2011-10-23 19:47:19 +02:00