Commit Graph

885 Commits

Author SHA1 Message Date
Benjamin Peterson 4c13a4a352 ready the correct string 2012-01-02 09:07:38 -06:00
Benjamin Peterson 22a29708fd fix some possible refleaks from PyUnicode_READY error conditions 2012-01-02 09:00:30 -06:00
Benjamin Peterson 9ca3ffac94 == -1 is convention 2012-01-01 16:04:29 -06:00
Benjamin Peterson e157cf1012 make switch more robust 2012-01-01 15:56:20 -06:00
Benjamin Peterson c0b95d18fa 4 space indentation 2011-12-20 17:24:05 -06:00
Benjamin Peterson ead6b53659 fix spacing around switch statements 2011-12-20 17:23:42 -06:00
Benjamin Peterson 822c790527 merge 3.2 2011-12-20 13:32:50 -06:00
Victor Stinner 6099a03202 Issue #13624: Write a specialized UTF-8 encoder to allow more optimization
The main bottleneck was the PyUnicode_READ() macro.
2011-12-18 14:22:26 +01:00
Victor Stinner 73f53b57d1 Optimize str * n for len(str)==1 and UCS-2 or UCS-4 2011-12-18 03:26:31 +01:00
Victor Stinner f644110816 Issue #13621: Optimize str.replace(char1, char2)
Use findchar() which is more optimized than a dummy loop using
PyUnicode_READ().  PyUnicode_READ() is a complex and slow macro.
2011-12-18 02:43:08 +01:00
Victor Stinner ab870218e3 Issue #10951: Fix compiler warnings in timemodule.c and unicodeobject.c
Thanks Jérémy Anger for the fix.
2011-12-17 22:39:43 +01:00
Victor Stinner 2f197078fb The locale decoder raises a UnicodeDecodeError instead of an OSError
Search the invalid character using mbrtowc().
2011-12-17 07:08:30 +01:00
Victor Stinner 1b57967b96 Issue #13560: Locale codec functions use the classic "errors" parameter,
instead of surrogateescape

So it would be possible to support more error handlers later.
2011-12-17 05:47:23 +01:00
Victor Stinner ab59594326 What's New in Python 3.3: complete the deprecation list
Add also FIXMEs in unicodeobject.c
2011-12-17 04:59:06 +01:00
Victor Stinner 1f33f2b0c3 Issue #13560: os.strerror() now uses the current locale encoding instead of UTF-8 2011-12-17 04:45:09 +01:00
Victor Stinner f2ea71fcc8 Issue #13560: Add PyUnicode_EncodeLocale()
* Use PyUnicode_EncodeLocale() in time.strftime() if wcsftime() is not
   available
 * Document my last changes in Misc/NEWS
2011-12-17 04:13:41 +01:00
Victor Stinner af02e1c85a Add PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale()
* PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale() decode a string
   from the current locale encoding
 * _Py_char2wchar() writes an "error code" in the size argument to indicate
   if the function failed because of memory allocation failure or because of a
   decoding error. The function doesn't write the error message directly to
   stderr.
 * Fix time.strftime() (if wcsftime() is missing): decode strftime() result
   from the current locale encoding, not from the filesystem encoding.
2011-12-16 23:56:01 +01:00
Victor Stinner 16e6a80923 PyUnicode_Resize(): warn about canonical representation
Call also directly unicode_resize() in unicodeobject.c
2011-12-12 13:24:15 +01:00
Victor Stinner b0a82a6a7f Fix PyUnicode_Resize() for compact string: leave the string unchanged on error
Fix also PyUnicode_Resize() doc
2011-12-12 13:08:33 +01:00
Victor Stinner bf6e560d0c Make PyUnicode_Copy() private => _PyUnicode_Copy()
Undocument the function.

Make also decode_utf8_errors() as private (static).
2011-12-12 01:53:47 +01:00
Victor Stinner 7a9105a380 resize_copy() now supports legacy ready strings 2011-12-12 00:13:42 +01:00
Victor Stinner 488fa49acf Rewrite PyUnicode_Append(); unicode_modifiable() is more strict
* Rename unicode_resizable() to unicode_modifiable()
 * Rename _PyUnicode_Dirty() to unicode_check_modifiable() to make it clear
   that the function is private
 * Inline PyUnicode_Concat() and unicode_append_inplace() in PyUnicode_Append()
   to simplify the code
 * unicode_modifiable() return 0 if the hash has been computed or if the string
   is not an exact unicode string
 * Remove _PyUnicode_DIRTY(): no need to reset the hash anymore, because if the
   hash has already been computed, you cannot modify a string inplace anymore
 * PyUnicode_Concat() checks for integer overflow
2011-12-12 00:01:39 +01:00
Victor Stinner c4b495497a Create unicode_result_unchanged() subfunction 2011-12-11 22:44:26 +01:00
Victor Stinner eaab604829 Fix fixup() for unchanged unicode subtype
If maxchar_new == 0 and self is a unicode subtype, return u instead of duplicating u.
2011-12-11 22:22:39 +01:00
Victor Stinner e6b2d4407a unicode_fromascii() doesn't check string content twice in debug mode
_PyUnicode_CheckConsistency() also checks string content.
2011-12-11 21:54:30 +01:00
Victor Stinner a1d12bb119 Call directly PyUnicode_DecodeUTF8Stateful() instead of PyUnicode_DecodeUTF8()
* Remove micro-optimization from PyUnicode_FromStringAndSize():
   PyUnicode_DecodeUTF8Stateful() has already these optimizations (for size=0
   and one ascii char).
 * Rename utf8_max_char_size_and_char_count() to utf8_scanner(), and remove an
   useless variable
2011-12-11 21:53:09 +01:00
Victor Stinner 382955ff4e Use directly unicode_empty instead of PyUnicode_New(0, 0) 2011-12-11 21:44:00 +01:00
Victor Stinner 785938eebd Move the slowest UTF-8 decoder to its own subfunction
* Create decode_utf8_errors()
 * Reuse unicode_fromascii()
 * decode_utf8_errors() doesn't refit at the beginning
 * Remove refit_partial_string(), use unicode_adjust_maxchar() instead
2011-12-11 20:09:03 +01:00
Victor Stinner 84def3774d Fix error handling in resize_compact() 2011-12-11 20:04:56 +01:00
Victor Stinner 8faf8216e4 PyUnicode_FromWideChar() and PyUnicode_FromUnicode() raise a ValueError if a
character in not in range [U+0000; U+10ffff].
2011-12-08 22:14:11 +01:00
Victor Stinner 551ac95733 Py_UNICODE_HIGH_SURROGATE() and Py_UNICODE_LOW_SURROGATE() macros
And use surrogates macros everywhere in unicodeobject.c
2011-11-29 22:58:13 +01:00
Victor Stinner 6345be9a14 Close #13093: PyUnicode_EncodeDecimal() doesn't support error handlers
different than "strict" anymore. The caller was unable to compute the
size of the output buffer: it depends on the error handler.
2011-11-25 20:09:01 +01:00
Benjamin Peterson 1518e8713d and back to the "magic" formula (with a comment) it is 2011-11-23 10:44:52 -06:00
Benjamin Peterson 5944c36931 cave to those who like readable code 2011-11-22 19:05:49 -06:00
Benjamin Peterson 0268675193 fix compiler warning by implementing this more cleverly 2011-11-22 15:29:32 -05:00
Victor Stinner ca4f20782e find_maxchar_surrogates() reuses surrogate macros 2011-11-22 03:38:40 +01:00
Victor Stinner 0d3721d986 Issue #13441: Disable temporary the check on the maximum character until
the Solaris issue is solved.

But add assertion on the maximum character in various encoders: UTF-7, UTF-8,
wide character (wchar_t*, Py_UNICODE*), unicode-escape, raw-unicode-escape.

Fix also unicode_encode_ucs1() for backslashreplace error handler: Python is
now always "wide".
2011-11-22 03:27:53 +01:00
Victor Stinner f8facacf30 Fix compiler warnings 2011-11-22 02:30:47 +01:00
Victor Stinner b84d723509 (Merge 3.2) Issue #13093: Fix error handling on PyUnicode_EncodeDecimal() 2011-11-22 01:50:07 +01:00
Victor Stinner cfed46e00a PyUnicode_FromKindAndData() fails with a ValueError if size < 0 2011-11-22 01:29:14 +01:00
Victor Stinner 42885206ec UTF-8 decoder: set consumed value in the latin1 fast-path 2011-11-22 01:23:02 +01:00
Victor Stinner d3df8ab377 Replace _PyUnicode_READY_REPLACE() and _PyUnicode_ReadyReplace() with unicode_ready()
* unicode_ready() has a simpler API
 * try to reuse unicode_empty and latin1_char singleton everywhere
 * Fix a reference leak in _PyUnicode_TranslateCharmap()
 * PyUnicode_InternInPlace() doesn't try to get a singleton anymore, to avoid
   having to handle a failure
2011-11-22 01:22:34 +01:00
Victor Stinner f01245067a Rewrite PyUnicode_TransformDecimalToASCII() to use the new Unicode API 2011-11-21 23:12:56 +01:00
Victor Stinner 2d718f39a5 Remove an unused variable from PyUnicode_Copy() 2011-11-21 23:11:52 +01:00
Victor Stinner 87af4f2f3a Simplify PyUnicode_Copy()
USe PyUnicode_Copy() in fixup()
2011-11-21 23:03:47 +01:00
Victor Stinner 5bbe5e7c85 Fix a compiler warning in _PyUnicode_CheckConsistency() 2011-11-21 22:54:05 +01:00
Victor Stinner 42bf77537e Rewrite PyUnicode_EncodeDecimal() to use the new Unicode API
Add tests for PyUnicode_EncodeDecimal() and
PyUnicode_TransformDecimalToASCII().
2011-11-21 22:52:58 +01:00
Antoine Pitrou 0a3229de6b Issue #13417: speed up utf-8 decoding by around 2x for the non-fully-ASCII case.
This almost catches up with pre-PEP 393 performance, when decoding needed
only one pass.
2011-11-21 20:39:13 +01:00
Victor Stinner da29cc36aa Issue #13441: _PyUnicode_CheckConsistency() dumps the string if the maximum
character is bigger than U+10FFFF and locale.localeconv() dumps the string
before decoding it.

Temporary hack to debug the issue #13441.
2011-11-21 14:31:41 +01:00
Victor Stinner 9e30aa52fd Fix misuse of PyUnicode_GET_SIZE() => PyUnicode_GET_LENGTH()
And PyUnicode_GetSize() => PyUnicode_GetLength()
2011-11-21 02:49:52 +01:00
Victor Stinner 4ead7c7be8 PyObject_Str() ensures that the result string is ready
and check the string consistency.

_PyUnicode_CheckConsistency() doesn't check the hash anymore. It should be
possible to call this function even if hash(str) was already called.
2011-11-20 19:48:36 +01:00
Victor Stinner b960b34577 PyUnicode_AsUTF32String() calls directly _PyUnicode_EncodeUTF32(),
instead of calling the deprecated PyUnicode_EncodeUTF32() function
2011-11-20 19:12:52 +01:00
Victor Stinner 77faf69ca1 _PyUnicode_CheckConsistency() also checks maxchar maximum value,
not only its minimum value
2011-11-20 18:56:05 +01:00
Victor Stinner d5c4022d2a Remove the two ugly and unused WRITE_ASCII_OR_WSTR and WRITE_WSTR macros 2011-11-20 18:41:31 +01:00
Victor Stinner 2e9cfadd7c Reuse surrogate macros in UTF-16 decoder 2011-11-20 18:40:27 +01:00
Victor Stinner ae4f7c8e59 charmap_encoding_error() uses the new Unicode API 2011-11-20 18:28:55 +01:00
Victor Stinner ac931b1e5b Use PyUnicode_EncodeCodePage() instead of PyUnicode_EncodeMBCS() with
PyUnicode_AsUnicodeAndSize()
2011-11-20 18:27:03 +01:00
Victor Stinner 22168998f5 charmap encoders uses Py_UCS4, not Py_UNICODE 2011-11-20 17:09:18 +01:00
Victor Stinner 1f7951711c Catch PyUnicode_AS_UNICODE() errors 2011-11-17 00:45:54 +01:00
Ezio Melotti 11060a4a48 #13406: silence deprecation warnings in test_codecs. 2011-11-16 09:39:10 +02:00
Antoine Pitrou 78edf7576e Issue #13333: The UTF-7 decoder now accepts lone surrogates
(the encoder already accepts them).
2011-11-15 01:44:16 +01:00
Antoine Pitrou 5418ee0b9a Issue #13333: The UTF-7 decoder now accepts lone surrogates
(the encoder already accepts them).
2011-11-15 01:42:21 +01:00
Antoine Pitrou 31b92a534f Sanitize reference management in the utf-8 encoder 2011-11-12 18:35:19 +01:00
Antoine Pitrou 0290c7a811 Fix regression on 2-byte wchar_t systems (Windows) 2011-11-11 13:29:12 +01:00
Antoine Pitrou 44c6affc79 Avoid crashing because of an unaligned word access 2011-11-11 02:59:42 +01:00
Antoine Pitrou de20b0b50e Issue #13149: Speed up append-only StringIO objects.
This is very similar to the "lazy strings" idea.
2011-11-10 21:47:38 +01:00
Victor Stinner 9f4b1e9c50 Fix and deprecated the unicode_internal codec
unicode_internal codec uses Py_UNICODE instead of the real internal
representation (PEP 393: Py_UCS1, Py_UCS2 or Py_UCS4) for backward
compatibility.
2011-11-10 20:56:30 +01:00
Victor Stinner 24729f36bf Prefer Py_UCS4 or wchar_t over Py_UNICODE 2011-11-10 20:31:37 +01:00
Victor Stinner ebf3ba808e PyUnicode_DecodeCharmap() uses the new Unicode API 2011-11-10 20:30:22 +01:00
Victor Stinner a98b28c1bf Avoid PyUnicode_AS_UNICODE in the UTF-8 encoder 2011-11-10 20:21:49 +01:00
Victor Stinner 3326cb6a36 Fix "unicode_escape" encoder 2011-11-10 20:15:25 +01:00
Victor Stinner 0e36826a04 Fix UTF-7 encoder on Windows 2011-11-10 20:12:49 +01:00
Martin v. Löwis 1db7c13be1 Port encoders from Py_UNICODE API to unicode object API. 2011-11-10 18:24:32 +01:00
Victor Stinner 62aa4d086a Strip trailing spaces 2011-11-09 00:03:45 +01:00
Victor Stinner 0a045efb49 Fix a compiler warning: use unsiged for maxchar in unicode_widen() 2011-11-09 00:02:42 +01:00
Victor Stinner 596a6c4ffc Fix the code page decoder
* unicode_decode_call_errorhandler() now supports the PyUnicode_WCHAR_KIND
   kind
 * unicode_decode_call_errorhandler() calls copy_characters() instead of
   PyUnicode_CopyCharacters()
2011-11-09 00:02:18 +01:00
Antoine Pitrou a8f63c02ef Fix missing goto 2011-11-08 18:37:16 +01:00
Martin v. Löwis d10759f6ed Make _PyUnicode_FromId return borrowed references.
http://mail.python.org/pipermail/python-dev/2011-November/114347.html
2011-11-07 13:00:05 +01:00
Martin v. Löwis e9b11c1cd8 Change decoders to use Unicode API instead of Py_UNICODE. 2011-11-08 17:35:34 +01:00
Victor Stinner e30c0a1014 Fix gdb/libpython.py for not ready Unicode strings
_PyUnicode_CheckConsistency() checks also hash and length value for not ready
Unicode strings.
2011-11-04 20:54:05 +01:00
Victor Stinner 2fc507fe45 Replace tabs by spaces 2011-11-04 20:06:39 +01:00
Martin v. Löwis 12be46ca84 Drop Py_UNICODE based encode exceptions. 2011-11-04 19:04:15 +01:00
Martin v. Löwis 3d325191bf Port code page codec to Unicode API. 2011-11-04 18:23:06 +01:00
Victor Stinner fcd9653667 Fix a compiler warning in unicode_encode_ucs1() 2011-11-04 00:28:50 +01:00
Victor Stinner fc026c98d8 Fix PyUnicode_EncodeCharmap() 2011-11-04 00:24:51 +01:00
Victor Stinner 7931d9a951 Replace PyUnicodeObject type by PyObject
* _PyUnicode_CheckConsistency() now takes a PyObject* instead of void*
 * Remove now useless casts to PyObject*
2011-11-04 00:22:48 +01:00
Victor Stinner 76a31a6bff Cleanup decode_code_page_stateful() and encode_code_page()
* Fix decode_code_page_errors() result
 * Inline decode_code_page() and encode_code_page_chunk()
 * Replace the PyUnicodeObject type by PyObject
2011-11-04 00:05:13 +01:00
Victor Stinner 7581cef699 Adapt the code page encoder to the new unicode_encode_call_errorhandler()
The code is not correct, but at least it doesn't crash anymore.
2011-11-03 22:32:33 +01:00
Brian Curtin 2787ea41fd Fix a compile error (apparently Windows only) introduced in 295fdfd4f422 2011-11-02 15:09:37 -05:00
Martin v. Löwis 23e275b3ad Port UCS1 and charmap codecs to new API. 2011-11-02 18:02:51 +01:00
Martin v. Löwis 9e8166843c Introduce PyObject* API for raising encode errors. 2011-11-02 12:45:42 +01:00
Martin v. Löwis 0d3072e98d Drop Py_UCS4_ functions. Closes #13246. 2011-10-31 08:40:56 +01:00
Victor Stinner 57ffa9d4ff PyUnicode_AsUnicodeCopy() uses PyUnicode_AsUnicodeAndSize() to get directly the length 2011-10-23 20:10:08 +02:00
Victor Stinner af9e4b8c29 Fix PyUnicode_InternImmortal(): PyUnicode_InternInPlace() may changes *p 2011-10-23 20:07:00 +02:00
Victor Stinner 9faa384bed Cast directly to unsigned char, instead of using Py_CHARMASK
We don't need "& 0xff" on an unsigned char.
2011-10-23 20:06:00 +02:00
Victor Stinner 9db1a8b69f Replace PyUnicodeObject* by PyObject* where it was irrevelant
A Unicode string can now be a PyASCIIObject, PyCompactUnicodeObject or
PyUnicodeObject. Aliasing a PyASCIIObject* or PyCompactUnicodeObject* to
PyUnicodeObject* is wrong
2011-10-23 20:04:37 +02:00
Victor Stinner 0d60e87ad6 Fix data variable in _PyUnicode_Dump() for compact ASCII 2011-10-23 19:47:19 +02:00
Victor Stinner d8e61c348e Remove last references to the removed Unicode free list 2011-10-23 19:43:33 +02:00
Victor Stinner 065836ec9c PyUnicode_FSDecoder() ensures that the decoded string is ready 2011-10-27 01:56:33 +02:00
Victor Stinner dd18d3ad9e Fix unicode_subtype_new() on debug build
Patch written by Stefan Behnel.
2011-10-22 11:08:10 +02:00