cpython

Commit Graph

Author	SHA1	Message	Date
Marc-André Lemburg	040f76b79c	Slightly revised version of patch #1538956 : Replace UnicodeDecodeErrors raised during == and != compares of Unicode and other objects with a new UnicodeWarning. All other comparisons continue to raise exceptions. Exceptions other than UnicodeDecodeErrors are also left untouched.	2006-08-14 10:55:19 +00:00
Martin v. Löwis	d825143be1	Patch #1455898 : Incremental mode for "mbcs" codec.	2006-06-14 05:21:04 +00:00
Martin v. Löwis	3f767795f6	Patch #1359618 : Speed-up charmap encoder.	2006-06-04 19:36:28 +00:00
Fredrik Lundh	80f8e80c15	needforspeed: added Py_MEMCPY macro (currently tuned for Visual C only), and use it for string copy operations. this gives a 20% speedup on some string benchmarks.	2006-05-28 12:06:46 +00:00
Fredrik Lundh	b3167cbcd7	needforspeed: added rpartition implementation	2006-05-26 18:15:38 +00:00
Fredrik Lundh	06a69dd8ff	needforspeed: partition implementation, part two. feel free to improve the documentation and the docstrings.	2006-05-26 08:54:28 +00:00
Fredrik Lundh	3d885e0195	needforspeed: check first and last character before doing a full memcmp	2006-05-23 10:10:57 +00:00
Fredrik Lundh	8a8e05a2b9	needforspeed: use memcpy for "long" strings; use a better algorithm for long repeats.	2006-05-22 17:12:58 +00:00
Fredrik Lundh	f1d60a5384	needforspeed: speed up unicode repeat, unicode string copy	2006-05-22 16:29:30 +00:00
Martin v. Löwis	18e165558b	Merge ssize_t branch.	2006-02-15 17:27:45 +00:00
Tim Peters	2576c97f52	_PyUnicode_IsWhitespace(), _PyUnicode_IsLinebreak(): Changed the declarations to match the definitions. Don't know why they differed; MSVC warned about it; don't know why only these two functions use "const". Someone who does may want to do something saner ;-).	2005-10-29 02:33:18 +00:00
Walter Dörwald	a47d1c08d0	SF bug #1251300 : On UCS-4 builds the "unicode-internal" codec will now complain about illegal code points. The codec now supports PEP 293 style error handlers. (This is a variant of the Nik Haldimann's patch that detects truncated data)	2005-08-30 10:23:14 +00:00
Marc-André Lemburg	a9cadcd41b	Correct the handling of 0-termination of PyUnicode_AsWideChar() and its usage in PyLocale_strcoll(). Clarify the documentation on this. Thanks to Andreas Degert for pointing this out.	2004-11-22 13:02:31 +00:00
Raymond Hettinger	57341c37c9	SF patch #1056231 : typo in comment (unicodeobject.h)	2004-10-31 05:46:59 +00:00
Walter Dörwald	69652035bc	SF patch #998993 : The UTF-8 and the UTF-16 stateful decoders now support decoding incomplete input (when the input stream is temporarily exhausted). codecs.StreamReader now implements buffering, which enables proper readline support for the UTF-16 decoders. codecs.StreamReader.read() has a new argument chars which specifies the number of characters to return. codecs.StreamReader.readline() and codecs.StreamReader.readlines() have a new argument keepends. Trailing "\n"s will be stripped from the lines if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and PyUnicode_DecodeUTF16Stateful.	2004-09-07 20:24:22 +00:00
Hye-Shik Chang	e9ddfbb412	SF #989185 : Drop unicode.iswide() and unicode.width() and add unicodedata.east_asian_width(). You can still implement your own simple width() function using it like this: def width(u): w = 0 for c in unicodedata.normalize('NFC', u): cwidth = unicodedata.east_asian_width(c) if cwidth in ('W', 'F'): w += 2 else: w += 1 return w	2004-08-04 07:38:35 +00:00
Marc-André Lemburg	d2d4598ec2	Allow string and unicode return types from .encode()/.decode() methods on string and unicode objects. Added unicode.decode() which was missing for no apparent reason.	2004-07-08 17:57:32 +00:00
Hye-Shik Chang	974ed7cfa5	- SF #962502 : Add two more methods for unicode type; width() and iswide() for east asian width manipulation. (Inspired by David Goodger, Reviewed by Martin v. Loewis) - Move _PyUnicode_TypeRecord.flags to the end of the struct so that no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)	2004-06-02 16:49:17 +00:00
Hye-Shik Chang	3ae811b57d	Add rsplit method for str and unicode builtin types. SF feature request #801847. Original patch is written by Sean Reifschneider.	2003-12-15 18:49:53 +00:00
Marc-André Lemburg	9c329de47e	Add name mangling for new PyUnicode_FromOrdinal() and fix declaration to use new extern macro.	2002-08-12 08:19:10 +00:00
Mark Hammond	91a681debf	Excise DL_EXPORT from Include. Thanks to Skip Montanaro and Kalle Svensson for the patches.	2002-08-12 07:21:58 +00:00
Marc-André Lemburg	cc8764ca9d	Add C API PyUnicode_FromOrdinal() which exposes unichr() at C level. u'%c' will now raise a ValueError in case the argument is an integer outside the valid range of Unicode code point ordinals. Closes SF bug #593581.	2002-08-11 12:23:04 +00:00
Marc-André Lemburg	4da6fd63bc	Fix for bug [ 561796 ] string.find causes lazy error	2002-05-29 11:33:13 +00:00
Walter Dörwald	de02bcb265	Apply patch diff.txt from SF feature request http://www.python.org/sf/444708 This adds the optional argument for str.strip to unicode.strip too and makes it possible to call str.strip with a unicode argument and unicode.strip with a str argument.	2002-04-22 17:42:37 +00:00
Guido van Rossum	b8c65bc27f	SF patch #470578 : Fixes to synchronize unicode() and str() This patch implements what we have discussed on python-dev late in September: str(obj) and unicode(obj) should behave similar, while the old behaviour is retained for unicode(obj, encoding, errors). The patch also adds a new feature with which objects can provide unicode(obj) with input data: the __unicode__ method. Currently no new tp_unicode slot is implemented; this is left as option for the future. Note that PyUnicode_FromEncodedObject() no longer accepts Unicode objects as input. The API name already suggests that Unicode objects do not belong in the list of acceptable objects and the functionality was only needed because PyUnicode_FromEncodedObject() was being used directly by unicode(). The latter was changed in the discussed way: * unicode(obj) calls PyObject_Unicode() * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject() One thing left open to discussion is whether to leave the PyUnicode_FromObject() API as a thin API extension on top of PyUnicode_FromEncodedObject() or to turn it into a (macro) alias for PyObject_Unicode() and deprecate it. Doing so would have some surprising consequences though, e.g. u"abc" + 123 would turn out as u"abc123"... [Marc-Andre didn't have time to check this in before the deadline. I hope this is OK, Marc-Andre! You can still make changes and commit them on the trunk after the branch has been made, but then please mail Barry a context diff if you want the change to be merged into the 2.2b1 release branch. GvR]	2001-10-19 02:01:31 +00:00
Marc-André Lemburg	c60e6f7771	Patch #435971 : UTF-7 codec by Brian Quinlan.	2001-09-20 10:35:46 +00:00
Marc-André Lemburg	5e6007c5db	Fix for bug #462737 .	2001-09-19 11:21:03 +00:00
Tim Peters	78e0fc74bc	Possibly the end of SF [#460020 ] bug or feature: unicode() and subclasses. Changed unicode(i) to return a true Unicode object when i is an instance of a unicode subclass. Added PyUnicode_CheckExact macro.	2001-09-11 03:07:38 +00:00
Guido van Rossum	5eef77a21b	Make the Py<type>_Check() macro use PyObject_TypeCheck().	2001-08-30 03:08:07 +00:00
Martin v. Löwis	339d0f720e	Patch #445762 : Support --disable-unicode - Do not compile unicodeobject, unicodectype, and unicodedata if Unicode is disabled - check for Py_USING_UNICODE in all places that use Unicode functions - disables unicode literals, and the builtin functions - add the types.StringTypes list - remove Unicode literals from most tests.	2001-08-17 18:39:25 +00:00
Tim Peters	772747b3f1	SF patch #438013 Remove 2-byte Py_UCS2 assumptions Removed all instances of Py_UCS2 from the codebase, and so also (I hope) the last remaining reliance on the platform having an integral type with exactly 16 bits. PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write one byte at a time.	2001-08-09 22:21:55 +00:00
Marc-André Lemburg	b5ac6f62c7	As discussed on python-dev: this patch adds name mangling to assure that extensions and interpreters using the Unicode APIs were compiled using the same Unicode width.	2001-07-31 14:30:16 +00:00
Jeremy Hylton	3ce45389bd	Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h. And remove all the extern decls in the middle of .c files. Apparently, it was excluded from the header file because it is intended for internal use by the interpreter. It's still intended for internal use and documented as such in the header file.	2001-07-30 22:34:24 +00:00
Fredrik Lundh	72b068566a	removed "register const" from scalar arguments to the unicode predicates	2001-06-27 22:08:26 +00:00
Fredrik Lundh	8f4558583f	use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE tests.	2001-06-27 18:59:43 +00:00
Martin v. Löwis	ce9b5a55e1	Encode surrogates in UTF-8 even for a wide Py_UNICODE. Implement sys.maxunicode. Explicitly wrap around upper/lower computations for wide Py_UNICODE. When decoding large characters with UTF-8, represent expected test results using the \U notation.	2001-06-27 06:28:56 +00:00
Fredrik Lundh	9b14ab367a	Make Unicode work a bit better on Windows...	2001-06-26 22:59:49 +00:00
Martin v. Löwis	0ba70cc3c8	Support using UCS-4 as the Py_UNICODE type: Add configure option --enable-unicode. Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE, SIZEOF_WCHAR_T. Define Py_UCS2. Encode and decode large UTF-8 characters into single Py_UNICODE values for wide Unicode types; likewise for UTF-16. Remove test whether sizeof Py_UNICODE is two.	2001-06-26 22:22:37 +00:00
Fredrik Lundh	1294ad0c59	experimental UCS-4 support: added USE_UCS4_STORAGE define to unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4). (this may be good enough for platforms that doesn't have a 16-bit type. the UTF-16 codecs don't work, though)	2001-06-26 17:17:07 +00:00
Marc-André Lemburg	489b56e044	This patch changes the behaviour of the UTF-16 codec family. Only the UTF-16 codec will now interpret and remove a leading BOM mark. Sub- sequent BOM characters are no longer interpreted and removed. UTF-16-LE and -BE pass through all BOM mark characters. These changes should get the UTF-16 codec more in line with what the Unicode FAQ recommends w/r to BOM marks.	2001-05-21 20:30:15 +00:00
Marc-André Lemburg	8155e0e541	This patch originated from an idea by Martin v. Loewis who submitted a patch for sharing single character Unicode objects. Martin's patch had to be reworked in a number of ways to take Unicode resizing into consideration as well. Here's what the updated patch implements: * Single character Unicode strings in the Latin-1 range are shared (not only ASCII chars as in Martin's original patch). * The ASCII and Latin-1 codecs make use of this optimization, providing a noticable speedup for single character strings. Most Unicode methods can use the optimization as well (by virtue of using PyUnicode_FromUnicode()). * Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY) * The PyUnicode_Resize() can now also handle the case of resizing unicode_empty which previously resulted in an error. * Modified the internal API _PyUnicode_Resize() and the public PyUnicode_Resize() API to handle references to shared objects correctly. The _PyUnicode_Resize() signature changed due to this. * Callers of PyUnicode_FromUnicode() may now only modify the Unicode object contents of the returned object in case they called the API with NULL as content template. Note that even though this patch passes the regression tests, there may still be subtle bugs in the sharing code.	2001-04-23 14:44:21 +00:00
Marc-André Lemburg	1a731c60a3	Added #fndef's to avoid compiler errors.	2000-08-11 11:43:10 +00:00
Marc-André Lemburg	bff879cabb	This patch finalizes the move from UTF-8 to a default encoding in the Python Unicode implementation. The internal buffer used for implementing the buffer protocol is renamed to defenc to make this change visible. It now holds the default encoded version of the Unicode object and is calculated on demand (NULL otherwise). Since the default encoding defaults to ASCII, this will mean that Unicode objects which hold non-ASCII characters will no longer work on C APIs using the "s" or "t" parser markers. C APIs must now explicitly provide Unicode support via the "u", "U" or "es"/"es#" parser markers in order to work with non-ASCII Unicode strings. (Note: this patch will also have to be applied to the 1.6 branch of the CVS tree.)	2000-08-03 18:46:08 +00:00
Guido van Rossum	16b1ad9c7d	Changing the CNRI copyright notice according to CNRI's instructions. This is a notice without a date, which apparently is not a claim to copyright but only advice to the reader. IANAL. :-)	2000-08-03 16:24:25 +00:00
Thomas Wouters	5f37591a16	ANSIfications: fix empty arglists, and remove the checks for 'HAVE_STDARG_PROTOTYPES' (consider it true, remove false branch)	2000-07-22 23:30:03 +00:00
Thomas Wouters	7e47402264	Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in either comments, docstrings or error messages. I fixed two minor things in test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't"). There is a minor style issue involved: Guido seems to have preferred English grammar (behaviour, honour) in a couple places. This patch changes that to American, which is the more prominent style in the source. I prefer English myself, so if English is preferred, I'd be happy to supply a patch myself ;)	2000-07-16 12:04:32 +00:00
Marc-André Lemburg	5a5c81a0e9	Added new API PyUnicode_FromEncodedObject() which supports decoding objects including instance objects. The old API PyUnicode_FromObject() is still available as shortcut.	2000-07-07 13:46:42 +00:00
Marc-André Lemburg	43279100f4	Bill Tutt: Added Py_UCS4 typedef to hold UCS4 values (these need at least 32 bits as opposed to Py_UNICODE which rely on having 16 bits).	2000-07-07 09:01:41 +00:00
Marc-André Lemburg	f03e74126e	Modified the ISALPHA and ISALNUM macros to use the new lookup APIs from unicodectype.c	2000-07-05 09:45:59 +00:00
Marc-André Lemburg	a9c103bc09	Added new Py_UNICODE_ISALPHA() and Py_UNICODE_ISALNUM() macros which are true for alphabetic and alphanumeric characters resp. The macros are currently implemented using the existing is* tables but will have to be updated to meet the Unicode standard definitions (add tables for non-cased letters and letter modifiers).	2000-07-03 10:52:13 +00:00

1 2

60 Commits