cpython

Commit Graph

Author	SHA1	Message	Date
Jeremy Hylton	3ce45389bd	Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h. And remove all the extern decls in the middle of .c files. Apparently, it was excluded from the header file because it is intended for internal use by the interpreter. It's still intended for internal use and documented as such in the header file.	2001-07-30 22:34:24 +00:00
Marc-André Lemburg	80d1dd5f3b	Fix for bug #444493 : u'\U00010001' segfaults with current CVS on wide builds.	2001-07-25 16:05:59 +00:00
Marc-André Lemburg	6c6bfb7c70	Make the unicode-escape and the UTF-16 codecs handle surrogates correctly and thus roundtrip-safe. Some minor cleanups of the code. Added tests for the roundtrip-safety.	2001-07-20 17:39:11 +00:00
Guido van Rossum	0d42e0c54a	#ifdef out generation of \U escapes unless Py_UNICODE_WIDE. This #caused warnings with the VMS C compiler. (SF bug #442998, in part.) On a narrow system the current code should never be executed since ch will always be < 0x10000. Marc-Andre: you may end up fixing this a different way, since I believe you have plans to generate \U for surrogate pairs. I'll leave that to you.	2001-07-20 16:36:21 +00:00
Fredrik Lundh	8f4558583f	use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE tests.	2001-06-27 18:59:43 +00:00
Martin v. Löwis	ce9b5a55e1	Encode surrogates in UTF-8 even for a wide Py_UNICODE. Implement sys.maxunicode. Explicitly wrap around upper/lower computations for wide Py_UNICODE. When decoding large characters with UTF-8, represent expected test results using the \U notation.	2001-06-27 06:28:56 +00:00
Martin v. Löwis	ac93bc2501	When decoding UTF-16, don't assume that the buffer is in native endianness when checking surrogates.	2001-06-26 22:43:40 +00:00
Martin v. Löwis	0ba70cc3c8	Support using UCS-4 as the Py_UNICODE type: Add configure option --enable-unicode. Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE, SIZEOF_WCHAR_T. Define Py_UCS2. Encode and decode large UTF-8 characters into single Py_UNICODE values for wide Unicode types; likewise for UTF-16. Remove test whether sizeof Py_UNICODE is two.	2001-06-26 22:22:37 +00:00
Fredrik Lundh	1294ad0c59	experimental UCS-4 support: added USE_UCS4_STORAGE define to unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4). (this may be good enough for platforms that doesn't have a 16-bit type. the UTF-16 codecs don't work, though)	2001-06-26 17:17:07 +00:00
Fredrik Lundh	45714e9ecb	experimental UCS-4 support: made compare a bit more robust, in case sizeof(Py_UNICODE) >= sizeof(long). also changed surrogate expansion to work if sizeof(Py_UNICODE) > 2.	2001-06-26 16:39:36 +00:00
Fredrik Lundh	3083163dc1	experimental UCS-4 support: don't assume that MS_WIN32 implies HAVE_USABLE_WCHAR_T	2001-06-26 15:11:00 +00:00
Guido van Rossum	ad98db1d9e	Fix a mis-indentation in _PyUnicode_New() that caused me to stare at some code for longer than needed.	2001-06-14 17:52:02 +00:00
Marc-André Lemburg	8879a33613	Fixes [ #430986 ] Buglet in PyUnicode_FromUnicode.	2001-06-07 12:26:56 +00:00
Jeremy Hylton	9cea41c195	fix bogus indentation	2001-05-29 17:13:15 +00:00
Marc-André Lemburg	489b56e044	This patch changes the behaviour of the UTF-16 codec family. Only the UTF-16 codec will now interpret and remove a leading BOM mark. Sub- sequent BOM characters are no longer interpreted and removed. UTF-16-LE and -BE pass through all BOM mark characters. These changes should get the UTF-16 codec more in line with what the Unicode FAQ recommends w/r to BOM marks.	2001-05-21 20:30:15 +00:00
Jeremy Hylton	d37292bb8d	Remove unused variable	2001-05-08 04:00:45 +00:00
Tim Peters	2cfe368283	Make unicode.join() work nice with iterators. This also required a change to string.join(), so that when the latter figures out in midstream that it really needs unicode.join() instead, unicode.join() can actually get all the sequence elements (i.e., there's no guarantee that the sequence passed to string.join() can be iterated over again by unicode.join(), so string.join() must not pass on the original sequence object anymore).	2001-05-05 05:36:48 +00:00
Tim Peters	b3d8d1f76c	A different approach to the problem reported in Patch #419651: Metrowerks on Mac adds 0x itself C std says %#x and %#X conversion of 0 do not add the 0x/0X base marker. Metrowerks apparently does. Mark Favas reported the same bug under a Compaq compiler on Tru64 Unix, but no other libc broken in this respect is known (known to be OK under MSVC and gcc). So just try the damn thing at runtime and see what the platform does. Note that we've always had bugs here, but never knew it before because a relevant test case didn't exist before 2.1.	2001-04-28 05:38:26 +00:00
Marc-André Lemburg	8155e0e541	This patch originated from an idea by Martin v. Loewis who submitted a patch for sharing single character Unicode objects. Martin's patch had to be reworked in a number of ways to take Unicode resizing into consideration as well. Here's what the updated patch implements: * Single character Unicode strings in the Latin-1 range are shared (not only ASCII chars as in Martin's original patch). * The ASCII and Latin-1 codecs make use of this optimization, providing a noticable speedup for single character strings. Most Unicode methods can use the optimization as well (by virtue of using PyUnicode_FromUnicode()). * Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY) * The PyUnicode_Resize() can now also handle the case of resizing unicode_empty which previously resulted in an error. * Modified the internal API _PyUnicode_Resize() and the public PyUnicode_Resize() API to handle references to shared objects correctly. The _PyUnicode_Resize() signature changed due to this. * Callers of PyUnicode_FromUnicode() may now only modify the Unicode object contents of the returned object in case they called the API with NULL as content template. Note that even though this patch passes the regression tests, there may still be subtle bugs in the sharing code.	2001-04-23 14:44:21 +00:00
Tim Peters	cf96de052f	SF but #417587 : compiler warnings compiling 2.1. Repaired some of the SGI compiler warnings Sjoerd Mullender reported.	2001-04-21 02:46:11 +00:00
Tim Peters	78fe5308b4	CVS patch 416248: 2.1c1 unicodeobject: unused vrbl cleanup, from Mark Favas.	2001-04-19 21:55:14 +00:00
Jeremy Hylton	b8a93215c2	Revert previous checkin, which caused test_unicodedata to fail.	2001-04-19 16:43:49 +00:00
Martin v. Löwis	da3dc5b892	Patch #416953 : Cache ASCII characters to speed up ASCII decoding.	2001-04-18 12:49:15 +00:00
Tim Peters	fff5325078	Bug 415514 reported that e.g. "%#x" % 0 blew up, at heart because C sprintf supplies a base marker if and only if the value is not 0. I then fixed that, by tolerating C's inconsistency when it does %#x, and taking away that Python produced 0x0 when formatting 0L (the "long" flavor of 0) under %#x itself. But after talking with Guido, we agreed it would be better to supply 0x for the short int case too, despite that it's inconsistent with C, because C is inconsistent with itself and with Python's hex(0) (plus, while "%#x" % 0 didn't work before, "%#x" % 0L did, and returned "0x0"). Similarly for %#X conversion.	2001-04-12 18:38:48 +00:00
Tim Peters	711088d9b8	Fix for SF bug #415514 : "%#x" % 0 caused assertion failure/abort. http://sourceforge.net/tracker/index.php?func=detail&aid=415514&group_id=5470&atid=105470 For short ints, Python defers to the platform C library to figure out what %#x should do. The code asserted that the platform C returned a string beginning with "0x". However, that's not true when-- and only when --the value being formatted is 0. Changed the code to live with C's inconsistency here. In the meantime, the problem does not arise if you format a long 0 (0L) instead. However, that's because the code we wrote to do %#x conversions on longs produces a leading "0x" regardless of value. That's probably wrong too: we should drop leading "0x", for consistency with C, when (& only when) formatting 0L. So I changed the long formatting code to do that too.	2001-04-12 00:35:51 +00:00
Fredrik Lundh	ccc7473fc8	reorganized PyUnicode_DecodeUnicodeEscape a bit (in order to make it less likely that bug #132817 ever appears again)	2001-02-18 22:13:49 +00:00
Marc-André Lemburg	fde66e1bcc	Fixed .capitalize() method of Unicode objects to work like the corresponding string method. Added tests for this too. Patch written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.	2001-01-29 11:14:16 +00:00
Ka-Ping Yee	fa004ad36c	Show '\011', '\012', and '\015' as '\t', '\n', '\r' in strings. Switch from octal escapes to hex escapes for other nonprintable characters.	2001-01-24 17:19:08 +00:00
Fredrik Lundh	06d126803c	Move uchhash functionality into unicodedata (after the recent crop of changes, the files are small enough to do this). Also adds "name" and "lookup" functions to unicodedata.	2001-01-24 07:59:11 +00:00
Fredrik Lundh	f60560626c	Better error message if ucnhash cannot be found (obscure attribute errors aren't that helpful), or doesn't contain what's expected from it. Also tweaked the test script so it compiles even if ucnhash is missing.	2001-01-20 11:15:25 +00:00
Fredrik Lundh	0fdb90cafe	refactored the unicodeobject/ucnhash interface, to hide the implementation details inside the ucnhash module. also cleaned up the unicode copyright blurb a little; Secret Labs' internal revision history isn't that interesting...	2001-01-19 09:45:02 +00:00
Marc-André Lemburg	ad7c98e264	This patch adds a new builtin unistr() which behaves like str() except that it always returns Unicode objects. A new C API PyObject_Unicode() is also provided. This closes patch #101664. Written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.	2001-01-17 17:09:53 +00:00
Marc-André Lemburg	3a645e4dd4	Added checks to prevent PyUnicode_Count() from dumping core in case the parameters are out of bounds and fixes error handling for .count(), .startswith() and .endswith() for the case of mixed string/Unicode objects. This patch adds Python style index semantics to PyUnicode_Count() indices (including the special handling of negative indices). The patch is an extended version of patch #103249 submitted by Michael Hudson (mwh) on SF. It also includes new test cases.	2001-01-16 11:54:12 +00:00
Marc-André Lemburg	ec233e5803	This patch adds a new feature to the builtin charmap codec: The mapping dictionaries can now contain 1-n mappings, meaning that character ordinals may be mapped to strings or Unicode object, e.g. 0x0078 ('x') -> u"abc", causing the ordinal to be replaced by the complete string or Unicode object instead of just one character. Another feature introduced by the patch is that of mapping oridnals to the emtpy string. This allows removing characters. The patch is different from patch #103100 in that it does not cause a performance hit for the normal use case of 1-1 mappings. Written by Marc-Andre Lemburg, copyright assigned to Guido van Rossum.	2001-01-06 14:59:58 +00:00
Marc-André Lemburg	a866df806d	This patch changes the default behaviour of the builtin charmap codec to not apply Latin-1 mappings for keys which are not found in the mapping dictionaries, but instead treat them as undefined mappings. The patch was originally written by Martin v. Loewis with some additional (cosmetic) changes and an updated test script by Marc-Andre Lemburg. The standard codecs were recreated from the most current files available at the Unicode.org site using the Tools/scripts/gencodec.py tool. This patch closes the bugs #116285 and #119960.	2001-01-03 21:29:14 +00:00
Andrew M. Kuchling	f947ffe951	Patch #102940 : use only printable Unicode chars in reporting incorrect % characters; characters outside the printable range are replaced with '?'	2000-12-19 22:49:06 +00:00
Guido van Rossum	cda4f9a8dc	Fix off-by-one error in split_substring(). Fixes SF bug #122162 .	2000-12-19 02:23:19 +00:00
Andrew M. Kuchling	6ca8917758	[ Patch #102852 ] Make % error a bit more informative by indicates the index at which an unknown %-escape was found	2000-12-15 13:07:46 +00:00
Tim Peters	a3a3a030af	Fox for SF bug #123859 : %[duxXo] long formats inconsistent.	2000-11-30 05:22:44 +00:00
Barry Warsaw	5b4c22806f	_PyUnicode_Fini(): Initialize the local freelist walking variable `u' after unicode_empty has been freed, otherwise it might not point to the real start of the unicode_freelist. Final closure for SF bug #110681, Jitterbug PR#398.	2000-10-03 20:45:26 +00:00
Guido van Rossum	4ae8ef84da	In _PyUnicode_Fini(), decref unicode_empty before tearng down the free list. Discovered by Barry, fix approved by MAL.	2000-10-03 18:09:04 +00:00
Fred Drake	d5fadf75e4	Rationalize use of limits.h, moving the inclusion to Python.h. Add definitions of INT_MAX and LONG_MAX to pyport.h. Remove includes of limits.h and conditional definitions of INT_MAX and LONG_MAX elsewhere. This closes SourceForge patch #101659 and bug #115323.	2000-09-26 05:46:01 +00:00
Tim Peters	38fd5b6413	Derived from Martin's SF patch 110609: support unbounded ints in %d,i,u,x,X,o formats. Note a curious extension to the std C rules: x, X and o formatting can never produce a sign character in C, so the '+' and ' ' flags are meaningless for them. But unbounded ints can produce a sign character under these conversions (no fixed- width bitstring is wide enough to hold all negative values in 2's-comp form). So these flags become meaningful in Python when formatting a Python long which is too big to fit in a C long. This required shuffling around existing code, which hacked x and X conversions to death when both the '#' and '0' flags were specified: the hacks weren't strong enough to deal with the simultaneous possibility of the ' ' or '+' flags too, since signs were always meaningless before for x and X conversions. Isomorphic shuffling was required in unicodeobject.c. Also added dozens of non-trivial new unbounded-int test cases to test_format.py.	2000-09-21 05:43:11 +00:00
Tim Peters	8f422461b4	Fix for bug 113934. stringn and unicoden did no overflow checking at all, either to see whether the # of chars fit in an int, or that the amount of memory needed fit in a size_t. Checking these is expensive, but the alternative is silently wrong answers (as in the bug report) or core dumps (which were easy to provoke using Unicode strings).	2000-09-09 06:13:41 +00:00
Fredrik Lundh	df84675f93	changed \x to consume exactly two hex digits, also for unicode strings. closes PEP-223. also added \U escape (eight hex digits).	2000-09-03 11:29:49 +00:00
Barry Warsaw	ce4dc41b1a	PyUnicode_AsUTF8String(): /F picks up what I missed: the local var `str' is no longer necessary. Gotta turn on -Wall!	2000-08-18 19:30:40 +00:00
Barry Warsaw	2dd4abf277	PyUnicode_AsUTF8String(): Don't need to explicitly incref str since PyUnicode_EncodeUTF8() already returns the created object with the proper reference count. This fixes an Insure reported memory leak.	2000-08-18 06:58:15 +00:00
Marc-André Lemburg	b7520774e2	Fixed a couple of instances where a 0-length string was being resized after creation. 0-length strings are usually shared and _PyString_Resize() fails on these shared strings. Fixes [ Bug #111667 ] unicode core dump.	2000-08-14 11:29:19 +00:00
Trent Mick	20abf573ef	Clean up warning from Monterey compiler. Properly end a comment block. It was terminated fine later but by a subsequent block and. It was also in #if 0. This patch is so trivial I can't believe I am talking about it. :)	2000-08-12 22:14:34 +00:00
Marc-André Lemburg	e5034378cc	Removing UTF-16 aware Unicode comparison code. This kind of compare function (together with other locale aware ones) should into a new collation support module. See python-dev for a discussion of this removal. Note: This patch should also be applied to the 1.6 branch.	2000-08-08 08:04:29 +00:00

1 2 3

105 Commits