cpython

Commit Graph

Author	SHA1	Message	Date
Anthony Baxter	a62862120d	More low-hanging fruit. Still need to re-arrange some code (or find a better solution) in the same way as listobject.c got changed. Hoping for a better solution.	2006-04-11 07:42:36 +00:00
Georg Brandl	ecdc0a9f46	That one was a mistake.	2006-03-30 12:19:07 +00:00
Georg Brandl	347b30042b	Remove unnecessary casts in type object initializers.	2006-03-30 11:57:00 +00:00
Thomas Wouters	a96affe1fc	- Reindent a confusingly indented piece of code (no intended code changes there) - Add missing DECREFs of inner-scope 'temp' variable - Add various missing DECREFs by changing 'return NULL' into 'goto onError' - Avoid double DECREF when last _PyUnicode_Resize() fails Coverity found one of the missing DECREFs, but oddly enough not the others.	2006-03-12 00:29:36 +00:00
Martin v. Löwis	480f1bb67b	Update Unicode database to Unicode 4.1.	2006-03-09 23:38:20 +00:00
Guido van Rossum	38fff8c4e4	Checking in the code for PEP 357. This was mostly written by Travis Oliphant. I've inspected it all; Neal Norwitz and MvL have also looked at it (in an earlier incarnation).	2006-03-07 18:50:55 +00:00
Hye-Shik Chang	4af5c8cee4	SF #1444030 : Fix several potential defects found by Coverity. (reviewed by Neal Norwitz)	2006-03-07 15:39:21 +00:00
Martin v. Löwis	15e62742fa	Revert backwards-incompatible const changes.	2006-02-27 16:46:16 +00:00
Thomas Wouters	de01774dae	Use correct PyArg_Parse format char for Py_ssize_t in unicode.center(). Fixes: >>> u"".center(10) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError on 64-bit systems.	2006-02-16 19:34:37 +00:00
Martin v. Löwis	eb079f1c25	Use Py_ssize_t for counts and sizes. Convert Py_ssize_t using PyInt_FromSsize_t	2006-02-16 14:32:27 +00:00
Martin v. Löwis	2c95cc6d72	Support %zd in PyErr_Format and PyString_FromFormat.	2006-02-16 06:54:25 +00:00
Tim Peters	15231548d2	doubletounicode(), longtounicode(): Py_SAFE_DOWNCAST can evaluate its first argument multiple times in a debug build. This caused two distinct assert- failures in test_unicode run under a debug build. Rewrote the code in trivial ways so that multiple evaluation of the first argument doesn't hurt.	2006-02-16 01:08:01 +00:00
Thomas Wouters	4701af5bf5	Remove two unused Py_ssize_t variables (merge glitches, looks like.)	2006-02-15 23:10:32 +00:00
Martin v. Löwis	18e165558b	Merge ssize_t branch.	2006-02-15 17:27:45 +00:00
Neal Norwitz	fc76d633e8	- Patch #1400181 , fix unicode string formatting to not use the locale. This is how string objects work. u'%f' could use , instead of . for the decimal point. Now both strings and unicode always use periods. This is the code that would break: import locale locale.setlocale(locale.LC_NUMERIC, 'de_DE') u'%.1f' % 1.0 assert '1.0' == u'%.1f' % 1.0 I couldn't create a test case which fails, but this fixes the problem. Will backport.	2006-01-10 06:03:13 +00:00
Neal Norwitz	d43069ce95	Fix icc warnings: remove (sometimes) unused variable conditionally	2006-01-08 01:12:10 +00:00
Martin v. Löwis	dea59e5755	Stop maintaining the buildno file. Also, stop determining Unicode sizes with PyString_GET_SIZE.	2006-01-05 10:00:36 +00:00
Hye-Shik Chang	835b243c71	Bug #1379994 : Fix *unicode_escape codecs to encode r'\' as r'\\' just like string codecs.	2005-12-17 04:38:31 +00:00
Jeremy Hylton	af68c874a6	Add const to several API functions that take char . In C++, it's an error to pass a string literal to a char function without a const_cast(). Rather than require every C++ extension module to put a cast around string literals, fix the API to state the const-ness. I focused on parts of the API where people usually pass literals: PyArg_ParseTuple() and friends, Py_BuildValue(), PyMethodDef, the type slots, etc. Predictably, there were a large set of functions that needed to be fixed as a result of these changes. The most pervasive change was to make the keyword args list passed to PyArg_ParseTupleAndKewords() to be a const char kwlist[]. One cast was required as a result of the changes: A type object mallocs the memory for its tp_doc slot and later frees it. PyTypeObject says that tp_doc is const char ; but if the type was created by type_new(), we know it is safe to cast to char *.	2005-12-10 18:50:16 +00:00
Walter Dörwald	d4fff1731c	Fix leaked reference to None.	2005-11-28 22:15:56 +00:00
Andrew M. Kuchling	8294de5673	Another comment typo fix	2005-11-02 16:36:12 +00:00
Walter Dörwald	2e2c02fedb	Fix typo in comment.	2005-11-02 08:57:11 +00:00
Fred Drake	db390c1ad8	fix typos, mostly in comments	2005-10-28 14:39:47 +00:00
Michael W. Hudson	b2308bb9be	Fix bug: [ 1327110 ] wrong TypeError traceback in generator expressions by removing the code that can stomp on the users' TypeError raised by the iterable argument to ''.join() -- PySequence_Fast (now?) gives a perfectly reasonable message itself. Also, a couple of tests.	2005-10-21 11:45:01 +00:00
Marc-André Lemburg	5c4a9d6591	Whitespace corrections.	2005-10-19 22:39:02 +00:00
Marc-André Lemburg	e115ec832c	Bug fix for [ 1331062 ] utf 7 codec broken. Backport candidate.	2005-10-19 22:33:31 +00:00
Walter Dörwald	d1c1e10f70	Part of SF patch #1313939 : Speedup charmap decoding by extending PyUnicode_DecodeCharmap() the accept a unicode string as the mapping argument which is used as a mapping table. This code isn't used by any of the codecs yet.	2005-10-06 20:29:57 +00:00
Walter Dörwald	a47d1c08d0	SF bug #1251300 : On UCS-4 builds the "unicode-internal" codec will now complain about illegal code points. The codec now supports PEP 293 style error handlers. (This is a variant of the Nik Haldimann's patch that detects truncated data)	2005-08-30 10:23:14 +00:00
Marc-André Lemburg	a9cadcd41b	Correct the handling of 0-termination of PyUnicode_AsWideChar() and its usage in PyLocale_strcoll(). Clarify the documentation on this. Thanks to Andreas Degert for pointing this out.	2004-11-22 13:02:31 +00:00
Marc-André Lemburg	204bd6d9d2	Applied patch for [ 1047269 ] Buffer overwrite in PyUnicode_AsWideChar. Python 2.3.x candidate.	2004-10-15 07:45:05 +00:00
Skip Montanaro	6543b45b0c	Initialize sep and seplen to suppress warning from gcc.	2004-09-16 03:28:13 +00:00
Thomas Heller	ca0d2cb66e	Add a missing line continuation character.	2004-09-15 11:41:32 +00:00
Walter Dörwald	065a32f550	Make the hint about the None default less ambiguous.	2004-09-14 09:45:10 +00:00
Walter Dörwald	782afc5927	Enhance the docstrings for unicode.split() and string.split() to make it clear that it is possible to pass None as the separator argument to get the default "any whitespace" separator.	2004-09-14 09:40:45 +00:00
Walter Dörwald	69652035bc	SF patch #998993 : The UTF-8 and the UTF-16 stateful decoders now support decoding incomplete input (when the input stream is temporarily exhausted). codecs.StreamReader now implements buffering, which enables proper readline support for the UTF-16 decoders. codecs.StreamReader.read() has a new argument chars which specifies the number of characters to return. codecs.StreamReader.readline() and codecs.StreamReader.readlines() have a new argument keepends. Trailing "\n"s will be stripped from the lines if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and PyUnicode_DecodeUTF16Stateful.	2004-09-07 20:24:22 +00:00
Tim Peters	91879ab8ea	PyUnicode_Join(): Bozo Alert. While this is chugging along, it may need to convert str objects from the iterable to unicode. So, if someone set the system default encoding to something nasty enough, the conversion process could mutate the input iterable as a side effect, and PySequence_Fast doesn't hide that from us if the input was a list. IOW, can't assume the size of PySequence_Fast's result is invariant across PyUnicode_FromObject() calls.	2004-08-27 22:35:44 +00:00
Tim Peters	05eba1fdc8	PyUnicode_Join(): Rewrote to use PySequence_Fast(). This doesn't do much to reduce the size of the code, but greatly improves its clarity. It's also quicker in what's probably the most common case (the argument iterable is a list). Against it, if the iterable isn't a list or a tuple, a temp tuple is materialized containing the entire input sequence, and that's a bigger temp memory burden. Yawn.	2004-08-27 21:32:02 +00:00
Tim Peters	894c512c2f	PyUnicode_Join(): Missed a spot where I intended a cast from size_t to int. I sure wish MS would gripe about that! Whatever, note that the statement above it guarantees that the cast loses no info.	2004-08-27 05:08:36 +00:00
Tim Peters	8ce9f16259	PyUnicode_Join(): Two primary aims: 1. u1.join([u2]) is u2 2. Be more careful about C-level int overflow. Since PySequence_Fast() isn't needed to achieve #1, it's not used -- but the code could sure be simpler if it were.	2004-08-27 01:49:32 +00:00
Hye-Shik Chang	e9ddfbb412	SF #989185 : Drop unicode.iswide() and unicode.width() and add unicodedata.east_asian_width(). You can still implement your own simple width() function using it like this: def width(u): w = 0 for c in unicodedata.normalize('NFC', u): cwidth = unicodedata.east_asian_width(c) if cwidth in ('W', 'F'): w += 2 else: w += 1 return w	2004-08-04 07:38:35 +00:00
Marc-André Lemburg	d25c650461	Let u'%s' % obj try obj.__unicode__() first and fallback to obj.__str__().	2004-07-23 16:13:25 +00:00
Nicholas Bastin	9ba301e589	Moved SunPro warning suppression into pyport.h and out of individual modules and objects.	2004-07-15 15:54:05 +00:00
Marc-André Lemburg	126b44cd41	Fix a copy&paste typo.	2004-07-10 12:04:20 +00:00
Marc-André Lemburg	1dffb120b7	.encode()/.decode() patch part 2.	2004-07-08 19:13:55 +00:00
Marc-André Lemburg	d2d4598ec2	Allow string and unicode return types from .encode()/.decode() methods on string and unicode objects. Added unicode.decode() which was missing for no apparent reason.	2004-07-08 17:57:32 +00:00
Nicholas Bastin	1ce9e4cfc1	Fixed end-of-loop code not reached warning when using SunPro C	2004-06-17 18:27:18 +00:00
Hye-Shik Chang	974ed7cfa5	- SF #962502 : Add two more methods for unicode type; width() and iswide() for east asian width manipulation. (Inspired by David Goodger, Reviewed by Martin v. Loewis) - Move _PyUnicode_TypeRecord.flags to the end of the struct so that no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)	2004-06-02 16:49:17 +00:00
Hye-Shik Chang	4057483164	SF Patch #926375 : Remove a useless UTF-16 support code that is never been used. (Suggested by Martin v. Loewis)	2004-04-06 07:24:51 +00:00
Walter Dörwald	cd736e71a3	Fix reallocation bug in unicode.translate(): The code was comparing characters instead of character pointers to determine space requirements.	2004-02-05 17:36:00 +00:00
Hye-Shik Chang	1bc09b7c2a	Cosmetic fix for wrongly indented tabs with ts=4.	2004-01-03 19:35:43 +00:00
Hye-Shik Chang	7fc4cf57b8	Fix unicode.rsplit()'s bug that ignores separater on the end of string when using specialized splitter for 1 char sep.	2003-12-23 09:10:16 +00:00
Hye-Shik Chang	40e9509dc7	Fix broken xmlcharrefreplace by rev 2.204. (Pointy hat goes to perky)	2003-12-22 01:31:13 +00:00
Hye-Shik Chang	4a264fb054	SF #859573 : Reduce compiler warnings on gcc 3.2 and above.	2003-12-19 01:59:56 +00:00
Hye-Shik Chang	3ae811b57d	Add rsplit method for str and unicode builtin types. SF feature request #801847. Original patch is written by Sean Reifschneider.	2003-12-15 18:49:53 +00:00
Guido van Rossum	6c9e130524	- Removed FutureWarnings related to hex/oct literals and conversions and left shifts. (Thanks to Kalle Svensson for SF patch 849227.) This addresses most of the remaining semantic changes promised by PEP 237, except for repr() of a long, which still shows the trailing 'L'. The PEP appears to promise warnings for operations that changed semantics compared to Python 2.3, but this is not implemented; we've suffered through enough warnings related to hex/oct literals and I think it's best to be silent now.	2003-11-29 23:52:13 +00:00
Raymond Hettinger	4f8f976576	Add optional fillchar argument to ljust(), rjust(), and center() string methods.	2003-11-26 08:21:35 +00:00
Walter Dörwald	4894c30626	Fix a bug in the memory reallocation code of PyUnicode_TranslateCharmap(). charmaptranslate_makespace() allocated more memory than required for the next replacement but didn't remember that fact, so memory size was growing exponentially every time a replacement string is longer that one character. This fixes SF bug #828737.	2003-10-24 14:25:28 +00:00
Martin v. Löwis	6828e18a6a	Patch #825679 : Clarify semantics of .isfoo on empty strings. Backported to 2.3.	2003-10-18 09:55:08 +00:00
Jeremy Hylton	504de6bd2c	Fix for SF bug [ 817156 ] invalid \U escape gives 0=length unistr.	2003-10-06 05:08:26 +00:00
Tim Peters	ced69f8a20	On c.l.py, Martin v. Löwis said that Py_UNICODE could be of a signed type, so fiddle Jeremy's fix to live with that. Also added more comments. Bugfix candidate (this bug is in all versions of Python, at least since 2.1).	2003-09-16 20:30:58 +00:00
Jeremy Hylton	d808279be3	Double-fix of crash in Unicode freelist handling. If a length-1 Unicode string was in the freelist and it was uninitialized or pointed to a very large (magnitude) negative number, the check unicode_latin1[unicode->str[0]] == unicode could cause a segmentation violation, e.g. unicode->str[0] is 0xcbcbcbcb. Fix this in two ways: 1. Change guard befor unicode_latin1[] to test against 256U. If I understand correctly, the unsigned long used to store UCS4 on my box was getting converted to a signed long to compare with the signed constant 256. 2. Change _PyUnicode_New() to make sure the first element of str is always initialized to zero. There are several places in the code where the caller can exit with an error before initializing any of str, which would leave junk in str[0]. Also, silence a compiler warning on pointer vs. int arithmetic. Bug fix candidate.	2003-09-16 19:41:39 +00:00
Jeremy Hylton	deb2dc6658	Change checks of PyUnicode_Resize() return value for clarity. The unicode_resize() family only returns -1 or 0 so simply checking for != 0 is sufficient, but somewhat unclear. Many Python API functions return < 0 on error, reserving the right to return 0 or 1 on success. Change the call sites for consistency with these calls.	2003-09-16 03:41:45 +00:00
Raymond Hettinger	9bfe533c69	SF bug #795506 : Wrong handling of string format code for float values. Adding missing support for '%F'. Will backport to 2.3.1.	2003-08-27 04:55:52 +00:00
Walter Dörwald	150523efa5	Fix refcounting leak in charmaptranslate_lookup()	2003-08-15 16:52:19 +00:00
Walter Dörwald	9b30f206ee	Fix another refcounting leak in PyUnicode_EncodeCharmap().	2003-08-15 16:26:34 +00:00
Walter Dörwald	d4ade0885c	Fix another refcounting leak (in PyUnicode_DecodeUnicodeEscape()).	2003-08-15 15:00:26 +00:00
Walter Dörwald	e5402fb340	Fix refcount leak in PyUnicode_EncodeCharmap(). The bug surfaces when an encoding error occurs and the callback name is unknown, i.e. when the callback has to be called. The problem was that the fact that the callback has already been looked up was only recorded in a local variable in charmap_encoding_error(), because charmap_encoding_error() got it's own copy of the errorHandler pointer instead of a pointer to the pointer in PyUnicode_EncodeCharmap().	2003-08-14 20:25:29 +00:00
Mark Hammond	0ccda1ee10	Support 'mbcs' as a 'built-in' encoding, so the C API can use it without defering to the encodings package. As described in [ 763111 ] mbcs encoding should skip encodings package	2003-07-01 00:13:27 +00:00
Raymond Hettinger	f466793fcc	SF patch 703666: Several objects don't decref tmp on failure in subtype_new Submitted By: Christopher A. Craig Fillin some missing decrefs.	2003-06-28 20:04:25 +00:00
Martin v. Löwis	9a3a9f7791	Consider \U-escapes in raw-unicode-escape. Fixes #444514 .	2003-05-18 12:31:09 +00:00
Neal Norwitz	ffe33b7f24	Attempt to make all the various string strip methods the same. Doc - add doc for when functions were added * UserString * string object methods * string module functions 'chars' is used for the last parameter everywhere. These changes will be backported, since part of the changes have already been made, but they were inconsistent.	2003-04-10 22:35:32 +00:00
Guido van Rossum	a7132189d2	Reformat a few docstrings that caused line wraps in help() output.	2003-04-09 19:32:45 +00:00
Walter Dörwald	44f527fea4	Change formatchar(), so that u"%c" % 0xffffffff now raises an OverflowError instead of a TypeError to be consistent with "%c" % 256. See SF patch #710127.	2003-04-02 16:37:24 +00:00
Raymond Hettinger	c8df5780e1	Sf patch #700047 : unicode object leaks refcount on resizing Contributed by Hye-Shik Chang.	2003-03-09 07:30:43 +00:00
Neal Norwitz	ec74f2fda7	Add more missing PyErr_NoMemory() after failled memory allocs	2003-02-11 23:05:40 +00:00
Walter Dörwald	f6b56aecad	Fix two refcounting bugs	2003-02-09 23:42:56 +00:00
Walter Dörwald	2e0b18af30	Change the treatment of positions returned by PEP293 error handers in the Unicode codecs: Negative positions are treated as being relative to the end of the input and out of bounds positions result in an IndexError. Also update the PEP and include an explanation of this in the documentation for codecs.register_error. Fixes a small bug in iconv_codecs: if the position from the callback is negative add it to the size instead of substracting it. From SF patch #677429.	2003-01-31 17:19:08 +00:00
Guido van Rossum	5d9113d8be	Implement appropriate __getnewargs__ for all immutable subclassable builtin types. The special handling for these can now be removed from save_newobj(). Add some testing for this. Also add support for setting the 'fast' flag on the Python Pickler class, which suppresses use of the memo.	2003-01-29 17:58:45 +00:00
Walter Dörwald	adc727490b	Fix charmapencode_lookup(), so that a None value in the mapping is treated as "character maps to <undefined>" and not as "character mapping must return integer, None or str".	2003-01-08 22:01:33 +00:00
Walter Dörwald	034d97605d	Remove variable owned from PyUnicode_FromEncodedObject, which is unused (except for Py_DECREF calls) since the introduction of __unicode__.	2003-01-08 20:38:39 +00:00
Marc-André Lemburg	79f57833f3	Patch for bug #659709 : bogus computation of float length Python 2.2.x backport candidate. (This bug has been around since Python 1.6.)	2002-12-29 19:44:06 +00:00
Neil Schemenauer	ce30bc9f49	Add nb_remainder (i.e. __mod__) slot to unicode type. Fixes SF bug #615506 .	2002-11-18 16:10:18 +00:00
Neal Norwitz	80a1bf4b5d	Fix SF # 635969, No error "not all arguments converted" When mwh added extended slicing, strings and unicode became mappings. Thus, dict was set which prevented an error when doing: newstr = 'format without a percent' % string_value This fix raises an exception again when there are no formats and % with a string value.	2002-11-12 23:01:12 +00:00
Marc-André Lemburg	9cd87aaa54	Fix for bug #626172 : crash using unicode latin1 single char Python 2.2.3 candidate.	2002-10-23 09:02:46 +00:00
Guido van Rossum	049cd6b563	Fix a nasty endcase reported by Armin Rigo in SF bug 618623: '%2147483647d' % -123 segfaults. This was because an integer overflow in a comparison caused the string resize to be skipped. After fixing the overflow, this could call _PyString_Resize() with a negative size, so I (1) test for that and raise MemoryError instead; (2) also added a test for negative newsize to _PyString_Resize(), raising SystemError as for all bad arguments. An identical bug existed in unicodeobject.c, of course. Will backport to 2.2.2.	2002-10-11 00:43:48 +00:00
Marc-André Lemburg	24e53b6d91	Add cast to avoid compiler warning.	2002-09-24 09:32:14 +00:00
Neal Norwitz	a0378e1eda	Fix part of SF bug # 544248 gcc warning in unicodeobject.c When --enable-unicode=ucs4, need to cast Py_UNICODE to a char	2002-09-13 13:47:06 +00:00
Guido van Rossum	efc1188239	Fix warnings on 64-bit platforms about casts from pointers to ints. Two of these were real bugs.	2002-09-12 14:43:41 +00:00
Walter Dörwald	5c1ee17742	Change the unicode.translate docstring to document that Unicode strings (with arbitrary length) are allowed as entries in the unicode.translate mapping. Add a test case for multicharacter replacements. (Multicharacter replacements were enabled by the PEP 293 patch)	2002-09-04 20:31:32 +00:00
Walter Dörwald	3aeb632c31	PEP 293 implemention (from SF patch http://www.python.org/sf/432401 )	2002-09-02 13:14:32 +00:00
Guido van Rossum	2023c9b84a	Fix SF bug 599128, submitted by Inyeol Lee: .replace() would do the wrong thing for a unicode subclass when there were zero string replacements. The example given in the SF bug report was only one way to trigger this; replacing a string of length >= 2 that's not found is another. The code would actually write outside allocated memory if replacement string was longer than the search string. (I wonder how many more of these are lurking? The unicode code base is full of wonders.) Bugfix candidate; this same bug is present in 2.2.1.	2002-08-23 18:50:21 +00:00
Guido van Rossum	8b1a6d694f	Code by Inyeol Lee, submitted to SF bug 595350, to implement the string/unicode method .replace() with a zero-lengt first argument. Inyeol contributed tests for this too.	2002-08-23 18:21:28 +00:00
Guido van Rossum	76afbd9aa4	Fix some endcase bugs in unicode rfind()/rindex() and endswith(). These were reported and fixed by Inyeol Lee in SF bug 595350. The endswith() bug was already fixed in 2.3, but this adds some more test cases.	2002-08-20 17:29:29 +00:00
Guido van Rossum	54df53a352	More changes of DeprecationWarning to FutureWarning.	2002-08-14 18:38:27 +00:00
Marc-André Lemburg	cc8764ca9d	Add C API PyUnicode_FromOrdinal() which exposes unichr() at C level. u'%c' will now raise a ValueError in case the argument is an integer outside the valid range of Unicode code point ordinals. Closes SF bug #593581.	2002-08-11 12:23:04 +00:00
Guido van Rossum	078151da90	Implement stage B0 of PEP 237: add warnings for operations that currently return inconsistent results for ints and longs; in particular: hex/oct/%u/%o/%x/%X of negative short ints, and x<<n that either loses bits or changes sign. (No warnings for repr() of a long, though that will also change to lose the trailing 'L' eventually.) This introduces some warnings in the test suite; I'll take care of those later.	2002-08-11 04:24:12 +00:00
Guido van Rossum	f36921c4b0	Unicode replace() method with empty pattern argument should fail, like it does for 8-bit strings.	2002-08-09 15:36:48 +00:00
Barry Warsaw	6a043f3fe8	PyUnicode_Contains(): The memcmp() call didn't take into account the width of Py_UNICODE. Good catch, MAL.	2002-08-06 19:03:17 +00:00
Barry Warsaw	817918cc3c	Committing patch #591250 which provides "str1 in str2" when str1 is a string of longer than 1 character.	2002-08-06 16:58:21 +00:00
Skip Montanaro	35b37a5c11	tighten up the unicode object's docstring a tad	2002-07-26 16:22:46 +00:00
Jeremy Hylton	938ace69a0	staticforward bites the dust. The staticforward define was needed to support certain broken C compilers (notably SCO ODT 3.0, perhaps early AIX as well) botched the static keyword when it was used with a forward declaration of a static initialized structure. Standard C allows the forward declaration with static, and we've decided to stop catering to broken C compilers. (In fact, we expect that the compilers are all fixed eight years later.) I'm leaving staticforward and statichere defined in object.h as static. This is only for backwards compatibility with C extensions that might still use it. XXX I haven't updated the documentation.	2002-07-17 16:30:39 +00:00
Martin v. Löwis	6238d2b024	Patch #569753 : Remove support for WIN16. Rename all occurrences of MS_WIN32 to MS_WINDOWS.	2002-06-30 15:26:10 +00:00
Neal Norwitz	20e72130c4	Fix typo in exception message	2002-06-13 21:25:17 +00:00
Martin v. Löwis	14f8b4cfcb	Patch #568124 : Add doc string macros.	2002-06-13 20:33:02 +00:00
Michael W. Hudson	5efaf7eac8	This is my nearly two year old patch [ 400998 ] experimental support for extended slicing on lists somewhat spruced up and better tested than it was when I wrote it. Includes docs & tests. The whatsnew section needs expanding, and arrays should support extended slices -- later.	2002-06-11 10:55:12 +00:00
Marc-André Lemburg	4164439240	Fix a possible segfault. Found be Neal Norvitz.	2002-05-29 13:46:29 +00:00
Marc-André Lemburg	4da6fd63bc	Fix for bug [ 561796 ] string.find causes lazy error	2002-05-29 11:33:13 +00:00
Guido van Rossum	cacfc07d08	- A new type object, 'string', is added. This is a common base type for 'str' and 'unicode', and can be used instead of types.StringTypes, e.g. to test whether something is "a string": isinstance(x, string) is True for Unicode and 8-bit strings. This is an abstract base class and cannot be instantiated directly.	2002-05-24 19:01:59 +00:00
Raymond Hettinger	0ebac97058	Patch 549187. Improve string formatting error message.	2002-05-21 15:14:57 +00:00
Tim Peters	5de9842b34	Repair widespread misuse of _PyString_Resize. Since it's clear people don't understand how this function works, also beefed up the docs. The most common usage error is of this form (often spread out across gotos): if (_PyString_Resize(&s, n) < 0) { Py_DECREF(s); s = NULL; goto outtahere; } The error is that if _PyString_Resize runs out of memory, it automatically decrefs the input string object s (which also deallocates it, since its refcount must be 1 upon entry), and sets s to NULL. So if the "if" branch ever triggers, it's an error to call Py_DECREF(s): s is already NULL! A correct way to write the above is the simpler (and intended) if (_PyString_Resize(&s, n) < 0) goto outtahere; Bugfix candidate.	2002-04-27 18:44:32 +00:00
Tim Peters	602f740bc2	SF patch 549375: Compromise PyUnicode_EncodeUTF8 This implements ideas from Marc-Andre, Martin, Guido and me on Python-Dev. "Short" Unicode strings are encoded into a "big enough" stack buffer, then exactly as much string space as they turn out to need is allocated at the end. This should have speed benefits akin to Martin's "measure once, allocate once" strategy, but without needing a distinct measuring pass. "Long" Unicode strings allocate as much heap space as they could possibly need (4 x # Unicode chars), and do a realloc at the end to return the untouched excess. Since the overallocation is likely to be substantial, this shouldn't burden the platform realloc with unusably small excess blocks. Also simplified uses of the PyString_xyz functions. Also added a release- build check that 4*size doesn't overflow a C int. Sooner or later, that's going to happen.	2002-04-27 18:03:26 +00:00
Tim Peters	030a5cebf4	unicode_memchr(): Squashed gratuitous int-vs-size_t mismatch (which gives a compiler wng under MSVC because of the resulting signed-vs- unsigned comparison).	2002-04-22 19:00:10 +00:00
Walter Dörwald	de02bcb265	Apply patch diff.txt from SF feature request http://www.python.org/sf/444708 This adds the optional argument for str.strip to unicode.strip too and makes it possible to call str.strip with a unicode argument and unicode.strip with a str argument.	2002-04-22 17:42:37 +00:00
Tim Peters	0eca65c4c5	PyUnicode_EncodeUTF8(): tightened the memory asserts a bit, and at least tried to catch some possible arithmetic overflows in the debug build.	2002-04-21 17:28:06 +00:00
Martin v. Löwis	2a7ff35a07	Back out 2.140.	2002-04-21 09:59:45 +00:00
Tim Peters	7e3d961fc1	PyUnicode_EncodeUTF8: squash compiler wng. The difference of two pointers is a signed type. Changing "allocated" to a signed int makes undetected overflow more likely, but there was no overflow detection before either.	2002-04-21 03:26:37 +00:00
Martin v. Löwis	a4eb14b7a4	Patch #495401 : Count number of required bytes for encoding UTF-8 before allocating the target buffer.	2002-04-20 13:44:01 +00:00
Walter Dörwald	0fe940c862	Return the orginal string only if it's a real str or unicode instance, otherwise make a copy.	2002-04-15 18:42:15 +00:00
Walter Dörwald	068325ef92	Apply the second version of SF patch http://www.python.org/sf/536241 Add a method zfill to str, unicode and UserString and change Lib/string.py accordingly. This activates the zfill version in unicodeobject.c that was commented out and implements the same in stringobject.c. It also adds the test for unicode support in Lib/string.py back in and uses repr() instead() of str() (as it was before Lib/string.py 1.62)	2002-04-15 13:36:47 +00:00
Neil Schemenauer	58aa861fa2	Remove PyMalloc_*.	2002-04-12 03:07:20 +00:00
Marc-André Lemburg	68e69338ae	Bug fix for UTF-8 encoding bug (buffer overrun) #541828 .	2002-04-10 20:36:13 +00:00
Marc-André Lemburg	ce0b664af2	Added test case for UTF-8 encoding bug #541828 .	2002-04-10 17:18:02 +00:00
Guido van Rossum	77f6a65eb0	Add the 'bool' type and its values 'False' and 'True', as described in PEP 285. Everything described in the PEP is here, and there is even some documentation. I had to fix 12 unit tests; all but one of these were printing Boolean outcomes that changed from 0/1 to False/True. (The exception is test_unicode.py, which did a type(x) == type(y) style comparison. I could've fixed that with a single line using issubtype(x, type(y)), but instead chose to be explicit about those places where a bool is expected. Still to do: perhaps more documentation; change standard library modules to return False/True from predicates.	2002-04-03 22:41:51 +00:00
Walter Dörwald	8c077227f2	Fix whitespace.	2002-03-25 11:16:18 +00:00
Neil Schemenauer	dcc819a5c9	Use pymalloc if it's enabled.	2002-03-22 15:33:15 +00:00
Martin v. Löwis	047c05ebc4	Do not insert characters for unicode-escape decoders if the error mode is "ignore". Fixes #529104.	2002-03-21 08:55:28 +00:00
Andrew MacIntyre	5e9c80d906	%#x/%#X format conversion cleanup (see patch #450267 ): Objects/ stringobject.c unicodeobject.c	2002-02-28 11:38:24 +00:00
Andrew MacIntyre	c487439aa7	OS/2 EMX port changes (Objects part of patch #450267 ): Objects/ fileobject.c stringobject.c unicodeobject.c This commit doesn't include the cleanup patches for stringobject.c and unicodeobject.c which are shown separately in the patch manager. Those patches will be regenerated and applied in a subsequent commit, so as to preserve a fallback position (this commit to those files).	2002-02-26 11:36:35 +00:00
Marc-André Lemburg	bd3be8f0ca	Fix to the UTF-8 encoder: it failed on 0-length input strings. Fix for the UTF-8 decoder: it will now accept isolated surrogates (previously it raised an exception which causes round-trips to fail). Added new tests for UTF-8 round-trip safety (we rely on UTF-8 for marshalling Unicode objects, so we better make sure it works for all Unicode code points, including isolated surrogates). Bumped the PYC magic in a non-standard way -- please review. This was needed because the old PYC format used illegal UTF-8 sequences for isolated high surrogates which now raise an exception.	2002-02-07 11:33:49 +00:00
Marc-André Lemburg	dc724d6e35	Cosmetics.	2002-02-06 18:20:19 +00:00
Marc-André Lemburg	e7c6ee4b8a	Whitespace fixes.	2002-02-06 18:18:03 +00:00
Marc-André Lemburg	3688a882d3	Fix for the UTF-8 memory allocation bug and the UTF-8 encoding bug related to lone high surrogates.	2002-02-06 18:09:02 +00:00
Guido van Rossum	604ddf80d8	Fix for #489669 (Neil Norwitz): memory leak in test_descr (unicode). This is best reproduced by while 1: class U(unicode): pass U(u"xxxxxx") The unicode_dealloc() code wasn't properly freeing the str and defenc fields of the Unicode object when freeing a subtype instance. Fixed this by a subtle refactoring that actually reduces the amount of code slightly.	2001-12-06 20:03:56 +00:00
Barry Warsaw	e5c492d72a	formatfloat(), formatint(): Conversion of sprintf() to PyOS_snprintf() for buffer overrun avoidance.	2001-11-28 21:00:41 +00:00
Marc-André Lemburg	11326de657	Fix for bug #485951 : repr diff between string and unicode.	2001-11-28 12:56:20 +00:00
Marc-André Lemburg	72f8213ba4	Fix for bug #438164 : %-formatting using Unicode objects. This patch also does away with an incompatibility between Jython and CPython.	2001-11-20 15:18:49 +00:00
Marc-André Lemburg	b5507ecd3c	Additional test and documentation for the unicode() changes. This patch should also be applied to the 2.2b1 trunk.	2001-10-19 12:02:29 +00:00
Guido van Rossum	b8c65bc27f	SF patch #470578 : Fixes to synchronize unicode() and str() This patch implements what we have discussed on python-dev late in September: str(obj) and unicode(obj) should behave similar, while the old behaviour is retained for unicode(obj, encoding, errors). The patch also adds a new feature with which objects can provide unicode(obj) with input data: the __unicode__ method. Currently no new tp_unicode slot is implemented; this is left as option for the future. Note that PyUnicode_FromEncodedObject() no longer accepts Unicode objects as input. The API name already suggests that Unicode objects do not belong in the list of acceptable objects and the functionality was only needed because PyUnicode_FromEncodedObject() was being used directly by unicode(). The latter was changed in the discussed way: * unicode(obj) calls PyObject_Unicode() * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject() One thing left open to discussion is whether to leave the PyUnicode_FromObject() API as a thin API extension on top of PyUnicode_FromEncodedObject() or to turn it into a (macro) alias for PyObject_Unicode() and deprecate it. Doing so would have some surprising consequences though, e.g. u"abc" + 123 would turn out as u"abc123"... [Marc-Andre didn't have time to check this in before the deadline. I hope this is OK, Marc-Andre! You can still make changes and commit them on the trunk after the branch has been made, but then please mail Barry a context diff if you want the change to be merged into the 2.2b1 release branch. GvR]	2001-10-19 02:01:31 +00:00
Guido van Rossum	9475a2310d	Enable GC for new-style instances. This touches lots of files, since many types were subclassable but had a xxx_dealloc function that called PyObject_DEL(self) directly instead of deferring to self->ob_type->tp_free(self). It is permissible to set tp_free in the type object directly to _PyObject_Del, for non-GC types, or to _PyObject_GC_Del, for GC types. Still, PyObject_DEL was a tad faster, so I'm fearing that our pystone rating is going down again. I'm not sure if doing something like void xxx_dealloc(PyObject *self) { if (PyXxxCheckExact(self)) PyObject_DEL(self); else self->ob_type->tp_free(self); } is any faster than always calling the else branch, so I haven't attempted that -- however those types whose own dealloc is fancier (int, float, unicode) do use this pattern.	2001-10-05 20:51:39 +00:00
Guido van Rossum	ad9744a67a	Fix a bug in rendering of \\ by repr() -- it rendered as \\\ instead of \\.	2001-09-21 15:38:17 +00:00
Marc-André Lemburg	3508e30861	Fix Unicode .join() method to raise a TypeError for sequence elements which are not Unicode objects or strings. (This matches the string.join() behaviour.) Fix a memory leak in the .join() method which occurs in case the Unicode resize fails. Restore the test_unicode output.	2001-09-20 17:22:58 +00:00
Marc-André Lemburg	6871f6ac57	Implement the changes proposed in patch #413333 . unicode(obj) now works just like str(obj) in that it tries __str__/tp_str on the object in case it finds that the object is not a string or buffer.	2001-09-20 12:53:16 +00:00
Marc-André Lemburg	c60e6f7771	Patch #435971 : UTF-7 codec by Brian Quinlan.	2001-09-20 10:35:46 +00:00
Tim Peters	af90b3e610	str_subtype_new, unicode_subtype_new: + These were leaving the hash fields at 0, which all string and unicode routines believe is a legitimate hash code. As a result, hash() applied to str and unicode subclass instances always returned 0, which in turn confused dict operations, etc. + Changed local names "new"; no point to antagonizing C++ compilers.	2001-09-12 05:18:58 +00:00
Tim Peters	7a29bd5861	More on bug 460020: disable many optimizations of unicode subclasses.	2001-09-12 03:03:31 +00:00
Tim Peters	78e0fc74bc	Possibly the end of SF [#460020 ] bug or feature: unicode() and subclasses. Changed unicode(i) to return a true Unicode object when i is an instance of a unicode subclass. Added PyUnicode_CheckExact macro.	2001-09-11 03:07:38 +00:00
Tim Peters	0ebeb584a4	PyUnicode_FromEncodedObject(): Repair memory leak in an error case.	2001-09-11 02:00:50 +00:00
Guido van Rossum	e023fe0eef	Make unicode subclassable.	2001-08-30 03:12:59 +00:00
Martin v. Löwis	e3eb1f2b23	Patch #427190 : Implement and use METH_NOARGS and METH_O.	2001-08-16 13:15:00 +00:00
Tim Peters	772747b3f1	SF patch #438013 Remove 2-byte Py_UCS2 assumptions Removed all instances of Py_UCS2 from the codebase, and so also (I hope) the last remaining reliance on the platform having an integral type with exactly 16 bits. PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write one byte at a time.	2001-08-09 22:21:55 +00:00
Tim Peters	6d6c1a35e0	Merge of descr-branch back into trunk.	2001-08-02 04:15:00 +00:00
Jeremy Hylton	3ce45389bd	Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h. And remove all the extern decls in the middle of .c files. Apparently, it was excluded from the header file because it is intended for internal use by the interpreter. It's still intended for internal use and documented as such in the header file.	2001-07-30 22:34:24 +00:00
Marc-André Lemburg	80d1dd5f3b	Fix for bug #444493 : u'\U00010001' segfaults with current CVS on wide builds.	2001-07-25 16:05:59 +00:00
Marc-André Lemburg	6c6bfb7c70	Make the unicode-escape and the UTF-16 codecs handle surrogates correctly and thus roundtrip-safe. Some minor cleanups of the code. Added tests for the roundtrip-safety.	2001-07-20 17:39:11 +00:00
Guido van Rossum	0d42e0c54a	#ifdef out generation of \U escapes unless Py_UNICODE_WIDE. This #caused warnings with the VMS C compiler. (SF bug #442998, in part.) On a narrow system the current code should never be executed since ch will always be < 0x10000. Marc-Andre: you may end up fixing this a different way, since I believe you have plans to generate \U for surrogate pairs. I'll leave that to you.	2001-07-20 16:36:21 +00:00
Fredrik Lundh	8f4558583f	use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE tests.	2001-06-27 18:59:43 +00:00
Martin v. Löwis	ce9b5a55e1	Encode surrogates in UTF-8 even for a wide Py_UNICODE. Implement sys.maxunicode. Explicitly wrap around upper/lower computations for wide Py_UNICODE. When decoding large characters with UTF-8, represent expected test results using the \U notation.	2001-06-27 06:28:56 +00:00
Martin v. Löwis	ac93bc2501	When decoding UTF-16, don't assume that the buffer is in native endianness when checking surrogates.	2001-06-26 22:43:40 +00:00
Martin v. Löwis	0ba70cc3c8	Support using UCS-4 as the Py_UNICODE type: Add configure option --enable-unicode. Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE, SIZEOF_WCHAR_T. Define Py_UCS2. Encode and decode large UTF-8 characters into single Py_UNICODE values for wide Unicode types; likewise for UTF-16. Remove test whether sizeof Py_UNICODE is two.	2001-06-26 22:22:37 +00:00
Fredrik Lundh	1294ad0c59	experimental UCS-4 support: added USE_UCS4_STORAGE define to unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4). (this may be good enough for platforms that doesn't have a 16-bit type. the UTF-16 codecs don't work, though)	2001-06-26 17:17:07 +00:00
Fredrik Lundh	45714e9ecb	experimental UCS-4 support: made compare a bit more robust, in case sizeof(Py_UNICODE) >= sizeof(long). also changed surrogate expansion to work if sizeof(Py_UNICODE) > 2.	2001-06-26 16:39:36 +00:00
Fredrik Lundh	3083163dc1	experimental UCS-4 support: don't assume that MS_WIN32 implies HAVE_USABLE_WCHAR_T	2001-06-26 15:11:00 +00:00
Guido van Rossum	ad98db1d9e	Fix a mis-indentation in _PyUnicode_New() that caused me to stare at some code for longer than needed.	2001-06-14 17:52:02 +00:00
Marc-André Lemburg	8879a33613	Fixes [ #430986 ] Buglet in PyUnicode_FromUnicode.	2001-06-07 12:26:56 +00:00
Jeremy Hylton	9cea41c195	fix bogus indentation	2001-05-29 17:13:15 +00:00
Marc-André Lemburg	489b56e044	This patch changes the behaviour of the UTF-16 codec family. Only the UTF-16 codec will now interpret and remove a leading BOM mark. Sub- sequent BOM characters are no longer interpreted and removed. UTF-16-LE and -BE pass through all BOM mark characters. These changes should get the UTF-16 codec more in line with what the Unicode FAQ recommends w/r to BOM marks.	2001-05-21 20:30:15 +00:00
Jeremy Hylton	d37292bb8d	Remove unused variable	2001-05-08 04:00:45 +00:00
Tim Peters	2cfe368283	Make unicode.join() work nice with iterators. This also required a change to string.join(), so that when the latter figures out in midstream that it really needs unicode.join() instead, unicode.join() can actually get all the sequence elements (i.e., there's no guarantee that the sequence passed to string.join() can be iterated over again by unicode.join(), so string.join() must not pass on the original sequence object anymore).	2001-05-05 05:36:48 +00:00
Tim Peters	b3d8d1f76c	A different approach to the problem reported in Patch #419651: Metrowerks on Mac adds 0x itself C std says %#x and %#X conversion of 0 do not add the 0x/0X base marker. Metrowerks apparently does. Mark Favas reported the same bug under a Compaq compiler on Tru64 Unix, but no other libc broken in this respect is known (known to be OK under MSVC and gcc). So just try the damn thing at runtime and see what the platform does. Note that we've always had bugs here, but never knew it before because a relevant test case didn't exist before 2.1.	2001-04-28 05:38:26 +00:00
Marc-André Lemburg	8155e0e541	This patch originated from an idea by Martin v. Loewis who submitted a patch for sharing single character Unicode objects. Martin's patch had to be reworked in a number of ways to take Unicode resizing into consideration as well. Here's what the updated patch implements: * Single character Unicode strings in the Latin-1 range are shared (not only ASCII chars as in Martin's original patch). * The ASCII and Latin-1 codecs make use of this optimization, providing a noticable speedup for single character strings. Most Unicode methods can use the optimization as well (by virtue of using PyUnicode_FromUnicode()). * Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY) * The PyUnicode_Resize() can now also handle the case of resizing unicode_empty which previously resulted in an error. * Modified the internal API _PyUnicode_Resize() and the public PyUnicode_Resize() API to handle references to shared objects correctly. The _PyUnicode_Resize() signature changed due to this. * Callers of PyUnicode_FromUnicode() may now only modify the Unicode object contents of the returned object in case they called the API with NULL as content template. Note that even though this patch passes the regression tests, there may still be subtle bugs in the sharing code.	2001-04-23 14:44:21 +00:00
Tim Peters	cf96de052f	SF but #417587 : compiler warnings compiling 2.1. Repaired some of the SGI compiler warnings Sjoerd Mullender reported.	2001-04-21 02:46:11 +00:00
Tim Peters	78fe5308b4	CVS patch 416248: 2.1c1 unicodeobject: unused vrbl cleanup, from Mark Favas.	2001-04-19 21:55:14 +00:00
Jeremy Hylton	b8a93215c2	Revert previous checkin, which caused test_unicodedata to fail.	2001-04-19 16:43:49 +00:00
Martin v. Löwis	da3dc5b892	Patch #416953 : Cache ASCII characters to speed up ASCII decoding.	2001-04-18 12:49:15 +00:00
Tim Peters	fff5325078	Bug 415514 reported that e.g. "%#x" % 0 blew up, at heart because C sprintf supplies a base marker if and only if the value is not 0. I then fixed that, by tolerating C's inconsistency when it does %#x, and taking away that Python produced 0x0 when formatting 0L (the "long" flavor of 0) under %#x itself. But after talking with Guido, we agreed it would be better to supply 0x for the short int case too, despite that it's inconsistent with C, because C is inconsistent with itself and with Python's hex(0) (plus, while "%#x" % 0 didn't work before, "%#x" % 0L did, and returned "0x0"). Similarly for %#X conversion.	2001-04-12 18:38:48 +00:00
Tim Peters	711088d9b8	Fix for SF bug #415514 : "%#x" % 0 caused assertion failure/abort. http://sourceforge.net/tracker/index.php?func=detail&aid=415514&group_id=5470&atid=105470 For short ints, Python defers to the platform C library to figure out what %#x should do. The code asserted that the platform C returned a string beginning with "0x". However, that's not true when-- and only when --the value being formatted is 0. Changed the code to live with C's inconsistency here. In the meantime, the problem does not arise if you format a long 0 (0L) instead. However, that's because the code we wrote to do %#x conversions on longs produces a leading "0x" regardless of value. That's probably wrong too: we should drop leading "0x", for consistency with C, when (& only when) formatting 0L. So I changed the long formatting code to do that too.	2001-04-12 00:35:51 +00:00
Fredrik Lundh	ccc7473fc8	reorganized PyUnicode_DecodeUnicodeEscape a bit (in order to make it less likely that bug #132817 ever appears again)	2001-02-18 22:13:49 +00:00
Marc-André Lemburg	fde66e1bcc	Fixed .capitalize() method of Unicode objects to work like the corresponding string method. Added tests for this too. Patch written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.	2001-01-29 11:14:16 +00:00
Ka-Ping Yee	fa004ad36c	Show '\011', '\012', and '\015' as '\t', '\n', '\r' in strings. Switch from octal escapes to hex escapes for other nonprintable characters.	2001-01-24 17:19:08 +00:00
Fredrik Lundh	06d126803c	Move uchhash functionality into unicodedata (after the recent crop of changes, the files are small enough to do this). Also adds "name" and "lookup" functions to unicodedata.	2001-01-24 07:59:11 +00:00
Fredrik Lundh	f60560626c	Better error message if ucnhash cannot be found (obscure attribute errors aren't that helpful), or doesn't contain what's expected from it. Also tweaked the test script so it compiles even if ucnhash is missing.	2001-01-20 11:15:25 +00:00
Fredrik Lundh	0fdb90cafe	refactored the unicodeobject/ucnhash interface, to hide the implementation details inside the ucnhash module. also cleaned up the unicode copyright blurb a little; Secret Labs' internal revision history isn't that interesting...	2001-01-19 09:45:02 +00:00
Marc-André Lemburg	ad7c98e264	This patch adds a new builtin unistr() which behaves like str() except that it always returns Unicode objects. A new C API PyObject_Unicode() is also provided. This closes patch #101664. Written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.	2001-01-17 17:09:53 +00:00
Marc-André Lemburg	3a645e4dd4	Added checks to prevent PyUnicode_Count() from dumping core in case the parameters are out of bounds and fixes error handling for .count(), .startswith() and .endswith() for the case of mixed string/Unicode objects. This patch adds Python style index semantics to PyUnicode_Count() indices (including the special handling of negative indices). The patch is an extended version of patch #103249 submitted by Michael Hudson (mwh) on SF. It also includes new test cases.	2001-01-16 11:54:12 +00:00
Marc-André Lemburg	ec233e5803	This patch adds a new feature to the builtin charmap codec: The mapping dictionaries can now contain 1-n mappings, meaning that character ordinals may be mapped to strings or Unicode object, e.g. 0x0078 ('x') -> u"abc", causing the ordinal to be replaced by the complete string or Unicode object instead of just one character. Another feature introduced by the patch is that of mapping oridnals to the emtpy string. This allows removing characters. The patch is different from patch #103100 in that it does not cause a performance hit for the normal use case of 1-1 mappings. Written by Marc-Andre Lemburg, copyright assigned to Guido van Rossum.	2001-01-06 14:59:58 +00:00
Marc-André Lemburg	a866df806d	This patch changes the default behaviour of the builtin charmap codec to not apply Latin-1 mappings for keys which are not found in the mapping dictionaries, but instead treat them as undefined mappings. The patch was originally written by Martin v. Loewis with some additional (cosmetic) changes and an updated test script by Marc-Andre Lemburg. The standard codecs were recreated from the most current files available at the Unicode.org site using the Tools/scripts/gencodec.py tool. This patch closes the bugs #116285 and #119960.	2001-01-03 21:29:14 +00:00
Andrew M. Kuchling	f947ffe951	Patch #102940 : use only printable Unicode chars in reporting incorrect % characters; characters outside the printable range are replaced with '?'	2000-12-19 22:49:06 +00:00
Guido van Rossum	cda4f9a8dc	Fix off-by-one error in split_substring(). Fixes SF bug #122162 .	2000-12-19 02:23:19 +00:00
Andrew M. Kuchling	6ca8917758	[ Patch #102852 ] Make % error a bit more informative by indicates the index at which an unknown %-escape was found	2000-12-15 13:07:46 +00:00
Tim Peters	a3a3a030af	Fox for SF bug #123859 : %[duxXo] long formats inconsistent.	2000-11-30 05:22:44 +00:00
Barry Warsaw	5b4c22806f	_PyUnicode_Fini(): Initialize the local freelist walking variable `u' after unicode_empty has been freed, otherwise it might not point to the real start of the unicode_freelist. Final closure for SF bug #110681, Jitterbug PR#398.	2000-10-03 20:45:26 +00:00
Guido van Rossum	4ae8ef84da	In _PyUnicode_Fini(), decref unicode_empty before tearng down the free list. Discovered by Barry, fix approved by MAL.	2000-10-03 18:09:04 +00:00
Fred Drake	d5fadf75e4	Rationalize use of limits.h, moving the inclusion to Python.h. Add definitions of INT_MAX and LONG_MAX to pyport.h. Remove includes of limits.h and conditional definitions of INT_MAX and LONG_MAX elsewhere. This closes SourceForge patch #101659 and bug #115323.	2000-09-26 05:46:01 +00:00
Tim Peters	38fd5b6413	Derived from Martin's SF patch 110609: support unbounded ints in %d,i,u,x,X,o formats. Note a curious extension to the std C rules: x, X and o formatting can never produce a sign character in C, so the '+' and ' ' flags are meaningless for them. But unbounded ints can produce a sign character under these conversions (no fixed- width bitstring is wide enough to hold all negative values in 2's-comp form). So these flags become meaningful in Python when formatting a Python long which is too big to fit in a C long. This required shuffling around existing code, which hacked x and X conversions to death when both the '#' and '0' flags were specified: the hacks weren't strong enough to deal with the simultaneous possibility of the ' ' or '+' flags too, since signs were always meaningless before for x and X conversions. Isomorphic shuffling was required in unicodeobject.c. Also added dozens of non-trivial new unbounded-int test cases to test_format.py.	2000-09-21 05:43:11 +00:00
Tim Peters	8f422461b4	Fix for bug 113934. stringn and unicoden did no overflow checking at all, either to see whether the # of chars fit in an int, or that the amount of memory needed fit in a size_t. Checking these is expensive, but the alternative is silently wrong answers (as in the bug report) or core dumps (which were easy to provoke using Unicode strings).	2000-09-09 06:13:41 +00:00
Fredrik Lundh	df84675f93	changed \x to consume exactly two hex digits, also for unicode strings. closes PEP-223. also added \U escape (eight hex digits).	2000-09-03 11:29:49 +00:00
Barry Warsaw	ce4dc41b1a	PyUnicode_AsUTF8String(): /F picks up what I missed: the local var `str' is no longer necessary. Gotta turn on -Wall!	2000-08-18 19:30:40 +00:00
Barry Warsaw	2dd4abf277	PyUnicode_AsUTF8String(): Don't need to explicitly incref str since PyUnicode_EncodeUTF8() already returns the created object with the proper reference count. This fixes an Insure reported memory leak.	2000-08-18 06:58:15 +00:00
Marc-André Lemburg	b7520774e2	Fixed a couple of instances where a 0-length string was being resized after creation. 0-length strings are usually shared and _PyString_Resize() fails on these shared strings. Fixes [ Bug #111667 ] unicode core dump.	2000-08-14 11:29:19 +00:00
Trent Mick	20abf573ef	Clean up warning from Monterey compiler. Properly end a comment block. It was terminated fine later but by a subsequent block and. It was also in #if 0. This patch is so trivial I can't believe I am talking about it. :)	2000-08-12 22:14:34 +00:00
Marc-André Lemburg	e5034378cc	Removing UTF-16 aware Unicode comparison code. This kind of compare function (together with other locale aware ones) should into a new collation support module. See python-dev for a discussion of this removal. Note: This patch should also be applied to the 1.6 branch.	2000-08-08 08:04:29 +00:00
Marc-André Lemburg	bff879cabb	This patch finalizes the move from UTF-8 to a default encoding in the Python Unicode implementation. The internal buffer used for implementing the buffer protocol is renamed to defenc to make this change visible. It now holds the default encoded version of the Unicode object and is calculated on demand (NULL otherwise). Since the default encoding defaults to ASCII, this will mean that Unicode objects which hold non-ASCII characters will no longer work on C APIs using the "s" or "t" parser markers. C APIs must now explicitly provide Unicode support via the "u", "U" or "es"/"es#" parser markers in order to work with non-ASCII Unicode strings. (Note: this patch will also have to be applied to the 1.6 branch of the CVS tree.)	2000-08-03 18:46:08 +00:00
Guido van Rossum	16b1ad9c7d	Changing the CNRI copyright notice according to CNRI's instructions. This is a notice without a date, which apparently is not a claim to copyright but only advice to the reader. IANAL. :-)	2000-08-03 16:24:25 +00:00
Peter Schneider-Kamp	7e01890986	merge Include/my.h into Include/pyport.h marked my.h as obsolete	2000-07-31 15:28:04 +00:00
Thomas Wouters	7889010731	Miscelaneous ANSIfications. I'm assuming here 'main' should take (int, char**) and return an int even on PC platforms. If not, please fix PC/utils/makesrc.c ;-P	2000-07-22 19:25:51 +00:00
Marc-André Lemburg	9542f48fd5	Fixed problems with UTF error reporting macros and some formatting bugs.	2000-07-17 18:23:13 +00:00
Greg Stein	af36a3aa20	gcc is being stupid with if/else constructs clean out some other warnings	2000-07-17 09:04:43 +00:00
Greg Stein	ff975003cf	stop messing around with goto and just write the macro correctly.	2000-07-16 21:39:49 +00:00
Fredrik Lundh	0e19e76aba	- change \x to mean "byte" also in unicode literals (patch #100912)	2000-07-16 18:47:43 +00:00
Tim Peters	855ffac224	Fix fatal compiler (MSVC6) error: unicodeobject.c(735) : error C2143: syntax error : missing ';' before '}'	2000-07-16 17:10:50 +00:00
Marc-André Lemburg	fb625847bf	Fix to a bug found by Florian Weimer: The UTF-8 decoder is still buggy (i.e. it doesn't pass Markus Kuhn's stress test), mainly due to the following construct: #define UTF8_ERROR(details) do { \ if (utf8_decoding_error(&s, &p, errors, details)) \ goto onError; \ continue; \ } while (0) (The "continue" statement is supposed to exit from the outer loop, but of course, it doesn't. Indeed, this is a marvelous example of the dangers of the C programming language and especially of the C preprocessor.)	2000-07-16 13:29:13 +00:00
Thomas Wouters	7e47402264	Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in either comments, docstrings or error messages. I fixed two minor things in test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't"). There is a minor style issue involved: Guido seems to have preferred English grammar (behaviour, honour) in a couple places. This patch changes that to American, which is the more prominent style in the source. I prefer English myself, so if English is preferred, I'd be happy to supply a patch myself ;)	2000-07-16 12:04:32 +00:00
Jeremy Hylton	03657cfdb0	replace PyXXX_Length calls with PyXXX_Size calls	2000-07-12 13:05:33 +00:00
Marc-André Lemburg	566d8a64eb	Jeremy Hylton: better error message for unicode coercion failure	2000-07-11 09:47:04 +00:00
Fredrik Lundh	dde6164402	- changed hash calculation for unicode strings. the new value is calculated from the character values, in a way that makes sure an 8-bit ASCII string and a unicode string with the same contents get the same hash value. (as a side effect, this also works for ISO Latin 1 strings). for more details, see the python-dev discussion.	2000-07-10 18:27:47 +00:00
Marc-André Lemburg	e12896ec98	New surrogate support in the UTF-8 codec. By Bill Tutt.	2000-07-07 17:51:08 +00:00
Marc-André Lemburg	5a5c81a0e9	Added new API PyUnicode_FromEncodedObject() which supports decoding objects including instance objects. The old API PyUnicode_FromObject() is still available as shortcut.	2000-07-07 13:46:42 +00:00
Marc-André Lemburg	063e0cb4c6	Fix to bug #393 (UTF16 codec didn't like empty strings) and corrected some usage of 'unsigned long' where Py_UNICODE should have been used.	2000-07-07 11:27:45 +00:00
Sjoerd Mullender	2629bd5a33	Two more places where long should be used instead of int. Especially true after revision 2.36 was checked in...	2000-07-07 09:47:24 +00:00
Marc-André Lemburg	449c325303	Fixed some code that used 'short' to use 'long' instead.	2000-07-06 20:13:23 +00:00
Marc-André Lemburg	85cc4d8940	Fixed a couple of places where 'int' was used where 'long' should have been used.	2000-07-06 19:43:31 +00:00
Marc-André Lemburg	a7acf425f6	Added new .isalpha() and .isalnum() methods which provide interfaces to the new alphabetic lookup APIs in unicodectype.c.	2000-07-05 09:49:44 +00:00
Marc-André Lemburg	1e7205a62a	Bill Tutt: Make unicode_compare a true UTF-16 compare function (includes support for surrogates).	2000-07-04 09:51:07 +00:00
Marc-André Lemburg	d49e5b4667	Marc-Andre Lemburg <mal@lemburg.com>: A previous patch by Jack Jansen was accidently reverted.	2000-06-30 14:58:20 +00:00
Marc-André Lemburg	f28dd83b86	Marc-Andre Lemburg <mal@lemburg.com>: New buffer overflow checks for formatting strings. By Trent Mick.	2000-06-30 10:29:57 +00:00
Guido van Rossum	4f4b799b33	Jack Jansen: Use include "" instead of <>; and staticforward declarations	2000-06-29 00:06:39 +00:00
Marc-André Lemburg	0f774e3987	Marc-Andre Lemburg <mal@lemburg.com>: Patch to the standard unicode-escape codec which dynamically loads the Unicode name to ordinal mapping from the module ucnhash. By Bill Tutt.	2000-06-28 16:43:35 +00:00
Marc-André Lemburg	7c014684c2	Marc-Andre Lemburg <mal@lemburg.com>: Better error message for "1 in unicodestring". Submitted by Andrew Kuchling.	2000-06-28 08:11:47 +00:00
Marc-André Lemburg	49ef6dc1f4	Marc-Andre Lemburg <mal@lemburg.com>: Fixed a bug in PyUnicode_Count() which would have caused a core dump in case of substring coercion failure. Synchronized .count() with the string method of the same name to return len(s)+1 for s.count('').	2000-06-18 22:25:22 +00:00
Marc-André Lemburg	bea47e768d	Vladimir MARANGOZOV <Vladimir.Marangozov@inrialpes.fr>: This patch fixes an optimisation mystery in _PyUnicodeNew causing segfaults on AIX when the interpreter is compiled with -O.	2000-06-17 20:31:17 +00:00
Marc-André Lemburg	60bc809d9a	Marc-Andre Lemburg <mal@lemburg.com>: Added code so that .isXXX() testing returns 0 for emtpy strings.	2000-06-14 09:18:32 +00:00
Marc-André Lemburg	07ceb67d9c	Marc-Andre Lemburg <mal@lemburg.com>: Fixed a typo and removed a debug printf(). Thanks to Finn Bock for finding these.	2000-06-10 09:32:51 +00:00
Andrew M. Kuchling	cb95a1470a	Patch from Michael Hudson: improve unclear error message	2000-06-09 14:04:53 +00:00
Marc-André Lemburg	d4ab4a5905	Marc-Andre Lemburg <mal@lemburg.com>: Fixed %c formatting to check for one character arguments. Thanks to Finn Bock for finding this bug. Added a fix for bug PR#348 which originated from not resetting the globals correctly in _PyUnicode_Fini().	2000-06-08 17:54:00 +00:00
Marc-André Lemburg	90e8147118	Marc-Andre Lemburg <mal@lemburg.com>: Change the default encoding to 'ascii' (it was previously defined as UTF-8). Note: The implementation still uses UTF-8 to implement the buffer protocol, so C APIs will still see UTF-8. This is on purpose: rather than fixing the Unicode implementation, the C APIs should be made Unicode aware.	2000-06-07 09:13:21 +00:00
Fred Drake	785d14f965	Minimal change so I can add the rest of MAL's checkin message: M.-A. Lemburg <mal@lemburg.com>: Fixed a core dump in PyUnicode_Format().	2000-05-09 19:54:43 +00:00
Fred Drake	e4315f58d2	M.-A. Lemburg <mal@lemburg.com>: Added support for user settable default encodings. The current implementation uses a per-process global which defines the value of the encoding parameter in case it is set to NULL (meaning: use the default encoding).	2000-05-09 19:53:39 +00:00
Guido van Rossum	b8872e61c6	Trent Mick: Fix the string methods that implement slice-like semantics with optional args (count, find, endswith, etc.) to properly handle indeces outside [INT_MIN, INT_MAX]. Previously the "i" formatter for PyArg_ParseTuple was used to get the indices. These could overflow. This patch changes the string methods to use the "O&" formatter with the slice_index() function from ceval.c which is used to do the same job for Python code slices (e.g. 'abcabcabc'[0:1000000000L]).	2000-05-09 14:14:27 +00:00
Guido van Rossum	03e29f1ae9	Mark Hammond should get his act into gear (his words :-). Zero length strings _are_ valid!	2000-05-04 15:52:20 +00:00
Guido van Rossum	42c29aaeb5	Fix warning detected by VC++ on assignment of Py_UNICODE to char.	2000-05-03 23:58:29 +00:00
Guido van Rossum	b18618dab7	Vladimir Marangozov's long-awaited malloc restructuring. For more comments, read the patches@python.org archives. For documentation read the comments in mymalloc.h and objimpl.h. (This is not exactly what Vladimir posted to the patches list; I've made a few changes, and Vladimir sent me a fix in private email for a problem that only occurs in debug mode. I'm also holding back on his change to main.c, which seems unnecessary to me.)	2000-05-03 23:44:39 +00:00
Guido van Rossum	4e751c3d12	Mark Hammond withdraws his fix -- the size includes the trailing 0 so a size of 0 is illegal.	2000-05-03 12:27:22 +00:00
Guido van Rossum	a6edfd9737	Mark Hammond: Fixes the MBCS codec to work correctly with zero length strings.	2000-05-03 11:03:24 +00:00
Guido van Rossum	0e4f657a50	Marc-Andre Lemburg: Fixed \OOO interpretation for Unicode objects. \777 now correctly produces the Unicode character with ordinal 511.	2000-05-01 21:27:20 +00:00
Guido van Rossum	3c1bb8043f	Marc-Andre Lemburg: Fixed a reference leak in the allocator. Renamed utf8_string to _PyUnicode_AsUTF8String() and made it external for use by other parts of the interpreter.	2000-04-27 20:13:50 +00:00
Guido van Rossum	86662914be	Marc-Andre Lemburg: The maxsplit functionality in .splitlines() was replaced by the keepends functionality which allows keeping the line end markers together with the string.	2000-04-11 15:38:46 +00:00
Guido van Rossum	fd4b957b06	Marc-Andre Lemburg: * New exported API PyUnicode_Resize() * The experimental Keep-Alive optimization was turned back on after some tweaks to the implementation. It should now work without causing core dumps... this has yet to tested though (switching it off is easy: see the unicodeobject.c file for details). * Fixed a memory leak in the Unicode freelist cleanup code. * Added tests to correctly process the return code from _PyUnicode_Resize(). * Fixed a bug in the 'ignore' error handling routines of some builtin codecs. Added test cases for these to test_unicode.py.	2000-04-10 13:51:10 +00:00
Guido van Rossum	5db862dd0c	Skip Montanaro: add string precisions to calls to PyErr_Format to prevent possible buffer overruns.	2000-04-10 12:46:51 +00:00
Guido van Rossum	ba47704943	Conrad Huang points out that "if (0 < ch < 256)", while legal C, doesn't mean what the Python programmer thought...	2000-04-06 18:18:10 +00:00
Guido van Rossum	34888ed689	Fredrik Lundh: eliminate a MSVC compiler warning.	2000-04-05 21:29:50 +00:00

... 3 4 5 6 7 ...

456 Commits