Commit Graph

344 Commits

Author SHA1 Message Date
Thomas Wouters 4701af5bf5 Remove two unused Py_ssize_t variables (merge glitches, looks like.) 2006-02-15 23:10:32 +00:00
Martin v. Löwis 18e165558b Merge ssize_t branch. 2006-02-15 17:27:45 +00:00
Neal Norwitz fc76d633e8 - Patch #1400181, fix unicode string formatting to not use the locale.
This is how string objects work.  u'%f' could use , instead of .
  for the decimal point.  Now both strings and unicode always use periods.

This is the code that would break:

import locale
locale.setlocale(locale.LC_NUMERIC, 'de_DE')
u'%.1f' % 1.0
assert '1.0' == u'%.1f' % 1.0

I couldn't create a test case which fails, but this fixes the problem.

Will backport.
2006-01-10 06:03:13 +00:00
Neal Norwitz d43069ce95 Fix icc warnings: remove (sometimes) unused variable conditionally 2006-01-08 01:12:10 +00:00
Martin v. Löwis dea59e5755 Stop maintaining the buildno file.
Also, stop determining Unicode sizes with PyString_GET_SIZE.
2006-01-05 10:00:36 +00:00
Hye-Shik Chang 835b243c71 Bug #1379994: Fix *unicode_escape codecs to encode r'\' as r'\\'
just like string codecs.
2005-12-17 04:38:31 +00:00
Jeremy Hylton af68c874a6 Add const to several API functions that take char *.
In C++, it's an error to pass a string literal to a char* function
without a const_cast().  Rather than require every C++ extension
module to put a cast around string literals, fix the API to state the
const-ness.

I focused on parts of the API where people usually pass literals:
PyArg_ParseTuple() and friends, Py_BuildValue(), PyMethodDef, the type
slots, etc.  Predictably, there were a large set of functions that
needed to be fixed as a result of these changes.  The most pervasive
change was to make the keyword args list passed to
PyArg_ParseTupleAndKewords() to be a const char *kwlist[].

One cast was required as a result of the changes:  A type object
mallocs the memory for its tp_doc slot and later frees it.
PyTypeObject says that tp_doc is const char *; but if the type was
created by type_new(), we know it is safe to cast to char *.
2005-12-10 18:50:16 +00:00
Walter Dörwald d4fff1731c Fix leaked reference to None. 2005-11-28 22:15:56 +00:00
Andrew M. Kuchling 8294de5673 Another comment typo fix 2005-11-02 16:36:12 +00:00
Walter Dörwald 2e2c02fedb Fix typo in comment. 2005-11-02 08:57:11 +00:00
Fred Drake db390c1ad8 fix typos, mostly in comments 2005-10-28 14:39:47 +00:00
Michael W. Hudson b2308bb9be Fix bug:
[ 1327110 ] wrong TypeError traceback in generator expressions

by removing the code that can stomp on the users' TypeError raised by the
iterable argument to ''.join() -- PySequence_Fast (now?) gives a perfectly
reasonable message itself.  Also, a couple of tests.
2005-10-21 11:45:01 +00:00
Marc-André Lemburg 5c4a9d6591 Whitespace corrections. 2005-10-19 22:39:02 +00:00
Marc-André Lemburg e115ec832c Bug fix for [ 1331062 ] utf 7 codec broken.
Backport candidate.
2005-10-19 22:33:31 +00:00
Walter Dörwald d1c1e10f70 Part of SF patch #1313939: Speedup charmap decoding by extending
PyUnicode_DecodeCharmap() the accept a unicode string as the mapping
argument which is used as a mapping table.

This code isn't used by any of the codecs yet.
2005-10-06 20:29:57 +00:00
Walter Dörwald a47d1c08d0 SF bug #1251300: On UCS-4 builds the "unicode-internal" codec will now complain
about illegal code points. The codec now supports PEP 293 style error handlers.
(This is a variant of the Nik Haldimann's patch that detects truncated data)
2005-08-30 10:23:14 +00:00
Marc-André Lemburg a9cadcd41b Correct the handling of 0-termination of PyUnicode_AsWideChar()
and its usage in PyLocale_strcoll().

Clarify the documentation on this.

Thanks to Andreas Degert for pointing this out.
2004-11-22 13:02:31 +00:00
Marc-André Lemburg 204bd6d9d2 Applied patch for [ 1047269 ] Buffer overwrite in PyUnicode_AsWideChar.
Python 2.3.x candidate.
2004-10-15 07:45:05 +00:00
Skip Montanaro 6543b45b0c Initialize sep and seplen to suppress warning from gcc. 2004-09-16 03:28:13 +00:00
Thomas Heller ca0d2cb66e Add a missing line continuation character. 2004-09-15 11:41:32 +00:00
Walter Dörwald 065a32f550 Make the hint about the None default less ambiguous. 2004-09-14 09:45:10 +00:00
Walter Dörwald 782afc5927 Enhance the docstrings for unicode.split() and string.split()
to make it clear that it is possible to pass None as the
separator argument to get the default "any whitespace" separator.
2004-09-14 09:40:45 +00:00
Walter Dörwald 69652035bc SF patch #998993: The UTF-8 and the UTF-16 stateful decoders now support
decoding incomplete input (when the input stream is temporarily exhausted).
codecs.StreamReader now implements buffering, which enables proper
readline support for the UTF-16 decoders. codecs.StreamReader.read()
has a new argument chars which specifies the number of characters to
return. codecs.StreamReader.readline() and codecs.StreamReader.readlines()
have a new argument keepends. Trailing "\n"s will be stripped from the lines
if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and
PyUnicode_DecodeUTF16Stateful.
2004-09-07 20:24:22 +00:00
Tim Peters 91879ab8ea PyUnicode_Join(): Bozo Alert. While this is chugging along, it may
need to convert str objects from the iterable to unicode.  So, if
someone set the system default encoding to something nasty enough,
the conversion process could mutate the input iterable as a side
effect, and PySequence_Fast doesn't hide that from us if the input was
a list.  IOW, can't assume the size of PySequence_Fast's result is
invariant across PyUnicode_FromObject() calls.
2004-08-27 22:35:44 +00:00
Tim Peters 05eba1fdc8 PyUnicode_Join(): Rewrote to use PySequence_Fast(). This doesn't do
much to reduce the size of the code, but greatly improves its clarity.
It's also quicker in what's probably the most common case (the argument
iterable is a list).  Against it, if the iterable isn't a list or a tuple,
a temp tuple is materialized containing the entire input sequence, and
that's a bigger temp memory burden.  Yawn.
2004-08-27 21:32:02 +00:00
Tim Peters 894c512c2f PyUnicode_Join(): Missed a spot where I intended a cast from size_t to
int.  I sure wish MS would gripe about that!  Whatever, note that the
statement above it guarantees that the cast loses no info.
2004-08-27 05:08:36 +00:00
Tim Peters 8ce9f16259 PyUnicode_Join(): Two primary aims:
1. u1.join([u2]) is u2
2. Be more careful about C-level int overflow.

Since PySequence_Fast() isn't needed to achieve #1, it's not used -- but
the code could sure be simpler if it were.
2004-08-27 01:49:32 +00:00
Hye-Shik Chang e9ddfbb412 SF #989185: Drop unicode.iswide() and unicode.width() and add
unicodedata.east_asian_width().  You can still implement your own
simple width() function using it like this:
    def width(u):
        w = 0
        for c in unicodedata.normalize('NFC', u):
            cwidth = unicodedata.east_asian_width(c)
            if cwidth in ('W', 'F'): w += 2
            else: w += 1
        return w
2004-08-04 07:38:35 +00:00
Marc-André Lemburg d25c650461 Let u'%s' % obj try obj.__unicode__() first and fallback to obj.__str__(). 2004-07-23 16:13:25 +00:00
Nicholas Bastin 9ba301e589 Moved SunPro warning suppression into pyport.h and out of individual
modules and objects.
2004-07-15 15:54:05 +00:00
Marc-André Lemburg 126b44cd41 Fix a copy&paste typo. 2004-07-10 12:04:20 +00:00
Marc-André Lemburg 1dffb120b7 .encode()/.decode() patch part 2. 2004-07-08 19:13:55 +00:00
Marc-André Lemburg d2d4598ec2 Allow string and unicode return types from .encode()/.decode()
methods on string and unicode objects. Added unicode.decode()
which was missing for no apparent reason.
2004-07-08 17:57:32 +00:00
Nicholas Bastin 1ce9e4cfc1 Fixed end-of-loop code not reached warning when using SunPro C 2004-06-17 18:27:18 +00:00
Hye-Shik Chang 974ed7cfa5 - SF #962502: Add two more methods for unicode type; width() and
iswide() for east asian width manipulation. (Inspired by David
Goodger, Reviewed by Martin v. Loewis)
- Move _PyUnicode_TypeRecord.flags to the end of the struct so that
no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)
2004-06-02 16:49:17 +00:00
Hye-Shik Chang 4057483164 SF Patch #926375: Remove a useless UTF-16 support code that is never
been used. (Suggested by Martin v. Loewis)
2004-04-06 07:24:51 +00:00
Walter Dörwald cd736e71a3 Fix reallocation bug in unicode.translate(): The code was comparing
characters instead of character pointers to determine space requirements.
2004-02-05 17:36:00 +00:00
Hye-Shik Chang 1bc09b7c2a Cosmetic fix for wrongly indented tabs with ts=4. 2004-01-03 19:35:43 +00:00
Hye-Shik Chang 7fc4cf57b8 Fix unicode.rsplit()'s bug that ignores separater on the end of string when
using specialized splitter for 1 char sep.
2003-12-23 09:10:16 +00:00
Hye-Shik Chang 40e9509dc7 Fix broken xmlcharrefreplace by rev 2.204.
(Pointy hat goes to perky)
2003-12-22 01:31:13 +00:00
Hye-Shik Chang 4a264fb054 SF #859573: Reduce compiler warnings on gcc 3.2 and above. 2003-12-19 01:59:56 +00:00
Hye-Shik Chang 3ae811b57d Add rsplit method for str and unicode builtin types.
SF feature request #801847.
Original patch is written by Sean Reifschneider.
2003-12-15 18:49:53 +00:00
Guido van Rossum 6c9e130524 - Removed FutureWarnings related to hex/oct literals and conversions
and left shifts.  (Thanks to Kalle Svensson for SF patch 849227.)
  This addresses most of the remaining semantic changes promised by
  PEP 237, except for repr() of a long, which still shows the trailing
  'L'.  The PEP appears to promise warnings for operations that
  changed semantics compared to Python 2.3, but this is not
  implemented; we've suffered through enough warnings related to
  hex/oct literals and I think it's best to be silent now.
2003-11-29 23:52:13 +00:00
Raymond Hettinger 4f8f976576 Add optional fillchar argument to ljust(), rjust(), and center() string methods. 2003-11-26 08:21:35 +00:00
Walter Dörwald 4894c30626 Fix a bug in the memory reallocation code of PyUnicode_TranslateCharmap().
charmaptranslate_makespace() allocated more memory than required for the
next replacement but didn't remember that fact, so memory size was growing
exponentially every time a replacement string is longer that one character.
This fixes SF bug #828737.
2003-10-24 14:25:28 +00:00
Martin v. Löwis 6828e18a6a Patch #825679: Clarify semantics of .isfoo on empty strings.
Backported to 2.3.
2003-10-18 09:55:08 +00:00
Jeremy Hylton 504de6bd2c Fix for SF bug [ 817156 ] invalid \U escape gives 0=length unistr. 2003-10-06 05:08:26 +00:00
Tim Peters ced69f8a20 On c.l.py, Martin v. Löwis said that Py_UNICODE could be of a signed type,
so fiddle Jeremy's fix to live with that.  Also added more comments.

Bugfix candidate (this bug is in all versions of Python, at least since
2.1).
2003-09-16 20:30:58 +00:00
Jeremy Hylton d808279be3 Double-fix of crash in Unicode freelist handling.
If a length-1 Unicode string was in the freelist and it was
uninitialized or pointed to a very large (magnitude) negative number,
the check

	 unicode_latin1[unicode->str[0]] == unicode

could cause a segmentation violation, e.g. unicode->str[0] is 0xcbcbcbcb.

Fix this in two ways:

1. Change guard befor unicode_latin1[] to test against 256U.  If I
   understand correctly, the unsigned long used to store UCS4 on my
   box was getting converted to a signed long to compare with the
   signed constant 256.

2. Change _PyUnicode_New() to make sure the first element of str is
   always initialized to zero.  There are several places in the code
   where the caller can exit with an error before initializing any
   of str, which would leave junk in str[0].

Also, silence a compiler warning on pointer vs. int arithmetic.

Bug fix candidate.
2003-09-16 19:41:39 +00:00
Jeremy Hylton deb2dc6658 Change checks of PyUnicode_Resize() return value for clarity.
The unicode_resize() family only returns -1 or 0 so simply checking
for != 0 is sufficient, but somewhat unclear.  Many Python API
functions return < 0 on error, reserving the right to return 0 or 1 on
success.  Change the call sites for consistency with these calls.
2003-09-16 03:41:45 +00:00
Raymond Hettinger 9bfe533c69 SF bug #795506: Wrong handling of string format code for float values.
Adding missing support for '%F'.

Will backport to 2.3.1.
2003-08-27 04:55:52 +00:00
Walter Dörwald 150523efa5 Fix refcounting leak in charmaptranslate_lookup() 2003-08-15 16:52:19 +00:00
Walter Dörwald 9b30f206ee Fix another refcounting leak in PyUnicode_EncodeCharmap(). 2003-08-15 16:26:34 +00:00
Walter Dörwald d4ade0885c Fix another refcounting leak (in PyUnicode_DecodeUnicodeEscape()). 2003-08-15 15:00:26 +00:00
Walter Dörwald e5402fb340 Fix refcount leak in PyUnicode_EncodeCharmap(). The bug surfaces
when an encoding error occurs and the callback name is unknown,
i.e. when the callback has to be called. The problem was that
the fact that the callback has already been looked up was only
recorded in a local variable in charmap_encoding_error(), because
charmap_encoding_error() got it's own copy of the errorHandler
pointer instead of a pointer to the pointer in
PyUnicode_EncodeCharmap().
2003-08-14 20:25:29 +00:00
Mark Hammond 0ccda1ee10 Support 'mbcs' as a 'built-in' encoding, so the C API can use it without
defering to the encodings package.
As described in [ 763111 ] mbcs encoding should skip encodings package
2003-07-01 00:13:27 +00:00
Raymond Hettinger f466793fcc SF patch 703666: Several objects don't decref tmp on failure in subtype_new
Submitted By: Christopher A. Craig

Fillin some missing decrefs.
2003-06-28 20:04:25 +00:00
Martin v. Löwis 9a3a9f7791 Consider \U-escapes in raw-unicode-escape. Fixes #444514. 2003-05-18 12:31:09 +00:00
Neal Norwitz ffe33b7f24 Attempt to make all the various string *strip methods the same.
* Doc - add doc for when functions were added
 * UserString
 * string object methods
 * string module functions
'chars' is used for the last parameter everywhere.

These changes will be backported, since part of the changes
have already been made, but they were inconsistent.
2003-04-10 22:35:32 +00:00
Guido van Rossum a7132189d2 Reformat a few docstrings that caused line wraps in help() output. 2003-04-09 19:32:45 +00:00
Walter Dörwald 44f527fea4 Change formatchar(), so that u"%c" % 0xffffffff now raises
an OverflowError instead of a TypeError to be consistent
with "%c" % 256. See SF patch #710127.
2003-04-02 16:37:24 +00:00
Raymond Hettinger c8df5780e1 Sf patch #700047: unicode object leaks refcount on resizing
Contributed by Hye-Shik Chang.
2003-03-09 07:30:43 +00:00
Neal Norwitz ec74f2fda7 Add more missing PyErr_NoMemory() after failled memory allocs 2003-02-11 23:05:40 +00:00
Walter Dörwald f6b56aecad Fix two refcounting bugs 2003-02-09 23:42:56 +00:00
Walter Dörwald 2e0b18af30 Change the treatment of positions returned by PEP293
error handers in the Unicode codecs: Negative
positions are treated as being relative to the end of
the input and out of bounds positions result in an
IndexError.

Also update the PEP and include an explanation of
this in the documentation for codecs.register_error.

Fixes a small bug in iconv_codecs: if the position
from the callback is negative *add* it to the size
instead of substracting it.

From SF patch #677429.
2003-01-31 17:19:08 +00:00
Guido van Rossum 5d9113d8be Implement appropriate __getnewargs__ for all immutable subclassable builtin
types.  The special handling for these can now be removed from save_newobj().
Add some testing for this.

Also add support for setting the 'fast' flag on the Python Pickler class,
which suppresses use of the memo.
2003-01-29 17:58:45 +00:00
Walter Dörwald adc727490b Fix charmapencode_lookup(), so that a None value in the mapping
is treated as "character maps to <undefined>" and not as
"character mapping must return integer, None or str".
2003-01-08 22:01:33 +00:00
Walter Dörwald 034d97605d Remove variable owned from PyUnicode_FromEncodedObject, which is unused
(except for Py_DECREF calls) since the introduction of __unicode__.
2003-01-08 20:38:39 +00:00
Marc-André Lemburg 79f57833f3 Patch for bug #659709: bogus computation of float length
Python 2.2.x backport candidate. (This bug has been around since
Python 1.6.)
2002-12-29 19:44:06 +00:00
Neil Schemenauer ce30bc9f49 Add nb_remainder (i.e. __mod__) slot to unicode type. Fixes SF bug #615506. 2002-11-18 16:10:18 +00:00
Neal Norwitz 80a1bf4b5d Fix SF # 635969, No error "not all arguments converted"
When mwh added extended slicing, strings and unicode became mappings.
Thus, dict was set which prevented an error when doing:
	newstr = 'format without a percent' % string_value

This fix raises an exception again when there are no formats
and % with a string value.
2002-11-12 23:01:12 +00:00
Marc-André Lemburg 9cd87aaa54 Fix for bug #626172: crash using unicode latin1 single char
Python 2.2.3 candidate.
2002-10-23 09:02:46 +00:00
Guido van Rossum 049cd6b563 Fix a nasty endcase reported by Armin Rigo in SF bug 618623:
'%2147483647d' % -123 segfaults.  This was because an integer overflow
in a comparison caused the string resize to be skipped.  After fixing
the overflow, this could call _PyString_Resize() with a negative size,
so I (1) test for that and raise MemoryError instead; (2) also added a
test for negative newsize to _PyString_Resize(), raising SystemError
as for all bad arguments.

An identical bug existed in unicodeobject.c, of course.

Will backport to 2.2.2.
2002-10-11 00:43:48 +00:00
Marc-André Lemburg 24e53b6d91 Add cast to avoid compiler warning. 2002-09-24 09:32:14 +00:00
Neal Norwitz a0378e1eda Fix part of SF bug # 544248 gcc warning in unicodeobject.c
When --enable-unicode=ucs4, need to cast Py_UNICODE to a char
2002-09-13 13:47:06 +00:00
Guido van Rossum efc1188239 Fix warnings on 64-bit platforms about casts from pointers to ints.
Two of these were real bugs.
2002-09-12 14:43:41 +00:00
Walter Dörwald 5c1ee17742 Change the unicode.translate docstring to document that
Unicode strings (with arbitrary length) are allowed
as entries in the unicode.translate mapping.

Add a test case for multicharacter replacements.

(Multicharacter replacements were enabled by the
PEP 293 patch)
2002-09-04 20:31:32 +00:00
Walter Dörwald 3aeb632c31 PEP 293 implemention (from SF patch http://www.python.org/sf/432401) 2002-09-02 13:14:32 +00:00
Guido van Rossum 2023c9b84a Fix SF bug 599128, submitted by Inyeol Lee: .replace() would do the
wrong thing for a unicode subclass when there were zero string
replacements.  The example given in the SF bug report was only one way
to trigger this; replacing a string of length >= 2 that's not found is
another.  The code would actually write outside allocated memory if
replacement string was longer than the search string.

(I wonder how many more of these are lurking?  The unicode code base
is full of wonders.)

Bugfix candidate; this same bug is present in 2.2.1.
2002-08-23 18:50:21 +00:00
Guido van Rossum 8b1a6d694f Code by Inyeol Lee, submitted to SF bug 595350, to implement
the string/unicode method .replace() with a zero-lengt first argument.
Inyeol contributed tests for this too.
2002-08-23 18:21:28 +00:00
Guido van Rossum 76afbd9aa4 Fix some endcase bugs in unicode rfind()/rindex() and endswith().
These were reported and fixed by Inyeol Lee in SF bug 595350.  The
endswith() bug was already fixed in 2.3, but this adds some more test
cases.
2002-08-20 17:29:29 +00:00
Guido van Rossum 54df53a352 More changes of DeprecationWarning to FutureWarning. 2002-08-14 18:38:27 +00:00
Marc-André Lemburg cc8764ca9d Add C API PyUnicode_FromOrdinal() which exposes unichr() at C level.
u'%c' will now raise a ValueError in case the argument is an
integer outside the valid range of Unicode code point ordinals.

Closes SF bug #593581.
2002-08-11 12:23:04 +00:00
Guido van Rossum 078151da90 Implement stage B0 of PEP 237: add warnings for operations that
currently return inconsistent results for ints and longs; in
particular: hex/oct/%u/%o/%x/%X of negative short ints, and x<<n that
either loses bits or changes sign.  (No warnings for repr() of a long,
though that will also change to lose the trailing 'L' eventually.)

This introduces some warnings in the test suite; I'll take care of
those later.
2002-08-11 04:24:12 +00:00
Guido van Rossum f36921c4b0 Unicode replace() method with empty pattern argument should fail, like
it does for 8-bit strings.
2002-08-09 15:36:48 +00:00
Barry Warsaw 6a043f3fe8 PyUnicode_Contains(): The memcmp() call didn't take into account the
width of Py_UNICODE.  Good catch, MAL.
2002-08-06 19:03:17 +00:00
Barry Warsaw 817918cc3c Committing patch #591250 which provides "str1 in str2" when str1 is a
string of longer than 1 character.
2002-08-06 16:58:21 +00:00
Skip Montanaro 35b37a5c11 tighten up the unicode object's docstring a tad 2002-07-26 16:22:46 +00:00
Jeremy Hylton 938ace69a0 staticforward bites the dust.
The staticforward define was needed to support certain broken C
compilers (notably SCO ODT 3.0, perhaps early AIX as well) botched the
static keyword when it was used with a forward declaration of a static
initialized structure.  Standard C allows the forward declaration with
static, and we've decided to stop catering to broken C compilers.  (In
fact, we expect that the compilers are all fixed eight years later.)

I'm leaving staticforward and statichere defined in object.h as
static.  This is only for backwards compatibility with C extensions
that might still use it.

XXX I haven't updated the documentation.
2002-07-17 16:30:39 +00:00
Martin v. Löwis 6238d2b024 Patch #569753: Remove support for WIN16.
Rename all occurrences of MS_WIN32 to MS_WINDOWS.
2002-06-30 15:26:10 +00:00
Neal Norwitz 20e72130c4 Fix typo in exception message 2002-06-13 21:25:17 +00:00
Martin v. Löwis 14f8b4cfcb Patch #568124: Add doc string macros. 2002-06-13 20:33:02 +00:00
Michael W. Hudson 5efaf7eac8 This is my nearly two year old patch
[ 400998 ] experimental support for extended slicing on lists

somewhat spruced up and better tested than it was when I wrote it.

Includes docs & tests.  The whatsnew section needs expanding, and arrays
should support extended slices -- later.
2002-06-11 10:55:12 +00:00
Marc-André Lemburg 4164439240 Fix a possible segfault. Found be Neal Norvitz. 2002-05-29 13:46:29 +00:00
Marc-André Lemburg 4da6fd63bc Fix for bug [ 561796 ] string.find causes lazy error 2002-05-29 11:33:13 +00:00
Guido van Rossum cacfc07d08 - A new type object, 'string', is added. This is a common base type
for 'str' and 'unicode', and can be used instead of
  types.StringTypes, e.g. to test whether something is "a string":
  isinstance(x, string) is True for Unicode and 8-bit strings.  This
  is an abstract base class and cannot be instantiated directly.
2002-05-24 19:01:59 +00:00
Raymond Hettinger 0ebac97058 Patch 549187. Improve string formatting error message. 2002-05-21 15:14:57 +00:00
Tim Peters 5de9842b34 Repair widespread misuse of _PyString_Resize. Since it's clear people
don't understand how this function works, also beefed up the docs.  The
most common usage error is of this form (often spread out across gotos):

	if (_PyString_Resize(&s, n) < 0) {
		Py_DECREF(s);
		s = NULL;
		goto outtahere;
	}

The error is that if _PyString_Resize runs out of memory, it automatically
decrefs the input string object s (which also deallocates it, since its
refcount must be 1 upon entry), and sets s to NULL.  So if the "if"
branch ever triggers, it's an error to call Py_DECREF(s):  s is already
NULL!  A correct way to write the above is the simpler (and intended)

	if (_PyString_Resize(&s, n) < 0)
		goto outtahere;

Bugfix candidate.
2002-04-27 18:44:32 +00:00
Tim Peters 602f740bc2 SF patch 549375: Compromise PyUnicode_EncodeUTF8
This implements ideas from Marc-Andre, Martin, Guido and me on Python-Dev.

"Short" Unicode strings are encoded into a "big enough" stack buffer,
then exactly as much string space as they turn out to need is allocated
at the end.  This should have speed benefits akin to Martin's "measure
once, allocate once" strategy, but without needing a distinct measuring
pass.

"Long" Unicode strings allocate as much heap space as they could possibly
need (4 x # Unicode chars), and do a realloc at the end to return the
untouched excess.  Since the overallocation is likely to be substantial,
this shouldn't burden the platform realloc with unusably small excess
blocks.

Also simplified uses of the PyString_xyz functions.  Also added a release-
build check that 4*size doesn't overflow a C int.  Sooner or later, that's
going to happen.
2002-04-27 18:03:26 +00:00
Tim Peters 030a5cebf4 unicode_memchr(): Squashed gratuitous int-vs-size_t mismatch (which
gives a compiler wng under MSVC because of the resulting signed-vs-
unsigned comparison).
2002-04-22 19:00:10 +00:00
Walter Dörwald de02bcb265 Apply patch diff.txt from SF feature request
http://www.python.org/sf/444708

This adds the optional argument for str.strip
to unicode.strip too and makes it possible
to call str.strip with a unicode argument
and unicode.strip with a str argument.
2002-04-22 17:42:37 +00:00
Tim Peters 0eca65c4c5 PyUnicode_EncodeUTF8(): tightened the memory asserts a bit, and at least
tried to catch some possible arithmetic overflows in the debug build.
2002-04-21 17:28:06 +00:00
Martin v. Löwis 2a7ff35a07 Back out 2.140. 2002-04-21 09:59:45 +00:00
Tim Peters 7e3d961fc1 PyUnicode_EncodeUTF8: squash compiler wng. The difference of two
pointers is a signed type.  Changing "allocated" to a signed int makes
undetected overflow more likely, but there was no overflow detection
before either.
2002-04-21 03:26:37 +00:00
Martin v. Löwis a4eb14b7a4 Patch #495401: Count number of required bytes for encoding UTF-8 before
allocating the target buffer.
2002-04-20 13:44:01 +00:00
Walter Dörwald 0fe940c862 Return the orginal string only if it's a real str or unicode
instance, otherwise make a copy.
2002-04-15 18:42:15 +00:00
Walter Dörwald 068325ef92 Apply the second version of SF patch http://www.python.org/sf/536241
Add a method zfill to str, unicode and UserString and change
Lib/string.py accordingly.

This activates the zfill version in unicodeobject.c that was
commented out and implements the same in stringobject.c. It also
adds the test for unicode support in Lib/string.py back in and
uses repr() instead() of str() (as it was before Lib/string.py 1.62)
2002-04-15 13:36:47 +00:00
Neil Schemenauer 58aa861fa2 Remove PyMalloc_*. 2002-04-12 03:07:20 +00:00
Marc-André Lemburg 68e69338ae Bug fix for UTF-8 encoding bug (buffer overrun) #541828. 2002-04-10 20:36:13 +00:00
Marc-André Lemburg ce0b664af2 Added test case for UTF-8 encoding bug #541828. 2002-04-10 17:18:02 +00:00
Guido van Rossum 77f6a65eb0 Add the 'bool' type and its values 'False' and 'True', as described in
PEP 285.  Everything described in the PEP is here, and there is even
some documentation.  I had to fix 12 unit tests; all but one of these
were printing Boolean outcomes that changed from 0/1 to False/True.
(The exception is test_unicode.py, which did a type(x) == type(y)
style comparison.  I could've fixed that with a single line using
issubtype(x, type(y)), but instead chose to be explicit about those
places where a bool is expected.

Still to do: perhaps more documentation; change standard library
modules to return False/True from predicates.
2002-04-03 22:41:51 +00:00
Walter Dörwald 8c077227f2 Fix whitespace. 2002-03-25 11:16:18 +00:00
Neil Schemenauer dcc819a5c9 Use pymalloc if it's enabled. 2002-03-22 15:33:15 +00:00
Martin v. Löwis 047c05ebc4 Do not insert characters for unicode-escape decoders if the error mode
is "ignore". Fixes #529104.
2002-03-21 08:55:28 +00:00
Andrew MacIntyre 5e9c80d906 %#x/%#X format conversion cleanup (see patch #450267):
Objects/
    stringobject.c
    unicodeobject.c
2002-02-28 11:38:24 +00:00
Andrew MacIntyre c487439aa7 OS/2 EMX port changes (Objects part of patch #450267):
Objects/
    fileobject.c
    stringobject.c
    unicodeobject.c

This commit doesn't include the cleanup patches for stringobject.c and
unicodeobject.c which are shown separately in the patch manager.  Those
patches will be regenerated and applied in a subsequent commit, so as
to preserve a fallback position (this commit to those files).
2002-02-26 11:36:35 +00:00
Marc-André Lemburg bd3be8f0ca Fix to the UTF-8 encoder: it failed on 0-length input strings.
Fix for the UTF-8 decoder: it will now accept isolated surrogates
(previously it raised an exception which causes round-trips to
fail).

Added new tests for UTF-8 round-trip safety (we rely on UTF-8 for
marshalling Unicode objects, so we better make sure it works for
all Unicode code points, including isolated surrogates).

Bumped the PYC magic in a non-standard way -- please review. This
was needed because the old PYC format used illegal UTF-8 sequences
for isolated high surrogates which now raise an exception.
2002-02-07 11:33:49 +00:00
Marc-André Lemburg dc724d6e35 Cosmetics. 2002-02-06 18:20:19 +00:00
Marc-André Lemburg e7c6ee4b8a Whitespace fixes. 2002-02-06 18:18:03 +00:00
Marc-André Lemburg 3688a882d3 Fix for the UTF-8 memory allocation bug and the UTF-8 encoding
bug related to lone high surrogates.
2002-02-06 18:09:02 +00:00
Guido van Rossum 604ddf80d8 Fix for #489669 (Neil Norwitz): memory leak in test_descr (unicode).
This is best reproduced by

  while 1:
      class U(unicode):
          pass
      U(u"xxxxxx")

The unicode_dealloc() code wasn't properly freeing the str and defenc
fields of the Unicode object when freeing a subtype instance.  Fixed
this by a subtle refactoring that actually reduces the amount of code
slightly.
2001-12-06 20:03:56 +00:00
Barry Warsaw e5c492d72a formatfloat(), formatint(): Conversion of sprintf() to PyOS_snprintf()
for buffer overrun avoidance.
2001-11-28 21:00:41 +00:00
Marc-André Lemburg 11326de657 Fix for bug #485951: repr diff between string and unicode. 2001-11-28 12:56:20 +00:00
Marc-André Lemburg 72f8213ba4 Fix for bug #438164: %-formatting using Unicode objects.
This patch also does away with an incompatibility between Jython
and CPython.
2001-11-20 15:18:49 +00:00
Marc-André Lemburg b5507ecd3c Additional test and documentation for the unicode() changes.
This patch should also be applied to the 2.2b1 trunk.
2001-10-19 12:02:29 +00:00
Guido van Rossum b8c65bc27f SF patch #470578: Fixes to synchronize unicode() and str()
This patch implements what we have discussed on python-dev late in
    September: str(obj) and unicode(obj) should behave similar, while
    the old behaviour is retained for unicode(obj, encoding, errors).

    The patch also adds a new feature with which objects can provide
    unicode(obj) with input data: the __unicode__ method. Currently no
    new tp_unicode slot is implemented; this is left as option for the
    future.

    Note that PyUnicode_FromEncodedObject() no longer accepts Unicode
    objects as input. The API name already suggests that Unicode
    objects do not belong in the list of acceptable objects and the
    functionality was only needed because
    PyUnicode_FromEncodedObject() was being used directly by
    unicode(). The latter was changed in the discussed way:

    * unicode(obj) calls PyObject_Unicode()
    * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject()

    One thing left open to discussion is whether to leave the
    PyUnicode_FromObject() API as a thin API extension on top of
    PyUnicode_FromEncodedObject() or to turn it into a (macro) alias
    for PyObject_Unicode() and deprecate it. Doing so would have some
    surprising consequences though, e.g.  u"abc" + 123 would turn out
    as u"abc123"...

[Marc-Andre didn't have time to check this in before the deadline.  I
hope this is OK, Marc-Andre!  You can still make changes and commit
them on the trunk after the branch has been made, but then please mail
Barry a context diff if you want the change to be merged into the
2.2b1 release branch.  GvR]
2001-10-19 02:01:31 +00:00
Guido van Rossum 9475a2310d Enable GC for new-style instances. This touches lots of files, since
many types were subclassable but had a xxx_dealloc function that
called PyObject_DEL(self) directly instead of deferring to
self->ob_type->tp_free(self).  It is permissible to set tp_free in the
type object directly to _PyObject_Del, for non-GC types, or to
_PyObject_GC_Del, for GC types.  Still, PyObject_DEL was a tad faster,
so I'm fearing that our pystone rating is going down again.  I'm not
sure if doing something like

void xxx_dealloc(PyObject *self)
{
	if (PyXxxCheckExact(self))
		PyObject_DEL(self);
	else
		self->ob_type->tp_free(self);
}

is any faster than always calling the else branch, so I haven't
attempted that -- however those types whose own dealloc is fancier
(int, float, unicode) do use this pattern.
2001-10-05 20:51:39 +00:00
Guido van Rossum ad9744a67a Fix a bug in rendering of \\ by repr() -- it rendered as \\\ instead
of \\.
2001-09-21 15:38:17 +00:00
Marc-André Lemburg 3508e30861 Fix Unicode .join() method to raise a TypeError for sequence
elements which are not Unicode objects or strings. (This matches
the string.join() behaviour.)

Fix a memory leak in the .join() method which occurs in case
the Unicode resize fails.

Restore the test_unicode output.
2001-09-20 17:22:58 +00:00
Marc-André Lemburg 6871f6ac57 Implement the changes proposed in patch #413333. unicode(obj) now
works just like str(obj) in that it tries __str__/tp_str on the object
in case it finds that the object is not a string or buffer.
2001-09-20 12:53:16 +00:00
Marc-André Lemburg c60e6f7771 Patch #435971: UTF-7 codec by Brian Quinlan. 2001-09-20 10:35:46 +00:00
Tim Peters af90b3e610 str_subtype_new, unicode_subtype_new:
+ These were leaving the hash fields at 0, which all string and unicode
  routines believe is a legitimate hash code.  As a result, hash() applied
  to str and unicode subclass instances always returned 0, which in turn
  confused dict operations, etc.
+ Changed local names "new"; no point to antagonizing C++ compilers.
2001-09-12 05:18:58 +00:00
Tim Peters 7a29bd5861 More on bug 460020: disable many optimizations of unicode subclasses. 2001-09-12 03:03:31 +00:00
Tim Peters 78e0fc74bc Possibly the end of SF [#460020] bug or feature: unicode() and subclasses.
Changed unicode(i) to return a true Unicode object when i is an instance of
a unicode subclass.  Added PyUnicode_CheckExact macro.
2001-09-11 03:07:38 +00:00
Tim Peters 0ebeb584a4 PyUnicode_FromEncodedObject(): Repair memory leak in an error case. 2001-09-11 02:00:50 +00:00
Guido van Rossum e023fe0eef Make unicode subclassable. 2001-08-30 03:12:59 +00:00
Martin v. Löwis e3eb1f2b23 Patch #427190: Implement and use METH_NOARGS and METH_O. 2001-08-16 13:15:00 +00:00
Tim Peters 772747b3f1 SF patch #438013 Remove 2-byte Py_UCS2 assumptions
Removed all instances of Py_UCS2 from the codebase, and so also (I hope)
the last remaining reliance on the platform having an integral type
with exactly 16 bits.
PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write
one byte at a time.
2001-08-09 22:21:55 +00:00
Tim Peters 6d6c1a35e0 Merge of descr-branch back into trunk. 2001-08-02 04:15:00 +00:00
Jeremy Hylton 3ce45389bd Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h.
And remove all the extern decls in the middle of .c files.
Apparently, it was excluded from the header file because it is
intended for internal use by the interpreter.  It's still intended for
internal use and documented as such in the header file.
2001-07-30 22:34:24 +00:00
Marc-André Lemburg 80d1dd5f3b Fix for bug #444493: u'\U00010001' segfaults with current CVS on
wide builds.
2001-07-25 16:05:59 +00:00
Marc-André Lemburg 6c6bfb7c70 Make the unicode-escape and the UTF-16 codecs handle surrogates
correctly and thus roundtrip-safe.

Some minor cleanups of the code.

Added tests for the roundtrip-safety.
2001-07-20 17:39:11 +00:00
Guido van Rossum 0d42e0c54a #ifdef out generation of \U escapes unless Py_UNICODE_WIDE. This
#caused warnings with the VMS C compiler.  (SF bug #442998, in part.)
On a narrow system the current code should never be executed since ch
will always be < 0x10000.

Marc-Andre: you may end up fixing this a different way, since I
believe you have plans to generate \U for surrogate pairs.  I'll leave
that to you.
2001-07-20 16:36:21 +00:00
Fredrik Lundh 8f4558583f use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE
tests.
2001-06-27 18:59:43 +00:00
Martin v. Löwis ce9b5a55e1 Encode surrogates in UTF-8 even for a wide Py_UNICODE.
Implement sys.maxunicode.
Explicitly wrap around upper/lower computations for wide Py_UNICODE.
When decoding large characters with UTF-8, represent expected test
results using the \U notation.
2001-06-27 06:28:56 +00:00
Martin v. Löwis ac93bc2501 When decoding UTF-16, don't assume that the buffer is in native endianness
when checking surrogates.
2001-06-26 22:43:40 +00:00
Martin v. Löwis 0ba70cc3c8 Support using UCS-4 as the Py_UNICODE type:
Add configure option --enable-unicode.
Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE,
                    SIZEOF_WCHAR_T.
Define Py_UCS2.
Encode and decode large UTF-8 characters into single Py_UNICODE values
for wide Unicode types; likewise for UTF-16.
Remove test whether sizeof Py_UNICODE is two.
2001-06-26 22:22:37 +00:00
Fredrik Lundh 1294ad0c59 experimental UCS-4 support: added USE_UCS4_STORAGE define to
unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4).
(this may be good enough for platforms that doesn't have a 16-bit
type.  the UTF-16 codecs don't work, though)
2001-06-26 17:17:07 +00:00
Fredrik Lundh 45714e9ecb experimental UCS-4 support: made compare a bit more robust, in case
sizeof(Py_UNICODE) >= sizeof(long).  also changed surrogate expansion
to work if sizeof(Py_UNICODE) > 2.
2001-06-26 16:39:36 +00:00
Fredrik Lundh 3083163dc1 experimental UCS-4 support: don't assume that MS_WIN32 implies
HAVE_USABLE_WCHAR_T
2001-06-26 15:11:00 +00:00