Commit Graph

146 Commits

Author SHA1 Message Date
Tim Peters 602f740bc2 SF patch 549375: Compromise PyUnicode_EncodeUTF8
This implements ideas from Marc-Andre, Martin, Guido and me on Python-Dev.

"Short" Unicode strings are encoded into a "big enough" stack buffer,
then exactly as much string space as they turn out to need is allocated
at the end.  This should have speed benefits akin to Martin's "measure
once, allocate once" strategy, but without needing a distinct measuring
pass.

"Long" Unicode strings allocate as much heap space as they could possibly
need (4 x # Unicode chars), and do a realloc at the end to return the
untouched excess.  Since the overallocation is likely to be substantial,
this shouldn't burden the platform realloc with unusably small excess
blocks.

Also simplified uses of the PyString_xyz functions.  Also added a release-
build check that 4*size doesn't overflow a C int.  Sooner or later, that's
going to happen.
2002-04-27 18:03:26 +00:00
Tim Peters 030a5cebf4 unicode_memchr(): Squashed gratuitous int-vs-size_t mismatch (which
gives a compiler wng under MSVC because of the resulting signed-vs-
unsigned comparison).
2002-04-22 19:00:10 +00:00
Walter Dörwald de02bcb265 Apply patch diff.txt from SF feature request
http://www.python.org/sf/444708

This adds the optional argument for str.strip
to unicode.strip too and makes it possible
to call str.strip with a unicode argument
and unicode.strip with a str argument.
2002-04-22 17:42:37 +00:00
Tim Peters 0eca65c4c5 PyUnicode_EncodeUTF8(): tightened the memory asserts a bit, and at least
tried to catch some possible arithmetic overflows in the debug build.
2002-04-21 17:28:06 +00:00
Martin v. Löwis 2a7ff35a07 Back out 2.140. 2002-04-21 09:59:45 +00:00
Tim Peters 7e3d961fc1 PyUnicode_EncodeUTF8: squash compiler wng. The difference of two
pointers is a signed type.  Changing "allocated" to a signed int makes
undetected overflow more likely, but there was no overflow detection
before either.
2002-04-21 03:26:37 +00:00
Martin v. Löwis a4eb14b7a4 Patch #495401: Count number of required bytes for encoding UTF-8 before
allocating the target buffer.
2002-04-20 13:44:01 +00:00
Walter Dörwald 0fe940c862 Return the orginal string only if it's a real str or unicode
instance, otherwise make a copy.
2002-04-15 18:42:15 +00:00
Walter Dörwald 068325ef92 Apply the second version of SF patch http://www.python.org/sf/536241
Add a method zfill to str, unicode and UserString and change
Lib/string.py accordingly.

This activates the zfill version in unicodeobject.c that was
commented out and implements the same in stringobject.c. It also
adds the test for unicode support in Lib/string.py back in and
uses repr() instead() of str() (as it was before Lib/string.py 1.62)
2002-04-15 13:36:47 +00:00
Neil Schemenauer 58aa861fa2 Remove PyMalloc_*. 2002-04-12 03:07:20 +00:00
Marc-André Lemburg 68e69338ae Bug fix for UTF-8 encoding bug (buffer overrun) #541828. 2002-04-10 20:36:13 +00:00
Marc-André Lemburg ce0b664af2 Added test case for UTF-8 encoding bug #541828. 2002-04-10 17:18:02 +00:00
Guido van Rossum 77f6a65eb0 Add the 'bool' type and its values 'False' and 'True', as described in
PEP 285.  Everything described in the PEP is here, and there is even
some documentation.  I had to fix 12 unit tests; all but one of these
were printing Boolean outcomes that changed from 0/1 to False/True.
(The exception is test_unicode.py, which did a type(x) == type(y)
style comparison.  I could've fixed that with a single line using
issubtype(x, type(y)), but instead chose to be explicit about those
places where a bool is expected.

Still to do: perhaps more documentation; change standard library
modules to return False/True from predicates.
2002-04-03 22:41:51 +00:00
Walter Dörwald 8c077227f2 Fix whitespace. 2002-03-25 11:16:18 +00:00
Neil Schemenauer dcc819a5c9 Use pymalloc if it's enabled. 2002-03-22 15:33:15 +00:00
Martin v. Löwis 047c05ebc4 Do not insert characters for unicode-escape decoders if the error mode
is "ignore". Fixes #529104.
2002-03-21 08:55:28 +00:00
Andrew MacIntyre 5e9c80d906 %#x/%#X format conversion cleanup (see patch #450267):
Objects/
    stringobject.c
    unicodeobject.c
2002-02-28 11:38:24 +00:00
Andrew MacIntyre c487439aa7 OS/2 EMX port changes (Objects part of patch #450267):
Objects/
    fileobject.c
    stringobject.c
    unicodeobject.c

This commit doesn't include the cleanup patches for stringobject.c and
unicodeobject.c which are shown separately in the patch manager.  Those
patches will be regenerated and applied in a subsequent commit, so as
to preserve a fallback position (this commit to those files).
2002-02-26 11:36:35 +00:00
Marc-André Lemburg bd3be8f0ca Fix to the UTF-8 encoder: it failed on 0-length input strings.
Fix for the UTF-8 decoder: it will now accept isolated surrogates
(previously it raised an exception which causes round-trips to
fail).

Added new tests for UTF-8 round-trip safety (we rely on UTF-8 for
marshalling Unicode objects, so we better make sure it works for
all Unicode code points, including isolated surrogates).

Bumped the PYC magic in a non-standard way -- please review. This
was needed because the old PYC format used illegal UTF-8 sequences
for isolated high surrogates which now raise an exception.
2002-02-07 11:33:49 +00:00
Marc-André Lemburg dc724d6e35 Cosmetics. 2002-02-06 18:20:19 +00:00
Marc-André Lemburg e7c6ee4b8a Whitespace fixes. 2002-02-06 18:18:03 +00:00
Marc-André Lemburg 3688a882d3 Fix for the UTF-8 memory allocation bug and the UTF-8 encoding
bug related to lone high surrogates.
2002-02-06 18:09:02 +00:00
Guido van Rossum 604ddf80d8 Fix for #489669 (Neil Norwitz): memory leak in test_descr (unicode).
This is best reproduced by

  while 1:
      class U(unicode):
          pass
      U(u"xxxxxx")

The unicode_dealloc() code wasn't properly freeing the str and defenc
fields of the Unicode object when freeing a subtype instance.  Fixed
this by a subtle refactoring that actually reduces the amount of code
slightly.
2001-12-06 20:03:56 +00:00
Barry Warsaw e5c492d72a formatfloat(), formatint(): Conversion of sprintf() to PyOS_snprintf()
for buffer overrun avoidance.
2001-11-28 21:00:41 +00:00
Marc-André Lemburg 11326de657 Fix for bug #485951: repr diff between string and unicode. 2001-11-28 12:56:20 +00:00
Marc-André Lemburg 72f8213ba4 Fix for bug #438164: %-formatting using Unicode objects.
This patch also does away with an incompatibility between Jython
and CPython.
2001-11-20 15:18:49 +00:00
Marc-André Lemburg b5507ecd3c Additional test and documentation for the unicode() changes.
This patch should also be applied to the 2.2b1 trunk.
2001-10-19 12:02:29 +00:00
Guido van Rossum b8c65bc27f SF patch #470578: Fixes to synchronize unicode() and str()
This patch implements what we have discussed on python-dev late in
    September: str(obj) and unicode(obj) should behave similar, while
    the old behaviour is retained for unicode(obj, encoding, errors).

    The patch also adds a new feature with which objects can provide
    unicode(obj) with input data: the __unicode__ method. Currently no
    new tp_unicode slot is implemented; this is left as option for the
    future.

    Note that PyUnicode_FromEncodedObject() no longer accepts Unicode
    objects as input. The API name already suggests that Unicode
    objects do not belong in the list of acceptable objects and the
    functionality was only needed because
    PyUnicode_FromEncodedObject() was being used directly by
    unicode(). The latter was changed in the discussed way:

    * unicode(obj) calls PyObject_Unicode()
    * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject()

    One thing left open to discussion is whether to leave the
    PyUnicode_FromObject() API as a thin API extension on top of
    PyUnicode_FromEncodedObject() or to turn it into a (macro) alias
    for PyObject_Unicode() and deprecate it. Doing so would have some
    surprising consequences though, e.g.  u"abc" + 123 would turn out
    as u"abc123"...

[Marc-Andre didn't have time to check this in before the deadline.  I
hope this is OK, Marc-Andre!  You can still make changes and commit
them on the trunk after the branch has been made, but then please mail
Barry a context diff if you want the change to be merged into the
2.2b1 release branch.  GvR]
2001-10-19 02:01:31 +00:00
Guido van Rossum 9475a2310d Enable GC for new-style instances. This touches lots of files, since
many types were subclassable but had a xxx_dealloc function that
called PyObject_DEL(self) directly instead of deferring to
self->ob_type->tp_free(self).  It is permissible to set tp_free in the
type object directly to _PyObject_Del, for non-GC types, or to
_PyObject_GC_Del, for GC types.  Still, PyObject_DEL was a tad faster,
so I'm fearing that our pystone rating is going down again.  I'm not
sure if doing something like

void xxx_dealloc(PyObject *self)
{
	if (PyXxxCheckExact(self))
		PyObject_DEL(self);
	else
		self->ob_type->tp_free(self);
}

is any faster than always calling the else branch, so I haven't
attempted that -- however those types whose own dealloc is fancier
(int, float, unicode) do use this pattern.
2001-10-05 20:51:39 +00:00
Guido van Rossum ad9744a67a Fix a bug in rendering of \\ by repr() -- it rendered as \\\ instead
of \\.
2001-09-21 15:38:17 +00:00
Marc-André Lemburg 3508e30861 Fix Unicode .join() method to raise a TypeError for sequence
elements which are not Unicode objects or strings. (This matches
the string.join() behaviour.)

Fix a memory leak in the .join() method which occurs in case
the Unicode resize fails.

Restore the test_unicode output.
2001-09-20 17:22:58 +00:00
Marc-André Lemburg 6871f6ac57 Implement the changes proposed in patch #413333. unicode(obj) now
works just like str(obj) in that it tries __str__/tp_str on the object
in case it finds that the object is not a string or buffer.
2001-09-20 12:53:16 +00:00
Marc-André Lemburg c60e6f7771 Patch #435971: UTF-7 codec by Brian Quinlan. 2001-09-20 10:35:46 +00:00
Tim Peters af90b3e610 str_subtype_new, unicode_subtype_new:
+ These were leaving the hash fields at 0, which all string and unicode
  routines believe is a legitimate hash code.  As a result, hash() applied
  to str and unicode subclass instances always returned 0, which in turn
  confused dict operations, etc.
+ Changed local names "new"; no point to antagonizing C++ compilers.
2001-09-12 05:18:58 +00:00
Tim Peters 7a29bd5861 More on bug 460020: disable many optimizations of unicode subclasses. 2001-09-12 03:03:31 +00:00
Tim Peters 78e0fc74bc Possibly the end of SF [#460020] bug or feature: unicode() and subclasses.
Changed unicode(i) to return a true Unicode object when i is an instance of
a unicode subclass.  Added PyUnicode_CheckExact macro.
2001-09-11 03:07:38 +00:00
Tim Peters 0ebeb584a4 PyUnicode_FromEncodedObject(): Repair memory leak in an error case. 2001-09-11 02:00:50 +00:00
Guido van Rossum e023fe0eef Make unicode subclassable. 2001-08-30 03:12:59 +00:00
Martin v. Löwis e3eb1f2b23 Patch #427190: Implement and use METH_NOARGS and METH_O. 2001-08-16 13:15:00 +00:00
Tim Peters 772747b3f1 SF patch #438013 Remove 2-byte Py_UCS2 assumptions
Removed all instances of Py_UCS2 from the codebase, and so also (I hope)
the last remaining reliance on the platform having an integral type
with exactly 16 bits.
PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write
one byte at a time.
2001-08-09 22:21:55 +00:00
Tim Peters 6d6c1a35e0 Merge of descr-branch back into trunk. 2001-08-02 04:15:00 +00:00
Jeremy Hylton 3ce45389bd Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h.
And remove all the extern decls in the middle of .c files.
Apparently, it was excluded from the header file because it is
intended for internal use by the interpreter.  It's still intended for
internal use and documented as such in the header file.
2001-07-30 22:34:24 +00:00
Marc-André Lemburg 80d1dd5f3b Fix for bug #444493: u'\U00010001' segfaults with current CVS on
wide builds.
2001-07-25 16:05:59 +00:00
Marc-André Lemburg 6c6bfb7c70 Make the unicode-escape and the UTF-16 codecs handle surrogates
correctly and thus roundtrip-safe.

Some minor cleanups of the code.

Added tests for the roundtrip-safety.
2001-07-20 17:39:11 +00:00
Guido van Rossum 0d42e0c54a #ifdef out generation of \U escapes unless Py_UNICODE_WIDE. This
#caused warnings with the VMS C compiler.  (SF bug #442998, in part.)
On a narrow system the current code should never be executed since ch
will always be < 0x10000.

Marc-Andre: you may end up fixing this a different way, since I
believe you have plans to generate \U for surrogate pairs.  I'll leave
that to you.
2001-07-20 16:36:21 +00:00
Fredrik Lundh 8f4558583f use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE
tests.
2001-06-27 18:59:43 +00:00
Martin v. Löwis ce9b5a55e1 Encode surrogates in UTF-8 even for a wide Py_UNICODE.
Implement sys.maxunicode.
Explicitly wrap around upper/lower computations for wide Py_UNICODE.
When decoding large characters with UTF-8, represent expected test
results using the \U notation.
2001-06-27 06:28:56 +00:00
Martin v. Löwis ac93bc2501 When decoding UTF-16, don't assume that the buffer is in native endianness
when checking surrogates.
2001-06-26 22:43:40 +00:00
Martin v. Löwis 0ba70cc3c8 Support using UCS-4 as the Py_UNICODE type:
Add configure option --enable-unicode.
Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE,
                    SIZEOF_WCHAR_T.
Define Py_UCS2.
Encode and decode large UTF-8 characters into single Py_UNICODE values
for wide Unicode types; likewise for UTF-16.
Remove test whether sizeof Py_UNICODE is two.
2001-06-26 22:22:37 +00:00
Fredrik Lundh 1294ad0c59 experimental UCS-4 support: added USE_UCS4_STORAGE define to
unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4).
(this may be good enough for platforms that doesn't have a 16-bit
type.  the UTF-16 codecs don't work, though)
2001-06-26 17:17:07 +00:00