This patch implements what we have discussed on python-dev late in
September: str(obj) and unicode(obj) should behave similar, while
the old behaviour is retained for unicode(obj, encoding, errors).
The patch also adds a new feature with which objects can provide
unicode(obj) with input data: the __unicode__ method. Currently no
new tp_unicode slot is implemented; this is left as option for the
future.
Note that PyUnicode_FromEncodedObject() no longer accepts Unicode
objects as input. The API name already suggests that Unicode
objects do not belong in the list of acceptable objects and the
functionality was only needed because
PyUnicode_FromEncodedObject() was being used directly by
unicode(). The latter was changed in the discussed way:
* unicode(obj) calls PyObject_Unicode()
* unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject()
One thing left open to discussion is whether to leave the
PyUnicode_FromObject() API as a thin API extension on top of
PyUnicode_FromEncodedObject() or to turn it into a (macro) alias
for PyObject_Unicode() and deprecate it. Doing so would have some
surprising consequences though, e.g. u"abc" + 123 would turn out
as u"abc123"...
[Marc-Andre didn't have time to check this in before the deadline. I
hope this is OK, Marc-Andre! You can still make changes and commit
them on the trunk after the branch has been made, but then please mail
Barry a context diff if you want the change to be merged into the
2.2b1 release branch. GvR]
many types were subclassable but had a xxx_dealloc function that
called PyObject_DEL(self) directly instead of deferring to
self->ob_type->tp_free(self). It is permissible to set tp_free in the
type object directly to _PyObject_Del, for non-GC types, or to
_PyObject_GC_Del, for GC types. Still, PyObject_DEL was a tad faster,
so I'm fearing that our pystone rating is going down again. I'm not
sure if doing something like
void xxx_dealloc(PyObject *self)
{
if (PyXxxCheckExact(self))
PyObject_DEL(self);
else
self->ob_type->tp_free(self);
}
is any faster than always calling the else branch, so I haven't
attempted that -- however those types whose own dealloc is fancier
(int, float, unicode) do use this pattern.
elements which are not Unicode objects or strings. (This matches
the string.join() behaviour.)
Fix a memory leak in the .join() method which occurs in case
the Unicode resize fails.
Restore the test_unicode output.
+ These were leaving the hash fields at 0, which all string and unicode
routines believe is a legitimate hash code. As a result, hash() applied
to str and unicode subclass instances always returned 0, which in turn
confused dict operations, etc.
+ Changed local names "new"; no point to antagonizing C++ compilers.
Removed all instances of Py_UCS2 from the codebase, and so also (I hope)
the last remaining reliance on the platform having an integral type
with exactly 16 bits.
PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write
one byte at a time.
And remove all the extern decls in the middle of .c files.
Apparently, it was excluded from the header file because it is
intended for internal use by the interpreter. It's still intended for
internal use and documented as such in the header file.
#caused warnings with the VMS C compiler. (SF bug #442998, in part.)
On a narrow system the current code should never be executed since ch
will always be < 0x10000.
Marc-Andre: you may end up fixing this a different way, since I
believe you have plans to generate \U for surrogate pairs. I'll leave
that to you.
Implement sys.maxunicode.
Explicitly wrap around upper/lower computations for wide Py_UNICODE.
When decoding large characters with UTF-8, represent expected test
results using the \U notation.
Add configure option --enable-unicode.
Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE,
SIZEOF_WCHAR_T.
Define Py_UCS2.
Encode and decode large UTF-8 characters into single Py_UNICODE values
for wide Unicode types; likewise for UTF-16.
Remove test whether sizeof Py_UNICODE is two.
unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4).
(this may be good enough for platforms that doesn't have a 16-bit
type. the UTF-16 codecs don't work, though)
UTF-16 codec will now interpret and remove a *leading* BOM mark. Sub-
sequent BOM characters are no longer interpreted and removed.
UTF-16-LE and -BE pass through all BOM mark characters.
These changes should get the UTF-16 codec more in line with what
the Unicode FAQ recommends w/r to BOM marks.
to string.join(), so that when the latter figures out in midstream that
it really needs unicode.join() instead, unicode.join() can actually get
all the sequence elements (i.e., there's no guarantee that the sequence
passed to string.join() can be iterated over *again* by unicode.join(),
so string.join() must not pass on the original sequence object anymore).
Patch #419651: Metrowerks on Mac adds 0x itself
C std says %#x and %#X conversion of 0 do not add the 0x/0X base marker.
Metrowerks apparently does. Mark Favas reported the same bug under a
Compaq compiler on Tru64 Unix, but no other libc broken in this respect
is known (known to be OK under MSVC and gcc).
So just try the damn thing at runtime and see what the platform does.
Note that we've always had bugs here, but never knew it before because
a relevant test case didn't exist before 2.1.
patch for sharing single character Unicode objects.
Martin's patch had to be reworked in a number of ways to take Unicode
resizing into consideration as well. Here's what the updated patch
implements:
* Single character Unicode strings in the Latin-1 range are shared
(not only ASCII chars as in Martin's original patch).
* The ASCII and Latin-1 codecs make use of this optimization,
providing a noticable speedup for single character strings. Most
Unicode methods can use the optimization as well (by virtue
of using PyUnicode_FromUnicode()).
* Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY)
* The PyUnicode_Resize() can now also handle the case of resizing
unicode_empty which previously resulted in an error.
* Modified the internal API _PyUnicode_Resize() and
the public PyUnicode_Resize() API to handle references to
shared objects correctly. The _PyUnicode_Resize() signature
changed due to this.
* Callers of PyUnicode_FromUnicode() may now only modify the Unicode
object contents of the returned object in case they called the API
with NULL as content template.
Note that even though this patch passes the regression tests, there
may still be subtle bugs in the sharing code.
"%#x" % 0
blew up, at heart because C sprintf supplies a base marker if and only if
the value is not 0. I then fixed that, by tolerating C's inconsistency
when it does %#x, and taking away that *Python* produced 0x0 when
formatting 0L (the "long" flavor of 0) under %#x itself. But after talking
with Guido, we agreed it would be better to supply 0x for the short int
case too, despite that it's inconsistent with C, because C is inconsistent
with itself and with Python's hex(0) (plus, while "%#x" % 0 didn't work
before, "%#x" % 0L *did*, and returned "0x0"). Similarly for %#X conversion.
http://sourceforge.net/tracker/index.php?func=detail&aid=415514&group_id=5470&atid=105470
For short ints, Python defers to the platform C library to figure out what
%#x should do. The code asserted that the platform C returned a string
beginning with "0x". However, that's not true when-- and only when --the
*value* being formatted is 0. Changed the code to live with C's inconsistency
here. In the meantime, the problem does not arise if you format a long 0 (0L)
instead. However, that's because the code *we* wrote to do %#x conversions on
longs produces a leading "0x" regardless of value. That's probably wrong too:
we should drop leading "0x", for consistency with C, when (& only when) formatting
0L. So I changed the long formatting code to do that too.
implementation details inside the ucnhash module.
also cleaned up the unicode copyright blurb a little; Secret Labs'
internal revision history isn't that interesting...
except that it always returns Unicode objects.
A new C API PyObject_Unicode() is also provided.
This closes patch #101664.
Written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.
in case the parameters are out of bounds and fixes error handling
for .count(), .startswith() and .endswith() for the case of
mixed string/Unicode objects.
This patch adds Python style index semantics to PyUnicode_Count()
indices (including the special handling of negative indices).
The patch is an extended version of patch #103249 submitted
by Michael Hudson (mwh) on SF. It also includes new test cases.
The mapping dictionaries can now contain 1-n mappings, meaning
that character ordinals may be mapped to strings or Unicode object,
e.g. 0x0078 ('x') -> u"abc", causing the ordinal to be replaced by
the complete string or Unicode object instead of just one character.
Another feature introduced by the patch is that of mapping oridnals to
the emtpy string. This allows removing characters.
The patch is different from patch #103100 in that it does not cause a
performance hit for the normal use case of 1-1 mappings.
Written by Marc-Andre Lemburg, copyright assigned to Guido van Rossum.
codec to not apply Latin-1 mappings for keys which are not found
in the mapping dictionaries, but instead treat them as undefined
mappings.
The patch was originally written by Martin v. Loewis with some
additional (cosmetic) changes and an updated test script
by Marc-Andre Lemburg.
The standard codecs were recreated from the most current files
available at the Unicode.org site using the Tools/scripts/gencodec.py
tool.
This patch closes the bugs #116285 and #119960.