Commit Graph

47 Commits

Author SHA1 Message Date
Raymond Hettinger 57341c37c9 SF patch #1056231: typo in comment (unicodeobject.h) 2004-10-31 05:46:59 +00:00
Walter Dörwald 69652035bc SF patch #998993: The UTF-8 and the UTF-16 stateful decoders now support
decoding incomplete input (when the input stream is temporarily exhausted).
codecs.StreamReader now implements buffering, which enables proper
readline support for the UTF-16 decoders. codecs.StreamReader.read()
has a new argument chars which specifies the number of characters to
return. codecs.StreamReader.readline() and codecs.StreamReader.readlines()
have a new argument keepends. Trailing "\n"s will be stripped from the lines
if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and
PyUnicode_DecodeUTF16Stateful.
2004-09-07 20:24:22 +00:00
Hye-Shik Chang e9ddfbb412 SF #989185: Drop unicode.iswide() and unicode.width() and add
unicodedata.east_asian_width().  You can still implement your own
simple width() function using it like this:
    def width(u):
        w = 0
        for c in unicodedata.normalize('NFC', u):
            cwidth = unicodedata.east_asian_width(c)
            if cwidth in ('W', 'F'): w += 2
            else: w += 1
        return w
2004-08-04 07:38:35 +00:00
Marc-André Lemburg d2d4598ec2 Allow string and unicode return types from .encode()/.decode()
methods on string and unicode objects. Added unicode.decode()
which was missing for no apparent reason.
2004-07-08 17:57:32 +00:00
Hye-Shik Chang 974ed7cfa5 - SF #962502: Add two more methods for unicode type; width() and
iswide() for east asian width manipulation. (Inspired by David
Goodger, Reviewed by Martin v. Loewis)
- Move _PyUnicode_TypeRecord.flags to the end of the struct so that
no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)
2004-06-02 16:49:17 +00:00
Hye-Shik Chang 3ae811b57d Add rsplit method for str and unicode builtin types.
SF feature request #801847.
Original patch is written by Sean Reifschneider.
2003-12-15 18:49:53 +00:00
Marc-André Lemburg 9c329de47e Add name mangling for new PyUnicode_FromOrdinal() and fix declaration
to use new extern macro.
2002-08-12 08:19:10 +00:00
Mark Hammond 91a681debf Excise DL_EXPORT from Include.
Thanks to Skip Montanaro and Kalle Svensson for the patches.
2002-08-12 07:21:58 +00:00
Marc-André Lemburg cc8764ca9d Add C API PyUnicode_FromOrdinal() which exposes unichr() at C level.
u'%c' will now raise a ValueError in case the argument is an
integer outside the valid range of Unicode code point ordinals.

Closes SF bug #593581.
2002-08-11 12:23:04 +00:00
Marc-André Lemburg 4da6fd63bc Fix for bug [ 561796 ] string.find causes lazy error 2002-05-29 11:33:13 +00:00
Walter Dörwald de02bcb265 Apply patch diff.txt from SF feature request
http://www.python.org/sf/444708

This adds the optional argument for str.strip
to unicode.strip too and makes it possible
to call str.strip with a unicode argument
and unicode.strip with a str argument.
2002-04-22 17:42:37 +00:00
Guido van Rossum b8c65bc27f SF patch #470578: Fixes to synchronize unicode() and str()
This patch implements what we have discussed on python-dev late in
    September: str(obj) and unicode(obj) should behave similar, while
    the old behaviour is retained for unicode(obj, encoding, errors).

    The patch also adds a new feature with which objects can provide
    unicode(obj) with input data: the __unicode__ method. Currently no
    new tp_unicode slot is implemented; this is left as option for the
    future.

    Note that PyUnicode_FromEncodedObject() no longer accepts Unicode
    objects as input. The API name already suggests that Unicode
    objects do not belong in the list of acceptable objects and the
    functionality was only needed because
    PyUnicode_FromEncodedObject() was being used directly by
    unicode(). The latter was changed in the discussed way:

    * unicode(obj) calls PyObject_Unicode()
    * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject()

    One thing left open to discussion is whether to leave the
    PyUnicode_FromObject() API as a thin API extension on top of
    PyUnicode_FromEncodedObject() or to turn it into a (macro) alias
    for PyObject_Unicode() and deprecate it. Doing so would have some
    surprising consequences though, e.g.  u"abc" + 123 would turn out
    as u"abc123"...

[Marc-Andre didn't have time to check this in before the deadline.  I
hope this is OK, Marc-Andre!  You can still make changes and commit
them on the trunk after the branch has been made, but then please mail
Barry a context diff if you want the change to be merged into the
2.2b1 release branch.  GvR]
2001-10-19 02:01:31 +00:00
Marc-André Lemburg c60e6f7771 Patch #435971: UTF-7 codec by Brian Quinlan. 2001-09-20 10:35:46 +00:00
Marc-André Lemburg 5e6007c5db Fix for bug #462737. 2001-09-19 11:21:03 +00:00
Tim Peters 78e0fc74bc Possibly the end of SF [#460020] bug or feature: unicode() and subclasses.
Changed unicode(i) to return a true Unicode object when i is an instance of
a unicode subclass.  Added PyUnicode_CheckExact macro.
2001-09-11 03:07:38 +00:00
Guido van Rossum 5eef77a21b Make the Py<type>_Check() macro use PyObject_TypeCheck(). 2001-08-30 03:08:07 +00:00
Martin v. Löwis 339d0f720e Patch #445762: Support --disable-unicode
- Do not compile unicodeobject, unicodectype, and unicodedata if Unicode is disabled
- check for Py_USING_UNICODE in all places that use Unicode functions
- disables unicode literals, and the builtin functions
- add the types.StringTypes list
- remove Unicode literals from most tests.
2001-08-17 18:39:25 +00:00
Tim Peters 772747b3f1 SF patch #438013 Remove 2-byte Py_UCS2 assumptions
Removed all instances of Py_UCS2 from the codebase, and so also (I hope)
the last remaining reliance on the platform having an integral type
with exactly 16 bits.
PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write
one byte at a time.
2001-08-09 22:21:55 +00:00
Marc-André Lemburg b5ac6f62c7 As discussed on python-dev: this patch adds name mangling to
assure that extensions and interpreters using the Unicode APIs
were compiled using the same Unicode width.
2001-07-31 14:30:16 +00:00
Jeremy Hylton 3ce45389bd Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h.
And remove all the extern decls in the middle of .c files.
Apparently, it was excluded from the header file because it is
intended for internal use by the interpreter.  It's still intended for
internal use and documented as such in the header file.
2001-07-30 22:34:24 +00:00
Fredrik Lundh 72b068566a removed "register const" from scalar arguments to the unicode
predicates
2001-06-27 22:08:26 +00:00
Fredrik Lundh 8f4558583f use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE
tests.
2001-06-27 18:59:43 +00:00
Martin v. Löwis ce9b5a55e1 Encode surrogates in UTF-8 even for a wide Py_UNICODE.
Implement sys.maxunicode.
Explicitly wrap around upper/lower computations for wide Py_UNICODE.
When decoding large characters with UTF-8, represent expected test
results using the \U notation.
2001-06-27 06:28:56 +00:00
Fredrik Lundh 9b14ab367a Make Unicode work a bit better on Windows... 2001-06-26 22:59:49 +00:00
Martin v. Löwis 0ba70cc3c8 Support using UCS-4 as the Py_UNICODE type:
Add configure option --enable-unicode.
Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE,
                    SIZEOF_WCHAR_T.
Define Py_UCS2.
Encode and decode large UTF-8 characters into single Py_UNICODE values
for wide Unicode types; likewise for UTF-16.
Remove test whether sizeof Py_UNICODE is two.
2001-06-26 22:22:37 +00:00
Fredrik Lundh 1294ad0c59 experimental UCS-4 support: added USE_UCS4_STORAGE define to
unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4).
(this may be good enough for platforms that doesn't have a 16-bit
type.  the UTF-16 codecs don't work, though)
2001-06-26 17:17:07 +00:00
Marc-André Lemburg 489b56e044 This patch changes the behaviour of the UTF-16 codec family. Only the
UTF-16 codec will now interpret and remove a *leading* BOM mark. Sub-
sequent BOM characters are no longer interpreted and removed.
UTF-16-LE and -BE pass through all BOM mark characters.

These changes should get the UTF-16 codec more in line with what
the Unicode FAQ recommends w/r to BOM marks.
2001-05-21 20:30:15 +00:00
Marc-André Lemburg 8155e0e541 This patch originated from an idea by Martin v. Loewis who submitted a
patch for sharing single character Unicode objects.

Martin's patch had to be reworked in a number of ways to take Unicode
resizing into consideration as well. Here's what the updated patch
implements:

* Single character Unicode strings in the Latin-1 range are shared
  (not only ASCII chars as in Martin's original patch).

* The ASCII and Latin-1 codecs make use of this optimization,
  providing a noticable speedup for single character strings. Most
  Unicode methods can use the optimization as well (by virtue
  of using PyUnicode_FromUnicode()).

* Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY)

* The PyUnicode_Resize() can now also handle the case of resizing
  unicode_empty which previously resulted in an error.

* Modified the internal API _PyUnicode_Resize() and
  the public PyUnicode_Resize() API to handle references to
  shared objects correctly. The _PyUnicode_Resize() signature
  changed due to this.

* Callers of PyUnicode_FromUnicode() may now only modify the Unicode
  object contents of the returned object in case they called the API
  with NULL as content template.

Note that even though this patch passes the regression tests, there
may still be subtle bugs in the sharing code.
2001-04-23 14:44:21 +00:00
Marc-André Lemburg 1a731c60a3 Added #fndef's to avoid compiler errors. 2000-08-11 11:43:10 +00:00
Marc-André Lemburg bff879cabb This patch finalizes the move from UTF-8 to a default encoding in
the Python Unicode implementation.

The internal buffer used for implementing the buffer protocol
is renamed to defenc to make this change visible. It now holds the
default encoded version of the Unicode object and is calculated
on demand (NULL otherwise).

Since the default encoding defaults to ASCII, this will mean that
Unicode objects which hold non-ASCII characters will no longer
work on C APIs using the "s" or "t" parser markers. C APIs must now
explicitly provide Unicode support via the "u", "U" or "es"/"es#"
parser markers in order to work with non-ASCII Unicode strings.

(Note: this patch will also have to be applied to the 1.6 branch
 of the CVS tree.)
2000-08-03 18:46:08 +00:00
Guido van Rossum 16b1ad9c7d Changing the CNRI copyright notice according to CNRI's instructions.
This is a notice without a date, which apparently is not a claim to
copyright but only advice to the reader.  IANAL. :-)
2000-08-03 16:24:25 +00:00
Thomas Wouters 5f37591a16 ANSIfications: fix empty arglists, and remove the checks for
'HAVE_STDARG_PROTOTYPES' (consider it true, remove false branch)
2000-07-22 23:30:03 +00:00
Thomas Wouters 7e47402264 Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in either
comments, docstrings or error messages. I fixed two minor things in
test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't").

There is a minor style issue involved: Guido seems to have preferred English
grammar (behaviour, honour) in a couple places. This patch changes that to
American, which is the more prominent style in the source. I prefer English
myself, so if English is preferred, I'd be happy to supply a patch myself ;)
2000-07-16 12:04:32 +00:00
Marc-André Lemburg 5a5c81a0e9 Added new API PyUnicode_FromEncodedObject() which supports decoding
objects including instance objects.

The old API PyUnicode_FromObject() is still available as shortcut.
2000-07-07 13:46:42 +00:00
Marc-André Lemburg 43279100f4 Bill Tutt: Added Py_UCS4 typedef to hold UCS4 values (these need
at least 32 bits as opposed to Py_UNICODE which rely on having
16 bits).
2000-07-07 09:01:41 +00:00
Marc-André Lemburg f03e74126e Modified the ISALPHA and ISALNUM macros to use the new lookup APIs
from unicodectype.c
2000-07-05 09:45:59 +00:00
Marc-André Lemburg a9c103bc09 Added new Py_UNICODE_ISALPHA() and Py_UNICODE_ISALNUM() macros
which are true for alphabetic and alphanumeric characters resp.

The macros are currently implemented using the existing is* tables
but will have to be updated to meet the Unicode standard definitions
(add tables for non-cased letters and letter modifiers).
2000-07-03 10:52:13 +00:00
Marc-André Lemburg 2f4d0e9bb6 Marc-Andre Lemburg <mal@lemburg.com>:
Added optimization proposed by Andrew Kuchling to the Unicode
matching macro.
2000-06-18 22:22:27 +00:00
Fred Drake cb093fe890 M.-A. Lemburg <mal@lemburg.com>:
Added PyUnicode_GetDefaultEncoding() and
PyUnicode_GetDefaultEncoding() APIs.
2000-05-09 19:51:53 +00:00
Guido van Rossum 004d64f362 Marc-Andre Lemburg:
Changed PyUnicode_Splitlines() maxsplit argument to keepends.
The maxsplit functionality was replaced by the keepends
functionality which allows keeping the line end markers together
with the string.
2000-04-11 15:39:46 +00:00
Guido van Rossum 52c2359a59 Marc-Andre Lemburg: New exported API PyUnicode_Resize(). 2000-04-10 13:41:41 +00:00
Guido van Rossum 9e896b37c7 Marc-Andre's third try at this bulk patch seems to work (except that
his copy of test_contains.py seems to be broken -- the lines he
deleted were already absent).  Checkin messages:


New Unicode support for int(), float(), complex() and long().

- new APIs PyInt_FromUnicode() and PyLong_FromUnicode()
- added support for Unicode to PyFloat_FromString()
- new encoding API PyUnicode_EncodeDecimal() which converts
  Unicode to a decimal char* string (used in the above new
  APIs)
- shortcuts for calls like int(<int object>) and float(<float obj>)
- tests for all of the above

Unicode compares and contains checks:
- comparing Unicode and non-string types now works; TypeErrors
  are masked, all other errors such as ValueError during
  Unicode coercion are passed through (note that PyUnicode_Compare
  does not implement the masking -- PyObject_Compare does this)
- contains now works for non-string types too; TypeErrors are
  masked and 0 returned; all other errors are passed through

Better testing support for the standard codecs.

Misc minor enhancements, such as an alias dbcs for the mbcs codec.

Changes:
- PyLong_FromString() now applies the same error checks as
  does PyInt_FromString(): trailing garbage is reported
  as error and not longer silently ignored. The only characters
  which may be trailing the digits are 'L' and 'l' -- these
  are still silently ignored.
- string.ato?() now directly interface to int(), long() and
  float(). The error strings are now a little different, but
  the type still remains the same. These functions are now
  ready to get declared obsolete ;-)
- PyNumber_Int() now also does a check for embedded NULL chars
  in the input string; PyNumber_Long() already did this (and
  still does)

Followed by:

Looks like I've gone a step too far there... (and test_contains.py
seem to have a bug too).

I've changed back to reporting all errors in PyUnicode_Contains()
and added a few more test cases to test_contains.py (plus corrected
the join() NameError).
2000-04-05 20:11:21 +00:00
Guido van Rossum 24bdb0474f Marc-Andre Lemburg:
The attached patch set includes a workaround to get Python with
Unicode compile on BSDI 4.x (courtesy Thomas Wouters; the cause
is a bug in the BSDI wchar.h header file) and Python interfaces
for the MBCS codec donated by Mark Hammond.

Also included are some minor corrections w/r to the docs of
the new "es" and "es#" parser markers (use PyMem_Free() instead
of free(); thanks to Mark Hammond for finding these).

The unicodedata tests are now in a separate file
(test_unicodedata.py) to avoid problems if the module cannot
be found.
2000-03-28 20:29:59 +00:00
Guido van Rossum efec1158c1 Prototypes added for MBCS codecs. (Win32 only.) 2000-03-28 02:01:15 +00:00
Barry Warsaw 51ac58039f On 17-Mar-2000, Marc-Andre Lemburg said:
Attached you find an update of the Unicode implementation.

    The patch is against the current CVS version. I would appreciate
    if someone with CVS checkin permissions could check the changes
    in.

    The patch contains all bugs and patches sent this week and also
    fixes a leak in the codecs code and a bug in the free list code
    for Unicode objects (which only shows up when compiling Python
    with Py_DEBUG; thanks to MarkH for spotting this one).
2000-03-20 16:36:48 +00:00
Guido van Rossum d0d366b5e6 Marc-Andre Lemburg: add declaration for PyUnicode_Contains(). 2000-03-13 23:22:24 +00:00
Guido van Rossum d822518fa8 Unicode implementation by Marc-Andre Lemburg based on original code by Fredrik Lundh. 2000-03-10 22:33:05 +00:00