decoding incomplete input (when the input stream is temporarily exhausted).
codecs.StreamReader now implements buffering, which enables proper
readline support for the UTF-16 decoders. codecs.StreamReader.read()
has a new argument chars which specifies the number of characters to
return. codecs.StreamReader.readline() and codecs.StreamReader.readlines()
have a new argument keepends. Trailing "\n"s will be stripped from the lines
if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and
PyUnicode_DecodeUTF16Stateful.
and installed layouts to make maintenance simple and easy. And it
also adds four new codecs; big5hkscs, euc-jis-2004, shift-jis-2004
and iso2022-jp-2004.
__init__.py module to raise errors which can be catched as LookupErrors
as well as SystemErrors.
Modified the error messages to include more information about the
failing module.
write a BOM at the start of the stream and also to only read it as
BOM at the start of a stream.
Subsequent reading/writing of BOMs will read/write the BOM as ZWNBSP
character. This is in sync with the Unicode specifications.
Note that UTF-16 files will now *have* to start with a BOM mark
in order to be readable by the codec.
codec files to codecs.py and added logic so that multi mappings
in the decoding maps now result in mappings to None (undefined mapping)
in the encoding maps.
and introduces a new method .decode().
The major change is that strg.encode() will no longer try to convert
Unicode returns from the codec into a string, but instead pass along
the Unicode object as-is. The same is now true for all other codec
return types. The underlying C APIs were changed accordingly.
Note that even though this does have the potential of breaking
existing code, the chances are low since conversion from Unicode
previously took place using the default encoding which is normally
set to ASCII rendering this auto-conversion mechanism useless for
most Unicode encodings.
The good news is that you can now use .encode() and .decode() with
much greater ease and that the door was opened for better accessibility
of the builtin codecs.
As demonstration of the new feature, the patch includes a few new
codecs which allow string to string encoding and decoding (rot13,
hex, zip, uu, base64).
Written by Marc-Andre Lemburg. Copyright assigned to the PSF.
codec to not apply Latin-1 mappings for keys which are not found
in the mapping dictionaries, but instead treat them as undefined
mappings.
The patch was originally written by Martin v. Loewis with some
additional (cosmetic) changes and an updated test script
by Marc-Andre Lemburg.
The standard codecs were recreated from the most current files
available at the Unicode.org site using the Tools/scripts/gencodec.py
tool.
This patch closes the bugs #116285 and #119960.
encodings package aliases mapping dictionary rather than in the
internal cache used by the search function.
This enables aliases to take advantage of the full normalization
process applied to encoding names which was previously not available.
The patch restricts alias registration to new aliases. Existing
aliases cannot be overridden anymore.
comments, docstrings or error messages. I fixed two minor things in
test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't").
There is a minor style issue involved: Guido seems to have preferred English
grammar (behaviour, honour) in a couple places. This patch changes that to
American, which is the more prominent style in the source. I prefer English
myself, so if English is preferred, I'd be happy to supply a patch myself ;)
his copy of test_contains.py seems to be broken -- the lines he
deleted were already absent). Checkin messages:
New Unicode support for int(), float(), complex() and long().
- new APIs PyInt_FromUnicode() and PyLong_FromUnicode()
- added support for Unicode to PyFloat_FromString()
- new encoding API PyUnicode_EncodeDecimal() which converts
Unicode to a decimal char* string (used in the above new
APIs)
- shortcuts for calls like int(<int object>) and float(<float obj>)
- tests for all of the above
Unicode compares and contains checks:
- comparing Unicode and non-string types now works; TypeErrors
are masked, all other errors such as ValueError during
Unicode coercion are passed through (note that PyUnicode_Compare
does not implement the masking -- PyObject_Compare does this)
- contains now works for non-string types too; TypeErrors are
masked and 0 returned; all other errors are passed through
Better testing support for the standard codecs.
Misc minor enhancements, such as an alias dbcs for the mbcs codec.
Changes:
- PyLong_FromString() now applies the same error checks as
does PyInt_FromString(): trailing garbage is reported
as error and not longer silently ignored. The only characters
which may be trailing the digits are 'L' and 'l' -- these
are still silently ignored.
- string.ato?() now directly interface to int(), long() and
float(). The error strings are now a little different, but
the type still remains the same. These functions are now
ready to get declared obsolete ;-)
- PyNumber_Int() now also does a check for embedded NULL chars
in the input string; PyNumber_Long() already did this (and
still does)
Followed by:
Looks like I've gone a step too far there... (and test_contains.py
seem to have a bug too).
I've changed back to reporting all errors in PyUnicode_Contains()
and added a few more test cases to test_contains.py (plus corrected
the join() NameError).
The attached patch set includes a workaround to get Python with
Unicode compile on BSDI 4.x (courtesy Thomas Wouters; the cause
is a bug in the BSDI wchar.h header file) and Python interfaces
for the MBCS codec donated by Mark Hammond.
Also included are some minor corrections w/r to the docs of
the new "es" and "es#" parser markers (use PyMem_Free() instead
of free(); thanks to Mark Hammond for finding these).
The unicodedata tests are now in a separate file
(test_unicodedata.py) to avoid problems if the module cannot
be found.
Attached you find an update of the Unicode implementation.
The patch is against the current CVS version. I would appreciate
if someone with CVS checkin permissions could check the changes
in.
The patch contains all bugs and patches sent this week and also
fixes a leak in the codecs code and a bug in the free list code
for Unicode objects (which only shows up when compiling Python
with Py_DEBUG; thanks to MarkH for spotting this one).