mirror of https://github.com/python/cpython
Issue 19548: update codecs module documentation
- clarified the distinction between text encodings and other codecs - clarified relationship with builtin open and the io module - consolidated documentation of error handlers into one section - clarified type constraints of some behaviours - added tests for some of the new statements in the docs
This commit is contained in:
parent
fcfed19913
commit
b9fdb7a452
|
@ -820,10 +820,13 @@ Glossary
|
||||||
:meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
|
:meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
|
||||||
include :data:`sys.float_info` and the return value of :func:`os.stat`.
|
include :data:`sys.float_info` and the return value of :func:`os.stat`.
|
||||||
|
|
||||||
|
text encoding
|
||||||
|
A codec which encodes Unicode strings to bytes.
|
||||||
|
|
||||||
text file
|
text file
|
||||||
A :term:`file object` able to read and write :class:`str` objects.
|
A :term:`file object` able to read and write :class:`str` objects.
|
||||||
Often, a text file actually accesses a byte-oriented datastream
|
Often, a text file actually accesses a byte-oriented datastream
|
||||||
and handles the text encoding automatically.
|
and handles the :term:`text encoding` automatically.
|
||||||
|
|
||||||
.. seealso::
|
.. seealso::
|
||||||
A :term:`binary file` reads and write :class:`bytes` objects.
|
A :term:`binary file` reads and write :class:`bytes` objects.
|
||||||
|
|
|
@ -17,10 +17,17 @@
|
||||||
pair: stackable; streams
|
pair: stackable; streams
|
||||||
|
|
||||||
This module defines base classes for standard Python codecs (encoders and
|
This module defines base classes for standard Python codecs (encoders and
|
||||||
decoders) and provides access to the internal Python codec registry which
|
decoders) and provides access to the internal Python codec registry, which
|
||||||
manages the codec and error handling lookup process.
|
manages the codec and error handling lookup process. Most standard codecs
|
||||||
|
are :term:`text encodings <text encoding>`, which encode text to bytes,
|
||||||
|
but there are also codecs provided that encode text to text, and bytes to
|
||||||
|
bytes. Custom codecs may encode and decode between arbitrary types, but some
|
||||||
|
module features are restricted to use specifically with
|
||||||
|
:term:`text encodings <text encoding>`, or with codecs that encode to
|
||||||
|
:class:`bytes`.
|
||||||
|
|
||||||
It defines the following functions:
|
The module defines the following functions for encoding and decoding with
|
||||||
|
any codec:
|
||||||
|
|
||||||
.. function:: encode(obj, [encoding[, errors]])
|
.. function:: encode(obj, [encoding[, errors]])
|
||||||
|
|
||||||
|
@ -28,7 +35,7 @@ It defines the following functions:
|
||||||
encoding is ``utf-8``.
|
encoding is ``utf-8``.
|
||||||
|
|
||||||
*Errors* may be given to set the desired error handling scheme. The
|
*Errors* may be given to set the desired error handling scheme. The
|
||||||
default error handler is ``strict`` meaning that encoding errors raise
|
default error handler is ``'strict'`` meaning that encoding errors raise
|
||||||
:exc:`ValueError` (or a more codec specific subclass, such as
|
:exc:`ValueError` (or a more codec specific subclass, such as
|
||||||
:exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
|
:exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
|
||||||
information on codec error handling.
|
information on codec error handling.
|
||||||
|
@ -39,90 +46,63 @@ It defines the following functions:
|
||||||
encoding is ``utf-8``.
|
encoding is ``utf-8``.
|
||||||
|
|
||||||
*Errors* may be given to set the desired error handling scheme. The
|
*Errors* may be given to set the desired error handling scheme. The
|
||||||
default error handler is ``strict`` meaning that decoding errors raise
|
default error handler is ``'strict'`` meaning that decoding errors raise
|
||||||
:exc:`ValueError` (or a more codec specific subclass, such as
|
:exc:`ValueError` (or a more codec specific subclass, such as
|
||||||
:exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
|
:exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
|
||||||
information on codec error handling.
|
information on codec error handling.
|
||||||
|
|
||||||
.. function:: register(search_function)
|
The full details for each codec can also be looked up directly:
|
||||||
|
|
||||||
Register a codec search function. Search functions are expected to take one
|
|
||||||
argument, the encoding name in all lower case letters, and return a
|
|
||||||
:class:`CodecInfo` object having the following attributes:
|
|
||||||
|
|
||||||
* ``name`` The name of the encoding;
|
|
||||||
|
|
||||||
* ``encode`` The stateless encoding function;
|
|
||||||
|
|
||||||
* ``decode`` The stateless decoding function;
|
|
||||||
|
|
||||||
* ``incrementalencoder`` An incremental encoder class or factory function;
|
|
||||||
|
|
||||||
* ``incrementaldecoder`` An incremental decoder class or factory function;
|
|
||||||
|
|
||||||
* ``streamwriter`` A stream writer class or factory function;
|
|
||||||
|
|
||||||
* ``streamreader`` A stream reader class or factory function.
|
|
||||||
|
|
||||||
The various functions or classes take the following arguments:
|
|
||||||
|
|
||||||
*encode* and *decode*: These must be functions or methods which have the same
|
|
||||||
interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
|
|
||||||
instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
|
|
||||||
are expected to work in a stateless mode.
|
|
||||||
|
|
||||||
*incrementalencoder* and *incrementaldecoder*: These have to be factory
|
|
||||||
functions providing the following interface:
|
|
||||||
|
|
||||||
``factory(errors='strict')``
|
|
||||||
|
|
||||||
The factory functions must return objects providing the interfaces defined by
|
|
||||||
the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
|
|
||||||
respectively. Incremental codecs can maintain state.
|
|
||||||
|
|
||||||
*streamreader* and *streamwriter*: These have to be factory functions providing
|
|
||||||
the following interface:
|
|
||||||
|
|
||||||
``factory(stream, errors='strict')``
|
|
||||||
|
|
||||||
The factory functions must return objects providing the interfaces defined by
|
|
||||||
the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
|
|
||||||
Stream codecs can maintain state.
|
|
||||||
|
|
||||||
Possible values for errors are
|
|
||||||
|
|
||||||
* ``'strict'``: raise an exception in case of an encoding error
|
|
||||||
* ``'replace'``: replace malformed data with a suitable replacement marker,
|
|
||||||
such as ``'?'`` or ``'\ufffd'``
|
|
||||||
* ``'ignore'``: ignore malformed data and continue without further notice
|
|
||||||
* ``'xmlcharrefreplace'``: replace with the appropriate XML character
|
|
||||||
reference (for encoding only)
|
|
||||||
* ``'backslashreplace'``: replace with backslashed escape sequences (for
|
|
||||||
encoding only)
|
|
||||||
* ``'surrogateescape'``: on decoding, replace with code points in the Unicode
|
|
||||||
Private Use Area ranging from U+DC80 to U+DCFF. These private code
|
|
||||||
points will then be turned back into the same bytes when the
|
|
||||||
``surrogateescape`` error handler is used when encoding the data.
|
|
||||||
(See :pep:`383` for more.)
|
|
||||||
|
|
||||||
as well as any other error handling name defined via :func:`register_error`.
|
|
||||||
|
|
||||||
In case a search function cannot find a given encoding, it should return
|
|
||||||
``None``.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: lookup(encoding)
|
.. function:: lookup(encoding)
|
||||||
|
|
||||||
Looks up the codec info in the Python codec registry and returns a
|
Looks up the codec info in the Python codec registry and returns a
|
||||||
:class:`CodecInfo` object as defined above.
|
:class:`CodecInfo` object as defined below.
|
||||||
|
|
||||||
Encodings are first looked up in the registry's cache. If not found, the list of
|
Encodings are first looked up in the registry's cache. If not found, the list of
|
||||||
registered search functions is scanned. If no :class:`CodecInfo` object is
|
registered search functions is scanned. If no :class:`CodecInfo` object is
|
||||||
found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
|
found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
|
||||||
is stored in the cache and returned to the caller.
|
is stored in the cache and returned to the caller.
|
||||||
|
|
||||||
To simplify access to the various codecs, the module provides these additional
|
.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
|
||||||
functions which use :func:`lookup` for the codec lookup:
|
|
||||||
|
Codec details when looking up the codec registry. The constructor
|
||||||
|
arguments are stored in attributes of the same name:
|
||||||
|
|
||||||
|
|
||||||
|
.. attribute:: name
|
||||||
|
|
||||||
|
The name of the encoding.
|
||||||
|
|
||||||
|
|
||||||
|
.. attribute:: encode
|
||||||
|
decode
|
||||||
|
|
||||||
|
The stateless encoding and decoding functions. These must be
|
||||||
|
functions or methods which have the same interface as
|
||||||
|
the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec
|
||||||
|
instances (see :ref:`Codec Interface <codec-objects>`).
|
||||||
|
The functions or methods are expected to work in a stateless mode.
|
||||||
|
|
||||||
|
|
||||||
|
.. attribute:: incrementalencoder
|
||||||
|
incrementaldecoder
|
||||||
|
|
||||||
|
Incremental encoder and decoder classes or factory functions.
|
||||||
|
These have to provide the interface defined by the base classes
|
||||||
|
:class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
|
||||||
|
respectively. Incremental codecs can maintain state.
|
||||||
|
|
||||||
|
|
||||||
|
.. attribute:: streamwriter
|
||||||
|
streamreader
|
||||||
|
|
||||||
|
Stream writer and reader classes or factory functions. These have to
|
||||||
|
provide the interface defined by the base classes
|
||||||
|
:class:`StreamWriter` and :class:`StreamReader`, respectively.
|
||||||
|
Stream codecs can maintain state.
|
||||||
|
|
||||||
|
To simplify access to the various codec components, the module provides
|
||||||
|
these additional functions which use :func:`lookup` for the codec lookup:
|
||||||
|
|
||||||
|
|
||||||
.. function:: getencoder(encoding)
|
.. function:: getencoder(encoding)
|
||||||
|
@ -172,90 +152,43 @@ functions which use :func:`lookup` for the codec lookup:
|
||||||
|
|
||||||
Raises a :exc:`LookupError` in case the encoding cannot be found.
|
Raises a :exc:`LookupError` in case the encoding cannot be found.
|
||||||
|
|
||||||
|
Custom codecs are made available by registering a suitable codec search
|
||||||
|
function:
|
||||||
|
|
||||||
.. function:: register_error(name, error_handler)
|
.. function:: register(search_function)
|
||||||
|
|
||||||
Register the error handling function *error_handler* under the name *name*.
|
Register a codec search function. Search functions are expected to take one
|
||||||
*error_handler* will be called during encoding and decoding in case of an error,
|
argument, being the encoding name in all lower case letters, and return a
|
||||||
when *name* is specified as the errors parameter.
|
:class:`CodecInfo` object. In case a search function cannot find
|
||||||
|
a given encoding, it should return ``None``.
|
||||||
For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
|
|
||||||
instance, which contains information about the location of the error. The
|
|
||||||
error handler must either raise this or a different exception or return a
|
|
||||||
tuple with a replacement for the unencodable part of the input and a position
|
|
||||||
where encoding should continue. The replacement may be either :class:`str` or
|
|
||||||
:class:`bytes`. If the replacement is bytes, the encoder will simply copy
|
|
||||||
them into the output buffer. If the replacement is a string, the encoder will
|
|
||||||
encode the replacement. Encoding continues on original input at the
|
|
||||||
specified position. Negative position values will be treated as being
|
|
||||||
relative to the end of the input string. If the resulting position is out of
|
|
||||||
bound an :exc:`IndexError` will be raised.
|
|
||||||
|
|
||||||
Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
|
|
||||||
:exc:`UnicodeTranslateError` will be passed to the handler and that the
|
|
||||||
replacement from the error handler will be put into the output directly.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: lookup_error(name)
|
|
||||||
|
|
||||||
Return the error handler previously registered under the name *name*.
|
|
||||||
|
|
||||||
Raises a :exc:`LookupError` in case the handler cannot be found.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: strict_errors(exception)
|
|
||||||
|
|
||||||
Implements the ``strict`` error handling: each encoding or decoding error
|
|
||||||
raises a :exc:`UnicodeError`.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: replace_errors(exception)
|
|
||||||
|
|
||||||
Implements the ``replace`` error handling: malformed data is replaced with a
|
|
||||||
suitable replacement character such as ``'?'`` in bytestrings and
|
|
||||||
``'\ufffd'`` in Unicode strings.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: ignore_errors(exception)
|
|
||||||
|
|
||||||
Implements the ``ignore`` error handling: malformed data is ignored and
|
|
||||||
encoding or decoding is continued without further notice.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: xmlcharrefreplace_errors(exception)
|
|
||||||
|
|
||||||
Implements the ``xmlcharrefreplace`` error handling (for encoding only): the
|
|
||||||
unencodable character is replaced by an appropriate XML character reference.
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: backslashreplace_errors(exception)
|
|
||||||
|
|
||||||
Implements the ``backslashreplace`` error handling (for encoding only): the
|
|
||||||
unencodable character is replaced by a backslashed escape sequence.
|
|
||||||
|
|
||||||
To simplify working with encoded files or stream, the module also defines these
|
|
||||||
utility functions:
|
|
||||||
|
|
||||||
|
|
||||||
.. function:: open(filename, mode[, encoding[, errors[, buffering]]])
|
|
||||||
|
|
||||||
Open an encoded file using the given *mode* and return a wrapped version
|
|
||||||
providing transparent encoding/decoding. The default file mode is ``'r'``
|
|
||||||
meaning to open the file in read mode.
|
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The wrapped version's methods will accept and return strings only. Bytes
|
Search function registration is not currently reversible,
|
||||||
arguments will be rejected.
|
which may cause problems in some cases, such as unit testing or
|
||||||
|
module reloading.
|
||||||
|
|
||||||
|
While the builtin :func:`open` and the associated :mod:`io` module are the
|
||||||
|
recommended approach for working with encoded text files, this module
|
||||||
|
provides additional utility functions and classes that allow the use of a
|
||||||
|
wider range of codecs when working with binary files:
|
||||||
|
|
||||||
|
.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=1)
|
||||||
|
|
||||||
|
Open an encoded file using the given *mode* and return an instance of
|
||||||
|
:class:`StreamReaderWriter`, providing transparent encoding/decoding.
|
||||||
|
The default file mode is ``'r'``, meaning to open the file in read mode.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
Files are always opened in binary mode, even if no binary mode was
|
Underlying encoded files are always opened in binary mode.
|
||||||
specified. This is done to avoid data loss due to encodings using 8-bit
|
No automatic conversion of ``'\n'`` is done on reading and writing.
|
||||||
values. This means that no automatic conversion of ``b'\n'`` is done
|
The *mode* argument may be any binary mode acceptable to the built-in
|
||||||
on reading and writing.
|
:func:`open` function; the ``'b'`` is automatically added.
|
||||||
|
|
||||||
*encoding* specifies the encoding which is to be used for the file.
|
*encoding* specifies the encoding which is to be used for the file.
|
||||||
|
Any encoding that encodes to and decodes from bytes is allowed, and
|
||||||
|
the data types supported by the file methods depend on the codec used.
|
||||||
|
|
||||||
*errors* may be given to define the error handling. It defaults to ``'strict'``
|
*errors* may be given to define the error handling. It defaults to ``'strict'``
|
||||||
which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
|
which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
|
||||||
|
@ -266,12 +199,15 @@ utility functions:
|
||||||
|
|
||||||
.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
|
.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
|
||||||
|
|
||||||
Return a wrapped version of file which provides transparent encoding
|
Return a :class:`StreamRecoder` instance, a wrapped version of *file*
|
||||||
translation.
|
which provides transparent transcoding. The original file is closed
|
||||||
|
when the wrapped version is closed.
|
||||||
|
|
||||||
Bytes written to the wrapped file are interpreted according to the given
|
Data written to the wrapped file is decoded according to the given
|
||||||
*data_encoding* and then written to the original file as bytes using the
|
*data_encoding* and then written to the original file as bytes using
|
||||||
*file_encoding*.
|
*file_encoding*. Bytes read from the original file are decoded
|
||||||
|
according to *file_encoding*, and the result is encoded
|
||||||
|
using *data_encoding*.
|
||||||
|
|
||||||
If *file_encoding* is not given, it defaults to *data_encoding*.
|
If *file_encoding* is not given, it defaults to *data_encoding*.
|
||||||
|
|
||||||
|
@ -283,14 +219,16 @@ utility functions:
|
||||||
.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
|
.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
|
||||||
|
|
||||||
Uses an incremental encoder to iteratively encode the input provided by
|
Uses an incremental encoder to iteratively encode the input provided by
|
||||||
*iterator*. This function is a :term:`generator`. *errors* (as well as any
|
*iterator*. This function is a :term:`generator`.
|
||||||
|
The *errors* argument (as well as any
|
||||||
other keyword argument) is passed through to the incremental encoder.
|
other keyword argument) is passed through to the incremental encoder.
|
||||||
|
|
||||||
|
|
||||||
.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
|
.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
|
||||||
|
|
||||||
Uses an incremental decoder to iteratively decode the input provided by
|
Uses an incremental decoder to iteratively decode the input provided by
|
||||||
*iterator*. This function is a :term:`generator`. *errors* (as well as any
|
*iterator*. This function is a :term:`generator`.
|
||||||
|
The *errors* argument (as well as any
|
||||||
other keyword argument) is passed through to the incremental decoder.
|
other keyword argument) is passed through to the incremental decoder.
|
||||||
|
|
||||||
|
|
||||||
|
@ -309,9 +247,10 @@ and writing to platform dependent files:
|
||||||
BOM_UTF32_BE
|
BOM_UTF32_BE
|
||||||
BOM_UTF32_LE
|
BOM_UTF32_LE
|
||||||
|
|
||||||
These constants define various encodings of the Unicode byte order mark (BOM)
|
These constants define various byte sequences,
|
||||||
used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
|
being Unicode byte order marks (BOMs) for several encodings. They are
|
||||||
stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
|
used in UTF-16 and UTF-32 data streams to indicate the byte order used,
|
||||||
|
and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
|
||||||
:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
|
:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
|
||||||
native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
|
native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
|
||||||
:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
|
:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
|
||||||
|
@ -325,20 +264,25 @@ Codec Base Classes
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
The :mod:`codecs` module defines a set of base classes which define the
|
The :mod:`codecs` module defines a set of base classes which define the
|
||||||
interface and can also be used to easily write your own codecs for use in
|
interfaces for working with codec objects, and can also be used as the basis
|
||||||
Python.
|
for custom codec implementations.
|
||||||
|
|
||||||
Each codec has to define four interfaces to make it usable as codec in Python:
|
Each codec has to define four interfaces to make it usable as codec in Python:
|
||||||
stateless encoder, stateless decoder, stream reader and stream writer. The
|
stateless encoder, stateless decoder, stream reader and stream writer. The
|
||||||
stream reader and writers typically reuse the stateless encoder/decoder to
|
stream reader and writers typically reuse the stateless encoder/decoder to
|
||||||
implement the file protocols.
|
implement the file protocols. Codec authors also need to define how the
|
||||||
|
codec will handle encoding and decoding errors.
|
||||||
|
|
||||||
The :class:`Codec` class defines the interface for stateless encoders/decoders.
|
|
||||||
|
|
||||||
To simplify and standardize error handling, the :meth:`~Codec.encode` and
|
.. _error-handlers:
|
||||||
:meth:`~Codec.decode` methods may implement different error handling schemes by
|
|
||||||
providing the *errors* string argument. The following string values are defined
|
Error Handlers
|
||||||
and implemented by all standard Python codecs:
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
To simplify and standardize error handling,
|
||||||
|
codecs may implement different error handling schemes by
|
||||||
|
accepting the *errors* string argument. The following string values are
|
||||||
|
defined and implemented by all standard Python codecs:
|
||||||
|
|
||||||
.. tabularcolumns:: |l|L|
|
.. tabularcolumns:: |l|L|
|
||||||
|
|
||||||
|
@ -346,36 +290,52 @@ and implemented by all standard Python codecs:
|
||||||
| Value | Meaning |
|
| Value | Meaning |
|
||||||
+=========================+===============================================+
|
+=========================+===============================================+
|
||||||
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
|
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
|
||||||
| | this is the default. |
|
| | this is the default. Implemented in |
|
||||||
|
| | :func:`strict_errors`. |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
| ``'ignore'`` | Ignore the character and continue with the |
|
| ``'ignore'`` | Ignore the malformed data and continue |
|
||||||
| | next. |
|
| | without further notice. Implemented in |
|
||||||
|
| | :func:`ignore_errors`. |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
|
|
||||||
|
The following error handlers are only applicable to
|
||||||
|
:term:`text encodings <text encoding>`:
|
||||||
|
|
||||||
|
+-------------------------+-----------------------------------------------+
|
||||||
|
| Value | Meaning |
|
||||||
|
+=========================+===============================================+
|
||||||
| ``'replace'`` | Replace with a suitable replacement |
|
| ``'replace'`` | Replace with a suitable replacement |
|
||||||
| | character; Python will use the official |
|
| | marker; Python will use the official |
|
||||||
| | U+FFFD REPLACEMENT CHARACTER for the built-in |
|
| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
|
||||||
| | Unicode codecs on decoding and '?' on |
|
| | built-in codecs on decoding, and '?' on |
|
||||||
| | encoding. |
|
| | encoding. Implemented in |
|
||||||
|
| | :func:`replace_errors`. |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
|
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
|
||||||
| | reference (only for encoding). |
|
| | reference (only for encoding). Implemented |
|
||||||
|
| | in :func:`xmlcharrefreplace_errors`. |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
| ``'backslashreplace'`` | Replace with backslashed escape sequences |
|
| ``'backslashreplace'`` | Replace with backslashed escape sequences |
|
||||||
| | (only for encoding). |
|
| | (only for encoding). Implemented in |
|
||||||
|
| | :func:`backslashreplace_errors`. |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
| ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined|
|
| ``'surrogateescape'`` | On decoding, replace byte with individual |
|
||||||
| | in :pep:`383`. |
|
| | surrogate code ranging from ``U+DC80`` to |
|
||||||
|
| | ``U+DCFF``. This code will then be turned |
|
||||||
|
| | back into the same byte when the |
|
||||||
|
| | ``'surrogateescape'`` error handler is used |
|
||||||
|
| | when encoding the data. (See :pep:`383` for |
|
||||||
|
| | more.) |
|
||||||
+-------------------------+-----------------------------------------------+
|
+-------------------------+-----------------------------------------------+
|
||||||
|
|
||||||
In addition, the following error handlers are specific to Unicode encoding
|
In addition, the following error handler is specific to the given codecs:
|
||||||
schemes:
|
|
||||||
|
|
||||||
+-------------------+------------------------+-------------------------------------------+
|
+-------------------+------------------------+-------------------------------------------+
|
||||||
| Value | Codec | Meaning |
|
| Value | Codecs | Meaning |
|
||||||
+===================+========================+===========================================+
|
+===================+========================+===========================================+
|
||||||
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
|
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
|
||||||
| | utf-16-be, utf-16-le, | codes in all the Unicode encoding schemes.|
|
| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
|
||||||
| | utf-32-be, utf-32-le | |
|
| | utf-32-be, utf-32-le | presence of surrogates as an error. |
|
||||||
+-------------------+------------------------+-------------------------------------------+
|
+-------------------+------------------------+-------------------------------------------+
|
||||||
|
|
||||||
.. versionadded:: 3.1
|
.. versionadded:: 3.1
|
||||||
|
@ -384,26 +344,96 @@ schemes:
|
||||||
.. versionchanged:: 3.4
|
.. versionchanged:: 3.4
|
||||||
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
|
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
|
||||||
|
|
||||||
The set of allowed values can be extended via :meth:`register_error`.
|
The set of allowed values can be extended by registering a new named error
|
||||||
|
handler:
|
||||||
|
|
||||||
|
.. function:: register_error(name, error_handler)
|
||||||
|
|
||||||
|
Register the error handling function *error_handler* under the name *name*.
|
||||||
|
The *error_handler* argument will be called during encoding and decoding
|
||||||
|
in case of an error, when *name* is specified as the errors parameter.
|
||||||
|
|
||||||
|
For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`
|
||||||
|
instance, which contains information about the location of the error. The
|
||||||
|
error handler must either raise this or a different exception, or return a
|
||||||
|
tuple with a replacement for the unencodable part of the input and a position
|
||||||
|
where encoding should continue. The replacement may be either :class:`str` or
|
||||||
|
:class:`bytes`. If the replacement is bytes, the encoder will simply copy
|
||||||
|
them into the output buffer. If the replacement is a string, the encoder will
|
||||||
|
encode the replacement. Encoding continues on original input at the
|
||||||
|
specified position. Negative position values will be treated as being
|
||||||
|
relative to the end of the input string. If the resulting position is out of
|
||||||
|
bound an :exc:`IndexError` will be raised.
|
||||||
|
|
||||||
|
Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or
|
||||||
|
:exc:`UnicodeTranslateError` will be passed to the handler and that the
|
||||||
|
replacement from the error handler will be put into the output directly.
|
||||||
|
|
||||||
|
|
||||||
|
Previously registered error handlers (including the standard error handlers)
|
||||||
|
can be looked up by name:
|
||||||
|
|
||||||
|
.. function:: lookup_error(name)
|
||||||
|
|
||||||
|
Return the error handler previously registered under the name *name*.
|
||||||
|
|
||||||
|
Raises a :exc:`LookupError` in case the handler cannot be found.
|
||||||
|
|
||||||
|
The following standard error handlers are also made available as module level
|
||||||
|
functions:
|
||||||
|
|
||||||
|
.. function:: strict_errors(exception)
|
||||||
|
|
||||||
|
Implements the ``'strict'`` error handling: each encoding or
|
||||||
|
decoding error raises a :exc:`UnicodeError`.
|
||||||
|
|
||||||
|
|
||||||
|
.. function:: replace_errors(exception)
|
||||||
|
|
||||||
|
Implements the ``'replace'`` error handling (for :term:`text encodings
|
||||||
|
<text encoding>` only): substitutes ``'?'`` for encoding errors
|
||||||
|
(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
|
||||||
|
character, ``'<27>'``) for decoding errors.
|
||||||
|
|
||||||
|
|
||||||
|
.. function:: ignore_errors(exception)
|
||||||
|
|
||||||
|
Implements the ``'ignore'`` error handling: malformed data is ignored and
|
||||||
|
encoding or decoding is continued without further notice.
|
||||||
|
|
||||||
|
|
||||||
|
.. function:: xmlcharrefreplace_errors(exception)
|
||||||
|
|
||||||
|
Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
|
||||||
|
:term:`text encodings <text encoding>` only): the
|
||||||
|
unencodable character is replaced by an appropriate XML character reference.
|
||||||
|
|
||||||
|
|
||||||
|
.. function:: backslashreplace_errors(exception)
|
||||||
|
|
||||||
|
Implements the ``'backslashreplace'`` error handling (for encoding with
|
||||||
|
:term:`text encodings <text encoding>` only): the
|
||||||
|
unencodable character is replaced by a backslashed escape sequence.
|
||||||
|
|
||||||
|
|
||||||
.. _codec-objects:
|
.. _codec-objects:
|
||||||
|
|
||||||
Codec Objects
|
Stateless Encoding and Decoding
|
||||||
^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The :class:`Codec` class defines these methods which also define the function
|
The base :class:`Codec` class defines these methods which also define the
|
||||||
interfaces of the stateless encoder and decoder:
|
function interfaces of the stateless encoder and decoder:
|
||||||
|
|
||||||
|
|
||||||
.. method:: Codec.encode(input[, errors])
|
.. method:: Codec.encode(input[, errors])
|
||||||
|
|
||||||
Encodes the object *input* and returns a tuple (output object, length consumed).
|
Encodes the object *input* and returns a tuple (output object, length consumed).
|
||||||
Encoding converts a string object to a bytes object using a particular
|
For instance, :term:`text encoding` converts
|
||||||
|
a string object to a bytes object using a particular
|
||||||
character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
|
character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
|
||||||
|
|
||||||
*errors* defines the error handling to apply. It defaults to ``'strict'``
|
The *errors* argument defines the error handling to apply.
|
||||||
handling.
|
It defaults to ``'strict'`` handling.
|
||||||
|
|
||||||
The method may not store state in the :class:`Codec` instance. Use
|
The method may not store state in the :class:`Codec` instance. Use
|
||||||
:class:`StreamCodec` for codecs which have to keep state in order to make
|
:class:`StreamCodec` for codecs which have to keep state in order to make
|
||||||
|
@ -416,14 +446,16 @@ interfaces of the stateless encoder and decoder:
|
||||||
.. method:: Codec.decode(input[, errors])
|
.. method:: Codec.decode(input[, errors])
|
||||||
|
|
||||||
Decodes the object *input* and returns a tuple (output object, length
|
Decodes the object *input* and returns a tuple (output object, length
|
||||||
consumed). Decoding converts a bytes object encoded using a particular
|
consumed). For instance, for a :term:`text encoding`, decoding converts
|
||||||
|
a bytes object encoded using a particular
|
||||||
character set encoding to a string object.
|
character set encoding to a string object.
|
||||||
|
|
||||||
*input* must be a bytes object or one which provides the read-only character
|
For text encodings and bytes-to-bytes codecs,
|
||||||
|
*input* must be a bytes object or one which provides the read-only
|
||||||
buffer interface -- for example, buffer objects and memory mapped files.
|
buffer interface -- for example, buffer objects and memory mapped files.
|
||||||
|
|
||||||
*errors* defines the error handling to apply. It defaults to ``'strict'``
|
The *errors* argument defines the error handling to apply.
|
||||||
handling.
|
It defaults to ``'strict'`` handling.
|
||||||
|
|
||||||
The method may not store state in the :class:`Codec` instance. Use
|
The method may not store state in the :class:`Codec` instance. Use
|
||||||
:class:`StreamCodec` for codecs which have to keep state in order to make
|
:class:`StreamCodec` for codecs which have to keep state in order to make
|
||||||
|
@ -432,6 +464,10 @@ interfaces of the stateless encoder and decoder:
|
||||||
The decoder must be able to handle zero length input and return an empty object
|
The decoder must be able to handle zero length input and return an empty object
|
||||||
of the output object type in this situation.
|
of the output object type in this situation.
|
||||||
|
|
||||||
|
|
||||||
|
Incremental Encoding and Decoding
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
|
The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
|
||||||
the basic interface for incremental encoding and decoding. Encoding/decoding the
|
the basic interface for incremental encoding and decoding. Encoding/decoding the
|
||||||
input isn't done with one call to the stateless encoder/decoder function, but
|
input isn't done with one call to the stateless encoder/decoder function, but
|
||||||
|
@ -449,14 +485,14 @@ encoded/decoded with the stateless encoder/decoder.
|
||||||
.. _incremental-encoder-objects:
|
.. _incremental-encoder-objects:
|
||||||
|
|
||||||
IncrementalEncoder Objects
|
IncrementalEncoder Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`IncrementalEncoder` class is used for encoding an input in multiple
|
The :class:`IncrementalEncoder` class is used for encoding an input in multiple
|
||||||
steps. It defines the following methods which every incremental encoder must
|
steps. It defines the following methods which every incremental encoder must
|
||||||
define in order to be compatible with the Python codec registry.
|
define in order to be compatible with the Python codec registry.
|
||||||
|
|
||||||
|
|
||||||
.. class:: IncrementalEncoder([errors])
|
.. class:: IncrementalEncoder(errors='strict')
|
||||||
|
|
||||||
Constructor for an :class:`IncrementalEncoder` instance.
|
Constructor for an :class:`IncrementalEncoder` instance.
|
||||||
|
|
||||||
|
@ -465,26 +501,14 @@ define in order to be compatible with the Python codec registry.
|
||||||
the Python codec registry.
|
the Python codec registry.
|
||||||
|
|
||||||
The :class:`IncrementalEncoder` may implement different error handling schemes
|
The :class:`IncrementalEncoder` may implement different error handling schemes
|
||||||
by providing the *errors* keyword argument. These parameters are predefined:
|
by providing the *errors* keyword argument. See :ref:`error-handlers` for
|
||||||
|
possible values.
|
||||||
* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
|
|
||||||
|
|
||||||
* ``'ignore'`` Ignore the character and continue with the next.
|
|
||||||
|
|
||||||
* ``'replace'`` Replace with a suitable replacement character
|
|
||||||
|
|
||||||
* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
|
|
||||||
|
|
||||||
* ``'backslashreplace'`` Replace with backslashed escape sequences.
|
|
||||||
|
|
||||||
The *errors* argument will be assigned to an attribute of the same name.
|
The *errors* argument will be assigned to an attribute of the same name.
|
||||||
Assigning to this attribute makes it possible to switch between different error
|
Assigning to this attribute makes it possible to switch between different error
|
||||||
handling strategies during the lifetime of the :class:`IncrementalEncoder`
|
handling strategies during the lifetime of the :class:`IncrementalEncoder`
|
||||||
object.
|
object.
|
||||||
|
|
||||||
The set of allowed values for the *errors* argument can be extended with
|
|
||||||
:func:`register_error`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: encode(object[, final])
|
.. method:: encode(object[, final])
|
||||||
|
|
||||||
|
@ -496,7 +520,8 @@ define in order to be compatible with the Python codec registry.
|
||||||
.. method:: reset()
|
.. method:: reset()
|
||||||
|
|
||||||
Reset the encoder to the initial state. The output is discarded: call
|
Reset the encoder to the initial state. The output is discarded: call
|
||||||
``.encode('', final=True)`` to reset the encoder and to get the output.
|
``.encode(object, final=True)``, passing an empty byte or text string
|
||||||
|
if necessary, to reset the encoder and to get the output.
|
||||||
|
|
||||||
|
|
||||||
.. method:: IncrementalEncoder.getstate()
|
.. method:: IncrementalEncoder.getstate()
|
||||||
|
@ -517,14 +542,14 @@ define in order to be compatible with the Python codec registry.
|
||||||
.. _incremental-decoder-objects:
|
.. _incremental-decoder-objects:
|
||||||
|
|
||||||
IncrementalDecoder Objects
|
IncrementalDecoder Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`IncrementalDecoder` class is used for decoding an input in multiple
|
The :class:`IncrementalDecoder` class is used for decoding an input in multiple
|
||||||
steps. It defines the following methods which every incremental decoder must
|
steps. It defines the following methods which every incremental decoder must
|
||||||
define in order to be compatible with the Python codec registry.
|
define in order to be compatible with the Python codec registry.
|
||||||
|
|
||||||
|
|
||||||
.. class:: IncrementalDecoder([errors])
|
.. class:: IncrementalDecoder(errors='strict')
|
||||||
|
|
||||||
Constructor for an :class:`IncrementalDecoder` instance.
|
Constructor for an :class:`IncrementalDecoder` instance.
|
||||||
|
|
||||||
|
@ -533,22 +558,14 @@ define in order to be compatible with the Python codec registry.
|
||||||
the Python codec registry.
|
the Python codec registry.
|
||||||
|
|
||||||
The :class:`IncrementalDecoder` may implement different error handling schemes
|
The :class:`IncrementalDecoder` may implement different error handling schemes
|
||||||
by providing the *errors* keyword argument. These parameters are predefined:
|
by providing the *errors* keyword argument. See :ref:`error-handlers` for
|
||||||
|
possible values.
|
||||||
* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
|
|
||||||
|
|
||||||
* ``'ignore'`` Ignore the character and continue with the next.
|
|
||||||
|
|
||||||
* ``'replace'`` Replace with a suitable replacement character.
|
|
||||||
|
|
||||||
The *errors* argument will be assigned to an attribute of the same name.
|
The *errors* argument will be assigned to an attribute of the same name.
|
||||||
Assigning to this attribute makes it possible to switch between different error
|
Assigning to this attribute makes it possible to switch between different error
|
||||||
handling strategies during the lifetime of the :class:`IncrementalDecoder`
|
handling strategies during the lifetime of the :class:`IncrementalDecoder`
|
||||||
object.
|
object.
|
||||||
|
|
||||||
The set of allowed values for the *errors* argument can be extended with
|
|
||||||
:func:`register_error`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: decode(object[, final])
|
.. method:: decode(object[, final])
|
||||||
|
|
||||||
|
@ -587,6 +604,10 @@ define in order to be compatible with the Python codec registry.
|
||||||
returned by :meth:`getstate`.
|
returned by :meth:`getstate`.
|
||||||
|
|
||||||
|
|
||||||
|
Stream Encoding and Decoding
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|
||||||
The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
|
The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
|
||||||
working interfaces which can be used to implement new encoding submodules very
|
working interfaces which can be used to implement new encoding submodules very
|
||||||
easily. See :mod:`encodings.utf_8` for an example of how this is done.
|
easily. See :mod:`encodings.utf_8` for an example of how this is done.
|
||||||
|
@ -595,14 +616,14 @@ easily. See :mod:`encodings.utf_8` for an example of how this is done.
|
||||||
.. _stream-writer-objects:
|
.. _stream-writer-objects:
|
||||||
|
|
||||||
StreamWriter Objects
|
StreamWriter Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
|
The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
|
||||||
following methods which every stream writer must define in order to be
|
following methods which every stream writer must define in order to be
|
||||||
compatible with the Python codec registry.
|
compatible with the Python codec registry.
|
||||||
|
|
||||||
|
|
||||||
.. class:: StreamWriter(stream[, errors])
|
.. class:: StreamWriter(stream, errors='strict')
|
||||||
|
|
||||||
Constructor for a :class:`StreamWriter` instance.
|
Constructor for a :class:`StreamWriter` instance.
|
||||||
|
|
||||||
|
@ -610,29 +631,17 @@ compatible with the Python codec registry.
|
||||||
additional keyword arguments, but only the ones defined here are used by the
|
additional keyword arguments, but only the ones defined here are used by the
|
||||||
Python codec registry.
|
Python codec registry.
|
||||||
|
|
||||||
*stream* must be a file-like object open for writing binary data.
|
The *stream* argument must be a file-like object open for writing
|
||||||
|
text or binary data, as appropriate for the specific codec.
|
||||||
|
|
||||||
The :class:`StreamWriter` may implement different error handling schemes by
|
The :class:`StreamWriter` may implement different error handling schemes by
|
||||||
providing the *errors* keyword argument. These parameters are predefined:
|
providing the *errors* keyword argument. See :ref:`error-handlers` for
|
||||||
|
the standard error handlers the underlying stream codec may support.
|
||||||
* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
|
|
||||||
|
|
||||||
* ``'ignore'`` Ignore the character and continue with the next.
|
|
||||||
|
|
||||||
* ``'replace'`` Replace with a suitable replacement character
|
|
||||||
|
|
||||||
* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
|
|
||||||
|
|
||||||
* ``'backslashreplace'`` Replace with backslashed escape sequences.
|
|
||||||
|
|
||||||
The *errors* argument will be assigned to an attribute of the same name.
|
The *errors* argument will be assigned to an attribute of the same name.
|
||||||
Assigning to this attribute makes it possible to switch between different error
|
Assigning to this attribute makes it possible to switch between different error
|
||||||
handling strategies during the lifetime of the :class:`StreamWriter` object.
|
handling strategies during the lifetime of the :class:`StreamWriter` object.
|
||||||
|
|
||||||
The set of allowed values for the *errors* argument can be extended with
|
|
||||||
:func:`register_error`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: write(object)
|
.. method:: write(object)
|
||||||
|
|
||||||
Writes the object's contents encoded to the stream.
|
Writes the object's contents encoded to the stream.
|
||||||
|
@ -641,7 +650,8 @@ compatible with the Python codec registry.
|
||||||
.. method:: writelines(list)
|
.. method:: writelines(list)
|
||||||
|
|
||||||
Writes the concatenated list of strings to the stream (possibly by reusing
|
Writes the concatenated list of strings to the stream (possibly by reusing
|
||||||
the :meth:`write` method).
|
the :meth:`write` method). The standard bytes-to-bytes codecs
|
||||||
|
do not support this method.
|
||||||
|
|
||||||
|
|
||||||
.. method:: reset()
|
.. method:: reset()
|
||||||
|
@ -660,14 +670,14 @@ all other methods and attributes from the underlying stream.
|
||||||
.. _stream-reader-objects:
|
.. _stream-reader-objects:
|
||||||
|
|
||||||
StreamReader Objects
|
StreamReader Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
|
The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
|
||||||
following methods which every stream reader must define in order to be
|
following methods which every stream reader must define in order to be
|
||||||
compatible with the Python codec registry.
|
compatible with the Python codec registry.
|
||||||
|
|
||||||
|
|
||||||
.. class:: StreamReader(stream[, errors])
|
.. class:: StreamReader(stream, errors='strict')
|
||||||
|
|
||||||
Constructor for a :class:`StreamReader` instance.
|
Constructor for a :class:`StreamReader` instance.
|
||||||
|
|
||||||
|
@ -675,16 +685,12 @@ compatible with the Python codec registry.
|
||||||
additional keyword arguments, but only the ones defined here are used by the
|
additional keyword arguments, but only the ones defined here are used by the
|
||||||
Python codec registry.
|
Python codec registry.
|
||||||
|
|
||||||
*stream* must be a file-like object open for reading (binary) data.
|
The *stream* argument must be a file-like object open for reading
|
||||||
|
text or binary data, as appropriate for the specific codec.
|
||||||
|
|
||||||
The :class:`StreamReader` may implement different error handling schemes by
|
The :class:`StreamReader` may implement different error handling schemes by
|
||||||
providing the *errors* keyword argument. These parameters are defined:
|
providing the *errors* keyword argument. See :ref:`error-handlers` for
|
||||||
|
the standard error handlers the underlying stream codec may support.
|
||||||
* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
|
|
||||||
|
|
||||||
* ``'ignore'`` Ignore the character and continue with the next.
|
|
||||||
|
|
||||||
* ``'replace'`` Replace with a suitable replacement character.
|
|
||||||
|
|
||||||
The *errors* argument will be assigned to an attribute of the same name.
|
The *errors* argument will be assigned to an attribute of the same name.
|
||||||
Assigning to this attribute makes it possible to switch between different error
|
Assigning to this attribute makes it possible to switch between different error
|
||||||
|
@ -698,17 +704,20 @@ compatible with the Python codec registry.
|
||||||
|
|
||||||
Decodes data from the stream and returns the resulting object.
|
Decodes data from the stream and returns the resulting object.
|
||||||
|
|
||||||
*chars* indicates the number of characters to read from the
|
The *chars* argument indicates the number of decoded
|
||||||
stream. :func:`read` will never return more than *chars* characters, but
|
code points or bytes to return. The :func:`read` method will
|
||||||
it might return less, if there are not enough characters available.
|
never return more data than requested, but it might return less,
|
||||||
|
if there is not enough available.
|
||||||
|
|
||||||
*size* indicates the approximate maximum number of bytes to read from the
|
The *size* argument indicates the approximate maximum
|
||||||
stream for decoding purposes. The decoder can modify this setting as
|
number of encoded bytes or code points to read
|
||||||
|
for decoding. The decoder can modify this setting as
|
||||||
appropriate. The default value -1 indicates to read and decode as much as
|
appropriate. The default value -1 indicates to read and decode as much as
|
||||||
possible. *size* is intended to prevent having to decode huge files in
|
possible. This parameter is intended to
|
||||||
one step.
|
prevent having to decode huge files in one step.
|
||||||
|
|
||||||
*firstline* indicates that it would be sufficient to only return the first
|
The *firstline* flag indicates that
|
||||||
|
it would be sufficient to only return the first
|
||||||
line, if there are decoding errors on later lines.
|
line, if there are decoding errors on later lines.
|
||||||
|
|
||||||
The method should use a greedy read strategy meaning that it should read
|
The method should use a greedy read strategy meaning that it should read
|
||||||
|
@ -751,17 +760,13 @@ compatible with the Python codec registry.
|
||||||
In addition to the above methods, the :class:`StreamReader` must also inherit
|
In addition to the above methods, the :class:`StreamReader` must also inherit
|
||||||
all other methods and attributes from the underlying stream.
|
all other methods and attributes from the underlying stream.
|
||||||
|
|
||||||
The next two base classes are included for convenience. They are not needed by
|
|
||||||
the codec registry, but may provide useful in practice.
|
|
||||||
|
|
||||||
|
|
||||||
.. _stream-reader-writer:
|
.. _stream-reader-writer:
|
||||||
|
|
||||||
StreamReaderWriter Objects
|
StreamReaderWriter Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`StreamReaderWriter` allows wrapping streams which work in both read
|
The :class:`StreamReaderWriter` is a convenience class that allows wrapping
|
||||||
and write modes.
|
streams which work in both read and write modes.
|
||||||
|
|
||||||
The design is such that one can use the factory functions returned by the
|
The design is such that one can use the factory functions returned by the
|
||||||
:func:`lookup` function to construct the instance.
|
:func:`lookup` function to construct the instance.
|
||||||
|
@ -782,9 +787,9 @@ methods and attributes from the underlying stream.
|
||||||
.. _stream-recoder-objects:
|
.. _stream-recoder-objects:
|
||||||
|
|
||||||
StreamRecoder Objects
|
StreamRecoder Objects
|
||||||
^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The :class:`StreamRecoder` provide a frontend - backend view of encoding data
|
The :class:`StreamRecoder` translates data from one encoding to another,
|
||||||
which is sometimes useful when dealing with different encoding environments.
|
which is sometimes useful when dealing with different encoding environments.
|
||||||
|
|
||||||
The design is such that one can use the factory functions returned by the
|
The design is such that one can use the factory functions returned by the
|
||||||
|
@ -794,22 +799,20 @@ The design is such that one can use the factory functions returned by the
|
||||||
.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
|
.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
|
||||||
|
|
||||||
Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
|
Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
|
||||||
*encode* and *decode* work on the frontend (the input to :meth:`read` and output
|
*encode* and *decode* work on the frontend — the data visible to
|
||||||
of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
|
code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*
|
||||||
writing to the stream).
|
work on the backend — the data in *stream*.
|
||||||
|
|
||||||
You can use these objects to do transparent direct recodings from e.g. Latin-1
|
You can use these objects to do transparent transcodings from e.g. Latin-1
|
||||||
to UTF-8 and back.
|
to UTF-8 and back.
|
||||||
|
|
||||||
*stream* must be a file-like object.
|
The *stream* argument must be a file-like object.
|
||||||
|
|
||||||
*encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
|
The *encode* and *decode* arguments must
|
||||||
|
adhere to the :class:`Codec` interface. *Reader* and
|
||||||
*Writer* must be factory functions or classes providing objects of the
|
*Writer* must be factory functions or classes providing objects of the
|
||||||
:class:`StreamReader` and :class:`StreamWriter` interface respectively.
|
:class:`StreamReader` and :class:`StreamWriter` interface respectively.
|
||||||
|
|
||||||
*encode* and *decode* are needed for the frontend translation, *Reader* and
|
|
||||||
*Writer* for the backend translation.
|
|
||||||
|
|
||||||
Error handling is done in the same way as defined for the stream readers and
|
Error handling is done in the same way as defined for the stream readers and
|
||||||
writers.
|
writers.
|
||||||
|
|
||||||
|
@ -824,20 +827,23 @@ methods and attributes from the underlying stream.
|
||||||
Encodings and Unicode
|
Encodings and Unicode
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
Strings are stored internally as sequences of codepoints in range ``0 - 10FFFF``
|
Strings are stored internally as sequences of codepoints in
|
||||||
(see :pep:`393` for more details about the implementation).
|
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
||||||
Once a string object is used outside of CPU and memory, CPU endianness
|
more details about the implementation.)
|
||||||
and how these arrays are stored as bytes become an issue. Transforming a
|
Once a string object is used outside of CPU and memory, endianness
|
||||||
string object into a sequence of bytes is called encoding and recreating the
|
and how these arrays are stored as bytes become an issue. As with other
|
||||||
string object from the sequence of bytes is known as decoding. There are many
|
codecs, serialising a string into a sequence of bytes is known as *encoding*,
|
||||||
different methods for how this transformation can be done (these methods are
|
and recreating the string from the sequence of bytes is known as *decoding*.
|
||||||
also called encodings). The simplest method is to map the codepoints 0-255 to
|
There are a variety of different text serialisation codecs, which are
|
||||||
the bytes ``0x0``-``0xff``. This means that a string object that contains
|
collectivity referred to as :term:`text encodings <text encoding>`.
|
||||||
codepoints above ``U+00FF`` can't be encoded with this method (which is called
|
|
||||||
``'latin-1'`` or ``'iso-8859-1'``). :func:`str.encode` will raise a
|
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
||||||
:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
|
the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||||
codec can't encode character '\u1234' in position 3: ordinal not in
|
object that contains codepoints above ``U+00FF`` can't be encoded with this
|
||||||
range(256)``.
|
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
||||||
|
like the following (although the details of the error message may differ):
|
||||||
|
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
||||||
|
position 3: ordinal not in range(256)``.
|
||||||
|
|
||||||
There's another group of encodings (the so called charmap encodings) that choose
|
There's another group of encodings (the so called charmap encodings) that choose
|
||||||
a different subset of all Unicode code points and how these codepoints are
|
a different subset of all Unicode code points and how these codepoints are
|
||||||
|
@ -1184,7 +1190,8 @@ particular, the following variants typically exist:
|
||||||
|
|
||||||
.. versionchanged:: 3.4
|
.. versionchanged:: 3.4
|
||||||
The utf-16\* and utf-32\* encoders no longer allow surrogate code points
|
The utf-16\* and utf-32\* encoders no longer allow surrogate code points
|
||||||
(U+D800--U+DFFF) to be encoded. The utf-32\* decoders no longer decode
|
(``U+D800``--``U+DFFF``) to be encoded.
|
||||||
|
The utf-32\* decoders no longer decode
|
||||||
byte sequences that correspond to surrogate code points.
|
byte sequences that correspond to surrogate code points.
|
||||||
|
|
||||||
|
|
||||||
|
@ -1212,7 +1219,9 @@ encodings.
|
||||||
+====================+=========+===========================+
|
+====================+=========+===========================+
|
||||||
| idna | | Implements :rfc:`3490`, |
|
| idna | | Implements :rfc:`3490`, |
|
||||||
| | | see also |
|
| | | see also |
|
||||||
| | | :mod:`encodings.idna` |
|
| | | :mod:`encodings.idna`. |
|
||||||
|
| | | Only ``errors='strict'`` |
|
||||||
|
| | | is supported. |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| mbcs | dbcs | Windows only: Encode |
|
| mbcs | dbcs | Windows only: Encode |
|
||||||
| | | operand according to the |
|
| | | operand according to the |
|
||||||
|
@ -1220,31 +1229,44 @@ encodings.
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| palmos | | Encoding of PalmOS 3.5 |
|
| palmos | | Encoding of PalmOS 3.5 |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| punycode | | Implements :rfc:`3492` |
|
| punycode | | Implements :rfc:`3492`. |
|
||||||
|
| | | Stateful codecs are not |
|
||||||
|
| | | supported. |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| raw_unicode_escape | | Produce a string that is |
|
| raw_unicode_escape | | Latin-1 encoding with |
|
||||||
| | | suitable as raw Unicode |
|
| | | ``\uXXXX`` and |
|
||||||
| | | literal in Python source |
|
| | | ``\UXXXXXXXX`` for other |
|
||||||
| | | code |
|
| | | code points. Existing |
|
||||||
|
| | | backslashes are not |
|
||||||
|
| | | escaped in any way. |
|
||||||
|
| | | It is used in the Python |
|
||||||
|
| | | pickle protocol. |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| undefined | | Raise an exception for |
|
| undefined | | Raise an exception for |
|
||||||
| | | all conversions. Can be |
|
| | | all conversions, even |
|
||||||
| | | used as the system |
|
| | | empty strings. The error |
|
||||||
| | | encoding if no automatic |
|
| | | handler is ignored. |
|
||||||
| | | coercion between byte and |
|
|
||||||
| | | Unicode strings is |
|
|
||||||
| | | desired. |
|
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| unicode_escape | | Produce a string that is |
|
| unicode_escape | | Encoding suitable as the |
|
||||||
| | | suitable as Unicode |
|
| | | contents of a Unicode |
|
||||||
| | | literal in Python source |
|
| | | literal in ASCII-encoded |
|
||||||
| | | code |
|
| | | Python source code, |
|
||||||
|
| | | except that quotes are |
|
||||||
|
| | | not escaped. Decodes from |
|
||||||
|
| | | Latin-1 source code. |
|
||||||
|
| | | Beware that Python source |
|
||||||
|
| | | code actually uses UTF-8 |
|
||||||
|
| | | by default. |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
| unicode_internal | | Return the internal |
|
| unicode_internal | | Return the internal |
|
||||||
| | | representation of the |
|
| | | representation of the |
|
||||||
| | | operand |
|
| | | operand. Stateful codecs |
|
||||||
|
| | | are not supported. |
|
||||||
| | | |
|
| | | |
|
||||||
| | | .. deprecated:: 3.3 |
|
| | | .. deprecated:: 3.3 |
|
||||||
|
| | | This representation is |
|
||||||
|
| | | obsoleted by |
|
||||||
|
| | | :pep:`393`. |
|
||||||
+--------------------+---------+---------------------------+
|
+--------------------+---------+---------------------------+
|
||||||
|
|
||||||
.. _binary-transforms:
|
.. _binary-transforms:
|
||||||
|
@ -1253,7 +1275,8 @@ Binary Transforms
|
||||||
^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The following codecs provide binary transforms: :term:`bytes-like object`
|
The following codecs provide binary transforms: :term:`bytes-like object`
|
||||||
to :class:`bytes` mappings.
|
to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`
|
||||||
|
(which only produces :class:`str` output).
|
||||||
|
|
||||||
|
|
||||||
.. tabularcolumns:: |l|L|L|L|
|
.. tabularcolumns:: |l|L|L|L|
|
||||||
|
@ -1308,7 +1331,8 @@ Text Transforms
|
||||||
^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The following codec provides a text transform: a :class:`str` to :class:`str`
|
The following codec provides a text transform: a :class:`str` to :class:`str`
|
||||||
mapping.
|
mapping. It is not supported by :meth:`str.encode` (which only produces
|
||||||
|
:class:`bytes` output).
|
||||||
|
|
||||||
.. tabularcolumns:: |l|l|L|
|
.. tabularcolumns:: |l|l|L|
|
||||||
|
|
||||||
|
|
|
@ -939,15 +939,17 @@ are always available. They are listed here in alphabetical order.
|
||||||
*encoding* is the name of the encoding used to decode or encode the file.
|
*encoding* is the name of the encoding used to decode or encode the file.
|
||||||
This should only be used in text mode. The default encoding is platform
|
This should only be used in text mode. The default encoding is platform
|
||||||
dependent (whatever :func:`locale.getpreferredencoding` returns), but any
|
dependent (whatever :func:`locale.getpreferredencoding` returns), but any
|
||||||
encoding supported by Python can be used. See the :mod:`codecs` module for
|
:term:`text encoding` supported by Python
|
||||||
|
can be used. See the :mod:`codecs` module for
|
||||||
the list of supported encodings.
|
the list of supported encodings.
|
||||||
|
|
||||||
*errors* is an optional string that specifies how encoding and decoding
|
*errors* is an optional string that specifies how encoding and decoding
|
||||||
errors are to be handled--this cannot be used in binary mode.
|
errors are to be handled--this cannot be used in binary mode.
|
||||||
A variety of standard error handlers are available, though any
|
A variety of standard error handlers are available
|
||||||
|
(listed under :ref:`error-handlers`), though any
|
||||||
error handling name that has been registered with
|
error handling name that has been registered with
|
||||||
:func:`codecs.register_error` is also valid. The standard names
|
:func:`codecs.register_error` is also valid. The standard names
|
||||||
are:
|
include:
|
||||||
|
|
||||||
* ``'strict'`` to raise a :exc:`ValueError` exception if there is
|
* ``'strict'`` to raise a :exc:`ValueError` exception if there is
|
||||||
an encoding error. The default value of ``None`` has the same
|
an encoding error. The default value of ``None`` has the same
|
||||||
|
|
|
@ -1512,7 +1512,7 @@ expression support in the :mod:`re` module).
|
||||||
a :exc:`UnicodeError`. Other possible
|
a :exc:`UnicodeError`. Other possible
|
||||||
values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
|
values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
|
||||||
``'backslashreplace'`` and any other name registered via
|
``'backslashreplace'`` and any other name registered via
|
||||||
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
|
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
|
||||||
list of possible encodings, see section :ref:`standard-encodings`.
|
list of possible encodings, see section :ref:`standard-encodings`.
|
||||||
|
|
||||||
.. versionchanged:: 3.1
|
.. versionchanged:: 3.1
|
||||||
|
@ -2384,7 +2384,7 @@ arbitrary binary data.
|
||||||
error handling scheme. The default for *errors* is ``'strict'``, meaning
|
error handling scheme. The default for *errors* is ``'strict'``, meaning
|
||||||
that encoding errors raise a :exc:`UnicodeError`. Other possible values are
|
that encoding errors raise a :exc:`UnicodeError`. Other possible values are
|
||||||
``'ignore'``, ``'replace'`` and any other name registered via
|
``'ignore'``, ``'replace'`` and any other name registered via
|
||||||
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
|
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
|
||||||
list of possible encodings, see section :ref:`standard-encodings`.
|
list of possible encodings, see section :ref:`standard-encodings`.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
|
@ -794,7 +794,7 @@ metadata must be either decoded or encoded. If *encoding* is not set
|
||||||
appropriately, this conversion may fail.
|
appropriately, this conversion may fail.
|
||||||
|
|
||||||
The *errors* argument defines how characters are treated that cannot be
|
The *errors* argument defines how characters are treated that cannot be
|
||||||
converted. Possible values are listed in section :ref:`codec-base-classes`.
|
converted. Possible values are listed in section :ref:`error-handlers`.
|
||||||
The default scheme is ``'surrogateescape'`` which Python also uses for its
|
The default scheme is ``'surrogateescape'`` which Python also uses for its
|
||||||
file system calls, see :ref:`os-filenames`.
|
file system calls, see :ref:`os-filenames`.
|
||||||
|
|
||||||
|
|
|
@ -346,8 +346,7 @@ class StreamWriter(Codec):
|
||||||
|
|
||||||
""" Creates a StreamWriter instance.
|
""" Creates a StreamWriter instance.
|
||||||
|
|
||||||
stream must be a file-like object open for writing
|
stream must be a file-like object open for writing.
|
||||||
(binary) data.
|
|
||||||
|
|
||||||
The StreamWriter may use different error handling
|
The StreamWriter may use different error handling
|
||||||
schemes by providing the errors keyword argument. These
|
schemes by providing the errors keyword argument. These
|
||||||
|
@ -421,8 +420,7 @@ class StreamReader(Codec):
|
||||||
|
|
||||||
""" Creates a StreamReader instance.
|
""" Creates a StreamReader instance.
|
||||||
|
|
||||||
stream must be a file-like object open for reading
|
stream must be a file-like object open for reading.
|
||||||
(binary) data.
|
|
||||||
|
|
||||||
The StreamReader may use different error handling
|
The StreamReader may use different error handling
|
||||||
schemes by providing the errors keyword argument. These
|
schemes by providing the errors keyword argument. These
|
||||||
|
@ -450,13 +448,12 @@ class StreamReader(Codec):
|
||||||
""" Decodes data from the stream self.stream and returns the
|
""" Decodes data from the stream self.stream and returns the
|
||||||
resulting object.
|
resulting object.
|
||||||
|
|
||||||
chars indicates the number of characters to read from the
|
chars indicates the number of decoded code points or bytes to
|
||||||
stream. read() will never return more than chars
|
return. read() will never return more data than requested,
|
||||||
characters, but it might return less, if there are not enough
|
but it might return less, if there is not enough available.
|
||||||
characters available.
|
|
||||||
|
|
||||||
size indicates the approximate maximum number of bytes to
|
size indicates the approximate maximum number of decoded
|
||||||
read from the stream for decoding purposes. The decoder
|
bytes or code points to read for decoding. The decoder
|
||||||
can modify this setting as appropriate. The default value
|
can modify this setting as appropriate. The default value
|
||||||
-1 indicates to read and decode as much as possible. size
|
-1 indicates to read and decode as much as possible. size
|
||||||
is intended to prevent having to decode huge files in one
|
is intended to prevent having to decode huge files in one
|
||||||
|
@ -467,7 +464,7 @@ class StreamReader(Codec):
|
||||||
will be returned, the rest of the input will be kept until the
|
will be returned, the rest of the input will be kept until the
|
||||||
next call to read().
|
next call to read().
|
||||||
|
|
||||||
The method should use a greedy read strategy meaning that
|
The method should use a greedy read strategy, meaning that
|
||||||
it should read as much data as is allowed within the
|
it should read as much data as is allowed within the
|
||||||
definition of the encoding and the given size, e.g. if
|
definition of the encoding and the given size, e.g. if
|
||||||
optional encoding endings or state markers are available
|
optional encoding endings or state markers are available
|
||||||
|
@ -602,7 +599,7 @@ class StreamReader(Codec):
|
||||||
def readlines(self, sizehint=None, keepends=True):
|
def readlines(self, sizehint=None, keepends=True):
|
||||||
|
|
||||||
""" Read all lines available on the input stream
|
""" Read all lines available on the input stream
|
||||||
and return them as list of lines.
|
and return them as a list.
|
||||||
|
|
||||||
Line breaks are implemented using the codec's decoder
|
Line breaks are implemented using the codec's decoder
|
||||||
method and are included in the list entries.
|
method and are included in the list entries.
|
||||||
|
@ -750,19 +747,18 @@ class StreamReaderWriter:
|
||||||
|
|
||||||
class StreamRecoder:
|
class StreamRecoder:
|
||||||
|
|
||||||
""" StreamRecoder instances provide a frontend - backend
|
""" StreamRecoder instances translate data from one encoding to another.
|
||||||
view of encoding data.
|
|
||||||
|
|
||||||
They use the complete set of APIs returned by the
|
They use the complete set of APIs returned by the
|
||||||
codecs.lookup() function to implement their task.
|
codecs.lookup() function to implement their task.
|
||||||
|
|
||||||
Data written to the stream is first decoded into an
|
Data written to the StreamRecoder is first decoded into an
|
||||||
intermediate format (which is dependent on the given codec
|
intermediate format (depending on the "decode" codec) and then
|
||||||
combination) and then written to the stream using an instance
|
written to the underlying stream using an instance of the provided
|
||||||
of the provided Writer class.
|
Writer class.
|
||||||
|
|
||||||
In the other direction, data is read from the stream using a
|
In the other direction, data is read from the underlying stream using
|
||||||
Reader instance and then return encoded data to the caller.
|
a Reader instance and then encoded and returned to the caller.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
# Optional attributes set by the file wrappers below
|
# Optional attributes set by the file wrappers below
|
||||||
|
@ -774,22 +770,17 @@ class StreamRecoder:
|
||||||
|
|
||||||
""" Creates a StreamRecoder instance which implements a two-way
|
""" Creates a StreamRecoder instance which implements a two-way
|
||||||
conversion: encode and decode work on the frontend (the
|
conversion: encode and decode work on the frontend (the
|
||||||
input to .read() and output of .write()) while
|
data visible to .read() and .write()) while Reader and Writer
|
||||||
Reader and Writer work on the backend (reading and
|
work on the backend (the data in stream).
|
||||||
writing to the stream).
|
|
||||||
|
|
||||||
You can use these objects to do transparent direct
|
You can use these objects to do transparent
|
||||||
recodings from e.g. latin-1 to utf-8 and back.
|
transcodings from e.g. latin-1 to utf-8 and back.
|
||||||
|
|
||||||
stream must be a file-like object.
|
stream must be a file-like object.
|
||||||
|
|
||||||
encode, decode must adhere to the Codec interface, Reader,
|
encode and decode must adhere to the Codec interface; Reader and
|
||||||
Writer must be factory functions or classes providing the
|
Writer must be factory functions or classes providing the
|
||||||
StreamReader, StreamWriter interface resp.
|
StreamReader and StreamWriter interfaces resp.
|
||||||
|
|
||||||
encode and decode are needed for the frontend translation,
|
|
||||||
Reader and Writer for the backend translation. Unicode is
|
|
||||||
used as intermediate encoding.
|
|
||||||
|
|
||||||
Error handling is done in the same way as defined for the
|
Error handling is done in the same way as defined for the
|
||||||
StreamWriter/Readers.
|
StreamWriter/Readers.
|
||||||
|
@ -864,7 +855,7 @@ class StreamRecoder:
|
||||||
|
|
||||||
### Shortcuts
|
### Shortcuts
|
||||||
|
|
||||||
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
|
def open(filename, mode='r', encoding=None, errors='strict', buffering=1):
|
||||||
|
|
||||||
""" Open an encoded file using the given mode and return
|
""" Open an encoded file using the given mode and return
|
||||||
a wrapped version providing transparent encoding/decoding.
|
a wrapped version providing transparent encoding/decoding.
|
||||||
|
@ -874,10 +865,8 @@ def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
|
||||||
codecs. Output is also codec dependent and will usually be
|
codecs. Output is also codec dependent and will usually be
|
||||||
Unicode as well.
|
Unicode as well.
|
||||||
|
|
||||||
Files are always opened in binary mode, even if no binary mode
|
Underlying encoded files are always opened in binary mode.
|
||||||
was specified. This is done to avoid data loss due to encodings
|
The default file mode is 'r', meaning to open the file in read mode.
|
||||||
using 8-bit values. The default file mode is 'rb' meaning to
|
|
||||||
open the file in binary read mode.
|
|
||||||
|
|
||||||
encoding specifies the encoding which is to be used for the
|
encoding specifies the encoding which is to be used for the
|
||||||
file.
|
file.
|
||||||
|
@ -913,13 +902,13 @@ def EncodedFile(file, data_encoding, file_encoding=None, errors='strict'):
|
||||||
""" Return a wrapped version of file which provides transparent
|
""" Return a wrapped version of file which provides transparent
|
||||||
encoding translation.
|
encoding translation.
|
||||||
|
|
||||||
Strings written to the wrapped file are interpreted according
|
Data written to the wrapped file is decoded according
|
||||||
to the given data_encoding and then written to the original
|
to the given data_encoding and then encoded to the underlying
|
||||||
file as string using file_encoding. The intermediate encoding
|
file using file_encoding. The intermediate data type
|
||||||
will usually be Unicode but depends on the specified codecs.
|
will usually be Unicode but depends on the specified codecs.
|
||||||
|
|
||||||
Strings are read from the file using file_encoding and then
|
Bytes read from the file are decoded using file_encoding and then
|
||||||
passed back to the caller as string using data_encoding.
|
passed back to the caller encoded using data_encoding.
|
||||||
|
|
||||||
If file_encoding is not given, it defaults to data_encoding.
|
If file_encoding is not given, it defaults to data_encoding.
|
||||||
|
|
||||||
|
|
|
@ -1139,6 +1139,8 @@ class RecodingTest(unittest.TestCase):
|
||||||
# Python used to crash on this at exit because of a refcount
|
# Python used to crash on this at exit because of a refcount
|
||||||
# bug in _codecsmodule.c
|
# bug in _codecsmodule.c
|
||||||
|
|
||||||
|
self.assertTrue(f.closed)
|
||||||
|
|
||||||
# From RFC 3492
|
# From RFC 3492
|
||||||
punycode_testcases = [
|
punycode_testcases = [
|
||||||
# A Arabic (Egyptian):
|
# A Arabic (Egyptian):
|
||||||
|
@ -1591,6 +1593,16 @@ class IDNACodecTest(unittest.TestCase):
|
||||||
self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
|
self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
|
||||||
self.assertEqual(encoder.encode("", True), b"")
|
self.assertEqual(encoder.encode("", True), b"")
|
||||||
|
|
||||||
|
def test_errors(self):
|
||||||
|
"""Only supports "strict" error handler"""
|
||||||
|
"python.org".encode("idna", "strict")
|
||||||
|
b"python.org".decode("idna", "strict")
|
||||||
|
for errors in ("ignore", "replace", "backslashreplace",
|
||||||
|
"surrogateescape"):
|
||||||
|
self.assertRaises(Exception, "python.org".encode, "idna", errors)
|
||||||
|
self.assertRaises(Exception,
|
||||||
|
b"python.org".decode, "idna", errors)
|
||||||
|
|
||||||
class CodecsModuleTest(unittest.TestCase):
|
class CodecsModuleTest(unittest.TestCase):
|
||||||
|
|
||||||
def test_decode(self):
|
def test_decode(self):
|
||||||
|
@ -1668,6 +1680,24 @@ class CodecsModuleTest(unittest.TestCase):
|
||||||
for api in codecs.__all__:
|
for api in codecs.__all__:
|
||||||
getattr(codecs, api)
|
getattr(codecs, api)
|
||||||
|
|
||||||
|
def test_open(self):
|
||||||
|
self.addCleanup(support.unlink, support.TESTFN)
|
||||||
|
for mode in ('w', 'r', 'r+', 'w+', 'a', 'a+'):
|
||||||
|
with self.subTest(mode), \
|
||||||
|
codecs.open(support.TESTFN, mode, 'ascii') as file:
|
||||||
|
self.assertIsInstance(file, codecs.StreamReaderWriter)
|
||||||
|
|
||||||
|
def test_undefined(self):
|
||||||
|
self.assertRaises(UnicodeError, codecs.encode, 'abc', 'undefined')
|
||||||
|
self.assertRaises(UnicodeError, codecs.decode, b'abc', 'undefined')
|
||||||
|
self.assertRaises(UnicodeError, codecs.encode, '', 'undefined')
|
||||||
|
self.assertRaises(UnicodeError, codecs.decode, b'', 'undefined')
|
||||||
|
for errors in ('strict', 'ignore', 'replace', 'backslashreplace'):
|
||||||
|
self.assertRaises(UnicodeError,
|
||||||
|
codecs.encode, 'abc', 'undefined', errors)
|
||||||
|
self.assertRaises(UnicodeError,
|
||||||
|
codecs.decode, b'abc', 'undefined', errors)
|
||||||
|
|
||||||
class StreamReaderTest(unittest.TestCase):
|
class StreamReaderTest(unittest.TestCase):
|
||||||
|
|
||||||
def setUp(self):
|
def setUp(self):
|
||||||
|
@ -1801,13 +1831,10 @@ if hasattr(codecs, "mbcs_encode"):
|
||||||
# "undefined"
|
# "undefined"
|
||||||
|
|
||||||
# The following encodings don't work in stateful mode
|
# The following encodings don't work in stateful mode
|
||||||
broken_unicode_with_streams = [
|
broken_unicode_with_stateful = [
|
||||||
"punycode",
|
"punycode",
|
||||||
"unicode_internal"
|
"unicode_internal"
|
||||||
]
|
]
|
||||||
broken_incremental_coders = broken_unicode_with_streams + [
|
|
||||||
"idna",
|
|
||||||
]
|
|
||||||
|
|
||||||
class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
def test_basics(self):
|
def test_basics(self):
|
||||||
|
@ -1827,7 +1854,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
(chars, size) = codecs.getdecoder(encoding)(b)
|
(chars, size) = codecs.getdecoder(encoding)(b)
|
||||||
self.assertEqual(chars, s, "encoding=%r" % encoding)
|
self.assertEqual(chars, s, "encoding=%r" % encoding)
|
||||||
|
|
||||||
if encoding not in broken_unicode_with_streams:
|
if encoding not in broken_unicode_with_stateful:
|
||||||
# check stream reader/writer
|
# check stream reader/writer
|
||||||
q = Queue(b"")
|
q = Queue(b"")
|
||||||
writer = codecs.getwriter(encoding)(q)
|
writer = codecs.getwriter(encoding)(q)
|
||||||
|
@ -1845,7 +1872,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
decodedresult += reader.read()
|
decodedresult += reader.read()
|
||||||
self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
|
self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
|
||||||
|
|
||||||
if encoding not in broken_incremental_coders:
|
if encoding not in broken_unicode_with_stateful:
|
||||||
# check incremental decoder/encoder and iterencode()/iterdecode()
|
# check incremental decoder/encoder and iterencode()/iterdecode()
|
||||||
try:
|
try:
|
||||||
encoder = codecs.getincrementalencoder(encoding)()
|
encoder = codecs.getincrementalencoder(encoding)()
|
||||||
|
@ -1894,7 +1921,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
|
from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
|
||||||
s = "abc123" # all codecs should be able to encode these
|
s = "abc123" # all codecs should be able to encode these
|
||||||
for encoding in all_unicode_encodings:
|
for encoding in all_unicode_encodings:
|
||||||
if encoding not in broken_incremental_coders:
|
if encoding not in broken_unicode_with_stateful:
|
||||||
# check incremental decoder/encoder (fetched via the C API)
|
# check incremental decoder/encoder (fetched via the C API)
|
||||||
try:
|
try:
|
||||||
cencoder = codec_incrementalencoder(encoding)
|
cencoder = codec_incrementalencoder(encoding)
|
||||||
|
@ -1934,7 +1961,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
for encoding in all_unicode_encodings:
|
for encoding in all_unicode_encodings:
|
||||||
if encoding == "idna": # FIXME: See SF bug #1163178
|
if encoding == "idna": # FIXME: See SF bug #1163178
|
||||||
continue
|
continue
|
||||||
if encoding in broken_unicode_with_streams:
|
if encoding in broken_unicode_with_stateful:
|
||||||
continue
|
continue
|
||||||
reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
|
reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
|
||||||
for t in range(5):
|
for t in range(5):
|
||||||
|
@ -1967,7 +1994,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||||
# Check that getstate() and setstate() handle the state properly
|
# Check that getstate() and setstate() handle the state properly
|
||||||
u = "abc123"
|
u = "abc123"
|
||||||
for encoding in all_unicode_encodings:
|
for encoding in all_unicode_encodings:
|
||||||
if encoding not in broken_incremental_coders:
|
if encoding not in broken_unicode_with_stateful:
|
||||||
self.check_state_handling_decode(encoding, u, u.encode(encoding))
|
self.check_state_handling_decode(encoding, u, u.encode(encoding))
|
||||||
self.check_state_handling_encode(encoding, u, u.encode(encoding))
|
self.check_state_handling_encode(encoding, u, u.encode(encoding))
|
||||||
|
|
||||||
|
@ -2171,6 +2198,7 @@ class WithStmtTest(unittest.TestCase):
|
||||||
f = io.BytesIO(b"\xc3\xbc")
|
f = io.BytesIO(b"\xc3\xbc")
|
||||||
with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
|
with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
|
||||||
self.assertEqual(ef.read(), b"\xfc")
|
self.assertEqual(ef.read(), b"\xfc")
|
||||||
|
self.assertTrue(f.closed)
|
||||||
|
|
||||||
def test_streamreaderwriter(self):
|
def test_streamreaderwriter(self):
|
||||||
f = io.BytesIO(b"\xc3\xbc")
|
f = io.BytesIO(b"\xc3\xbc")
|
||||||
|
|
|
@ -265,6 +265,10 @@ IDLE
|
||||||
Tests
|
Tests
|
||||||
-----
|
-----
|
||||||
|
|
||||||
|
- Issue #19548: Added some additional checks to test_codecs to ensure that
|
||||||
|
statements in the updated documentation remain accurate. Patch by Martin
|
||||||
|
Panter.
|
||||||
|
|
||||||
- Issue #22838: All test_re tests now work with unittest test discovery.
|
- Issue #22838: All test_re tests now work with unittest test discovery.
|
||||||
|
|
||||||
- Issue #22173: Update lib2to3 tests to use unittest test discovery.
|
- Issue #22173: Update lib2to3 tests to use unittest test discovery.
|
||||||
|
@ -297,6 +301,10 @@ Build
|
||||||
Documentation
|
Documentation
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
|
- Issue #19548: Update the codecs module documentation to better cover the
|
||||||
|
distinction between text encodings and other codecs, together with other
|
||||||
|
clarifications. Patch by Martin Panter.
|
||||||
|
|
||||||
- Issue #22914: Update the Python 2/3 porting HOWTO to describe a more automated
|
- Issue #22914: Update the Python 2/3 porting HOWTO to describe a more automated
|
||||||
approach.
|
approach.
|
||||||
|
|
||||||
|
|
|
@ -54,9 +54,9 @@ PyDoc_STRVAR(register__doc__,
|
||||||
"register(search_function)\n\
|
"register(search_function)\n\
|
||||||
\n\
|
\n\
|
||||||
Register a codec search function. Search functions are expected to take\n\
|
Register a codec search function. Search functions are expected to take\n\
|
||||||
one argument, the encoding name in all lower case letters, and return\n\
|
one argument, the encoding name in all lower case letters, and either\n\
|
||||||
a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\
|
return None, or a tuple of functions (encoder, decoder, stream_reader,\n\
|
||||||
(or a CodecInfo object).");
|
stream_writer) (or a CodecInfo object).");
|
||||||
|
|
||||||
static
|
static
|
||||||
PyObject *codec_register(PyObject *self, PyObject *search_function)
|
PyObject *codec_register(PyObject *self, PyObject *search_function)
|
||||||
|
|
Loading…
Reference in New Issue