mirror of https://github.com/python/cpython
Merge issue 19548 changes from 3.4
This commit is contained in:
commit
582acb75e9
|
@ -834,10 +834,13 @@ Glossary
|
|||
:meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
|
||||
include :data:`sys.float_info` and the return value of :func:`os.stat`.
|
||||
|
||||
text encoding
|
||||
A codec which encodes Unicode strings to bytes.
|
||||
|
||||
text file
|
||||
A :term:`file object` able to read and write :class:`str` objects.
|
||||
Often, a text file actually accesses a byte-oriented datastream
|
||||
and handles the text encoding automatically.
|
||||
and handles the :term:`text encoding` automatically.
|
||||
|
||||
.. seealso::
|
||||
A :term:`binary file` reads and write :class:`bytes` objects.
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -940,15 +940,17 @@ are always available. They are listed here in alphabetical order.
|
|||
*encoding* is the name of the encoding used to decode or encode the file.
|
||||
This should only be used in text mode. The default encoding is platform
|
||||
dependent (whatever :func:`locale.getpreferredencoding` returns), but any
|
||||
encoding supported by Python can be used. See the :mod:`codecs` module for
|
||||
:term:`text encoding` supported by Python
|
||||
can be used. See the :mod:`codecs` module for
|
||||
the list of supported encodings.
|
||||
|
||||
*errors* is an optional string that specifies how encoding and decoding
|
||||
errors are to be handled--this cannot be used in binary mode.
|
||||
A variety of standard error handlers are available, though any
|
||||
A variety of standard error handlers are available
|
||||
(listed under :ref:`error-handlers`), though any
|
||||
error handling name that has been registered with
|
||||
:func:`codecs.register_error` is also valid. The standard names
|
||||
are:
|
||||
include:
|
||||
|
||||
* ``'strict'`` to raise a :exc:`ValueError` exception if there is
|
||||
an encoding error. The default value of ``None`` has the same
|
||||
|
|
|
@ -1512,7 +1512,7 @@ expression support in the :mod:`re` module).
|
|||
a :exc:`UnicodeError`. Other possible
|
||||
values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
|
||||
``'backslashreplace'`` and any other name registered via
|
||||
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
|
||||
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
|
||||
list of possible encodings, see section :ref:`standard-encodings`.
|
||||
|
||||
.. versionchanged:: 3.1
|
||||
|
@ -2384,7 +2384,7 @@ arbitrary binary data.
|
|||
error handling scheme. The default for *errors* is ``'strict'``, meaning
|
||||
that encoding errors raise a :exc:`UnicodeError`. Other possible values are
|
||||
``'ignore'``, ``'replace'`` and any other name registered via
|
||||
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
|
||||
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
|
||||
list of possible encodings, see section :ref:`standard-encodings`.
|
||||
|
||||
.. note::
|
||||
|
|
|
@ -798,7 +798,7 @@ metadata must be either decoded or encoded. If *encoding* is not set
|
|||
appropriately, this conversion may fail.
|
||||
|
||||
The *errors* argument defines how characters are treated that cannot be
|
||||
converted. Possible values are listed in section :ref:`codec-base-classes`.
|
||||
converted. Possible values are listed in section :ref:`error-handlers`.
|
||||
The default scheme is ``'surrogateescape'`` which Python also uses for its
|
||||
file system calls, see :ref:`os-filenames`.
|
||||
|
||||
|
|
|
@ -347,8 +347,7 @@ class StreamWriter(Codec):
|
|||
|
||||
""" Creates a StreamWriter instance.
|
||||
|
||||
stream must be a file-like object open for writing
|
||||
(binary) data.
|
||||
stream must be a file-like object open for writing.
|
||||
|
||||
The StreamWriter may use different error handling
|
||||
schemes by providing the errors keyword argument. These
|
||||
|
@ -422,8 +421,7 @@ class StreamReader(Codec):
|
|||
|
||||
""" Creates a StreamReader instance.
|
||||
|
||||
stream must be a file-like object open for reading
|
||||
(binary) data.
|
||||
stream must be a file-like object open for reading.
|
||||
|
||||
The StreamReader may use different error handling
|
||||
schemes by providing the errors keyword argument. These
|
||||
|
@ -451,13 +449,12 @@ class StreamReader(Codec):
|
|||
""" Decodes data from the stream self.stream and returns the
|
||||
resulting object.
|
||||
|
||||
chars indicates the number of characters to read from the
|
||||
stream. read() will never return more than chars
|
||||
characters, but it might return less, if there are not enough
|
||||
characters available.
|
||||
chars indicates the number of decoded code points or bytes to
|
||||
return. read() will never return more data than requested,
|
||||
but it might return less, if there is not enough available.
|
||||
|
||||
size indicates the approximate maximum number of bytes to
|
||||
read from the stream for decoding purposes. The decoder
|
||||
size indicates the approximate maximum number of decoded
|
||||
bytes or code points to read for decoding. The decoder
|
||||
can modify this setting as appropriate. The default value
|
||||
-1 indicates to read and decode as much as possible. size
|
||||
is intended to prevent having to decode huge files in one
|
||||
|
@ -468,7 +465,7 @@ class StreamReader(Codec):
|
|||
will be returned, the rest of the input will be kept until the
|
||||
next call to read().
|
||||
|
||||
The method should use a greedy read strategy meaning that
|
||||
The method should use a greedy read strategy, meaning that
|
||||
it should read as much data as is allowed within the
|
||||
definition of the encoding and the given size, e.g. if
|
||||
optional encoding endings or state markers are available
|
||||
|
@ -603,7 +600,7 @@ class StreamReader(Codec):
|
|||
def readlines(self, sizehint=None, keepends=True):
|
||||
|
||||
""" Read all lines available on the input stream
|
||||
and return them as list of lines.
|
||||
and return them as a list.
|
||||
|
||||
Line breaks are implemented using the codec's decoder
|
||||
method and are included in the list entries.
|
||||
|
@ -751,19 +748,18 @@ class StreamReaderWriter:
|
|||
|
||||
class StreamRecoder:
|
||||
|
||||
""" StreamRecoder instances provide a frontend - backend
|
||||
view of encoding data.
|
||||
""" StreamRecoder instances translate data from one encoding to another.
|
||||
|
||||
They use the complete set of APIs returned by the
|
||||
codecs.lookup() function to implement their task.
|
||||
|
||||
Data written to the stream is first decoded into an
|
||||
intermediate format (which is dependent on the given codec
|
||||
combination) and then written to the stream using an instance
|
||||
of the provided Writer class.
|
||||
Data written to the StreamRecoder is first decoded into an
|
||||
intermediate format (depending on the "decode" codec) and then
|
||||
written to the underlying stream using an instance of the provided
|
||||
Writer class.
|
||||
|
||||
In the other direction, data is read from the stream using a
|
||||
Reader instance and then return encoded data to the caller.
|
||||
In the other direction, data is read from the underlying stream using
|
||||
a Reader instance and then encoded and returned to the caller.
|
||||
|
||||
"""
|
||||
# Optional attributes set by the file wrappers below
|
||||
|
@ -775,22 +771,17 @@ class StreamRecoder:
|
|||
|
||||
""" Creates a StreamRecoder instance which implements a two-way
|
||||
conversion: encode and decode work on the frontend (the
|
||||
input to .read() and output of .write()) while
|
||||
Reader and Writer work on the backend (reading and
|
||||
writing to the stream).
|
||||
data visible to .read() and .write()) while Reader and Writer
|
||||
work on the backend (the data in stream).
|
||||
|
||||
You can use these objects to do transparent direct
|
||||
recodings from e.g. latin-1 to utf-8 and back.
|
||||
You can use these objects to do transparent
|
||||
transcodings from e.g. latin-1 to utf-8 and back.
|
||||
|
||||
stream must be a file-like object.
|
||||
|
||||
encode, decode must adhere to the Codec interface, Reader,
|
||||
encode and decode must adhere to the Codec interface; Reader and
|
||||
Writer must be factory functions or classes providing the
|
||||
StreamReader, StreamWriter interface resp.
|
||||
|
||||
encode and decode are needed for the frontend translation,
|
||||
Reader and Writer for the backend translation. Unicode is
|
||||
used as intermediate encoding.
|
||||
StreamReader and StreamWriter interfaces resp.
|
||||
|
||||
Error handling is done in the same way as defined for the
|
||||
StreamWriter/Readers.
|
||||
|
@ -865,7 +856,7 @@ class StreamRecoder:
|
|||
|
||||
### Shortcuts
|
||||
|
||||
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
|
||||
def open(filename, mode='r', encoding=None, errors='strict', buffering=1):
|
||||
|
||||
""" Open an encoded file using the given mode and return
|
||||
a wrapped version providing transparent encoding/decoding.
|
||||
|
@ -875,10 +866,8 @@ def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
|
|||
codecs. Output is also codec dependent and will usually be
|
||||
Unicode as well.
|
||||
|
||||
Files are always opened in binary mode, even if no binary mode
|
||||
was specified. This is done to avoid data loss due to encodings
|
||||
using 8-bit values. The default file mode is 'rb' meaning to
|
||||
open the file in binary read mode.
|
||||
Underlying encoded files are always opened in binary mode.
|
||||
The default file mode is 'r', meaning to open the file in read mode.
|
||||
|
||||
encoding specifies the encoding which is to be used for the
|
||||
file.
|
||||
|
@ -914,13 +903,13 @@ def EncodedFile(file, data_encoding, file_encoding=None, errors='strict'):
|
|||
""" Return a wrapped version of file which provides transparent
|
||||
encoding translation.
|
||||
|
||||
Strings written to the wrapped file are interpreted according
|
||||
to the given data_encoding and then written to the original
|
||||
file as string using file_encoding. The intermediate encoding
|
||||
Data written to the wrapped file is decoded according
|
||||
to the given data_encoding and then encoded to the underlying
|
||||
file using file_encoding. The intermediate data type
|
||||
will usually be Unicode but depends on the specified codecs.
|
||||
|
||||
Strings are read from the file using file_encoding and then
|
||||
passed back to the caller as string using data_encoding.
|
||||
Bytes read from the file are decoded using file_encoding and then
|
||||
passed back to the caller encoded using data_encoding.
|
||||
|
||||
If file_encoding is not given, it defaults to data_encoding.
|
||||
|
||||
|
|
|
@ -1140,6 +1140,8 @@ class RecodingTest(unittest.TestCase):
|
|||
# Python used to crash on this at exit because of a refcount
|
||||
# bug in _codecsmodule.c
|
||||
|
||||
self.assertTrue(f.closed)
|
||||
|
||||
# From RFC 3492
|
||||
punycode_testcases = [
|
||||
# A Arabic (Egyptian):
|
||||
|
@ -1592,6 +1594,16 @@ class IDNACodecTest(unittest.TestCase):
|
|||
self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
|
||||
self.assertEqual(encoder.encode("", True), b"")
|
||||
|
||||
def test_errors(self):
|
||||
"""Only supports "strict" error handler"""
|
||||
"python.org".encode("idna", "strict")
|
||||
b"python.org".decode("idna", "strict")
|
||||
for errors in ("ignore", "replace", "backslashreplace",
|
||||
"surrogateescape"):
|
||||
self.assertRaises(Exception, "python.org".encode, "idna", errors)
|
||||
self.assertRaises(Exception,
|
||||
b"python.org".decode, "idna", errors)
|
||||
|
||||
class CodecsModuleTest(unittest.TestCase):
|
||||
|
||||
def test_decode(self):
|
||||
|
@ -1682,6 +1694,24 @@ class CodecsModuleTest(unittest.TestCase):
|
|||
for api in codecs.__all__:
|
||||
getattr(codecs, api)
|
||||
|
||||
def test_open(self):
|
||||
self.addCleanup(support.unlink, support.TESTFN)
|
||||
for mode in ('w', 'r', 'r+', 'w+', 'a', 'a+'):
|
||||
with self.subTest(mode), \
|
||||
codecs.open(support.TESTFN, mode, 'ascii') as file:
|
||||
self.assertIsInstance(file, codecs.StreamReaderWriter)
|
||||
|
||||
def test_undefined(self):
|
||||
self.assertRaises(UnicodeError, codecs.encode, 'abc', 'undefined')
|
||||
self.assertRaises(UnicodeError, codecs.decode, b'abc', 'undefined')
|
||||
self.assertRaises(UnicodeError, codecs.encode, '', 'undefined')
|
||||
self.assertRaises(UnicodeError, codecs.decode, b'', 'undefined')
|
||||
for errors in ('strict', 'ignore', 'replace', 'backslashreplace'):
|
||||
self.assertRaises(UnicodeError,
|
||||
codecs.encode, 'abc', 'undefined', errors)
|
||||
self.assertRaises(UnicodeError,
|
||||
codecs.decode, b'abc', 'undefined', errors)
|
||||
|
||||
class StreamReaderTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
|
@ -1815,13 +1845,10 @@ if hasattr(codecs, "mbcs_encode"):
|
|||
# "undefined"
|
||||
|
||||
# The following encodings don't work in stateful mode
|
||||
broken_unicode_with_streams = [
|
||||
broken_unicode_with_stateful = [
|
||||
"punycode",
|
||||
"unicode_internal"
|
||||
]
|
||||
broken_incremental_coders = broken_unicode_with_streams + [
|
||||
"idna",
|
||||
]
|
||||
|
||||
class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
||||
def test_basics(self):
|
||||
|
@ -1841,7 +1868,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
|||
(chars, size) = codecs.getdecoder(encoding)(b)
|
||||
self.assertEqual(chars, s, "encoding=%r" % encoding)
|
||||
|
||||
if encoding not in broken_unicode_with_streams:
|
||||
if encoding not in broken_unicode_with_stateful:
|
||||
# check stream reader/writer
|
||||
q = Queue(b"")
|
||||
writer = codecs.getwriter(encoding)(q)
|
||||
|
@ -1859,7 +1886,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
|||
decodedresult += reader.read()
|
||||
self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
|
||||
|
||||
if encoding not in broken_incremental_coders:
|
||||
if encoding not in broken_unicode_with_stateful:
|
||||
# check incremental decoder/encoder and iterencode()/iterdecode()
|
||||
try:
|
||||
encoder = codecs.getincrementalencoder(encoding)()
|
||||
|
@ -1908,7 +1935,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
|||
from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
|
||||
s = "abc123" # all codecs should be able to encode these
|
||||
for encoding in all_unicode_encodings:
|
||||
if encoding not in broken_incremental_coders:
|
||||
if encoding not in broken_unicode_with_stateful:
|
||||
# check incremental decoder/encoder (fetched via the C API)
|
||||
try:
|
||||
cencoder = codec_incrementalencoder(encoding)
|
||||
|
@ -1948,7 +1975,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
|||
for encoding in all_unicode_encodings:
|
||||
if encoding == "idna": # FIXME: See SF bug #1163178
|
||||
continue
|
||||
if encoding in broken_unicode_with_streams:
|
||||
if encoding in broken_unicode_with_stateful:
|
||||
continue
|
||||
reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
|
||||
for t in range(5):
|
||||
|
@ -1981,7 +2008,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
|
|||
# Check that getstate() and setstate() handle the state properly
|
||||
u = "abc123"
|
||||
for encoding in all_unicode_encodings:
|
||||
if encoding not in broken_incremental_coders:
|
||||
if encoding not in broken_unicode_with_stateful:
|
||||
self.check_state_handling_decode(encoding, u, u.encode(encoding))
|
||||
self.check_state_handling_encode(encoding, u, u.encode(encoding))
|
||||
|
||||
|
@ -2185,6 +2212,7 @@ class WithStmtTest(unittest.TestCase):
|
|||
f = io.BytesIO(b"\xc3\xbc")
|
||||
with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
|
||||
self.assertEqual(ef.read(), b"\xfc")
|
||||
self.assertTrue(f.closed)
|
||||
|
||||
def test_streamreaderwriter(self):
|
||||
f = io.BytesIO(b"\xc3\xbc")
|
||||
|
|
|
@ -1441,6 +1441,10 @@ C API
|
|||
Documentation
|
||||
-------------
|
||||
|
||||
- Issue #19548: Update the codecs module documentation to better cover the
|
||||
distinction between text encodings and other codecs, together with other
|
||||
clarifications. Patch by Martin Panter.
|
||||
|
||||
- Issue #22394: Doc/Makefile now supports ``make venv PYTHON=../python`` to
|
||||
create a venv for generating the documentation, e.g.,
|
||||
``make html PYTHON=venv/bin/python3``.
|
||||
|
@ -1477,6 +1481,10 @@ Documentation
|
|||
Tests
|
||||
-----
|
||||
|
||||
- Issue #19548: Added some additional checks to test_codecs to ensure that
|
||||
statements in the updated documentation remain accurate. Patch by Martin
|
||||
Panter.
|
||||
|
||||
- Issue #22838: All test_re tests now work with unittest test discovery.
|
||||
|
||||
- Issue #22173: Update lib2to3 tests to use unittest test discovery.
|
||||
|
|
|
@ -54,9 +54,9 @@ PyDoc_STRVAR(register__doc__,
|
|||
"register(search_function)\n\
|
||||
\n\
|
||||
Register a codec search function. Search functions are expected to take\n\
|
||||
one argument, the encoding name in all lower case letters, and return\n\
|
||||
a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\
|
||||
(or a CodecInfo object).");
|
||||
one argument, the encoding name in all lower case letters, and either\n\
|
||||
return None, or a tuple of functions (encoder, decoder, stream_reader,\n\
|
||||
stream_writer) (or a CodecInfo object).");
|
||||
|
||||
static
|
||||
PyObject *codec_register(PyObject *self, PyObject *search_function)
|
||||
|
|
Loading…
Reference in New Issue