Fall back to 'ascii' encoding if sys.getfilesystemencoding() returns
None. Remove encoding and errors argument from pax create methods in TarInfo, pax always uses UTF-8. Adapt the documentation and tests to the new string/unicode concept.
This commit is contained in:
parent
4566c71e0e
commit
3741effcf8
|
@ -272,13 +272,14 @@ object, see :ref:`tarinfo-objects` for details.
|
||||||
:exc:`IOError` exceptions. If ``2``, all *non-fatal* errors are raised as
|
:exc:`IOError` exceptions. If ``2``, all *non-fatal* errors are raised as
|
||||||
:exc:`TarError` exceptions as well.
|
:exc:`TarError` exceptions as well.
|
||||||
|
|
||||||
The *encoding* and *errors* arguments control the way strings are converted to
|
The *encoding* and *errors* arguments define the character encoding to be
|
||||||
unicode objects and vice versa. The default settings will work for most users.
|
used for reading or writing the archive and how conversion errors are going
|
||||||
|
to be handled. The default settings will work for most users.
|
||||||
See section :ref:`tar-unicode` for in-depth information.
|
See section :ref:`tar-unicode` for in-depth information.
|
||||||
|
|
||||||
.. versionadded:: 2.6
|
.. versionadded:: 2.6
|
||||||
|
|
||||||
The *pax_headers* argument is an optional dictionary of unicode strings which
|
The *pax_headers* argument is an optional dictionary of strings which
|
||||||
will be added as a pax global header if *format* is :const:`PAX_FORMAT`.
|
will be added as a pax global header if *format* is :const:`PAX_FORMAT`.
|
||||||
|
|
||||||
.. versionadded:: 2.6
|
.. versionadded:: 2.6
|
||||||
|
@ -703,36 +704,30 @@ Unicode issues
|
||||||
The tar format was originally conceived to make backups on tape drives with the
|
The tar format was originally conceived to make backups on tape drives with the
|
||||||
main focus on preserving file system information. Nowadays tar archives are
|
main focus on preserving file system information. Nowadays tar archives are
|
||||||
commonly used for file distribution and exchanging archives over networks. One
|
commonly used for file distribution and exchanging archives over networks. One
|
||||||
problem of the original format (that all other formats are merely variants of)
|
problem of the original format (which is the basis of all other formats) is
|
||||||
is that there is no concept of supporting different character encodings. For
|
that there is no concept of supporting different character encodings. For
|
||||||
example, an ordinary tar archive created on a *UTF-8* system cannot be read
|
example, an ordinary tar archive created on a *UTF-8* system cannot be read
|
||||||
correctly on a *Latin-1* system if it contains non-ASCII characters. Names (i.e.
|
correctly on a *Latin-1* system if it contains non-*ASCII* characters. Textual
|
||||||
filenames, linknames, user/group names) containing these characters will appear
|
metadata (like filenames, linknames, user/group names) will appear damaged.
|
||||||
damaged. Unfortunately, there is no way to autodetect the encoding of an
|
Unfortunately, there is no way to autodetect the encoding of an archive. The
|
||||||
archive.
|
pax format was designed to solve this problem. It stores non-ASCII metadata
|
||||||
|
using the universal character encoding *UTF-8*.
|
||||||
|
|
||||||
The pax format was designed to solve this problem. It stores non-ASCII names
|
The details of character conversion in :mod:`tarfile` are controlled by the
|
||||||
using the universal character encoding *UTF-8*. When a pax archive is read,
|
*encoding* and *errors* keyword arguments of the :class:`TarFile` class.
|
||||||
these *UTF-8* names are converted to the encoding of the local file system.
|
|
||||||
|
|
||||||
The details of unicode conversion are controlled by the *encoding* and *errors*
|
*encoding* defines the character encoding to use for the metadata in the
|
||||||
keyword arguments of the :class:`TarFile` class.
|
archive. The default value is :func:`sys.getfilesystemencoding` or ``'ascii'``
|
||||||
|
as a fallback. Depending on whether the archive is read or written, the
|
||||||
The default value for *encoding* is the local character encoding. It is deduced
|
metadata must be either decoded or encoded. If *encoding* is not set
|
||||||
from :func:`sys.getfilesystemencoding` and :func:`sys.getdefaultencoding`. In
|
appropriately, this conversion may fail.
|
||||||
read mode, *encoding* is used exclusively to convert unicode names from a pax
|
|
||||||
archive to strings in the local character encoding. In write mode, the use of
|
|
||||||
*encoding* depends on the chosen archive format. In case of :const:`PAX_FORMAT`,
|
|
||||||
input names that contain non-ASCII characters need to be decoded before being
|
|
||||||
stored as *UTF-8* strings. The other formats do not make use of *encoding*
|
|
||||||
unless unicode objects are used as input names. These are converted to 8-bit
|
|
||||||
character strings before they are added to the archive.
|
|
||||||
|
|
||||||
The *errors* argument defines how characters are treated that cannot be
|
The *errors* argument defines how characters are treated that cannot be
|
||||||
converted to or from *encoding*. Possible values are listed in section
|
converted. Possible values are listed in section :ref:`codec-base-classes`. In
|
||||||
:ref:`codec-base-classes`. In read mode, there is an additional scheme
|
read mode the default scheme is ``'replace'``. This avoids unexpected
|
||||||
``'utf-8'`` which means that bad characters are replaced by their *UTF-8*
|
:exc:`UnicodeError` exceptions and guarantees that an archive can always be
|
||||||
representation. This is the default scheme. In write mode the default value for
|
read. In write mode the default value for *errors* is ``'strict'``. This
|
||||||
*errors* is ``'strict'`` to ensure that name information is not altered
|
ensures that name information is not altered unnoticed.
|
||||||
unnoticed.
|
|
||||||
|
|
||||||
|
In case of writing :const:`PAX_FORMAT` archives, *encoding* is ignored because
|
||||||
|
non-ASCII metadata is stored using *UTF-8*.
|
||||||
|
|
|
@ -167,7 +167,7 @@ TOEXEC = 0o001 # execute/search by other
|
||||||
#---------------------------------------------------------
|
#---------------------------------------------------------
|
||||||
ENCODING = sys.getfilesystemencoding()
|
ENCODING = sys.getfilesystemencoding()
|
||||||
if ENCODING is None:
|
if ENCODING is None:
|
||||||
ENCODING = sys.getdefaultencoding()
|
ENCODING = "ascii"
|
||||||
|
|
||||||
#---------------------------------------------------------
|
#---------------------------------------------------------
|
||||||
# Some useful functions
|
# Some useful functions
|
||||||
|
@ -982,7 +982,7 @@ class TarInfo(object):
|
||||||
elif format == GNU_FORMAT:
|
elif format == GNU_FORMAT:
|
||||||
return self.create_gnu_header(info, encoding, errors)
|
return self.create_gnu_header(info, encoding, errors)
|
||||||
elif format == PAX_FORMAT:
|
elif format == PAX_FORMAT:
|
||||||
return self.create_pax_header(info, encoding, errors)
|
return self.create_pax_header(info)
|
||||||
else:
|
else:
|
||||||
raise ValueError("invalid format")
|
raise ValueError("invalid format")
|
||||||
|
|
||||||
|
@ -1013,7 +1013,7 @@ class TarInfo(object):
|
||||||
|
|
||||||
return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
|
return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
|
||||||
|
|
||||||
def create_pax_header(self, info, encoding, errors):
|
def create_pax_header(self, info):
|
||||||
"""Return the object as a ustar header block. If it cannot be
|
"""Return the object as a ustar header block. If it cannot be
|
||||||
represented this way, prepend a pax extended header sequence
|
represented this way, prepend a pax extended header sequence
|
||||||
with supplement information.
|
with supplement information.
|
||||||
|
@ -1056,17 +1056,17 @@ class TarInfo(object):
|
||||||
|
|
||||||
# Create a pax extended header if necessary.
|
# Create a pax extended header if necessary.
|
||||||
if pax_headers:
|
if pax_headers:
|
||||||
buf = self._create_pax_generic_header(pax_headers, XHDTYPE, encoding, errors)
|
buf = self._create_pax_generic_header(pax_headers, XHDTYPE)
|
||||||
else:
|
else:
|
||||||
buf = b""
|
buf = b""
|
||||||
|
|
||||||
return buf + self._create_header(info, USTAR_FORMAT, encoding, errors)
|
return buf + self._create_header(info, USTAR_FORMAT, "ascii", "replace")
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def create_pax_global_header(cls, pax_headers, encoding, errors):
|
def create_pax_global_header(cls, pax_headers):
|
||||||
"""Return the object as a pax global header block sequence.
|
"""Return the object as a pax global header block sequence.
|
||||||
"""
|
"""
|
||||||
return cls._create_pax_generic_header(pax_headers, XGLTYPE, encoding, errors)
|
return cls._create_pax_generic_header(pax_headers, XGLTYPE)
|
||||||
|
|
||||||
def _posix_split_name(self, name):
|
def _posix_split_name(self, name):
|
||||||
"""Split a name longer than 100 chars into a prefix
|
"""Split a name longer than 100 chars into a prefix
|
||||||
|
@ -1139,7 +1139,7 @@ class TarInfo(object):
|
||||||
cls._create_payload(name)
|
cls._create_payload(name)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def _create_pax_generic_header(cls, pax_headers, type, encoding, errors):
|
def _create_pax_generic_header(cls, pax_headers, type):
|
||||||
"""Return a POSIX.1-2001 extended or global header sequence
|
"""Return a POSIX.1-2001 extended or global header sequence
|
||||||
that contains a list of keyword, value pairs. The values
|
that contains a list of keyword, value pairs. The values
|
||||||
must be strings.
|
must be strings.
|
||||||
|
@ -1166,7 +1166,7 @@ class TarInfo(object):
|
||||||
info["magic"] = POSIX_MAGIC
|
info["magic"] = POSIX_MAGIC
|
||||||
|
|
||||||
# Create pax header + record blocks.
|
# Create pax header + record blocks.
|
||||||
return cls._create_header(info, USTAR_FORMAT, encoding, errors) + \
|
return cls._create_header(info, USTAR_FORMAT, "ascii", "replace") + \
|
||||||
cls._create_payload(records)
|
cls._create_payload(records)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
|
@ -1566,8 +1566,7 @@ class TarFile(object):
|
||||||
self._loaded = True
|
self._loaded = True
|
||||||
|
|
||||||
if self.pax_headers:
|
if self.pax_headers:
|
||||||
buf = self.tarinfo.create_pax_global_header(
|
buf = self.tarinfo.create_pax_global_header(self.pax_headers.copy())
|
||||||
self.pax_headers.copy(), self.encoding, self.errors)
|
|
||||||
self.fileobj.write(buf)
|
self.fileobj.write(buf)
|
||||||
self.offset += len(buf)
|
self.offset += len(buf)
|
||||||
|
|
||||||
|
|
|
@ -780,8 +780,8 @@ class PaxWriteTest(GNUWriteTest):
|
||||||
|
|
||||||
tar = tarfile.open(tmpname, "w", format=tarfile.PAX_FORMAT, encoding="iso8859-1")
|
tar = tarfile.open(tmpname, "w", format=tarfile.PAX_FORMAT, encoding="iso8859-1")
|
||||||
t = tarfile.TarInfo()
|
t = tarfile.TarInfo()
|
||||||
t.name = "\xe4\xf6\xfc" # non-ASCII
|
t.name = "\xe4\xf6\xfc" # non-ASCII
|
||||||
t.uid = 8**8 # too large
|
t.uid = 8**8 # too large
|
||||||
t.pax_headers = pax_headers
|
t.pax_headers = pax_headers
|
||||||
tar.addfile(t)
|
tar.addfile(t)
|
||||||
tar.close()
|
tar.close()
|
||||||
|
@ -794,7 +794,6 @@ class PaxWriteTest(GNUWriteTest):
|
||||||
|
|
||||||
|
|
||||||
class UstarUnicodeTest(unittest.TestCase):
|
class UstarUnicodeTest(unittest.TestCase):
|
||||||
# All *UnicodeTests FIXME
|
|
||||||
|
|
||||||
format = tarfile.USTAR_FORMAT
|
format = tarfile.USTAR_FORMAT
|
||||||
|
|
||||||
|
@ -814,11 +813,14 @@ class UstarUnicodeTest(unittest.TestCase):
|
||||||
tar.close()
|
tar.close()
|
||||||
|
|
||||||
tar = tarfile.open(tmpname, encoding=encoding)
|
tar = tarfile.open(tmpname, encoding=encoding)
|
||||||
self.assert_(type(tar.getnames()[0]) is not bytes)
|
|
||||||
self.assertEqual(tar.getmembers()[0].name, name)
|
self.assertEqual(tar.getmembers()[0].name, name)
|
||||||
tar.close()
|
tar.close()
|
||||||
|
|
||||||
def test_unicode_filename_error(self):
|
def test_unicode_filename_error(self):
|
||||||
|
if self.format == tarfile.PAX_FORMAT:
|
||||||
|
# PAX_FORMAT ignores encoding in write mode.
|
||||||
|
return
|
||||||
|
|
||||||
tar = tarfile.open(tmpname, "w", format=self.format, encoding="ascii", errors="strict")
|
tar = tarfile.open(tmpname, "w", format=self.format, encoding="ascii", errors="strict")
|
||||||
tarinfo = tarfile.TarInfo()
|
tarinfo = tarfile.TarInfo()
|
||||||
|
|
||||||
|
@ -839,21 +841,24 @@ class UstarUnicodeTest(unittest.TestCase):
|
||||||
tar.close()
|
tar.close()
|
||||||
|
|
||||||
def test_uname_unicode(self):
|
def test_uname_unicode(self):
|
||||||
for name in ("\xe4\xf6\xfc", "\xe4\xf6\xfc"):
|
t = tarfile.TarInfo("foo")
|
||||||
t = tarfile.TarInfo("foo")
|
t.uname = "\xe4\xf6\xfc"
|
||||||
t.uname = name
|
t.gname = "\xe4\xf6\xfc"
|
||||||
t.gname = name
|
|
||||||
|
|
||||||
fobj = io.BytesIO()
|
tar = tarfile.open(tmpname, mode="w", format=self.format, encoding="iso8859-1")
|
||||||
tar = tarfile.open("foo.tar", mode="w", fileobj=fobj, format=self.format, encoding="iso8859-1")
|
tar.addfile(t)
|
||||||
tar.addfile(t)
|
tar.close()
|
||||||
tar.close()
|
|
||||||
fobj.seek(0)
|
|
||||||
|
|
||||||
tar = tarfile.open("foo.tar", fileobj=fobj, encoding="iso8859-1")
|
tar = tarfile.open(tmpname, encoding="iso8859-1")
|
||||||
|
t = tar.getmember("foo")
|
||||||
|
self.assertEqual(t.uname, "\xe4\xf6\xfc")
|
||||||
|
self.assertEqual(t.gname, "\xe4\xf6\xfc")
|
||||||
|
|
||||||
|
if self.format != tarfile.PAX_FORMAT:
|
||||||
|
tar = tarfile.open(tmpname, encoding="ascii")
|
||||||
t = tar.getmember("foo")
|
t = tar.getmember("foo")
|
||||||
self.assertEqual(t.uname, "\xe4\xf6\xfc")
|
self.assertEqual(t.uname, "\ufffd\ufffd\ufffd")
|
||||||
self.assertEqual(t.gname, "\xe4\xf6\xfc")
|
self.assertEqual(t.gname, "\ufffd\ufffd\ufffd")
|
||||||
|
|
||||||
|
|
||||||
class GNUUnicodeTest(UstarUnicodeTest):
|
class GNUUnicodeTest(UstarUnicodeTest):
|
||||||
|
@ -861,6 +866,11 @@ class GNUUnicodeTest(UstarUnicodeTest):
|
||||||
format = tarfile.GNU_FORMAT
|
format = tarfile.GNU_FORMAT
|
||||||
|
|
||||||
|
|
||||||
|
class PAXUnicodeTest(UstarUnicodeTest):
|
||||||
|
|
||||||
|
format = tarfile.PAX_FORMAT
|
||||||
|
|
||||||
|
|
||||||
class AppendTest(unittest.TestCase):
|
class AppendTest(unittest.TestCase):
|
||||||
# Test append mode (cp. patch #1652681).
|
# Test append mode (cp. patch #1652681).
|
||||||
|
|
||||||
|
@ -1047,6 +1057,7 @@ def test_main():
|
||||||
PaxWriteTest,
|
PaxWriteTest,
|
||||||
UstarUnicodeTest,
|
UstarUnicodeTest,
|
||||||
GNUUnicodeTest,
|
GNUUnicodeTest,
|
||||||
|
PAXUnicodeTest,
|
||||||
AppendTest,
|
AppendTest,
|
||||||
LimitsTest,
|
LimitsTest,
|
||||||
MiscTest,
|
MiscTest,
|
||||||
|
|
Loading…
Reference in New Issue