Fall back to 'ascii' encoding if sys.getfilesystemencoding() returns

None. Remove encoding and errors argument from pax create methods in
TarInfo, pax always uses UTF-8.
Adapt the documentation and tests to the new string/unicode concept.
This commit is contained in:
Lars Gustäbel 2007-08-21 12:17:05 +00:00
parent 4566c71e0e
commit 3741effcf8
3 changed files with 62 additions and 57 deletions

View File

@ -272,13 +272,14 @@ object, see :ref:`tarinfo-objects` for details.
:exc:`IOError` exceptions. If ``2``, all *non-fatal* errors are raised as :exc:`IOError` exceptions. If ``2``, all *non-fatal* errors are raised as
:exc:`TarError` exceptions as well. :exc:`TarError` exceptions as well.
The *encoding* and *errors* arguments control the way strings are converted to The *encoding* and *errors* arguments define the character encoding to be
unicode objects and vice versa. The default settings will work for most users. used for reading or writing the archive and how conversion errors are going
to be handled. The default settings will work for most users.
See section :ref:`tar-unicode` for in-depth information. See section :ref:`tar-unicode` for in-depth information.
.. versionadded:: 2.6 .. versionadded:: 2.6
The *pax_headers* argument is an optional dictionary of unicode strings which The *pax_headers* argument is an optional dictionary of strings which
will be added as a pax global header if *format* is :const:`PAX_FORMAT`. will be added as a pax global header if *format* is :const:`PAX_FORMAT`.
.. versionadded:: 2.6 .. versionadded:: 2.6
@ -703,36 +704,30 @@ Unicode issues
The tar format was originally conceived to make backups on tape drives with the The tar format was originally conceived to make backups on tape drives with the
main focus on preserving file system information. Nowadays tar archives are main focus on preserving file system information. Nowadays tar archives are
commonly used for file distribution and exchanging archives over networks. One commonly used for file distribution and exchanging archives over networks. One
problem of the original format (that all other formats are merely variants of) problem of the original format (which is the basis of all other formats) is
is that there is no concept of supporting different character encodings. For that there is no concept of supporting different character encodings. For
example, an ordinary tar archive created on a *UTF-8* system cannot be read example, an ordinary tar archive created on a *UTF-8* system cannot be read
correctly on a *Latin-1* system if it contains non-ASCII characters. Names (i.e. correctly on a *Latin-1* system if it contains non-*ASCII* characters. Textual
filenames, linknames, user/group names) containing these characters will appear metadata (like filenames, linknames, user/group names) will appear damaged.
damaged. Unfortunately, there is no way to autodetect the encoding of an Unfortunately, there is no way to autodetect the encoding of an archive. The
archive. pax format was designed to solve this problem. It stores non-ASCII metadata
using the universal character encoding *UTF-8*.
The pax format was designed to solve this problem. It stores non-ASCII names The details of character conversion in :mod:`tarfile` are controlled by the
using the universal character encoding *UTF-8*. When a pax archive is read, *encoding* and *errors* keyword arguments of the :class:`TarFile` class.
these *UTF-8* names are converted to the encoding of the local file system.
The details of unicode conversion are controlled by the *encoding* and *errors* *encoding* defines the character encoding to use for the metadata in the
keyword arguments of the :class:`TarFile` class. archive. The default value is :func:`sys.getfilesystemencoding` or ``'ascii'``
as a fallback. Depending on whether the archive is read or written, the
The default value for *encoding* is the local character encoding. It is deduced metadata must be either decoded or encoded. If *encoding* is not set
from :func:`sys.getfilesystemencoding` and :func:`sys.getdefaultencoding`. In appropriately, this conversion may fail.
read mode, *encoding* is used exclusively to convert unicode names from a pax
archive to strings in the local character encoding. In write mode, the use of
*encoding* depends on the chosen archive format. In case of :const:`PAX_FORMAT`,
input names that contain non-ASCII characters need to be decoded before being
stored as *UTF-8* strings. The other formats do not make use of *encoding*
unless unicode objects are used as input names. These are converted to 8-bit
character strings before they are added to the archive.
The *errors* argument defines how characters are treated that cannot be The *errors* argument defines how characters are treated that cannot be
converted to or from *encoding*. Possible values are listed in section converted. Possible values are listed in section :ref:`codec-base-classes`. In
:ref:`codec-base-classes`. In read mode, there is an additional scheme read mode the default scheme is ``'replace'``. This avoids unexpected
``'utf-8'`` which means that bad characters are replaced by their *UTF-8* :exc:`UnicodeError` exceptions and guarantees that an archive can always be
representation. This is the default scheme. In write mode the default value for read. In write mode the default value for *errors* is ``'strict'``. This
*errors* is ``'strict'`` to ensure that name information is not altered ensures that name information is not altered unnoticed.
unnoticed.
In case of writing :const:`PAX_FORMAT` archives, *encoding* is ignored because
non-ASCII metadata is stored using *UTF-8*.

View File

@ -167,7 +167,7 @@ TOEXEC = 0o001 # execute/search by other
#--------------------------------------------------------- #---------------------------------------------------------
ENCODING = sys.getfilesystemencoding() ENCODING = sys.getfilesystemencoding()
if ENCODING is None: if ENCODING is None:
ENCODING = sys.getdefaultencoding() ENCODING = "ascii"
#--------------------------------------------------------- #---------------------------------------------------------
# Some useful functions # Some useful functions
@ -982,7 +982,7 @@ class TarInfo(object):
elif format == GNU_FORMAT: elif format == GNU_FORMAT:
return self.create_gnu_header(info, encoding, errors) return self.create_gnu_header(info, encoding, errors)
elif format == PAX_FORMAT: elif format == PAX_FORMAT:
return self.create_pax_header(info, encoding, errors) return self.create_pax_header(info)
else: else:
raise ValueError("invalid format") raise ValueError("invalid format")
@ -1013,7 +1013,7 @@ class TarInfo(object):
return buf + self._create_header(info, GNU_FORMAT, encoding, errors) return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
def create_pax_header(self, info, encoding, errors): def create_pax_header(self, info):
"""Return the object as a ustar header block. If it cannot be """Return the object as a ustar header block. If it cannot be
represented this way, prepend a pax extended header sequence represented this way, prepend a pax extended header sequence
with supplement information. with supplement information.
@ -1056,17 +1056,17 @@ class TarInfo(object):
# Create a pax extended header if necessary. # Create a pax extended header if necessary.
if pax_headers: if pax_headers:
buf = self._create_pax_generic_header(pax_headers, XHDTYPE, encoding, errors) buf = self._create_pax_generic_header(pax_headers, XHDTYPE)
else: else:
buf = b"" buf = b""
return buf + self._create_header(info, USTAR_FORMAT, encoding, errors) return buf + self._create_header(info, USTAR_FORMAT, "ascii", "replace")
@classmethod @classmethod
def create_pax_global_header(cls, pax_headers, encoding, errors): def create_pax_global_header(cls, pax_headers):
"""Return the object as a pax global header block sequence. """Return the object as a pax global header block sequence.
""" """
return cls._create_pax_generic_header(pax_headers, XGLTYPE, encoding, errors) return cls._create_pax_generic_header(pax_headers, XGLTYPE)
def _posix_split_name(self, name): def _posix_split_name(self, name):
"""Split a name longer than 100 chars into a prefix """Split a name longer than 100 chars into a prefix
@ -1139,7 +1139,7 @@ class TarInfo(object):
cls._create_payload(name) cls._create_payload(name)
@classmethod @classmethod
def _create_pax_generic_header(cls, pax_headers, type, encoding, errors): def _create_pax_generic_header(cls, pax_headers, type):
"""Return a POSIX.1-2001 extended or global header sequence """Return a POSIX.1-2001 extended or global header sequence
that contains a list of keyword, value pairs. The values that contains a list of keyword, value pairs. The values
must be strings. must be strings.
@ -1166,7 +1166,7 @@ class TarInfo(object):
info["magic"] = POSIX_MAGIC info["magic"] = POSIX_MAGIC
# Create pax header + record blocks. # Create pax header + record blocks.
return cls._create_header(info, USTAR_FORMAT, encoding, errors) + \ return cls._create_header(info, USTAR_FORMAT, "ascii", "replace") + \
cls._create_payload(records) cls._create_payload(records)
@classmethod @classmethod
@ -1566,8 +1566,7 @@ class TarFile(object):
self._loaded = True self._loaded = True
if self.pax_headers: if self.pax_headers:
buf = self.tarinfo.create_pax_global_header( buf = self.tarinfo.create_pax_global_header(self.pax_headers.copy())
self.pax_headers.copy(), self.encoding, self.errors)
self.fileobj.write(buf) self.fileobj.write(buf)
self.offset += len(buf) self.offset += len(buf)

View File

@ -780,8 +780,8 @@ class PaxWriteTest(GNUWriteTest):
tar = tarfile.open(tmpname, "w", format=tarfile.PAX_FORMAT, encoding="iso8859-1") tar = tarfile.open(tmpname, "w", format=tarfile.PAX_FORMAT, encoding="iso8859-1")
t = tarfile.TarInfo() t = tarfile.TarInfo()
t.name = "\xe4\xf6\xfc" # non-ASCII t.name = "\xe4\xf6\xfc" # non-ASCII
t.uid = 8**8 # too large t.uid = 8**8 # too large
t.pax_headers = pax_headers t.pax_headers = pax_headers
tar.addfile(t) tar.addfile(t)
tar.close() tar.close()
@ -794,7 +794,6 @@ class PaxWriteTest(GNUWriteTest):
class UstarUnicodeTest(unittest.TestCase): class UstarUnicodeTest(unittest.TestCase):
# All *UnicodeTests FIXME
format = tarfile.USTAR_FORMAT format = tarfile.USTAR_FORMAT
@ -814,11 +813,14 @@ class UstarUnicodeTest(unittest.TestCase):
tar.close() tar.close()
tar = tarfile.open(tmpname, encoding=encoding) tar = tarfile.open(tmpname, encoding=encoding)
self.assert_(type(tar.getnames()[0]) is not bytes)
self.assertEqual(tar.getmembers()[0].name, name) self.assertEqual(tar.getmembers()[0].name, name)
tar.close() tar.close()
def test_unicode_filename_error(self): def test_unicode_filename_error(self):
if self.format == tarfile.PAX_FORMAT:
# PAX_FORMAT ignores encoding in write mode.
return
tar = tarfile.open(tmpname, "w", format=self.format, encoding="ascii", errors="strict") tar = tarfile.open(tmpname, "w", format=self.format, encoding="ascii", errors="strict")
tarinfo = tarfile.TarInfo() tarinfo = tarfile.TarInfo()
@ -839,21 +841,24 @@ class UstarUnicodeTest(unittest.TestCase):
tar.close() tar.close()
def test_uname_unicode(self): def test_uname_unicode(self):
for name in ("\xe4\xf6\xfc", "\xe4\xf6\xfc"): t = tarfile.TarInfo("foo")
t = tarfile.TarInfo("foo") t.uname = "\xe4\xf6\xfc"
t.uname = name t.gname = "\xe4\xf6\xfc"
t.gname = name
fobj = io.BytesIO() tar = tarfile.open(tmpname, mode="w", format=self.format, encoding="iso8859-1")
tar = tarfile.open("foo.tar", mode="w", fileobj=fobj, format=self.format, encoding="iso8859-1") tar.addfile(t)
tar.addfile(t) tar.close()
tar.close()
fobj.seek(0)
tar = tarfile.open("foo.tar", fileobj=fobj, encoding="iso8859-1") tar = tarfile.open(tmpname, encoding="iso8859-1")
t = tar.getmember("foo")
self.assertEqual(t.uname, "\xe4\xf6\xfc")
self.assertEqual(t.gname, "\xe4\xf6\xfc")
if self.format != tarfile.PAX_FORMAT:
tar = tarfile.open(tmpname, encoding="ascii")
t = tar.getmember("foo") t = tar.getmember("foo")
self.assertEqual(t.uname, "\xe4\xf6\xfc") self.assertEqual(t.uname, "\ufffd\ufffd\ufffd")
self.assertEqual(t.gname, "\xe4\xf6\xfc") self.assertEqual(t.gname, "\ufffd\ufffd\ufffd")
class GNUUnicodeTest(UstarUnicodeTest): class GNUUnicodeTest(UstarUnicodeTest):
@ -861,6 +866,11 @@ class GNUUnicodeTest(UstarUnicodeTest):
format = tarfile.GNU_FORMAT format = tarfile.GNU_FORMAT
class PAXUnicodeTest(UstarUnicodeTest):
format = tarfile.PAX_FORMAT
class AppendTest(unittest.TestCase): class AppendTest(unittest.TestCase):
# Test append mode (cp. patch #1652681). # Test append mode (cp. patch #1652681).
@ -1047,6 +1057,7 @@ def test_main():
PaxWriteTest, PaxWriteTest,
UstarUnicodeTest, UstarUnicodeTest,
GNUUnicodeTest, GNUUnicodeTest,
PAXUnicodeTest,
AppendTest, AppendTest,
LimitsTest, LimitsTest,
MiscTest, MiscTest,