#4153: finish updating Unicode HOWTO for Py3k changes.

This commit is contained in:
Georg Brandl 2008-11-22 10:26:59 +00:00
parent 2d92593729
commit 0c07422332
1 changed files with 60 additions and 68 deletions

View File

@ -2,16 +2,11 @@
Unicode HOWTO
*****************
:Release: 1.02
:Release: 1.1
This HOWTO discusses Python's support for Unicode, and explains various problems
that people commonly encounter when trying to work with Unicode.
.. XXX fix it
.. warning::
This HOWTO has not yet been updated for Python 3000's string object changes.
Introduction to Unicode
=======================
@ -21,9 +16,8 @@ History of Character Codes
In 1968, the American Standard Code for Information Interchange, better known by
its acronym ASCII, was standardized. ASCII defined numeric codes for various
characters, with the numeric values running from 0 to
127. For example, the lowercase letter 'a' is assigned 97 as its code
value.
characters, with the numeric values running from 0 to 127. For example, the
lowercase letter 'a' is assigned 97 as its code value.
ASCII was an American-developed standard, so it only defined unaccented
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
@ -256,25 +250,25 @@ an *errors* argument.
The *errors* argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (add U+FFFD,
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (use U+FFFD,
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
Unicode result). The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
Encodings are specified as strings containing the encoding's name. Python
comes with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
synonyms for the same encoding.
Encodings are specified as strings containing the encoding's name. Python comes
with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings have multiple names; for
example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same
encoding.
One-character Unicode strings can also be created with the :func:`chr`
built-in function, which takes integers and returns a Unicode string of length 1
@ -294,8 +288,9 @@ Another important str method is ``.encode([encoding], [errors='strict'])``,
which returns a ``bytes`` representation of the Unicode string, encoded in the
requested encoding. The ``errors`` parameter is the same as the parameter of
the :meth:`decode` method, with one additional possibility; as well as 'strict',
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
character references. The following example shows the different results::
'ignore', and 'replace' (which in this case inserts a question mark instead of
the unencodable character), you can also pass 'xmlcharrefreplace' which uses
XML's character references. The following example shows the different results::
>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
@ -303,7 +298,8 @@ character references. The following example shows the different results::
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
@ -319,10 +315,6 @@ completely new encoding, you'll need to learn about the :mod:`codecs` module
interfaces, but implementing encodings is a specialized task that also won't be
covered here. Consult the Python documentation to learn more about this module.
The most commonly used part of the :mod:`codecs` module is the
:func:`codecs.open` function which will be discussed in the section on input and
output.
Unicode Literals in Python Source Code
--------------------------------------
@ -350,10 +342,9 @@ encoding. You could then edit Python source code with your favorite editor
which would display the accented characters naturally, and have the right
characters used at runtime.
Python supports writing Unicode literals in UTF-8 by default, but you can use
(almost) any encoding if you declare the encoding being used. This is done by
including a special comment as either the first or second line of the source
file::
Python supports writing source code in UTF-8 by default, but you can use almost
any encoding if you declare the encoding being used. This is done by including
a special comment as either the first or second line of the source file::
#!/usr/bin/env python
# -*- coding: latin-1 -*-
@ -363,9 +354,9 @@ file::
The syntax is inspired by Emacs's notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
'coding'. The ``-*-`` symbols indicate that the comment is special; within
them, you must supply the name ``coding`` and the name of your chosen encoding,
separated by ``':'``.
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
they have no significance to Python but are a convention. Python looks for
``coding: name`` or ``coding=name`` in the comment.
If you don't include such a comment, the default encoding used will be UTF-8 as
already mentioned.
@ -426,7 +417,9 @@ The documentation for the :mod:`codecs` module.
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
Unicode". A PDF version of his slides is available at
<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
excellent overview of the design of Python's Unicode features.
excellent overview of the design of Python's Unicode features (based on Python
2, where the Unicode string type is called ``unicode`` and literals start with
``u``).
Reading and Writing Unicode Data
@ -444,8 +437,8 @@ columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It's possible to do all the work
yourself: open a file, read an 8-bit string from it, and convert the string with
``unicode(str, encoding)``. However, the manual approach is not recommended.
yourself: open a file, read an 8-bit byte string from it, and convert the string
with ``str(bytes, encoding)``. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be
represented by several bytes. If you want to read the file in arbitrary-sized
@ -459,39 +452,28 @@ string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been
done for you: the :mod:`codecs` module includes a version of the :func:`open`
function that returns a file-like object that assumes the file's contents are in
a specified encoding and accepts Unicode parameters for methods such as
``.read()`` and ``.write()``.
The function's parameters are ``open(filename, mode='rb', encoding=None,
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
just like the corresponding parameter to the regular built-in ``open()``
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
to the standard function's parameter. ``encoding`` is a string giving the
encoding to use; if it's left as ``None``, a regular Python file object that
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
data written to or read from the wrapper object will be converted as needed.
``errors`` specifies the action for encoding errors and can be one of the usual
values of 'strict', 'ignore', and 'replace'.
done for you: the built-in :func:`open` function can return a file-like object
that assumes the file's contents are in a specified encoding and accepts Unicode
parameters for methods such as ``.read()`` and ``.write()``. This works through
:func:`open`\'s *encoding* and *errors* parameters which are interpreted just
like those in string objects' :meth:`encode` and :meth:`decode` methods.
Reading Unicode from a file is therefore simple::
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
f = open('unicode.rst', encoding='utf-8')
for line in f:
print(repr(line))
It's also possible to open files in update mode, allowing both reading and
writing::
f = codecs.open('test', encoding='utf-8', mode='w+')
f = open('test', encoding='utf-8', mode='w+')
f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))
f.close()
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
written as the first character of a file in order to assist with autodetection
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
present at the start of a file; when such an encoding is used, the BOM will be
@ -500,6 +482,12 @@ the file is read. There are variants of these encodings, such as 'utf-16-le'
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
particular byte ordering and don't skip the BOM.
In some areas, it is also convention to use a "BOM" at the start of UTF-8
encoded files; the name is misleading since UTF-8 is not byte-order dependent.
The mark simply announces that the file is encoded in UTF-8. Use the
'utf-8-sig' codec to automatically skip the mark if present for reading such
files.
Unicode filenames
-----------------
@ -528,31 +516,36 @@ Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unico
filenames.
:func:`os.listdir`, which returns filenames, raises an issue: should it return
the Unicode version of filenames, or should it return 8-bit strings containing
the Unicode version of filenames, or should it return byte strings containing
the encoded versions? :func:`os.listdir` will do both, depending on whether you
provided the directory path as an 8-bit string or a Unicode string. If you pass
a Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing an 8-bit
path will return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program::
provided the directory path as a byte string or a Unicode string. If you pass a
Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing a byte
path will return the byte string versions of the filenames. For example,
assuming the default filesystem encoding is UTF-8, running the following
program::
fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print(os.listdir(b'.'))
print(os.listdir('.'))
print(os.listdir(u'.'))
will produce the following output::
amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[b'.svn', b'filename\xe4\x94\x80abc', ...]
['.svn', 'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
Note that in most occasions, the Uniode APIs should be used. The bytes APIs
should only be used on systems where undecodable file names can be present,
i.e. Unix systems.
Tips for Writing Unicode-aware Programs
@ -566,12 +559,10 @@ The most important tip is:
Software should only work with Unicode strings internally, converting to a
particular encoding on output.
If you attempt to write processing functions that accept both Unicode and 8-bit
If you attempt to write processing functions that accept both Unicode and byte
strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. Python's default encoding is ASCII, so whenever
a character with an ASCII value > 127 is in the input data, you'll get a
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
encoding.
two different kinds of strings. There is no automatic encoding or decoding if
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
It's easy to miss such problems if you only test your software with data that
doesn't contain any accents; everything will seem to work, but there's actually
@ -594,7 +585,7 @@ For example, let's say you have a content management system that takes a Unicode
filename, and you want to disallow paths with a '/' character. You might write
this code::
def read_file (filename, encoding):
def read_file(filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
@ -631,9 +622,10 @@ several links.
Version 1.02: posted August 16 2005. Corrects factual errors.
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
.. comment Describe obscure -U switch somewhere?
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
.. comment