mirror of https://github.com/python/cpython
#4153: finish updating Unicode HOWTO for Py3k changes.
This commit is contained in:
parent
2d92593729
commit
0c07422332
|
@ -2,16 +2,11 @@
|
|||
Unicode HOWTO
|
||||
*****************
|
||||
|
||||
:Release: 1.02
|
||||
:Release: 1.1
|
||||
|
||||
This HOWTO discusses Python's support for Unicode, and explains various problems
|
||||
that people commonly encounter when trying to work with Unicode.
|
||||
|
||||
.. XXX fix it
|
||||
.. warning::
|
||||
|
||||
This HOWTO has not yet been updated for Python 3000's string object changes.
|
||||
|
||||
|
||||
Introduction to Unicode
|
||||
=======================
|
||||
|
@ -21,9 +16,8 @@ History of Character Codes
|
|||
|
||||
In 1968, the American Standard Code for Information Interchange, better known by
|
||||
its acronym ASCII, was standardized. ASCII defined numeric codes for various
|
||||
characters, with the numeric values running from 0 to
|
||||
127. For example, the lowercase letter 'a' is assigned 97 as its code
|
||||
value.
|
||||
characters, with the numeric values running from 0 to 127. For example, the
|
||||
lowercase letter 'a' is assigned 97 as its code value.
|
||||
|
||||
ASCII was an American-developed standard, so it only defined unaccented
|
||||
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
|
||||
|
@ -256,25 +250,25 @@ an *errors* argument.
|
|||
|
||||
The *errors* argument specifies the response when the input string can't be
|
||||
converted according to the encoding's rules. Legal values for this argument are
|
||||
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (add U+FFFD,
|
||||
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (use U+FFFD,
|
||||
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
|
||||
Unicode result). The following examples show the differences::
|
||||
|
||||
>>> b'\x80abc'.decode("utf-8", "strict")
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
|
||||
ordinal not in range(128)
|
||||
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
|
||||
unexpected code byte
|
||||
>>> b'\x80abc'.decode("utf-8", "replace")
|
||||
'\ufffdabc'
|
||||
>>> b'\x80abc'.decode("utf-8", "ignore")
|
||||
'abc'
|
||||
|
||||
Encodings are specified as strings containing the encoding's name. Python
|
||||
comes with roughly 100 different encodings; see the Python Library Reference at
|
||||
:ref:`standard-encodings` for a list. Some encodings
|
||||
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
|
||||
synonyms for the same encoding.
|
||||
Encodings are specified as strings containing the encoding's name. Python comes
|
||||
with roughly 100 different encodings; see the Python Library Reference at
|
||||
:ref:`standard-encodings` for a list. Some encodings have multiple names; for
|
||||
example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same
|
||||
encoding.
|
||||
|
||||
One-character Unicode strings can also be created with the :func:`chr`
|
||||
built-in function, which takes integers and returns a Unicode string of length 1
|
||||
|
@ -294,8 +288,9 @@ Another important str method is ``.encode([encoding], [errors='strict'])``,
|
|||
which returns a ``bytes`` representation of the Unicode string, encoded in the
|
||||
requested encoding. The ``errors`` parameter is the same as the parameter of
|
||||
the :meth:`decode` method, with one additional possibility; as well as 'strict',
|
||||
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
|
||||
character references. The following example shows the different results::
|
||||
'ignore', and 'replace' (which in this case inserts a question mark instead of
|
||||
the unencodable character), you can also pass 'xmlcharrefreplace' which uses
|
||||
XML's character references. The following example shows the different results::
|
||||
|
||||
>>> u = chr(40960) + 'abcd' + chr(1972)
|
||||
>>> u.encode('utf-8')
|
||||
|
@ -303,7 +298,8 @@ character references. The following example shows the different results::
|
|||
>>> u.encode('ascii')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
|
||||
position 0: ordinal not in range(128)
|
||||
>>> u.encode('ascii', 'ignore')
|
||||
b'abcd'
|
||||
>>> u.encode('ascii', 'replace')
|
||||
|
@ -319,10 +315,6 @@ completely new encoding, you'll need to learn about the :mod:`codecs` module
|
|||
interfaces, but implementing encodings is a specialized task that also won't be
|
||||
covered here. Consult the Python documentation to learn more about this module.
|
||||
|
||||
The most commonly used part of the :mod:`codecs` module is the
|
||||
:func:`codecs.open` function which will be discussed in the section on input and
|
||||
output.
|
||||
|
||||
|
||||
Unicode Literals in Python Source Code
|
||||
--------------------------------------
|
||||
|
@ -350,10 +342,9 @@ encoding. You could then edit Python source code with your favorite editor
|
|||
which would display the accented characters naturally, and have the right
|
||||
characters used at runtime.
|
||||
|
||||
Python supports writing Unicode literals in UTF-8 by default, but you can use
|
||||
(almost) any encoding if you declare the encoding being used. This is done by
|
||||
including a special comment as either the first or second line of the source
|
||||
file::
|
||||
Python supports writing source code in UTF-8 by default, but you can use almost
|
||||
any encoding if you declare the encoding being used. This is done by including
|
||||
a special comment as either the first or second line of the source file::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: latin-1 -*-
|
||||
|
@ -363,9 +354,9 @@ file::
|
|||
|
||||
The syntax is inspired by Emacs's notation for specifying variables local to a
|
||||
file. Emacs supports many different variables, but Python only supports
|
||||
'coding'. The ``-*-`` symbols indicate that the comment is special; within
|
||||
them, you must supply the name ``coding`` and the name of your chosen encoding,
|
||||
separated by ``':'``.
|
||||
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
|
||||
they have no significance to Python but are a convention. Python looks for
|
||||
``coding: name`` or ``coding=name`` in the comment.
|
||||
|
||||
If you don't include such a comment, the default encoding used will be UTF-8 as
|
||||
already mentioned.
|
||||
|
@ -426,7 +417,9 @@ The documentation for the :mod:`codecs` module.
|
|||
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
|
||||
Unicode". A PDF version of his slides is available at
|
||||
<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
|
||||
excellent overview of the design of Python's Unicode features.
|
||||
excellent overview of the design of Python's Unicode features (based on Python
|
||||
2, where the Unicode string type is called ``unicode`` and literals start with
|
||||
``u``).
|
||||
|
||||
|
||||
Reading and Writing Unicode Data
|
||||
|
@ -444,8 +437,8 @@ columns and can return Unicode values from an SQL query.
|
|||
|
||||
Unicode data is usually converted to a particular encoding before it gets
|
||||
written to disk or sent over a socket. It's possible to do all the work
|
||||
yourself: open a file, read an 8-bit string from it, and convert the string with
|
||||
``unicode(str, encoding)``. However, the manual approach is not recommended.
|
||||
yourself: open a file, read an 8-bit byte string from it, and convert the string
|
||||
with ``str(bytes, encoding)``. However, the manual approach is not recommended.
|
||||
|
||||
One problem is the multi-byte nature of encodings; one Unicode character can be
|
||||
represented by several bytes. If you want to read the file in arbitrary-sized
|
||||
|
@ -459,39 +452,28 @@ string and its Unicode version in memory.)
|
|||
|
||||
The solution would be to use the low-level decoding interface to catch the case
|
||||
of partial coding sequences. The work of implementing this has already been
|
||||
done for you: the :mod:`codecs` module includes a version of the :func:`open`
|
||||
function that returns a file-like object that assumes the file's contents are in
|
||||
a specified encoding and accepts Unicode parameters for methods such as
|
||||
``.read()`` and ``.write()``.
|
||||
|
||||
The function's parameters are ``open(filename, mode='rb', encoding=None,
|
||||
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
|
||||
just like the corresponding parameter to the regular built-in ``open()``
|
||||
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
|
||||
to the standard function's parameter. ``encoding`` is a string giving the
|
||||
encoding to use; if it's left as ``None``, a regular Python file object that
|
||||
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
|
||||
data written to or read from the wrapper object will be converted as needed.
|
||||
``errors`` specifies the action for encoding errors and can be one of the usual
|
||||
values of 'strict', 'ignore', and 'replace'.
|
||||
done for you: the built-in :func:`open` function can return a file-like object
|
||||
that assumes the file's contents are in a specified encoding and accepts Unicode
|
||||
parameters for methods such as ``.read()`` and ``.write()``. This works through
|
||||
:func:`open`\'s *encoding* and *errors* parameters which are interpreted just
|
||||
like those in string objects' :meth:`encode` and :meth:`decode` methods.
|
||||
|
||||
Reading Unicode from a file is therefore simple::
|
||||
|
||||
import codecs
|
||||
f = codecs.open('unicode.rst', encoding='utf-8')
|
||||
f = open('unicode.rst', encoding='utf-8')
|
||||
for line in f:
|
||||
print(repr(line))
|
||||
|
||||
It's also possible to open files in update mode, allowing both reading and
|
||||
writing::
|
||||
|
||||
f = codecs.open('test', encoding='utf-8', mode='w+')
|
||||
f = open('test', encoding='utf-8', mode='w+')
|
||||
f.write('\u4500 blah blah blah\n')
|
||||
f.seek(0)
|
||||
print(repr(f.readline()[:1]))
|
||||
f.close()
|
||||
|
||||
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
||||
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
||||
written as the first character of a file in order to assist with autodetection
|
||||
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
|
||||
present at the start of a file; when such an encoding is used, the BOM will be
|
||||
|
@ -500,6 +482,12 @@ the file is read. There are variants of these encodings, such as 'utf-16-le'
|
|||
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
|
||||
particular byte ordering and don't skip the BOM.
|
||||
|
||||
In some areas, it is also convention to use a "BOM" at the start of UTF-8
|
||||
encoded files; the name is misleading since UTF-8 is not byte-order dependent.
|
||||
The mark simply announces that the file is encoded in UTF-8. Use the
|
||||
'utf-8-sig' codec to automatically skip the mark if present for reading such
|
||||
files.
|
||||
|
||||
|
||||
Unicode filenames
|
||||
-----------------
|
||||
|
@ -528,31 +516,36 @@ Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unico
|
|||
filenames.
|
||||
|
||||
:func:`os.listdir`, which returns filenames, raises an issue: should it return
|
||||
the Unicode version of filenames, or should it return 8-bit strings containing
|
||||
the Unicode version of filenames, or should it return byte strings containing
|
||||
the encoded versions? :func:`os.listdir` will do both, depending on whether you
|
||||
provided the directory path as an 8-bit string or a Unicode string. If you pass
|
||||
a Unicode string as the path, filenames will be decoded using the filesystem's
|
||||
encoding and a list of Unicode strings will be returned, while passing an 8-bit
|
||||
path will return the 8-bit versions of the filenames. For example, assuming the
|
||||
default filesystem encoding is UTF-8, running the following program::
|
||||
provided the directory path as a byte string or a Unicode string. If you pass a
|
||||
Unicode string as the path, filenames will be decoded using the filesystem's
|
||||
encoding and a list of Unicode strings will be returned, while passing a byte
|
||||
path will return the byte string versions of the filenames. For example,
|
||||
assuming the default filesystem encoding is UTF-8, running the following
|
||||
program::
|
||||
|
||||
fn = 'filename\u4500abc'
|
||||
f = open(fn, 'w')
|
||||
f.close()
|
||||
|
||||
import os
|
||||
print(os.listdir(b'.'))
|
||||
print(os.listdir('.'))
|
||||
print(os.listdir(u'.'))
|
||||
|
||||
will produce the following output::
|
||||
|
||||
amk:~$ python t.py
|
||||
['.svn', 'filename\xe4\x94\x80abc', ...]
|
||||
[b'.svn', b'filename\xe4\x94\x80abc', ...]
|
||||
['.svn', 'filename\u4500abc', ...]
|
||||
|
||||
The first list contains UTF-8-encoded filenames, and the second list contains
|
||||
the Unicode versions.
|
||||
|
||||
Note that in most occasions, the Uniode APIs should be used. The bytes APIs
|
||||
should only be used on systems where undecodable file names can be present,
|
||||
i.e. Unix systems.
|
||||
|
||||
|
||||
|
||||
Tips for Writing Unicode-aware Programs
|
||||
|
@ -566,12 +559,10 @@ The most important tip is:
|
|||
Software should only work with Unicode strings internally, converting to a
|
||||
particular encoding on output.
|
||||
|
||||
If you attempt to write processing functions that accept both Unicode and 8-bit
|
||||
If you attempt to write processing functions that accept both Unicode and byte
|
||||
strings, you will find your program vulnerable to bugs wherever you combine the
|
||||
two different kinds of strings. Python's default encoding is ASCII, so whenever
|
||||
a character with an ASCII value > 127 is in the input data, you'll get a
|
||||
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
|
||||
encoding.
|
||||
two different kinds of strings. There is no automatic encoding or decoding if
|
||||
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
|
||||
|
||||
It's easy to miss such problems if you only test your software with data that
|
||||
doesn't contain any accents; everything will seem to work, but there's actually
|
||||
|
@ -594,7 +585,7 @@ For example, let's say you have a content management system that takes a Unicode
|
|||
filename, and you want to disallow paths with a '/' character. You might write
|
||||
this code::
|
||||
|
||||
def read_file (filename, encoding):
|
||||
def read_file(filename, encoding):
|
||||
if '/' in filename:
|
||||
raise ValueError("'/' not allowed in filenames")
|
||||
unicode_name = filename.decode(encoding)
|
||||
|
@ -631,9 +622,10 @@ several links.
|
|||
|
||||
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||
|
||||
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
|
||||
|
||||
|
||||
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||
.. comment Describe obscure -U switch somewhere?
|
||||
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
|
||||
|
||||
.. comment
|
||||
|
|
Loading…
Reference in New Issue