Issue #4153: Updated Unicode HOWTO.
This commit is contained in:
parent
b970142707
commit
93a6b13f96
|
@ -4,13 +4,11 @@
|
||||||
Unicode HOWTO
|
Unicode HOWTO
|
||||||
*****************
|
*****************
|
||||||
|
|
||||||
:Release: 1.11
|
:Release: 1.12
|
||||||
|
|
||||||
This HOWTO discusses Python 2.x's support for Unicode, and explains
|
This HOWTO discusses Python support for Unicode, and explains
|
||||||
various problems that people commonly encounter when trying to work
|
various problems that people commonly encounter when trying to work
|
||||||
with Unicode. (This HOWTO has not yet been updated to cover the 3.x
|
with Unicode.
|
||||||
versions of Python.)
|
|
||||||
|
|
||||||
|
|
||||||
Introduction to Unicode
|
Introduction to Unicode
|
||||||
=======================
|
=======================
|
||||||
|
@ -44,14 +42,14 @@ In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
|
||||||
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
|
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
|
||||||
machines assigned values between 128 and 255 to accented characters. Different
|
machines assigned values between 128 and 255 to accented characters. Different
|
||||||
machines had different codes, however, which led to problems exchanging files.
|
machines had different codes, however, which led to problems exchanging files.
|
||||||
Eventually various commonly used sets of values for the 128-255 range emerged.
|
Eventually various commonly used sets of values for the 128--255 range emerged.
|
||||||
Some were true standards, defined by the International Standards Organization,
|
Some were true standards, defined by the International Standards Organization,
|
||||||
and some were **de facto** conventions that were invented by one company or
|
and some were **de facto** conventions that were invented by one company or
|
||||||
another and managed to catch on.
|
another and managed to catch on.
|
||||||
|
|
||||||
255 characters aren't very many. For example, you can't fit both the accented
|
255 characters aren't very many. For example, you can't fit both the accented
|
||||||
characters used in Western Europe and the Cyrillic alphabet used for Russian
|
characters used in Western Europe and the Cyrillic alphabet used for Russian
|
||||||
into the 128-255 range because there are more than 127 such characters.
|
into the 128--255 range because there are more than 127 such characters.
|
||||||
|
|
||||||
You could write files using different codes (all your Russian files in a coding
|
You could write files using different codes (all your Russian files in a coding
|
||||||
system called KOI8, all your French files in a different coding system called
|
system called KOI8, all your French files in a different coding system called
|
||||||
|
@ -64,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible
|
||||||
to represent many different characters from many different alphabets; an initial
|
to represent many different characters from many different alphabets; an initial
|
||||||
goal was to have Unicode contain the alphabets for every single human language.
|
goal was to have Unicode contain the alphabets for every single human language.
|
||||||
It turns out that even 16 bits isn't enough to meet that goal, and the modern
|
It turns out that even 16 bits isn't enough to meet that goal, and the modern
|
||||||
Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
|
Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10ffff
|
||||||
base-16).
|
in base 16).
|
||||||
|
|
||||||
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
||||||
originally separate efforts, but the specifications were merged with the 1.1
|
originally separate efforts, but the specifications were merged with the 1.1
|
||||||
|
@ -90,7 +88,7 @@ meanings.
|
||||||
The Unicode standard describes how characters are represented by **code
|
The Unicode standard describes how characters are represented by **code
|
||||||
points**. A code point is an integer value, usually denoted in base 16. In the
|
points**. A code point is an integer value, usually denoted in base 16. In the
|
||||||
standard, a code point is written using the notation U+12ca to mean the
|
standard, a code point is written using the notation U+12ca to mean the
|
||||||
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
|
character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot
|
||||||
of tables listing characters and their corresponding code points::
|
of tables listing characters and their corresponding code points::
|
||||||
|
|
||||||
0061 'a'; LATIN SMALL LETTER A
|
0061 'a'; LATIN SMALL LETTER A
|
||||||
|
@ -117,10 +115,10 @@ Encodings
|
||||||
---------
|
---------
|
||||||
|
|
||||||
To summarize the previous section: a Unicode string is a sequence of code
|
To summarize the previous section: a Unicode string is a sequence of code
|
||||||
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
|
points, which are numbers from 0 through 0x10ffff (1,114,111 decimal). This
|
||||||
represented as a set of bytes (meaning, values from 0-255) in memory. The rules
|
sequence needs to be represented as a set of bytes (meaning, values
|
||||||
for translating a Unicode string into a sequence of bytes are called an
|
from 0 through 255) in memory. The rules for translating a Unicode string
|
||||||
**encoding**.
|
into a sequence of bytes are called an **encoding**.
|
||||||
|
|
||||||
The first encoding you might think of is an array of 32-bit integers. In this
|
The first encoding you might think of is an array of 32-bit integers. In this
|
||||||
representation, the string "Python" would look like this::
|
representation, the string "Python" would look like this::
|
||||||
|
@ -164,7 +162,7 @@ encoding, for example, are simple; for each code point:
|
||||||
case.)
|
case.)
|
||||||
|
|
||||||
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
|
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
|
||||||
0-255 are identical to the Latin-1 values, so converting to this encoding simply
|
0--255 are identical to the Latin-1 values, so converting to this encoding simply
|
||||||
requires converting code points to byte values; if a code point larger than 255
|
requires converting code points to byte values; if a code point larger than 255
|
||||||
is encountered, the string can't be encoded into Latin-1.
|
is encountered, the string can't be encoded into Latin-1.
|
||||||
|
|
||||||
|
@ -226,8 +224,8 @@ Wikipedia entries are often helpful; see the entries for "character encoding"
|
||||||
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
||||||
|
|
||||||
|
|
||||||
Python 2.x's Unicode Support
|
Python's Unicode Support
|
||||||
============================
|
========================
|
||||||
|
|
||||||
Now that you've learned the rudiments of Unicode, we can look at Python's
|
Now that you've learned the rudiments of Unicode, we can look at Python's
|
||||||
Unicode features.
|
Unicode features.
|
||||||
|
@ -265,7 +263,7 @@ Unicode result). The following examples show the differences::
|
||||||
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
|
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
|
||||||
unexpected code byte
|
unexpected code byte
|
||||||
>>> b'\x80abc'.decode("utf-8", "replace")
|
>>> b'\x80abc'.decode("utf-8", "replace")
|
||||||
'\ufffdabc'
|
'<EFBFBD>abc'
|
||||||
>>> b'\x80abc'.decode("utf-8", "ignore")
|
>>> b'\x80abc'.decode("utf-8", "ignore")
|
||||||
'abc'
|
'abc'
|
||||||
|
|
||||||
|
@ -281,10 +279,10 @@ that contains the corresponding code point. The reverse operation is the
|
||||||
built-in :func:`ord` function that takes a one-character Unicode string and
|
built-in :func:`ord` function that takes a one-character Unicode string and
|
||||||
returns the code point value::
|
returns the code point value::
|
||||||
|
|
||||||
>>> chr(40960)
|
>>> chr(57344)
|
||||||
'\ua000'
|
'\ue000'
|
||||||
>>> ord('\ua000')
|
>>> ord('\ue000')
|
||||||
40960
|
57344
|
||||||
|
|
||||||
Converting to Bytes
|
Converting to Bytes
|
||||||
-------------------
|
-------------------
|
||||||
|
@ -326,7 +324,8 @@ Unicode Literals in Python Source Code
|
||||||
|
|
||||||
In Python source code, specific Unicode code points can be written using the
|
In Python source code, specific Unicode code points can be written using the
|
||||||
``\u`` escape sequence, which is followed by four hex digits giving the code
|
``\u`` escape sequence, which is followed by four hex digits giving the code
|
||||||
point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
|
point. The ``\U`` escape sequence is similar, but expects eight hex digits,
|
||||||
|
not four::
|
||||||
|
|
||||||
>>> s = "a\xac\u1234\u20ac\U00008000"
|
>>> s = "a\xac\u1234\u20ac\U00008000"
|
||||||
^^^^ two-digit hex escape
|
^^^^ two-digit hex escape
|
||||||
|
@ -465,18 +464,17 @@ like those in string objects' :meth:`encode` and :meth:`decode` methods.
|
||||||
|
|
||||||
Reading Unicode from a file is therefore simple::
|
Reading Unicode from a file is therefore simple::
|
||||||
|
|
||||||
f = open('unicode.rst', encoding='utf-8')
|
with open('unicode.rst', encoding='utf-8') as f:
|
||||||
for line in f:
|
for line in f:
|
||||||
print(repr(line))
|
print(repr(line))
|
||||||
|
|
||||||
It's also possible to open files in update mode, allowing both reading and
|
It's also possible to open files in update mode, allowing both reading and
|
||||||
writing::
|
writing::
|
||||||
|
|
||||||
f = open('test', encoding='utf-8', mode='w+')
|
with open('test', encoding='utf-8', mode='w+') as f:
|
||||||
f.write('\u4500 blah blah blah\n')
|
f.write('\u4500 blah blah blah\n')
|
||||||
f.seek(0)
|
f.seek(0)
|
||||||
print(repr(f.readline()[:1]))
|
print(repr(f.readline()[:1]))
|
||||||
f.close()
|
|
||||||
|
|
||||||
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
||||||
written as the first character of a file in order to assist with autodetection
|
written as the first character of a file in order to assist with autodetection
|
||||||
|
@ -513,14 +511,13 @@ usually just provide the Unicode string as the filename, and it will be
|
||||||
automatically converted to the right encoding for you::
|
automatically converted to the right encoding for you::
|
||||||
|
|
||||||
filename = 'filename\u4500abc'
|
filename = 'filename\u4500abc'
|
||||||
f = open(filename, 'w')
|
with open(filename, 'w') as f:
|
||||||
f.write('blah\n')
|
f.write('blah\n')
|
||||||
f.close()
|
|
||||||
|
|
||||||
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
|
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
|
||||||
filenames.
|
filenames.
|
||||||
|
|
||||||
:func:`os.listdir`, which returns filenames, raises an issue: should it return
|
Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
|
||||||
the Unicode version of filenames, or should it return byte strings containing
|
the Unicode version of filenames, or should it return byte strings containing
|
||||||
the encoded versions? :func:`os.listdir` will do both, depending on whether you
|
the encoded versions? :func:`os.listdir` will do both, depending on whether you
|
||||||
provided the directory path as a byte string or a Unicode string. If you pass a
|
provided the directory path as a byte string or a Unicode string. If you pass a
|
||||||
|
@ -569,14 +566,6 @@ strings, you will find your program vulnerable to bugs wherever you combine the
|
||||||
two different kinds of strings. There is no automatic encoding or decoding if
|
two different kinds of strings. There is no automatic encoding or decoding if
|
||||||
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
|
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
|
||||||
|
|
||||||
It's easy to miss such problems if you only test your software with data that
|
|
||||||
doesn't contain any accents; everything will seem to work, but there's actually
|
|
||||||
a bug in your program waiting for the first user who attempts to use characters
|
|
||||||
> 127. A second tip, therefore, is:
|
|
||||||
|
|
||||||
Include characters > 127 and, even better, characters > 255 in your test
|
|
||||||
data.
|
|
||||||
|
|
||||||
When using data coming from a web browser or some other untrusted source, a
|
When using data coming from a web browser or some other untrusted source, a
|
||||||
common technique is to check for illegal characters in a string before using the
|
common technique is to check for illegal characters in a string before using the
|
||||||
string in a generated command line or storing it in a database. If you're doing
|
string in a generated command line or storing it in a database. If you're doing
|
||||||
|
@ -594,8 +583,8 @@ this code::
|
||||||
if '/' in filename:
|
if '/' in filename:
|
||||||
raise ValueError("'/' not allowed in filenames")
|
raise ValueError("'/' not allowed in filenames")
|
||||||
unicode_name = filename.decode(encoding)
|
unicode_name = filename.decode(encoding)
|
||||||
f = open(unicode_name, 'r')
|
with open(unicode_name, 'r') as f:
|
||||||
# ... return contents of file ...
|
# ... return contents of file ...
|
||||||
|
|
||||||
However, if an attacker could specify the ``'base64'`` encoding, they could pass
|
However, if an attacker could specify the ``'base64'`` encoding, they could pass
|
||||||
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
|
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
|
||||||
|
@ -610,27 +599,30 @@ The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
|
||||||
Applications in Python" are available at
|
Applications in Python" are available at
|
||||||
<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
||||||
and discuss questions of character encodings as well as how to internationalize
|
and discuss questions of character encodings as well as how to internationalize
|
||||||
and localize an application.
|
and localize an application. These slides cover Python 2.x only.
|
||||||
|
|
||||||
|
|
||||||
Revision History and Acknowledgements
|
Acknowledgements
|
||||||
=====================================
|
================
|
||||||
|
|
||||||
Thanks to the following people who have noted errors or offered suggestions on
|
Thanks to the following people who have noted errors or offered suggestions on
|
||||||
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
|
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
|
||||||
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
|
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
|
||||||
|
|
||||||
Version 1.0: posted August 5 2005.
|
.. comment
|
||||||
|
Revision History
|
||||||
|
|
||||||
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
|
Version 1.0: posted August 5 2005.
|
||||||
several links.
|
|
||||||
|
|
||||||
Version 1.02: posted August 16 2005. Corrects factual errors.
|
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
|
||||||
|
several links.
|
||||||
|
|
||||||
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
|
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||||
|
|
||||||
Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
|
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
|
||||||
and that the HOWTO only covers 2.x.
|
|
||||||
|
Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
|
||||||
|
and that the HOWTO only covers 2.x.
|
||||||
|
|
||||||
.. comment Describe Python 3.x support (new section? new document?)
|
.. comment Describe Python 3.x support (new section? new document?)
|
||||||
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||||
|
|
Loading…
Reference in New Issue