mirror of https://github.com/python/cpython
Marc-Andre Lemburg <mal@lemburg.com>:
Tutorial information about Unicode strings in Python, with some markup adjustments from FLD.
This commit is contained in:
parent
a4cd2611f4
commit
9dc30bb956
101
Doc/tut/tut.tex
101
Doc/tut/tut.tex
|
@ -733,6 +733,107 @@ The built-in function \function{len()} returns the length of a string:
|
|||
34
|
||||
\end{verbatim}
|
||||
|
||||
|
||||
\subsection{Unicode Strings \label{unicodeStrings}}
|
||||
\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
||||
|
||||
Starting with Python 1.6 a new data type for storing text data is
|
||||
available to the programmer: the Unicode object. It can be used to
|
||||
store and manipulate Unicode data (see \url{http://www.unicode.org})
|
||||
and intergrates well with the existing string objects providing
|
||||
auto-conversions where necessary.
|
||||
|
||||
Unicode has the advantage of providing one ordinal for every character
|
||||
in every script used in modern and ancient texts. Previously, there
|
||||
were only 256 possible ordinals for script characters and texts were
|
||||
typically bound to a code page which mapped the ordinals to script
|
||||
characters. This lead to very much confusion especially with respect
|
||||
to internalization (usually written as \samp{i18n} --- \character{i} +
|
||||
18 characters + \character{n}) of software. Unicode solves these
|
||||
problems by defining one code page for all scripts.
|
||||
|
||||
Creating Unicode strings in Python is just as simple as creating
|
||||
normal strings:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> u'Hello World !'
|
||||
u'Hello World !'
|
||||
\end{verbatim}
|
||||
|
||||
The small \character{u} in front of the quote indicates that an
|
||||
Unicode string is supposed to be created. If you want to include
|
||||
special characters in the string, you can do so by using the Python
|
||||
\emph{Unicode-Escape} encoding. The following example shows how:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> u'Hello\\u0020World !'
|
||||
u'Hello World !'
|
||||
\end{verbatim}
|
||||
|
||||
The escape sequence \code{\\u0020} indicates to insert the Unicode
|
||||
character with the HEX ordinal 0x0020 (the space character) at the
|
||||
given position.
|
||||
|
||||
Other characters are interpreted by using their respective ordinal
|
||||
value directly as Unicode ordinal. Due to the fact that the lower 256
|
||||
Unicode are the same as the standard Latin-1 encoding used in many
|
||||
western countries, the process of entering Unicode is greatly
|
||||
simplified.
|
||||
|
||||
For experts, there is also a raw mode just like for normal
|
||||
strings. You have to prepend the string with a small 'r' to have
|
||||
Python use the \emph{Raw-Unicode-Escape} encoding. It will only apply
|
||||
the above \code{\\uXXXX} conversion if there is an uneven number of
|
||||
backslashes in front of the small 'u'.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> ur'Hello\u0020World !'
|
||||
u'Hello World !'
|
||||
>>> ur'Hello\\u0020World !'
|
||||
u'Hello\\\\u0020World !'
|
||||
\end{verbatim}
|
||||
|
||||
The raw mode is most useful when you have to enter lots of backslashes
|
||||
e.g. in regular expressions.
|
||||
|
||||
Apart from these standard encodings, Python provides a whole set of
|
||||
other ways of creating Unicod strings on the basis of a known
|
||||
encoding.
|
||||
|
||||
The builtin \function{unicode()}\bifuncindex{unicode} provides access
|
||||
to all registered Unicode codecs (COders and DECoders). Some of the
|
||||
more well known encodings which these codecs can convert are
|
||||
\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8} and \emph{UTF-16}. The latter two
|
||||
are variable length encodings which permit to store Unicode characters
|
||||
in 8 or 16 bits. Python uses UTF-8 as default encoding. This becomes
|
||||
noticable when printing Unicode strings or writing them to files.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> u"äöü"
|
||||
u'\344\366\374'
|
||||
>>> str(u"äöü")
|
||||
'\303\244\303\266\303\274'
|
||||
\end{verbatim}
|
||||
|
||||
If you have data in a specific encoding and want to produce a
|
||||
corresponding Unicode string from it, you can use the
|
||||
\function{unicode()} builtin with the encoding name as second
|
||||
argument.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> unicode('\303\244\303\266\303\274','UTF-8')
|
||||
u'\344\366\374'
|
||||
\end{verbatim}
|
||||
|
||||
To convert the Unicode string back into a string using the original
|
||||
encoding, the objects provide an \method{encode()} method.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> u"äöü".encode('UTF-8')
|
||||
'\303\244\303\266\303\274'
|
||||
\end{verbatim}
|
||||
|
||||
|
||||
\subsection{Lists \label{lists}}
|
||||
|
||||
Python knows a number of \emph{compound} data types, used to group
|
||||
|
|
Loading…
Reference in New Issue