2000-06-13 17:50:50 -03:00
|
|
|
\section{\module{unicodedata} ---
|
|
|
|
Unicode Database}
|
|
|
|
|
|
|
|
\declaremodule{standard}{unicodedata}
|
|
|
|
\modulesynopsis{Access the Unicode Database.}
|
|
|
|
\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
|
|
|
\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
2002-11-23 18:08:15 -04:00
|
|
|
\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
|
2000-06-13 17:50:50 -03:00
|
|
|
|
|
|
|
\index{Unicode}
|
|
|
|
\index{character}
|
|
|
|
\indexii{Unicode}{database}
|
|
|
|
|
|
|
|
This module provides access to the Unicode Character Database which
|
|
|
|
defines character properties for all Unicode characters. The data in
|
|
|
|
this database is based on the \file{UnicodeData.txt} file version
|
2006-07-27 15:41:21 -03:00
|
|
|
4.1.0 which is publicly available from \url{ftp://ftp.unicode.org/}.
|
2000-06-13 17:50:50 -03:00
|
|
|
|
|
|
|
The module uses the same names and symbols as defined by the
|
2006-03-09 19:38:20 -04:00
|
|
|
UnicodeData File Format 4.1.0 (see
|
2006-07-27 15:53:33 -03:00
|
|
|
\url{http://www.unicode.org/Public/4.1.0/ucd/UCD.html}). It
|
2000-06-13 17:50:50 -03:00
|
|
|
defines the following functions:
|
|
|
|
|
2001-01-24 04:10:07 -04:00
|
|
|
\begin{funcdesc}{lookup}{name}
|
|
|
|
Look up character by name. If a character with the
|
|
|
|
given name is found, return the corresponding Unicode
|
|
|
|
character. If not found, \exception{KeyError} is raised.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{name}{unichr\optional{, default}}
|
|
|
|
Returns the name assigned to the Unicode character
|
|
|
|
\var{unichr} as a string. If no name is defined,
|
|
|
|
\var{default} is returned, or, if not given,
|
|
|
|
\exception{ValueError} is raised.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
2000-06-13 17:50:50 -03:00
|
|
|
\begin{funcdesc}{decimal}{unichr\optional{, default}}
|
|
|
|
Returns the decimal value assigned to the Unicode character
|
|
|
|
\var{unichr} as integer. If no such value is defined,
|
|
|
|
\var{default} is returned, or, if not given,
|
|
|
|
\exception{ValueError} is raised.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{digit}{unichr\optional{, default}}
|
|
|
|
Returns the digit value assigned to the Unicode character
|
|
|
|
\var{unichr} as integer. If no such value is defined,
|
|
|
|
\var{default} is returned, or, if not given,
|
|
|
|
\exception{ValueError} is raised.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{numeric}{unichr\optional{, default}}
|
|
|
|
Returns the numeric value assigned to the Unicode character
|
|
|
|
\var{unichr} as float. If no such value is defined, \var{default} is
|
|
|
|
returned, or, if not given, \exception{ValueError} is raised.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{category}{unichr}
|
|
|
|
Returns the general category assigned to the Unicode character
|
|
|
|
\var{unichr} as string.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{bidirectional}{unichr}
|
|
|
|
Returns the bidirectional category assigned to the Unicode character
|
|
|
|
\var{unichr} as string. If no such value is defined, an empty string
|
|
|
|
is returned.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{combining}{unichr}
|
|
|
|
Returns the canonical combining class assigned to the Unicode
|
|
|
|
character \var{unichr} as integer. Returns \code{0} if no combining
|
|
|
|
class is defined.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
2004-08-04 04:38:35 -03:00
|
|
|
\begin{funcdesc}{east_asian_width}{unichr}
|
2004-08-19 23:36:27 -03:00
|
|
|
Returns the east asian width assigned to the Unicode character
|
2004-08-04 04:38:35 -03:00
|
|
|
\var{unichr} as string.
|
2004-08-19 23:36:27 -03:00
|
|
|
\versionadded{2.4}
|
2004-08-04 04:38:35 -03:00
|
|
|
\end{funcdesc}
|
|
|
|
|
2000-06-13 17:50:50 -03:00
|
|
|
\begin{funcdesc}{mirrored}{unichr}
|
2004-08-20 20:13:26 -03:00
|
|
|
Returns the mirrored property assigned to the Unicode character
|
2000-06-13 17:50:50 -03:00
|
|
|
\var{unichr} as integer. Returns \code{1} if the character has been
|
|
|
|
identified as a ``mirrored'' character in bidirectional text,
|
|
|
|
\code{0} otherwise.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{decomposition}{unichr}
|
|
|
|
Returns the character decomposition mapping assigned to the Unicode
|
|
|
|
character \var{unichr} as string. An empty string is returned in case
|
|
|
|
no such mapping is defined.
|
|
|
|
\end{funcdesc}
|
2002-11-23 18:08:15 -04:00
|
|
|
|
|
|
|
\begin{funcdesc}{normalize}{form, unistr}
|
|
|
|
|
|
|
|
Return the normal form \var{form} for the Unicode string \var{unistr}.
|
|
|
|
Valid values for \var{form} are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
|
|
|
|
|
|
|
|
The Unicode standard defines various normalization forms of a Unicode
|
|
|
|
string, based on the definition of canonical equivalence and
|
|
|
|
compatibility equivalence. In Unicode, several characters can be
|
|
|
|
expressed in various way. For example, the character U+00C7 (LATIN
|
|
|
|
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
|
|
|
|
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
|
|
|
|
|
|
|
|
For each character, there are two normal forms: normal form C and
|
|
|
|
normal form D. Normal form D (NFD) is also known as canonical
|
|
|
|
decomposition, and translates each character into its decomposed form.
|
|
|
|
Normal form C (NFC) first applies a canonical decomposition, then
|
|
|
|
composes pre-combined characters again.
|
|
|
|
|
2006-07-27 15:42:41 -03:00
|
|
|
In addition to these two forms, there are two additional normal forms
|
2002-11-23 18:08:15 -04:00
|
|
|
based on compatibility equivalence. In Unicode, certain characters are
|
|
|
|
supported which normally would be unified with other characters. For
|
|
|
|
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
|
|
|
|
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
|
|
|
|
compatibility with existing character sets (e.g. gb2312).
|
|
|
|
|
|
|
|
The normal form KD (NFKD) will apply the compatibility decomposition,
|
|
|
|
i.e. replace all compatibility characters with their equivalents. The
|
|
|
|
normal form KC (NFKC) first applies the compatibility decomposition,
|
|
|
|
followed by the canonical composition.
|
|
|
|
|
|
|
|
\versionadded{2.3}
|
|
|
|
\end{funcdesc}
|
|
|
|
|
2002-11-25 05:13:37 -04:00
|
|
|
In addition, the module exposes the following constant:
|
|
|
|
|
|
|
|
\begin{datadesc}{unidata_version}
|
|
|
|
The version of the Unicode database used in this module.
|
|
|
|
|
|
|
|
\versionadded{2.3}
|
2004-08-04 04:38:35 -03:00
|
|
|
\end{datadesc}
|
2006-03-09 19:38:20 -04:00
|
|
|
|
2006-03-10 07:20:04 -04:00
|
|
|
\begin{datadesc}{ucd_3_2_0}
|
2006-03-09 19:38:20 -04:00
|
|
|
This is an object that has the same methods as the entire
|
|
|
|
module, but uses the Unicode database version 3.2 instead,
|
|
|
|
for applications that require this specific version of
|
|
|
|
the Unicode database (such as IDNA).
|
|
|
|
|
|
|
|
\versionadded{2.5}
|
|
|
|
\end{datadesc}
|
2006-07-27 15:42:41 -03:00
|
|
|
|
2006-07-27 15:53:33 -03:00
|
|
|
Examples:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> unicodedata.lookup('LEFT CURLY BRACKET')
|
|
|
|
u'{'
|
|
|
|
>>> unicodedata.name(u'/')
|
|
|
|
'SOLIDUS'
|
|
|
|
>>> unicodedata.decimal(u'9')
|
|
|
|
9
|
|
|
|
>>> unicodedata.decimal(u'a')
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in ?
|
|
|
|
ValueError: not a decimal
|
|
|
|
>>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase
|
|
|
|
'Lu'
|
|
|
|
>>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
|
|
|
|
'AN'
|
|
|
|
\end{verbatim}
|