2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
:mod:`unicodedata` --- Unicode Database
|
|
|
|
=======================================
|
|
|
|
|
|
|
|
.. module:: unicodedata
|
|
|
|
:synopsis: Access the Unicode Database.
|
|
|
|
.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
|
|
|
|
.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
|
|
|
|
.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
|
|
|
|
|
|
|
|
|
|
|
|
.. index::
|
|
|
|
single: Unicode
|
|
|
|
single: character
|
|
|
|
pair: Unicode; database
|
|
|
|
|
|
|
|
This module provides access to the Unicode Character Database which defines
|
|
|
|
character properties for all Unicode characters. The data in this database is
|
2008-09-10 10:38:12 -03:00
|
|
|
based on the :file:`UnicodeData.txt` file version 5.1.0 which is publicly
|
2007-08-15 11:28:01 -03:00
|
|
|
available from ftp://ftp.unicode.org/.
|
|
|
|
|
|
|
|
The module uses the same names and symbols as defined by the UnicodeData File
|
2008-09-10 10:38:12 -03:00
|
|
|
Format 5.1.0 (see http://www.unicode.org/Public/5.1.0/ucd/UCD.html). It defines
|
2007-08-15 11:28:01 -03:00
|
|
|
the following functions:
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: lookup(name)
|
|
|
|
|
|
|
|
Look up character by name. If a character with the given name is found, return
|
|
|
|
the corresponding Unicode character. If not found, :exc:`KeyError` is raised.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: name(unichr[, default])
|
|
|
|
|
|
|
|
Returns the name assigned to the Unicode character *unichr* as a string. If no
|
|
|
|
name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
|
|
|
|
raised.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: decimal(unichr[, default])
|
|
|
|
|
|
|
|
Returns the decimal value assigned to the Unicode character *unichr* as integer.
|
|
|
|
If no such value is defined, *default* is returned, or, if not given,
|
|
|
|
:exc:`ValueError` is raised.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: digit(unichr[, default])
|
|
|
|
|
|
|
|
Returns the digit value assigned to the Unicode character *unichr* as integer.
|
|
|
|
If no such value is defined, *default* is returned, or, if not given,
|
|
|
|
:exc:`ValueError` is raised.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: numeric(unichr[, default])
|
|
|
|
|
|
|
|
Returns the numeric value assigned to the Unicode character *unichr* as float.
|
|
|
|
If no such value is defined, *default* is returned, or, if not given,
|
|
|
|
:exc:`ValueError` is raised.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: category(unichr)
|
|
|
|
|
|
|
|
Returns the general category assigned to the Unicode character *unichr* as
|
|
|
|
string.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: bidirectional(unichr)
|
|
|
|
|
|
|
|
Returns the bidirectional category assigned to the Unicode character *unichr* as
|
|
|
|
string. If no such value is defined, an empty string is returned.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: combining(unichr)
|
|
|
|
|
|
|
|
Returns the canonical combining class assigned to the Unicode character *unichr*
|
|
|
|
as integer. Returns ``0`` if no combining class is defined.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: east_asian_width(unichr)
|
|
|
|
|
|
|
|
Returns the east asian width assigned to the Unicode character *unichr* as
|
|
|
|
string.
|
|
|
|
|
|
|
|
.. versionadded:: 2.4
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: mirrored(unichr)
|
|
|
|
|
|
|
|
Returns the mirrored property assigned to the Unicode character *unichr* as
|
|
|
|
integer. Returns ``1`` if the character has been identified as a "mirrored"
|
|
|
|
character in bidirectional text, ``0`` otherwise.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: decomposition(unichr)
|
|
|
|
|
|
|
|
Returns the character decomposition mapping assigned to the Unicode character
|
|
|
|
*unichr* as string. An empty string is returned in case no such mapping is
|
|
|
|
defined.
|
|
|
|
|
|
|
|
|
|
|
|
.. function:: normalize(form, unistr)
|
|
|
|
|
|
|
|
Return the normal form *form* for the Unicode string *unistr*. Valid values for
|
|
|
|
*form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
|
|
|
|
|
|
|
|
The Unicode standard defines various normalization forms of a Unicode string,
|
|
|
|
based on the definition of canonical equivalence and compatibility equivalence.
|
|
|
|
In Unicode, several characters can be expressed in various way. For example, the
|
|
|
|
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
|
2007-08-16 07:09:22 -03:00
|
|
|
the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
For each character, there are two normal forms: normal form C and normal form D.
|
|
|
|
Normal form D (NFD) is also known as canonical decomposition, and translates
|
|
|
|
each character into its decomposed form. Normal form C (NFC) first applies a
|
|
|
|
canonical decomposition, then composes pre-combined characters again.
|
|
|
|
|
|
|
|
In addition to these two forms, there are two additional normal forms based on
|
|
|
|
compatibility equivalence. In Unicode, certain characters are supported which
|
|
|
|
normally would be unified with other characters. For example, U+2160 (ROMAN
|
|
|
|
NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
|
|
|
|
However, it is supported in Unicode for compatibility with existing character
|
|
|
|
sets (e.g. gb2312).
|
|
|
|
|
|
|
|
The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
|
|
|
|
replace all compatibility characters with their equivalents. The normal form KC
|
|
|
|
(NFKC) first applies the compatibility decomposition, followed by the canonical
|
|
|
|
composition.
|
|
|
|
|
2007-08-16 07:09:22 -03:00
|
|
|
Even if two unicode strings are normalized and look the same to
|
|
|
|
a human reader, if one has combining characters and the other
|
|
|
|
doesn't, they may not compare equal.
|
|
|
|
|
2007-08-15 11:28:01 -03:00
|
|
|
.. versionadded:: 2.3
|
|
|
|
|
|
|
|
In addition, the module exposes the following constant:
|
|
|
|
|
|
|
|
|
|
|
|
.. data:: unidata_version
|
|
|
|
|
|
|
|
The version of the Unicode database used in this module.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3
|
|
|
|
|
|
|
|
|
|
|
|
.. data:: ucd_3_2_0
|
|
|
|
|
|
|
|
This is an object that has the same methods as the entire module, but uses the
|
|
|
|
Unicode database version 3.2 instead, for applications that require this
|
|
|
|
specific version of the Unicode database (such as IDNA).
|
|
|
|
|
|
|
|
.. versionadded:: 2.5
|
|
|
|
|
2008-03-22 19:04:10 -03:00
|
|
|
Examples:
|
2007-08-15 11:28:01 -03:00
|
|
|
|
2008-03-22 19:04:10 -03:00
|
|
|
>>> import unicodedata
|
2007-08-15 11:28:01 -03:00
|
|
|
>>> unicodedata.lookup('LEFT CURLY BRACKET')
|
|
|
|
u'{'
|
|
|
|
>>> unicodedata.name(u'/')
|
|
|
|
'SOLIDUS'
|
|
|
|
>>> unicodedata.decimal(u'9')
|
|
|
|
9
|
|
|
|
>>> unicodedata.decimal(u'a')
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in ?
|
|
|
|
ValueError: not a decimal
|
|
|
|
>>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase
|
Merged revisions 68133-68134,68141-68142,68145-68146,68148-68149,68159-68162,68166,68171-68174,68179,68195-68196,68210,68214-68215,68217-68222 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r68133 | antoine.pitrou | 2009-01-01 16:38:03 +0100 (Thu, 01 Jan 2009) | 1 line
fill in actual issue number in tests
........
r68134 | hirokazu.yamamoto | 2009-01-01 16:45:39 +0100 (Thu, 01 Jan 2009) | 2 lines
Issue #4797: IOError.filename was not set when _fileio.FileIO failed to open
file with `str' filename on Windows.
........
r68141 | benjamin.peterson | 2009-01-01 17:43:12 +0100 (Thu, 01 Jan 2009) | 1 line
fix highlighting
........
r68142 | benjamin.peterson | 2009-01-01 18:29:49 +0100 (Thu, 01 Jan 2009) | 2 lines
welcome to 2009, Python!
........
r68145 | amaury.forgeotdarc | 2009-01-02 01:03:54 +0100 (Fri, 02 Jan 2009) | 5 lines
#4801 _collections module fails to build on cygwin.
_PyObject_GC_TRACK is the macro version of PyObject_GC_Track,
and according to documentation it should not be used for extension modules.
........
r68146 | ronald.oussoren | 2009-01-02 11:44:46 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4472: "configure --enable-shared doesn't work on OSX"
........
r68148 | ronald.oussoren | 2009-01-02 11:48:31 +0100 (Fri, 02 Jan 2009) | 2 lines
Forgot to add a NEWS item in my previous checkin
........
r68149 | ronald.oussoren | 2009-01-02 11:50:48 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4780
........
r68159 | ronald.oussoren | 2009-01-02 15:48:17 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue 1627952
........
r68160 | ronald.oussoren | 2009-01-02 15:52:09 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue r1737832
........
r68161 | ronald.oussoren | 2009-01-02 16:00:05 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 1149804
........
r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 4472 is incompatible with Cygwin, this patch
should fix that.
........
r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line
document PyMemberDef
........
r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines
#4811: fix markup glitches (mostly remains of the conversion),
found by Gabriel Genellina.
........
r68172 | martin.v.loewis | 2009-01-02 21:32:55 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4075: Use OutputDebugStringW in Py_FatalError.
........
r68173 | martin.v.loewis | 2009-01-02 21:40:14 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4051: Prevent conflict of UNICODE macros in cPickle.
........
r68174 | benjamin.peterson | 2009-01-02 21:47:27 +0100 (Fri, 02 Jan 2009) | 1 line
fix compilation on non-Windows platforms
........
r68179 | raymond.hettinger | 2009-01-02 22:26:45 +0100 (Fri, 02 Jan 2009) | 1 line
Issue #4615. Document how to use itertools for de-duping.
........
r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove useless string literal.
........
r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix indentation.
........
r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines
Set eol-style correctly for mp_distributing.py.
........
r68214 | georg.brandl | 2009-01-03 20:44:48 +0100 (Sat, 03 Jan 2009) | 2 lines
Make indentation consistent.
........
r68215 | georg.brandl | 2009-01-03 21:15:14 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix role name.
........
r68217 | georg.brandl | 2009-01-03 21:30:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Add rstlint, a little tool to find subtle markup problems and inconsistencies in the Doc sources.
........
r68218 | georg.brandl | 2009-01-03 21:38:59 +0100 (Sat, 03 Jan 2009) | 2 lines
Recognize usage of the default role.
........
r68219 | georg.brandl | 2009-01-03 21:47:01 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix uses of the default role.
........
r68220 | georg.brandl | 2009-01-03 21:55:06 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove trailing whitespace.
........
r68221 | georg.brandl | 2009-01-03 22:04:55 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove tabs from the documentation.
........
r68222 | georg.brandl | 2009-01-03 22:11:58 +0100 (Sat, 03 Jan 2009) | 2 lines
Disable the line length checker by default.
........
2009-01-03 17:55:17 -04:00
|
|
|
'Lu'
|
2007-08-15 11:28:01 -03:00
|
|
|
>>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
|
|
|
|
'AN'
|
|
|
|
|