Closes #23181: codepoint -> code point
This commit is contained in:
parent
1a8ada89f9
commit
3be472b5f7
|
@ -1141,7 +1141,7 @@ These are the UTF-32 codec APIs:
|
|||
mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
||||
|
||||
If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
|
||||
as a single codepoint.
|
||||
as a single code point.
|
||||
|
||||
Return *NULL* if an exception was raised by the codec.
|
||||
|
||||
|
|
|
@ -841,7 +841,7 @@ methods and attributes from the underlying stream.
|
|||
Encodings and Unicode
|
||||
---------------------
|
||||
|
||||
Strings are stored internally as sequences of codepoints in
|
||||
Strings are stored internally as sequences of code points in
|
||||
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
||||
more details about the implementation.)
|
||||
Once a string object is used outside of CPU and memory, endianness
|
||||
|
@ -852,23 +852,23 @@ There are a variety of different text serialisation codecs, which are
|
|||
collectivity referred to as :term:`text encodings <text encoding>`.
|
||||
|
||||
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
||||
the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains codepoints above ``U+00FF`` can't be encoded with this
|
||||
the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains code points above ``U+00FF`` can't be encoded with this
|
||||
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
||||
like the following (although the details of the error message may differ):
|
||||
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
||||
position 3: ordinal not in range(256)``.
|
||||
|
||||
There's another group of encodings (the so called charmap encodings) that choose
|
||||
a different subset of all Unicode code points and how these codepoints are
|
||||
a different subset of all Unicode code points and how these code points are
|
||||
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
|
||||
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
||||
Windows). There's a string constant with 256 characters that shows you which
|
||||
character is mapped to which byte value.
|
||||
|
||||
All of these encodings can only encode 256 of the 1114112 codepoints
|
||||
All of these encodings can only encode 256 of the 1114112 code points
|
||||
defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||
code point, is to store each codepoint as four consecutive bytes. There are two
|
||||
code point, is to store each code point as four consecutive bytes. There are two
|
||||
possibilities: store the bytes in big endian or in little endian order. These
|
||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||
|
|
|
@ -194,7 +194,7 @@ Here are the classes:
|
|||
minor type and defaults to :mimetype:`plain`. *_charset* is the character
|
||||
set of the text and is passed as an argument to the
|
||||
:class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults
|
||||
to ``us-ascii`` if the string contains only ``ascii`` codepoints, and
|
||||
to ``us-ascii`` if the string contains only ``ascii`` code points, and
|
||||
``utf-8`` otherwise. The *_charset* parameter accepts either a string or a
|
||||
:class:`~email.charset.Charset` instance.
|
||||
|
||||
|
|
|
@ -156,7 +156,7 @@ are always available. They are listed here in alphabetical order.
|
|||
|
||||
.. function:: chr(i)
|
||||
|
||||
Return the string representing a character whose Unicode codepoint is the
|
||||
Return the string representing a character whose Unicode code point is the
|
||||
integer *i*. For example, ``chr(97)`` returns the string ``'a'``, while
|
||||
``chr(931)`` returns the string ``'Σ'``. This is the inverse of :func:`ord`.
|
||||
|
||||
|
|
|
@ -33,12 +33,12 @@ This module defines four dictionaries, :data:`html5`,
|
|||
|
||||
.. data:: name2codepoint
|
||||
|
||||
A dictionary that maps HTML entity names to the Unicode codepoints.
|
||||
A dictionary that maps HTML entity names to the Unicode code points.
|
||||
|
||||
|
||||
.. data:: codepoint2name
|
||||
|
||||
A dictionary that maps Unicode codepoints to HTML entity names.
|
||||
A dictionary that maps Unicode code points to HTML entity names.
|
||||
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
|
|
@ -685,7 +685,7 @@ the same type, the lexicographical comparison is carried out recursively. If
|
|||
all items of two sequences compare equal, the sequences are considered equal.
|
||||
If one sequence is an initial sub-sequence of the other, the shorter sequence is
|
||||
the smaller (lesser) one. Lexicographical ordering for strings uses the Unicode
|
||||
codepoint number to order individual characters. Some examples of comparisons
|
||||
code point number to order individual characters. Some examples of comparisons
|
||||
between sequences of the same type::
|
||||
|
||||
(1, 2, 3) < (1, 2, 4)
|
||||
|
|
|
@ -228,7 +228,7 @@ Functionality
|
|||
|
||||
Changes introduced by :pep:`393` are the following:
|
||||
|
||||
* Python now always supports the full range of Unicode codepoints, including
|
||||
* Python now always supports the full range of Unicode code points, including
|
||||
non-BMP ones (i.e. from ``U+0000`` to ``U+10FFFF``). The distinction between
|
||||
narrow and wide builds no longer exists and Python now behaves like a wide
|
||||
build, even under Windows.
|
||||
|
@ -246,7 +246,7 @@ Changes introduced by :pep:`393` are the following:
|
|||
so ``'\U0010FFFF'[0]`` now returns ``'\U0010FFFF'`` and not ``'\uDBFF'``;
|
||||
|
||||
* all other functions in the standard library now correctly handle
|
||||
non-BMP codepoints.
|
||||
non-BMP code points.
|
||||
|
||||
* The value of :data:`sys.maxunicode` is now always ``1114111`` (``0x10FFFF``
|
||||
in hexadecimal). The :c:func:`PyUnicode_GetMax` function still returns
|
||||
|
@ -258,13 +258,13 @@ Changes introduced by :pep:`393` are the following:
|
|||
Performance and resource usage
|
||||
------------------------------
|
||||
|
||||
The storage of Unicode strings now depends on the highest codepoint in the string:
|
||||
The storage of Unicode strings now depends on the highest code point in the string:
|
||||
|
||||
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per codepoint;
|
||||
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per code point;
|
||||
|
||||
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per codepoint;
|
||||
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per code point;
|
||||
|
||||
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per codepoint.
|
||||
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per code point.
|
||||
|
||||
The net effect is that for most applications, memory usage of string
|
||||
storage should decrease significantly - especially compared to former
|
||||
|
|
Loading…
Reference in New Issue