mirror of https://github.com/python/cpython
Refactor a bit the codecs doc.
This commit is contained in:
parent
c515eba9ff
commit
59b13f4de9
|
@ -782,27 +782,28 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
||||||
Windows). There's a string constant with 256 characters that shows you which
|
Windows). There's a string constant with 256 characters that shows you which
|
||||||
character is mapped to which byte value.
|
character is mapped to which byte value.
|
||||||
|
|
||||||
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
|
All of these encodings can only encode 256 of the 1114112 codepoints
|
||||||
defined in unicode. A simple and straightforward way that can store each Unicode
|
defined in unicode. A simple and straightforward way that can store each Unicode
|
||||||
code point, is to store each codepoint as two consecutive bytes. There are two
|
code point, is to store each codepoint as four consecutive bytes. There are two
|
||||||
possibilities: Store the bytes in big endian or in little endian order. These
|
possibilities: store the bytes in big endian or in little endian order. These
|
||||||
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
|
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||||
disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
|
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||||
will always have to swap bytes on encoding and decoding. UTF-16 avoids this
|
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
|
||||||
problem: Bytes will always be in natural endianness. When these bytes are read
|
problem: bytes will always be in natural endianness. When these bytes are read
|
||||||
by a CPU with a different endianness, then bytes have to be swapped though. To
|
by a CPU with a different endianness, then bytes have to be swapped though. To
|
||||||
be able to detect the endianness of a UTF-16 byte sequence, there's the so
|
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
|
||||||
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
|
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
|
||||||
This character will be prepended to every UTF-16 byte sequence. The byte swapped
|
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
|
||||||
version of this character (``0xFFFE``) is an illegal character that may not
|
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
|
||||||
appear in a Unicode text. So when the first character in an UTF-16 byte sequence
|
illegal character that may not appear in a Unicode text. So when the
|
||||||
|
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
|
||||||
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
|
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
|
||||||
Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
|
Unfortunately the character ``U+FEFF`` had a second purpose as
|
||||||
a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
|
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
|
||||||
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
|
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
|
||||||
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
|
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
|
||||||
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
|
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
|
||||||
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
|
Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
|
||||||
it's a device to determine the storage layout of the encoded bytes, and vanishes
|
it's a device to determine the storage layout of the encoded bytes, and vanishes
|
||||||
once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
|
once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
|
||||||
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
|
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
|
||||||
|
@ -810,7 +811,7 @@ NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
|
||||||
There's another encoding that is able to encoding the full range of Unicode
|
There's another encoding that is able to encoding the full range of Unicode
|
||||||
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
|
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
|
||||||
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
|
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
|
||||||
parts: Marker bits (the most significant bits) and payload bits. The marker bits
|
parts: marker bits (the most significant bits) and payload bits. The marker bits
|
||||||
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
|
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
|
||||||
encoded like this (with x being payload bits, which when concatenated give the
|
encoded like this (with x being payload bits, which when concatenated give the
|
||||||
Unicode character):
|
Unicode character):
|
||||||
|
@ -849,13 +850,14 @@ map to
|
||||||
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||||
| INVERTED QUESTION MARK
|
| INVERTED QUESTION MARK
|
||||||
|
|
||||||
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
|
in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
|
||||||
correctly guessed from the byte sequence. So here the BOM is not used to be able
|
correctly guessed from the byte sequence. So here the BOM is not used to be able
|
||||||
to determine the byte order used for generating the byte sequence, but as a
|
to determine the byte order used for generating the byte sequence, but as a
|
||||||
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
|
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
|
||||||
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
|
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
|
||||||
decoding utf-8-sig will skip those three bytes if they appear as the first three
|
decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
|
||||||
bytes in the file.
|
three bytes in the file. In UTF-8, the use of the BOM is discouraged and
|
||||||
|
should generally be avoided.
|
||||||
|
|
||||||
|
|
||||||
.. _standard-encodings:
|
.. _standard-encodings:
|
||||||
|
|
Loading…
Reference in New Issue