Refactor a bit the codecs doc.
This commit is contained in:
parent
963004d1d8
commit
fbb39815d4
|
@ -810,27 +810,28 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
|||
Windows). There's a string constant with 256 characters that shows you which
|
||||
character is mapped to which byte value.
|
||||
|
||||
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
|
||||
All of these encodings can only encode 256 of the 1114112 codepoints
|
||||
defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||
code point, is to store each codepoint as two consecutive bytes. There are two
|
||||
possibilities: Store the bytes in big endian or in little endian order. These
|
||||
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
|
||||
disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
|
||||
will always have to swap bytes on encoding and decoding. UTF-16 avoids this
|
||||
problem: Bytes will always be in natural endianness. When these bytes are read
|
||||
code point, is to store each codepoint as four consecutive bytes. There are two
|
||||
possibilities: store the bytes in big endian or in little endian order. These
|
||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
|
||||
problem: bytes will always be in natural endianness. When these bytes are read
|
||||
by a CPU with a different endianness, then bytes have to be swapped though. To
|
||||
be able to detect the endianness of a UTF-16 byte sequence, there's the so
|
||||
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
|
||||
This character will be prepended to every UTF-16 byte sequence. The byte swapped
|
||||
version of this character (``0xFFFE``) is an illegal character that may not
|
||||
appear in a Unicode text. So when the first character in an UTF-16 byte sequence
|
||||
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
|
||||
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
|
||||
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
|
||||
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
|
||||
illegal character that may not appear in a Unicode text. So when the
|
||||
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
|
||||
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
|
||||
Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
|
||||
a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
|
||||
Unfortunately the character ``U+FEFF`` had a second purpose as
|
||||
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
|
||||
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
|
||||
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
|
||||
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
|
||||
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
|
||||
Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
|
||||
it's a device to determine the storage layout of the encoded bytes, and vanishes
|
||||
once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
|
||||
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
|
||||
|
@ -838,7 +839,7 @@ NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
|
|||
There's another encoding that is able to encoding the full range of Unicode
|
||||
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
|
||||
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
|
||||
parts: Marker bits (the most significant bits) and payload bits. The marker bits
|
||||
parts: marker bits (the most significant bits) and payload bits. The marker bits
|
||||
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
|
||||
encoded like this (with x being payload bits, which when concatenated give the
|
||||
Unicode character):
|
||||
|
@ -877,13 +878,14 @@ map to
|
|||
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||
| INVERTED QUESTION MARK
|
||||
|
||||
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
|
||||
in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
|
||||
correctly guessed from the byte sequence. So here the BOM is not used to be able
|
||||
to determine the byte order used for generating the byte sequence, but as a
|
||||
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
|
||||
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
|
||||
decoding utf-8-sig will skip those three bytes if they appear as the first three
|
||||
bytes in the file.
|
||||
decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
|
||||
three bytes in the file. In UTF-8, the use of the BOM is discouraged and
|
||||
should generally be avoided.
|
||||
|
||||
|
||||
.. _standard-encodings:
|
||||
|
|
Loading…
Reference in New Issue