Refactor a bit the codecs doc.

This commit is contained in:
Ezio Melotti 2011-10-25 10:46:22 +03:00
parent c515eba9ff
commit 59b13f4de9
1 changed files with 21 additions and 19 deletions

View File

@ -782,27 +782,28 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
Windows). There's a string constant with 256 characters that shows you which Windows). There's a string constant with 256 characters that shows you which
character is mapped to which byte value. character is mapped to which byte value.
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints All of these encodings can only encode 256 of the 1114112 codepoints
defined in unicode. A simple and straightforward way that can store each Unicode defined in unicode. A simple and straightforward way that can store each Unicode
code point, is to store each codepoint as two consecutive bytes. There are two code point, is to store each codepoint as four consecutive bytes. There are two
possibilities: Store the bytes in big endian or in little endian order. These possibilities: store the bytes in big endian or in little endian order. These
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
will always have to swap bytes on encoding and decoding. UTF-16 avoids this will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
problem: Bytes will always be in natural endianness. When these bytes are read problem: bytes will always be in natural endianness. When these bytes are read
by a CPU with a different endianness, then bytes have to be swapped though. To by a CPU with a different endianness, then bytes have to be swapped though. To
be able to detect the endianness of a UTF-16 byte sequence, there's the so be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``. there's the so called BOM ("Byte Order Mark"). This is the Unicode character
This character will be prepended to every UTF-16 byte sequence. The byte swapped ``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
version of this character (``0xFFFE``) is an illegal character that may not byte sequence. The byte swapped version of this character (``0xFFFE``) is an
appear in a Unicode text. So when the first character in an UTF-16 byte sequence illegal character that may not appear in a Unicode text. So when the
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as Unfortunately the character ``U+FEFF`` had a second purpose as
a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
a word to be split. It can e.g. be used to give hints to a ligature algorithm. a word to be split. It can e.g. be used to give hints to a ligature algorithm.
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
it's a device to determine the storage layout of the encoded bytes, and vanishes it's a device to determine the storage layout of the encoded bytes, and vanishes
once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
NO-BREAK SPACE`` it's a normal character that will be decoded like any other. NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
@ -810,7 +811,7 @@ NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
There's another encoding that is able to encoding the full range of Unicode There's another encoding that is able to encoding the full range of Unicode
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
parts: Marker bits (the most significant bits) and payload bits. The marker bits parts: marker bits (the most significant bits) and payload bits. The marker bits
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
encoded like this (with x being payload bits, which when concatenated give the encoded like this (with x being payload bits, which when concatenated give the
Unicode character): Unicode character):
@ -849,13 +850,14 @@ map to
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
| INVERTED QUESTION MARK | INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
correctly guessed from the byte sequence. So here the BOM is not used to be able correctly guessed from the byte sequence. So here the BOM is not used to be able
to determine the byte order used for generating the byte sequence, but as a to determine the byte order used for generating the byte sequence, but as a
signature that helps in guessing the encoding. On encoding the utf-8-sig codec signature that helps in guessing the encoding. On encoding the utf-8-sig codec
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
decoding utf-8-sig will skip those three bytes if they appear as the first three decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
bytes in the file. three bytes in the file. In UTF-8, the use of the BOM is discouraged and
should generally be avoided.
.. _standard-encodings: .. _standard-encodings: