Marc-Andre Lemburg <mal@lemburg.com>:

Updated to version 1.5. Includes typo fixes by Andrew Kuchling
and a new section on the default encoding.
This commit is contained in:
Marc-André Lemburg 2000-06-08 17:51:33 +00:00
parent 59a044b7d2
commit bfa36f5407
1 changed files with 18 additions and 18 deletions

View File

@ -19,11 +19,11 @@ due to the many different aspects of the Unicode-Python integration.
The latest version of this document is always available at: The latest version of this document is always available at:
http://starship.skyport.net/~lemburg/unicode-proposal.txt http://starship.python.net/~lemburg/unicode-proposal.txt
Older versions are available as: Older versions are available as:
http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
Conventions: Conventions:
@ -101,7 +101,7 @@ of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too. countries will want to be able to read your source strings too.
Unicode Type Object: Unicode Type Object:
@ -169,7 +169,7 @@ during coercion of strings to Unicode should not be masked and passed
through to the user. through to the user.
In containment tests ('a' in u'abc' and u'a' in 'abc') both sides In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
should be coerced to Unicode before applying the test. Errors occuring should be coerced to Unicode before applying the test. Errors occurring
during coercion (e.g. None in u'abc') should not be masked. during coercion (e.g. None in u'abc') should not be masked.
@ -184,7 +184,7 @@ always coerce to the more precise format, i.e. Unicode objects.
s + u := unicode(s) + u s + u := unicode(s) + u
All string methods should delegate the call to an equivalent Unicode All string methods should delegate the call to an equivalent Unicode
object method call by converting all envolved strings to Unicode and object method call by converting all involved strings to Unicode and
then applying the arguments to the Unicode method of the same name, then applying the arguments to the Unicode method of the same name,
e.g. e.g.
@ -199,7 +199,7 @@ Formatting Markers.
Exceptions: Exceptions:
----------- -----------
UnicodeError is defined in the exceptions module as subclass of UnicodeError is defined in the exceptions module as a subclass of
ValueError. It is available at the C level via PyExc_UnicodeError. ValueError. It is available at the C level via PyExc_UnicodeError.
All exceptions related to Unicode encoding/decoding should be All exceptions related to Unicode encoding/decoding should be
subclasses of UnicodeError. subclasses of UnicodeError.
@ -268,7 +268,7 @@ Python should provide a few standard codecs for the most relevant
encodings, e.g. encodings, e.g.
'utf-8': 8-bit variable length encoding 'utf-8': 8-bit variable length encoding
'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16': 16-bit variable length encoding (little/big endian)
'utf-16-le': utf-16 but explicitly little endian 'utf-16-le': utf-16 but explicitly little endian
'utf-16-be': utf-16 but explicitly big endian 'utf-16-be': utf-16 but explicitly big endian
'ascii': 7-bit ASCII codepage 'ascii': 7-bit ASCII codepage
@ -284,7 +284,7 @@ Note: 'utf-16' should be implemented by using and requiring byte order
marks (BOM) for file input/output. marks (BOM) for file input/output.
All other encodings such as the CJK ones to support Asian scripts All other encodings such as the CJK ones to support Asian scripts
should be implemented in seperate packages which do not get included should be implemented in separate packages which do not get included
in the core Python distribution and are not a part of this proposal. in the core Python distribution and are not a part of this proposal.
@ -324,14 +324,14 @@ class Codec:
""" """
def encode(self,input,errors='strict'): def encode(self,input,errors='strict'):
""" Encodes the object intput and returns a tuple (output """ Encodes the object input and returns a tuple (output
object, length consumed). object, length consumed).
errors defines the error handling to apply. It defaults to errors defines the error handling to apply. It defaults to
'strict' handling. 'strict' handling.
The method may not store state in the Codec instance. Use The method may not store state in the Codec instance. Use
SteamCodec for codecs which have to keep state in order to StreamCodec for codecs which have to keep state in order to
make encoding/decoding efficient. make encoding/decoding efficient.
""" """
@ -350,7 +350,7 @@ class Codec:
'strict' handling. 'strict' handling.
The method may not store state in the Codec instance. Use The method may not store state in the Codec instance. Use
SteamCodec for codecs which have to keep state in order to StreamCodec for codecs which have to keep state in order to
make encoding/decoding efficient. make encoding/decoding efficient.
""" """
@ -490,7 +490,7 @@ class StreamReader(Codec):
the line breaking knowledge from the underlying stream's the line breaking knowledge from the underlying stream's
.readline() method -- there is currently no support for .readline() method -- there is currently no support for
line breaking using the codec decoder due to lack of line line breaking using the codec decoder due to lack of line
buffering. Sublcasses should however, if possible, try to buffering. Subclasses should however, if possible, try to
implement this method using their own knowledge of line implement this method using their own knowledge of line
breaking. breaking.
@ -527,7 +527,7 @@ class StreamReader(Codec):
""" Resets the codec buffers used for keeping state. """ Resets the codec buffers used for keeping state.
Note that no stream repositioning should take place. Note that no stream repositioning should take place.
This method is primarely intended to be able to recover This method is primarily intended to be able to recover
from decoding errors. from decoding errors.
""" """
@ -553,7 +553,7 @@ interfaces, though.
It is not required by the Unicode implementation to use these base It is not required by the Unicode implementation to use these base
classes, only the interfaces must match; this allows writing Codecs as classes, only the interfaces must match; this allows writing Codecs as
extensions types. extension types.
As guideline, large mapping tables should be implemented using static As guideline, large mapping tables should be implemented using static
C data in separate (shared) extension modules. That way multiple C data in separate (shared) extension modules. That way multiple
@ -628,8 +628,8 @@ Private Code Point Areas:
------------------------- -------------------------
Support for these is left to user land Codecs and not explicitly Support for these is left to user land Codecs and not explicitly
intergrated into the core. Note that due to the Internal Format being integrated into the core. Note that due to the Internal Format being
implemented, only the area between \uE000 and \uF8FF is useable for implemented, only the area between \uE000 and \uF8FF is usable for
private encodings. private encodings.
@ -649,14 +649,14 @@ provides access to about 64k characters and covers all characters in
the Basic Multilingual Plane (BMP) of Unicode. the Basic Multilingual Plane (BMP) of Unicode.
It is the Codec's responsibility to ensure that the data they pass to It is the Codec's responsibility to ensure that the data they pass to
the Unicode object constructor repects this assumption. The the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use of constructor does not check the data for Unicode compliance or use of
surrogates. surrogates.
Future implementations can extend the 32 bit restriction to the full Future implementations can extend the 32 bit restriction to the full
set of all UTF-16 addressable characters (around 1M characters). set of all UTF-16 addressable characters (around 1M characters).
The Unicode API should provide inteface routines from <PythonUnicode> The Unicode API should provide interface routines from <PythonUnicode>
to the compiler's wchar_t which can be 16 or 32 bit depending on the to the compiler's wchar_t which can be 16 or 32 bit depending on the
compiler/libc/platform being used. compiler/libc/platform being used.