From ab01087109046dc7fea5c62529a6e4e79847036f Mon Sep 17 00:00:00 2001 From: "Andrew M. Kuchling" Date: Thu, 19 Jul 2001 14:59:53 +0000 Subject: [PATCH] Revise the Unicode section after getting comments from MAL, GvR, and others. Add new low-level API for interpreter introspection Bump version number. --- Doc/whatsnew/whatsnew22.tex | 72 ++++++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 24 deletions(-) diff --git a/Doc/whatsnew/whatsnew22.tex b/Doc/whatsnew/whatsnew22.tex index 96b0972ae13..431e269c4fb 100644 --- a/Doc/whatsnew/whatsnew22.tex +++ b/Doc/whatsnew/whatsnew22.tex @@ -3,7 +3,7 @@ % $Id$ \title{What's New in Python 2.2} -\release{0.03} +\release{0.04} \author{A.M. Kuchling} \authoraddress{\email{akuchlin@mems-exchange.org}} \begin{document} @@ -339,32 +339,46 @@ and Tim Peters, with other fixes from the Python Labs crew.} \section{Unicode Changes} Python's Unicode support has been enhanced a bit in 2.2. Unicode -strings are usually stored as UCS-2, as 16-bit unsigned integers. +strings are usually stored as UTF-16, as 16-bit unsigned integers. Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned integers, as its internal encoding by supplying \longprogramopt{enable-unicode=ucs4} to the configure script. When -built to use UCS-4, in theory Python could handle Unicode characters -from U-00000000 to U-7FFFFFFF. Being able to use UCS-4 internally is -a necessary step to do that, but it's not the only step, and in Python -2.2alpha1 the work isn't complete yet. For example, the -\function{unichr()} function still only accepts values from 0 to -65535, and there's no \code{\e U} notation for embedding characters -greater than 65535 in a Unicode string literal. All this is the -province of the still-unimplemented PEP 261, ``Support for `wide' -Unicode characters''; consult it for further details, and please offer -comments and suggestions on the proposal it describes. +built to use UCS-4 (a ``wide Python''), the interpreter can natively +handle Unicode characters from U+000000 to U+110000. The range of +legal values for the \function{unichr()} function has been expanded; +it used to only accept values up to 65535, but in 2.2 will accept +values from 0 to 0x110000. Using a ``narrow Python'', an interpreter +compiled to use UTF-16, values greater than 65535 will result in +\function{unichr()} returning a string of length 2: -Another change is much simpler to explain. -Since their introduction, Unicode strings have supported an -\method{encode()} method to convert the string to a selected encoding -such as UTF-8 or Latin-1. A symmetric -\method{decode(\optional{\var{encoding}})} method has been added to -both 8-bit and Unicode strings in 2.2, which assumes that the string -is in the specified encoding and decodes it. This means that -\method{encode()} and \method{decode()} can be called on both types of -strings, and can be used for tasks not directly related to Unicode. -For example, codecs have been added for UUencoding, MIME's base-64 -encoding, and compression with the \module{zlib} module. +\begin{verbatim} +>>> s = unichr(65536) +>>> s +u'\ud800\udc00' +>>> len(s) +2 +\end{verbatim} + +This possibly-confusing behaviour, breaking the intuitive invariant +that \function{chr()} and\function{unichr()} always return strings of +length 1, may be changed later in 2.2 depending on public reaction. + +All this is the province of the still-unimplemented PEP 261, ``Support +for `wide' Unicode characters''; consult it for further details, and +please offer comments and suggestions on the proposal it describes. + +Another change is much simpler to explain. Since their introduction, +Unicode strings have supported an \method{encode()} method to convert +the string to a selected encoding such as UTF-8 or Latin-1. A +symmetric \method{decode(\optional{\var{encoding}})} method has been +added to 8-bit strings (though not to Unicode strings) in 2.2. +\method{decode()} assumes that the string is in the specified encoding +and decodes it, returning whatever is returned by the codec. + +Using this new feature, codecs have been added for tasks not directly +related to Unicode. For example, codecs have been added for +uu-encoding, MIME's base64 encoding, and compression with the +\module{zlib} module: \begin{verbatim} >>> s = """Here is a lengthy piece of redundant, overly verbose, @@ -610,6 +624,15 @@ changes are: been changed to use the new C-level interface. (Contributed by Fred L. Drake, Jr.) + \item Another low-level API, primarily of interest to implementors + of Python debuggers and development tools, was added. + \cfunction{PyInterpreterState_Head()} and + \cfunction{PyInterpreterState_Next()} let a caller walk through all + the existing interpreter objects; + \cfunction{PyInterpreterState_ThreadHead()} and + \cfunction{PyThreadState_Next()} allow looping over all the thread + states for a given interpreter. (Contributed by David Beazley.) + % XXX is this explanation correct? \item When presented with a Unicode filename on Windows, Python will now correctly convert it to a string using the MBCS encoding. @@ -668,6 +691,7 @@ changes are: The author would like to thank the following people for offering suggestions and corrections to various drafts of this article: Fred -Bremmer, Fred L. Drake, Jr., Tim Peters, Neil Schemenauer. +Bremmer, Fred L. Drake, Jr., Marc-Andr\'e Lemburg, +Tim Peters, Neil Schemenauer, Guido van Rossum. \end{document}