cpython/Lib/encodings/__init__.py

""" Standard "encodings" Package

    Standard Python encoding modules are stored in this package
    directory.

    Codec modules must have names corresponding to standard lower-case
    encoding names with hyphens mapped to underscores, e.g. 'utf-8' is
    implemented by the module 'utf_8.py'.

    Each codec module must export the following interface:

    * getregentry() -> (encoder, decoder, stream_reader, stream_writer)
    The getregentry() API must return callable objects which adhere to
    the Python Codec Interface Standard.

    In addition, a module may optionally also define the following
    APIs which are then used by the package's codec search function:

    * getaliases() -> sequence of encoding name strings to use as aliases

    Alias names returned by getaliases() must be standard encoding
    names as defined above (lower-case, hyphens converted to
    underscores).

Written by Marc-Andre Lemburg (mal@lemburg.com).

(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.

"""#"

import codecs,aliases,exceptions

_cache = {}
_unknown = '--unknown--'

class CodecRegistryError(exceptions.LookupError,
                         exceptions.SystemError):
    pass

def search_function(encoding):
    
    # Cache lookup
    entry = _cache.get(encoding,_unknown)
    if entry is not _unknown:
        return entry

    # Import the module
    modname = encoding.replace('-', '_')
    modname = aliases.aliases.get(modname,modname)
    try:
        mod = __import__(modname,globals(),locals(),'*')
    except ImportError,why:
        # cache misses
        _cache[encoding] = None
        return None
    
    # Now ask the module for the registry entry
    try:
        entry = tuple(mod.getregentry())
    except AttributeError:
        entry = ()
    if len(entry) != 4:
        raise CodecRegistryError,\
              'module "%s" (%s) failed to register' % \
              (mod.__name__, mod.__file__)
    for obj in entry:
        if not callable(obj):
            raise CodecRegistryError,\
                  'incompatible codecs in module "%s" (%s)' % \
                  (mod.__name__, mod.__file__)

    # Cache the codec registry entry
    _cache[encoding] = entry

    # Register its aliases (without overwriting previously registered
    # aliases)
    try:
        codecaliases = mod.getaliases()
    except AttributeError:
        pass
    else:
        for alias in codecaliases:
            if not aliases.aliases.has_key(alias):
                aliases.aliases[alias] = modname

    # Return the registry entry
    return entry

# Register the search_function in the Python codec registry
codecs.register(search_function)
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`""" Standard "encodings" Package`

			`Standard Python encoding modules are stored in this package`
			`directory.`

			`Codec modules must have names corresponding to standard lower-case`
Marc-Andre's third try at this bulk patch seems to work (except that his copy of test_contains.py seems to be broken -- the lines he deleted were already absent). Checkin messages: New Unicode support for int(), float(), complex() and long(). - new APIs PyInt_FromUnicode() and PyLong_FromUnicode() - added support for Unicode to PyFloat_FromString() - new encoding API PyUnicode_EncodeDecimal() which converts Unicode to a decimal char* string (used in the above new APIs) - shortcuts for calls like int(<int object>) and float(<float obj>) - tests for all of the above Unicode compares and contains checks: - comparing Unicode and non-string types now works; TypeErrors are masked, all other errors such as ValueError during Unicode coercion are passed through (note that PyUnicode_Compare does not implement the masking -- PyObject_Compare does this) - contains now works for non-string types too; TypeErrors are masked and 0 returned; all other errors are passed through Better testing support for the standard codecs. Misc minor enhancements, such as an alias dbcs for the mbcs codec. Changes: - PyLong_FromString() now applies the same error checks as does PyInt_FromString(): trailing garbage is reported as error and not longer silently ignored. The only characters which may be trailing the digits are 'L' and 'l' -- these are still silently ignored. - string.ato?() now directly interface to int(), long() and float(). The error strings are now a little different, but the type still remains the same. These functions are now ready to get declared obsolete ;-) - PyNumber_Int() now also does a check for embedded NULL chars in the input string; PyNumber_Long() already did this (and still does) Followed by: Looks like I've gone a step too far there... (and test_contains.py seem to have a bug too). I've changed back to reporting all errors in PyUnicode_Contains() and added a few more test cases to test_contains.py (plus corrected the join() NameError). 2000-04-05 17:11:21 -03:00			`encoding names with hyphens mapped to underscores, e.g. 'utf-8' is`
			`implemented by the module 'utf_8.py'.`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00
			`Each codec module must export the following interface:`

			`* getregentry() -> (encoder, decoder, stream_reader, stream_writer)`
			`The getregentry() API must return callable objects which adhere to`
			`the Python Codec Interface Standard.`

			`In addition, a module may optionally also define the following`
			`APIs which are then used by the package's codec search function:`

			`* getaliases() -> sequence of encoding name strings to use as aliases`

Changed .getaliases() support to register the new aliases in the encodings package aliases mapping dictionary rather than in the internal cache used by the search function. This enables aliases to take advantage of the full normalization process applied to encoding names which was previously not available. The patch restricts alias registration to new aliases. Existing aliases cannot be overridden anymore. 2000-12-12 10:45:35 -04:00			`Alias names returned by getaliases() must be standard encoding`
			`names as defined above (lower-case, hyphens converted to`
			`underscores).`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00
			`Written by Marc-Andre Lemburg (mal@lemburg.com).`

			`(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.`

			`"""#"`

Fixed search function error reporting in the encodings package __init__.py module to raise errors which can be catched as LookupErrors as well as SystemErrors. Modified the error messages to include more information about the failing module. 2001-09-19 08:52:07 -03:00			`import codecs,aliases,exceptions`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00
			`_cache = {}`
On 17-Mar-2000, Marc-Andre Lemburg said: Attached you find an update of the Unicode implementation. The patch is against the current CVS version. I would appreciate if someone with CVS checkin permissions could check the changes in. The patch contains all bugs and patches sent this week and also fixes a leak in the codecs code and a bug in the free list code for Unicode objects (which only shows up when compiling Python with Py_DEBUG; thanks to MarkH for spotting this one). 2000-03-20 12:36:48 -04:00			`_unknown = '--unknown--'`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00
Fixed search function error reporting in the encodings package __init__.py module to raise errors which can be catched as LookupErrors as well as SystemErrors. Modified the error messages to include more information about the failing module. 2001-09-19 08:52:07 -03:00			`class CodecRegistryError(exceptions.LookupError,`
			`exceptions.SystemError):`
			`pass`

Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`def search_function(encoding):`

			`# Cache lookup`
On 17-Mar-2000, Marc-Andre Lemburg said: Attached you find an update of the Unicode implementation. The patch is against the current CVS version. I would appreciate if someone with CVS checkin permissions could check the changes in. The patch contains all bugs and patches sent this week and also fixes a leak in the codecs code and a bug in the free list code for Unicode objects (which only shows up when compiling Python with Py_DEBUG; thanks to MarkH for spotting this one). 2000-03-20 12:36:48 -04:00			`entry = _cache.get(encoding,_unknown)`
			`if entry is not _unknown:`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`return entry`

			`# Import the module`
Marc-Andre Lemburg <mal@lemburg.com>: Removed import of string module -- use string methods directly. Thanks to Finn Bock. 2000-06-13 09:04:05 -03:00			`modname = encoding.replace('-', '_')`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`modname = aliases.aliases.get(modname,modname)`
			`try:`
			`mod = __import__(modname,globals(),locals(),'*')`
			`except ImportError,why:`
Changed .getaliases() support to register the new aliases in the encodings package aliases mapping dictionary rather than in the internal cache used by the search function. This enables aliases to take advantage of the full normalization process applied to encoding names which was previously not available. The patch restricts alias registration to new aliases. Existing aliases cannot be overridden anymore. 2000-12-12 10:45:35 -04:00			`# cache misses`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`_cache[encoding] = None`
			`return None`

			`# Now ask the module for the registry entry`
			`try:`
			`entry = tuple(mod.getregentry())`
			`except AttributeError:`
			`entry = ()`
			`if len(entry) != 4:`
Fixed search function error reporting in the encodings package __init__.py module to raise errors which can be catched as LookupErrors as well as SystemErrors. Modified the error messages to include more information about the failing module. 2001-09-19 08:52:07 -03:00			`raise CodecRegistryError,\`
			`'module "%s" (%s) failed to register' % \`
			`(mod.__name__, mod.__file__)`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`for obj in entry:`
			`if not callable(obj):`
Fixed search function error reporting in the encodings package __init__.py module to raise errors which can be catched as LookupErrors as well as SystemErrors. Modified the error messages to include more information about the failing module. 2001-09-19 08:52:07 -03:00			`raise CodecRegistryError,\`
			`'incompatible codecs in module "%s" (%s)' % \`
			`(mod.__name__, mod.__file__)`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00
Changed .getaliases() support to register the new aliases in the encodings package aliases mapping dictionary rather than in the internal cache used by the search function. This enables aliases to take advantage of the full normalization process applied to encoding names which was previously not available. The patch restricts alias registration to new aliases. Existing aliases cannot be overridden anymore. 2000-12-12 10:45:35 -04:00			`# Cache the codec registry entry`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`_cache[encoding] = entry`
Changed .getaliases() support to register the new aliases in the encodings package aliases mapping dictionary rather than in the internal cache used by the search function. This enables aliases to take advantage of the full normalization process applied to encoding names which was previously not available. The patch restricts alias registration to new aliases. Existing aliases cannot be overridden anymore. 2000-12-12 10:45:35 -04:00
			`# Register its aliases (without overwriting previously registered`
			`# aliases)`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`try:`
			`codecaliases = mod.getaliases()`
			`except AttributeError:`
			`pass`
			`else:`
			`for alias in codecaliases:`
Changed .getaliases() support to register the new aliases in the encodings package aliases mapping dictionary rather than in the internal cache used by the search function. This enables aliases to take advantage of the full normalization process applied to encoding names which was previously not available. The patch restricts alias registration to new aliases. Existing aliases cannot be overridden anymore. 2000-12-12 10:45:35 -04:00			`if not aliases.aliases.has_key(alias):`
			`aliases.aliases[alias] = modname`

			`# Return the registry entry`
Marc-Andre Lemburg: Unicode encodings. 2000-03-10 19:17:24 -04:00			`return entry`

			`# Register the search_function in the Python codec registry`
			`codecs.register(search_function)`