I banged on the code (beyond what's in that branch) to make fewer tests fail;
the only tests that fail now are:
test_descr -- can't pickle ints?!
test_pickletools -- ???
test_socket -- See python.org/sf/1619659
test_sqlite -- ???
I'll deal with those later.
Not all code has been fixed yet; this is just a checkpoint...
The C API still has PyDict_HasKey() and _HasKeyString(); not sure
if I want to change those just yet.
the optional proto 2 slot state.
pickle.py, load_build(): CAUTION: Noted that cPickle's
load_build and pickle's load_build really don't do the same
things with the state, and didn't before this patch either.
cPickle never tries to do .update(), and has no backoff if
instance.__dict__ can't be retrieved. There are no tests
that can tell the difference, and part of what cPickle's
load_build() did looked accidental to me, so I don't know
what the true intent is here.
pickletester.py, test_pickle.py: Got rid of the hack for
exempting cPickle from running some of the proto 2 tests.
dictobject.c, PyDict_Next(): documented intended use.
this clarifies that they are part of an internal API (albeit shared
between pickle.py, copy_reg.py and cPickle.c).
I'd like to do the same for copy_reg.dispatch_table, but worry that it
might be used by existing code. This risk doesn't exist for the
extension registry.
outcome as __slotnames__ on the class. (Like __slots__, it's not safe
to ask for this as an attribute -- you must look for it in the
specific class's __dict__. But it must be set using attribute
notation, because __dict__ is a read-only proxy.)
Assorted code cleanups; e.g., sizeof(char) is 1 by definition, so there's
no need to do things like multiply by sizeof(char) in hairy malloc
arguments. Fixed an undetected-overflow bug in readline_file().
longobject.c: Fixed a really stupid bug in the new _PyLong_NumBits.
pickle.py: Fixed stupid bug in save_long(): When proto is 2, it
wrote LONG1 or LONG4, but forgot to return then -- it went on to
append the proto 1 LONG opcode too.
Fixed equally stupid cancelling bugs in load_long1() and
load_long4(): they *returned* the unpickled long instead of pushing
it on the stack. The return values were ignored. Tests passed
before only because save_long() pickled the long twice.
Fixed bugs in encode_long().
Noted that decode_long() is quadratic-time despite our hopes,
because long(string, 16) is still quadratic-time in len(string).
It's hex() that's linear-time. I don't know a way to make decode_long()
linear-time in Python, short of maybe transforming the 256's-complement
bytes into marshal's funky internal format, and letting marshal decode
that. It would be more valuable to make long(string, 16) linear time.
pickletester.py: Added a global "protocols" vector so tests can try
all the protocols in a sane way. Changed test_ints() and test_unicode()
to do so. Added a new test_long(), but the tail end of it is disabled
because it "takes forever" under pickle.py (but runs very quickly under
cPickle: cPickle proto 2 for longs is linear-time).
The 4th item can be None or an iterator yielding list items, which are
used to append() or extend() the object. The 5th item can be None or
an iterator yielding a dict's (key, value) pairs, which are stuffed
into the object using __setitem__.
Also (as a separate, though related, feature) add "batching" for list
and dict items. If you pickled a dict or list with a million items in
the past, it would push a million items onto the stack. It now pushes
only 1000 items at a time on the stack, using repeated APPENDS or
SETITEMS opcodes. (For lists, I hope that using many short extend()
calls doesn't exhibit quadratic behavior.)
__module__ is the string name of the module the function was defined
in, just like __module__ of classes. In some cases, particularly for
C functions, the __module__ may be None.
Change PyCFunction_New() from a function to a macro, but keep an
unused copy of the function around so that we don't change the binary
API.
Change pickle's save_global() to use whichmodule() if __module__ is
None, but add the __module__ logic to whichmodule() since it might be
used outside of pickle.
on the type instead of self.save(t). This defeated the purpose of
NEWOBJ, because it didn't generate a BINGET opcode when t was already
memoized; but moreover, it would generate multiple BINPUT opcodes for
the same type! pickletools.dis() doesn't like this.
How I found this? I was playing with picklesize.py in the datetime
sandbox, and noticed that protocol 2 pickles for multiple objects were
in fact larger than protocol 1 pickles! That was suspicious, so I
decided to disassemble one of the pickles.
This really needs a unit test, but I'm exhausted. I'll be late for
work as it is. :-(
the same function, don't save the state or write a BUILD opcode. This
is so that a type (e.g. datetime :-) can support protocol 2 using
__getnewargs__ while also supporting protocol 0 and 1 using
__getstate__. (Without this, the state would be pickled twice with
protocol 2, unless __getstate__ is defined to return None, which
breaks protocol 0 and 1.)
types. The special handling for these can now be removed from save_newobj().
Add some testing for this.
Also add support for setting the 'fast' flag on the Python Pickler class,
which suppresses use of the memo.
object.__reduce__, do a getattr() on the class so we can explicitly
test for it. The reduce()-calling code becomes a bit more regular as
a result.
Also add support slots: if an object has slots, the default state is
(dict, slots) where dict is the __dict__ or None, and slots is a dict
mapping slot names to slot values. We do a best-effort approach to
find slot names, assuming the __slots__ fields of classes aren't
modified after class definition time to misrepresent the actual list
of slots defined by a class.
be one of 0, 1 or 2).
I should note that the previous checkin also added NEWOBJ support to
the unpickler -- but there's nothing yet that generates this.
some notion of low-level efficiency. Undid that, but left one routine
alone: save_inst() claims it has a reason for not using memoize().
I don't understand that comment, so added an XXX comment there.
then the embedded argument consumes at least 256 bytes. The difference
between a 3-byte prefix (LONG2 + 2 bytes) and a 5-byte prefix (LONG4 +
4 bytes) is at worst less than 1%. Note that binary strings and binary
Unicode strings also have only "size is 1 byte, or size is 4 bytes?"
flavors, and I expect for the same reason. The only place a 2-byte
thingie was used was in BININT2, where the 2 bytes make up the *entire*
embedded argument (and now EXT2 also does this); that's a large savings
over 4 bytes, because the total opcode+argument size is so small in
the BININT2/EXT2 case.
Removed the TAKEN_FROM_ARGUMENT "number of bytes" code, and bifurcated it
into TAKEN_FROM_ARGUMENT1 and TAKEN_FROM_ARGUMENT4. Now there's enough
info in ArgumentDescriptor objects to deduce the # of bytes consumed by
each opcode.
Rearranged the order in which proto2 opcodes are listed in pickle.py.
add memoize() helper function to update the memo.
The first element of the tuple returned by __reduce__() must be a
callable. If it isn't the Unpickler will raise an error. Catch this
error in the pickler and raise the error there.
The memoize() helper also has a comment explaining how the memo
works. So methods can't use memoize() because the write funny codes.
This fixes the charming, but unhelpful error message for
>>> pickle.dumps(type.__new__)
Can't pickle <built-in method __new__ of type object at 0x812a440>: it's not the same object as datetime.math.__new__
Bugfix candidate.
Change pickling format for bools to use a backwards compatible
encoding. This means you can pickle True or False on Python 2.3
and Python 2.2 or before will read it back as 1 or 0. The code
used for pickling bools before would create pickles that could
not be read in previous Python versions.
PEP 285. Everything described in the PEP is here, and there is even
some documentation. I had to fix 12 unit tests; all but one of these
were printing Boolean outcomes that changed from 0/1 to False/True.
(The exception is test_unicode.py, which did a type(x) == type(y)
style comparison. I could've fixed that with a single line using
issubtype(x, type(y)), but instead chose to be explicit about those
places where a bool is expected.
Still to do: perhaps more documentation; change standard library
modules to return False/True from predicates.
metaclass, reported by Dan Parisien.
Objects that are instances of custom metaclasses, i.e. whose class is
a subclass of 'type', should be pickled the same as new-style classes
(objects whose class is 'type'). This can't be done through a
dispatch table entry, and the __reduce__ trick doesn't work for these,
since it finds the unbound __reduce__ for instances of the class
(inherited from 'object'). So check explicitly using issubclass().
load_inst(): Implement the security hook that cPickle already had.
When unpickling callables which are not classes, we look to see if the
object has an attribute __safe_for_unpickling__. If this exists and
has a true value, then we can call it to create the unpickled object.
Otherwise we raise an UnpicklingError.
find_class(): We no longer mask ImportError, KeyError, and
AttributeError by transforming them into SystemError. The latter is
definitely not the right thing to do, so we let the former three
exceptions simply propagate up if they occur, i.e. we remove the
try/except!
64-bit INTs on 32-bit boxes (where they become longs). Also exploit that
int(str) and long(str) will ignore a trailing newline (saves creating a
new string at the Python level).
pickletester.py: Simulate reading a pickle produced by a 64-bit box.
is pickled as a global must now exist by the name under which it is
pickled, otherwise the pickling fails. Previously, such things would
fail on unpickling, or unpickle as the wrong global object. I'm
hoping that this won't break existing code that is playing tricks with
this.
I need a volunteer to do this for cPickle too.
pickle.py
The code implicitly assumed that all ints fit in 4 bytes, causing all
sorts of mischief (from nonsense results to corrupted pickles).
Repaired that.
marshal.c
The int marshaling code assumed that right shifts of signed longs
sign-extend. Repaired that.
bugs #126161 and 123634).
The solution doesn't use the unicode-escape encoding; that has other
problems (it seems not 100% reversible). Rather, it transforms the
input Unicode object slightly before encoding it using
raw-unicode-escape, so that the decoding will reconstruct the original
string: backslash and newline characters are translated into their
\uXXXX counterparts.
This is backwards incompatible for strings containing backslashes, but
for some of those strings, the pickling was already broken.
Strings are unpickled by calling eval on the string's repr. This
change makes pickle work like cPickle; it checks if the pickled
string is safe to eval and raises ValueError if it is not.
test suite modifications:
Verify that pickle catches a variety of insecure string pickles
Make test_pickle and test_cpickle use exactly the same test suite
Add test for pickling recursive object
who writes:
Here is batch 2, as a big collection of CVS context diffs.
Along with moving comments into docstrings, i've added a
couple of missing docstrings and attempted to make sure more
module docstrings begin with a one-line summary.
I did not add docstrings to the methods in profile.py for
fear of upsetting any careful optimizations there, though
i did move class documentation into class docstrings.
The convention i'm using is to leave credits/version/copyright
type of stuff in # comments, and move the rest of the descriptive
stuff about module usage into module docstrings. Hope this is
okay.
I found the following patch helpful in tracking down a bug in some
code. I had appended time, the module, instead of time.time(). Not
sure if it is generally true that printing the repr of the object is
good, but I expect that most unpicklable things will have fairly
information and concise reprs (like files or sockets or modules).
"""
I've attached a long overdue patch to pickle.py to bring it to format
1.3, which is the same as 1.2 except that the binary float format
is supported. This is done using the new platform-indepent format
features of struct.
This patch also gets rid of the undocumented obsolete Pickler
dump_special method.
"""
there's an __getinitargs__() method), if a TypeError occurs, catch and
reraise it but add info to the error about the class name being
instantiated. This makes debugging a lot easier if __getinitargs__()
returns something bogus (e.g. a string instead of a singleton tuple).
Fixed problems when unpickling in restricted execution environments.
These methods try to assign to an instance's __class__ attribute, or
access the instances __dict__, which are prohibited in REE. For the
first two methods, I re-implemented the old behavior when assignment
to value.__class__ fails.
For the load_build() I also re-implemented the old behavior when
inst.__dict__.update() fails but this means that unpickling in REE is
semantically different than unpickling in unrestricted mode.
The attached patch adds the following behavior to the handling
of REDUCE codes:
- A user-defined type may have a __reduce__ method that returns
a string rather than a tuple, in which case the object is
saved as a global object with a name given by the string returned
by reduce.
This was a feature added to cPickle a long time ago.
- User-defined types can now support unpickling without
executing a constructor.
The second value returned from '__reduce__' can now be None,
rather than an argument tuple. On unpickling, if the
second value returned from '__reduce__' during pickling was
None, then rather than calling the first value returned from
'__reduce__', directly, the '__basicnew__' method of the
first value returned from '__reduce__' is called without
arguments.
I also got rid of a few of Chris' extra ()s, which he used
to make python ifs look like C ifs.
mode. The pickler always uses base 10 so the default base should be
fine. (The base gets us in trouble when there's no strop module, as
the atoi() in string.py only supports base 10. This is for JPython.)
not define __getinitargs__, bypass the __init__ constructor
completely. This uses the trick of instantiating an empty dummy class
and then changing inst.__class__ to the real class. This is done in
two places: once for the INST and once for the OBJ format code.
Also replaced the much outdated long doc string with a short summary
of the module; the information of that doc string is already
incorporated in the library reference manual.
- Don't use "from copy_reg import *".
- Use cls.__module__ instead of calling whichobject(cls, cls.__name__);
also try __module__ in whichmodule(), just in case.
- After calling save_reduce(), add the object to the memo.
instance, use inst.__dict__.update(value) instead of a for loop with
setattr() over the value.keys(). This is more consistent (the
pickling doesn't use getattr() either but pickles inst.__dict__) and
avoids problems with instances that have a __setattr__ hook.
But it *is* a semantic change (because the setattr hook is no longer
used). So beware!