853 lines
36 KiB
TeX
853 lines
36 KiB
TeX
\section{\module{pickle} --- Python object serialization}
|
|
|
|
\declaremodule{standard}{pickle}
|
|
\modulesynopsis{Convert Python objects to streams of bytes and back.}
|
|
% Substantial improvements by Jim Kerr <jbkerr@sr.hp.com>.
|
|
% Rewritten by Barry Warsaw <barry@zope.com>
|
|
|
|
\index{persistence}
|
|
\indexii{persistent}{objects}
|
|
\indexii{serializing}{objects}
|
|
\indexii{marshalling}{objects}
|
|
\indexii{flattening}{objects}
|
|
\indexii{pickling}{objects}
|
|
|
|
The \module{pickle} module implements a fundamental, but powerful
|
|
algorithm for serializing and de-serializing a Python object
|
|
structure. ``Pickling'' is the process whereby a Python object
|
|
hierarchy is converted into a byte stream, and ``unpickling'' is the
|
|
inverse operation, whereby a byte stream is converted back into an
|
|
object hierarchy. Pickling (and unpickling) is alternatively known as
|
|
``serialization'', ``marshalling,''\footnote{Don't confuse this with
|
|
the \refmodule{marshal} module} or ``flattening'',
|
|
however the preferred term used here is ``pickling'' and
|
|
``unpickling'' to avoid confusing.
|
|
|
|
This documentation describes both the \module{pickle} module and the
|
|
\refmodule{cPickle} module.
|
|
|
|
\subsection{Relationship to other Python modules}
|
|
|
|
The \module{pickle} module has an optimized cousin called the
|
|
\module{cPickle} module. As its name implies, \module{cPickle} is
|
|
written in C, so it can be up to 1000 times faster than
|
|
\module{pickle}. However it does not support subclassing of the
|
|
\function{Pickler()} and \function{Unpickler()} classes, because in
|
|
\module{cPickle} these are functions, not classes. Most applications
|
|
have no need for this functionality, and can benefit from the improved
|
|
performance of \module{cPickle}. Other than that, the interfaces of
|
|
the two modules are nearly identical; the common interface is
|
|
described in this manual and differences are pointed out where
|
|
necessary. In the following discussions, we use the term ``pickle''
|
|
to collectively describe the \module{pickle} and
|
|
\module{cPickle} modules.
|
|
|
|
The data streams the two modules produce are guaranteed to be
|
|
interchangeable.
|
|
|
|
Python has a more primitive serialization module called
|
|
\refmodule{marshal}, but in general
|
|
\module{pickle} should always be the preferred way to serialize Python
|
|
objects. \module{marshal} exists primarily to support Python's
|
|
\file{.pyc} files.
|
|
|
|
The \module{pickle} module differs from \refmodule{marshal} several
|
|
significant ways:
|
|
|
|
\begin{itemize}
|
|
|
|
\item The \module{pickle} module keeps track of the objects it has
|
|
already serialized, so that later references to the same object
|
|
won't be serialized again. \module{marshal} doesn't do this.
|
|
|
|
This has implications both for recursive objects and object
|
|
sharing. Recursive objects are objects that contain references
|
|
to themselves. These are not handled by marshal, and in fact,
|
|
attempting to marshal recursive objects will crash your Python
|
|
interpreter. Object sharing happens when there are multiple
|
|
references to the same object in different places in the object
|
|
hierarchy being serialized. \module{pickle} stores such objects
|
|
only once, and ensures that all other references point to the
|
|
master copy. Shared objects remain shared, which can be very
|
|
important for mutable objects.
|
|
|
|
\item \module{marshal} cannot be used to serialize user-defined
|
|
classes and their instances. \module{pickle} can save and
|
|
restore class instances transparently, however the class
|
|
definition must be importable and live in the same module as
|
|
when the object was stored.
|
|
|
|
\item The \module{marshal} serialization format is not guaranteed to
|
|
be portable across Python versions. Because its primary job in
|
|
life is to support \file{.pyc} files, the Python implementers
|
|
reserve the right to change the serialization format in
|
|
non-backwards compatible ways should the need arise. The
|
|
\module{pickle} serialization format is guaranteed to be
|
|
backwards compatible across Python releases.
|
|
|
|
\item The \module{pickle} module doesn't handle code objects, which
|
|
the \module{marshal} module does. This avoids the possibility
|
|
of smuggling Trojan horses into a program through the
|
|
\module{pickle} module\footnote{This doesn't necessarily imply
|
|
that \module{pickle} is inherently secure. See
|
|
section~\ref{pickle-sec} for a more detailed discussion on
|
|
\module{pickle} module security. Besides, it's possible that
|
|
\module{pickle} will eventually support serializing code
|
|
objects.}.
|
|
|
|
\end{itemize}
|
|
|
|
Note that serialization is a more primitive notion than persistence;
|
|
although
|
|
\module{pickle} reads and writes file objects, it does not handle the
|
|
issue of naming persistent objects, nor the (even more complicated)
|
|
issue of concurrent access to persistent objects. The \module{pickle}
|
|
module can transform a complex object into a byte stream and it can
|
|
transform the byte stream into an object with the same internal
|
|
structure. Perhaps the most obvious thing to do with these byte
|
|
streams is to write them onto a file, but it is also conceivable to
|
|
send them across a network or store them in a database. The module
|
|
\refmodule{shelve} provides a simple interface
|
|
to pickle and unpickle objects on DBM-style database files.
|
|
|
|
\subsection{Data stream format}
|
|
|
|
The data format used by \module{pickle} is Python-specific. This has
|
|
the advantage that there are no restrictions imposed by external
|
|
standards such as XDR\index{XDR}\index{External Data Representation}
|
|
(which can't represent pointer sharing); however it means that
|
|
non-Python programs may not be able to reconstruct pickled Python
|
|
objects.
|
|
|
|
By default, the \module{pickle} data format uses a printable \ASCII{}
|
|
representation. This is slightly more voluminous than a binary
|
|
representation. The big advantage of using printable \ASCII{} (and of
|
|
some other characteristics of \module{pickle}'s representation) is that
|
|
for debugging or recovery purposes it is possible for a human to read
|
|
the pickled file with a standard text editor.
|
|
|
|
There are currently 3 different protocols which can be used for pickling.
|
|
|
|
\begin{itemize}
|
|
|
|
\item Protocol version 0 is the original ASCII protocol and is backwards
|
|
compatible with earlier versions of Python.
|
|
|
|
\item Protocol version 1 is the old binary format which is also compatible
|
|
with earlier versions of Python.
|
|
|
|
\item Protocol version 2 was introduced in Python 2.3. It provides
|
|
much more efficient pickling of new-style classes.
|
|
|
|
\end{itemize}
|
|
|
|
Refer to PEP 307 for more information.
|
|
|
|
If a \var{protocol} is not specified, protocol 0 is used.
|
|
If \var{protocol} is specified as a negative value
|
|
or \constant{HIGHEST_PROTOCOL},
|
|
the highest protocol version available will be used.
|
|
|
|
\versionchanged[The \var{bin} parameter is deprecated and only provided
|
|
for backwards compatibility. You should use the \var{protocol}
|
|
parameter instead]{2.3}
|
|
|
|
A binary format, which is slightly more efficient, can be chosen by
|
|
specifying a true value for the \var{bin} argument to the
|
|
\class{Pickler} constructor or the \function{dump()} and \function{dumps()}
|
|
functions. A \var{protocol} version >= 1 implies use of a binary format.
|
|
|
|
\subsection{Usage}
|
|
|
|
To serialize an object hierarchy, you first create a pickler, then you
|
|
call the pickler's \method{dump()} method. To de-serialize a data
|
|
stream, you first create an unpickler, then you call the unpickler's
|
|
\method{load()} method. The \module{pickle} module provides the
|
|
following constant:
|
|
|
|
\begin{datadesc}{HIGHEST_PROTOCOL}
|
|
The highest protocol version available. This value can be passed
|
|
as a \var{protocol} value.
|
|
\end{datadesc}
|
|
|
|
The \module{pickle} module provides the
|
|
following functions to make this process more convenient:
|
|
|
|
\begin{funcdesc}{dump}{object, file\optional{, protocol\optional{, bin}}}
|
|
Write a pickled representation of \var{object} to the open file object
|
|
\var{file}. This is equivalent to
|
|
\code{Pickler(\var{file}, \var{protocol}, \var{bin}).dump(\var{object})}.
|
|
|
|
If the \var{protocol} parameter is ommitted, protocol 0 is used.
|
|
If \var{protocol} is specified as a negative value
|
|
or \constant{HIGHEST_PROTOCOL},
|
|
the highest protocol version will be used.
|
|
|
|
\versionchanged[The \var{protocol} parameter was added.
|
|
The \var{bin} parameter is deprecated and only provided
|
|
for backwards compatibility. You should use the \var{protocol}
|
|
parameter instead]{2.3}
|
|
|
|
If the optional \var{bin} argument is true, the binary pickle format
|
|
is used; otherwise the (less efficient) text pickle format is used
|
|
(for backwards compatibility, this is the default).
|
|
|
|
\var{file} must have a \method{write()} method that accepts a single
|
|
string argument. It can thus be a file object opened for writing, a
|
|
\refmodule{StringIO} object, or any other custom
|
|
object that meets this interface.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{load}{file}
|
|
Read a string from the open file object \var{file} and interpret it as
|
|
a pickle data stream, reconstructing and returning the original object
|
|
hierarchy. This is equivalent to \code{Unpickler(\var{file}).load()}.
|
|
|
|
\var{file} must have two methods, a \method{read()} method that takes
|
|
an integer argument, and a \method{readline()} method that requires no
|
|
arguments. Both methods should return a string. Thus \var{file} can
|
|
be a file object opened for reading, a
|
|
\module{StringIO} object, or any other custom
|
|
object that meets this interface.
|
|
|
|
This function automatically determines whether the data stream was
|
|
written in binary mode or not.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{dumps}{object\optional{, protocol\optional{, bin}}}
|
|
Return the pickled representation of the object as a string, instead
|
|
of writing it to a file.
|
|
|
|
If the \var{protocol} parameter is ommitted, protocol 0 is used.
|
|
If \var{protocol} is specified as a negative value
|
|
or \constant{HIGHEST_PROTOCOL},
|
|
the highest protocol version will be used.
|
|
|
|
\versionchanged[The \var{protocol} parameter was added.
|
|
The \var{bin} parameter is deprecated and only provided
|
|
for backwards compatibility. You should use the \var{protocol}
|
|
parameter instead]{2.3}
|
|
|
|
If the optional \var{bin} argument is
|
|
true, the binary pickle format is used; otherwise the (less efficient)
|
|
text pickle format is used (this is the default).
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{loads}{string}
|
|
Read a pickled object hierarchy from a string. Characters in the
|
|
string past the pickled object's representation are ignored.
|
|
\end{funcdesc}
|
|
|
|
The \module{pickle} module also defines three exceptions:
|
|
|
|
\begin{excdesc}{PickleError}
|
|
A common base class for the other exceptions defined below. This
|
|
inherits from \exception{Exception}.
|
|
\end{excdesc}
|
|
|
|
\begin{excdesc}{PicklingError}
|
|
This exception is raised when an unpicklable object is passed to
|
|
the \method{dump()} method.
|
|
\end{excdesc}
|
|
|
|
\begin{excdesc}{UnpicklingError}
|
|
This exception is raised when there is a problem unpickling an object,
|
|
such as a security violation. Note that other exceptions may also be
|
|
raised during unpickling, including (but not necessarily limited to)
|
|
\exception{AttributeError}, \exception{EOFError},
|
|
\exception{ImportError}, and \exception{IndexError}.
|
|
\end{excdesc}
|
|
|
|
The \module{pickle} module also exports two callables\footnote{In the
|
|
\module{pickle} module these callables are classes, which you could
|
|
subclass to customize the behavior. However, in the \module{cPickle}
|
|
modules these callables are factory functions and so cannot be
|
|
subclassed. One of the common reasons to subclass is to control what
|
|
objects can actually be unpickled. See section~\ref{pickle-sec} for
|
|
more details on security concerns.}, \class{Pickler} and
|
|
\class{Unpickler}:
|
|
|
|
\begin{classdesc}{Pickler}{file\optional{, protocol\optional{, bin}}}
|
|
This takes a file-like object to which it will write a pickle data
|
|
stream.
|
|
|
|
If the \var{protocol} parameter is ommitted, protocol 0 is used.
|
|
If \var{protocol} is specified as a negative value,
|
|
the highest protocol version will be used.
|
|
|
|
\versionchanged[The \var{bin} parameter is deprecated and only provided
|
|
for backwards compatibility. You should use the \var{protocol}
|
|
parameter instead]{2.3}
|
|
|
|
Optional \var{bin} if true, tells the pickler to use the more
|
|
efficient binary pickle format, otherwise the \ASCII{} format is used
|
|
(this is the default).
|
|
|
|
\var{file} must have a \method{write()} method that accepts a single
|
|
string argument. It can thus be an open file object, a
|
|
\module{StringIO} object, or any other custom
|
|
object that meets this interface.
|
|
\end{classdesc}
|
|
|
|
\class{Pickler} objects define one (or two) public methods:
|
|
|
|
\begin{methoddesc}[Pickler]{dump}{object}
|
|
Write a pickled representation of \var{object} to the open file object
|
|
given in the constructor. Either the binary or \ASCII{} format will
|
|
be used, depending on the value of the \var{bin} flag passed to the
|
|
constructor.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[Pickler]{clear_memo}{}
|
|
Clears the pickler's ``memo''. The memo is the data structure that
|
|
remembers which objects the pickler has already seen, so that shared
|
|
or recursive objects pickled by reference and not by value. This
|
|
method is useful when re-using picklers.
|
|
|
|
\begin{notice}
|
|
Prior to Python 2.3, \method{clear_memo()} was only available on the
|
|
picklers created by \refmodule{cPickle}. In the \module{pickle} module,
|
|
picklers have an instance variable called \member{memo} which is a
|
|
Python dictionary. So to clear the memo for a \module{pickle} module
|
|
pickler, you could do the following:
|
|
|
|
\begin{verbatim}
|
|
mypickler.memo.clear()
|
|
\end{verbatim}
|
|
|
|
Code that does not need to support older versions of Python should
|
|
simply use \method{clear_memo()}.
|
|
\end{notice}
|
|
\end{methoddesc}
|
|
|
|
It is possible to make multiple calls to the \method{dump()} method of
|
|
the same \class{Pickler} instance. These must then be matched to the
|
|
same number of calls to the \method{load()} method of the
|
|
corresponding \class{Unpickler} instance. If the same object is
|
|
pickled by multiple \method{dump()} calls, the \method{load()} will
|
|
all yield references to the same object\footnote{\emph{Warning}: this
|
|
is intended for pickling multiple objects without intervening
|
|
modifications to the objects or their parts. If you modify an object
|
|
and then pickle it again using the same \class{Pickler} instance, the
|
|
object is not pickled again --- a reference to it is pickled and the
|
|
\class{Unpickler} will return the old value, not the modified one.
|
|
There are two problems here: (1) detecting changes, and (2)
|
|
marshalling a minimal set of changes. Garbage Collection may also
|
|
become a problem here.}.
|
|
|
|
\class{Unpickler} objects are defined as:
|
|
|
|
\begin{classdesc}{Unpickler}{file}
|
|
This takes a file-like object from which it will read a pickle data
|
|
stream. This class automatically determines whether the data stream
|
|
was written in binary mode or not, so it does not need a flag as in
|
|
the \class{Pickler} factory.
|
|
|
|
\var{file} must have two methods, a \method{read()} method that takes
|
|
an integer argument, and a \method{readline()} method that requires no
|
|
arguments. Both methods should return a string. Thus \var{file} can
|
|
be a file object opened for reading, a
|
|
\module{StringIO} object, or any other custom
|
|
object that meets this interface.
|
|
\end{classdesc}
|
|
|
|
\class{Unpickler} objects have one (or two) public methods:
|
|
|
|
\begin{methoddesc}[Unpickler]{load}{}
|
|
Read a pickled object representation from the open file object given
|
|
in the constructor, and return the reconstituted object hierarchy
|
|
specified therein.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[Unpickler]{noload}{}
|
|
This is just like \method{load()} except that it doesn't actually
|
|
create any objects. This is useful primarily for finding what's
|
|
called ``persistent ids'' that may be referenced in a pickle data
|
|
stream. See section~\ref{pickle-protocol} below for more details.
|
|
|
|
\strong{Note:} the \method{noload()} method is currently only
|
|
available on \class{Unpickler} objects created with the
|
|
\module{cPickle} module. \module{pickle} module \class{Unpickler}s do
|
|
not have the \method{noload()} method.
|
|
\end{methoddesc}
|
|
|
|
\subsection{What can be pickled and unpickled?}
|
|
|
|
The following types can be pickled:
|
|
|
|
\begin{itemize}
|
|
|
|
\item \code{None}, \code{True}, and \code{False}
|
|
|
|
\item integers, long integers, floating point numbers, complex numbers
|
|
|
|
\item normal and Unicode strings
|
|
|
|
\item tuples, lists, and dictionaries containing only picklable objects
|
|
|
|
\item functions defined at the top level of a module
|
|
|
|
\item built-in functions defined at the top level of a module
|
|
|
|
\item classes that are defined at the top level of a module
|
|
|
|
\item instances of such classes whose \member{__dict__} or
|
|
\method{__setstate__()} is picklable (see
|
|
section~\ref{pickle-protocol} for details)
|
|
|
|
\end{itemize}
|
|
|
|
Attempts to pickle unpicklable objects will raise the
|
|
\exception{PicklingError} exception; when this happens, an unspecified
|
|
number of bytes may have already been written to the underlying file.
|
|
|
|
Note that functions (built-in and user-defined) are pickled by ``fully
|
|
qualified'' name reference, not by value. This means that only the
|
|
function name is pickled, along with the name of module the function
|
|
is defined in. Neither the function's code, nor any of its function
|
|
attributes are pickled. Thus the defining module must be importable
|
|
in the unpickling environment, and the module must contain the named
|
|
object, otherwise an exception will be raised\footnote{The exception
|
|
raised will likely be an \exception{ImportError} or an
|
|
\exception{AttributeError} but it could be something else.}.
|
|
|
|
Similarly, classes are pickled by named reference, so the same
|
|
restrictions in the unpickling environment apply. Note that none of
|
|
the class's code or data is pickled, so in the following example the
|
|
class attribute \code{attr} is not restored in the unpickling
|
|
environment:
|
|
|
|
\begin{verbatim}
|
|
class Foo:
|
|
attr = 'a class attr'
|
|
|
|
picklestring = pickle.dumps(Foo)
|
|
\end{verbatim}
|
|
|
|
These restrictions are why picklable functions and classes must be
|
|
defined in the top level of a module.
|
|
|
|
Similarly, when class instances are pickled, their class's code and
|
|
data are not pickled along with them. Only the instance data are
|
|
pickled. This is done on purpose, so you can fix bugs in a class or
|
|
add methods to the class and still load objects that were created with
|
|
an earlier version of the class. If you plan to have long-lived
|
|
objects that will see many versions of a class, it may be worthwhile
|
|
to put a version number in the objects so that suitable conversions
|
|
can be made by the class's \method{__setstate__()} method.
|
|
|
|
\subsection{The pickle protocol
|
|
\label{pickle-protocol}}\setindexsubitem{(pickle protocol)}
|
|
|
|
This section describes the ``pickling protocol'' that defines the
|
|
interface between the pickler/unpickler and the objects that are being
|
|
serialized. This protocol provides a standard way for you to define,
|
|
customize, and control how your objects are serialized and
|
|
de-serialized. The description in this section doesn't cover specific
|
|
customizations that you can employ to make the unpickling environment
|
|
safer from untrusted pickle data streams; see section~\ref{pickle-sec}
|
|
for more details.
|
|
|
|
\subsubsection{Pickling and unpickling normal class
|
|
instances\label{pickle-inst}}
|
|
|
|
When a pickled class instance is unpickled, its \method{__init__()}
|
|
method is normally \emph{not} invoked. If it is desirable that the
|
|
\method{__init__()} method be called on unpickling, a class can define
|
|
a method \method{__getinitargs__()}, which should return a
|
|
\emph{tuple} containing the arguments to be passed to the class
|
|
constructor (i.e. \method{__init__()}). The
|
|
\method{__getinitargs__()} method is called at
|
|
pickle time; the tuple it returns is incorporated in the pickle for
|
|
the instance.
|
|
\withsubitem{(copy protocol)}{\ttindex{__getinitargs__()}}
|
|
\withsubitem{(instance constructor)}{\ttindex{__init__()}}
|
|
|
|
\withsubitem{(copy protocol)}{
|
|
\ttindex{__getstate__()}\ttindex{__setstate__()}}
|
|
\withsubitem{(instance attribute)}{
|
|
\ttindex{__dict__}}
|
|
|
|
Classes can further influence how their instances are pickled; if the
|
|
class defines the method \method{__getstate__()}, it is called and the
|
|
return state is pickled as the contents for the instance, instead of
|
|
the contents of the instance's dictionary. If there is no
|
|
\method{__getstate__()} method, the instance's \member{__dict__} is
|
|
pickled.
|
|
|
|
Upon unpickling, if the class also defines the method
|
|
\method{__setstate__()}, it is called with the unpickled
|
|
state\footnote{These methods can also be used to implement copying
|
|
class instances.}. If there is no \method{__setstate__()} method, the
|
|
pickled state must be a dictionary and its items are assigned to the
|
|
new instance's dictionary. If a class defines both
|
|
\method{__getstate__()} and \method{__setstate__()}, the state object
|
|
needn't be a dictionary and these methods can do what they
|
|
want.\footnote{This protocol is also used by the shallow and deep
|
|
copying operations defined in the
|
|
\refmodule{copy} module.}
|
|
|
|
\begin{notice}[warning]
|
|
For new-style classes, if \method{__getstate__()} returns a false
|
|
value, the \method{__setstate__()} method will not be called.
|
|
\end{notice}
|
|
|
|
|
|
\subsubsection{Pickling and unpickling extension types}
|
|
|
|
When the \class{Pickler} encounters an object of a type it knows
|
|
nothing about --- such as an extension type --- it looks in two places
|
|
for a hint of how to pickle it. One alternative is for the object to
|
|
implement a \method{__reduce__()} method. If provided, at pickling
|
|
time \method{__reduce__()} will be called with no arguments, and it
|
|
must return either a string or a tuple.
|
|
|
|
If a string is returned, it names a global variable whose contents are
|
|
pickled as normal. When a tuple is returned, it must be of length two
|
|
or three, with the following semantics:
|
|
|
|
\begin{itemize}
|
|
|
|
\item A callable object, which in the unpickling environment must be
|
|
either a class, a callable registered as a ``safe constructor''
|
|
(see below), or it must have an attribute
|
|
\member{__safe_for_unpickling__} with a true value. Otherwise,
|
|
an \exception{UnpicklingError} will be raised in the unpickling
|
|
environment. Note that as usual, the callable itself is pickled
|
|
by name.
|
|
|
|
\item A tuple of arguments for the callable object, or \code{None}.
|
|
\deprecated{2.3}{Use the tuple of arguments instead}
|
|
|
|
\item Optionally, the object's state, which will be passed to
|
|
the object's \method{__setstate__()} method as described in
|
|
section~\ref{pickle-inst}. If the object has no
|
|
\method{__setstate__()} method, then, as above, the value must
|
|
be a dictionary and it will be added to the object's
|
|
\member{__dict__}.
|
|
|
|
\end{itemize}
|
|
|
|
Upon unpickling, the callable will be called (provided that it meets
|
|
the above criteria), passing in the tuple of arguments; it should
|
|
return the unpickled object.
|
|
|
|
If the second item was \code{None}, then instead of calling the
|
|
callable directly, its \method{__basicnew__()} method is called
|
|
without arguments. It should also return the unpickled object.
|
|
|
|
\deprecated{2.3}{Use the tuple of arguments instead}
|
|
|
|
An alternative to implementing a \method{__reduce__()} method on the
|
|
object to be pickled, is to register the callable with the
|
|
\refmodule[copyreg]{copy_reg} module. This module provides a way
|
|
for programs to register ``reduction functions'' and constructors for
|
|
user-defined types. Reduction functions have the same semantics and
|
|
interface as the \method{__reduce__()} method described above, except
|
|
that they are called with a single argument, the object to be pickled.
|
|
|
|
The registered constructor is deemed a ``safe constructor'' for purposes
|
|
of unpickling as described above.
|
|
|
|
\subsubsection{Pickling and unpickling external objects}
|
|
|
|
For the benefit of object persistence, the \module{pickle} module
|
|
supports the notion of a reference to an object outside the pickled
|
|
data stream. Such objects are referenced by a ``persistent id'',
|
|
which is just an arbitrary string of printable \ASCII{} characters.
|
|
The resolution of such names is not defined by the \module{pickle}
|
|
module; it will delegate this resolution to user defined functions on
|
|
the pickler and unpickler\footnote{The actual mechanism for
|
|
associating these user defined functions is slightly different for
|
|
\module{pickle} and \module{cPickle}. The description given here
|
|
works the same for both implementations. Users of the \module{pickle}
|
|
module could also use subclassing to effect the same results,
|
|
overriding the \method{persistent_id()} and \method{persistent_load()}
|
|
methods in the derived classes.}.
|
|
|
|
To define external persistent id resolution, you need to set the
|
|
\member{persistent_id} attribute of the pickler object and the
|
|
\member{persistent_load} attribute of the unpickler object.
|
|
|
|
To pickle objects that have an external persistent id, the pickler
|
|
must have a custom \function{persistent_id()} method that takes an
|
|
object as an argument and returns either \code{None} or the persistent
|
|
id for that object. When \code{None} is returned, the pickler simply
|
|
pickles the object as normal. When a persistent id string is
|
|
returned, the pickler will pickle that string, along with a marker
|
|
so that the unpickler will recognize the string as a persistent id.
|
|
|
|
To unpickle external objects, the unpickler must have a custom
|
|
\function{persistent_load()} function that takes a persistent id
|
|
string and returns the referenced object.
|
|
|
|
Here's a silly example that \emph{might} shed more light:
|
|
|
|
\begin{verbatim}
|
|
import pickle
|
|
from cStringIO import StringIO
|
|
|
|
src = StringIO()
|
|
p = pickle.Pickler(src)
|
|
|
|
def persistent_id(obj):
|
|
if hasattr(obj, 'x'):
|
|
return 'the value %d' % obj.x
|
|
else:
|
|
return None
|
|
|
|
p.persistent_id = persistent_id
|
|
|
|
class Integer:
|
|
def __init__(self, x):
|
|
self.x = x
|
|
def __str__(self):
|
|
return 'My name is integer %d' % self.x
|
|
|
|
i = Integer(7)
|
|
print i
|
|
p.dump(i)
|
|
|
|
datastream = src.getvalue()
|
|
print repr(datastream)
|
|
dst = StringIO(datastream)
|
|
|
|
up = pickle.Unpickler(dst)
|
|
|
|
class FancyInteger(Integer):
|
|
def __str__(self):
|
|
return 'I am the integer %d' % self.x
|
|
|
|
def persistent_load(persid):
|
|
if persid.startswith('the value '):
|
|
value = int(persid.split()[2])
|
|
return FancyInteger(value)
|
|
else:
|
|
raise pickle.UnpicklingError, 'Invalid persistent id'
|
|
|
|
up.persistent_load = persistent_load
|
|
|
|
j = up.load()
|
|
print j
|
|
\end{verbatim}
|
|
|
|
In the \module{cPickle} module, the unpickler's
|
|
\member{persistent_load} attribute can also be set to a Python
|
|
list, in which case, when the unpickler reaches a persistent id, the
|
|
persistent id string will simply be appended to this list. This
|
|
functionality exists so that a pickle data stream can be ``sniffed''
|
|
for object references without actually instantiating all the objects
|
|
in a pickle\footnote{We'll leave you with the image of Guido and Jim
|
|
sitting around sniffing pickles in their living rooms.}. Setting
|
|
\member{persistent_load} to a list is usually used in conjunction with
|
|
the \method{noload()} method on the Unpickler.
|
|
|
|
% BAW: Both pickle and cPickle support something called
|
|
% inst_persistent_id() which appears to give unknown types a second
|
|
% shot at producing a persistent id. Since Jim Fulton can't remember
|
|
% why it was added or what it's for, I'm leaving it undocumented.
|
|
|
|
\subsection{Security \label{pickle-sec}}
|
|
|
|
Most of the security issues surrounding the \module{pickle} and
|
|
\module{cPickle} module involve unpickling. There are no known
|
|
security vulnerabilities
|
|
related to pickling because you (the programmer) control the objects
|
|
that \module{pickle} will interact with, and all it produces is a
|
|
string.
|
|
|
|
However, for unpickling, it is \strong{never} a good idea to unpickle
|
|
an untrusted string whose origins are dubious, for example, strings
|
|
read from a socket. This is because unpickling can create unexpected
|
|
objects and even potentially run methods of those objects, such as
|
|
their class constructor or destructor\footnote{A special note of
|
|
caution is worth raising about the \refmodule{Cookie}
|
|
module. By default, the \class{Cookie.Cookie} class is an alias for
|
|
the \class{Cookie.SmartCookie} class, which ``helpfully'' attempts to
|
|
unpickle any cookie data string it is passed. This is a huge security
|
|
hole because cookie data typically comes from an untrusted source.
|
|
You should either explicitly use the \class{Cookie.SimpleCookie} class
|
|
--- which doesn't attempt to unpickle its string --- or you should
|
|
implement the defensive programming steps described later on in this
|
|
section.}.
|
|
|
|
You can defend against this by customizing your unpickler so that you
|
|
can control exactly what gets unpickled and what gets called.
|
|
Unfortunately, exactly how you do this is different depending on
|
|
whether you're using \module{pickle} or \module{cPickle}.
|
|
|
|
One common feature that both modules implement is the
|
|
\member{__safe_for_unpickling__} attribute. Before calling a callable
|
|
which is not a class, the unpickler will check to make sure that the
|
|
callable has either been registered as a safe callable via the
|
|
\refmodule[copyreg]{copy_reg} module, or that it has an
|
|
attribute \member{__safe_for_unpickling__} with a true value. This
|
|
prevents the unpickling environment from being tricked into doing
|
|
evil things like call \code{os.unlink()} with an arbitrary file name.
|
|
See section~\ref{pickle-protocol} for more details.
|
|
|
|
For safely unpickling class instances, you need to control exactly
|
|
which classes will get created. Be aware that a class's constructor
|
|
could be called (if the pickler found a \method{__getinitargs__()}
|
|
method) and the the class's destructor (i.e. its \method{__del__()} method)
|
|
might get called when the object is garbage collected. Depending on
|
|
the class, it isn't very heard to trick either method into doing bad
|
|
things, such as removing a file. The way to
|
|
control the classes that are safe to instantiate differs in
|
|
\module{pickle} and \module{cPickle}\footnote{A word of caution: the
|
|
mechanisms described here use internal attributes and methods, which
|
|
are subject to change in future versions of Python. We intend to
|
|
someday provide a common interface for controlling this behavior,
|
|
which will work in either \module{pickle} or \module{cPickle}.}.
|
|
|
|
In the \module{pickle} module, you need to derive a subclass from
|
|
\class{Unpickler}, overriding the \method{load_global()}
|
|
method. \method{load_global()} should read two lines from the pickle
|
|
data stream where the first line will the the name of the module
|
|
containing the class and the second line will be the name of the
|
|
instance's class. It then look up the class, possibly importing the
|
|
module and digging out the attribute, then it appends what it finds to
|
|
the unpickler's stack. Later on, this class will be assigned to the
|
|
\member{__class__} attribute of an empty class, as a way of magically
|
|
creating an instance without calling its class's \method{__init__()}.
|
|
You job (should you choose to accept it), would be to have
|
|
\method{load_global()} push onto the unpickler's stack, a known safe
|
|
version of any class you deem safe to unpickle. It is up to you to
|
|
produce such a class. Or you could raise an error if you want to
|
|
disallow all unpickling of instances. If this sounds like a hack,
|
|
you're right. UTSL.
|
|
|
|
Things are a little cleaner with \module{cPickle}, but not by much.
|
|
To control what gets unpickled, you can set the unpickler's
|
|
\member{find_global} attribute to a function or \code{None}. If it is
|
|
\code{None} then any attempts to unpickle instances will raise an
|
|
\exception{UnpicklingError}. If it is a function,
|
|
then it should accept a module name and a class name, and return the
|
|
corresponding class object. It is responsible for looking up the
|
|
class, again performing any necessary imports, and it may raise an
|
|
error to prevent instances of the class from being unpickled.
|
|
|
|
The moral of the story is that you should be really careful about the
|
|
source of the strings your application unpickles.
|
|
|
|
\subsection{Example \label{pickle-example}}
|
|
|
|
Here's a simple example of how to modify pickling behavior for a
|
|
class. The \class{TextReader} class opens a text file, and returns
|
|
the line number and line contents each time its \method{readline()}
|
|
method is called. If a \class{TextReader} instance is pickled, all
|
|
attributes \emph{except} the file object member are saved. When the
|
|
instance is unpickled, the file is reopened, and reading resumes from
|
|
the last location. The \method{__setstate__()} and
|
|
\method{__getstate__()} methods are used to implement this behavior.
|
|
|
|
\begin{verbatim}
|
|
class TextReader:
|
|
"""Print and number lines in a text file."""
|
|
def __init__(self, file):
|
|
self.file = file
|
|
self.fh = open(file)
|
|
self.lineno = 0
|
|
|
|
def readline(self):
|
|
self.lineno = self.lineno + 1
|
|
line = self.fh.readline()
|
|
if not line:
|
|
return None
|
|
if line.endswith("\n"):
|
|
line = line[:-1]
|
|
return "%d: %s" % (self.lineno, line)
|
|
|
|
def __getstate__(self):
|
|
odict = self.__dict__.copy() # copy the dict since we change it
|
|
del odict['fh'] # remove filehandle entry
|
|
return odict
|
|
|
|
def __setstate__(self,dict):
|
|
fh = open(dict['file']) # reopen file
|
|
count = dict['lineno'] # read from file...
|
|
while count: # until line count is restored
|
|
fh.readline()
|
|
count = count - 1
|
|
self.__dict__.update(dict) # update attributes
|
|
self.fh = fh # save the file object
|
|
\end{verbatim}
|
|
|
|
A sample usage might be something like this:
|
|
|
|
\begin{verbatim}
|
|
>>> import TextReader
|
|
>>> obj = TextReader.TextReader("TextReader.py")
|
|
>>> obj.readline()
|
|
'1: #!/usr/local/bin/python'
|
|
>>> # (more invocations of obj.readline() here)
|
|
... obj.readline()
|
|
'7: class TextReader:'
|
|
>>> import pickle
|
|
>>> pickle.dump(obj,open('save.p','w'))
|
|
\end{verbatim}
|
|
|
|
If you want to see that \refmodule{pickle} works across Python
|
|
processes, start another Python session, before continuing. What
|
|
follows can happen from either the same process or a new process.
|
|
|
|
\begin{verbatim}
|
|
>>> import pickle
|
|
>>> reader = pickle.load(open('save.p'))
|
|
>>> reader.readline()
|
|
'8: "Print and number lines in a text file."'
|
|
\end{verbatim}
|
|
|
|
|
|
\begin{seealso}
|
|
\seemodule[copyreg]{copy_reg}{Pickle interface constructor
|
|
registration for extension types.}
|
|
|
|
\seemodule{shelve}{Indexed databases of objects; uses \module{pickle}.}
|
|
|
|
\seemodule{copy}{Shallow and deep object copying.}
|
|
|
|
\seemodule{marshal}{High-performance serialization of built-in types.}
|
|
\end{seealso}
|
|
|
|
|
|
\section{\module{cPickle} --- A faster \module{pickle}}
|
|
|
|
\declaremodule{builtin}{cPickle}
|
|
\modulesynopsis{Faster version of \refmodule{pickle}, but not subclassable.}
|
|
\moduleauthor{Jim Fulton}{jfulton@digicool.com}
|
|
\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
|
|
|
|
The \module{cPickle} module supports serialization and
|
|
de-serialization of Python objects, providing an interface and
|
|
functionality nearly identical to the
|
|
\refmodule{pickle}\refstmodindex{pickle} module. There are several
|
|
differences, the most important being performance and subclassability.
|
|
|
|
First, \module{cPickle} can be up to 1000 times faster than
|
|
\module{pickle} because the former is implemented in C. Second, in
|
|
the \module{cPickle} module the callables \function{Pickler()} and
|
|
\function{Unpickler()} are functions, not classes. This means that
|
|
you cannot use them to derive custom pickling and unpickling
|
|
subclasses. Most applications have no need for this functionality and
|
|
should benefit from the greatly improved performance of the
|
|
\module{cPickle} module.
|
|
|
|
The pickle data stream produced by \module{pickle} and
|
|
\module{cPickle} are identical, so it is possible to use
|
|
\module{pickle} and \module{cPickle} interchangeably with existing
|
|
pickles\footnote{Since the pickle data format is actually a tiny
|
|
stack-oriented programming language, and some freedom is taken in the
|
|
encodings of certain objects, it is possible that the two modules
|
|
produce different data streams for the same input objects. However it
|
|
is guaranteed that they will always be able to read each other's
|
|
data streams.}.
|
|
|
|
There are additional minor differences in API between \module{cPickle}
|
|
and \module{pickle}, however for most applications, they are
|
|
interchangable. More documentation is provided in the
|
|
\module{pickle} module documentation, which
|
|
includes a list of the documented differences.
|
|
|
|
|