mirror of https://github.com/python/cpython
615 lines
19 KiB
TeX
615 lines
19 KiB
TeX
\section{\module{xml.dom.minidom} ---
|
|
The Document Object Model}
|
|
|
|
\declaremodule{standard}{xml.dom.minidom}
|
|
\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
|
|
\moduleauthor{Paul Prescod}{paul@prescod.net}
|
|
\sectionauthor{Paul Prescod}{paul@prescod.net}
|
|
\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
|
|
|
|
\versionadded{2.0}
|
|
|
|
The \module{xml.dom.minidom} provides a light-weight implementation of
|
|
the W3C Document Object Model. The DOM is a cross-language API from
|
|
the Web Consortium (W3C) for accessing and modifying XML documents. A
|
|
DOM implementation allows to convert an XML document into a tree-like
|
|
structure, or to build such a structure from scratch. It then gives
|
|
access to the structure through a set of objects which provided
|
|
well-known interfaces. Minidom is intended to be simpler than the full
|
|
DOM and also significantly smaller.
|
|
|
|
The DOM is extremely useful for random-access applications. SAX only
|
|
allows you a view of one bit of the document at a time. If you are
|
|
looking at one SAX element, you have no access to another. If you are
|
|
looking at a text node, you have no access to a containing
|
|
element. When you write a SAX application, you need to keep track of
|
|
your program's position in the document somewhere in your own
|
|
code. Sax does not do it for you. Also, if you need to look ahead in
|
|
the XML document, you are just out of luck.
|
|
|
|
Some applications are simply impossible in an event driven model with
|
|
no access to a tree. Of course you could build some sort of tree
|
|
yourself in SAX events, but the DOM allows you to avoid writing that
|
|
code. The DOM is a standard tree representation for XML data.
|
|
|
|
%What if your needs are somewhere between SAX and the DOM? Perhaps you cannot
|
|
%afford to load the entire tree in memory but you find the SAX model
|
|
%somewhat cumbersome and low-level. There is also an experimental module
|
|
%called pulldom that allows you to build trees of only the parts of a
|
|
%document that you need structured access to. It also has features that allow
|
|
%you to find your way around the DOM.
|
|
% See http://www.prescod.net/python/pulldom
|
|
|
|
DOM applications typically start by parsing some XML into a DOM. This
|
|
is done through the parse functions:
|
|
|
|
\begin{verbatim}
|
|
from xml.dom.minidom import parse, parseString
|
|
|
|
dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
|
|
|
|
datasource = open('c:\\temp\\mydata.xml')
|
|
dom2 = parse(datasource) # parse an open file
|
|
|
|
dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
|
|
\end{verbatim}
|
|
|
|
The parse function can take either a filename or an open file object.
|
|
|
|
\begin{funcdesc}{parse}{filename_or_file{, parser}}
|
|
Return a \class{Document} from the given input. \var{filename_or_file}
|
|
may be either a file name, or a file-like object. \var{parser}, if
|
|
given, must be a SAX2 parser object. This function will change the
|
|
document handler of the parser and activate namespace support; other
|
|
parser configuration (like setting an entity resolver) must have been
|
|
done in advance.
|
|
\end{funcdesc}
|
|
|
|
If you have XML in a string, you can use the parseString function
|
|
instead:
|
|
|
|
\begin{funcdesc}{parseString}{string\optional{, parser}}
|
|
Return a \class{Document} that represents the \var{string}. This
|
|
method creates a \class{StringIO} object for the string and passes
|
|
that on to \function{parse}.
|
|
\end{funcdesc}
|
|
|
|
Both functions return a document object representing the content of
|
|
the document.
|
|
|
|
You can also create a document node merely by instantiating a
|
|
document object. Then you could add child nodes to it to populate
|
|
the DOM.
|
|
|
|
\begin{verbatim}
|
|
from xml.dom.minidom import Document
|
|
|
|
newdoc = Document()
|
|
newel = newdoc.createElement("some_tag")
|
|
newdoc.appendChild(newel)
|
|
\end{verbatim}
|
|
|
|
Once you have a DOM document object, you can access the parts of your
|
|
XML document through its properties and methods. These properties are
|
|
defined in the DOM specification. The main property of the document
|
|
object is the documentElement property. It gives you the main element
|
|
in the XML document: the one that holds all others. Here is an
|
|
example program:
|
|
|
|
\begin{verbatim}
|
|
dom3 = parseString("<myxml>Some data</myxml>")
|
|
assert dom3.documentElement.tagName == "myxml"
|
|
\end{verbatim}
|
|
|
|
When you are finished with a DOM, you should clean it up. This is
|
|
necessary because some versions of Python do not support garbage
|
|
collection of objects that refer to each other in a cycle. Until this
|
|
restriction is removed from all versions of Python, it is safest to
|
|
write your code as if cycles would not be cleaned up.
|
|
|
|
The way to clean up a DOM is to call its \method{unlink()} method:
|
|
|
|
\begin{verbatim}
|
|
dom1.unlink()
|
|
dom2.unlink()
|
|
dom3.unlink()
|
|
\end{verbatim}
|
|
|
|
\method{unlink()} is a \module{minidom}-specific extension to the DOM
|
|
API. After calling \method{unlink()}, a DOM is basically useless.
|
|
|
|
\begin{seealso}
|
|
\seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
|
|
{This is the canonical specification for the level of the
|
|
DOM supported by \module{xml.dom.minidom}.}
|
|
\seetitle[http://pyxml.sourceforge.net]{PyXML}{Users that require a
|
|
full-featured implementation of DOM should use the PyXML
|
|
package.}
|
|
\end{seealso}
|
|
|
|
|
|
\subsection{DOM objects \label{dom-objects}}
|
|
|
|
The definitive documentation for the DOM is the DOM specification from
|
|
the W3C. This section lists the properties and methods supported by
|
|
\refmodule{xml.dom.minidom}.
|
|
|
|
\begin{classdesc}{Node}{}
|
|
All of the components of an XML document are subclasses of
|
|
\class{Node}.
|
|
|
|
\begin{memberdesc}{nodeType}
|
|
An integer representing the node type. Symbolic constants for the
|
|
types are on the \class{Node} object: \constant{DOCUMENT_NODE},
|
|
\constant{ELEMENT_NODE}, \constant{ATTRIBUTE_NODE},
|
|
\constant{TEXT_NODE}, \constant{CDATA_SECTION_NODE},
|
|
\constant{ENTITY_NODE}, \constant{PROCESSING_INSTRUCTION_NODE},
|
|
\constant{COMMENT_NODE}, \constant{DOCUMENT_NODE},
|
|
\constant{DOCUMENT_TYPE_NODE}, \constant{NOTATION_NODE}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{parentNode}
|
|
The parent of the current node. \code{None} for the document node.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{attributes}
|
|
An \class{AttributeList} of attribute objects. Only
|
|
elements have this attribute. Others return \code{None}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{previousSibling}
|
|
The node that immediately precedes this one with the same parent. For
|
|
instance the element with an end-tag that comes just before the
|
|
\var{self} element's start-tag. Of course, XML documents are made
|
|
up of more than just elements so the previous sibling could be text, a
|
|
comment, or something else.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{nextSibling}
|
|
The node that immediately follows this one with the same parent. See
|
|
also \member{previousSibling}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{childNodes}
|
|
A list of nodes contained within this node.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{firstChild}
|
|
Equivalent to \code{childNodes[0]}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{lastChild}
|
|
Equivalent to \code{childNodes[-1]}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{nodeName}
|
|
Has a different meaning for each node type. See the DOM specification
|
|
for details. You can always get the information you would get here
|
|
from another property such as the \member{tagName} property for
|
|
elements or the \member{name} property for attributes.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{nodeValue}
|
|
Has a different meaning for each node type. See the DOM specification
|
|
for details. The situation is similar to that with \member{nodeName}.
|
|
\end{memberdesc}
|
|
|
|
\begin{methoddesc}{unlink}{}
|
|
Break internal references within the DOM so that it will be garbage
|
|
collected on versions of Python without cyclic GC.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{writexml}{writer}
|
|
Write XML to the writer object. The writer should have a
|
|
\method{write()} method which matches that of the file object
|
|
interface.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{toxml}{}
|
|
Return the XML string that the DOM represents.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{hasChildNodes}{}
|
|
Returns true the node has any child nodes.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{insertBefore}{newChild, refChild}
|
|
Insert a new child node before an existing child. It must be the case
|
|
that \var{refChild} is a child of this node; if not,
|
|
\exception{ValueError} is raised.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{replaceChild}{newChild, oldChild}
|
|
Replace an existing node with a new node. It must be the case that
|
|
\var{oldChild} is a child of this node; if not,
|
|
\exception{ValueError} is raised.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{removeChild}{oldChild}
|
|
Remove a child node. \var{oldChild} must be a child of this node; if
|
|
not, \exception{ValueError} is raised.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{appendChild}{newChild}
|
|
Add a new child node to this node list.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{cloneNode}{deep}
|
|
Clone this node. Deep means to clone all children also. Deep cloning
|
|
is not implemented in Python 2 so the deep parameter should always be
|
|
0 for now.
|
|
\end{methoddesc}
|
|
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{Document}{}
|
|
Represents an entire XML document, including its constituent elements,
|
|
attributes, processing instructions, comments etc. Remeber that it
|
|
inherits properties from \class{Node}.
|
|
|
|
\begin{memberdesc}{documentElement}
|
|
The one and only root element of the document.
|
|
\end{memberdesc}
|
|
|
|
\begin{methoddesc}{createElement}{tagName}
|
|
Create a new element. The element is not inserted into the document
|
|
when it is created. You need to explicitly insert it with one of the
|
|
other methods such as \method{insertBefore()} or
|
|
\method{appendChild()}.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{createTextNode}{data}
|
|
Create a text node containing the data passed as a parameter. As with
|
|
the other creation methods, this one does not insert the node into the
|
|
tree.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{createComment}{data}
|
|
Create a comment node containing the data passed as a parameter. As
|
|
with the other creation methods, this one does not insert the node
|
|
into the tree.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{createProcessingInstruction}{target, data}
|
|
Create a processing instruction node containing the \var{target} and
|
|
\var{data} passed as parameters. As with the other creation methods,
|
|
this one does not insert the node into the tree.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{createAttribute}{name}
|
|
Create an attribute node. This method does not associate the
|
|
attribute node with any particular element. You must use
|
|
\method{setAttributeNode()} on the appropriate \class{Element} object
|
|
to use the newly created attribute instance.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{createElementNS}{namespaceURI, tagName}
|
|
Create a new element with a namespace. The \var{tagName} may have a
|
|
prefix. The element is not inserted into the document when it is
|
|
created. You need to explicitly insert it with one of the other
|
|
methods such as \method{insertBefore()} or \method{appendChild()}.
|
|
\end{methoddesc}
|
|
|
|
|
|
\begin{methoddesc}{createAttributeNS}{namespaceURI, qualifiedName}
|
|
Create an attribute node with a namespace. The \var{tagName} may have
|
|
a prefix. This method does not associate the attribute node with any
|
|
particular element. You must use \method{setAttributeNode()} on the
|
|
appropriate \class{Element} object to use the newly created attribute
|
|
instance.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{getElementsByTagName}{tagName}
|
|
Search for all descendants (direct children, children's children,
|
|
etc.) with a particular element type name.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{getElementsByTagNameNS}{namespaceURI, localName}
|
|
Search for all descendants (direct children, children's children,
|
|
etc.) with a particular namespace URI and localname. The localname is
|
|
the part of the namespace after the prefix.
|
|
\end{methoddesc}
|
|
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{Element}{}
|
|
\begin{memberdesc}{tagName}
|
|
The element type name. In a namespace-using document it may have
|
|
colons in it.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{localName}
|
|
The part of the \member{tagName} following the colon if there is one,
|
|
else the entire \member{tagName}.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{prefix}
|
|
The part of the \member{tagName} preceding the colon if there is one,
|
|
else the empty string.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{namespaceURI}
|
|
The namespace associated with the tagName.
|
|
\end{memberdesc}
|
|
|
|
\begin{methoddesc}{getAttribute}{attname}
|
|
Return an attribute value as a string.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{setAttribute}{attname, value}
|
|
Set an attribute value from a string.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{removeAttribute}{attname}
|
|
Remove an attribute by name.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{getAttributeNS}{namespaceURI, localName}
|
|
Return an attribute value as a string, given a \var{namespaceURI} and
|
|
\var{localName}. Note that a localname is the part of a prefixed
|
|
attribute name after the colon (if there is one).
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{setAttributeNS}{namespaceURI, qname, value}
|
|
Set an attribute value from a string, given a \var{namespaceURI} and a
|
|
\var{qname}. Note that a qname is the whole attribute name. This is
|
|
different than above.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{removeAttributeNS}{namespaceURI, localName}
|
|
Remove an attribute by name. Note that it uses a localName, not a
|
|
qname.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{getElementsByTagName}{tagName}
|
|
Same as equivalent method in the \class{Document} class.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}{getElementsByTagNameNS}{tagName}
|
|
Same as equivalent method in the \class{Document} class.
|
|
\end{methoddesc}
|
|
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{Attribute}{}
|
|
|
|
\begin{memberdesc}{name}
|
|
The attribute name. In a namespace-using document it may have colons
|
|
in it.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{localName}
|
|
The part of the name following the colon if there is one, else the
|
|
entire name.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{prefix}
|
|
The part of the name preceding the colon if there is one, else the
|
|
empty string.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{namespaceURI}
|
|
The namespace associated with the attribute name.
|
|
\end{memberdesc}
|
|
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{AttributeList}{}
|
|
|
|
\begin{memberdesc}{length}
|
|
The length of the attribute list.
|
|
\end{memberdesc}
|
|
|
|
\begin{methoddesc}{item}{index}
|
|
Return an attribute with a particular index. The order you get the
|
|
attributes in is arbitrary but will be consistent for the life of a
|
|
DOM. Each item is an attribute node. Get its value with the
|
|
\member{value} attribbute.
|
|
\end{methoddesc}
|
|
|
|
There are also experimental methods that give this class more
|
|
dictionary-like behavior. You can use them or you can use the
|
|
standardized \method{getAttribute*()}-family methods.
|
|
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{Comment}{}
|
|
Represents a comment in the XML document.
|
|
|
|
\begin{memberdesc}{data}
|
|
The content of the comment.
|
|
\end{memberdesc}
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{Text}{}
|
|
Represents text in the XML document.
|
|
|
|
\begin{memberdesc}{data}
|
|
The content of the text node.
|
|
\end{memberdesc}
|
|
\end{classdesc}
|
|
|
|
|
|
\begin{classdesc}{ProcessingInstruction}{}
|
|
Represents a processing instruction in the XML document.
|
|
|
|
\begin{memberdesc}{target}
|
|
The content of the processing instruction up to the first whitespace
|
|
character.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}{data}
|
|
The content of the processing instruction following the first
|
|
whitespace character.
|
|
\end{memberdesc}
|
|
\end{classdesc}
|
|
|
|
Note that DOM attributes may also be manipulated as nodes instead of as
|
|
simple strings. It is fairly rare that you must do this, however, so this
|
|
usage is not yet documented here.
|
|
|
|
|
|
\begin{seealso}
|
|
\seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
|
|
{This is the canonical specification for the level of the
|
|
DOM supported by \module{xml.dom.minidom}.}
|
|
\end{seealso}
|
|
|
|
|
|
\subsection{DOM Example \label{dom-example}}
|
|
|
|
This example program is a fairly realistic example of a simple
|
|
program. In this particular case, we do not take much advantage
|
|
of the flexibility of the DOM.
|
|
|
|
\begin{verbatim}
|
|
from xml.dom.minidom import parse, parseString
|
|
|
|
document="""
|
|
<slideshow>
|
|
<title>Demo slideshow</title>
|
|
<slide><title>Slide title</title>
|
|
<point>This is a demo</point>
|
|
<point>Of a program for processing slides</point>
|
|
</slide>
|
|
|
|
<slide><title>Another demo slide</title>
|
|
<point>It is important</point>
|
|
<point>To have more than</point>
|
|
<point>one slide</point>
|
|
</slide>
|
|
</slideshow>
|
|
"""
|
|
|
|
dom = parseString(document)
|
|
|
|
space=" "
|
|
def getText(nodelist):
|
|
rc=""
|
|
for node in nodelist:
|
|
if node.nodeType==node.TEXT_NODE:
|
|
rc=rc+node.data
|
|
return rc
|
|
|
|
def handleSlideshow(slideshow):
|
|
print "<html>"
|
|
handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
|
|
slides = slideshow.getElementsByTagName("slide")
|
|
handleToc(slides)
|
|
handleSlides(slides)
|
|
print "</html>"
|
|
|
|
def handleSlides(slides):
|
|
for slide in slides:
|
|
handleSlide(slide)
|
|
|
|
def handleSlide(slide):
|
|
handleSlideTitle(slide.getElementsByTagName("title")[0])
|
|
handlePoints(slide.getElementsByTagName("point"))
|
|
|
|
def handleSlideshowTitle(title):
|
|
print "<title>%s</title>"%getText(title.childNodes)
|
|
|
|
def handleSlideTitle(title):
|
|
print "<h2>%s</h2>"%getText(title.childNodes)
|
|
|
|
def handlePoints(points):
|
|
print "<ul>"
|
|
for point in points:
|
|
handlePoint(point)
|
|
print "</ul>"
|
|
|
|
def handlePoint(point):
|
|
print "<li>%s</li>"%getText(point.childNodes)
|
|
|
|
def handleToc(slides):
|
|
for slide in slides:
|
|
title = slide.getElementsByTagName("title")[0]
|
|
print "<p>%s</p>"%getText(title.childNodes)
|
|
|
|
handleSlideshow(dom)
|
|
\end{verbatim}
|
|
|
|
\subsection{minidom and the DOM standard \label{minidom-and-dom}}
|
|
|
|
Minidom is basically a DOM 1.0-compatible DOM with some DOM 2 features
|
|
(primarily namespace features).
|
|
|
|
Usage of the other DOM interfaces in Python is straight-forward. The
|
|
following mapping rules apply:
|
|
|
|
\begin{itemize}
|
|
|
|
\item Interfaces are accessed through instance objects. Applications
|
|
should
|
|
not instantiate the classes themselves; they should use the creator
|
|
functions. Derived interfaces support all operations (and attributes)
|
|
from the base interfaces, plus any new operations.
|
|
|
|
\item Operations are used as methods. Since the DOM uses only
|
|
\code{in}
|
|
parameters, the arguments are passed in normal order (from left to
|
|
right).
|
|
There are no optional arguments. \code{void} operations return
|
|
\code{None}.
|
|
|
|
\item IDL attributes map to instance attributes. For compatibility
|
|
with
|
|
the OMG IDL language mapping for Python, an attribute \code{foo} can
|
|
also be accessed through accessor functions \code{_get_foo} and
|
|
\code{_set_foo}. \code{readonly} attributes must not be changed.
|
|
|
|
\item The types \code{short int},\code{unsigned int},\code{unsigned
|
|
long long},
|
|
and \code{boolean} all map to Python integer objects.
|
|
|
|
\item The type \code{DOMString} maps to Python strings. \code{minidom}
|
|
supports either byte or Unicode strings, but will normally produce
|
|
Unicode
|
|
strings. Attributes of type \code{DOMString} may also be \code{None}.
|
|
|
|
\item \code{const} declarations map to variables in their respective
|
|
scope
|
|
(e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); they
|
|
must
|
|
not be changed.
|
|
|
|
\item \code{DOMException} is currently not supported in
|
|
\module{minidom}. Instead, minidom returns standard Python exceptions
|
|
such as TypeError and AttributeError.
|
|
|
|
\end{itemize}
|
|
|
|
The following interfaces have no equivalent in minidom:
|
|
|
|
\begin{itemize}
|
|
|
|
\item DOMTimeStamp
|
|
|
|
\item DocumentType
|
|
|
|
\item DOMImplementation
|
|
|
|
\item CharacterData
|
|
|
|
\item CDATASection
|
|
|
|
\item Notation
|
|
|
|
\item Entity
|
|
|
|
\item EntityReference
|
|
|
|
\item DocumentFragment
|
|
|
|
\end{itemize}
|
|
|
|
Most of these reflect information in the XML document that is not of
|
|
general utility to most DOM users.
|