Issue #17741: Rename IncrementalParser and its methods.

The new names are hopefully more descriptive and consistent. If you feel you
don't agree with this change, *please* read issue 17741 first - there's a lot of
discussion in there.
This commit is contained in:
Eli Bendersky 2013-08-30 05:51:20 -07:00
parent 1221f6b603
commit b586934f0e
4 changed files with 151 additions and 133 deletions

View File

@ -105,37 +105,42 @@ Children are nested, and we can access specific child nodes by index::
>>> root[0][1].text
'2008'
Incremental parsing
^^^^^^^^^^^^^^^^^^^
Pull API for asynchronous parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It's possible to parse XML incrementally (i.e. not the whole document at once).
The most powerful tool for doing this is :class:`IncrementalParser`. It does
not require a blocking read to obtain the XML data, and is instead fed with
data incrementally with :meth:`IncrementalParser.data_received` calls. To get
the parsed XML elements, call :meth:`IncrementalParser.events`. Here's an
example::
Most parsing functions provided by this module require to read the whole
document at once before returning any result. It is possible to use a
:class:`XMLParser` and feed data into it incrementally, but it's a push API that
calls methods on a callback target, which is too low-level and inconvenient for
most needs. Sometimes what the user really wants is to be able to parse XML
incrementally, without blocking operations, while enjoying the convenience of
fully constructed :class:`Element` objects.
>>> incparser = ET.IncrementalParser(['start', 'end'])
>>> incparser.data_received('<mytag>sometext')
>>> list(incparser.events())
[('start', <Element 'mytag' at 0x7fba3f2a8688>)]
>>> incparser.data_received(' more text</mytag>')
>>> for event, elem in incparser.events():
... print(event)
... print(elem.tag, 'text=', elem.text)
...
end
mytag text= sometext more text
The most powerful tool for doing this is :class:`XMLPullParser`. It does not
require a blocking read to obtain the XML data, and is instead fed with data
incrementally with :meth:`XMLPullParser.feed` calls. To get the parsed XML
elements, call :meth:`XMLPullParser.read_events`. Here's an example::
>>> asyncparser = ET.XMLPullParser(['start', 'end'])
>>> asyncparser.feed('<mytag>sometext')
>>> list(asyncparser.read_events())
[('start', <Element 'mytag' at 0x7fa66db2be58>)]
>>> asyncparser.feed(' more text</mytag>')
>>> for event, elem in asyncparser.read_events():
... print(event)
... print(elem.tag, 'text=', elem.text)
...
end
The obvious use case is applications that operate in an asynchronous fashion
where the XML data is being received from a socket or read incrementally from
some storage device. In such cases, blocking reads are unacceptable.
Because it's so flexible, :class:`IncrementalParser` can be inconvenient
to use for simpler use-cases. If you don't mind your application blocking on
reading XML data but would still like to have incremental parsing capabilities,
take a look at :func:`iterparse`. It can be useful when you're reading a large
XML document and don't want to hold it wholly in memory.
Because it's so flexible, :class:`XMLPullParser` can be inconvenient to use for
simpler use-cases. If you don't mind your application blocking on reading XML
data but would still like to have incremental parsing capabilities, take a look
at :func:`iterparse`. It can be useful when you're reading a large XML document
and don't want to hold it wholly in memory.
Finding interesting elements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -412,29 +417,32 @@ Functions
Parses an XML section into an element tree incrementally, and reports what's
going on to the user. *source* is a filename or :term:`file object`
containing XML data. *events* is a sequence of events to report back. The
supported events are the strings ``"start"``, ``"end"``, ``"start-ns"``
and ``"end-ns"`` (the "ns" events are used to get detailed namespace
supported events are the strings ``"start"``, ``"end"``, ``"start-ns"`` and
``"end-ns"`` (the "ns" events are used to get detailed namespace
information). If *events* is omitted, only ``"end"`` events are reported.
*parser* is an optional parser instance. If not given, the standard
:class:`XMLParser` parser is used. *parser* can only use the default
:class:`TreeBuilder` as a target. Returns an :term:`iterator` providing
``(event, elem)`` pairs.
:class:`XMLParser` parser is used. *parser* must be a subclass of
:class:`XMLParser` and can only use the default :class:`TreeBuilder` as a
target. Returns an :term:`iterator` providing ``(event, elem)`` pairs.
Note that while :func:`iterparse` builds the tree incrementally, it issues
blocking reads on *source* (or the file it names). As such, it's unsuitable
for asynchronous applications where blocking reads can't be made. For fully
asynchronous parsing, see :class:`IncrementalParser`.
asynchronous parsing, see :class:`XMLPullParser`.
.. note::
:func:`iterparse` only guarantees that it has seen the ">"
character of a starting tag when it emits a "start" event, so the
attributes are defined, but the contents of the text and tail attributes
are undefined at that point. The same applies to the element children;
they may or may not be present.
:func:`iterparse` only guarantees that it has seen the ">" character of a
starting tag when it emits a "start" event, so the attributes are defined,
but the contents of the text and tail attributes are undefined at that
point. The same applies to the element children; they may or may not be
present.
If you need a fully populated element, look for "end" events instead.
.. deprecated:: 3.4
The *parser* argument.
.. function:: parse(source, parser=None)
Parses an XML section into an element tree. *source* is a filename or file
@ -871,48 +879,6 @@ QName Objects
:class:`QName` instances are opaque.
IncrementalParser Objects
^^^^^^^^^^^^^^^^^^^^^^^^^
.. class:: IncrementalParser(events=None, parser=None)
An incremental, event-driven parser suitable for non-blocking applications.
*events* is a sequence of events to report back. The supported events are
the strings ``"start"``, ``"end"``, ``"start-ns"`` and ``"end-ns"`` (the "ns"
events are used to get detailed namespace information). If *events* is
omitted, only ``"end"`` events are reported. *parser* is an optional
parser instance. If not given, the standard :class:`XMLParser` parser is
used. *parser* can only use the default :class:`TreeBuilder` as a target.
.. method:: data_received(data)
Feed the given bytes data to the incremental parser.
.. method:: eof_received()
Signal the incremental parser that the data stream is terminated.
.. method:: events()
Iterate over the events which have been encountered in the data fed
to the parser. This method yields ``(event, elem)`` pairs, where
*event* is a string representing the type of event (e.g. ``"end"``)
and *elem* is the encountered :class:`Element` object. Events
provided in a previous call to :meth:`events` will not be yielded
again.
.. note::
:class:`IncrementalParser` only guarantees that it has seen the ">"
character of a starting tag when it emits a "start" event, so the
attributes are defined, but the contents of the text and tail attributes
are undefined at that point. The same applies to the element children;
they may or may not be present.
If you need a fully populated element, look for "end" events instead.
.. versionadded:: 3.4
.. _elementtree-treebuilder-objects:
@ -973,13 +939,17 @@ XMLParser Objects
.. class:: XMLParser(html=0, target=None, encoding=None)
:class:`Element` structure builder for XML source data, based on the expat
parser. *html* are predefined HTML entities. This flag is not supported by
the current implementation. *target* is the target object. If omitted, the
builder uses an instance of the standard :class:`TreeBuilder` class.
*encoding* [1]_ is optional. If given, the value overrides the encoding
This class is the low-level building block of the module. It uses
:mod:`xml.parsers.expat` for efficient, event-based parsing of XML. It can
be fed XML data incrementall with the :meth:`feed` method, and parsing events
are translated to a push API - by invoking callbacks on the *target* object.
If *target* is omitted, the standard :class:`TreeBuilder` is used. The
*html* argument was historically used for backwards compatibility and is now
deprecated. If *encoding* [1]_ is given, the value overrides the encoding
specified in the XML file.
.. deprecated:: 3.4
The *html* argument.
.. method:: close()
@ -999,12 +969,12 @@ XMLParser Objects
Feeds data to the parser. *data* is encoded data.
:meth:`XMLParser.feed` calls *target*\'s ``start()`` method
for each opening tag, its ``end()`` method for each closing tag,
and data is processed by method ``data()``. :meth:`XMLParser.close`
calls *target*\'s method ``close()``.
:class:`XMLParser` can be used not only for building a tree structure.
This is an example of counting the maximum depth of an XML file::
:meth:`XMLParser.feed` calls *target*\'s ``start(tag, attrs_dict)`` method
for each opening tag, its ``end(tag)`` method for each closing tag, and data
is processed by method ``data(data)``. :meth:`XMLParser.close` calls
*target*\'s method ``close()``. :class:`XMLParser` can be used not only for
building a tree structure. This is an example of counting the maximum depth
of an XML file::
>>> from xml.etree.ElementTree import XMLParser
>>> class MaxDepth: # The target object of the parser
@ -1038,6 +1008,51 @@ XMLParser Objects
>>> parser.close()
4
.. _elementtree-xmlpullparser-objects:
XMLPullParser Objects
^^^^^^^^^^^^^^^^^^^^^
.. class:: XMLPullParser(events=None)
A pull parser suitable for nonblocking (asynchronous) applications. Its
input-side API is similar to that of :class:`XMLParser`, but instead of
pushing calls to a callback target, :class:`XMLPullParser` collects an
internal list of parsing events and lets the user read from it. *events* is a
sequence of events to report back. The supported events are the strings
``"start"``, ``"end"``, ``"start-ns"`` and ``"end-ns"`` (the "ns" events are
used to get detailed namespace information). If *events* is omitted, only
``"end"`` events are reported.
.. method:: feed(data)
Feed the given bytes data to the parser.
.. method:: close()
Signal the parser that the data stream is terminated.
.. method:: read_events()
Iterate over the events which have been encountered in the data fed to the
parser. This method yields ``(event, elem)`` pairs, where *event* is a
string representing the type of event (e.g. ``"end"``) and *elem* is the
encountered :class:`Element` object. Events provided in a previous call
to :meth:`read_events` will not be yielded again.
.. note::
:class:`XMLPullParser` only guarantees that it has seen the ">"
character of a starting tag when it emits a "start" event, so the
attributes are defined, but the contents of the text and tail attributes
are undefined at that point. The same applies to the element children;
they may or may not be present.
If you need a fully populated element, look for "end" events instead.
.. versionadded:: 3.4
Exceptions
^^^^^^^^^^

View File

@ -402,10 +402,9 @@ xml.etree
---------
Add an event-driven parser for non-blocking applications,
:class:`~xml.etree.ElementTree.IncrementalParser`.
(Contributed by Antoine Pitrou in :issue:`17782`.)
:class:`~xml.etree.ElementTree.XMLPullParser`.
(Contributed by Antoine Pitrou in :issue:`17741`.)
Other improvements
==================

View File

@ -950,24 +950,24 @@ class ElementTreeTest(unittest.TestCase):
self.assertEqual(serialized, expected)
class IncrementalParserTest(unittest.TestCase):
class XMLPullParserTest(unittest.TestCase):
def _feed(self, parser, data, chunk_size=None):
if chunk_size is None:
parser.data_received(data)
parser.feed(data)
else:
for i in range(0, len(data), chunk_size):
parser.data_received(data[i:i+chunk_size])
parser.feed(data[i:i+chunk_size])
def assert_event_tags(self, parser, expected):
events = parser.events()
events = parser.read_events()
self.assertEqual([(action, elem.tag) for action, elem in events],
expected)
def test_simple_xml(self):
for chunk_size in (None, 1, 5):
with self.subTest(chunk_size=chunk_size):
parser = ET.IncrementalParser()
parser = ET.XMLPullParser()
self.assert_event_tags(parser, [])
self._feed(parser, "<!-- comment -->\n", chunk_size)
self.assert_event_tags(parser, [])
@ -985,14 +985,14 @@ class IncrementalParserTest(unittest.TestCase):
])
self._feed(parser, "</root>\n", chunk_size)
self.assert_event_tags(parser, [('end', 'root')])
# Receiving EOF sets the `root` attribute
# Closing sets the `root` attribute
self.assertIs(parser.root, None)
parser.eof_received()
parser.close()
self.assertEqual(parser.root.tag, 'root')
def test_data_received_while_iterating(self):
parser = ET.IncrementalParser()
it = parser.events()
def test_feed_while_iterating(self):
parser = ET.XMLPullParser()
it = parser.read_events()
self._feed(parser, "<root>\n <element key='value'>text</element>\n")
action, elem = next(it)
self.assertEqual((action, elem.tag), ('end', 'element'))
@ -1003,7 +1003,7 @@ class IncrementalParserTest(unittest.TestCase):
next(it)
def test_simple_xml_with_ns(self):
parser = ET.IncrementalParser()
parser = ET.XMLPullParser()
self.assert_event_tags(parser, [])
self._feed(parser, "<!-- comment -->\n")
self.assert_event_tags(parser, [])
@ -1021,32 +1021,32 @@ class IncrementalParserTest(unittest.TestCase):
])
self._feed(parser, "</root>\n")
self.assert_event_tags(parser, [('end', '{namespace}root')])
# Receiving EOF sets the `root` attribute
# Closing sets the `root` attribute
self.assertIs(parser.root, None)
parser.eof_received()
parser.close()
self.assertEqual(parser.root.tag, '{namespace}root')
def test_ns_events(self):
parser = ET.IncrementalParser(events=('start-ns', 'end-ns'))
parser = ET.XMLPullParser(events=('start-ns', 'end-ns'))
self._feed(parser, "<!-- comment -->\n")
self._feed(parser, "<root xmlns='namespace'>\n")
self.assertEqual(
list(parser.events()),
list(parser.read_events()),
[('start-ns', ('', 'namespace'))])
self._feed(parser, "<element key='value'>text</element")
self._feed(parser, ">\n")
self._feed(parser, "<element>text</element>tail\n")
self._feed(parser, "<empty-element/>\n")
self._feed(parser, "</root>\n")
self.assertEqual(list(parser.events()), [('end-ns', None)])
parser.eof_received()
self.assertEqual(list(parser.read_events()), [('end-ns', None)])
parser.close()
def test_events(self):
parser = ET.IncrementalParser(events=())
parser = ET.XMLPullParser(events=())
self._feed(parser, "<root/>\n")
self.assert_event_tags(parser, [])
parser = ET.IncrementalParser(events=('start', 'end'))
parser = ET.XMLPullParser(events=('start', 'end'))
self._feed(parser, "<!-- comment -->\n")
self.assert_event_tags(parser, [])
self._feed(parser, "<root>\n")
@ -1064,12 +1064,12 @@ class IncrementalParserTest(unittest.TestCase):
('end', '{foo}element'),
])
self._feed(parser, "</root>")
parser.eof_received()
parser.close()
self.assertIs(parser.root, None)
self.assert_event_tags(parser, [('end', 'root')])
self.assertEqual(parser.root.tag, 'root')
parser = ET.IncrementalParser(events=('start',))
parser = ET.XMLPullParser(events=('start',))
self._feed(parser, "<!-- comment -->\n")
self.assert_event_tags(parser, [])
self._feed(parser, "<root>\n")
@ -1085,13 +1085,13 @@ class IncrementalParserTest(unittest.TestCase):
('start', '{foo}empty-element'),
])
self._feed(parser, "</root>")
parser.eof_received()
parser.close()
self.assertEqual(parser.root.tag, 'root')
def test_events_sequence(self):
# Test that events can be some sequence that's not just a tuple or list
eventset = {'end', 'start'}
parser = ET.IncrementalParser(events=eventset)
parser = ET.XMLPullParser(events=eventset)
self._feed(parser, "<foo>bar</foo>")
self.assert_event_tags(parser, [('start', 'foo'), ('end', 'foo')])
@ -1103,14 +1103,14 @@ class IncrementalParserTest(unittest.TestCase):
def __next__(self):
return next(self.events)
parser = ET.IncrementalParser(events=DummyIter())
parser = ET.XMLPullParser(events=DummyIter())
self._feed(parser, "<foo>bar</foo>")
self.assert_event_tags(parser, [('start', 'foo'), ('end', 'foo')])
def test_unknown_event(self):
with self.assertRaises(ValueError):
ET.IncrementalParser(events=('start', 'end', 'bogus'))
ET.XMLPullParser(events=('start', 'end', 'bogus'))
#
@ -2546,7 +2546,6 @@ def test_main(module=None):
ElementSlicingTest,
BasicElementTest,
ElementTreeTest,
IncrementalParserTest,
IOTest,
ParseErrorTest,
XIncludeTest,
@ -2555,6 +2554,7 @@ def test_main(module=None):
ElementIterTest,
TreeBuilderTest,
XMLParserTest,
XMLPullParserTest,
BugsTest,
]

View File

@ -1210,37 +1210,39 @@ def iterparse(source, events=None, parser=None):
return _IterParseIterator(source, events, parser, close_source)
class IncrementalParser:
class XMLPullParser:
def __init__(self, events=None, *, _parser=None):
# The _parser argument is for internal use only and must not be relied
# upon in user code. It will be removed in a future release.
# See http://bugs.python.org/issue17741 for more details.
def __init__(self, events=None, parser=None):
# _elementtree.c expects a list, not a deque
self._events_queue = []
self._index = 0
self.root = self._root = None
if not parser:
parser = XMLParser(target=TreeBuilder())
self._parser = parser
self._parser = _parser or XMLParser(target=TreeBuilder())
# wire up the parser for event reporting
if events is None:
events = ("end",)
self._parser._setevents(self._events_queue, events)
def data_received(self, data):
def feed(self, data):
if self._parser is None:
raise ValueError("data_received() called after end of stream")
raise ValueError("feed() called after end of stream")
if data:
try:
self._parser.feed(data)
except SyntaxError as exc:
self._events_queue.append(exc)
def eof_received(self):
def close(self):
self._root = self._parser.close()
self._parser = None
if self._index >= len(self._events_queue):
self.root = self._root
def events(self):
def read_events(self):
events = self._events_queue
while True:
index = self._index
@ -1268,14 +1270,16 @@ class IncrementalParser:
class _IterParseIterator:
def __init__(self, source, events, parser, close_source=False):
self._parser = IncrementalParser(events, parser)
# Use the internal, undocumented _parser argument for now; When the
# parser argument of iterparse is removed, this can be killed.
self._parser = XMLPullParser(events=events, _parser=parser)
self._file = source
self._close_file = close_source
self.root = None
def __next__(self):
while 1:
for event in self._parser.events():
for event in self._parser.read_events():
return event
if self._parser._parser is None:
self.root = self._parser.root
@ -1283,11 +1287,11 @@ class _IterParseIterator:
self._file.close()
raise StopIteration
# load event buffer
data = self._file.read(16384)
data = self._file.read(16 * 1024)
if data:
self._parser.data_received(data)
self._parser.feed(data)
else:
self._parser.eof_received()
self._parser.close()
def __iter__(self):
return self
@ -1481,9 +1485,9 @@ class XMLParser:
pass # unknown
def _setevents(self, events_queue, events_to_report):
# Internal API for IncrementalParser
# Internal API for XMLPullParser
# events_to_report: a list of events to report during parsing (same as
# the *events* of IncrementalParser's constructor.
# the *events* of XMLPullParser's constructor.
# events_queue: a list of actual parsing events that will be populated
# by the underlying parser.
#