2008-05-17 19:02:32 -03:00
|
|
|
:mod:`html.parser` --- Simple HTML and XHTML parser
|
|
|
|
===================================================
|
2007-08-15 11:28:22 -03:00
|
|
|
|
2008-05-17 19:02:32 -03:00
|
|
|
.. module:: html.parser
|
2007-08-15 11:28:22 -03:00
|
|
|
:synopsis: A simple parser that can handle HTML and XHTML.
|
|
|
|
|
|
|
|
|
2008-05-18 04:53:01 -03:00
|
|
|
.. index::
|
|
|
|
single: HTML
|
|
|
|
single: XHTML
|
2007-08-15 11:28:22 -03:00
|
|
|
|
|
|
|
This module defines a class :class:`HTMLParser` which serves as the basis for
|
|
|
|
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
|
|
|
|
|
|
|
.. class:: HTMLParser()
|
|
|
|
|
|
|
|
The :class:`HTMLParser` class is instantiated without arguments.
|
|
|
|
|
2008-05-17 19:02:32 -03:00
|
|
|
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
|
2007-08-15 11:28:22 -03:00
|
|
|
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
|
|
|
user to provide a desired behavior.
|
|
|
|
|
2008-06-01 18:25:55 -03:00
|
|
|
This parser does not check that end tags match start tags or call the end-tag
|
|
|
|
handler for elements which are closed implicitly by closing an outer element.
|
2007-08-15 11:28:22 -03:00
|
|
|
|
|
|
|
An exception is defined as well:
|
|
|
|
|
|
|
|
|
|
|
|
.. exception:: HTMLParseError
|
|
|
|
|
|
|
|
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
|
|
|
while parsing. This exception provides three attributes: :attr:`msg` is a brief
|
|
|
|
message explaining the error, :attr:`lineno` is the number of the line on which
|
|
|
|
the broken construct was detected, and :attr:`offset` is the number of
|
|
|
|
characters into the line at which the construct starts.
|
|
|
|
|
|
|
|
:class:`HTMLParser` instances have the following methods:
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.reset()
|
|
|
|
|
|
|
|
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
|
|
|
instantiation time.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.feed(data)
|
|
|
|
|
|
|
|
Feed some text to the parser. It is processed insofar as it consists of
|
|
|
|
complete elements; incomplete data is buffered until more data is fed or
|
|
|
|
:meth:`close` is called.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.close()
|
|
|
|
|
|
|
|
Force processing of all buffered data as if it were followed by an end-of-file
|
|
|
|
mark. This method may be redefined by a derived class to define additional
|
|
|
|
processing at the end of the input, but the redefined version should always call
|
|
|
|
the :class:`HTMLParser` base class method :meth:`close`.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.getpos()
|
|
|
|
|
|
|
|
Return current line number and offset.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.get_starttag_text()
|
|
|
|
|
|
|
|
Return the text of the most recently opened start tag. This should not normally
|
|
|
|
be needed for structured processing, but may be useful in dealing with HTML "as
|
|
|
|
deployed" or for re-generating input with minimal changes (whitespace between
|
|
|
|
attributes can be preserved, etc.).
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_starttag(tag, attrs)
|
|
|
|
|
|
|
|
This method is called to handle the start of a tag. It is intended to be
|
|
|
|
overridden by a derived class; the base class implementation does nothing.
|
|
|
|
|
|
|
|
The *tag* argument is the name of the tag converted to lower case. The *attrs*
|
|
|
|
argument is a list of ``(name, value)`` pairs containing the attributes found
|
|
|
|
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
|
|
|
|
and quotes in the *value* have been removed, and character and entity references
|
|
|
|
have been replaced. For instance, for the tag ``<A
|
|
|
|
HREF="http://www.cwi.nl/">``, this method would be called as
|
|
|
|
``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
|
|
|
|
|
2008-05-18 04:53:01 -03:00
|
|
|
All entity references from :mod:`html.entities` are replaced in the attribute
|
|
|
|
values.
|
2007-08-15 11:28:22 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_startendtag(tag, attrs)
|
|
|
|
|
|
|
|
Similar to :meth:`handle_starttag`, but called when the parser encounters an
|
|
|
|
XHTML-style empty tag (``<a .../>``). This method may be overridden by
|
|
|
|
subclasses which require this particular lexical information; the default
|
|
|
|
implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_endtag(tag)
|
|
|
|
|
|
|
|
This method is called to handle the end tag of an element. It is intended to be
|
|
|
|
overridden by a derived class; the base class implementation does nothing. The
|
|
|
|
*tag* argument is the name of the tag converted to lower case.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_data(data)
|
|
|
|
|
|
|
|
This method is called to process arbitrary data. It is intended to be
|
|
|
|
overridden by a derived class; the base class implementation does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_charref(name)
|
|
|
|
|
|
|
|
This method is called to process a character reference of the form ``&#ref;``.
|
|
|
|
It is intended to be overridden by a derived class; the base class
|
|
|
|
implementation does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_entityref(name)
|
|
|
|
|
|
|
|
This method is called to process a general entity reference of the form
|
|
|
|
``&name;`` where *name* is an general entity reference. It is intended to be
|
|
|
|
overridden by a derived class; the base class implementation does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_comment(data)
|
|
|
|
|
|
|
|
This method is called when a comment is encountered. The *comment* argument is
|
|
|
|
a string containing the text between the ``--`` and ``--`` delimiters, but not
|
|
|
|
the delimiters themselves. For example, the comment ``<!--text-->`` will cause
|
|
|
|
this method to be called with the argument ``'text'``. It is intended to be
|
|
|
|
overridden by a derived class; the base class implementation does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_decl(decl)
|
|
|
|
|
2010-08-01 18:09:54 -03:00
|
|
|
Method called when an SGML ``doctype`` declaration is read by the parser.
|
|
|
|
The *decl* parameter will be the entire contents of the declaration inside
|
|
|
|
the ``<!...>`` markup. It is intended to be overridden by a derived class;
|
|
|
|
the base class implementation does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.unknown_decl(data)
|
|
|
|
|
|
|
|
Method called when an unrecognized SGML declaration is read by the parser.
|
|
|
|
The *data* parameter will be the entire contents of the declaration inside
|
|
|
|
the ``<!...>`` markup. It is sometimes useful to be be overridden by a
|
Merged revisions 83561,83563,83565-83566,83569,83571,83574-83575,83580,83584,83599,83612,83659,83977,84015-84018,84020,84141 via svnmerge from
svn+ssh://svn.python.org/python/branches/py3k
........
r83561 | georg.brandl | 2010-08-02 22:17:50 +0200 (Mo, 02 Aug 2010) | 1 line
#4280: remove outdated "versionchecker" tool.
........
r83563 | georg.brandl | 2010-08-02 22:21:21 +0200 (Mo, 02 Aug 2010) | 1 line
#9037: add example how to raise custom exceptions from C code.
........
r83565 | georg.brandl | 2010-08-02 22:27:20 +0200 (Mo, 02 Aug 2010) | 1 line
#9111: document that do_help() looks at docstrings.
........
r83566 | georg.brandl | 2010-08-02 22:30:57 +0200 (Mo, 02 Aug 2010) | 1 line
#9019: remove false (in 3k) claim about Headers updates.
........
r83569 | georg.brandl | 2010-08-02 22:39:35 +0200 (Mo, 02 Aug 2010) | 1 line
#7797: be explicit about bytes-oriented interface of base64 functions.
........
r83571 | georg.brandl | 2010-08-02 22:44:34 +0200 (Mo, 02 Aug 2010) | 1 line
Clarify that abs() is not a namespace.
........
r83574 | georg.brandl | 2010-08-02 22:47:56 +0200 (Mo, 02 Aug 2010) | 1 line
#6867: epoll.register() returns None.
........
r83575 | georg.brandl | 2010-08-02 22:52:10 +0200 (Mo, 02 Aug 2010) | 1 line
#9238: zipfile does handle archive comments.
........
r83580 | georg.brandl | 2010-08-02 23:02:36 +0200 (Mo, 02 Aug 2010) | 1 line
#8119: fix copy-paste error.
........
r83584 | georg.brandl | 2010-08-02 23:07:14 +0200 (Mo, 02 Aug 2010) | 1 line
#9457: fix documentation links for 3.2.
........
r83599 | georg.brandl | 2010-08-02 23:51:18 +0200 (Mo, 02 Aug 2010) | 1 line
#9061: warn that single quotes are never escaped.
........
r83612 | georg.brandl | 2010-08-03 00:59:44 +0200 (Di, 03 Aug 2010) | 1 line
Fix unicode literal.
........
r83659 | georg.brandl | 2010-08-03 14:06:29 +0200 (Di, 03 Aug 2010) | 1 line
Terminology fix: exceptions are raised, except in generator.throw().
........
r83977 | georg.brandl | 2010-08-13 17:10:49 +0200 (Fr, 13 Aug 2010) | 1 line
Fix copy-paste error.
........
r84015 | georg.brandl | 2010-08-14 17:44:34 +0200 (Sa, 14 Aug 2010) | 1 line
Add some maintainers.
........
r84016 | georg.brandl | 2010-08-14 17:46:15 +0200 (Sa, 14 Aug 2010) | 1 line
Wording fix.
........
r84017 | georg.brandl | 2010-08-14 17:46:59 +0200 (Sa, 14 Aug 2010) | 1 line
Typo fix.
........
r84018 | georg.brandl | 2010-08-14 17:48:49 +0200 (Sa, 14 Aug 2010) | 1 line
Typo fix.
........
r84020 | georg.brandl | 2010-08-14 17:57:20 +0200 (Sa, 14 Aug 2010) | 1 line
Fix format.
........
r84141 | georg.brandl | 2010-08-17 16:11:59 +0200 (Di, 17 Aug 2010) | 1 line
Markup nits.
........
2010-10-06 05:35:38 -03:00
|
|
|
derived class; the base class implementation raises an :exc:`HTMLParseError`.
|
2007-08-15 11:28:22 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_pi(data)
|
|
|
|
|
|
|
|
Method called when a processing instruction is encountered. The *data*
|
|
|
|
parameter will contain the entire processing instruction. For example, for the
|
|
|
|
processing instruction ``<?proc color='red'>``, this method would be called as
|
|
|
|
``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
|
|
|
|
class; the base class implementation does nothing.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
The :class:`HTMLParser` class uses the SGML syntactic rules for processing
|
|
|
|
instructions. An XHTML processing instruction using the trailing ``'?'`` will
|
|
|
|
cause the ``'?'`` to be included in *data*.
|
|
|
|
|
|
|
|
|
|
|
|
.. _htmlparser-example:
|
|
|
|
|
|
|
|
Example HTML Parser Application
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
As a basic example, below is a very basic HTML parser that uses the
|
|
|
|
:class:`HTMLParser` class to print out tags as they are encountered::
|
|
|
|
|
2009-08-13 05:58:24 -03:00
|
|
|
>>> from html.parser import HTMLParser
|
|
|
|
>>>
|
|
|
|
>>> class MyHTMLParser(HTMLParser):
|
|
|
|
... def handle_starttag(self, tag, attrs):
|
|
|
|
... print("Encountered a {} start tag".format(tag))
|
|
|
|
... def handle_endtag(self, tag):
|
|
|
|
... print("Encountered a {} end tag".format(tag))
|
|
|
|
...
|
|
|
|
>>> page = """<html><h1>Title</h1><p>I'm a paragraph!</p></html>"""
|
|
|
|
>>>
|
|
|
|
>>> myparser = MyHTMLParser()
|
|
|
|
>>> myparser.feed(page)
|
|
|
|
Encountered a html start tag
|
|
|
|
Encountered a h1 start tag
|
|
|
|
Encountered a h1 end tag
|
|
|
|
Encountered a p start tag
|
|
|
|
Encountered a p end tag
|
|
|
|
Encountered a html end tag
|
2007-08-15 11:28:22 -03:00
|
|
|
|
|
|
|
|