diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst
index 7c44bec73d8..f3c36ec8867 100644
--- a/Doc/library/html.parser.rst
+++ b/Doc/library/html.parser.rst
@@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
.. class:: HTMLParser(strict=True)
Create a parser instance. If *strict* is ``True`` (the default), invalid
- html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
+ HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
*strict* is ``False``, the parser uses heuristics to make a best guess at
- the intention of any invalid html it encounters, similar to the way most
- browsers do.
+ the intention of any invalid HTML it encounters, similar to the way most
+ browsers do. Using ``strict=False`` is advised.
- An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
- begin and end. The :class:`HTMLParser` class is meant to be overridden by the
- user to provide a desired behavior.
+ An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
+ when start tags, end tags, text, comments, and other markup elements are
+ encountered. The user should subclass :class:`.HTMLParser` and override its
+ methods to implement the desired behavior.
This parser does not check that end tags match start tags or call the end-tag
handler for elements which are closed implicitly by closing an outer element.
@@ -39,25 +40,61 @@ An exception is defined as well:
.. exception:: HTMLParseError
Exception raised by the :class:`HTMLParser` class when it encounters an error
- while parsing. This exception provides three attributes: :attr:`msg` is a brief
- message explaining the error, :attr:`lineno` is the number of the line on which
- the broken construct was detected, and :attr:`offset` is the number of
- characters into the line at which the construct starts.
+ while parsing and *strict* is ``True``. This exception provides three
+ attributes: :attr:`msg` is a brief message explaining the error,
+ :attr:`lineno` is the number of the line on which the broken construct was
+ detected, and :attr:`offset` is the number of characters into the line at
+ which the construct starts.
+
+
+Example HTML Parser Application
+-------------------------------
+
+As a basic example, below is a simple HTML parser that uses the
+:class:`HTMLParser` class to print out start tags, end tags, and data
+as they are encountered::
+
+ from html.parser import HTMLParser
+
+ class MyHTMLParser(HTMLParser):
+ def handle_starttag(self, tag, attrs):
+ print("Encountered a start tag:", tag)
+ def handle_endtag(self, tag):
+ print("Encountered an end tag :", tag)
+ def handle_data(self, data):
+ print("Encountered some data :", data)
+
+ parser = MyHTMLParser(strict=False)
+ parser.feed('
Test'
+ 'Parse me!
')
+
+The output will then be::
+
+ Encountered a start tag: html
+ Encountered a start tag: head
+ Encountered a start tag: title
+ Encountered some data : Test
+ Encountered an end tag : title
+ Encountered an end tag : head
+ Encountered a start tag: body
+ Encountered a start tag: h1
+ Encountered some data : Parse me!
+ Encountered an end tag : h1
+ Encountered an end tag : body
+ Encountered an end tag : html
+
+
+:class:`.HTMLParser` Methods
+----------------------------
:class:`HTMLParser` instances have the following methods:
-.. method:: HTMLParser.reset()
-
- Reset the instance. Loses all unprocessed data. This is called implicitly at
- instantiation time.
-
-
.. method:: HTMLParser.feed(data)
Feed some text to the parser. It is processed insofar as it consists of
complete elements; incomplete data is buffered until more data is fed or
- :meth:`close` is called.
+ :meth:`close` is called. *data* must be :class:`str`.
.. method:: HTMLParser.close()
@@ -68,6 +105,12 @@ An exception is defined as well:
the :class:`HTMLParser` base class method :meth:`close`.
+.. method:: HTMLParser.reset()
+
+ Reset the instance. Loses all unprocessed data. This is called implicitly at
+ instantiation time.
+
+
.. method:: HTMLParser.getpos()
Return current line number and offset.
@@ -81,23 +124,35 @@ An exception is defined as well:
attributes can be preserved, etc.).
+The following methods are called when data or markup elements are encountered
+and they are meant to be overridden in a subclass. The base class
+implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
+
+
.. method:: HTMLParser.handle_starttag(tag, attrs)
- This method is called to handle the start of a tag. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called to handle the start of a tag (e.g. ````).
The *tag* argument is the name of the tag converted to lower case. The *attrs*
argument is a list of ``(name, value)`` pairs containing the attributes found
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
and quotes in the *value* have been removed, and character and entity references
- have been replaced. For instance, for the tag ``
``, this method would be called as
- ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
+ have been replaced.
+
+ For instance, for the tag ````, this method
+ would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
All entity references from :mod:`html.entities` are replaced in the attribute
values.
+.. method:: HTMLParser.handle_endtag(tag)
+
+ This method is called to handle the end tag of an element (e.g. `` ``).
+
+ The *tag* argument is the name of the tag converted to lower case.
+
+
.. method:: HTMLParser.handle_startendtag(tag, attrs)
Similar to :meth:`handle_starttag`, but called when the parser encounters an
@@ -106,57 +161,46 @@ An exception is defined as well:
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
-.. method:: HTMLParser.handle_endtag(tag)
-
- This method is called to handle the end tag of an element. It is intended to be
- overridden by a derived class; the base class implementation does nothing. The
- *tag* argument is the name of the tag converted to lower case.
-
-
.. method:: HTMLParser.handle_data(data)
- This method is called to process arbitrary data (e.g. the content of
- ```` and ````). It is intended to be
- overridden by a derived class; the base class implementation does nothing.
-
-
-.. method:: HTMLParser.handle_charref(name)
-
- This method is called to process a character reference of the form ``ref;``.
- It is intended to be overridden by a derived class; the base class
- implementation does nothing.
+ This method is called to process arbitrary data (e.g. text nodes and the
+ content of ```` and ````).
.. method:: HTMLParser.handle_entityref(name)
- This method is called to process a general entity reference of the form
- ``&name;`` where *name* is an general entity reference. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called to process a named character reference of the form
+ ``&name;`` (e.g. ``>``), where *name* is a general entity reference
+ (e.g. ``'gt'``).
+
+
+.. method:: HTMLParser.handle_charref(name)
+
+ This method is called to process decimal and hexadecimal numeric character
+ references of the form ``NNN;`` and ``NNN;``. For example, the decimal
+ equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
+ in this case the method will receive ``'62'`` or ``'x3E'``.
.. method:: HTMLParser.handle_comment(data)
- This method is called when a comment is encountered. The *comment* argument is
- a string containing the text between the ``--`` and ``--`` delimiters, but not
- the delimiters themselves. For example, the comment ```` will cause
- this method to be called with the argument ``'text'``. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called when a comment is encountered (e.g. ````).
+
+ For example, the comment ```` will cause this method to be
+ called with the argument ``' comment '``.
+
+ The content of Internet Explorer conditional comments (condcoms) will also be
+ sent to this method, so, for ````,
+ this method will receive ``'[if IE 9]>IE-specific content``).
+
The *decl* parameter will be the entire contents of the declaration inside
- the ```` markup. It is intended to be overridden by a derived class;
- the base class implementation does nothing.
-
-
-.. method:: HTMLParser.unknown_decl(data)
-
- Method called when an unrecognized SGML declaration is read by the parser.
- The *data* parameter will be the entire contents of the declaration inside
- the ```` markup. It is sometimes useful to be overridden by a
- derived class; the base class implementation raises an :exc:`HTMLParseError`.
+ the ```` markup (e.g. ``'DOCTYPE html'``).
.. method:: HTMLParser.handle_pi(data)
@@ -174,29 +218,123 @@ An exception is defined as well:
cause the ``'?'`` to be included in *data*.
-.. _htmlparser-example:
+.. method:: HTMLParser.unknown_decl(data)
-Example HTML Parser Application
--------------------------------
+ This method is called when an unrecognized declaration is read by the parser.
-As a basic example, below is a simple HTML parser that uses the
-:class:`HTMLParser` class to print out start tags, end tags, and data
-as they are encountered::
+ The *data* parameter will be the entire contents of the declaration inside
+ the ```` markup. It is sometimes useful to be overridden by a
+ derived class. The base class implementation raises an :exc:`HTMLParseError`
+ when *strict* is ``True``.
+
+
+.. _htmlparser-examples:
+
+Examples
+--------
+
+The following class implements a parser that will be used to illustrate more
+examples::
from html.parser import HTMLParser
+ from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
- print("Encountered a start tag:", tag)
+ print("Start tag:", tag)
+ for attr in attrs:
+ print(" attr:", attr)
def handle_endtag(self, tag):
- print("Encountered an end tag:", tag)
+ print("End tag :", tag)
def handle_data(self, data):
- print("Encountered some data:", data)
+ print("Data :", data)
+ def handle_comment(self, data):
+ print("Comment :", data)
+ def handle_entityref(self, name):
+ c = chr(name2codepoint[name])
+ print("Named ent:", c)
+ def handle_charref(self, name):
+ if name.startswith('x'):
+ c = chr(int(name[1:], 16))
+ else:
+ c = chr(int(name))
+ print("Num ent :", c)
+ def handle_decl(self, data):
+ print("Decl :", data)
- parser = MyHTMLParser()
- parser.feed('Test'
- 'Parse me!
')
+ parser = MyHTMLParser(strict=False)
+Parsing a doctype::
+
+ >>> parser.feed('')
+ Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
+
+Parsing an element with a few attributes and a title::
+
+ >>> parser.feed('')
+ Start tag: img
+ attr: ('src', 'python-logo.png')
+ attr: ('alt', 'The Python logo')
+ >>>
+ >>> parser.feed('Python
')
+ Start tag: h1
+ Data : Python
+ End tag : h1
+
+The content of ``script`` and ``style`` elements is returned as is, without
+further parsing::
+
+ >>> parser.feed('')
+ Start tag: style
+ attr: ('type', 'text/css')
+ Data : #python { color: green }
+ End tag : style
+ >>>
+ >>> parser.feed('')
+ Start tag: script
+ attr: ('type', 'text/javascript')
+ Data : alert("hello!");
+ End tag : script
+
+Parsing comments::
+
+ >>> parser.feed(''
+ ... '')
+ Comment : a comment
+ Comment : [if IE 9]>IE-specific content'``)::
+
+ >>> parser.feed('>>>')
+ Named ent: >
+ Num ent : >
+ Num ent : >
+
+Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
+:meth:`~HTMLParser.handle_data` might be called more than once::
+
+ >>> for chunk in ['buff', 'ered ', 'text']:
+ ... parser.feed(chunk)
+ ...
+ Start tag: span
+ Data : buff
+ Data : ered
+ Data : text
+ End tag : span
+
+Parsing invalid HTML (e.g. unquoted attributes) also works::
+
+ >>> parser.feed('tag soup
')
+ Start tag: p
+ Start tag: a
+ attr: ('class', 'link')
+ attr: ('href', '#main')
+ Data : tag soup
+ End tag : p
+ End tag : a
.. rubric:: Footnotes