From 4279bc7aef89ff668b81e5dd7514cc5ec281a753 Mon Sep 17 00:00:00 2001 From: Ezio Melotti Date: Sat, 18 Feb 2012 02:01:36 +0200 Subject: [PATCH] #14020: improve HTMLParser documentation. --- Doc/library/html.parser.rst | 278 +++++++++++++++++++++++++++--------- 1 file changed, 208 insertions(+), 70 deletions(-) diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst index 7c44bec73d8..f3c36ec8867 100644 --- a/Doc/library/html.parser.rst +++ b/Doc/library/html.parser.rst @@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. .. class:: HTMLParser(strict=True) Create a parser instance. If *strict* is ``True`` (the default), invalid - html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If + HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If *strict* is ``False``, the parser uses heuristics to make a best guess at - the intention of any invalid html it encounters, similar to the way most - browsers do. + the intention of any invalid HTML it encounters, similar to the way most + browsers do. Using ``strict=False`` is advised. - An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags - begin and end. The :class:`HTMLParser` class is meant to be overridden by the - user to provide a desired behavior. + An :class:`.HTMLParser` instance is fed HTML data and calls handler methods + when start tags, end tags, text, comments, and other markup elements are + encountered. The user should subclass :class:`.HTMLParser` and override its + methods to implement the desired behavior. This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element. @@ -39,25 +40,61 @@ An exception is defined as well: .. exception:: HTMLParseError Exception raised by the :class:`HTMLParser` class when it encounters an error - while parsing. This exception provides three attributes: :attr:`msg` is a brief - message explaining the error, :attr:`lineno` is the number of the line on which - the broken construct was detected, and :attr:`offset` is the number of - characters into the line at which the construct starts. + while parsing and *strict* is ``True``. This exception provides three + attributes: :attr:`msg` is a brief message explaining the error, + :attr:`lineno` is the number of the line on which the broken construct was + detected, and :attr:`offset` is the number of characters into the line at + which the construct starts. + + +Example HTML Parser Application +------------------------------- + +As a basic example, below is a simple HTML parser that uses the +:class:`HTMLParser` class to print out start tags, end tags, and data +as they are encountered:: + + from html.parser import HTMLParser + + class MyHTMLParser(HTMLParser): + def handle_starttag(self, tag, attrs): + print("Encountered a start tag:", tag) + def handle_endtag(self, tag): + print("Encountered an end tag :", tag) + def handle_data(self, data): + print("Encountered some data :", data) + + parser = MyHTMLParser(strict=False) + parser.feed('Test' + '

Parse me!

') + +The output will then be:: + + Encountered a start tag: html + Encountered a start tag: head + Encountered a start tag: title + Encountered some data : Test + Encountered an end tag : title + Encountered an end tag : head + Encountered a start tag: body + Encountered a start tag: h1 + Encountered some data : Parse me! + Encountered an end tag : h1 + Encountered an end tag : body + Encountered an end tag : html + + +:class:`.HTMLParser` Methods +---------------------------- :class:`HTMLParser` instances have the following methods: -.. method:: HTMLParser.reset() - - Reset the instance. Loses all unprocessed data. This is called implicitly at - instantiation time. - - .. method:: HTMLParser.feed(data) Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or - :meth:`close` is called. + :meth:`close` is called. *data* must be :class:`str`. .. method:: HTMLParser.close() @@ -68,6 +105,12 @@ An exception is defined as well: the :class:`HTMLParser` base class method :meth:`close`. +.. method:: HTMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. + + .. method:: HTMLParser.getpos() Return current line number and offset. @@ -81,23 +124,35 @@ An exception is defined as well: attributes can be preserved, etc.). +The following methods are called when data or markup elements are encountered +and they are meant to be overridden in a subclass. The base class +implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): + + .. method:: HTMLParser.handle_starttag(tag, attrs) - This method is called to handle the start of a tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to handle the start of a tag (e.g. ``
``). The *tag* argument is the name of the tag converted to lower case. The *attrs* argument is a list of ``(name, value)`` pairs containing the attributes found inside the tag's ``<>`` brackets. The *name* will be translated to lower case, and quotes in the *value* have been removed, and character and entity references - have been replaced. For instance, for the tag ````, this method would be called as - ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. + have been replaced. + + For instance, for the tag ````, this method + would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. All entity references from :mod:`html.entities` are replaced in the attribute values. +.. method:: HTMLParser.handle_endtag(tag) + + This method is called to handle the end tag of an element (e.g. ``
``). + + The *tag* argument is the name of the tag converted to lower case. + + .. method:: HTMLParser.handle_startendtag(tag, attrs) Similar to :meth:`handle_starttag`, but called when the parser encounters an @@ -106,57 +161,46 @@ An exception is defined as well: implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. -.. method:: HTMLParser.handle_endtag(tag) - - This method is called to handle the end tag of an element. It is intended to be - overridden by a derived class; the base class implementation does nothing. The - *tag* argument is the name of the tag converted to lower case. - - .. method:: HTMLParser.handle_data(data) - This method is called to process arbitrary data (e.g. the content of - ```` and ````). It is intended to be - overridden by a derived class; the base class implementation does nothing. - - -.. method:: HTMLParser.handle_charref(name) - - This method is called to process a character reference of the form ``&#ref;``. - It is intended to be overridden by a derived class; the base class - implementation does nothing. + This method is called to process arbitrary data (e.g. text nodes and the + content of ```` and ````). .. method:: HTMLParser.handle_entityref(name) - This method is called to process a general entity reference of the form - ``&name;`` where *name* is an general entity reference. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to process a named character reference of the form + ``&name;`` (e.g. ``>``), where *name* is a general entity reference + (e.g. ``'gt'``). + + +.. method:: HTMLParser.handle_charref(name) + + This method is called to process decimal and hexadecimal numeric character + references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal + equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; + in this case the method will receive ``'62'`` or ``'x3E'``. .. method:: HTMLParser.handle_comment(data) - This method is called when a comment is encountered. The *comment* argument is - a string containing the text between the ``--`` and ``--`` delimiters, but not - the delimiters themselves. For example, the comment ```` will cause - this method to be called with the argument ``'text'``. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called when a comment is encountered (e.g. ````). + + For example, the comment ```` will cause this method to be + called with the argument ``' comment '``. + + The content of Internet Explorer conditional comments (condcoms) will also be + sent to this method, so, for ````, + this method will receive ``'[if IE 9]>IE-specific content``). + The *decl* parameter will be the entire contents of the declaration inside - the ```` markup. It is intended to be overridden by a derived class; - the base class implementation does nothing. - - -.. method:: HTMLParser.unknown_decl(data) - - Method called when an unrecognized SGML declaration is read by the parser. - The *data* parameter will be the entire contents of the declaration inside - the ```` markup. It is sometimes useful to be overridden by a - derived class; the base class implementation raises an :exc:`HTMLParseError`. + the ```` markup (e.g. ``'DOCTYPE html'``). .. method:: HTMLParser.handle_pi(data) @@ -174,29 +218,123 @@ An exception is defined as well: cause the ``'?'`` to be included in *data*. -.. _htmlparser-example: +.. method:: HTMLParser.unknown_decl(data) -Example HTML Parser Application -------------------------------- + This method is called when an unrecognized declaration is read by the parser. -As a basic example, below is a simple HTML parser that uses the -:class:`HTMLParser` class to print out start tags, end tags, and data -as they are encountered:: + The *data* parameter will be the entire contents of the declaration inside + the ```` markup. It is sometimes useful to be overridden by a + derived class. The base class implementation raises an :exc:`HTMLParseError` + when *strict* is ``True``. + + +.. _htmlparser-examples: + +Examples +-------- + +The following class implements a parser that will be used to illustrate more +examples:: from html.parser import HTMLParser + from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): - print("Encountered a start tag:", tag) + print("Start tag:", tag) + for attr in attrs: + print(" attr:", attr) def handle_endtag(self, tag): - print("Encountered an end tag:", tag) + print("End tag :", tag) def handle_data(self, data): - print("Encountered some data:", data) + print("Data :", data) + def handle_comment(self, data): + print("Comment :", data) + def handle_entityref(self, name): + c = chr(name2codepoint[name]) + print("Named ent:", c) + def handle_charref(self, name): + if name.startswith('x'): + c = chr(int(name[1:], 16)) + else: + c = chr(int(name)) + print("Num ent :", c) + def handle_decl(self, data): + print("Decl :", data) - parser = MyHTMLParser() - parser.feed('Test' - '

Parse me!

') + parser = MyHTMLParser(strict=False) +Parsing a doctype:: + + >>> parser.feed('') + Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" + +Parsing an element with a few attributes and a title:: + + >>> parser.feed('The Python logo') + Start tag: img + attr: ('src', 'python-logo.png') + attr: ('alt', 'The Python logo') + >>> + >>> parser.feed('

Python

') + Start tag: h1 + Data : Python + End tag : h1 + +The content of ``script`` and ``style`` elements is returned as is, without +further parsing:: + + >>> parser.feed('') + Start tag: style + attr: ('type', 'text/css') + Data : #python { color: green } + End tag : style + >>> + >>> parser.feed('') + Start tag: script + attr: ('type', 'text/javascript') + Data : alert("hello!"); + End tag : script + +Parsing comments:: + + >>> parser.feed('' + ... '') + Comment : a comment + Comment : [if IE 9]>IE-specific content'``):: + + >>> parser.feed('>>>') + Named ent: > + Num ent : > + Num ent : > + +Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but +:meth:`~HTMLParser.handle_data` might be called more than once:: + + >>> for chunk in ['buff', 'ered ', 'text']: + ... parser.feed(chunk) + ... + Start tag: span + Data : buff + Data : ered + Data : text + End tag : span + +Parsing invalid HTML (e.g. unquoted attributes) also works:: + + >>> parser.feed('

tag soup

') + Start tag: p + Start tag: a + attr: ('class', 'link') + attr: ('href', '#main') + Data : tag soup + End tag : p + End tag : a .. rubric:: Footnotes