Parse me!

``). The *tag* argument is the name of the tag converted to lower case. The *attrs* argument is a list of ``(name, value)`` pairs containing the attributes found inside the tag's ``<>`` brackets. The *name* will be translated to lower case, and quotes in the *value* have been removed, and character and entity references have been replaced. For instance, for the tag ````, this method would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. All entity references from :mod:`html.entities` are replaced in the attribute values. .. method:: HTMLParser.handle_endtag(tag) This method is called to handle the end tag of an element (e.g. ``

``). The *tag* argument is the name of the tag converted to lower case. .. method:: HTMLParser.handle_startendtag(tag, attrs) Similar to :meth:`handle_starttag`, but called when the parser encounters an XHTML-style empty tag (````). This method may be overridden by subclasses which require this particular lexical information; the default implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. .. method:: HTMLParser.handle_data(data) This method is called to process arbitrary data (e.g. text nodes and the content of ```` and ````). .. method:: HTMLParser.handle_entityref(name) This method is called to process a named character reference of the form ``&name;`` (e.g. ``>``), where *name* is a general entity reference (e.g. ``'gt'``). This method is never called if *convert_charrefs* is ``True``. .. method:: HTMLParser.handle_charref(name) This method is called to process decimal and hexadecimal numeric character references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; in this case the method will receive ``'62'`` or ``'x3E'``. This method is never called if *convert_charrefs* is ``True``. .. method:: HTMLParser.handle_comment(data) This method is called when a comment is encountered (e.g. ````). For example, the comment ```` will cause this method to be called with the argument ``' comment '``. The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for ````, this method will receive ``'[if IE 9]>IE-specific content``). The *decl* parameter will be the entire contents of the declaration inside the ```` markup (e.g. ``'DOCTYPE html'``). .. method:: HTMLParser.handle_pi(data) Method called when a processing instruction is encountered. The *data* parameter will contain the entire processing instruction. For example, for the processing instruction ````, this method would be called as ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived class; the base class implementation does nothing. .. note:: The :class:`HTMLParser` class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing ``'?'`` will cause the ``'?'`` to be included in *data*. .. method:: HTMLParser.unknown_decl(data) This method is called when an unrecognized declaration is read by the parser. The *data* parameter will be the entire contents of the declaration inside the ```` markup. It is sometimes useful to be overridden by a derived class. The base class implementation does nothing. .. _htmlparser-examples: Examples -------- The following class implements a parser that will be used to illustrate more examples:: from html.parser import HTMLParser from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Start tag:", tag) for attr in attrs: print(" attr:", attr) def handle_endtag(self, tag): print("End tag :", tag) def handle_data(self, data): print("Data :", data) def handle_comment(self, data): print("Comment :", data) def handle_entityref(self, name): c = chr(name2codepoint[name]) print("Named ent:", c) def handle_charref(self, name): if name.startswith('x'): c = chr(int(name[1:], 16)) else: c = chr(int(name)) print("Num ent :", c) def handle_decl(self, data): print("Decl :", data) parser = MyHTMLParser() Parsing a doctype:: >>> parser.feed('') Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" Parsing an element with a few attributes and a title:: >>> parser.feed('

') Start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo') >>> >>> parser.feed('

Python

') Start tag: h1 Data : Python End tag : h1 The content of ``script`` and ``style`` elements is returned as is, without further parsing:: >>> parser.feed('') Start tag: style attr: ('type', 'text/css') Data : #python { color: green } End tag : style >>> >>> parser.feed('') Start tag: script attr: ('type', 'text/javascript') Data : alert("hello!"); End tag : script Parsing comments:: >>> parser.feed('' ... '') Comment : a comment Comment : [if IE 9]>IE-specific content'``):: >>> parser.feed('>>>') Named ent: > Num ent : > Num ent : > Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but :meth:`~HTMLParser.handle_data` might be called more than once (unless *convert_charrefs* is set to ``True``):: >>> for chunk in ['buff', 'ered ', 'text']: ... parser.feed(chunk) ... Start tag: span Data : buff Data : ered Data : text End tag : span Parsing invalid HTML (e.g. unquoted attributes) also works:: >>> parser.feed('