\section{\module{HTMLParser} --- Simple HTML and XHTML parser} \declaremodule{standard}{HTMLParser} \modulesynopsis{A simple parser that can handle HTML and XHTML.} This module defines a class \class{HTMLParser} which serves as the basis for parsing text files formatted in HTML\index{HTML} (HyperText Mark-up Language) and XHTML.\index{XHTML} Unlike the parser in \refmodule{htmllib}, this parser is not based on the SGML parser in \refmodule{sgmllib}. \begin{classdesc}{HTMLParser}{} The \class{HTMLParser} class is instantiated without arguments. An HTMLParser instance is fed HTML data and calls handler functions when tags begin and end. The \class{HTMLParser} class is meant to be overridden by the user to provide a desired behavior. Unlike the parser in \refmodule{htmllib}, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element. \end{classdesc} \class{HTMLParser} instances have the following methods: \begin{methoddesc}{reset}{} Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time. \end{methoddesc} \begin{methoddesc}{feed}{data} Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or \method{close()} is called. \end{methoddesc} \begin{methoddesc}{close}{} Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call the \class{HTMLParser} base class method \method{close()}. \end{methoddesc} \begin{methoddesc}{getpos}{} Return current line number and offset. \end{methoddesc} \begin{methoddesc}{get_starttag_text}{} Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML ``as deployed'' or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.). \end{methoddesc} \begin{methoddesc}{handle_starttag}{tag, attrs} This method is called to handle the start of a tag. It is intended to be overridden by a derived class; the base class implementation does nothing. The \var{tag} argument is the name of the tag converted to lower case. The \var{attrs} argument is a list of \code{(\var{name}, \var{value})} pairs containing the attributes found inside the tag's \code{<>} brackets. The \var{name} will be translated to lower case and double quotes and backslashes in the \var{value} have been interpreted. For instance, for the tag \code{}, this method would be called as \samp{handle_starttag('a', [('href', 'http://www.cwi.nl/')])}. \end{methoddesc} \begin{methoddesc}{handle_startendtag}{tag, attrs} Similar to \method{handle_starttag()}, but called when the parser encounters an XHTML-style empty tag (\code{}). This method may be overridden by subclasses which require this particular lexical information; the default implementation simple calls \method{handle_starttag()} and \method{handle_endtag()}. \end{methoddesc} \begin{methoddesc}{handle_endtag}{tag} This method is called to handle the end tag of an element. It is intended to be overridden by a derived class; the base class implementation does nothing. The \var{tag} argument is the name of the tag converted to lower case. \end{methoddesc} \begin{methoddesc}{handle_data}{data} This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing. \end{methoddesc} \begin{methoddesc}{handle_charref}{name} This method is called to process a character reference of the form \samp{\&\#\var{ref};}. It is intended to be overridden by a derived class; the base class implementation does nothing. \end{methoddesc} \begin{methoddesc}{handle_entityref}{name} This method is called to process a general entity reference of the form \samp{\&\var{name};} where \var{name} is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. \end{methoddesc} \begin{methoddesc}{handle_comment}{data} This method is called when a comment is encountered. The \var{comment} argument is a string containing the text between the \samp{{-}{-}} and \samp{{-}{-}} delimiters, but not the delimiters themselves. For example, the comment \samp{} will cause this method to be called with the argument \code{'text'}. It is intended to be overridden by a derived class; the base class implementation does nothing. \end{methoddesc} \begin{methoddesc}{handle_decl}{decl} Method called when an SGML declaration is read by the parser. The \var{decl} parameter will be the entire contents of the declaration inside the \code{} markup.It is intended to be overridden by a derived class; the base class implementation does nothing. \end{methoddesc} \begin{methoddesc}{handle_pi}{data} Method called when a processing instruction is encountered. The \var{data} parameter will contain the entire processing instruction. For example, for the processing instruction \code{}, this method would be called as \code{handle_pi("proc color='red'")}. It is intended to be overridden by a derived class; the base class implementation does nothing. \note{The \class{HTMLParser} class uses the SGML syntactic rules for processing instruction. An XHTML processing instruction using the trailing \character{?} will cause the \character{?} to be included in \var{data}.} \end{methoddesc} \subsection{Example HTML Parser Application \label{htmlparser-example}} As a basic example, below is a very basic HTML parser that uses the \class{HTMLParser} class to print out tags as they are encountered: \begin{verbatim} from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered the beginning of a %s tag" % tag def handle_endtag(self, tag): print "Encountered the end of a %s tag" % tag \end{verbatim}