cpython/Doc/lib/libhtmlparser.tex

\section{\module{HTMLParser} ---
         Simple HTML and XHTML parser}

\declaremodule{standard}{HTMLParser}
\modulesynopsis{A simple parser that can handle HTML and XHTML.}

\versionadded{2.2}

This module defines a class \class{HTMLParser} which serves as the
basis for parsing text files formatted in HTML\index{HTML} (HyperText
Mark-up Language) and XHTML.\index{XHTML}  Unlike the parser in
\refmodule{htmllib}, this parser is not based on the SGML parser in
\refmodule{sgmllib}.


\begin{classdesc}{HTMLParser}{}
The \class{HTMLParser} class is instantiated without arguments.

An HTMLParser instance is fed HTML data and calls handler functions
when tags begin and end.  The \class{HTMLParser} class is meant to be
overridden by the user to provide a desired behavior.

Unlike the parser in \refmodule{htmllib}, this parser does not check
that end tags match start tags or call the end-tag handler for
elements which are closed implicitly by closing an outer element.
\end{classdesc}

An exception is defined as well:

\begin{excdesc}{HTMLParseError}
Exception raised by the \class{HTMLParser} class when it encounters an
error while parsing.  This exception provides three attributes:
\member{msg} is a brief message explaining the error, \member{lineno}
is the number of the line on which the broken construct was detected,
and \member{offset} is the number of characters into the line at which
the construct starts.
\end{excdesc}


\class{HTMLParser} instances have the following methods:

\begin{methoddesc}{reset}{}
Reset the instance.  Loses all unprocessed data.  This is called
implicitly at instantiation time.
\end{methoddesc}

\begin{methoddesc}{feed}{data}
Feed some text to the parser.  It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is
fed or \method{close()} is called.
\end{methoddesc}

\begin{methoddesc}{close}{}
Force processing of all buffered data as if it were followed by an
end-of-file mark.  This method may be redefined by a derived class to
define additional processing at the end of the input, but the
redefined version should always call the \class{HTMLParser} base class
method \method{close()}.
\end{methoddesc}

\begin{methoddesc}{getpos}{}
Return current line number and offset.
\end{methoddesc}

\begin{methoddesc}{get_starttag_text}{}
Return the text of the most recently opened start tag.  This should
not normally be needed for structured processing, but may be useful in
dealing with HTML ``as deployed'' or for re-generating input with
minimal changes (whitespace between attributes can be preserved,
etc.).
\end{methoddesc}

\begin{methoddesc}{handle_starttag}{tag, attrs} 
This method is called to handle the start of a tag.  It is intended to
be overridden by a derived class; the base class implementation does
nothing.  

The \var{tag} argument is the name of the tag converted to
lower case.  The \var{attrs} argument is a list of \code{(\var{name},
\var{value})} pairs containing the attributes found inside the tag's
\code{<>} brackets.  The \var{name} will be translated to lower case
and double quotes and backslashes in the \var{value} have been
interpreted.  For instance, for the tag \code{<A
HREF="http://www.cwi.nl/">}, this method would be called as
\samp{handle_starttag('a', [('href', 'http://www.cwi.nl/')])}.
\end{methoddesc}

\begin{methoddesc}{handle_startendtag}{tag, attrs}
Similar to \method{handle_starttag()}, but called when the parser
encounters an XHTML-style empty tag (\code{<a .../>}).  This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
\method{handle_starttag()} and \method{handle_endtag()}.
\end{methoddesc}

\begin{methoddesc}{handle_endtag}{tag}
This method is called to handle the end tag of an element.  It is
intended to be overridden by a derived class; the base class
implementation does nothing.  The \var{tag} argument is the name of
the tag converted to lower case.
\end{methoddesc}

\begin{methoddesc}{handle_data}{data}
This method is called to process arbitrary data.  It is intended to be
overridden by a derived class; the base class implementation does
nothing.
\end{methoddesc}

\begin{methoddesc}{handle_charref}{name} This method is called to
process a character reference of the form \samp{\&\#\var{ref};}.  It
is intended to be overridden by a derived class; the base class
implementation does nothing.  
\end{methoddesc}

\begin{methoddesc}{handle_entityref}{name} 
This method is called to process a general entity reference of the
form \samp{\&\var{name};} where \var{name} is an general entity
reference.  It is intended to be overridden by a derived class; the
base class implementation does nothing.
\end{methoddesc}

\begin{methoddesc}{handle_comment}{data}
This method is called when a comment is encountered.  The
\var{comment} argument is a string containing the text between the
\samp{--} and \samp{--} delimiters, but not the delimiters
themselves.  For example, the comment \samp{<!--text-->} will
cause this method to be called with the argument \code{'text'}.  It is
intended to be overridden by a derived class; the base class
implementation does nothing.
\end{methoddesc}

\begin{methoddesc}{handle_decl}{decl}
Method called when an SGML declaration is read by the parser.  The
\var{decl} parameter will be the entire contents of the declaration
inside the \code{<!}...\code{>} markup.  It is intended to be overridden
by a derived class; the base class implementation does nothing.
\end{methoddesc}

\begin{methoddesc}{handle_pi}{data}
Method called when a processing instruction is encountered.  The
\var{data} parameter will contain the entire processing instruction.
For example, for the processing instruction \code{<?proc color='red'>},
this method would be called as \code{handle_pi("proc color='red'")}.  It
is intended to be overridden by a derived class; the base class
implementation does nothing.

\note{The \class{HTMLParser} class uses the SGML syntactic rules for
processing instructions.  An XHTML processing instruction using the
trailing \character{?} will cause the \character{?} to be included in
\var{data}.}
\end{methoddesc}


\subsection{Example HTML Parser Application \label{htmlparser-example}}

As a basic example, below is a very basic HTML parser that uses the
\class{HTMLParser} class to print out tags as they are encountered:

\begin{verbatim}
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag

    def handle_endtag(self, tag):
        print "Encountered the end of a %s tag" % tag
\end{verbatim}
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00			`\section{\module{HTMLParser} ---`
			`Simple HTML and XHTML parser}`

			`\declaremodule{standard}{HTMLParser}`
			`\modulesynopsis{A simple parser that can handle HTML and XHTML.}`

document the exceptions raised by sgmllib, htmllib, and HTMLParser 2004-09-09 22:20:21 -03:00			`\versionadded{2.2}`

Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00			`This module defines a class \class{HTMLParser} which serves as the`
			`basis for parsing text files formatted in HTML\index{HTML} (HyperText`
Added more information on the differences between the htmllib and HTMLParser modules. 2001-07-05 13:34:36 -03:00			`Mark-up Language) and XHTML.\index{XHTML} Unlike the parser in`
			`\refmodule{htmllib}, this parser is not based on the SGML parser in`
			`\refmodule{sgmllib}.`
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00

			`\begin{classdesc}{HTMLParser}{}`
			`The \class{HTMLParser} class is instantiated without arguments.`

			`An HTMLParser instance is fed HTML data and calls handler functions`
			`when tags begin and end. The \class{HTMLParser} class is meant to be`
			`overridden by the user to provide a desired behavior.`
Added more information on the differences between the htmllib and HTMLParser modules. 2001-07-05 13:34:36 -03:00
			`Unlike the parser in \refmodule{htmllib}, this parser does not check`
			`that end tags match start tags or call the end-tag handler for`
			`elements which are closed implicitly by closing an outer element.`
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00			`\end{classdesc}`

document the exceptions raised by sgmllib, htmllib, and HTMLParser 2004-09-09 22:20:21 -03:00			`An exception is defined as well:`

			`\begin{excdesc}{HTMLParseError}`
			`Exception raised by the \class{HTMLParser} class when it encounters an`
			`error while parsing. This exception provides three attributes:`
			`\member{msg} is a brief message explaining the error, \member{lineno}`
			`is the number of the line on which the broken construct was detected,`
			`and \member{offset} is the number of characters into the line at which`
			`the construct starts.`
			`\end{excdesc}`

Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00
			`\class{HTMLParser} instances have the following methods:`

			`\begin{methoddesc}{reset}{}`
			`Reset the instance. Loses all unprocessed data. This is called`
			`implicitly at instantiation time.`
			`\end{methoddesc}`

			`\begin{methoddesc}{feed}{data}`
			`Feed some text to the parser. It is processed insofar as it consists`
			`of complete elements; incomplete data is buffered until more data is`
			`fed or \method{close()} is called.`
			`\end{methoddesc}`

			`\begin{methoddesc}{close}{}`
			`Force processing of all buffered data as if it were followed by an`
			`end-of-file mark. This method may be redefined by a derived class to`
			`define additional processing at the end of the input, but the`
			`redefined version should always call the \class{HTMLParser} base class`
			`method \method{close()}.`
			`\end{methoddesc}`

			`\begin{methoddesc}{getpos}{}`
			`Return current line number and offset.`
			`\end{methoddesc}`

			`\begin{methoddesc}{get_starttag_text}{}`
			`Return the text of the most recently opened start tag. This should`
			`not normally be needed for structured processing, but may be useful in`
			dealing with HTML ``as deployed'' or for re-generating input with
			`minimal changes (whitespace between attributes can be preserved,`
			`etc.).`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_starttag}{tag, attrs}`
			`This method is called to handle the start of a tag. It is intended to`
			`be overridden by a derived class; the base class implementation does`
			`nothing.`

			`The \var{tag} argument is the name of the tag converted to`
			`lower case. The \var{attrs} argument is a list of \code{(\var{name},`
			`\var{value})} pairs containing the attributes found inside the tag's`
			`\code{<>} brackets. The \var{name} will be translated to lower case`
			`and double quotes and backslashes in the \var{value} have been`
			`interpreted. For instance, for the tag \code{<A`
			`HREF="http://www.cwi.nl/">}, this method would be called as`
			`\samp{handle_starttag('a', [('href', 'http://www.cwi.nl/')])}.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_startendtag}{tag, attrs}`
			`Similar to \method{handle_starttag()}, but called when the parser`
			`encounters an XHTML-style empty tag (\code{<a .../>}). This method`
			`may be overridden by subclasses which require this particular lexical`
			`information; the default implementation simple calls`
			`\method{handle_starttag()} and \method{handle_endtag()}.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_endtag}{tag}`
			`This method is called to handle the end tag of an element. It is`
			`intended to be overridden by a derived class; the base class`
			`implementation does nothing. The \var{tag} argument is the name of`
			`the tag converted to lower case.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_data}{data}`
			`This method is called to process arbitrary data. It is intended to be`
			`overridden by a derived class; the base class implementation does`
			`nothing.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_charref}{name} This method is called to`
			`process a character reference of the form \samp{\&\#\var{ref};}. It`
			`is intended to be overridden by a derived class; the base class`
			`implementation does nothing.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_entityref}{name}`
			`This method is called to process a general entity reference of the`
			`form \samp{\&\var{name};} where \var{name} is an general entity`
			`reference. It is intended to be overridden by a derived class; the`
			`base class implementation does nothing.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_comment}{data}`
			`This method is called when a comment is encountered. The`
			`\var{comment} argument is a string containing the text between the`
- remove crufty markup that's no longer needed to make the presentation work right (and didn't work anyway) - fix minor typo 2003-12-30 12:18:23 -04:00			`\samp{--} and \samp{--} delimiters, but not the delimiters`
			`themselves. For example, the comment \samp{<!--text-->} will`
Fix double hyphen markup. 2003-12-07 08:46:16 -04:00			`cause this method to be called with the argument \code{'text'}. It is`
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00			`intended to be overridden by a derived class; the base class`
			`implementation does nothing.`
			`\end{methoddesc}`

			`\begin{methoddesc}{handle_decl}{decl}`
			`Method called when an SGML declaration is read by the parser. The`
			`\var{decl} parameter will be the entire contents of the declaration`
avoid ugly markup based on the unfortunate conversions of ">>" and "<<" to guillemets; no need for magic here 2006-05-02 23:04:40 -03:00			`inside the \code{<!}...\code{>} markup. It is intended to be overridden`
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00			`by a derived class; the base class implementation does nothing.`
			`\end{methoddesc}`

Added documentation for the handle_pi() method, based on SF patch #662464. Closes SF bug #659188, patch #662464. 2003-04-17 19:36:52 -03:00			`\begin{methoddesc}{handle_pi}{data}`
			`Method called when a processing instruction is encountered. The`
			`\var{data} parameter will contain the entire processing instruction.`
			`For example, for the processing instruction \code{<?proc color='red'>},`
			`this method would be called as \code{handle_pi("proc color='red'")}. It`
			`is intended to be overridden by a derived class; the base class`
			`implementation does nothing.`

			`\note{The \class{HTMLParser} class uses the SGML syntactic rules for`
- remove crufty markup that's no longer needed to make the presentation work right (and didn't work anyway) - fix minor typo 2003-12-30 12:18:23 -04:00			`processing instructions. An XHTML processing instruction using the`
Added documentation for the handle_pi() method, based on SF patch #662464. Closes SF bug #659188, patch #662464. 2003-04-17 19:36:52 -03:00			`trailing \character{?} will cause the \character{?} to be included in`
			`\var{data}.}`
			`\end{methoddesc}`

Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00
Added documentation for the handle_pi() method, based on SF patch #662464. Closes SF bug #659188, patch #662464. 2003-04-17 19:36:52 -03:00			`\subsection{Example HTML Parser Application \label{htmlparser-example}}`
Michel Pelletier <michel@digicool.com>: Documentation for the HTMLParser module, with small changes by FLD. 2001-05-30 01:59:00 -03:00
			`As a basic example, below is a very basic HTML parser that uses the`
			`\class{HTMLParser} class to print out tags as they are encountered:`

			`\begin{verbatim}`
			`from HTMLParser import HTMLParser`

			`class MyHTMLParser(HTMLParser):`

			`def handle_starttag(self, tag, attrs):`
			`print "Encountered the beginning of a %s tag" % tag`

			`def handle_endtag(self, tag):`
			`print "Encountered the end of a %s tag" % tag`
			`\end{verbatim}`