2007-08-15 11:28:01 -03:00
|
|
|
:mod:`htmllib` --- A parser for HTML documents
|
|
|
|
==============================================
|
|
|
|
|
|
|
|
.. module:: htmllib
|
|
|
|
:synopsis: A parser for HTML documents.
|
2008-06-01 18:19:14 -03:00
|
|
|
:deprecated:
|
Merged revisions 68133-68134,68141-68142,68145-68146,68148-68149,68159-68162,68166,68171-68174,68179,68195-68196,68210,68214-68215,68217-68222 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r68133 | antoine.pitrou | 2009-01-01 16:38:03 +0100 (Thu, 01 Jan 2009) | 1 line
fill in actual issue number in tests
........
r68134 | hirokazu.yamamoto | 2009-01-01 16:45:39 +0100 (Thu, 01 Jan 2009) | 2 lines
Issue #4797: IOError.filename was not set when _fileio.FileIO failed to open
file with `str' filename on Windows.
........
r68141 | benjamin.peterson | 2009-01-01 17:43:12 +0100 (Thu, 01 Jan 2009) | 1 line
fix highlighting
........
r68142 | benjamin.peterson | 2009-01-01 18:29:49 +0100 (Thu, 01 Jan 2009) | 2 lines
welcome to 2009, Python!
........
r68145 | amaury.forgeotdarc | 2009-01-02 01:03:54 +0100 (Fri, 02 Jan 2009) | 5 lines
#4801 _collections module fails to build on cygwin.
_PyObject_GC_TRACK is the macro version of PyObject_GC_Track,
and according to documentation it should not be used for extension modules.
........
r68146 | ronald.oussoren | 2009-01-02 11:44:46 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4472: "configure --enable-shared doesn't work on OSX"
........
r68148 | ronald.oussoren | 2009-01-02 11:48:31 +0100 (Fri, 02 Jan 2009) | 2 lines
Forgot to add a NEWS item in my previous checkin
........
r68149 | ronald.oussoren | 2009-01-02 11:50:48 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4780
........
r68159 | ronald.oussoren | 2009-01-02 15:48:17 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue 1627952
........
r68160 | ronald.oussoren | 2009-01-02 15:52:09 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue r1737832
........
r68161 | ronald.oussoren | 2009-01-02 16:00:05 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 1149804
........
r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 4472 is incompatible with Cygwin, this patch
should fix that.
........
r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line
document PyMemberDef
........
r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines
#4811: fix markup glitches (mostly remains of the conversion),
found by Gabriel Genellina.
........
r68172 | martin.v.loewis | 2009-01-02 21:32:55 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4075: Use OutputDebugStringW in Py_FatalError.
........
r68173 | martin.v.loewis | 2009-01-02 21:40:14 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4051: Prevent conflict of UNICODE macros in cPickle.
........
r68174 | benjamin.peterson | 2009-01-02 21:47:27 +0100 (Fri, 02 Jan 2009) | 1 line
fix compilation on non-Windows platforms
........
r68179 | raymond.hettinger | 2009-01-02 22:26:45 +0100 (Fri, 02 Jan 2009) | 1 line
Issue #4615. Document how to use itertools for de-duping.
........
r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove useless string literal.
........
r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix indentation.
........
r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines
Set eol-style correctly for mp_distributing.py.
........
r68214 | georg.brandl | 2009-01-03 20:44:48 +0100 (Sat, 03 Jan 2009) | 2 lines
Make indentation consistent.
........
r68215 | georg.brandl | 2009-01-03 21:15:14 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix role name.
........
r68217 | georg.brandl | 2009-01-03 21:30:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Add rstlint, a little tool to find subtle markup problems and inconsistencies in the Doc sources.
........
r68218 | georg.brandl | 2009-01-03 21:38:59 +0100 (Sat, 03 Jan 2009) | 2 lines
Recognize usage of the default role.
........
r68219 | georg.brandl | 2009-01-03 21:47:01 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix uses of the default role.
........
r68220 | georg.brandl | 2009-01-03 21:55:06 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove trailing whitespace.
........
r68221 | georg.brandl | 2009-01-03 22:04:55 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove tabs from the documentation.
........
r68222 | georg.brandl | 2009-01-03 22:11:58 +0100 (Sat, 03 Jan 2009) | 2 lines
Disable the line length checker by default.
........
2009-01-03 17:55:17 -04:00
|
|
|
|
2008-06-01 18:19:14 -03:00
|
|
|
.. deprecated:: 2.6
|
|
|
|
The :mod:`htmllib` module has been removed in Python 3.0.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. index::
|
|
|
|
single: HTML
|
|
|
|
single: hypertext
|
|
|
|
|
|
|
|
.. index::
|
|
|
|
module: sgmllib
|
|
|
|
module: formatter
|
|
|
|
single: SGMLParser (in module sgmllib)
|
|
|
|
|
|
|
|
This module defines a class which can serve as a base for parsing text files
|
|
|
|
formatted in the HyperText Mark-up Language (HTML). The class is not directly
|
|
|
|
concerned with I/O --- it must be provided with input in string form via a
|
|
|
|
method, and makes calls to methods of a "formatter" object in order to produce
|
|
|
|
output. The :class:`HTMLParser` class is designed to be used as a base class
|
|
|
|
for other classes in order to add functionality, and allows most of its methods
|
|
|
|
to be extended or overridden. In turn, this class is derived from and extends
|
|
|
|
the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The
|
|
|
|
:class:`HTMLParser` implementation supports the HTML 2.0 language as described
|
|
|
|
in :rfc:`1866`. Two implementations of formatter objects are provided in the
|
|
|
|
:mod:`formatter` module; refer to the documentation for that module for
|
|
|
|
information on the formatter interface.
|
|
|
|
|
|
|
|
The following is a summary of the interface defined by
|
|
|
|
:class:`sgmllib.SGMLParser`:
|
|
|
|
|
|
|
|
* The interface to feed data to an instance is through the :meth:`feed` method,
|
|
|
|
which takes a string argument. This can be called with as little or as much
|
|
|
|
text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
|
|
|
|
``p.feed(a+b)``. When the data contains complete HTML markup constructs, these
|
|
|
|
are processed immediately; incomplete constructs are saved in a buffer. To
|
|
|
|
force processing of all unprocessed data, call the :meth:`close` method.
|
|
|
|
|
|
|
|
For example, to parse the entire contents of a file, use::
|
|
|
|
|
|
|
|
parser.feed(open('myfile.html').read())
|
|
|
|
parser.close()
|
|
|
|
|
|
|
|
* The interface to define semantics for HTML tags is very simple: derive a class
|
|
|
|
and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
|
|
|
|
The parser will call these at appropriate moments: :meth:`start_tag` or
|
|
|
|
:meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
|
|
|
|
encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
|
|
|
|
is encountered. If an opening tag requires a corresponding closing tag, like
|
|
|
|
``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
|
|
|
|
a tag requires no closing tag, like ``<P>``, the class should define the
|
|
|
|
:meth:`do_tag` method.
|
|
|
|
|
|
|
|
The module defines a parser class and an exception:
|
|
|
|
|
|
|
|
|
|
|
|
.. class:: HTMLParser(formatter)
|
|
|
|
|
|
|
|
This is the basic HTML parser class. It supports all entity names required by
|
|
|
|
the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines
|
|
|
|
handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
|
|
|
|
|
|
|
|
|
|
|
|
.. exception:: HTMLParseError
|
|
|
|
|
|
|
|
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
|
|
|
while parsing.
|
|
|
|
|
|
|
|
.. versionadded:: 2.4
|
|
|
|
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
|
|
|
Module :mod:`formatter`
|
|
|
|
Interface definition for transforming an abstract flow of formatting events into
|
|
|
|
specific output events on writer objects.
|
|
|
|
|
2008-05-20 03:08:38 -03:00
|
|
|
Module :mod:`HTMLParser`
|
2007-08-15 11:28:01 -03:00
|
|
|
Alternate HTML parser that offers a slightly lower-level view of the input, but
|
|
|
|
is designed to work with XHTML, and does not implement some of the SGML syntax
|
|
|
|
not used in "HTML as deployed" and which isn't legal for XHTML.
|
|
|
|
|
2008-05-20 03:08:38 -03:00
|
|
|
Module :mod:`htmlentitydefs`
|
2007-08-15 11:28:01 -03:00
|
|
|
Definition of replacement text for XHTML 1.0 entities.
|
|
|
|
|
|
|
|
Module :mod:`sgmllib`
|
|
|
|
Base class for :class:`HTMLParser`.
|
|
|
|
|
|
|
|
|
|
|
|
.. _html-parser-objects:
|
|
|
|
|
|
|
|
HTMLParser Objects
|
|
|
|
------------------
|
|
|
|
|
|
|
|
In addition to tag methods, the :class:`HTMLParser` class provides some
|
|
|
|
additional methods and instance variables for use within tag methods.
|
|
|
|
|
|
|
|
|
|
|
|
.. attribute:: HTMLParser.formatter
|
|
|
|
|
|
|
|
This is the formatter instance associated with the parser.
|
|
|
|
|
|
|
|
|
|
|
|
.. attribute:: HTMLParser.nofill
|
|
|
|
|
|
|
|
Boolean flag which should be true when whitespace should not be collapsed, or
|
|
|
|
false when it should be. In general, this should only be true when character
|
|
|
|
data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
|
|
|
|
The default value is false. This affects the operation of :meth:`handle_data`
|
|
|
|
and :meth:`save_end`.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.anchor_bgn(href, name, type)
|
|
|
|
|
|
|
|
This method is called at the start of an anchor region. The arguments
|
|
|
|
correspond to the attributes of the ``<A>`` tag with the same names. The
|
|
|
|
default implementation maintains a list of hyperlinks (defined by the ``HREF``
|
|
|
|
attribute for ``<A>`` tags) within the document. The list of hyperlinks is
|
|
|
|
available as the data attribute :attr:`anchorlist`.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.anchor_end()
|
|
|
|
|
|
|
|
This method is called at the end of an anchor region. The default
|
|
|
|
implementation adds a textual footnote marker using an index into the list of
|
|
|
|
hyperlinks created by :meth:`anchor_bgn`.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
|
|
|
|
|
|
|
|
This method is called to handle images. The default implementation simply
|
|
|
|
passes the *alt* value to the :meth:`handle_data` method.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.save_bgn()
|
|
|
|
|
|
|
|
Begins saving character data in a buffer instead of sending it to the formatter
|
|
|
|
object. Retrieve the stored data via :meth:`save_end`. Use of the
|
|
|
|
:meth:`save_bgn` / :meth:`save_end` pair may not be nested.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: HTMLParser.save_end()
|
|
|
|
|
|
|
|
Ends buffering character data and returns all data saved since the preceding
|
|
|
|
call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is
|
|
|
|
collapsed to single spaces. A call to this method without a preceding call to
|
|
|
|
:meth:`save_bgn` will raise a :exc:`TypeError` exception.
|
2008-05-20 03:08:38 -03:00
|
|
|
|
|
|
|
|
|
|
|
:mod:`htmlentitydefs` --- Definitions of HTML general entities
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
.. module:: htmlentitydefs
|
|
|
|
:synopsis: Definitions of HTML general entities.
|
|
|
|
.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
|
|
|
|
|
|
|
|
.. note::
|
2008-05-20 04:21:58 -03:00
|
|
|
|
2008-05-20 03:08:38 -03:00
|
|
|
The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in
|
2008-05-20 04:21:58 -03:00
|
|
|
Python 3.0. The :term:`2to3` tool will automatically adapt imports when
|
|
|
|
converting your sources to 3.0.
|
2008-05-20 03:08:38 -03:00
|
|
|
|
|
|
|
|
|
|
|
This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
|
|
|
|
and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
|
|
|
|
provide the :attr:`entitydefs` member of the :class:`HTMLParser` class. The
|
|
|
|
definition provided here contains all the entities defined by XHTML 1.0 that
|
|
|
|
can be handled using simple textual substitution in the Latin-1 character set
|
|
|
|
(ISO-8859-1).
|
|
|
|
|
|
|
|
|
|
|
|
.. data:: entitydefs
|
|
|
|
|
|
|
|
A dictionary mapping XHTML 1.0 entity definitions to their replacement text in
|
|
|
|
ISO Latin-1.
|
|
|
|
|
|
|
|
|
|
|
|
.. data:: name2codepoint
|
|
|
|
|
|
|
|
A dictionary that maps HTML entity names to the Unicode codepoints.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3
|
|
|
|
|
|
|
|
|
|
|
|
.. data:: codepoint2name
|
|
|
|
|
|
|
|
A dictionary that maps Unicode codepoints to HTML entity names.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3
|
|
|
|
|