Remove the htmllib and sgmllib modules as per PEP 3108.
This commit is contained in:
parent
6b38daa80d
commit
877b10add4
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
:mod:`formatter` --- Generic output formatting
|
:mod:`formatter` --- Generic output formatting
|
||||||
==============================================
|
==============================================
|
||||||
|
|
||||||
|
@ -6,12 +5,9 @@
|
||||||
:synopsis: Generic output formatter and device interface.
|
:synopsis: Generic output formatter and device interface.
|
||||||
|
|
||||||
|
|
||||||
.. index:: single: HTMLParser (class in htmllib)
|
|
||||||
|
|
||||||
This module supports two interface definitions, each with multiple
|
This module supports two interface definitions, each with multiple
|
||||||
implementations. The *formatter* interface is used by the :class:`HTMLParser`
|
implementations: The *formatter* interface, and the *writer* interface which is
|
||||||
class of the :mod:`htmllib` module, and the *writer* interface is required by
|
required by the formatter interface.
|
||||||
the formatter interface.
|
|
||||||
|
|
||||||
Formatter objects transform an abstract flow of formatting events into specific
|
Formatter objects transform an abstract flow of formatting events into specific
|
||||||
output events on writer objects. Formatters manage several stack structures to
|
output events on writer objects. Formatters manage several stack structures to
|
||||||
|
|
|
@ -7,11 +7,10 @@
|
||||||
|
|
||||||
|
|
||||||
This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
|
This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
|
||||||
and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
|
and ``entitydefs``. ``entitydefs`` is used to provide the :attr:`entitydefs`
|
||||||
provide the :attr:`entitydefs` member of the :class:`html.parser.HTMLParser`
|
member of the :class:`html.parser.HTMLParser` class. The definition provided
|
||||||
class. The definition provided here contains all the entities defined by XHTML
|
here contains all the entities defined by XHTML 1.0 that can be handled using
|
||||||
1.0 that can be handled using simple textual substitution in the Latin-1
|
simple textual substitution in the Latin-1 character set (ISO-8859-1).
|
||||||
character set (ISO-8859-1).
|
|
||||||
|
|
||||||
|
|
||||||
.. data:: entitydefs
|
.. data:: entitydefs
|
||||||
|
|
|
@ -11,9 +11,6 @@
|
||||||
|
|
||||||
This module defines a class :class:`HTMLParser` which serves as the basis for
|
This module defines a class :class:`HTMLParser` which serves as the basis for
|
||||||
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
||||||
Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
|
|
||||||
in :mod:`sgmllib`.
|
|
||||||
|
|
||||||
|
|
||||||
.. class:: HTMLParser()
|
.. class:: HTMLParser()
|
||||||
|
|
||||||
|
@ -23,9 +20,8 @@ in :mod:`sgmllib`.
|
||||||
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
||||||
user to provide a desired behavior.
|
user to provide a desired behavior.
|
||||||
|
|
||||||
Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
|
This parser does not check that end tags match start tags or call the end-tag
|
||||||
match start tags or call the end-tag handler for elements which are closed
|
handler for elements which are closed implicitly by closing an outer element.
|
||||||
implicitly by closing an outer element.
|
|
||||||
|
|
||||||
An exception is defined as well:
|
An exception is defined as well:
|
||||||
|
|
||||||
|
|
|
@ -1,147 +0,0 @@
|
||||||
|
|
||||||
:mod:`htmllib` --- A parser for HTML documents
|
|
||||||
==============================================
|
|
||||||
|
|
||||||
.. module:: htmllib
|
|
||||||
:synopsis: A parser for HTML documents.
|
|
||||||
|
|
||||||
|
|
||||||
.. index::
|
|
||||||
single: HTML
|
|
||||||
single: hypertext
|
|
||||||
|
|
||||||
.. index::
|
|
||||||
module: sgmllib
|
|
||||||
module: formatter
|
|
||||||
single: SGMLParser (in module sgmllib)
|
|
||||||
|
|
||||||
This module defines a class which can serve as a base for parsing text files
|
|
||||||
formatted in the HyperText Mark-up Language (HTML). The class is not directly
|
|
||||||
concerned with I/O --- it must be provided with input in string form via a
|
|
||||||
method, and makes calls to methods of a "formatter" object in order to produce
|
|
||||||
output. The :class:`HTMLParser` class is designed to be used as a base class
|
|
||||||
for other classes in order to add functionality, and allows most of its methods
|
|
||||||
to be extended or overridden. In turn, this class is derived from and extends
|
|
||||||
the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The
|
|
||||||
:class:`HTMLParser` implementation supports the HTML 2.0 language as described
|
|
||||||
in :rfc:`1866`. Two implementations of formatter objects are provided in the
|
|
||||||
:mod:`formatter` module; refer to the documentation for that module for
|
|
||||||
information on the formatter interface.
|
|
||||||
|
|
||||||
The following is a summary of the interface defined by
|
|
||||||
:class:`sgmllib.SGMLParser`:
|
|
||||||
|
|
||||||
* The interface to feed data to an instance is through the :meth:`feed` method,
|
|
||||||
which takes a string argument. This can be called with as little or as much
|
|
||||||
text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
|
|
||||||
``p.feed(a+b)``. When the data contains complete HTML markup constructs, these
|
|
||||||
are processed immediately; incomplete constructs are saved in a buffer. To
|
|
||||||
force processing of all unprocessed data, call the :meth:`close` method.
|
|
||||||
|
|
||||||
For example, to parse the entire contents of a file, use::
|
|
||||||
|
|
||||||
parser.feed(open('myfile.html').read())
|
|
||||||
parser.close()
|
|
||||||
|
|
||||||
* The interface to define semantics for HTML tags is very simple: derive a class
|
|
||||||
and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
|
|
||||||
The parser will call these at appropriate moments: :meth:`start_tag` or
|
|
||||||
:meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
|
|
||||||
encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
|
|
||||||
is encountered. If an opening tag requires a corresponding closing tag, like
|
|
||||||
``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
|
|
||||||
a tag requires no closing tag, like ``<P>``, the class should define the
|
|
||||||
:meth:`do_tag` method.
|
|
||||||
|
|
||||||
The module defines a parser class and an exception:
|
|
||||||
|
|
||||||
|
|
||||||
.. class:: HTMLParser(formatter)
|
|
||||||
|
|
||||||
This is the basic HTML parser class. It supports all entity names required by
|
|
||||||
the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines
|
|
||||||
handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
|
|
||||||
|
|
||||||
|
|
||||||
.. exception:: HTMLParseError
|
|
||||||
|
|
||||||
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
|
||||||
while parsing.
|
|
||||||
|
|
||||||
|
|
||||||
.. seealso::
|
|
||||||
|
|
||||||
Module :mod:`formatter`
|
|
||||||
Interface definition for transforming an abstract flow of formatting events into
|
|
||||||
specific output events on writer objects.
|
|
||||||
|
|
||||||
Module :mod:`html.parser`
|
|
||||||
Alternate HTML parser that offers a slightly lower-level view of the input, but
|
|
||||||
is designed to work with XHTML, and does not implement some of the SGML syntax
|
|
||||||
not used in "HTML as deployed" and which isn't legal for XHTML.
|
|
||||||
|
|
||||||
Module :mod:`html.entities`
|
|
||||||
Definition of replacement text for XHTML 1.0 entities.
|
|
||||||
|
|
||||||
Module :mod:`sgmllib`
|
|
||||||
Base class for :class:`HTMLParser`.
|
|
||||||
|
|
||||||
|
|
||||||
.. _html-parser-objects:
|
|
||||||
|
|
||||||
HTMLParser Objects
|
|
||||||
------------------
|
|
||||||
|
|
||||||
In addition to tag methods, the :class:`HTMLParser` class provides some
|
|
||||||
additional methods and instance variables for use within tag methods.
|
|
||||||
|
|
||||||
|
|
||||||
.. attribute:: HTMLParser.formatter
|
|
||||||
|
|
||||||
This is the formatter instance associated with the parser.
|
|
||||||
|
|
||||||
|
|
||||||
.. attribute:: HTMLParser.nofill
|
|
||||||
|
|
||||||
Boolean flag which should be true when whitespace should not be collapsed, or
|
|
||||||
false when it should be. In general, this should only be true when character
|
|
||||||
data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
|
|
||||||
The default value is false. This affects the operation of :meth:`handle_data`
|
|
||||||
and :meth:`save_end`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.anchor_bgn(href, name, type)
|
|
||||||
|
|
||||||
This method is called at the start of an anchor region. The arguments
|
|
||||||
correspond to the attributes of the ``<A>`` tag with the same names. The
|
|
||||||
default implementation maintains a list of hyperlinks (defined by the ``HREF``
|
|
||||||
attribute for ``<A>`` tags) within the document. The list of hyperlinks is
|
|
||||||
available as the data attribute :attr:`anchorlist`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.anchor_end()
|
|
||||||
|
|
||||||
This method is called at the end of an anchor region. The default
|
|
||||||
implementation adds a textual footnote marker using an index into the list of
|
|
||||||
hyperlinks created by :meth:`anchor_bgn`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
|
|
||||||
|
|
||||||
This method is called to handle images. The default implementation simply
|
|
||||||
passes the *alt* value to the :meth:`handle_data` method.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.save_bgn()
|
|
||||||
|
|
||||||
Begins saving character data in a buffer instead of sending it to the formatter
|
|
||||||
object. Retrieve the stored data via :meth:`save_end`. Use of the
|
|
||||||
:meth:`save_bgn` / :meth:`save_end` pair may not be nested.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.save_end()
|
|
||||||
|
|
||||||
Ends buffering character data and returns all data saved since the preceding
|
|
||||||
call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is
|
|
||||||
collapsed to single spaces. A call to this method without a preceding call to
|
|
||||||
:meth:`save_bgn` will raise a :exc:`TypeError` exception.
|
|
|
@ -23,8 +23,6 @@ definition of the Python bindings for the DOM and SAX interfaces.
|
||||||
|
|
||||||
html.parser.rst
|
html.parser.rst
|
||||||
html.entities.rst
|
html.entities.rst
|
||||||
sgmllib.rst
|
|
||||||
htmllib.rst
|
|
||||||
pyexpat.rst
|
pyexpat.rst
|
||||||
xml.dom.rst
|
xml.dom.rst
|
||||||
xml.dom.minidom.rst
|
xml.dom.minidom.rst
|
||||||
|
|
|
@ -1,253 +0,0 @@
|
||||||
|
|
||||||
:mod:`sgmllib` --- Simple SGML parser
|
|
||||||
=====================================
|
|
||||||
|
|
||||||
.. module:: sgmllib
|
|
||||||
:synopsis: Only as much of an SGML parser as needed to parse HTML.
|
|
||||||
|
|
||||||
|
|
||||||
.. index:: single: SGML
|
|
||||||
|
|
||||||
This module defines a class :class:`SGMLParser` which serves as the basis for
|
|
||||||
parsing text files formatted in SGML (Standard Generalized Mark-up Language).
|
|
||||||
In fact, it does not provide a full SGML parser --- it only parses SGML insofar
|
|
||||||
as it is used by HTML, and the module only exists as a base for the
|
|
||||||
:mod:`htmllib` module. Another HTML parser which supports XHTML and offers a
|
|
||||||
somewhat different interface is available in the :mod:`HTMLParser` module.
|
|
||||||
|
|
||||||
|
|
||||||
.. class:: SGMLParser()
|
|
||||||
|
|
||||||
The :class:`SGMLParser` class is instantiated without arguments. The parser is
|
|
||||||
hardcoded to recognize the following constructs:
|
|
||||||
|
|
||||||
* Opening and closing tags of the form ``<tag attr="value" ...>`` and
|
|
||||||
``</tag>``, respectively.
|
|
||||||
|
|
||||||
* Numeric character references of the form ``&#name;``.
|
|
||||||
|
|
||||||
* Entity references of the form ``&name;``.
|
|
||||||
|
|
||||||
* SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and
|
|
||||||
newlines are allowed between the trailing ``>`` and the immediately preceding
|
|
||||||
``--``.
|
|
||||||
|
|
||||||
A single exception is defined as well:
|
|
||||||
|
|
||||||
|
|
||||||
.. exception:: SGMLParseError
|
|
||||||
|
|
||||||
Exception raised by the :class:`SGMLParser` class when it encounters an error
|
|
||||||
while parsing.
|
|
||||||
|
|
||||||
:class:`SGMLParser` instances have the following methods:
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.reset()
|
|
||||||
|
|
||||||
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
|
||||||
instantiation time.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.setnomoretags()
|
|
||||||
|
|
||||||
Stop processing tags. Treat all following input as literal input (CDATA).
|
|
||||||
(This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.)
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.setliteral()
|
|
||||||
|
|
||||||
Enter literal mode (CDATA mode).
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.feed(data)
|
|
||||||
|
|
||||||
Feed some text to the parser. It is processed insofar as it consists of
|
|
||||||
complete elements; incomplete data is buffered until more data is fed or
|
|
||||||
:meth:`close` is called.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.close()
|
|
||||||
|
|
||||||
Force processing of all buffered data as if it were followed by an end-of-file
|
|
||||||
mark. This method may be redefined by a derived class to define additional
|
|
||||||
processing at the end of the input, but the redefined version should always call
|
|
||||||
:meth:`close`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.get_starttag_text()
|
|
||||||
|
|
||||||
Return the text of the most recently opened start tag. This should not normally
|
|
||||||
be needed for structured processing, but may be useful in dealing with HTML "as
|
|
||||||
deployed" or for re-generating input with minimal changes (whitespace between
|
|
||||||
attributes can be preserved, etc.).
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_starttag(tag, method, attributes)
|
|
||||||
|
|
||||||
This method is called to handle start tags for which either a :meth:`start_tag`
|
|
||||||
or :meth:`do_tag` method has been defined. The *tag* argument is the name of
|
|
||||||
the tag converted to lower case, and the *method* argument is the bound method
|
|
||||||
which should be used to support semantic interpretation of the start tag. The
|
|
||||||
*attributes* argument is a list of ``(name, value)`` pairs containing the
|
|
||||||
attributes found inside the tag's ``<>`` brackets.
|
|
||||||
|
|
||||||
The *name* has been translated to lower case. Double quotes and backslashes in
|
|
||||||
the *value* have been interpreted, as well as known character references and
|
|
||||||
known entity references terminated by a semicolon (normally, entity references
|
|
||||||
can be terminated by any non-alphanumerical character, but this would break the
|
|
||||||
very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid
|
|
||||||
entity name).
|
|
||||||
|
|
||||||
For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would
|
|
||||||
be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The
|
|
||||||
base implementation simply calls *method* with *attributes* as the only
|
|
||||||
argument.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_endtag(tag, method)
|
|
||||||
|
|
||||||
This method is called to handle endtags for which an :meth:`end_tag` method has
|
|
||||||
been defined. The *tag* argument is the name of the tag converted to lower
|
|
||||||
case, and the *method* argument is the bound method which should be used to
|
|
||||||
support semantic interpretation of the end tag. If no :meth:`end_tag` method is
|
|
||||||
defined for the closing element, this handler is not called. The base
|
|
||||||
implementation simply calls *method*.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_data(data)
|
|
||||||
|
|
||||||
This method is called to process arbitrary data. It is intended to be
|
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_charref(ref)
|
|
||||||
|
|
||||||
This method is called to process a character reference of the form ``&#ref;``.
|
|
||||||
The base implementation uses :meth:`convert_charref` to convert the reference to
|
|
||||||
a string. If that method returns a string, it is passed to :meth:`handle_data`,
|
|
||||||
otherwise ``unknown_charref(ref)`` is called to handle the error.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.convert_charref(ref)
|
|
||||||
|
|
||||||
Convert a character reference to a string, or ``None``. *ref* is the reference
|
|
||||||
passed in as a string. In the base implementation, *ref* must be a decimal
|
|
||||||
number in the range 0-255. It converts the code point found using the
|
|
||||||
:meth:`convert_codepoint` method. If *ref* is invalid or out of range, this
|
|
||||||
method returns ``None``. This method is called by the default
|
|
||||||
:meth:`handle_charref` implementation and by the attribute value parser.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.convert_codepoint(codepoint)
|
|
||||||
|
|
||||||
Convert a codepoint to a :class:`str` value. Encodings can be handled here if
|
|
||||||
appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_entityref(ref)
|
|
||||||
|
|
||||||
This method is called to process a general entity reference of the form
|
|
||||||
``&ref;`` where *ref* is an general entity reference. It converts *ref* by
|
|
||||||
passing it to :meth:`convert_entityref`. If a translation is returned, it calls
|
|
||||||
the method :meth:`handle_data` with the translation; otherwise, it calls the
|
|
||||||
method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines
|
|
||||||
translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.convert_entityref(ref)
|
|
||||||
|
|
||||||
Convert a named entity reference to a :class:`str` value, or ``None``. The
|
|
||||||
resulting value will not be parsed. *ref* will be only the name of the entity.
|
|
||||||
The default implementation looks for *ref* in the instance (or class) variable
|
|
||||||
:attr:`entitydefs` which should be a mapping from entity names to corresponding
|
|
||||||
translations. If no translation is available for *ref*, this method returns
|
|
||||||
``None``. This method is called by the default :meth:`handle_entityref`
|
|
||||||
implementation and by the attribute value parser.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_comment(comment)
|
|
||||||
|
|
||||||
This method is called when a comment is encountered. The *comment* argument is
|
|
||||||
a string containing the text between the ``<!--`` and ``-->`` delimiters, but
|
|
||||||
not the delimiters themselves. For example, the comment ``<!--text-->`` will
|
|
||||||
cause this method to be called with the argument ``'text'``. The default method
|
|
||||||
does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.handle_decl(data)
|
|
||||||
|
|
||||||
Method called when an SGML declaration is read by the parser. In practice, the
|
|
||||||
``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does
|
|
||||||
not discriminate among different (or broken) declarations. Internal subsets in
|
|
||||||
a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the
|
|
||||||
entire contents of the declaration inside the ``<!``...\ ``>`` markup. The
|
|
||||||
default implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.report_unbalanced(tag)
|
|
||||||
|
|
||||||
This method is called when an end tag is found which does not correspond to any
|
|
||||||
open element.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.unknown_starttag(tag, attributes)
|
|
||||||
|
|
||||||
This method is called to process an unknown start tag. It is intended to be
|
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.unknown_endtag(tag)
|
|
||||||
|
|
||||||
This method is called to process an unknown end tag. It is intended to be
|
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.unknown_charref(ref)
|
|
||||||
|
|
||||||
This method is called to process unresolvable numeric character references.
|
|
||||||
Refer to :meth:`handle_charref` to determine what is handled by default. It is
|
|
||||||
intended to be overridden by a derived class; the base class implementation does
|
|
||||||
nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.unknown_entityref(ref)
|
|
||||||
|
|
||||||
This method is called to process an unknown entity reference. It is intended to
|
|
||||||
be overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
Apart from overriding or extending the methods listed above, derived classes may
|
|
||||||
also define methods of the following form to define processing of specific tags.
|
|
||||||
Tag names in the input stream are case independent; the *tag* occurring in
|
|
||||||
method names must be in lower case:
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.start_tag(attributes)
|
|
||||||
:noindex:
|
|
||||||
|
|
||||||
This method is called to process an opening tag *tag*. It has preference over
|
|
||||||
:meth:`do_tag`. The *attributes* argument has the same meaning as described for
|
|
||||||
:meth:`handle_starttag` above.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.do_tag(attributes)
|
|
||||||
:noindex:
|
|
||||||
|
|
||||||
This method is called to process an opening tag *tag* for which no
|
|
||||||
:meth:`start_tag` method is defined. The *attributes* argument has the same
|
|
||||||
meaning as described for :meth:`handle_starttag` above.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: SGMLParser.end_tag()
|
|
||||||
:noindex:
|
|
||||||
|
|
||||||
This method is called to process a closing tag *tag*.
|
|
||||||
|
|
||||||
Note that the parser maintains a stack of open elements for which no end tag has
|
|
||||||
been found yet. Only tags processed by :meth:`start_tag` are pushed on this
|
|
||||||
stack. Definition of an :meth:`end_tag` method is optional for these tags. For
|
|
||||||
tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag`
|
|
||||||
method must be defined; if defined, it will not be used. If both
|
|
||||||
:meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the
|
|
||||||
:meth:`start_tag` method takes precedence.
|
|
||||||
|
|
|
@ -389,14 +389,13 @@ URL Opener objects
|
||||||
.. index::
|
.. index::
|
||||||
single: HTML
|
single: HTML
|
||||||
pair: HTTP; protocol
|
pair: HTTP; protocol
|
||||||
module: htmllib
|
|
||||||
|
|
||||||
* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
|
* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
|
||||||
returned by the server. This may be binary data (such as an image), plain text
|
returned by the server. This may be binary data (such as an image), plain text
|
||||||
or (for example) HTML. The HTTP protocol provides type information in the reply
|
or (for example) HTML. The HTTP protocol provides type information in the reply
|
||||||
header, which can be inspected by looking at the :mailheader:`Content-Type`
|
header, which can be inspected by looking at the :mailheader:`Content-Type`
|
||||||
header. If the returned data is HTML, you can use the module :mod:`htmllib` to
|
header. If the returned data is HTML, you can use the module
|
||||||
parse it.
|
:mod:`html.parser` to parse it.
|
||||||
|
|
||||||
.. index:: single: FTP
|
.. index:: single: FTP
|
||||||
|
|
||||||
|
|
|
@ -1,8 +1,7 @@
|
||||||
"""Shared support for scanning document type declarations in HTML and XHTML.
|
"""Shared support for scanning document type declarations in HTML and XHTML.
|
||||||
|
|
||||||
This module is used as a foundation for the HTMLParser and sgmllib
|
This module is used as a foundation for the html.parser module. It has no
|
||||||
modules (indirectly, for htmllib as well). It has no documented
|
documented public API and should not be used directly.
|
||||||
public API and should not be used directly.
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
486
Lib/htmllib.py
486
Lib/htmllib.py
|
@ -1,486 +0,0 @@
|
||||||
"""HTML 2.0 parser.
|
|
||||||
|
|
||||||
See the HTML 2.0 specification:
|
|
||||||
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_toc.html
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sgmllib
|
|
||||||
|
|
||||||
from formatter import AS_IS
|
|
||||||
|
|
||||||
__all__ = ["HTMLParser", "HTMLParseError"]
|
|
||||||
|
|
||||||
|
|
||||||
class HTMLParseError(sgmllib.SGMLParseError):
|
|
||||||
"""Error raised when an HTML document can't be parsed."""
|
|
||||||
|
|
||||||
|
|
||||||
class HTMLParser(sgmllib.SGMLParser):
|
|
||||||
"""This is the basic HTML parser class.
|
|
||||||
|
|
||||||
It supports all entity names required by the XHTML 1.0 Recommendation.
|
|
||||||
It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2
|
|
||||||
elements.
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
from html.entities import entitydefs
|
|
||||||
|
|
||||||
def __init__(self, formatter, verbose=0):
|
|
||||||
"""Creates an instance of the HTMLParser class.
|
|
||||||
|
|
||||||
The formatter parameter is the formatter instance associated with
|
|
||||||
the parser.
|
|
||||||
|
|
||||||
"""
|
|
||||||
sgmllib.SGMLParser.__init__(self, verbose)
|
|
||||||
self.formatter = formatter
|
|
||||||
|
|
||||||
def error(self, message):
|
|
||||||
raise HTMLParseError(message)
|
|
||||||
|
|
||||||
def reset(self):
|
|
||||||
sgmllib.SGMLParser.reset(self)
|
|
||||||
self.savedata = None
|
|
||||||
self.isindex = 0
|
|
||||||
self.title = None
|
|
||||||
self.base = None
|
|
||||||
self.anchor = None
|
|
||||||
self.anchorlist = []
|
|
||||||
self.nofill = 0
|
|
||||||
self.list_stack = []
|
|
||||||
|
|
||||||
# ------ Methods used internally; some may be overridden
|
|
||||||
|
|
||||||
# --- Formatter interface, taking care of 'savedata' mode;
|
|
||||||
# shouldn't need to be overridden
|
|
||||||
|
|
||||||
def handle_data(self, data):
|
|
||||||
if self.savedata is not None:
|
|
||||||
self.savedata = self.savedata + data
|
|
||||||
else:
|
|
||||||
if self.nofill:
|
|
||||||
self.formatter.add_literal_data(data)
|
|
||||||
else:
|
|
||||||
self.formatter.add_flowing_data(data)
|
|
||||||
|
|
||||||
# --- Hooks to save data; shouldn't need to be overridden
|
|
||||||
|
|
||||||
def save_bgn(self):
|
|
||||||
"""Begins saving character data in a buffer instead of sending it
|
|
||||||
to the formatter object.
|
|
||||||
|
|
||||||
Retrieve the stored data via the save_end() method. Use of the
|
|
||||||
save_bgn() / save_end() pair may not be nested.
|
|
||||||
|
|
||||||
"""
|
|
||||||
self.savedata = ''
|
|
||||||
|
|
||||||
def save_end(self):
|
|
||||||
"""Ends buffering character data and returns all data saved since
|
|
||||||
the preceding call to the save_bgn() method.
|
|
||||||
|
|
||||||
If the nofill flag is false, whitespace is collapsed to single
|
|
||||||
spaces. A call to this method without a preceding call to the
|
|
||||||
save_bgn() method will raise a TypeError exception.
|
|
||||||
|
|
||||||
"""
|
|
||||||
data = self.savedata
|
|
||||||
self.savedata = None
|
|
||||||
if not self.nofill:
|
|
||||||
data = ' '.join(data.split())
|
|
||||||
return data
|
|
||||||
|
|
||||||
# --- Hooks for anchors; should probably be overridden
|
|
||||||
|
|
||||||
def anchor_bgn(self, href, name, type):
|
|
||||||
"""This method is called at the start of an anchor region.
|
|
||||||
|
|
||||||
The arguments correspond to the attributes of the <A> tag with
|
|
||||||
the same names. The default implementation maintains a list of
|
|
||||||
hyperlinks (defined by the HREF attribute for <A> tags) within
|
|
||||||
the document. The list of hyperlinks is available as the data
|
|
||||||
attribute anchorlist.
|
|
||||||
|
|
||||||
"""
|
|
||||||
self.anchor = href
|
|
||||||
if self.anchor:
|
|
||||||
self.anchorlist.append(href)
|
|
||||||
|
|
||||||
def anchor_end(self):
|
|
||||||
"""This method is called at the end of an anchor region.
|
|
||||||
|
|
||||||
The default implementation adds a textual footnote marker using an
|
|
||||||
index into the list of hyperlinks created by the anchor_bgn()method.
|
|
||||||
|
|
||||||
"""
|
|
||||||
if self.anchor:
|
|
||||||
self.handle_data("[%d]" % len(self.anchorlist))
|
|
||||||
self.anchor = None
|
|
||||||
|
|
||||||
# --- Hook for images; should probably be overridden
|
|
||||||
|
|
||||||
def handle_image(self, src, alt, *args):
|
|
||||||
"""This method is called to handle images.
|
|
||||||
|
|
||||||
The default implementation simply passes the alt value to the
|
|
||||||
handle_data() method.
|
|
||||||
|
|
||||||
"""
|
|
||||||
self.handle_data(alt)
|
|
||||||
|
|
||||||
# --------- Top level elememts
|
|
||||||
|
|
||||||
def start_html(self, attrs): pass
|
|
||||||
def end_html(self): pass
|
|
||||||
|
|
||||||
def start_head(self, attrs): pass
|
|
||||||
def end_head(self): pass
|
|
||||||
|
|
||||||
def start_body(self, attrs): pass
|
|
||||||
def end_body(self): pass
|
|
||||||
|
|
||||||
# ------ Head elements
|
|
||||||
|
|
||||||
def start_title(self, attrs):
|
|
||||||
self.save_bgn()
|
|
||||||
|
|
||||||
def end_title(self):
|
|
||||||
self.title = self.save_end()
|
|
||||||
|
|
||||||
def do_base(self, attrs):
|
|
||||||
for a, v in attrs:
|
|
||||||
if a == 'href':
|
|
||||||
self.base = v
|
|
||||||
|
|
||||||
def do_isindex(self, attrs):
|
|
||||||
self.isindex = 1
|
|
||||||
|
|
||||||
def do_link(self, attrs):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def do_meta(self, attrs):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def do_nextid(self, attrs): # Deprecated
|
|
||||||
pass
|
|
||||||
|
|
||||||
# ------ Body elements
|
|
||||||
|
|
||||||
# --- Headings
|
|
||||||
|
|
||||||
def start_h1(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h1', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h1(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_h2(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h2', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h2(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_h3(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h3', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h3(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_h4(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h4', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h4(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_h5(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h5', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h5(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_h6(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font(('h6', 0, 1, 0))
|
|
||||||
|
|
||||||
def end_h6(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
# --- Block Structuring Elements
|
|
||||||
|
|
||||||
def do_p(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
|
|
||||||
def start_pre(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_font((AS_IS, AS_IS, AS_IS, 1))
|
|
||||||
self.nofill = self.nofill + 1
|
|
||||||
|
|
||||||
def end_pre(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
self.nofill = max(0, self.nofill - 1)
|
|
||||||
|
|
||||||
def start_xmp(self, attrs):
|
|
||||||
self.start_pre(attrs)
|
|
||||||
self.setliteral('xmp') # Tell SGML parser
|
|
||||||
|
|
||||||
def end_xmp(self):
|
|
||||||
self.end_pre()
|
|
||||||
|
|
||||||
def start_listing(self, attrs):
|
|
||||||
self.start_pre(attrs)
|
|
||||||
self.setliteral('listing') # Tell SGML parser
|
|
||||||
|
|
||||||
def end_listing(self):
|
|
||||||
self.end_pre()
|
|
||||||
|
|
||||||
def start_address(self, attrs):
|
|
||||||
self.formatter.end_paragraph(0)
|
|
||||||
self.formatter.push_font((AS_IS, 1, AS_IS, AS_IS))
|
|
||||||
|
|
||||||
def end_address(self):
|
|
||||||
self.formatter.end_paragraph(0)
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_blockquote(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.push_margin('blockquote')
|
|
||||||
|
|
||||||
def end_blockquote(self):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.formatter.pop_margin()
|
|
||||||
|
|
||||||
# --- List Elements
|
|
||||||
|
|
||||||
def start_ul(self, attrs):
|
|
||||||
self.formatter.end_paragraph(not self.list_stack)
|
|
||||||
self.formatter.push_margin('ul')
|
|
||||||
self.list_stack.append(['ul', '*', 0])
|
|
||||||
|
|
||||||
def end_ul(self):
|
|
||||||
if self.list_stack: del self.list_stack[-1]
|
|
||||||
self.formatter.end_paragraph(not self.list_stack)
|
|
||||||
self.formatter.pop_margin()
|
|
||||||
|
|
||||||
def do_li(self, attrs):
|
|
||||||
self.formatter.end_paragraph(0)
|
|
||||||
if self.list_stack:
|
|
||||||
[dummy, label, counter] = top = self.list_stack[-1]
|
|
||||||
top[2] = counter = counter+1
|
|
||||||
else:
|
|
||||||
label, counter = '*', 0
|
|
||||||
self.formatter.add_label_data(label, counter)
|
|
||||||
|
|
||||||
def start_ol(self, attrs):
|
|
||||||
self.formatter.end_paragraph(not self.list_stack)
|
|
||||||
self.formatter.push_margin('ol')
|
|
||||||
label = '1.'
|
|
||||||
for a, v in attrs:
|
|
||||||
if a == 'type':
|
|
||||||
if len(v) == 1: v = v + '.'
|
|
||||||
label = v
|
|
||||||
self.list_stack.append(['ol', label, 0])
|
|
||||||
|
|
||||||
def end_ol(self):
|
|
||||||
if self.list_stack: del self.list_stack[-1]
|
|
||||||
self.formatter.end_paragraph(not self.list_stack)
|
|
||||||
self.formatter.pop_margin()
|
|
||||||
|
|
||||||
def start_menu(self, attrs):
|
|
||||||
self.start_ul(attrs)
|
|
||||||
|
|
||||||
def end_menu(self):
|
|
||||||
self.end_ul()
|
|
||||||
|
|
||||||
def start_dir(self, attrs):
|
|
||||||
self.start_ul(attrs)
|
|
||||||
|
|
||||||
def end_dir(self):
|
|
||||||
self.end_ul()
|
|
||||||
|
|
||||||
def start_dl(self, attrs):
|
|
||||||
self.formatter.end_paragraph(1)
|
|
||||||
self.list_stack.append(['dl', '', 0])
|
|
||||||
|
|
||||||
def end_dl(self):
|
|
||||||
self.ddpop(1)
|
|
||||||
if self.list_stack: del self.list_stack[-1]
|
|
||||||
|
|
||||||
def do_dt(self, attrs):
|
|
||||||
self.ddpop()
|
|
||||||
|
|
||||||
def do_dd(self, attrs):
|
|
||||||
self.ddpop()
|
|
||||||
self.formatter.push_margin('dd')
|
|
||||||
self.list_stack.append(['dd', '', 0])
|
|
||||||
|
|
||||||
def ddpop(self, bl=0):
|
|
||||||
self.formatter.end_paragraph(bl)
|
|
||||||
if self.list_stack:
|
|
||||||
if self.list_stack[-1][0] == 'dd':
|
|
||||||
del self.list_stack[-1]
|
|
||||||
self.formatter.pop_margin()
|
|
||||||
|
|
||||||
# --- Phrase Markup
|
|
||||||
|
|
||||||
# Idiomatic Elements
|
|
||||||
|
|
||||||
def start_cite(self, attrs): self.start_i(attrs)
|
|
||||||
def end_cite(self): self.end_i()
|
|
||||||
|
|
||||||
def start_code(self, attrs): self.start_tt(attrs)
|
|
||||||
def end_code(self): self.end_tt()
|
|
||||||
|
|
||||||
def start_em(self, attrs): self.start_i(attrs)
|
|
||||||
def end_em(self): self.end_i()
|
|
||||||
|
|
||||||
def start_kbd(self, attrs): self.start_tt(attrs)
|
|
||||||
def end_kbd(self): self.end_tt()
|
|
||||||
|
|
||||||
def start_samp(self, attrs): self.start_tt(attrs)
|
|
||||||
def end_samp(self): self.end_tt()
|
|
||||||
|
|
||||||
def start_strong(self, attrs): self.start_b(attrs)
|
|
||||||
def end_strong(self): self.end_b()
|
|
||||||
|
|
||||||
def start_var(self, attrs): self.start_i(attrs)
|
|
||||||
def end_var(self): self.end_i()
|
|
||||||
|
|
||||||
# Typographic Elements
|
|
||||||
|
|
||||||
def start_i(self, attrs):
|
|
||||||
self.formatter.push_font((AS_IS, 1, AS_IS, AS_IS))
|
|
||||||
def end_i(self):
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_b(self, attrs):
|
|
||||||
self.formatter.push_font((AS_IS, AS_IS, 1, AS_IS))
|
|
||||||
def end_b(self):
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_tt(self, attrs):
|
|
||||||
self.formatter.push_font((AS_IS, AS_IS, AS_IS, 1))
|
|
||||||
def end_tt(self):
|
|
||||||
self.formatter.pop_font()
|
|
||||||
|
|
||||||
def start_a(self, attrs):
|
|
||||||
href = ''
|
|
||||||
name = ''
|
|
||||||
type = ''
|
|
||||||
for attrname, value in attrs:
|
|
||||||
value = value.strip()
|
|
||||||
if attrname == 'href':
|
|
||||||
href = value
|
|
||||||
if attrname == 'name':
|
|
||||||
name = value
|
|
||||||
if attrname == 'type':
|
|
||||||
type = value.lower()
|
|
||||||
self.anchor_bgn(href, name, type)
|
|
||||||
|
|
||||||
def end_a(self):
|
|
||||||
self.anchor_end()
|
|
||||||
|
|
||||||
# --- Line Break
|
|
||||||
|
|
||||||
def do_br(self, attrs):
|
|
||||||
self.formatter.add_line_break()
|
|
||||||
|
|
||||||
# --- Horizontal Rule
|
|
||||||
|
|
||||||
def do_hr(self, attrs):
|
|
||||||
self.formatter.add_hor_rule()
|
|
||||||
|
|
||||||
# --- Image
|
|
||||||
|
|
||||||
def do_img(self, attrs):
|
|
||||||
align = ''
|
|
||||||
alt = '(image)'
|
|
||||||
ismap = ''
|
|
||||||
src = ''
|
|
||||||
width = 0
|
|
||||||
height = 0
|
|
||||||
for attrname, value in attrs:
|
|
||||||
if attrname == 'align':
|
|
||||||
align = value
|
|
||||||
if attrname == 'alt':
|
|
||||||
alt = value
|
|
||||||
if attrname == 'ismap':
|
|
||||||
ismap = value
|
|
||||||
if attrname == 'src':
|
|
||||||
src = value
|
|
||||||
if attrname == 'width':
|
|
||||||
try: width = int(value)
|
|
||||||
except ValueError: pass
|
|
||||||
if attrname == 'height':
|
|
||||||
try: height = int(value)
|
|
||||||
except ValueError: pass
|
|
||||||
self.handle_image(src, alt, ismap, align, width, height)
|
|
||||||
|
|
||||||
# --- Really Old Unofficial Deprecated Stuff
|
|
||||||
|
|
||||||
def do_plaintext(self, attrs):
|
|
||||||
self.start_pre(attrs)
|
|
||||||
self.setnomoretags() # Tell SGML parser
|
|
||||||
|
|
||||||
# --- Unhandled tags
|
|
||||||
|
|
||||||
def unknown_starttag(self, tag, attrs):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def unknown_endtag(self, tag):
|
|
||||||
pass
|
|
||||||
|
|
||||||
|
|
||||||
def test(args = None):
|
|
||||||
import sys, formatter
|
|
||||||
|
|
||||||
if not args:
|
|
||||||
args = sys.argv[1:]
|
|
||||||
|
|
||||||
silent = args and args[0] == '-s'
|
|
||||||
if silent:
|
|
||||||
del args[0]
|
|
||||||
|
|
||||||
if args:
|
|
||||||
file = args[0]
|
|
||||||
else:
|
|
||||||
file = 'test.html'
|
|
||||||
|
|
||||||
if file == '-':
|
|
||||||
f = sys.stdin
|
|
||||||
else:
|
|
||||||
try:
|
|
||||||
f = open(file, 'r')
|
|
||||||
except IOError as msg:
|
|
||||||
print(file, ":", msg)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
data = f.read()
|
|
||||||
|
|
||||||
if f is not sys.stdin:
|
|
||||||
f.close()
|
|
||||||
|
|
||||||
if silent:
|
|
||||||
f = formatter.NullFormatter()
|
|
||||||
else:
|
|
||||||
f = formatter.AbstractFormatter(formatter.DumbWriter())
|
|
||||||
|
|
||||||
p = HTMLParser(f)
|
|
||||||
p.feed(data)
|
|
||||||
p.close()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
test()
|
|
548
Lib/sgmllib.py
548
Lib/sgmllib.py
|
@ -1,548 +0,0 @@
|
||||||
"""A parser for SGML, using the derived class as a static DTD."""
|
|
||||||
|
|
||||||
# XXX This only supports those SGML features used by HTML.
|
|
||||||
|
|
||||||
# XXX There should be a way to distinguish between PCDATA (parsed
|
|
||||||
# character data -- the normal case), RCDATA (replaceable character
|
|
||||||
# data -- only char and entity references and end tags are special)
|
|
||||||
# and CDATA (character data -- only end tags are special). RCDATA is
|
|
||||||
# not supported at all.
|
|
||||||
|
|
||||||
|
|
||||||
import _markupbase
|
|
||||||
import re
|
|
||||||
|
|
||||||
__all__ = ["SGMLParser", "SGMLParseError"]
|
|
||||||
|
|
||||||
# Regular expressions used for parsing
|
|
||||||
|
|
||||||
interesting = re.compile('[&<]')
|
|
||||||
incomplete = re.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|'
|
|
||||||
'<([a-zA-Z][^<>]*|'
|
|
||||||
'/([a-zA-Z][^<>]*)?|'
|
|
||||||
'![^<>]*)?')
|
|
||||||
|
|
||||||
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
|
|
||||||
charref = re.compile('&#([0-9]+)[^0-9]')
|
|
||||||
|
|
||||||
starttagopen = re.compile('<[>a-zA-Z]')
|
|
||||||
shorttagopen = re.compile('<[a-zA-Z][-.a-zA-Z0-9]*/')
|
|
||||||
shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/([^/]*)/')
|
|
||||||
piclose = re.compile('>')
|
|
||||||
endbracket = re.compile('[<>]')
|
|
||||||
tagfind = re.compile('[a-zA-Z][-_.a-zA-Z0-9]*')
|
|
||||||
attrfind = re.compile(
|
|
||||||
r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*'
|
|
||||||
r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?')
|
|
||||||
|
|
||||||
|
|
||||||
class SGMLParseError(RuntimeError):
|
|
||||||
"""Exception raised for all parse errors."""
|
|
||||||
pass
|
|
||||||
|
|
||||||
|
|
||||||
# SGML parser base class -- find tags and call handler functions.
|
|
||||||
# Usage: p = SGMLParser(); p.feed(data); ...; p.close().
|
|
||||||
# The dtd is defined by deriving a class which defines methods
|
|
||||||
# with special names to handle tags: start_foo and end_foo to handle
|
|
||||||
# <foo> and </foo>, respectively, or do_foo to handle <foo> by itself.
|
|
||||||
# (Tags are converted to lower case for this purpose.) The data
|
|
||||||
# between tags is passed to the parser by calling self.handle_data()
|
|
||||||
# with some data as argument (the data may be split up in arbitrary
|
|
||||||
# chunks). Entity references are passed by calling
|
|
||||||
# self.handle_entityref() with the entity reference as argument.
|
|
||||||
|
|
||||||
class SGMLParser(_markupbase.ParserBase):
|
|
||||||
# Definition of entities -- derived classes may override
|
|
||||||
entity_or_charref = re.compile('&(?:'
|
|
||||||
'([a-zA-Z][-.a-zA-Z0-9]*)|#([0-9]+)'
|
|
||||||
')(;?)')
|
|
||||||
|
|
||||||
def __init__(self, verbose=0):
|
|
||||||
"""Initialize and reset this instance."""
|
|
||||||
self.verbose = verbose
|
|
||||||
self.reset()
|
|
||||||
|
|
||||||
def reset(self):
|
|
||||||
"""Reset this instance. Loses all unprocessed data."""
|
|
||||||
self.__starttag_text = None
|
|
||||||
self.rawdata = ''
|
|
||||||
self.stack = []
|
|
||||||
self.lasttag = '???'
|
|
||||||
self.nomoretags = 0
|
|
||||||
self.literal = 0
|
|
||||||
_markupbase.ParserBase.reset(self)
|
|
||||||
|
|
||||||
def setnomoretags(self):
|
|
||||||
"""Enter literal mode (CDATA) till EOF.
|
|
||||||
|
|
||||||
Intended for derived classes only.
|
|
||||||
"""
|
|
||||||
self.nomoretags = self.literal = 1
|
|
||||||
|
|
||||||
def setliteral(self, *args):
|
|
||||||
"""Enter literal mode (CDATA).
|
|
||||||
|
|
||||||
Intended for derived classes only.
|
|
||||||
"""
|
|
||||||
self.literal = 1
|
|
||||||
|
|
||||||
def feed(self, data):
|
|
||||||
"""Feed some data to the parser.
|
|
||||||
|
|
||||||
Call this as often as you want, with as little or as much text
|
|
||||||
as you want (may include '\n'). (This just saves the text,
|
|
||||||
all the processing is done by goahead().)
|
|
||||||
"""
|
|
||||||
|
|
||||||
self.rawdata = self.rawdata + data
|
|
||||||
self.goahead(0)
|
|
||||||
|
|
||||||
def close(self):
|
|
||||||
"""Handle the remaining data."""
|
|
||||||
self.goahead(1)
|
|
||||||
|
|
||||||
def error(self, message):
|
|
||||||
raise SGMLParseError(message)
|
|
||||||
|
|
||||||
# Internal -- handle data as far as reasonable. May leave state
|
|
||||||
# and data to be processed by a subsequent call. If 'end' is
|
|
||||||
# true, force handling all data as if followed by EOF marker.
|
|
||||||
def goahead(self, end):
|
|
||||||
rawdata = self.rawdata
|
|
||||||
i = 0
|
|
||||||
n = len(rawdata)
|
|
||||||
while i < n:
|
|
||||||
if self.nomoretags:
|
|
||||||
self.handle_data(rawdata[i:n])
|
|
||||||
i = n
|
|
||||||
break
|
|
||||||
match = interesting.search(rawdata, i)
|
|
||||||
if match: j = match.start()
|
|
||||||
else: j = n
|
|
||||||
if i < j:
|
|
||||||
self.handle_data(rawdata[i:j])
|
|
||||||
i = j
|
|
||||||
if i == n: break
|
|
||||||
if rawdata[i] == '<':
|
|
||||||
if starttagopen.match(rawdata, i):
|
|
||||||
if self.literal:
|
|
||||||
self.handle_data(rawdata[i])
|
|
||||||
i = i+1
|
|
||||||
continue
|
|
||||||
k = self.parse_starttag(i)
|
|
||||||
if k < 0: break
|
|
||||||
i = k
|
|
||||||
continue
|
|
||||||
if rawdata.startswith("</", i):
|
|
||||||
k = self.parse_endtag(i)
|
|
||||||
if k < 0: break
|
|
||||||
i = k
|
|
||||||
self.literal = 0
|
|
||||||
continue
|
|
||||||
if self.literal:
|
|
||||||
if n > (i + 1):
|
|
||||||
self.handle_data("<")
|
|
||||||
i = i+1
|
|
||||||
else:
|
|
||||||
# incomplete
|
|
||||||
break
|
|
||||||
continue
|
|
||||||
if rawdata.startswith("<!--", i):
|
|
||||||
# Strictly speaking, a comment is --.*--
|
|
||||||
# within a declaration tag <!...>.
|
|
||||||
# This should be removed,
|
|
||||||
# and comments handled only in parse_declaration.
|
|
||||||
k = self.parse_comment(i)
|
|
||||||
if k < 0: break
|
|
||||||
i = k
|
|
||||||
continue
|
|
||||||
if rawdata.startswith("<?", i):
|
|
||||||
k = self.parse_pi(i)
|
|
||||||
if k < 0: break
|
|
||||||
i = i+k
|
|
||||||
continue
|
|
||||||
if rawdata.startswith("<!", i):
|
|
||||||
# This is some sort of declaration; in "HTML as
|
|
||||||
# deployed," this should only be the document type
|
|
||||||
# declaration ("<!DOCTYPE html...>").
|
|
||||||
k = self.parse_declaration(i)
|
|
||||||
if k < 0: break
|
|
||||||
i = k
|
|
||||||
continue
|
|
||||||
elif rawdata[i] == '&':
|
|
||||||
if self.literal:
|
|
||||||
self.handle_data(rawdata[i])
|
|
||||||
i = i+1
|
|
||||||
continue
|
|
||||||
match = charref.match(rawdata, i)
|
|
||||||
if match:
|
|
||||||
name = match.group(1)
|
|
||||||
self.handle_charref(name)
|
|
||||||
i = match.end(0)
|
|
||||||
if rawdata[i-1] != ';': i = i-1
|
|
||||||
continue
|
|
||||||
match = entityref.match(rawdata, i)
|
|
||||||
if match:
|
|
||||||
name = match.group(1)
|
|
||||||
self.handle_entityref(name)
|
|
||||||
i = match.end(0)
|
|
||||||
if rawdata[i-1] != ';': i = i-1
|
|
||||||
continue
|
|
||||||
else:
|
|
||||||
self.error('neither < nor & ??')
|
|
||||||
# We get here only if incomplete matches but
|
|
||||||
# nothing else
|
|
||||||
match = incomplete.match(rawdata, i)
|
|
||||||
if not match:
|
|
||||||
self.handle_data(rawdata[i])
|
|
||||||
i = i+1
|
|
||||||
continue
|
|
||||||
j = match.end(0)
|
|
||||||
if j == n:
|
|
||||||
break # Really incomplete
|
|
||||||
self.handle_data(rawdata[i:j])
|
|
||||||
i = j
|
|
||||||
# end while
|
|
||||||
if end and i < n:
|
|
||||||
self.handle_data(rawdata[i:n])
|
|
||||||
i = n
|
|
||||||
self.rawdata = rawdata[i:]
|
|
||||||
# XXX if end: check for empty stack
|
|
||||||
|
|
||||||
# Extensions for the DOCTYPE scanner:
|
|
||||||
_decl_otherchars = '='
|
|
||||||
|
|
||||||
# Internal -- parse processing instr, return length or -1 if not terminated
|
|
||||||
def parse_pi(self, i):
|
|
||||||
rawdata = self.rawdata
|
|
||||||
if rawdata[i:i+2] != '<?':
|
|
||||||
self.error('unexpected call to parse_pi()')
|
|
||||||
match = piclose.search(rawdata, i+2)
|
|
||||||
if not match:
|
|
||||||
return -1
|
|
||||||
j = match.start(0)
|
|
||||||
self.handle_pi(rawdata[i+2: j])
|
|
||||||
j = match.end(0)
|
|
||||||
return j-i
|
|
||||||
|
|
||||||
def get_starttag_text(self):
|
|
||||||
return self.__starttag_text
|
|
||||||
|
|
||||||
# Internal -- handle starttag, return length or -1 if not terminated
|
|
||||||
def parse_starttag(self, i):
|
|
||||||
self.__starttag_text = None
|
|
||||||
start_pos = i
|
|
||||||
rawdata = self.rawdata
|
|
||||||
if shorttagopen.match(rawdata, i):
|
|
||||||
# SGML shorthand: <tag/data/ == <tag>data</tag>
|
|
||||||
# XXX Can data contain &... (entity or char refs)?
|
|
||||||
# XXX Can data contain < or > (tag characters)?
|
|
||||||
# XXX Can there be whitespace before the first /?
|
|
||||||
match = shorttag.match(rawdata, i)
|
|
||||||
if not match:
|
|
||||||
return -1
|
|
||||||
tag, data = match.group(1, 2)
|
|
||||||
self.__starttag_text = '<%s/' % tag
|
|
||||||
tag = tag.lower()
|
|
||||||
k = match.end(0)
|
|
||||||
self.finish_shorttag(tag, data)
|
|
||||||
self.__starttag_text = rawdata[start_pos:match.end(1) + 1]
|
|
||||||
return k
|
|
||||||
# XXX The following should skip matching quotes (' or ")
|
|
||||||
# As a shortcut way to exit, this isn't so bad, but shouldn't
|
|
||||||
# be used to locate the actual end of the start tag since the
|
|
||||||
# < or > characters may be embedded in an attribute value.
|
|
||||||
match = endbracket.search(rawdata, i+1)
|
|
||||||
if not match:
|
|
||||||
return -1
|
|
||||||
j = match.start(0)
|
|
||||||
# Now parse the data between i+1 and j into a tag and attrs
|
|
||||||
attrs = []
|
|
||||||
if rawdata[i:i+2] == '<>':
|
|
||||||
# SGML shorthand: <> == <last open tag seen>
|
|
||||||
k = j
|
|
||||||
tag = self.lasttag
|
|
||||||
else:
|
|
||||||
match = tagfind.match(rawdata, i+1)
|
|
||||||
if not match:
|
|
||||||
self.error('unexpected call to parse_starttag')
|
|
||||||
k = match.end(0)
|
|
||||||
tag = rawdata[i+1:k].lower()
|
|
||||||
self.lasttag = tag
|
|
||||||
while k < j:
|
|
||||||
match = attrfind.match(rawdata, k)
|
|
||||||
if not match: break
|
|
||||||
attrname, rest, attrvalue = match.group(1, 2, 3)
|
|
||||||
if not rest:
|
|
||||||
attrvalue = attrname
|
|
||||||
else:
|
|
||||||
if (attrvalue[:1] == "'" == attrvalue[-1:] or
|
|
||||||
attrvalue[:1] == '"' == attrvalue[-1:]):
|
|
||||||
# strip quotes
|
|
||||||
attrvalue = attrvalue[1:-1]
|
|
||||||
attrvalue = self.entity_or_charref.sub(
|
|
||||||
self._convert_ref, attrvalue)
|
|
||||||
attrs.append((attrname.lower(), attrvalue))
|
|
||||||
k = match.end(0)
|
|
||||||
if rawdata[j] == '>':
|
|
||||||
j = j+1
|
|
||||||
self.__starttag_text = rawdata[start_pos:j]
|
|
||||||
self.finish_starttag(tag, attrs)
|
|
||||||
return j
|
|
||||||
|
|
||||||
# Internal -- convert entity or character reference
|
|
||||||
def _convert_ref(self, match):
|
|
||||||
if match.group(2):
|
|
||||||
return self.convert_charref(match.group(2)) or \
|
|
||||||
'&#%s%s' % match.groups()[1:]
|
|
||||||
elif match.group(3):
|
|
||||||
return self.convert_entityref(match.group(1)) or \
|
|
||||||
'&%s;' % match.group(1)
|
|
||||||
else:
|
|
||||||
return '&%s' % match.group(1)
|
|
||||||
|
|
||||||
# Internal -- parse endtag
|
|
||||||
def parse_endtag(self, i):
|
|
||||||
rawdata = self.rawdata
|
|
||||||
match = endbracket.search(rawdata, i+1)
|
|
||||||
if not match:
|
|
||||||
return -1
|
|
||||||
j = match.start(0)
|
|
||||||
tag = rawdata[i+2:j].strip().lower()
|
|
||||||
if rawdata[j] == '>':
|
|
||||||
j = j+1
|
|
||||||
self.finish_endtag(tag)
|
|
||||||
return j
|
|
||||||
|
|
||||||
# Internal -- finish parsing of <tag/data/ (same as <tag>data</tag>)
|
|
||||||
def finish_shorttag(self, tag, data):
|
|
||||||
self.finish_starttag(tag, [])
|
|
||||||
self.handle_data(data)
|
|
||||||
self.finish_endtag(tag)
|
|
||||||
|
|
||||||
# Internal -- finish processing of start tag
|
|
||||||
# Return -1 for unknown tag, 0 for open-only tag, 1 for balanced tag
|
|
||||||
def finish_starttag(self, tag, attrs):
|
|
||||||
try:
|
|
||||||
method = getattr(self, 'start_' + tag)
|
|
||||||
except AttributeError:
|
|
||||||
try:
|
|
||||||
method = getattr(self, 'do_' + tag)
|
|
||||||
except AttributeError:
|
|
||||||
self.unknown_starttag(tag, attrs)
|
|
||||||
return -1
|
|
||||||
else:
|
|
||||||
self.handle_starttag(tag, method, attrs)
|
|
||||||
return 0
|
|
||||||
else:
|
|
||||||
self.stack.append(tag)
|
|
||||||
self.handle_starttag(tag, method, attrs)
|
|
||||||
return 1
|
|
||||||
|
|
||||||
# Internal -- finish processing of end tag
|
|
||||||
def finish_endtag(self, tag):
|
|
||||||
if not tag:
|
|
||||||
found = len(self.stack) - 1
|
|
||||||
if found < 0:
|
|
||||||
self.unknown_endtag(tag)
|
|
||||||
return
|
|
||||||
else:
|
|
||||||
if tag not in self.stack:
|
|
||||||
try:
|
|
||||||
method = getattr(self, 'end_' + tag)
|
|
||||||
except AttributeError:
|
|
||||||
self.unknown_endtag(tag)
|
|
||||||
else:
|
|
||||||
self.report_unbalanced(tag)
|
|
||||||
return
|
|
||||||
found = len(self.stack)
|
|
||||||
for i in range(found):
|
|
||||||
if self.stack[i] == tag: found = i
|
|
||||||
while len(self.stack) > found:
|
|
||||||
tag = self.stack[-1]
|
|
||||||
try:
|
|
||||||
method = getattr(self, 'end_' + tag)
|
|
||||||
except AttributeError:
|
|
||||||
method = None
|
|
||||||
if method:
|
|
||||||
self.handle_endtag(tag, method)
|
|
||||||
else:
|
|
||||||
self.unknown_endtag(tag)
|
|
||||||
del self.stack[-1]
|
|
||||||
|
|
||||||
# Overridable -- handle start tag
|
|
||||||
def handle_starttag(self, tag, method, attrs):
|
|
||||||
method(attrs)
|
|
||||||
|
|
||||||
# Overridable -- handle end tag
|
|
||||||
def handle_endtag(self, tag, method):
|
|
||||||
method()
|
|
||||||
|
|
||||||
# Example -- report an unbalanced </...> tag.
|
|
||||||
def report_unbalanced(self, tag):
|
|
||||||
if self.verbose:
|
|
||||||
print('*** Unbalanced </' + tag + '>')
|
|
||||||
print('*** Stack:', self.stack)
|
|
||||||
|
|
||||||
def convert_charref(self, name):
|
|
||||||
"""Convert character reference, may be overridden."""
|
|
||||||
try:
|
|
||||||
n = int(name)
|
|
||||||
except ValueError:
|
|
||||||
return
|
|
||||||
if not 0 <= n <= 255:
|
|
||||||
return
|
|
||||||
return self.convert_codepoint(n)
|
|
||||||
|
|
||||||
def convert_codepoint(self, codepoint):
|
|
||||||
return chr(codepoint)
|
|
||||||
|
|
||||||
def handle_charref(self, name):
|
|
||||||
"""Handle character reference, no need to override."""
|
|
||||||
replacement = self.convert_charref(name)
|
|
||||||
if replacement is None:
|
|
||||||
self.unknown_charref(name)
|
|
||||||
else:
|
|
||||||
self.handle_data(replacement)
|
|
||||||
|
|
||||||
# Definition of entities -- derived classes may override
|
|
||||||
entitydefs = \
|
|
||||||
{'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}
|
|
||||||
|
|
||||||
def convert_entityref(self, name):
|
|
||||||
"""Convert entity references.
|
|
||||||
|
|
||||||
As an alternative to overriding this method; one can tailor the
|
|
||||||
results by setting up the self.entitydefs mapping appropriately.
|
|
||||||
"""
|
|
||||||
table = self.entitydefs
|
|
||||||
if name in table:
|
|
||||||
return table[name]
|
|
||||||
else:
|
|
||||||
return
|
|
||||||
|
|
||||||
def handle_entityref(self, name):
|
|
||||||
"""Handle entity references, no need to override."""
|
|
||||||
replacement = self.convert_entityref(name)
|
|
||||||
if replacement is None:
|
|
||||||
self.unknown_entityref(name)
|
|
||||||
else:
|
|
||||||
self.handle_data(replacement)
|
|
||||||
|
|
||||||
# Example -- handle data, should be overridden
|
|
||||||
def handle_data(self, data):
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Example -- handle comment, could be overridden
|
|
||||||
def handle_comment(self, data):
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Example -- handle declaration, could be overridden
|
|
||||||
def handle_decl(self, decl):
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Example -- handle processing instruction, could be overridden
|
|
||||||
def handle_pi(self, data):
|
|
||||||
pass
|
|
||||||
|
|
||||||
# To be overridden -- handlers for unknown objects
|
|
||||||
def unknown_starttag(self, tag, attrs): pass
|
|
||||||
def unknown_endtag(self, tag): pass
|
|
||||||
def unknown_charref(self, ref): pass
|
|
||||||
def unknown_entityref(self, ref): pass
|
|
||||||
|
|
||||||
|
|
||||||
class TestSGMLParser(SGMLParser):
|
|
||||||
|
|
||||||
def __init__(self, verbose=0):
|
|
||||||
self.testdata = ""
|
|
||||||
SGMLParser.__init__(self, verbose)
|
|
||||||
|
|
||||||
def handle_data(self, data):
|
|
||||||
self.testdata = self.testdata + data
|
|
||||||
if len(repr(self.testdata)) >= 70:
|
|
||||||
self.flush()
|
|
||||||
|
|
||||||
def flush(self):
|
|
||||||
data = self.testdata
|
|
||||||
if data:
|
|
||||||
self.testdata = ""
|
|
||||||
print('data:', repr(data))
|
|
||||||
|
|
||||||
def handle_comment(self, data):
|
|
||||||
self.flush()
|
|
||||||
r = repr(data)
|
|
||||||
if len(r) > 68:
|
|
||||||
r = r[:32] + '...' + r[-32:]
|
|
||||||
print('comment:', r)
|
|
||||||
|
|
||||||
def unknown_starttag(self, tag, attrs):
|
|
||||||
self.flush()
|
|
||||||
if not attrs:
|
|
||||||
print('start tag: <' + tag + '>')
|
|
||||||
else:
|
|
||||||
print('start tag: <' + tag, end=' ')
|
|
||||||
for name, value in attrs:
|
|
||||||
print(name + '=' + '"' + value + '"', end=' ')
|
|
||||||
print('>')
|
|
||||||
|
|
||||||
def unknown_endtag(self, tag):
|
|
||||||
self.flush()
|
|
||||||
print('end tag: </' + tag + '>')
|
|
||||||
|
|
||||||
def unknown_entityref(self, ref):
|
|
||||||
self.flush()
|
|
||||||
print('*** unknown entity ref: &' + ref + ';')
|
|
||||||
|
|
||||||
def unknown_charref(self, ref):
|
|
||||||
self.flush()
|
|
||||||
print('*** unknown char ref: &#' + ref + ';')
|
|
||||||
|
|
||||||
def unknown_decl(self, data):
|
|
||||||
self.flush()
|
|
||||||
print('*** unknown decl: [' + data + ']')
|
|
||||||
|
|
||||||
def close(self):
|
|
||||||
SGMLParser.close(self)
|
|
||||||
self.flush()
|
|
||||||
|
|
||||||
|
|
||||||
def test(args = None):
|
|
||||||
import sys
|
|
||||||
|
|
||||||
if args is None:
|
|
||||||
args = sys.argv[1:]
|
|
||||||
|
|
||||||
if args and args[0] == '-s':
|
|
||||||
args = args[1:]
|
|
||||||
klass = SGMLParser
|
|
||||||
else:
|
|
||||||
klass = TestSGMLParser
|
|
||||||
|
|
||||||
if args:
|
|
||||||
file = args[0]
|
|
||||||
else:
|
|
||||||
file = 'test.html'
|
|
||||||
|
|
||||||
if file == '-':
|
|
||||||
f = sys.stdin
|
|
||||||
else:
|
|
||||||
try:
|
|
||||||
f = open(file, 'r')
|
|
||||||
except IOError as msg:
|
|
||||||
print(file, ":", msg)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
data = f.read()
|
|
||||||
if f is not sys.stdin:
|
|
||||||
f.close()
|
|
||||||
|
|
||||||
x = klass()
|
|
||||||
for c in data:
|
|
||||||
x.feed(c)
|
|
||||||
x.close()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
test()
|
|
|
@ -73,7 +73,6 @@ class AllTest(unittest.TestCase):
|
||||||
self.check_all("glob")
|
self.check_all("glob")
|
||||||
self.check_all("gzip")
|
self.check_all("gzip")
|
||||||
self.check_all("heapq")
|
self.check_all("heapq")
|
||||||
self.check_all("htmllib")
|
|
||||||
self.check_all("http.client")
|
self.check_all("http.client")
|
||||||
self.check_all("ihooks")
|
self.check_all("ihooks")
|
||||||
self.check_all("imaplib")
|
self.check_all("imaplib")
|
||||||
|
@ -116,7 +115,6 @@ class AllTest(unittest.TestCase):
|
||||||
self.check_all("rlcompleter")
|
self.check_all("rlcompleter")
|
||||||
self.check_all("robotparser")
|
self.check_all("robotparser")
|
||||||
self.check_all("sched")
|
self.check_all("sched")
|
||||||
self.check_all("sgmllib")
|
|
||||||
self.check_all("shelve")
|
self.check_all("shelve")
|
||||||
self.check_all("shlex")
|
self.check_all("shlex")
|
||||||
self.check_all("shutil")
|
self.check_all("shutil")
|
||||||
|
|
|
@ -1,69 +0,0 @@
|
||||||
import formatter
|
|
||||||
import htmllib
|
|
||||||
import unittest
|
|
||||||
|
|
||||||
from test import support
|
|
||||||
|
|
||||||
|
|
||||||
class AnchorCollector(htmllib.HTMLParser):
|
|
||||||
def __init__(self, *args, **kw):
|
|
||||||
self.__anchors = []
|
|
||||||
htmllib.HTMLParser.__init__(self, *args, **kw)
|
|
||||||
|
|
||||||
def get_anchor_info(self):
|
|
||||||
return self.__anchors
|
|
||||||
|
|
||||||
def anchor_bgn(self, *args):
|
|
||||||
self.__anchors.append(args)
|
|
||||||
|
|
||||||
class DeclCollector(htmllib.HTMLParser):
|
|
||||||
def __init__(self, *args, **kw):
|
|
||||||
self.__decls = []
|
|
||||||
htmllib.HTMLParser.__init__(self, *args, **kw)
|
|
||||||
|
|
||||||
def get_decl_info(self):
|
|
||||||
return self.__decls
|
|
||||||
|
|
||||||
def unknown_decl(self, data):
|
|
||||||
self.__decls.append(data)
|
|
||||||
|
|
||||||
|
|
||||||
class HTMLParserTestCase(unittest.TestCase):
|
|
||||||
def test_anchor_collection(self):
|
|
||||||
# See SF bug #467059.
|
|
||||||
parser = AnchorCollector(formatter.NullFormatter(), verbose=1)
|
|
||||||
parser.feed(
|
|
||||||
"""<a href='http://foo.org/' name='splat'> </a>
|
|
||||||
<a href='http://www.python.org/'> </a>
|
|
||||||
<a name='frob'> </a>
|
|
||||||
""")
|
|
||||||
parser.close()
|
|
||||||
self.assertEquals(parser.get_anchor_info(),
|
|
||||||
[('http://foo.org/', 'splat', ''),
|
|
||||||
('http://www.python.org/', '', ''),
|
|
||||||
('', 'frob', ''),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_decl_collection(self):
|
|
||||||
# See SF patch #545300
|
|
||||||
parser = DeclCollector(formatter.NullFormatter(), verbose=1)
|
|
||||||
parser.feed(
|
|
||||||
"""<html>
|
|
||||||
<body>
|
|
||||||
hallo
|
|
||||||
<![if !supportEmptyParas]> <![endif]>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
""")
|
|
||||||
parser.close()
|
|
||||||
self.assertEquals(parser.get_decl_info(),
|
|
||||||
["if !supportEmptyParas",
|
|
||||||
"endif"
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_main():
|
|
||||||
support.run_unittest(HTMLParserTestCase)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
test_main()
|
|
|
@ -1,438 +0,0 @@
|
||||||
import pprint
|
|
||||||
import re
|
|
||||||
import sgmllib
|
|
||||||
import unittest
|
|
||||||
from test import support
|
|
||||||
|
|
||||||
|
|
||||||
class EventCollector(sgmllib.SGMLParser):
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.events = []
|
|
||||||
self.append = self.events.append
|
|
||||||
sgmllib.SGMLParser.__init__(self)
|
|
||||||
|
|
||||||
def get_events(self):
|
|
||||||
# Normalize the list of events so that buffer artefacts don't
|
|
||||||
# separate runs of contiguous characters.
|
|
||||||
L = []
|
|
||||||
prevtype = None
|
|
||||||
for event in self.events:
|
|
||||||
type = event[0]
|
|
||||||
if type == prevtype == "data":
|
|
||||||
L[-1] = ("data", L[-1][1] + event[1])
|
|
||||||
else:
|
|
||||||
L.append(event)
|
|
||||||
prevtype = type
|
|
||||||
self.events = L
|
|
||||||
return L
|
|
||||||
|
|
||||||
# structure markup
|
|
||||||
|
|
||||||
def unknown_starttag(self, tag, attrs):
|
|
||||||
self.append(("starttag", tag, attrs))
|
|
||||||
|
|
||||||
def unknown_endtag(self, tag):
|
|
||||||
self.append(("endtag", tag))
|
|
||||||
|
|
||||||
# all other markup
|
|
||||||
|
|
||||||
def handle_comment(self, data):
|
|
||||||
self.append(("comment", data))
|
|
||||||
|
|
||||||
def handle_charref(self, data):
|
|
||||||
self.append(("charref", data))
|
|
||||||
|
|
||||||
def handle_data(self, data):
|
|
||||||
self.append(("data", data))
|
|
||||||
|
|
||||||
def handle_decl(self, decl):
|
|
||||||
self.append(("decl", decl))
|
|
||||||
|
|
||||||
def handle_entityref(self, data):
|
|
||||||
self.append(("entityref", data))
|
|
||||||
|
|
||||||
def handle_pi(self, data):
|
|
||||||
self.append(("pi", data))
|
|
||||||
|
|
||||||
def unknown_decl(self, decl):
|
|
||||||
self.append(("unknown decl", decl))
|
|
||||||
|
|
||||||
|
|
||||||
class CDATAEventCollector(EventCollector):
|
|
||||||
def start_cdata(self, attrs):
|
|
||||||
self.append(("starttag", "cdata", attrs))
|
|
||||||
self.setliteral()
|
|
||||||
|
|
||||||
|
|
||||||
class HTMLEntityCollector(EventCollector):
|
|
||||||
|
|
||||||
entity_or_charref = re.compile('(?:&([a-zA-Z][-.a-zA-Z0-9]*)'
|
|
||||||
'|&#(x[0-9a-zA-Z]+|[0-9]+))(;?)')
|
|
||||||
|
|
||||||
def convert_charref(self, name):
|
|
||||||
self.append(("charref", "convert", name))
|
|
||||||
if name[0] != "x":
|
|
||||||
return EventCollector.convert_charref(self, name)
|
|
||||||
|
|
||||||
def convert_codepoint(self, codepoint):
|
|
||||||
self.append(("codepoint", "convert", codepoint))
|
|
||||||
EventCollector.convert_codepoint(self, codepoint)
|
|
||||||
|
|
||||||
def convert_entityref(self, name):
|
|
||||||
self.append(("entityref", "convert", name))
|
|
||||||
return EventCollector.convert_entityref(self, name)
|
|
||||||
|
|
||||||
# These to record that they were called, then pass the call along
|
|
||||||
# to the default implementation so that it's actions can be
|
|
||||||
# recorded.
|
|
||||||
|
|
||||||
def handle_charref(self, data):
|
|
||||||
self.append(("charref", data))
|
|
||||||
sgmllib.SGMLParser.handle_charref(self, data)
|
|
||||||
|
|
||||||
def handle_entityref(self, data):
|
|
||||||
self.append(("entityref", data))
|
|
||||||
sgmllib.SGMLParser.handle_entityref(self, data)
|
|
||||||
|
|
||||||
|
|
||||||
class SGMLParserTestCase(unittest.TestCase):
|
|
||||||
|
|
||||||
collector = EventCollector
|
|
||||||
|
|
||||||
def get_events(self, source):
|
|
||||||
parser = self.collector()
|
|
||||||
try:
|
|
||||||
for s in source:
|
|
||||||
parser.feed(s)
|
|
||||||
parser.close()
|
|
||||||
except:
|
|
||||||
#self.events = parser.events
|
|
||||||
raise
|
|
||||||
return parser.get_events()
|
|
||||||
|
|
||||||
def check_events(self, source, expected_events):
|
|
||||||
try:
|
|
||||||
events = self.get_events(source)
|
|
||||||
except:
|
|
||||||
#import sys
|
|
||||||
#print >>sys.stderr, pprint.pformat(self.events)
|
|
||||||
raise
|
|
||||||
if events != expected_events:
|
|
||||||
self.fail("received events did not match expected events\n"
|
|
||||||
"Expected:\n" + pprint.pformat(expected_events) +
|
|
||||||
"\nReceived:\n" + pprint.pformat(events))
|
|
||||||
|
|
||||||
def check_parse_error(self, source):
|
|
||||||
parser = EventCollector()
|
|
||||||
try:
|
|
||||||
parser.feed(source)
|
|
||||||
parser.close()
|
|
||||||
except sgmllib.SGMLParseError:
|
|
||||||
pass
|
|
||||||
else:
|
|
||||||
self.fail("expected SGMLParseError for %r\nReceived:\n%s"
|
|
||||||
% (source, pprint.pformat(parser.get_events())))
|
|
||||||
|
|
||||||
def test_doctype_decl_internal(self):
|
|
||||||
inside = """\
|
|
||||||
DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'
|
|
||||||
SYSTEM 'http://www.w3.org/TR/html401/strict.dtd' [
|
|
||||||
<!ELEMENT html - O EMPTY>
|
|
||||||
<!ATTLIST html
|
|
||||||
version CDATA #IMPLIED
|
|
||||||
profile CDATA 'DublinCore'>
|
|
||||||
<!NOTATION datatype SYSTEM 'http://xml.python.org/notations/python-module'>
|
|
||||||
<!ENTITY myEntity 'internal parsed entity'>
|
|
||||||
<!ENTITY anEntity SYSTEM 'http://xml.python.org/entities/something.xml'>
|
|
||||||
<!ENTITY % paramEntity 'name|name|name'>
|
|
||||||
%paramEntity;
|
|
||||||
<!-- comment -->
|
|
||||||
]"""
|
|
||||||
self.check_events(["<!%s>" % inside], [
|
|
||||||
("decl", inside),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_doctype_decl_external(self):
|
|
||||||
inside = "DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'"
|
|
||||||
self.check_events("<!%s>" % inside, [
|
|
||||||
("decl", inside),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_underscore_in_attrname(self):
|
|
||||||
# SF bug #436621
|
|
||||||
"""Make sure attribute names with underscores are accepted"""
|
|
||||||
self.check_events("<a has_under _under>", [
|
|
||||||
("starttag", "a", [("has_under", "has_under"),
|
|
||||||
("_under", "_under")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_underscore_in_tagname(self):
|
|
||||||
# SF bug #436621
|
|
||||||
"""Make sure tag names with underscores are accepted"""
|
|
||||||
self.check_events("<has_under></has_under>", [
|
|
||||||
("starttag", "has_under", []),
|
|
||||||
("endtag", "has_under"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_quotes_in_unquoted_attrs(self):
|
|
||||||
# SF bug #436621
|
|
||||||
"""Be sure quotes in unquoted attributes are made part of the value"""
|
|
||||||
self.check_events("<a href=foo'bar\"baz>", [
|
|
||||||
("starttag", "a", [("href", "foo'bar\"baz")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_xhtml_empty_tag(self):
|
|
||||||
"""Handling of XHTML-style empty start tags"""
|
|
||||||
self.check_events("<br />text<i></i>", [
|
|
||||||
("starttag", "br", []),
|
|
||||||
("data", "text"),
|
|
||||||
("starttag", "i", []),
|
|
||||||
("endtag", "i"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_processing_instruction_only(self):
|
|
||||||
self.check_events("<?processing instruction>", [
|
|
||||||
("pi", "processing instruction"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_bad_nesting(self):
|
|
||||||
self.check_events("<a><b></a></b>", [
|
|
||||||
("starttag", "a", []),
|
|
||||||
("starttag", "b", []),
|
|
||||||
("endtag", "a"),
|
|
||||||
("endtag", "b"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_bare_ampersands(self):
|
|
||||||
self.check_events("this text & contains & ampersands &", [
|
|
||||||
("data", "this text & contains & ampersands &"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_bare_pointy_brackets(self):
|
|
||||||
self.check_events("this < text > contains < bare>pointy< brackets", [
|
|
||||||
("data", "this < text > contains < bare>pointy< brackets"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_attr_syntax(self):
|
|
||||||
output = [
|
|
||||||
("starttag", "a", [("b", "v"), ("c", "v"), ("d", "v"), ("e", "e")])
|
|
||||||
]
|
|
||||||
self.check_events("""<a b='v' c="v" d=v e>""", output)
|
|
||||||
self.check_events("""<a b = 'v' c = "v" d = v e>""", output)
|
|
||||||
self.check_events("""<a\nb\n=\n'v'\nc\n=\n"v"\nd\n=\nv\ne>""", output)
|
|
||||||
self.check_events("""<a\tb\t=\t'v'\tc\t=\t"v"\td\t=\tv\te>""", output)
|
|
||||||
|
|
||||||
def test_attr_values(self):
|
|
||||||
self.check_events("""<a b='xxx\n\txxx' c="yyy\t\nyyy" d='\txyz\n'>""",
|
|
||||||
[("starttag", "a", [("b", "xxx\n\txxx"),
|
|
||||||
("c", "yyy\t\nyyy"),
|
|
||||||
("d", "\txyz\n")])
|
|
||||||
])
|
|
||||||
self.check_events("""<a b='' c="">""", [
|
|
||||||
("starttag", "a", [("b", ""), ("c", "")]),
|
|
||||||
])
|
|
||||||
# URL construction stuff from RFC 1808:
|
|
||||||
safe = "$-_.+"
|
|
||||||
extra = "!*'(),"
|
|
||||||
reserved = ";/?:@&="
|
|
||||||
url = "http://example.com:8080/path/to/file?%s%s%s" % (
|
|
||||||
safe, extra, reserved)
|
|
||||||
self.check_events("""<e a=%s>""" % url, [
|
|
||||||
("starttag", "e", [("a", url)]),
|
|
||||||
])
|
|
||||||
# Regression test for SF patch #669683.
|
|
||||||
self.check_events("<e a=rgb(1,2,3)>", [
|
|
||||||
("starttag", "e", [("a", "rgb(1,2,3)")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_attr_values_entities(self):
|
|
||||||
"""Substitution of entities and charrefs in attribute values"""
|
|
||||||
# SF bug #1452246
|
|
||||||
self.check_events("""<a b=< c=<> d=<-> e='< '
|
|
||||||
f="&xxx;" g=' !' h='Ǵ'
|
|
||||||
i='x?a=b&c=d;'
|
|
||||||
j='&#42;' k='&#42;'>""",
|
|
||||||
[("starttag", "a", [("b", "<"),
|
|
||||||
("c", "<>"),
|
|
||||||
("d", "<->"),
|
|
||||||
("e", "< "),
|
|
||||||
("f", "&xxx;"),
|
|
||||||
("g", " !"),
|
|
||||||
("h", "Ǵ"),
|
|
||||||
("i", "x?a=b&c=d;"),
|
|
||||||
("j", "*"),
|
|
||||||
("k", "*"),
|
|
||||||
])])
|
|
||||||
|
|
||||||
def test_convert_overrides(self):
|
|
||||||
# This checks that the character and entity reference
|
|
||||||
# conversion helpers are called at the documented times. No
|
|
||||||
# attempt is made to really change what the parser accepts.
|
|
||||||
#
|
|
||||||
self.collector = HTMLEntityCollector
|
|
||||||
self.check_events(('<a title="“test”">foo</a>'
|
|
||||||
'&foobar;*'), [
|
|
||||||
('entityref', 'convert', 'ldquo'),
|
|
||||||
('charref', 'convert', 'x201d'),
|
|
||||||
('starttag', 'a', [('title', '“test”')]),
|
|
||||||
('data', 'foo'),
|
|
||||||
('endtag', 'a'),
|
|
||||||
('entityref', 'foobar'),
|
|
||||||
('entityref', 'convert', 'foobar'),
|
|
||||||
('charref', '42'),
|
|
||||||
('charref', 'convert', '42'),
|
|
||||||
('codepoint', 'convert', 42),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_attr_funky_names(self):
|
|
||||||
self.check_events("""<a a.b='v' c:d=v e-f=v>""", [
|
|
||||||
("starttag", "a", [("a.b", "v"), ("c:d", "v"), ("e-f", "v")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_attr_value_ip6_url(self):
|
|
||||||
# http://www.python.org/sf/853506
|
|
||||||
self.check_events(("<a href='http://[1080::8:800:200C:417A]/'>"
|
|
||||||
"<a href=http://[1080::8:800:200C:417A]/>"), [
|
|
||||||
("starttag", "a", [("href", "http://[1080::8:800:200C:417A]/")]),
|
|
||||||
("starttag", "a", [("href", "http://[1080::8:800:200C:417A]/")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_illegal_declarations(self):
|
|
||||||
s = 'abc<!spacer type="block" height="25">def'
|
|
||||||
self.check_events(s, [
|
|
||||||
("data", "abc"),
|
|
||||||
("unknown decl", 'spacer type="block" height="25"'),
|
|
||||||
("data", "def"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_weird_starttags(self):
|
|
||||||
self.check_events("<a<a>", [
|
|
||||||
("starttag", "a", []),
|
|
||||||
("starttag", "a", []),
|
|
||||||
])
|
|
||||||
self.check_events("</a<a>", [
|
|
||||||
("endtag", "a"),
|
|
||||||
("starttag", "a", []),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_declaration_junk_chars(self):
|
|
||||||
self.check_parse_error("<!DOCTYPE foo $ >")
|
|
||||||
|
|
||||||
def test_get_starttag_text(self):
|
|
||||||
s = """<foobar \n one="1"\ttwo=2 >"""
|
|
||||||
self.check_events(s, [
|
|
||||||
("starttag", "foobar", [("one", "1"), ("two", "2")]),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_cdata_content(self):
|
|
||||||
s = ("<cdata> <!-- not a comment --> ¬-an-entity-ref; </cdata>"
|
|
||||||
"<notcdata> <!-- comment --> </notcdata>")
|
|
||||||
self.collector = CDATAEventCollector
|
|
||||||
self.check_events(s, [
|
|
||||||
("starttag", "cdata", []),
|
|
||||||
("data", " <!-- not a comment --> ¬-an-entity-ref; "),
|
|
||||||
("endtag", "cdata"),
|
|
||||||
("starttag", "notcdata", []),
|
|
||||||
("data", " "),
|
|
||||||
("comment", " comment "),
|
|
||||||
("data", " "),
|
|
||||||
("endtag", "notcdata"),
|
|
||||||
])
|
|
||||||
s = """<cdata> <not a='start tag'> </cdata>"""
|
|
||||||
self.check_events(s, [
|
|
||||||
("starttag", "cdata", []),
|
|
||||||
("data", " <not a='start tag'> "),
|
|
||||||
("endtag", "cdata"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_illegal_declarations(self):
|
|
||||||
s = 'abc<!spacer type="block" height="25">def'
|
|
||||||
self.check_events(s, [
|
|
||||||
("data", "abc"),
|
|
||||||
("unknown decl", 'spacer type="block" height="25"'),
|
|
||||||
("data", "def"),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_enumerated_attr_type(self):
|
|
||||||
s = "<!DOCTYPE doc [<!ATTLIST doc attr (a | b) >]>"
|
|
||||||
self.check_events(s, [
|
|
||||||
('decl', 'DOCTYPE doc [<!ATTLIST doc attr (a | b) >]'),
|
|
||||||
])
|
|
||||||
|
|
||||||
def test_read_chunks(self):
|
|
||||||
# SF bug #1541697, this caused sgml parser to hang
|
|
||||||
# Just verify this code doesn't cause a hang.
|
|
||||||
CHUNK = 1024 # increasing this to 8212 makes the problem go away
|
|
||||||
|
|
||||||
f = open(support.findfile('sgml_input.html'), encoding="latin-1")
|
|
||||||
fp = sgmllib.SGMLParser()
|
|
||||||
while 1:
|
|
||||||
data = f.read(CHUNK)
|
|
||||||
fp.feed(data)
|
|
||||||
if len(data) != CHUNK:
|
|
||||||
break
|
|
||||||
|
|
||||||
# XXX These tests have been disabled by prefixing their names with
|
|
||||||
# an underscore. The first two exercise outstanding bugs in the
|
|
||||||
# sgmllib module, and the third exhibits questionable behavior
|
|
||||||
# that needs to be carefully considered before changing it.
|
|
||||||
|
|
||||||
def _test_starttag_end_boundary(self):
|
|
||||||
self.check_events("<a b='<'>", [("starttag", "a", [("b", "<")])])
|
|
||||||
self.check_events("<a b='>'>", [("starttag", "a", [("b", ">")])])
|
|
||||||
|
|
||||||
def _test_buffer_artefacts(self):
|
|
||||||
output = [("starttag", "a", [("b", "<")])]
|
|
||||||
self.check_events(["<a b='<'>"], output)
|
|
||||||
self.check_events(["<a ", "b='<'>"], output)
|
|
||||||
self.check_events(["<a b", "='<'>"], output)
|
|
||||||
self.check_events(["<a b=", "'<'>"], output)
|
|
||||||
self.check_events(["<a b='<", "'>"], output)
|
|
||||||
self.check_events(["<a b='<'", ">"], output)
|
|
||||||
|
|
||||||
output = [("starttag", "a", [("b", ">")])]
|
|
||||||
self.check_events(["<a b='>'>"], output)
|
|
||||||
self.check_events(["<a ", "b='>'>"], output)
|
|
||||||
self.check_events(["<a b", "='>'>"], output)
|
|
||||||
self.check_events(["<a b=", "'>'>"], output)
|
|
||||||
self.check_events(["<a b='>", "'>"], output)
|
|
||||||
self.check_events(["<a b='>'", ">"], output)
|
|
||||||
|
|
||||||
output = [("comment", "abc")]
|
|
||||||
self.check_events(["", "<!--abc-->"], output)
|
|
||||||
self.check_events(["<", "!--abc-->"], output)
|
|
||||||
self.check_events(["<!", "--abc-->"], output)
|
|
||||||
self.check_events(["<!-", "-abc-->"], output)
|
|
||||||
self.check_events(["<!--", "abc-->"], output)
|
|
||||||
self.check_events(["<!--a", "bc-->"], output)
|
|
||||||
self.check_events(["<!--ab", "c-->"], output)
|
|
||||||
self.check_events(["<!--abc", "-->"], output)
|
|
||||||
self.check_events(["<!--abc-", "->"], output)
|
|
||||||
self.check_events(["<!--abc--", ">"], output)
|
|
||||||
self.check_events(["<!--abc-->", ""], output)
|
|
||||||
|
|
||||||
def _test_starttag_junk_chars(self):
|
|
||||||
self.check_parse_error("<")
|
|
||||||
self.check_parse_error("<>")
|
|
||||||
self.check_parse_error("</$>")
|
|
||||||
self.check_parse_error("</")
|
|
||||||
self.check_parse_error("</a")
|
|
||||||
self.check_parse_error("<$")
|
|
||||||
self.check_parse_error("<$>")
|
|
||||||
self.check_parse_error("<!")
|
|
||||||
self.check_parse_error("<a $>")
|
|
||||||
self.check_parse_error("<a")
|
|
||||||
self.check_parse_error("<a foo='bar'")
|
|
||||||
self.check_parse_error("<a foo='bar")
|
|
||||||
self.check_parse_error("<a foo='>'")
|
|
||||||
self.check_parse_error("<a foo='>")
|
|
||||||
self.check_parse_error("<a foo=>")
|
|
||||||
|
|
||||||
|
|
||||||
def test_main():
|
|
||||||
support.run_unittest(SGMLParserTestCase)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
test_main()
|
|
|
@ -60,6 +60,8 @@ Extension Modules
|
||||||
Library
|
Library
|
||||||
-------
|
-------
|
||||||
|
|
||||||
|
- Removed the ``htmllib`` and ``sgmllib`` modules.
|
||||||
|
|
||||||
- The deprecated ``SmartCookie`` and ``SimpleCookie`` classes have
|
- The deprecated ``SmartCookie`` and ``SimpleCookie`` classes have
|
||||||
been removed from ``http.cookies``.
|
been removed from ``http.cookies``.
|
||||||
|
|
||||||
|
|
|
@ -1895,7 +1895,6 @@ reprlib Redo repr() but with limits on most sizes.
|
||||||
rlcompleter Word completion for GNU readline 2.0.
|
rlcompleter Word completion for GNU readline 2.0.
|
||||||
robotparser Parse robots.txt files, useful for web spiders.
|
robotparser Parse robots.txt files, useful for web spiders.
|
||||||
sched A generally useful event scheduler class.
|
sched A generally useful event scheduler class.
|
||||||
sgmllib A parser for SGML.
|
|
||||||
shelve Manage shelves of pickled objects.
|
shelve Manage shelves of pickled objects.
|
||||||
shlex Lexical analyzer class for simple shell-like syntaxes.
|
shlex Lexical analyzer class for simple shell-like syntaxes.
|
||||||
shutil Utility functions usable in a shell-like program.
|
shutil Utility functions usable in a shell-like program.
|
||||||
|
|
Loading…
Reference in New Issue