(modified patch by Sam Ruby; changed to use separate REs for start and end
tags to reduce matching cost for end tags; extended tests; updated to avoid
breaking previous changes to support IPv6 addresses in unquoted attribute
values)
('[' and ']' were not accepted in unquoted attribute values)
- cleaned up tests of character and entity reference decoding so the
tests cover the documented relationships among handle_charref,
handle_entityref, convert_charref, convert_codepoint, and
convert_entityref, without bringing up Unicode issues that sgmllib
cannot be involved in
Use a new internal method, error(), consistently to raise parse errors;
the new base class also uses this.
Adjust the parse_comment() method to return the new offset into the buffer
instead of the number of characters scanned; this was the only helper
method that did it this way, so we have better consistency now. Required
to share the new base class.
This fixes SF bug #448482 and #453706.
basically accept <!...> where the dots can be single- or double-quoted
strings or any other character except >.
Background: I found a real-life example that failed to parse with
the old assumption: http://www.opensource.org/licenses/jabberpl.html
contains a few constructs of the form <![if !supportLists]>...<![endif]>.
for backward compatibility.
Add support for SGML declaration syntax (<!....>) to some reasonable
degree. This does not support everything allowed in SGML, but should
work with "real" HTML (internal subset in a DOCTYPE is not handled).
The content of the declaration is passed to the .handle_decl() method,
which can be overridden by subclasses.
sgmllib does not recognize HTML attributes containing the semicolon
';' character. This may be in accordance with the HTML spec, but there
are sites that use it (excite.com) and the browsers I regularly use
(IE5, Netscape, Opera) all handle it. Doug Fort Downright Software LLC
also modified check_all function to suppress all warnings since they aren't
relevant to what this test is doing (allows quiet checking of regsub, for
instance)
get_starttag_text(): New method.
Return the text of the most recently parsed start tag, from
the '<' to the '>' or '/'. Not really useful for structure
processing, but requested for Web-related use. May also be
useful for being able to re-generate the input from the parse
events, but there's no equivalent for end tags.
attrfind: Be a little more forgiving of unquoted attribute values.
The attached patches update the standard library so that all modules
have docstrings beginning with one-line summaries.
A new docstring was added to formatter. The docstring for os.py
was updated to mention nt, os2, ce in addition to posix, dos, mac.
- Handle <? processing instructions >.
- Allow . and - in entity names.
Also fixed an oversight in the previous fix (in one place, [ \t\r\n]
was used instead of string.whitespace).
<leonard@dstc.edu.au>; allows hyphen and period in the middle
of attribute names. Still not allowed as first character;
as first character these are illegal in the Reference Concrete
Syntax, and we've not identified any use of these characters as
the first char in an attribute name in deployment on the web.
Allow '=' and '~' in unquoted attribute values.
Added overridable methods handle_starttag(tag, method, attrs) and
handle_endtag(tag, method) so subclasses can decide whether they
really want to call the method (e.g. when suppressing some portion of
the document).
Added support for a number of SGML shortcuts:
shorthand full notation
<tag>...<>... <tag>...<tag>...
<tag>...</> <tag>...</tag>
<tag/.../ <tag>...</tag>
<tag1<tag2> <tag1><tag2>
</tag1</tag2> </tag1></tag2>
</tag1<tag2> </tag1><tag2>
This required factoring out some common actions and rationalizing the
interface to parse_endtag(), so as to make the code more readable.
Fixed syntax for &entity and &#char references so the trailing
semicolon is optional; removed explicit support for trailing period
(which was a TBL mistake in HTML 0.0).
Generalized the test program.
Tried to speed things up a little. (More to come after the profile
results are in.)
Fix error recovery: call the end methods popped from the stack instead
of the one that triggers. (Plus some complications because of the way
HTML extensions are handled in Grail.)