Commit Graph

56 Commits

Author SHA1 Message Date
Neal Norwitz 48829ba61d As mentioned on python-dev, reverting patch #1504333 because it introduced
an infinite loop in rev 47154.

This patch also adds a test to prevent the regression.

Will backport to 2.4 and head later.
2006-09-11 04:05:18 +00:00
Fred Drake a136210a9f SF bug #1504333: sgmlib should allow angle brackets in quoted values
(modified patch by Sam Ruby; changed to use separate REs for start and end
 tags to reduce matching cost for end tags; extended tests; updated to avoid
 breaking previous changes to support IPv6 addresses in unquoted attribute
 values)
2006-06-29 00:51:53 +00:00
Fred Drake 2f99da636b - SF bug #853506: IP6 address parsing in sgmllib
('[' and ']' were not accepted in unquoted attribute values)

- cleaned up tests of character and entity reference decoding so the
  tests cover the documented relationships among handle_charref,
  handle_entityref, convert_charref, convert_codepoint, and
  convert_entityref, without bringing up Unicode issues that sgmllib
  cannot be involved in
2006-06-23 06:03:45 +00:00
Fred Drake 541660553d fix change that broke the htmllib tests 2006-06-17 01:07:54 +00:00
Fred Drake fab461a4b5 SF patch 1504676: Make sgmllib char and entity references pluggable
(implementation/tests contributed by Sam Ruby)
2006-06-16 23:45:06 +00:00
Fred Drake 6ce9fe880b explain an XXX in more detail 2006-06-14 05:15:51 +00:00
Tim Peters 480725d4c5 Whitespace normalization. 2006-04-03 02:46:44 +00:00
Georg Brandl 7f6b67c235 patch #1462498: handle entityrefs in attribute values. 2006-04-01 08:35:18 +00:00
Fred Drake 58ae830fd0 add name that should be considered public to __all__ 2004-09-09 01:49:58 +00:00
Walter Dörwald 70a6b49821 Replace backticks with repr() or "%r"
From SF patch #852334.
2004-02-12 17:35:32 +00:00
Martin v. Löwis dc14ab13c4 Patch #793559: Reset __starttext_tag. Fixes #709491. Backported to 2.3. 2003-09-20 10:58:38 +00:00
Fred Drake 75ab1462d5 Allow "@" in unquoted attribute values.
Added test that checks for characters allowed in the query part of URLs.
Backport candidate.
2003-04-29 22:12:55 +00:00
Tim Peters 0eadaac7dc Whitespace normalization. 2003-04-24 16:02:54 +00:00
Martin v. Löwis 3163a3b4b2 Patch #545300: Support marked sections. 2003-03-30 14:25:40 +00:00
Fred Drake 0834d77bc4 Accept commas in unquoted attribute values.
This closes SF patch #669683.
2003-03-14 16:21:57 +00:00
Raymond Hettinger f13eb55d59 Replace boolean test with is None. 2002-06-02 00:40:05 +00:00
Raymond Hettinger 54f0222547 SF 563203. Replaced 'has_key()' with 'in'. 2002-06-01 14:18:47 +00:00
Fred Drake 5445f078df Re-arrange things and remove some unused variables/imports to keep pychecker
happy.  (This does not cover everything it complained about, though.)
2001-10-26 18:02:28 +00:00
Fred Drake a3bae3369c Re-factor the SGMLParser class to use the new markupbase.ParserBase class.
Use a new internal method, error(), consistently to raise parse errors;
the new base class also uses this.
Adjust the parse_comment() method to return the new offset into the buffer
instead of the number of characters scanned; this was the only helper
method that did it this way, so we have better consistency now.  Required
to share the new base class.
This fixes SF bug #448482 and #453706.
2001-09-24 20:15:51 +00:00
Martin v. Löwis 02d893cfae Patch #444359: Remove unused imports. 2001-08-02 07:15:29 +00:00
Fred Drake 390e9dbd4f Make the new docstrings better conform to Guido's style guide. 2001-07-19 20:57:23 +00:00
Fred Drake 08f8dd6d0c Added docstrings based on a patch by Evelyn Mitchell.
This closes SF patch #440153.
2001-07-19 20:08:04 +00:00
Fred Drake fb38c76e0f In CDATA mode, make sure entity-reference syntax is not interpreted;
entity references are not allowed in that mode.

Do a better job of scanning <!DOCTYPE ...> declarations; based on the
code in HTMLParser.py.
2001-07-16 18:30:35 +00:00
Fred Drake 8600b47b61 Be more permissive in what is accepted as an attribute name; this makes
this module slightly more resiliant in the face of XHTML input, or just
colons in attribute names.
2001-07-14 05:50:33 +00:00
Fred Drake dc19163b18 Allow underscores in tag names and quote characters in unquoted attribute
values.  The change for attribute values matches the way Mozilla and
Navigator view the world, at least.

This closes SF bug #436621.
2001-07-05 18:21:57 +00:00
Guido van Rossum 39d345127e parse_declaration(): be more lenient in what we accept. We now
basically accept <!...> where the dots can be single- or double-quoted
strings or any other character except >.

Background: I found a real-life example that failed to parse with
the old assumption: http://www.opensource.org/licenses/jabberpl.html
contains a few constructs of the form <![if !supportLists]>...<![endif]>.
2001-05-21 20:17:17 +00:00
Guido van Rossum 74cde5bb3e Fix typo in exception name (SGMLParserError should be SGMLParseError)
found by Neil Norwitz's PyChecker.
2001-04-15 13:01:41 +00:00
Fred Drake 669573726b Change RuntimeError to SGMLParseError, which subclasses RuntimeError
for backward compatibility.

Add support for SGML declaration syntax (<!....>) to some reasonable
degree.  This does not support everything allowed in SGML, but should
work with "real" HTML (internal subset in a DOCTYPE is not handled).
The content of the declaration is passed to the .handle_decl() method,
which can be overridden by subclasses.
2001-03-16 20:04:57 +00:00
Fred Drake 62dfed96be Change "[%s]" % string.whitespace to r"\s" in regular expressions. 2001-03-14 16:18:56 +00:00
Guido van Rossum b68c245662 SF Patch # 103839 byt dougfort: Allow ';' in attributes
sgmllib does not recognize HTML attributes containing the semicolon
';' character. This may be in accordance with the HTML spec, but there
are sites that use it (excite.com) and the browsers I regularly use
(IE5, Netscape, Opera) all handle it. Doug Fort Downright Software LLC
2001-02-19 18:39:09 +00:00
Skip Montanaro 0de65807e6 bunch more __all__ lists
also modified check_all function to suppress all warnings since they aren't
relevant to what this test is doing (allows quiet checking of regsub, for
instance)
2001-02-15 22:15:14 +00:00
Eric S. Raymond 18af564bef Use ValueError instead of string.atoi.error, since we've switched to
int().
2001-02-09 10:12:19 +00:00
Eric S. Raymond 1b645e8cd3 String method conversion. 2001-02-09 07:49:30 +00:00
Tim Peters 495ad3c8cc Whitespace normalization. 2001-01-15 01:36:40 +00:00
Fred Drake 8152d32375 Update the code to better reflect recommended style:
Use != instead of <> since <> is documented as "obsolescent".
Use "is" and "is not" when comparing with None or type objects.
2000-12-12 23:20:45 +00:00
Fred Drake b46696c0ed [Old patch that hadn't been checked in.]
get_starttag_text():  New method.
        Return the text of the most recently parsed start tag, from
        the '<' to the '>' or '/'.  Not really useful for structure
        processing, but requested for Web-related use.  May also be
        useful for being able to re-generate the input from the parse
        events, but there's no equivalent for end tags.

attrfind:  Be a little more forgiving of unquoted attribute values.
2000-06-29 18:50:59 +00:00
Jeremy Hylton a05e293a21 typos fixed by Rob Hooft 2000-06-28 14:48:01 +00:00
Guido van Rossum e7b146fb3b The third and final doc-string sweep by Ka-Ping Yee.
The attached patches update the standard library so that all modules
have docstrings beginning with one-line summaries.

A new docstring was added to formatter.  The docstring for os.py
was updated to mention nt, os2, ce in addition to posix, dos, mac.
2000-02-04 15:28:42 +00:00
Fred Drake dfd8954e36 Allow recognition of attributes even if they don't have space in front
of them.  I.e., '<a name="foo"href="bar.html">' will now have two
attributes recognized.

Based on comments from newgroup.
1999-01-25 21:57:07 +00:00
Guido van Rossum 5fdf85254c Patch by Chris Herborth (posted to comp.lang.python)to make it behave
with tags that have - or . in their names.
1998-08-24 20:59:13 +00:00
Guido van Rossum b84ef9bc61 Put back the call to report_unbalanced() that was lost when
parse_endtag() was restructured in parse_endtag() and finish_endtag().
1998-07-07 22:46:11 +00:00
Guido van Rossum 1ad00717fb Patch by Lars Marius Garshol:
- Handle <? processing instructions >.

- Allow . and - in entity names.

Also fixed an oversight in the previous fix (in one place, [ \t\r\n]
was used instead of string.whitespace).
1998-05-28 22:48:53 +00:00
Fred Drake de2f708299 Fix regexp for attrfind; bug reported by Lars Marius Garshol
<larsga@ifi.uio.no>.
1998-04-16 21:04:26 +00:00
Guido van Rossum 45e2fbc2e7 Mass check-in after untabifying all files that need it. 1998-03-26 21:13:24 +00:00
Guido van Rossum 1fef181183 Although it's hard to be sure, I *think* this is a working conversion
from regex to re style regular expressions.  This should make sgmllib
and htmllib threadsafe, so I can now create a threaded version of
webchecker...
1997-10-23 19:09:21 +00:00
Fred Drake 09bcf8c031 (sgmllib.py): Partial acceptance of patch from David Leonard
<leonard@dstc.edu.au>; allows hyphen and period in the middle
	of attribute names.  Still not allowed as first character;
	as first character these are illegal in the Reference Concrete
	Syntax, and we've not identified any use of these characters as
	the first char in an attribute name in deployment on the web.
1996-12-16 21:56:27 +00:00
Guido van Rossum 48766512a0 Reformatted with 4-space tab stops.
Allow '=' and '~' in unquoted attribute values.

Added overridable methods handle_starttag(tag, method, attrs) and
handle_endtag(tag, method) so subclasses can decide whether they
really want to call the method (e.g. when suppressing some portion of
the document).

Added support for a number of SGML shortcuts:

        shorthand               full notation
        <tag>...<>...           <tag>...<tag>...
        <tag>...</>             <tag>...</tag>
        <tag/.../               <tag>...</tag>
        <tag1<tag2>             <tag1><tag2>
        </tag1</tag2>           </tag1></tag2>
        </tag1<tag2>            </tag1><tag2>

This required factoring out some common actions and rationalizing the
interface to parse_endtag(), so as to make the code more readable.

Fixed syntax for &entity and &#char references so the trailing
semicolon is optional; removed explicit support for trailing period
(which was a TBL mistake in HTML 0.0).

Generalized the test program.

Tried to speed things up a little.  (More to come after the profile
results are in.)

Fix error recovery: call the end methods popped from the stack instead
of the one that triggers.  (Plus some complications because of the way
HTML extensions are handled in Grail.)
1996-03-28 18:45:04 +00:00
Guido van Rossum 650ba37e1d typos in attrfind regex 1995-10-06 15:30:28 +00:00
Guido van Rossum e3d9320fc5 allow _ in attr names (Netscape!) 1995-09-30 16:49:36 +00:00
Guido van Rossum 3c0bfd0dee fix <!...!> parsing; added verbose option; don't lowercase entityrefs 1995-09-22 00:54:32 +00:00