Commit Graph

31 Commits

Author SHA1 Message Date
Andrew M. Kuchling a982c44543 [Patch #918212] Support XHTML's 'id' attribute, which can be on any element. 2004-03-21 19:07:23 +00:00
Mark Hammond ce56c377a0 When bad HTML is encountered, ignore the page rather than failing with
a traceback.
2003-02-27 06:59:10 +00:00
Fred Drake 0b9e3f750c Handle the Content-Type header a little more appropriately: if it
contains options, drop them to get the major/minor content type.
Modified from the supplied patch to support more whitespace variation.
Closes SF patch #613605.
2002-11-12 22:19:34 +00:00
Walter Dörwald aaab30e00c Apply diff2.txt from SF patch http://www.python.org/sf/572113
(with one small bugfix in bgen/bgen/scantools.py)

This replaces string module functions with string methods
for the stuff in the Tools directory. Several uses of
string.letters etc. are still remaining.
2002-09-11 20:36:02 +00:00
Walter Dörwald 88a20baa77 Apply diff.txt from SF patch http://www.python.org/sf/561478
This uses cgi.parse_header() in Checker.checkforhtml(), so that
webchecker recognises the mime type text/html even if options
are specified.
2002-06-06 17:01:21 +00:00
Andrew M. Kuchling 566c0c737f [Bug #512799] urllib.splittype() returns a 2-tuple. (Reported by seb bacon) 2002-03-08 17:19:10 +00:00
Guido van Rossum f0953b9dff Fix SF bug #482171: webchecker dies on file: URLs w/o robots.txt
The cause seems to be that when a file URL doesn't exist,
urllib.urlopen() raises OSError instead of IOError.  Simply add this
to the except clause.  Not elegant, but effective. :-)
2001-12-11 22:41:24 +00:00
Fred Drake d34a9c98a9 Added more link attributes based on additonal information from Chris
McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation
with Navigator 4.7.

HTML-as-deployed is evil!
2001-04-05 18:14:50 +00:00
Fred Drake f3186e8242 A number of improvements based on a discussion with Chris McCafferty
<christopher.mccafferty@csg.ch>:

Add javascript: and telnet: to the types of URLs we ignore.

Add support for several additional URL-valued attributes on the BODY,
FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
2001-04-04 17:47:25 +00:00
Guido van Rossum 84306246f1 Fix suggested by Magnus Kessler: in class Page, it is possible for
self.parser to be None; in that case don't dereference it in
getnames().
2000-03-28 20:10:39 +00:00
Guido van Rossum e284b21457 Integrated Sam Bayer's wcnew.py code. It seems silly to keep two
files.  Removed Sam's "SLB" change comments; otherwise this is the
same as wcnew.py.
1999-11-17 15:40:08 +00:00
Guido van Rossum dbd5c3e63b Samuel L. Bayer:
- forced new done origins to set errors if they're in self.bad (fixes
  bug where only the first of a number of errorful references to a
  link is reported under some circumstances)
- suppressed adding duplicates to self.todo list (cleans up printout
  in wcgui details)
1999-11-17 15:00:14 +00:00
Guido van Rossum 0ec1493d0b Some changes (maybe not enough?) to make it work on Windows with local
file URLs.
1999-04-26 23:11:46 +00:00
Guido van Rossum a42c1ee21d Added note() message to Page class -- this was used but didn't exist.
(The alternative would be to call self.checker.note() but since
self.checker might be None that's not quite right.
1998-08-06 21:31:13 +00:00
Guido van Rossum 125700addb Instead of printint, use self.message() or self.note(). 1998-07-08 03:04:39 +00:00
Guido van Rossum 6eb9d32c43 sort the urls in the todo list 1998-06-15 12:33:02 +00:00
Guido van Rossum bee64533d6 Use a try-except so that the pickle file is written even when we die
because of an unexpected exception.
1998-04-27 19:35:15 +00:00
Guido van Rossum 986abac1ba Give in to tabnanny 1998-04-06 14:29:28 +00:00
Guido van Rossum 00756bd4a6 Major overhaul. Don't use global variable (e.g. verbose); use
instance variables.  Make all global functions methods, for easy
overriding.  Restructure getpage() for easy overriding.  Add
save_pickle() method and load_pickle() global function to make it
easier for other programs to emulate the toplevel interface.
1998-02-21 20:02:09 +00:00
Guido van Rossum 2237b73baf Several changes:
- Change the code that looks for robots.txt to always look in /, even
if the "root" path is somewhere deep down below.

- Add link processing in <AREA> tags.

- Change safeclose() to avoid crashing when the file has no geturl()
method.
1997-10-06 18:54:01 +00:00
Guido van Rossum 89efda363f Avoid the fancy handler for error 401 (request authentication). 1997-05-07 15:00:56 +00:00
Guido van Rossum af310c1d00 Restructured Checker class to get rid of 'ext' table.
Links are now either in 'todo' or 'done', and ext links
are hadled more like local links except that no further
links are gathered (and sometimes they aren't checked,
e.g. for mailto and news URLs).  The -x option reverses
its meaning: it disables checking of ext links (they are
moved to 'done' without checking).  A new 'errors' table
collects pages with bad links as we go -- redundant,
but useful for the GUI version which needs to report
this as we go.  Some new methods, including reset().
New checkpoint format.

Adapted the GUI to the changes in the Checker class.
Added Quit and "Start over" buttons, and a checkbox
to disable checking external links.  The details
window now also shows bad links emanating from the
selected page.  Miscellaneous small chages.
1997-02-02 23:30:32 +00:00
Guido van Rossum 6133ec656e Process <img> and <frame> tags. Don't bother skipping second href. 1997-02-01 05:16:08 +00:00
Guido van Rossum 0b0b5f0279 Spin off checking of external page in a subroutine.
Increase MAXPAGE to 150K.
Add back printing of __doc__ for usage message.
1997-01-31 18:57:23 +00:00
Guido van Rossum e5605ba3c2 Many misc changes.
- Faster HTML parser derivede from SGMLparser (Fred Gansevles).

- All manipulations of todo, done, ext, bad are done via methods, so a
derived class can override.  Also moved the 'done' marking to
dopage(), so run() is much simpler.

- Added a method status() which returns a string containing the
summary counts; added a "total" count.

- Drop the guessing of the file type before opening the document -- we
still need to check those links for validity!

- Added a subroutine to close a connection which first slurps up the
remaining data when it's an ftp URL -- apparently closing an ftp
connection without reading till the end makes it hang.

- Added -n option to skip running (only useful with -R).

- The Checker object now has an instance variable which is set to 1
when it is changed.  This is not pickled.
1997-01-31 14:43:15 +00:00
Guido van Rossum c59a5d449f Set proper User-agent header (Python-webchecker/<version>).
When -x is combined with -q, still do the checking, but don't print
the error in this phase -- they are reported by report_errors().
1997-01-30 06:04:00 +00:00
Guido van Rossum 2739cd74b3 Some refinements of the external-link checking code: insert the errors
in the 'bad' dictionary (sanitize them so they are picklable; the
sanitation code is now a subroutine); don't check mailto: URLs; omit
colon in Error message.
1997-01-30 04:26:57 +00:00
Guido van Rossum de66268588 Added -x option to check external links. Slooooow! 1997-01-30 03:58:21 +00:00
Guido van Rossum 325a64f207 Catch I/O errors when parsing robots.txt file.
Add version number, printed at startup in non-quited mode.
1997-01-30 03:30:20 +00:00
Guido van Rossum 3edbb35023 Added robots.txt support, using Skip Montanaro's parser.
Fixed occasional inclusion of unpicklable objects (Message in errors).
Changed indent of a few messages.
1997-01-30 03:19:41 +00:00
Guido van Rossum 272b37d686 web tree checker 1997-01-30 02:44:48 +00:00