Commit Graph

29 Commits

Author SHA1 Message Date
Raymond Hettinger a5413c4997 Issue 21469: Mitigate risk of false positives with robotparser.
* Repair the broken link to norobots-rfc.txt.

* HTTP response codes >= 500 treated as a failed read rather than as a not
found.  Not found means that we can assume the entire site is allowed.  A 5xx
server error tells us nothing.

* A successful read() or parse() updates the mtime (which is defined to be "the
  time the robots.txt file was last fetched").

* The can_fetch() method returns False unless we've had a read() with a 2xx or
4xx response.  This avoids false positives in the case where a user calls
can_fetch() before calling read().

* I don't see any easy way to test this patch without hitting internet
resources that might change or without use of mock objects that wouldn't
provide must reassurance.
2014-05-12 22:18:50 -07:00
Senthil Kumaran 2c4810efa2 #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline.
This helps in handling certain types invalid urls in a conservative manner.
2013-05-29 05:58:47 -07:00
Georg Brandl 2bd953e291 Merged revisions 83238 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k

........
  r83238 | georg.brandl | 2010-07-29 19:55:01 +0200 (Do, 29 Jul 2010) | 1 line

  #4108: the first default entry (User-agent: *) wins.
........
2010-08-01 20:59:03 +00:00
Senthil Kumaran a4f79f97db Merged revisions 83209 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k

........
  r83209 | senthil.kumaran | 2010-07-28 21:57:56 +0530 (Wed, 28 Jul 2010) | 3 lines

  Fix Issue6325 - robotparse to honor urls with query strings.
........
2010-07-28 16:35:35 +00:00
Skip Montanaro 1ef19f0de1 Close issue 3437 - missing state change when Allow lines are processed.
Adds test cases which use Allow: as well.
2008-07-27 00:49:02 +00:00
Benjamin Peterson 0522a9f1eb #1778443 robotparser fixes from Aristotelis Mikropoulos 2008-07-12 23:41:19 +00:00
Skip Montanaro b8bdbc04e7 Get rid of _test(), _main(), _debug() and _check(). Tests are no longer
needed (better set available in Lib/test/test_robotparser.py).  Clean up a
few PEP 8 nits (compound statements on a single line, whitespace around
operators).
2008-04-28 03:27:53 +00:00
Skip Montanaro 1a41313684 fixes 813986 2007-08-28 23:22:52 +00:00
Georg Brandl 4ffc8f5107 Patch #1555098: use str.join() instead of repeated string
concatenation in robotparser.
2007-03-13 09:41:31 +00:00
Martin v. Löwis 31bd529f53 Patch #1014237: Consistently return booleans throughout. 2004-08-23 20:42:35 +00:00
Raymond Hettinger bac788a3cd Replace str.find()!=1 with the more readable "in" operator. 2004-05-04 09:21:43 +00:00
Raymond Hettinger 2d95f1ad57 SF patch #911431: robot.txt must be robots.txt
(Contributed by George Yoshida.)
2004-03-13 20:27:23 +00:00
Guido van Rossum 68468eba63 Get rid of many apply() calls. 2003-02-27 20:14:51 +00:00
Neal Norwitz 5aee504ccb Remove import of re, it is not used 2002-05-31 14:14:06 +00:00
Raymond Hettinger aef22fb9cd Patch 560023 adding docstrings. 2.2 Candidate (after verifying modules were not updated after 2.2). 2002-05-29 16:18:42 +00:00
Tim Peters bc0e910826 Convert a pile of obvious "yes/no" functions to return bool. 2002-04-04 22:55:58 +00:00
Martin v. Löwis 73f570ba08 Correctly set default entry in all cases. 2002-03-18 10:43:18 +00:00
Martin v. Löwis d22368ffb3 Patch #499513: use readline() instead of readlines(). Removed the
unnecessary redirection limit code which is already in FancyURLopener.
2002-03-18 10:41:20 +00:00
Martin v. Löwis 1c63f6e489 Correct various errors:
- Use substring search, not re search for user-agent and paths.
- Consider * entry last. Unquote, then requote URLs.
- Treat empty Disallow as "allow everything".
Add test cases. Fixes #523041
2002-02-28 15:24:47 +00:00
Andrew M. Kuchling e7abf97903 Remove unused import (PyChecker) 2001-08-13 14:43:43 +00:00
Tim Peters 0e6d213177 Whitespace normalization. 2001-02-15 23:56:39 +00:00
Skip Montanaro 5bba231d1e The bulk of the credit for these changes goes to Bastian Kleineidam
* restores urllib as the file fetcher (closes bug #132000)
* allows checking URLs with empty paths (closes patches #103511 and 103721)
* properly handle user agents with versions (e.g., SpamMeister/1.5)
* added several more tests
2001-02-12 20:58:30 +00:00
Eric S. Raymond 141971f22a String method conversion. 2001-02-09 08:40:40 +00:00
Tim Peters dfc538acae Whitespace normalization. 2001-01-21 04:49:16 +00:00
Skip Montanaro e99d5ea25b added __all__ lists to a number of Python modules
added test script and expected output file as well
this closes patch 103297.
__all__ attributes will be added to other modules without first submitting
a patch, just adding the necessary line to the test script to verify
more-or-less correct implementation.
2001-01-20 19:54:20 +00:00
Skip Montanaro 663f6c2ad2 rewrite of robotparser.py by Bastian Kleineidam. Closes patch 102229. 2001-01-20 15:59:25 +00:00
Guido van Rossum dc8b7980e0 Skip Montanaro:
The robotparser.py module currently lives in Tools/webchecker.  In
preparation for its migration to Lib, I made the following changes:

    * renamed the test() function _test
    * corrected the URLs in _test() so they refer to actual documents
    * added an "if __name__ == '__main__'" catcher to invoke _test()
      when run as a main program
    * added doc strings for the two main methods, parse and can_fetch
    * replaced usage of regsub and regex with corresponding re code
2000-03-27 19:29:31 +00:00
Guido van Rossum 986abac1ba Give in to tabnanny 1998-04-06 14:29:28 +00:00
Guido van Rossum bbf8c2fafd Skip Montanaro's robots.txt parser. 1997-01-30 03:18:23 +00:00