* Repair the broken link to norobots-rfc.txt.
* HTTP response codes >= 500 treated as a failed read rather than as a not
found. Not found means that we can assume the entire site is allowed. A 5xx
server error tells us nothing.
* A successful read() or parse() updates the mtime (which is defined to be "the
time the robots.txt file was last fetched").
* The can_fetch() method returns False unless we've had a read() with a 2xx or
4xx response. This avoids false positives in the case where a user calls
can_fetch() before calling read().
* I don't see any easy way to test this patch without hitting internet
resources that might change or without use of mock objects that wouldn't
provide must reassurance.
It consists of code from urllib, urllib2, urlparse, and robotparser.
The old modules have all been removed. The new package has five
submodules: urllib.parse, urllib.request, urllib.response,
urllib.error, and urllib.robotparser. The urllib.request.urlopen()
function uses the url opener from urllib2.
Note that the unittests have not been renamed for the
beta, but they will be renamed in the future.
Joint work with Senthil Kumaran.