2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
:mod:`robotparser` --- Parser for robots.txt
|
|
|
|
=============================================
|
|
|
|
|
|
|
|
.. module:: robotparser
|
2008-04-28 00:25:37 -03:00
|
|
|
:synopsis: Loads a robots.txt file and answers questions about
|
2008-04-28 02:16:30 -03:00
|
|
|
fetchability of other URLs.
|
2007-12-08 11:26:16 -04:00
|
|
|
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. index::
|
|
|
|
single: WWW
|
|
|
|
single: World Wide Web
|
|
|
|
single: URL
|
|
|
|
single: robots.txt
|
Merged revisions 68133-68134,68141-68142,68145-68146,68148-68149,68159-68162,68166,68171-68174,68179,68195-68196,68210,68214-68215,68217-68222 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r68133 | antoine.pitrou | 2009-01-01 16:38:03 +0100 (Thu, 01 Jan 2009) | 1 line
fill in actual issue number in tests
........
r68134 | hirokazu.yamamoto | 2009-01-01 16:45:39 +0100 (Thu, 01 Jan 2009) | 2 lines
Issue #4797: IOError.filename was not set when _fileio.FileIO failed to open
file with `str' filename on Windows.
........
r68141 | benjamin.peterson | 2009-01-01 17:43:12 +0100 (Thu, 01 Jan 2009) | 1 line
fix highlighting
........
r68142 | benjamin.peterson | 2009-01-01 18:29:49 +0100 (Thu, 01 Jan 2009) | 2 lines
welcome to 2009, Python!
........
r68145 | amaury.forgeotdarc | 2009-01-02 01:03:54 +0100 (Fri, 02 Jan 2009) | 5 lines
#4801 _collections module fails to build on cygwin.
_PyObject_GC_TRACK is the macro version of PyObject_GC_Track,
and according to documentation it should not be used for extension modules.
........
r68146 | ronald.oussoren | 2009-01-02 11:44:46 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4472: "configure --enable-shared doesn't work on OSX"
........
r68148 | ronald.oussoren | 2009-01-02 11:48:31 +0100 (Fri, 02 Jan 2009) | 2 lines
Forgot to add a NEWS item in my previous checkin
........
r68149 | ronald.oussoren | 2009-01-02 11:50:48 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4780
........
r68159 | ronald.oussoren | 2009-01-02 15:48:17 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue 1627952
........
r68160 | ronald.oussoren | 2009-01-02 15:52:09 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue r1737832
........
r68161 | ronald.oussoren | 2009-01-02 16:00:05 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 1149804
........
r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 4472 is incompatible with Cygwin, this patch
should fix that.
........
r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line
document PyMemberDef
........
r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines
#4811: fix markup glitches (mostly remains of the conversion),
found by Gabriel Genellina.
........
r68172 | martin.v.loewis | 2009-01-02 21:32:55 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4075: Use OutputDebugStringW in Py_FatalError.
........
r68173 | martin.v.loewis | 2009-01-02 21:40:14 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4051: Prevent conflict of UNICODE macros in cPickle.
........
r68174 | benjamin.peterson | 2009-01-02 21:47:27 +0100 (Fri, 02 Jan 2009) | 1 line
fix compilation on non-Windows platforms
........
r68179 | raymond.hettinger | 2009-01-02 22:26:45 +0100 (Fri, 02 Jan 2009) | 1 line
Issue #4615. Document how to use itertools for de-duping.
........
r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove useless string literal.
........
r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix indentation.
........
r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines
Set eol-style correctly for mp_distributing.py.
........
r68214 | georg.brandl | 2009-01-03 20:44:48 +0100 (Sat, 03 Jan 2009) | 2 lines
Make indentation consistent.
........
r68215 | georg.brandl | 2009-01-03 21:15:14 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix role name.
........
r68217 | georg.brandl | 2009-01-03 21:30:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Add rstlint, a little tool to find subtle markup problems and inconsistencies in the Doc sources.
........
r68218 | georg.brandl | 2009-01-03 21:38:59 +0100 (Sat, 03 Jan 2009) | 2 lines
Recognize usage of the default role.
........
r68219 | georg.brandl | 2009-01-03 21:47:01 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix uses of the default role.
........
r68220 | georg.brandl | 2009-01-03 21:55:06 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove trailing whitespace.
........
r68221 | georg.brandl | 2009-01-03 22:04:55 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove tabs from the documentation.
........
r68222 | georg.brandl | 2009-01-03 22:11:58 +0100 (Sat, 03 Jan 2009) | 2 lines
Disable the line length checker by default.
........
2009-01-03 17:55:17 -04:00
|
|
|
|
2008-07-10 21:48:57 -03:00
|
|
|
.. note::
|
|
|
|
The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
|
|
|
|
Python 3.0.
|
|
|
|
The :term:`2to3` tool will automatically adapt imports when converting
|
|
|
|
your sources to 3.0.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
This module provides a single class, :class:`RobotFileParser`, which answers
|
|
|
|
questions about whether or not a particular user agent can fetch a URL on the
|
2008-03-14 21:20:19 -03:00
|
|
|
Web site that published the :file:`robots.txt` file. For more details on the
|
|
|
|
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. class:: RobotFileParser()
|
|
|
|
|
2008-04-28 00:25:37 -03:00
|
|
|
This class provides a set of methods to read, parse and answer questions
|
|
|
|
about a single :file:`robots.txt` file.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: set_url(url)
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
Sets the URL referring to a :file:`robots.txt` file.
|
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: read()
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: parse(lines)
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
Parses the lines argument.
|
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: can_fetch(useragent, url)
|
2007-08-15 11:28:01 -03:00
|
|
|
|
2008-04-28 00:25:37 -03:00
|
|
|
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
|
|
|
according to the rules contained in the parsed :file:`robots.txt`
|
|
|
|
file.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: mtime()
|
2007-08-15 11:28:01 -03:00
|
|
|
|
2008-04-28 00:25:37 -03:00
|
|
|
Returns the time the ``robots.txt`` file was last fetched. This is
|
|
|
|
useful for long-running web spiders that need to check for new
|
|
|
|
``robots.txt`` files periodically.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
|
2008-04-24 22:29:10 -03:00
|
|
|
.. method:: modified()
|
2007-08-15 11:28:01 -03:00
|
|
|
|
2008-04-28 00:25:37 -03:00
|
|
|
Sets the time the ``robots.txt`` file was last fetched to the current
|
|
|
|
time.
|
2007-08-15 11:28:01 -03:00
|
|
|
|
|
|
|
The following example demonstrates basic use of the RobotFileParser class. ::
|
|
|
|
|
|
|
|
>>> import robotparser
|
|
|
|
>>> rp = robotparser.RobotFileParser()
|
|
|
|
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
|
|
|
>>> rp.read()
|
|
|
|
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
|
|
|
False
|
|
|
|
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
|
|
|
True
|
|
|
|
|