cpython/Doc/library/robotparser.rst

80 lines
2.1 KiB
ReStructuredText
Raw Normal View History

2007-08-15 11:28:01 -03:00
:mod:`robotparser` --- Parser for robots.txt
=============================================
.. module:: robotparser
2008-04-28 00:25:37 -03:00
:synopsis: Loads a robots.txt file and answers questions about
2008-04-28 02:16:30 -03:00
fetchability of other URLs.
2007-12-08 11:26:16 -04:00
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
2007-08-15 11:28:01 -03:00
.. index::
single: WWW
single: World Wide Web
single: URL
single: robots.txt
Merged revisions 68133-68134,68141-68142,68145-68146,68148-68149,68159-68162,68166,68171-68174,68179,68195-68196,68210,68214-68215,68217-68222 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r68133 | antoine.pitrou | 2009-01-01 16:38:03 +0100 (Thu, 01 Jan 2009) | 1 line fill in actual issue number in tests ........ r68134 | hirokazu.yamamoto | 2009-01-01 16:45:39 +0100 (Thu, 01 Jan 2009) | 2 lines Issue #4797: IOError.filename was not set when _fileio.FileIO failed to open file with `str' filename on Windows. ........ r68141 | benjamin.peterson | 2009-01-01 17:43:12 +0100 (Thu, 01 Jan 2009) | 1 line fix highlighting ........ r68142 | benjamin.peterson | 2009-01-01 18:29:49 +0100 (Thu, 01 Jan 2009) | 2 lines welcome to 2009, Python! ........ r68145 | amaury.forgeotdarc | 2009-01-02 01:03:54 +0100 (Fri, 02 Jan 2009) | 5 lines #4801 _collections module fails to build on cygwin. _PyObject_GC_TRACK is the macro version of PyObject_GC_Track, and according to documentation it should not be used for extension modules. ........ r68146 | ronald.oussoren | 2009-01-02 11:44:46 +0100 (Fri, 02 Jan 2009) | 2 lines Fix for issue4472: "configure --enable-shared doesn't work on OSX" ........ r68148 | ronald.oussoren | 2009-01-02 11:48:31 +0100 (Fri, 02 Jan 2009) | 2 lines Forgot to add a NEWS item in my previous checkin ........ r68149 | ronald.oussoren | 2009-01-02 11:50:48 +0100 (Fri, 02 Jan 2009) | 2 lines Fix for issue4780 ........ r68159 | ronald.oussoren | 2009-01-02 15:48:17 +0100 (Fri, 02 Jan 2009) | 2 lines Fix for issue 1627952 ........ r68160 | ronald.oussoren | 2009-01-02 15:52:09 +0100 (Fri, 02 Jan 2009) | 2 lines Fix for issue r1737832 ........ r68161 | ronald.oussoren | 2009-01-02 16:00:05 +0100 (Fri, 02 Jan 2009) | 3 lines Fix for issue 1149804 ........ r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines Fix for issue 4472 is incompatible with Cygwin, this patch should fix that. ........ r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line document PyMemberDef ........ r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines #4811: fix markup glitches (mostly remains of the conversion), found by Gabriel Genellina. ........ r68172 | martin.v.loewis | 2009-01-02 21:32:55 +0100 (Fri, 02 Jan 2009) | 2 lines Issue #4075: Use OutputDebugStringW in Py_FatalError. ........ r68173 | martin.v.loewis | 2009-01-02 21:40:14 +0100 (Fri, 02 Jan 2009) | 2 lines Issue #4051: Prevent conflict of UNICODE macros in cPickle. ........ r68174 | benjamin.peterson | 2009-01-02 21:47:27 +0100 (Fri, 02 Jan 2009) | 1 line fix compilation on non-Windows platforms ........ r68179 | raymond.hettinger | 2009-01-02 22:26:45 +0100 (Fri, 02 Jan 2009) | 1 line Issue #4615. Document how to use itertools for de-duping. ........ r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines Remove useless string literal. ........ r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines Fix indentation. ........ r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines Set eol-style correctly for mp_distributing.py. ........ r68214 | georg.brandl | 2009-01-03 20:44:48 +0100 (Sat, 03 Jan 2009) | 2 lines Make indentation consistent. ........ r68215 | georg.brandl | 2009-01-03 21:15:14 +0100 (Sat, 03 Jan 2009) | 2 lines Fix role name. ........ r68217 | georg.brandl | 2009-01-03 21:30:15 +0100 (Sat, 03 Jan 2009) | 2 lines Add rstlint, a little tool to find subtle markup problems and inconsistencies in the Doc sources. ........ r68218 | georg.brandl | 2009-01-03 21:38:59 +0100 (Sat, 03 Jan 2009) | 2 lines Recognize usage of the default role. ........ r68219 | georg.brandl | 2009-01-03 21:47:01 +0100 (Sat, 03 Jan 2009) | 2 lines Fix uses of the default role. ........ r68220 | georg.brandl | 2009-01-03 21:55:06 +0100 (Sat, 03 Jan 2009) | 2 lines Remove trailing whitespace. ........ r68221 | georg.brandl | 2009-01-03 22:04:55 +0100 (Sat, 03 Jan 2009) | 2 lines Remove tabs from the documentation. ........ r68222 | georg.brandl | 2009-01-03 22:11:58 +0100 (Sat, 03 Jan 2009) | 2 lines Disable the line length checker by default. ........
2009-01-03 17:55:17 -04:00
.. note::
The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
Python 3.0.
The :term:`2to3` tool will automatically adapt imports when converting
your sources to 3.0.
2007-08-15 11:28:01 -03:00
This module provides a single class, :class:`RobotFileParser`, which answers
questions about whether or not a particular user agent can fetch a URL on the
Web site that published the :file:`robots.txt` file. For more details on the
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
2007-08-15 11:28:01 -03:00
.. class:: RobotFileParser()
2008-04-28 00:25:37 -03:00
This class provides a set of methods to read, parse and answer questions
about a single :file:`robots.txt` file.
2007-08-15 11:28:01 -03:00
.. method:: set_url(url)
2007-08-15 11:28:01 -03:00
Sets the URL referring to a :file:`robots.txt` file.
.. method:: read()
2007-08-15 11:28:01 -03:00
Reads the :file:`robots.txt` URL and feeds it to the parser.
.. method:: parse(lines)
2007-08-15 11:28:01 -03:00
Parses the lines argument.
.. method:: can_fetch(useragent, url)
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Returns ``True`` if the *useragent* is allowed to fetch the *url*
according to the rules contained in the parsed :file:`robots.txt`
file.
2007-08-15 11:28:01 -03:00
.. method:: mtime()
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Returns the time the ``robots.txt`` file was last fetched. This is
useful for long-running web spiders that need to check for new
``robots.txt`` files periodically.
2007-08-15 11:28:01 -03:00
.. method:: modified()
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Sets the time the ``robots.txt`` file was last fetched to the current
time.
2007-08-15 11:28:01 -03:00
The following example demonstrates basic use of the RobotFileParser class. ::
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True