cpython/Doc/library/robotparser.rst

80 lines
2.1 KiB
ReStructuredText
Raw Normal View History

2007-08-15 11:28:01 -03:00
:mod:`robotparser` --- Parser for robots.txt
=============================================
.. module:: robotparser
2008-04-28 00:25:37 -03:00
:synopsis: Loads a robots.txt file and answers questions about
2008-04-28 02:16:30 -03:00
fetchability of other URLs.
2007-12-08 11:26:16 -04:00
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
2007-08-15 11:28:01 -03:00
.. index::
single: WWW
single: World Wide Web
single: URL
single: robots.txt
2009-01-03 16:55:06 -04:00
.. note::
The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
Python 3.0.
The :term:`2to3` tool will automatically adapt imports when converting
your sources to 3.0.
2007-08-15 11:28:01 -03:00
This module provides a single class, :class:`RobotFileParser`, which answers
questions about whether or not a particular user agent can fetch a URL on the
Web site that published the :file:`robots.txt` file. For more details on the
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
2007-08-15 11:28:01 -03:00
.. class:: RobotFileParser()
2008-04-28 00:25:37 -03:00
This class provides a set of methods to read, parse and answer questions
about a single :file:`robots.txt` file.
2007-08-15 11:28:01 -03:00
.. method:: set_url(url)
2007-08-15 11:28:01 -03:00
Sets the URL referring to a :file:`robots.txt` file.
.. method:: read()
2007-08-15 11:28:01 -03:00
Reads the :file:`robots.txt` URL and feeds it to the parser.
.. method:: parse(lines)
2007-08-15 11:28:01 -03:00
Parses the lines argument.
.. method:: can_fetch(useragent, url)
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Returns ``True`` if the *useragent* is allowed to fetch the *url*
according to the rules contained in the parsed :file:`robots.txt`
file.
2007-08-15 11:28:01 -03:00
.. method:: mtime()
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Returns the time the ``robots.txt`` file was last fetched. This is
useful for long-running web spiders that need to check for new
``robots.txt`` files periodically.
2007-08-15 11:28:01 -03:00
.. method:: modified()
2007-08-15 11:28:01 -03:00
2008-04-28 00:25:37 -03:00
Sets the time the ``robots.txt`` file was last fetched to the current
time.
2007-08-15 11:28:01 -03:00
The following example demonstrates basic use of the RobotFileParser class. ::
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True