Documentation updates for urllib package. Modified the documentation for the

urllib,urllib2 -> urllib.request,urllib.error
urlparse -> urllib.parse
RobotParser -> urllib.robotparser

Updated tutorial references and other module references (http.client.rst,
ftplib.rst,contextlib.rst)
Updated the examples in the urllib2-howto

Addresses Issue3142.
This commit is contained in:
Senthil Kumaran 2008-06-23 04:41:59 +00:00
parent d11a44312f
commit aca8fd7a9d
12 changed files with 565 additions and 593 deletions

View File

@ -1,6 +1,6 @@
************************************************
HOWTO Fetch Internet Resources Using urllib2
************************************************
*****************************************************
HOWTO Fetch Internet Resources Using urllib package
*****************************************************
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
@ -24,14 +24,14 @@ Introduction
A tutorial on *Basic Authentication*, with examples in Python.
**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
**urllib.request** is a `Python <http://www.python.org>`_ module for fetching URLs
(Uniform Resource Locators). It offers a very simple interface, in the form of
the *urlopen* function. This is capable of fetching URLs using a variety of
different protocols. It also offers a slightly more complex interface for
handling common situations - like basic authentication, cookies, proxies and so
on. These are provided by objects called handlers and openers.
urllib2 supports fetching URLs for many "URL schemes" (identified by the string
urllib.request supports fetching URLs for many "URL schemes" (identified by the string
before the ":" in URL - for example "ftp" is the URL scheme of
"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
This tutorial focuses on the most common case, HTTP.
@ -40,43 +40,43 @@ For straightforward situations *urlopen* is very easy to use. But as soon as you
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
understanding of the HyperText Transfer Protocol. The most comprehensive and
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
with enough detail about HTTP to help you through. It is not intended to replace
the :mod:`urllib2` docs, but is supplementary to them.
the :mod:`urllib.request` docs, but is supplementary to them.
Fetching URLs
=============
The simplest way to use urllib2 is as follows::
The simplest way to use urllib.request is as follows::
import urllib2
response = urllib2.urlopen('http://python.org/')
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
Many uses of urllib will be that simple (note that instead of an 'http:' URL we
could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
purpose of this tutorial to explain the more complicated cases, concentrating on
HTTP.
HTTP is based on requests and responses - the client makes requests and servers
send responses. urllib2 mirrors this with a ``Request`` object which represents
send responses. urllib.request mirrors this with a ``Request`` object which represents
the HTTP request you are making. In its simplest form you create a Request
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
Request object returns a response object for the URL requested. This response is
a file-like object, which means you can for example call ``.read()`` on the
response::
import urllib2
import urllib.request
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)
the_page = response.read()
Note that urllib2 makes use of the same Request interface to handle all URL
Note that urllib.request makes use of the same Request interface to handle all URL
schemes. For example, you can make an FTP request like so::
req = urllib2.Request('ftp://example.com/')
req = urllib.request.Request('ftp://example.com/')
In the case of HTTP, there are two extra things that Request objects allow you
to do: First, you can pass data to be sent to the server. Second, you can pass
@ -94,20 +94,20 @@ your browser does when you submit a HTML form that you filled in on the web. Not
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
to your own application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as the ``data``
argument. The encoding is done using a function from the ``urllib`` library
*not* from ``urllib2``. ::
argument. The encoding is done using a function from the ``urllib.parse`` library
*not* from ``urllib.request``. ::
import urllib
import urllib2
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
the_page = response.read()
Note that other encodings are sometimes required (e.g. for file upload from HTML
@ -115,7 +115,7 @@ forms - see `HTML Specification, Form Submission
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
details).
If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
If you do not pass the ``data`` argument, urllib.request uses a **GET** request. One
way in which GET and POST requests differ is that POST requests often have
"side-effects": they change the state of the system in some way (for example by
placing an order with the website for a hundredweight of tinned spam to be
@ -127,18 +127,18 @@ GET request by encoding it in the URL itself.
This is done as follows::
>>> import urllib2
>>> import urllib
>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values)
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.open(full_url)
>>> data = urllib.request.open(full_url)
Notice that the full URL is created by adding a ``?`` to the URL, followed by
the encoded values.
@ -150,7 +150,7 @@ We'll discuss here one particular HTTP header, to illustrate how to add headers
to your HTTP request.
Some websites [#]_ dislike being browsed by programs, or send different versions
to different browsers [#]_ . By default urllib2 identifies itself as
to different browsers [#]_ . By default urllib identifies itself as
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
numbers of the Python release,
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
@ -160,8 +160,8 @@ pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer [#]_. ::
import urllib
import urllib2
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
@ -170,9 +170,9 @@ Explorer [#]_. ::
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
The response also has two useful methods. See the section on `info and geturl`_
@ -182,7 +182,7 @@ which comes after we have a look at what happens when things go wrong.
Handling Exceptions
===================
*urlopen* raises ``URLError`` when it cannot handle a response (though as usual
*urllib.error* raises ``URLError`` when it cannot handle a response (though as usual
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
be raised).
@ -199,9 +199,9 @@ error code and a text error message.
e.g. ::
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
>>> except urllib.error.URLError, e:
>>> print(e.reason)
>>>
(4, 'getaddrinfo failed')
@ -214,7 +214,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes
the status code indicates that the server is unable to fulfil the request. The
default handlers will handle some of these responses for you (for example, if
the response is a "redirection" that requests the client fetch the document from
a different URL, urllib2 will handle that for you). For those it can't handle,
a different URL, urllib.request will handle that for you). For those it can't handle,
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
found), '403' (request forbidden), and '401' (authentication required).
@ -305,12 +305,12 @@ dictionary is reproduced here for convenience ::
When an error is raised the server responds by returning an HTTP error code
*and* an error page. You can use the ``HTTPError`` instance as a response on the
page returned. This means that as well as the code attribute, it also has read,
geturl, and info, methods. ::
geturl, and info, methods as returned by the ``urllib.response`` module::
>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> req = urllib.request.Request('http://www.python.org/fish.html')
>>> try:
>>> urllib2.urlopen(req)
>>> except URLError, e:
>>> urllib.request.urlopen(req)
>>> except urllib.error.URLError, e:
>>> print(e.code)
>>> print(e.read())
>>>
@ -334,7 +334,8 @@ Number 1
::
from urllib2 import Request, urlopen, URLError, HTTPError
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
@ -358,7 +359,8 @@ Number 2
::
from urllib2 import Request, urlopen, URLError
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
response = urlopen(req)
@ -377,7 +379,8 @@ info and geturl
===============
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
methods ``info`` and ``geturl``.
methods ``info`` and ``geturl`` and is defined in the module
``urllib.response``.
**geturl** - this returns the real URL of the page fetched. This is useful
because ``urlopen`` (or the opener object used) may have followed a
@ -397,7 +400,7 @@ Openers and Handlers
====================
When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
the default opener - via ``urlopen`` - but you can create custom
openers. Openers use handlers. All the "heavy lifting" is done by the
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
@ -466,24 +469,24 @@ The top-level URL is the first URL that requires authentication. URLs "deeper"
than the URL you pass to .add_password() will also match. ::
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of ``None``.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
opener.open(a_url)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
.. note::
@ -505,46 +508,46 @@ not correct.
Proxies
=======
**urllib2** will auto-detect your proxy settings and use those. This is through
**urllib.request** will auto-detect your proxy settings and use those. This is through
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
a good thing, but there are occasions when it may not be helpful [#]_. One way
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
is done using similar steps to setting up a `Basic Authentication`_ handler : ::
>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
>>> proxy_support = urllib.request.ProxyHandler({})
>>> opener = urllib.request.build_opener(proxy_support)
>>> urllib.request.install_opener(opener)
.. note::
Currently ``urllib2`` *does not* support fetching of ``https`` locations
through a proxy. However, this can be enabled by extending urllib2 as
Currently ``urllib.request`` *does not* support fetching of ``https`` locations
through a proxy. However, this can be enabled by extending urllib.request as
shown in the recipe [#]_.
Sockets and Layers
==================
The Python support for fetching resources from the web is layered. urllib2 uses
the http.client library, which in turn uses the socket library.
The Python support for fetching resources from the web is layered.
urllib.request uses the http.client library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response
before timing out. This can be useful in applications which have to fetch web
pages. By default the socket module has *no timeout* and can hang. Currently,
the socket timeout is not exposed at the http.client or urllib2 levels.
the socket timeout is not exposed at the http.client or urllib.request levels.
However, you can set the default timeout globally for all sockets using ::
import socket
import urllib2
import urllib.request
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)
-------

View File

@ -98,9 +98,9 @@ Functions provided:
And lets you write code like this::
from contextlib import closing
import urllib
import urllib.request
with closing(urllib.urlopen('http://www.python.org')) as page:
with closing(urllib.request.urlopen('http://www.python.org')) as page:
for line in page:
print(line)

View File

@ -13,7 +13,6 @@ that aren't markup languages or are related to e-mail.
csv.rst
configparser.rst
robotparser.rst
netrc.rst
xdrlib.rst
plistlib.rst

View File

@ -13,9 +13,9 @@
This module defines the class :class:`FTP` and a few related items. The
:class:`FTP` class implements the client side of the FTP protocol. You can use
this to write Python programs that perform a variety of automated FTP jobs, such
as mirroring other ftp servers. It is also used by the module :mod:`urllib` to
handle URLs that use FTP. For more information on FTP (File Transfer Protocol),
see Internet :rfc:`959`.
as mirroring other ftp servers. It is also used by the module
:mod:`urllib.request` to handle URLs that use FTP. For more information on FTP
(File Transfer Protocol), see Internet :rfc:`959`.
Here's a sample session using the :mod:`ftplib` module::

View File

@ -9,10 +9,11 @@
pair: HTTP; protocol
single: HTTP; http.client (standard module)
.. index:: module: urllib
.. index:: module: urllib.request
This module defines classes which implement the client side of the HTTP and
HTTPS protocols. It is normally not used directly --- the module :mod:`urllib`
HTTPS protocols. It is normally not used directly --- the module
:mod:`urllib.request`
uses it to handle URLs that use HTTP and HTTPS.
.. note::
@ -484,8 +485,8 @@ Here is an example session that uses the ``GET`` method::
Here is an example session that shows how to ``POST`` requests::
>>> import http.client, urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> import http.client, urllib.parse
>>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> headers = {"Content-type": "application/x-www-form-urlencoded",
... "Accept": "text/plain"}
>>> conn = http.client.HTTPConnection("musi-cal.mojam.com:80")

View File

@ -24,8 +24,10 @@ is currently supported on most popular platforms. Here is an overview:
cgi.rst
cgitb.rst
wsgiref.rst
urllib.rst
urllib2.rst
urllib.request.rst
urllib.parse.rst
urllib.error.rst
urllib.robotparser.rst
http.client.rst
ftplib.rst
poplib.rst
@ -35,7 +37,6 @@ is currently supported on most popular platforms. Here is an overview:
smtpd.rst
telnetlib.rst
uuid.rst
urlparse.rst
socketserver.rst
http.server.rst
http.cookies.rst

View File

@ -0,0 +1,48 @@
:mod:`urllib.error` --- Exception classes raised by urllib.request
==================================================================
.. module:: urllib.error
:synopsis: Next generation URL opening library.
.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net>
.. sectionauthor:: Senthil Kumaran <orsenthil@gmail.com>
The :mod:`urllib.error` module defines exception classes raise by
urllib.request. The base exception class is URLError, which inherits from
IOError.
The following exceptions are raised by :mod:`urllib.error` as appropriate:
.. exception:: URLError
The handlers raise this exception (or derived exceptions) when they run into a
problem. It is a subclass of :exc:`IOError`.
.. attribute:: reason
The reason for this error. It can be a message string or another exception
instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local
URLs).
.. exception:: HTTPError
Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError`
can also function as a non-exceptional file-like return value (the same thing
that :func:`urlopen` returns). This is useful when handling exotic HTTP
errors, such as requests for authentication.
.. attribute:: code
An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_.
This numeric value corresponds to a value found in the dictionary of
codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`.
.. exception:: ContentTooShortError(msg[, content])
This exception is raised when the :func:`urlretrieve` function detects that the
amount of the downloaded data is less than the expected amount (given by the
*Content-Length* header). The :attr:`content` attribute stores the downloaded
(and supposedly truncated) data.

View File

@ -1,7 +1,7 @@
:mod:`urlparse` --- Parse URLs into components
==============================================
:mod:`urllib.parse` --- Parse URLs into components
==================================================
.. module:: urlparse
.. module:: urllib.parse
:synopsis: Parse URLs into or assemble them from components.
@ -24,7 +24,7 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
The :mod:`urlparse` module defines the following functions:
The :mod:`urllib.parse` module defines the following functions:
.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]])
@ -37,7 +37,7 @@ The :mod:`urlparse` module defines the following functions:
result, except for a leading slash in the *path* component, which is retained if
present. For example:
>>> from urlparse import urlparse
>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o # doctest: +NORMALIZE_WHITESPACE
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
@ -154,7 +154,7 @@ The :mod:`urlparse` module defines the following functions:
particular the addressing scheme, the network location and (part of) the path,
to provide missing components in the relative URL. For example:
>>> from urlparse import urljoin
>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'
@ -183,6 +183,52 @@ The :mod:`urlparse` module defines the following functions:
If there is no fragment identifier in *url*, returns *url* unmodified and an
empty string.
.. function:: quote(string[, safe])
Replace special characters in *string* using the ``%xx`` escape. Letters,
digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
parameter specifies additional characters that should not be quoted --- its
default value is ``'/'``.
Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
.. function:: quote_plus(string[, safe])
Like :func:`quote`, but also replaces spaces by plus signs, as required for
quoting HTML form values. Plus signs in the original string are escaped unless
they are included in *safe*. It also does not have *safe* default to ``'/'``.
.. function:: unquote(string)
Replace ``%xx`` escapes by their single-character equivalent.
Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
.. function:: unquote_plus(string)
Like :func:`unquote`, but also replaces plus signs by spaces, as required for
unquoting HTML form values.
.. function:: urlencode(query[, doseq])
Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
string, suitable to pass to :func:`urlopen` above as the optional *data*
argument. This is useful to pass a dictionary of form fields to a ``POST``
request. The resulting string is a series of ``key=value`` pairs separated by
``'&'`` characters, where both *key* and *value* are quoted using
:func:`quote_plus` above. If the optional parameter *doseq* is present and
evaluates to true, individual ``key=value`` pairs are generated for each element
of the sequence. When a sequence of two-element tuples is used as the *query*
argument, the first element of each tuple is a key and the second is a value.
The order of parameters in the encoded string will match the order of parameter
tuples in the sequence. The :mod:`cgi` module provides the functions
:func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
into Python data structures.
.. seealso::
@ -219,14 +265,14 @@ described in those functions, as well as provide an additional method:
The result of this method is a fixpoint if passed back through the original
parsing function:
>>> import urlparse
>>> import urllib.parse
>>> url = 'HTTP://www.Python.org/doc/#'
>>> r1 = urlparse.urlsplit(url)
>>> r1 = urllib.parse.urlsplit(url)
>>> r1.geturl()
'http://www.Python.org/doc/'
>>> r2 = urlparse.urlsplit(r1.geturl())
>>> r2 = urllib.parse.urlsplit(r1.geturl())
>>> r2.geturl()
'http://www.Python.org/doc/'

View File

@ -1,17 +1,17 @@
:mod:`urllib2` --- extensible library for opening URLs
======================================================
:mod:`urllib.request` --- extensible library for opening URLs
=============================================================
.. module:: urllib2
.. module:: urllib.request
:synopsis: Next generation URL opening library.
.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net>
.. sectionauthor:: Moshe Zadka <moshez@users.sourceforge.net>
The :mod:`urllib2` module defines functions and classes which help in opening
The :mod:`urllib.request` module defines functions and classes which help in opening
URLs (mostly HTTP) in a complex world --- basic and digest authentication,
redirections, cookies and more.
The :mod:`urllib2` module defines the following functions:
The :mod:`urllib.request` module defines the following functions:
.. function:: urlopen(url[, data][, timeout])
@ -31,7 +31,8 @@ The :mod:`urllib2` module defines the following functions:
timeout setting will be used). This actually only works for HTTP, HTTPS,
FTP and FTPS connections.
This function returns a file-like object with two additional methods:
This function returns a file-like object with two additional methods from
the :mod:`urllib.response` module
* :meth:`geturl` --- return the URL of the resource retrieved, commonly used to
determine if a redirect was followed
@ -45,6 +46,11 @@ The :mod:`urllib2` module defines the following functions:
Note that ``None`` may be returned if no handler handles the request (though the
default installed global :class:`OpenerDirector` uses :class:`UnknownHandler` to
ensure this never happens).
The urlopen function from the previous version, Python 2.6 and earlier, of
the module urllib has been discontinued as urlopen can return the
file-object as the previous. The proxy handling, which in earlier was passed
as a dict parameter to urlopen can be availed by the use of `ProxyHandler`
objects.
.. function:: install_opener(opener)
@ -74,39 +80,87 @@ The :mod:`urllib2` module defines the following functions:
A :class:`BaseHandler` subclass may also change its :attr:`handler_order`
member variable to modify its position in the handlers list.
The following exceptions are raised as appropriate:
.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
Copy a network object denoted by a URL to a local file, if necessary. If the URL
points to a local file, or a valid cached copy of the object exists, the object
is not copied. Return a tuple ``(filename, headers)`` where *filename* is the
local file name under which the object can be found, and *headers* is whatever
the :meth:`info` method of the object returned by :func:`urlopen` returned (for
a remote object, possibly cached). Exceptions are the same as for
:func:`urlopen`.
The second argument, if present, specifies the file location to copy to (if
absent, the location will be a tempfile with a generated name). The third
argument, if present, is a hook function that will be called once on
establishment of the network connection and once after each block read
thereafter. The hook will be passed three arguments; a count of blocks
transferred so far, a block size in bytes, and the total size of the file. The
third argument may be ``-1`` on older FTP servers which do not return a file
size in response to a retrieval request.
If the *url* uses the :file:`http:` scheme identifier, the optional *data*
argument may be given to specify a ``POST`` request (normally the request type
is ``GET``). The *data* argument must in standard
:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
function below.
:func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
the amount of data available was less than the expected amount (which is the
size reported by a *Content-Length* header). This can occur, for example, when
the download is interrupted.
The *Content-Length* is treated as a lower bound: if there's more data to read,
urlretrieve reads more data, but if less data is available, it raises the
exception.
You can still retrieve the downloaded data in this case, it is stored in the
:attr:`content` attribute of the exception instance.
If no *Content-Length* header was supplied, urlretrieve can not check the size
of the data it has downloaded, and just returns it. In this case you just have
to assume that the download was successful.
.. exception:: URLError
.. data:: _urlopener
The handlers raise this exception (or derived exceptions) when they run into a
problem. It is a subclass of :exc:`IOError`.
The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
of the :class:`FancyURLopener` class and use it to perform their requested
actions. To override this functionality, programmers can create a subclass of
:class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
class to the ``urllib._urlopener`` variable before calling the desired function.
For example, applications may want to specify a different
:mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
accomplished with the following code::
.. attribute:: reason
import urllib.request
The reason for this error. It can be a message string or another exception
instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local
URLs).
class AppURLopener(urllib.request.FancyURLopener):
version = "App/1.7"
urllib._urlopener = AppURLopener()
.. exception:: HTTPError
.. function:: urlcleanup()
Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError`
can also function as a non-exceptional file-like return value (the same thing
that :func:`urlopen` returns). This is useful when handling exotic HTTP
errors, such as requests for authentication.
Clear the cache that may have been built up by previous calls to
:func:`urlretrieve`.
.. attribute:: code
.. function:: pathname2url(path)
An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_.
This numeric value corresponds to a value found in the dictionary of
codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`.
Convert the pathname *path* from the local syntax for a path to the form used in
the path component of a URL. This does not produce a complete URL. The return
value will already be quoted using the :func:`quote` function.
.. function:: url2pathname(path)
Convert the path component *path* from an encoded URL to the local syntax for a
path. This does not accept a complete URL. This function uses :func:`unquote`
to decode *path*.
The following classes are provided:
.. class:: Request(url[, data][, headers][, origin_req_host][, unverifiable])
This class is an abstraction of a URL request.
@ -145,6 +199,114 @@ The following classes are provided:
an image in an HTML document, and the user had no option to approve the
automatic fetching of the image, this should be true.
.. class:: URLopener([proxies[, **x509]])
Base class for opening and reading URLs. Unless you need to support opening
objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
you probably want to use :class:`FancyURLopener`.
By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number.
Applications can define their own :mailheader:`User-Agent` header by subclassing
:class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
:attr:`version` to an appropriate string value in the subclass definition.
The optional *proxies* parameter should be a dictionary mapping scheme names to
proxy URLs, where an empty dictionary turns proxies off completely. Its default
value is ``None``, in which case environmental proxy settings will be used if
present, as discussed in the definition of :func:`urlopen`, above.
Additional keyword parameters, collected in *x509*, may be used for
authentication of the client when using the :file:`https:` scheme. The keywords
*key_file* and *cert_file* are supported to provide an SSL key and certificate;
both are needed to support client authentication.
:class:`URLopener` objects will raise an :exc:`IOError` exception if the server
returns an error code.
.. method:: open(fullurl[, data])
Open *fullurl* using the appropriate protocol. This method sets up cache and
proxy information, then calls the appropriate open method with its input
arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
The *data* argument has the same meaning as the *data* argument of
:func:`urlopen`.
.. method:: open_unknown(fullurl[, data])
Overridable interface to open unknown URL types.
.. method:: retrieve(url[, filename[, reporthook[, data]]])
Retrieves the contents of *url* and places it in *filename*. The return value
is a tuple consisting of a local filename and either a
:class:`email.message.Message` object containing the response headers (for remote
URLs) or ``None`` (for local URLs). The caller must then open and read the
contents of *filename*. If *filename* is not given and the URL refers to a
local file, the input filename is returned. If the URL is non-local and
*filename* is not given, the filename is the output of :func:`tempfile.mktemp`
with a suffix that matches the suffix of the last path component of the input
URL. If *reporthook* is given, it must be a function accepting three numeric
parameters. It will be called after each chunk of data is read from the
network. *reporthook* is ignored for local URLs.
If the *url* uses the :file:`http:` scheme identifier, the optional *data*
argument may be given to specify a ``POST`` request (normally the request type
is ``GET``). The *data* argument must in standard
:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
function below.
.. attribute:: version
Variable that specifies the user agent of the opener object. To get
:mod:`urllib` to tell servers that it is a particular user agent, set this in a
subclass as a class variable or in the constructor before calling the base
constructor.
.. class:: FancyURLopener(...)
:class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
response codes listed above, the :mailheader:`Location` header is used to fetch
the actual URL. For 401 response codes (authentication required), basic HTTP
authentication is performed. For the 30x response codes, recursion is bounded
by the value of the *maxtries* attribute, which defaults to 10.
For all other response codes, the method :meth:`http_error_default` is called
which you can override in subclasses to handle the error appropriately.
.. note::
According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
must not be automatically redirected without confirmation by the user. In
reality, browsers do allow automatic redirection of these responses, changing
the POST to a GET, and :mod:`urllib` reproduces this behaviour.
The parameters to the constructor are the same as those for :class:`URLopener`.
.. note::
When performing basic authentication, a :class:`FancyURLopener` instance calls
its :meth:`prompt_user_passwd` method. The default implementation asks the
users for the required information on the controlling terminal. A subclass may
override this method to support more appropriate behavior if needed.
The :class:`FancyURLopener` class offers one additional method that should be
overloaded to provide the appropriate behavior:
.. method:: prompt_user_passwd(host, realm)
Return information needed to authenticate the user at the given host in the
specified security realm. The return value should be a tuple, ``(user,
password)``, which can be used for basic authentication.
The implementation prompts for this information on the terminal; an application
should override this method to use an appropriate interaction model in the local
environment.
.. class:: OpenerDirector()
@ -846,7 +1008,6 @@ HTTPErrorProcessor Objects
Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an
:exc:`HTTPError` if no other handler handles the error.
.. _urllib2-examples:
Examples
@ -855,8 +1016,8 @@ Examples
This example gets the python.org main page and displays the first 100 bytes of
it::
>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html
@ -865,10 +1026,10 @@ Here we are sending a data-stream to the stdin of a CGI and reading the data it
returns to us. Note that this example will only work when the Python
installation supports SSL. ::
>>> import urllib2
>>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',
>>> import urllib.request
>>> req = urllib.request.Request(url='https://localhost/cgi-bin/test.cgi',
... data='This data is passed to stdin of the CGI')
>>> f = urllib2.urlopen(req)
>>> f = urllib.request.urlopen(req)
>>> print(f.read())
Got Data: "This data is passed to stdin of the CGI"
@ -881,17 +1042,17 @@ The code for the sample CGI used in the above example is::
Use of Basic HTTP Authentication::
import urllib2
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')
:func:`build_opener` provides many handlers by default, including a
:class:`ProxyHandler`. By default, :class:`ProxyHandler` uses the environment
@ -903,8 +1064,8 @@ This example replaces the default :class:`ProxyHandler` with one that uses
programatically-supplied proxy URLs, and adds proxy authorization support with
:class:`ProxyBasicAuthHandler`. ::
proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.HTTPBasicAuthHandler()
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.HTTPBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = build_opener(proxy_handler, proxy_auth_handler)
@ -915,16 +1076,16 @@ Adding HTTP headers:
Use the *headers* argument to the :class:`Request` constructor, or::
import urllib2
req = urllib2.Request('http://www.example.com/')
import urllib
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)
r = urllib.request.urlopen(req)
:class:`OpenerDirector` automatically adds a :mailheader:`User-Agent` header to
every :class:`Request`. To change this::
import urllib2
opener = urllib2.build_opener()
import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
@ -932,3 +1093,102 @@ Also, remember that a few standard headers (:mailheader:`Content-Length`,
:mailheader:`Content-Type` and :mailheader:`Host`) are added when the
:class:`Request` is passed to :func:`urlopen` (or :meth:`OpenerDirector.open`).
.. _urllib-examples:
Here is an example session that uses the ``GET`` method to retrieve a URL
containing parameters::
>>> import urllib.request
>>> import urllib.parse
>>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print(f.read())
The following example uses the ``POST`` method instead::
>>> import urllib.request
>>> import urllib.parse
>>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print(f.read())
The following example uses an explicitly specified HTTP proxy, overriding
environment settings::
>>> import urllib.request
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()
The following example uses no proxies at all, overriding environment settings::
>>> import urllib.request
>>> opener = urllib.request.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()
:mod:`urllib.request` Restrictions
----------------------------------
.. index::
pair: HTTP; protocol
pair: FTP; protocol
* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
1.0), FTP, and local files.
* The caching feature of :func:`urlretrieve` has been disabled until I find the
time to hack proper processing of Expiration time headers.
* There should be a function to query whether a particular URL is in the cache.
* For backward compatibility, if a URL appears to point to a local file but the
file can't be opened, the URL is re-interpreted using the FTP protocol. This
can sometimes cause confusing error messages.
* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
long delays while waiting for a network connection to be set up. This means
that it is difficult to build an interactive Web client using these functions
without using threads.
.. index::
single: HTML
pair: HTTP; protocol
* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
returned by the server. This may be binary data (such as an image), plain text
or (for example) HTML. The HTTP protocol provides type information in the reply
header, which can be inspected by looking at the :mailheader:`Content-Type`
header. If the returned data is HTML, you can use the module
:mod:`html.parser` to parse it.
.. index:: single: FTP
* The code handling the FTP protocol cannot differentiate between a file and a
directory. This can lead to unexpected behavior when attempting to read a URL
that points to a file that is not accessible. If the URL ends in a ``/``, it is
assumed to refer to a directory and will be handled accordingly. But if an
attempt to read a file leads to a 550 error (meaning the URL cannot be found or
is not accessible, often for permission reasons), then the path is treated as a
directory in order to handle the case when a directory is specified by a URL but
the trailing ``/`` has been left off. This can cause misleading results when
you try to fetch a file whose read permissions make it inaccessible; the FTP
code will try to read it, fail with a 550 error, and then perform a directory
listing for the unreadable file. If fine-grained control is needed, consider
using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
*_urlopener* to meet your needs.
:mod:`urllib.response` --- Response classes used by urllib.
===========================================================
.. module:: urllib.response
:synopsis: Response classes used by urllib.
The :mod:`urllib.response` module defines functions and classes which define a
minimal file like interface, including read() and readline(). The typical
response object is an addinfourl instance, which defines and info() method and
that returns headers and a geturl() method that returns the url.
Functions defined by this module are used internally by the
:mod:`urllib.request` module.

View File

@ -0,0 +1,73 @@
:mod:`urllib.robotparser` --- Parser for robots.txt
====================================================
.. module:: urllib.robotparser
:synopsis: Loads a robots.txt file and answers questions about
fetchability of other URLs.
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
.. index::
single: WWW
single: World Wide Web
single: URL
single: robots.txt
This module provides a single class, :class:`RobotFileParser`, which answers
questions about whether or not a particular user agent can fetch a URL on the
Web site that published the :file:`robots.txt` file. For more details on the
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
.. class:: RobotFileParser()
This class provides a set of methods to read, parse and answer questions
about a single :file:`robots.txt` file.
.. method:: set_url(url)
Sets the URL referring to a :file:`robots.txt` file.
.. method:: read()
Reads the :file:`robots.txt` URL and feeds it to the parser.
.. method:: parse(lines)
Parses the lines argument.
.. method:: can_fetch(useragent, url)
Returns ``True`` if the *useragent* is allowed to fetch the *url*
according to the rules contained in the parsed :file:`robots.txt`
file.
.. method:: mtime()
Returns the time the ``robots.txt`` file was last fetched. This is
useful for long-running web spiders that need to check for new
``robots.txt`` files periodically.
.. method:: modified()
Sets the time the ``robots.txt`` file was last fetched to the current
time.
The following example demonstrates basic use of the RobotFileParser class. ::
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True

View File

@ -1,459 +0,0 @@
:mod:`urllib` --- Open arbitrary resources by URL
=================================================
.. module:: urllib
:synopsis: Open an arbitrary network resource by URL (requires sockets).
.. index::
single: WWW
single: World Wide Web
single: URL
This module provides a high-level interface for fetching data across the World
Wide Web. In particular, the :func:`urlopen` function is similar to the
built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
instead of filenames. Some restrictions apply --- it can only open URLs for
reading, and no seek operations are available.
High-level interface
--------------------
.. function:: urlopen(url[, data[, proxies]])
Open a network object denoted by a URL for reading. If the URL does not have a
scheme identifier, or if it has :file:`file:` as its scheme identifier, this
opens a local file (without universal newlines); otherwise it opens a socket to
a server somewhere on the network. If the connection cannot be made the
:exc:`IOError` exception is raised. If all went well, a file-like object is
returned. This supports the following methods: :meth:`read`, :meth:`readline`,
:meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and
:meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One
caveat: the :meth:`read` method, if the size argument is omitted or negative,
may not read until the end of the data stream; there is no good way to determine
that the entire stream from a socket has been read in the general case.
Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
these methods have the same interface as for file objects --- see section
:ref:`bltin-file-objects` in this manual. (It is not a built-in file object,
however, so it can't be used at those few places where a true built-in file
object is required.)
The :meth:`info` method returns an instance of the class
:class:`email.message.Message` containing meta-information associated with
the URL. When the method is HTTP, these headers are those returned by the
server at the head of the retrieved HTML page (including Content-Length and
Content-Type). When the method is FTP, a Content-Length header will be
present if (as is now usual) the server passed back a file length in response
to the FTP retrieval request. A Content-Type header will be present if the
MIME type can be guessed. When the method is local-file, returned headers
will include a Date representing the file's last-modified time, a
Content-Length giving file size, and a Content-Type containing a guess at the
file's type.
The :meth:`geturl` method returns the real URL of the page. In some cases, the
HTTP server redirects a client to another URL. The :func:`urlopen` function
handles this transparently, but in some cases the caller needs to know which URL
the client was redirected to. The :meth:`geturl` method can be used to get at
this redirected URL.
The :meth:`getcode` method returns the HTTP status code that was sent with the
response, or ``None`` if the URL is no HTTP URL.
If the *url* uses the :file:`http:` scheme identifier, the optional *data*
argument may be given to specify a ``POST`` request (normally the request type
is ``GET``). The *data* argument must be in standard
:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
function below.
The :func:`urlopen` function works transparently with proxies which do not
require authentication. In a Unix or Windows environment, set the
:envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
identifies the proxy server before starting the Python interpreter. For example
(the ``'%'`` is the command prompt)::
% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...
The :envvar:`no_proxy` environment variable can be used to specify hosts which
shouldn't be reached via proxy; if set, it should be a comma-separated list
of hostname suffixes, optionally with ``:port`` appended, for example
``cern.ch,ncsa.uiuc.edu,some.host:8080``.
In a Windows environment, if no proxy environment variables are set, proxy
settings are obtained from the registry's Internet Settings section.
.. index:: single: Internet Config
In a Macintosh environment, :func:`urlopen` will retrieve proxy information from
Internet Config.
Alternatively, the optional *proxies* argument may be used to explicitly specify
proxies. It must be a dictionary mapping scheme names to proxy URLs, where an
empty dictionary causes no proxies to be used, and ``None`` (the default value)
causes environmental proxy settings to be used as discussed above. For
example::
# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})
# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)
Proxies which require authentication for use are not currently supported; this
is considered an implementation limitation.
.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
Copy a network object denoted by a URL to a local file, if necessary. If the URL
points to a local file, or a valid cached copy of the object exists, the object
is not copied. Return a tuple ``(filename, headers)`` where *filename* is the
local file name under which the object can be found, and *headers* is whatever
the :meth:`info` method of the object returned by :func:`urlopen` returned (for
a remote object, possibly cached). Exceptions are the same as for
:func:`urlopen`.
The second argument, if present, specifies the file location to copy to (if
absent, the location will be a tempfile with a generated name). The third
argument, if present, is a hook function that will be called once on
establishment of the network connection and once after each block read
thereafter. The hook will be passed three arguments; a count of blocks
transferred so far, a block size in bytes, and the total size of the file. The
third argument may be ``-1`` on older FTP servers which do not return a file
size in response to a retrieval request.
If the *url* uses the :file:`http:` scheme identifier, the optional *data*
argument may be given to specify a ``POST`` request (normally the request type
is ``GET``). The *data* argument must in standard
:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
function below.
:func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
the amount of data available was less than the expected amount (which is the
size reported by a *Content-Length* header). This can occur, for example, when
the download is interrupted.
The *Content-Length* is treated as a lower bound: if there's more data to read,
urlretrieve reads more data, but if less data is available, it raises the
exception.
You can still retrieve the downloaded data in this case, it is stored in the
:attr:`content` attribute of the exception instance.
If no *Content-Length* header was supplied, urlretrieve can not check the size
of the data it has downloaded, and just returns it. In this case you just have
to assume that the download was successful.
.. data:: _urlopener
The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
of the :class:`FancyURLopener` class and use it to perform their requested
actions. To override this functionality, programmers can create a subclass of
:class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
class to the ``urllib._urlopener`` variable before calling the desired function.
For example, applications may want to specify a different
:mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
accomplished with the following code::
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "App/1.7"
urllib._urlopener = AppURLopener()
.. function:: urlcleanup()
Clear the cache that may have been built up by previous calls to
:func:`urlretrieve`.
Utility functions
-----------------
.. function:: quote(string[, safe])
Replace special characters in *string* using the ``%xx`` escape. Letters,
digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
parameter specifies additional characters that should not be quoted --- its
default value is ``'/'``.
Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
.. function:: quote_plus(string[, safe])
Like :func:`quote`, but also replaces spaces by plus signs, as required for
quoting HTML form values. Plus signs in the original string are escaped unless
they are included in *safe*. It also does not have *safe* default to ``'/'``.
.. function:: unquote(string)
Replace ``%xx`` escapes by their single-character equivalent.
Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
.. function:: unquote_plus(string)
Like :func:`unquote`, but also replaces plus signs by spaces, as required for
unquoting HTML form values.
.. function:: urlencode(query[, doseq])
Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
string, suitable to pass to :func:`urlopen` above as the optional *data*
argument. This is useful to pass a dictionary of form fields to a ``POST``
request. The resulting string is a series of ``key=value`` pairs separated by
``'&'`` characters, where both *key* and *value* are quoted using
:func:`quote_plus` above. If the optional parameter *doseq* is present and
evaluates to true, individual ``key=value`` pairs are generated for each element
of the sequence. When a sequence of two-element tuples is used as the *query*
argument, the first element of each tuple is a key and the second is a value.
The order of parameters in the encoded string will match the order of parameter
tuples in the sequence. The :mod:`cgi` module provides the functions
:func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
into Python data structures.
.. function:: pathname2url(path)
Convert the pathname *path* from the local syntax for a path to the form used in
the path component of a URL. This does not produce a complete URL. The return
value will already be quoted using the :func:`quote` function.
.. function:: url2pathname(path)
Convert the path component *path* from an encoded URL to the local syntax for a
path. This does not accept a complete URL. This function uses :func:`unquote`
to decode *path*.
URL Opener objects
------------------
.. class:: URLopener([proxies[, **x509]])
Base class for opening and reading URLs. Unless you need to support opening
objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
you probably want to use :class:`FancyURLopener`.
By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number.
Applications can define their own :mailheader:`User-Agent` header by subclassing
:class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
:attr:`version` to an appropriate string value in the subclass definition.
The optional *proxies* parameter should be a dictionary mapping scheme names to
proxy URLs, where an empty dictionary turns proxies off completely. Its default
value is ``None``, in which case environmental proxy settings will be used if
present, as discussed in the definition of :func:`urlopen`, above.
Additional keyword parameters, collected in *x509*, may be used for
authentication of the client when using the :file:`https:` scheme. The keywords
*key_file* and *cert_file* are supported to provide an SSL key and certificate;
both are needed to support client authentication.
:class:`URLopener` objects will raise an :exc:`IOError` exception if the server
returns an error code.
.. method:: open(fullurl[, data])
Open *fullurl* using the appropriate protocol. This method sets up cache and
proxy information, then calls the appropriate open method with its input
arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
The *data* argument has the same meaning as the *data* argument of
:func:`urlopen`.
.. method:: open_unknown(fullurl[, data])
Overridable interface to open unknown URL types.
.. method:: retrieve(url[, filename[, reporthook[, data]]])
Retrieves the contents of *url* and places it in *filename*. The return value
is a tuple consisting of a local filename and either a
:class:`email.message.Message` object containing the response headers (for remote
URLs) or ``None`` (for local URLs). The caller must then open and read the
contents of *filename*. If *filename* is not given and the URL refers to a
local file, the input filename is returned. If the URL is non-local and
*filename* is not given, the filename is the output of :func:`tempfile.mktemp`
with a suffix that matches the suffix of the last path component of the input
URL. If *reporthook* is given, it must be a function accepting three numeric
parameters. It will be called after each chunk of data is read from the
network. *reporthook* is ignored for local URLs.
If the *url* uses the :file:`http:` scheme identifier, the optional *data*
argument may be given to specify a ``POST`` request (normally the request type
is ``GET``). The *data* argument must in standard
:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
function below.
.. attribute:: version
Variable that specifies the user agent of the opener object. To get
:mod:`urllib` to tell servers that it is a particular user agent, set this in a
subclass as a class variable or in the constructor before calling the base
constructor.
.. class:: FancyURLopener(...)
:class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
response codes listed above, the :mailheader:`Location` header is used to fetch
the actual URL. For 401 response codes (authentication required), basic HTTP
authentication is performed. For the 30x response codes, recursion is bounded
by the value of the *maxtries* attribute, which defaults to 10.
For all other response codes, the method :meth:`http_error_default` is called
which you can override in subclasses to handle the error appropriately.
.. note::
According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
must not be automatically redirected without confirmation by the user. In
reality, browsers do allow automatic redirection of these responses, changing
the POST to a GET, and :mod:`urllib` reproduces this behaviour.
The parameters to the constructor are the same as those for :class:`URLopener`.
.. note::
When performing basic authentication, a :class:`FancyURLopener` instance calls
its :meth:`prompt_user_passwd` method. The default implementation asks the
users for the required information on the controlling terminal. A subclass may
override this method to support more appropriate behavior if needed.
The :class:`FancyURLopener` class offers one additional method that should be
overloaded to provide the appropriate behavior:
.. method:: prompt_user_passwd(host, realm)
Return information needed to authenticate the user at the given host in the
specified security realm. The return value should be a tuple, ``(user,
password)``, which can be used for basic authentication.
The implementation prompts for this information on the terminal; an application
should override this method to use an appropriate interaction model in the local
environment.
.. exception:: ContentTooShortError(msg[, content])
This exception is raised when the :func:`urlretrieve` function detects that the
amount of the downloaded data is less than the expected amount (given by the
*Content-Length* header). The :attr:`content` attribute stores the downloaded
(and supposedly truncated) data.
:mod:`urllib` Restrictions
--------------------------
.. index::
pair: HTTP; protocol
pair: FTP; protocol
* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
1.0), FTP, and local files.
* The caching feature of :func:`urlretrieve` has been disabled until I find the
time to hack proper processing of Expiration time headers.
* There should be a function to query whether a particular URL is in the cache.
* For backward compatibility, if a URL appears to point to a local file but the
file can't be opened, the URL is re-interpreted using the FTP protocol. This
can sometimes cause confusing error messages.
* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
long delays while waiting for a network connection to be set up. This means
that it is difficult to build an interactive Web client using these functions
without using threads.
.. index::
single: HTML
pair: HTTP; protocol
* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
returned by the server. This may be binary data (such as an image), plain text
or (for example) HTML. The HTTP protocol provides type information in the reply
header, which can be inspected by looking at the :mailheader:`Content-Type`
header. If the returned data is HTML, you can use the module
:mod:`html.parser` to parse it.
.. index:: single: FTP
* The code handling the FTP protocol cannot differentiate between a file and a
directory. This can lead to unexpected behavior when attempting to read a URL
that points to a file that is not accessible. If the URL ends in a ``/``, it is
assumed to refer to a directory and will be handled accordingly. But if an
attempt to read a file leads to a 550 error (meaning the URL cannot be found or
is not accessible, often for permission reasons), then the path is treated as a
directory in order to handle the case when a directory is specified by a URL but
the trailing ``/`` has been left off. This can cause misleading results when
you try to fetch a file whose read permissions make it inaccessible; the FTP
code will try to read it, fail with a 550 error, and then perform a directory
listing for the unreadable file. If fine-grained control is needed, consider
using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
*_urlopener* to meet your needs.
* This module does not support the use of proxies which require authentication.
This may be implemented in the future.
.. index:: module: urlparse
* Although the :mod:`urllib` module contains (undocumented) routines to parse
and unparse URL strings, the recommended interface for URL manipulation is in
module :mod:`urlparse`.
.. _urllib-examples:
Examples
--------
Here is an example session that uses the ``GET`` method to retrieve a URL
containing parameters::
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print(f.read())
The following example uses the ``POST`` method instead::
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print(f.read())
The following example uses an explicitly specified HTTP proxy, overriding
environment settings::
>>> import urllib
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()
The following example uses no proxies at all, overriding environment settings::
>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()

View File

@ -147,11 +147,11 @@ Internet Access
===============
There are a number of modules for accessing the internet and processing internet
protocols. Two of the simplest are :mod:`urllib2` for retrieving data from urls
and :mod:`smtplib` for sending mail::
protocols. Two of the simplest are :mod:`urllib.request` for retrieving data
from urls and :mod:`smtplib` for sending mail::
>>> import urllib2
>>> for line in urllib2.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
>>> import urllib.request
>>> for line in urllib.request.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
... if 'EST' in line or 'EDT' in line: # look for Eastern Time
... print(line)