Commit Graph

132 Commits

Author SHA1 Message Date
Serhiy Storchaka e237b25a4f
gh-67693: Fix urlunparse() and urlunsplit() for URIs with path starting with multiple slashes and no authority (GH-113563) 2024-05-14 12:24:37 +03:00
Serhiy Storchaka 1069a462f6
gh-116764: Fix regressions in urllib.parse.parse_qsl() (GH-116801)
* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef42b).
2024-03-16 12:36:05 +02:00
Serhiy Storchaka bdba8ef42b
gh-74668: Fix support of bytes in urllib.parse.parse_qsl() (GH-115771)
urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
2024-03-05 17:49:50 +02:00
zentarim f3266c05b6
GH-104554: Add RTSPS support to `urllib/parse.py` (#104605)
* GH-104554: Add RTSPS support to `urllib/parse.py`

RTSPS is the permanent scheme defined in
https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml
alongside RTSP and RTSPU schemes.

* 📜🤖 Added by blurb_it.

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
2023-06-13 16:45:47 -07:00
Illia Volochii 2f630e1ce1
gh-102153: Start stripping C0 control and space chars in `urlsplit` (#102508)
`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit #25595.

This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/#url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).

---------

Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
2023-05-17 01:49:20 -07:00
JohnJamesUtley 29f348e232
gh-103848: Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format (#103849)
* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
2023-05-10 00:18:35 +00:00
Gregory P. Smith 82f789be3b
gh-104139: Add itms-services to uses_netloc urllib.parse. (#104312)
Teach unsplit to retain the `"//"` when assembling `itms-services://?action=generate-bugs` style
[Apple Platform Deployment](https://support.apple.com/en-gb/guide/deployment/depce7cefc4d/web) URLs.
2023-05-09 07:04:50 -07:00
Gregory P. Smith 2e279e85fe
gh-88500: Reduce memory use of `urllib.unquote` (#96763)
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.

Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal.  The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.
2022-12-10 16:17:39 -08:00
Ben Kallus 439b9cfaf4
gh-99418: Make urllib.parse.urlparse enforce that a scheme must begin with an alphabetical ASCII character. (#99421)
Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
2022-11-13 10:25:55 -08:00
Ben Kallus 6f15ca8c7a
gh-96035: Make urllib.parse.urlparse reject non-numeric ports (#98273)
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
2022-10-20 14:00:56 -07:00
Gregory P. Smith e61ca22431
gh-95865: Further reduce quote_from_bytes memory consumption (#96860)
on large input values.  Based on Dennis Sweeney's chunking idea.
2022-09-19 16:06:25 -07:00
Dennis Sweeney 8ba22b90ca
gh-95865: Speed up urllib.parse.quote_from_bytes() (GH-95872) 2022-08-30 21:39:51 -04:00
Victor Stinner 259dd71c32
gh-84623: Remove unused imports in stdlib (#93773) 2022-06-13 16:28:41 +02:00
Oleg Iarygin a03a09e068
Replace with_traceback() with exception chaining and reraising (GH-32074) 2022-03-30 15:28:20 +03:00
Christian Sattler e6fe10d340
bpo-45874: Handle empty query string correctly in urllib.parse.parse_qsl (#29716) 2021-12-12 10:41:12 +02:00
Gregory P. Smith d597fdc5fd
bpo-44002: Switch to lru_cache in urllib.parse. (GH-25798)
Switch to lru_cache in urllib.parse.

urllib.parse now uses functool.lru_cache for its internal URL splitting and
quoting caches instead of rolling its own like its the 90s.

The undocumented internal Quoted class API is now deprecated
as it had no reason to be public and no existing OSS users were found.

The clear_cache() API remains undocumented but gets an explicit test as it
is used in a few projects' (twisted, gevent) tests as well as our own regrtest.
2021-05-11 17:01:44 -07:00
Senthil Kumaran 985ac01637
bpo-43882 Remove the newline, and tab early. From query and fragments. (GH-25921) 2021-05-05 15:50:05 -07:00
Dong-hee Na 6143fcdf8b
bpo-43979: Remove unnecessary operation from urllib.parse.parse_qsl (GH-25756)
Automerge-Triggered-By: GH:gpshead
2021-04-30 12:01:55 -07:00
Senthil Kumaran 76cd81d603
bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)
* issue43882 - urllib.parse should sanitize urls containing ASCII newline and tabs.

Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2021-04-29 10:16:50 -07:00
Ken Jin b38601d496
bpo-42967: coerce bytes separator to string in urllib.parse_qs(l) (#24818)
* coerce bytes separator to string

* Add news

* Update Misc/NEWS.d/next/Library/2021-03-11-00-31-41.bpo-42967.2PeQRw.rst
2021-04-11 06:26:09 -07:00
Ken Jin a2f0654b0a
bpo-42967: Fix urllib.parse docs and make logic clearer (GH-24536) 2021-02-15 09:00:20 -08:00
Adam Goldschmidt fcbe0cb04d
bpo-42967: only use '&' as a query string separator (#24297)
bpo-42967: [security] Address a web cache-poisoning issue reported in urllib.parse.parse_qsl().

urllib.parse will only us "&" as query string separator by default instead of both ";" and "&" as allowed in earlier versions. An optional argument seperator with default value "&" is added to specify the separator.


Co-authored-by: Éric Araujo <merwok@netwok.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Éric Araujo <merwok@netwok.org>
2021-02-14 14:41:57 -08:00
Batuhan Taşkaya 0361556537
bpo-39481: PEP 585 for a variety of modules (GH-19423)
- concurrent.futures
- ctypes
- http.cookies
- multiprocessing
- queue
- tempfile
- unittest.case
- urllib.parse
2020-04-10 07:46:36 -07:00
idomic c33bdbb20c
bpo-37970: update and improve urlparse and urlsplit doc-strings (GH-16458) 2020-02-16 21:17:58 +02:00
Serhiy Storchaka 6a265f0d0c
bpo-39057: Fix urllib.request.proxy_bypass_environment(). (GH-17619)
Ignore leading dots and no longer ignore a trailing newline.
2020-01-05 14:14:31 +02:00
Tim Graham 5a88d50ff0 bpo-27657: Fix urlparse() with numeric paths (#661)
* bpo-27657: Fix urlparse() with numeric paths

Revert parsing decision from bpo-754016 in favor of the documented
consensus in bpo-16932 of how to treat strings without a // to
designate the netloc.

* bpo-22891: Remove urlsplit() optimization for 'http' prefixed inputs.
2019-10-18 06:07:20 -07:00
Stein Karlsen aad2ee0156 bpo-32498: urllib.parse.unquote also accepts bytes (GH-7768) 2019-10-14 13:36:29 +03:00
Steve Dower 8d0ef0b5ed bpo-36742: Corrects fix to handle decomposition in usernames (#13812) 2019-06-04 17:55:29 +02:00
Rémi Lapeyre 674ee12600 bpo-35397: Remove deprecation and document urllib.parse.unwrap (GH-11481) 2019-05-27 09:43:45 -04:00
Steve Dower d537ab0ff9
bpo-36742: Fixes handling of pre-normalization characters in urlsplit() (GH-13017) 2019-04-30 12:03:02 +00:00
Jörn Hees 750d74fac5 bpo-12910: update and correct quote docstring (#2568)
Fixes some mistakes and misleadings in the quote function docstring:
- reserved chars are never actually used by quote code, unreserved chars are
- reserved chars were wrong and incomplete
- mentioned that use-case is not minimal quoting wrt. RFC, but cautious quoting
2019-04-09 17:31:18 -07:00
Steve Dower 16e6f7dee7
bpo-36216: Add check for characters in netloc that normalize to separators (GH-12201) 2019-03-07 08:02:26 -08:00
matthewbelisle-wf 209144831b bpo-34866: Adding max_num_fields to cgi.FieldStorage (GH-9660)
Adding `max_num_fields` to `cgi.FieldStorage` to make DOS attacks harder by
limiting the number of `MiniFieldStorage` objects created by `FieldStorage`.
2018-10-19 03:52:59 -07:00
Cheryl Sabella 0250de4819 bpo-27485: Rename and deprecate undocumented functions in urllib.parse (GH-2205) 2018-04-25 16:51:54 -07:00
Matt Eaton 2cb4661707 bpo-33034: Improve exception message when cast fails for {Parse,Split}Result.port (GH-6078) 2018-03-20 09:41:37 +03:00
Коренберг Марк fbd605151f bpo-32323: urllib.parse.urlsplit() must not lowercase() IPv6 scope value (#4867) 2017-12-21 14:16:17 +02:00
Oren Milman 8df44ee8e0 remove a redundant lower in urllib.parse.urlsplit (#3008) 2017-09-02 21:51:39 -07:00
postmasters 90e01e50ef urllib: Simplify splithost by calling into urlparse. (#1849)
The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
2017-06-20 15:02:44 +02:00
Senthil Kumaran 906f5330b9 bpo-29976: urllib.parse clarify '' in scheme values. (GH-984) 2017-05-17 21:48:59 -07:00
Senthil Kumaran 257b980b31 correct parse_qs and parse_qsl test case descriptions. (#968)
* correct parse_qs and parse_qsl test case descriptions.
2017-04-04 21:19:43 -07:00
Ratnadeep Debnath 21024f0662 bpo-16285: Update urllib quoting to RFC 3986 (#173)
* bpo-16285: Update urllib quoting to RFC 3986

urllib.parse.quote is now based on RFC 3986, and hence
includes `'~'` in the set of characters that is not escaped
by default.

Patch by Christian Theune and Ratnadeep Debnath.
2017-02-25 19:00:28 +10:00
Serhiy Storchaka 8cbd3df3ce Issue #28992: Use bytes.fromhex(). 2016-12-21 12:59:28 +02:00
Berker Peksag f8479eeb34 Issue #25895: Merge from 3.5 2016-09-16 14:45:15 +03:00
Berker Peksag f676748a05 Issue #25895: Enable WebSocket URL schemes in urllib.parse.urljoin
Patch by Gergely Imreh and Markus Holtermann.
2016-09-16 14:43:58 +03:00
Senthil Kumaran 0b57f0adde merge from 3.5
Remove unnecessary test case comment in urllib.parse.py. These are asserted as test cases.
2016-01-25 18:54:37 -08:00
Senthil Kumaran d4e51f45a9 Remove unnecessary test case comment in urllib.parse.py. These are asserted as test cases. 2016-01-25 18:53:34 -08:00
Senthil Kumaran 86f7109dad Issue #25822: Add docstrings to the fields of urllib.parse results.
Patch contributed by Swati Jaiswal.
2016-01-14 00:11:39 -08:00
Robert Collins dfa95c9a8f Issue #20059: urllib.parse raises ValueError on all invalid ports.
Patch by Martin Panter.
2015-08-10 09:53:30 +12:00
R David Murray c17686f071 Issue #13866: add *quote_via* argument to urlencode.
Patch by samwyse, completed by Arnon Yaari, and reviewed by
Martin Panter.
2015-05-17 20:44:50 -04:00
Berker Peksag 20416f7994 Issue #23703: Fix a regression in urljoin() introduced in 901e4e52b20a.
Patch by Demian Brecht.
2015-04-16 02:31:14 +03:00