The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.
With this, that case is actually faster than in master!
$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop
$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413947)
Co-authored-by: Greg Price <gnprice@gmail.com>
* Fix suspicious.py to actually print the unused rules
* Fix the other `self.warn` calls
(cherry picked from commit e1786b5416)
Co-authored-by: Anthony Sottile <asottile@umich.edu>
Adds a link to `dateutil.parser.isoparse` in the documentation.
It would be nice to set up intersphinx for things like this, but I think we can leave that for a separate PR.
CC: @pitrou
[bpo-37979](https://bugs.python.org/issue37979)
https://bugs.python.org/issue37979
Automerge-Triggered-By: @pitrou
(cherry picked from commit 59725f3bad)
Co-authored-by: Paul Ganssle <paul@ganssle.io>
- drop TargetScopeError in favour of raising SyntaxError directly
as per the updated PEP 572
- comprehension iteration variables are explicitly local, but
named expression targets in comprehensions are nonlocal or
global. Raise SyntaxError as specified in PEP 572
- named expression targets in the outermost iterable of a
comprehension have an ambiguous target scope. Avoid resolving
that question now by raising SyntaxError. PEP 572
originally required this only for cases where the bound name
conflicts with the iteration variable in the comprehension,
but CPython can't easily restrict the exception to that case
(as it doesn't know the target variable names when visiting
the outermost iterator expression)
(cherry picked from commit 5dbe0f59b7)
"Arguments may be integers... " could be misunderstand as they also
could be strings.
New wording makes it clear that arguments have to be integers.
modified: Doc/library/datetime.rst
Automerge-Triggered-By: @pganssle
(cherry picked from commit c5218fce02)
Co-authored-by: Jürgen Gmach <juergen.gmach@googlemail.com>
Automerge-Triggered-By: @pganssle
Fix typo in description of link to mozilla bug report writing guidelines.
Though the URL is misleading, we're indeed trying to write bug _reports_, not to add bugs.
Automerge-Triggered-By: @ned-deily
(cherry picked from commit e17f201cd9)
Co-authored-by: Antoine <43954001+awecx@users.noreply.github.com>
bpo-37834: Normalise handling of reparse points on Windows
* ntpath.realpath() and nt.stat() will traverse all supported reparse points (previously was mixed)
* nt.lstat() will let the OS traverse reparse points that are not name surrogates (previously would not traverse any reparse point)
* nt.[l]stat() will only set S_IFLNK for symlinks (previous behaviour)
* nt.readlink() will read destinations for symlinks and junction points only
bpo-1311: os.path.exists('nul') now returns True on Windows
* nt.stat('nul').st_mode is now S_IFCHR (previously was an error)
Added back mention that ensure_future actually scheduled obj. This documentation just mentions what ensure_future returns, so I did not realize that ensure_future also schedules obj.
(cherry picked from commit 092911d5c0)
Co-authored-by: Roger Iyengar <ri@rogeriyengar.com>
Fixed wrong link to Telnet.open() method in telnetlib documentation.
(cherry picked from commit e0b6117e27)
Co-authored-by: Michael Anckaert <michael.anckaert@sinax.be>
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.
Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.
Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so. The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.
Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.
Because mod, func, class, etc all share one namespace, :func:time creates a link to the time module doc page rather than the time.time function.
(cherry picked from commit 1b1d0514ad)
Co-authored-by: Éric Araujo <merwok@netwok.org>
Automerge-Triggered-By: @merwok
https://bugs.python.org/issue37814:
> The empty tuple syntax in type annotations, `Tuple[()]`, is not obvious from the examples given in the documentation (I naively expected `Tuple[]` to work); it has been documented in PEP 484 and in mypy, but not in the documentation for the typing module.
https://bugs.python.org/issue37814
(cherry picked from commit 8a784af750)
Co-authored-by: Josh Holland <anowlcalledjosh@gmail.com>
* bpo-32912: Revert warnings for invalid escape sequences.
DeprecationWarning will continue to be emitted for invalid escape sequences in string and bytes literals in 3.8 just as it did in 3.7.
SyntaxWarning may be emitted in the future. But per mailing list discussion, we don't yet know when because we haven't settled on how to do so in a non-disruptive manner.