cpython/Lib
Greg Price 2f09413947 closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
2019-09-03 19:45:44 -07:00
..
asyncio Fix typos mostly in comments, docs and test names (GH-15209) 2019-08-30 16:21:19 -04:00
collections bpo-36582: Make collections.UserString.encode() return bytes, not str (GH-13138) 2019-08-27 21:38:09 -07:00
concurrent bpo-31783: Fix a race condition creating workers during shutdown (#13171) 2019-06-28 11:54:52 -07:00
ctypes bpo-37140: Fix StructUnionType_paramfunc() (GH-15612) 2019-08-30 14:30:33 +02:00
curses [3.9] bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-12620) 2019-06-05 18:22:31 +03:00
dbm bpo-36232: Improve error message on dbm.open() when the db doesn't exist (GH-12060) 2019-04-29 16:23:28 -07:00
distutils closes bpo-37965: Fix compiler warning of distutils CCompiler.test_function. (GH-15560) 2019-08-28 10:11:03 -07:00
email bpo-37764: Fix infinite loop when parsing unstructured email headers. (GH-15239) 2019-08-31 08:25:35 -07:00
encodings bpo-35551: remove mac_centeuro encoding (GH-13856) 2019-06-06 14:38:52 +09:00
ensurepip bpo-37664: Update ensurepip bundled wheels, again (GH-15483) 2019-08-26 11:19:30 -07:00
html bpo-37328: remove deprecated HTMLParser.unescape (GH-14186) 2019-08-27 11:48:06 +09:00
http bpo-26589: Add http status code 451 (GH-15413) 2019-08-23 10:19:15 -07:00
idlelib bpo-38022: IDLE: upgrade help.html to sphinx 2.x HTML5 output (GH-15664) 2019-09-03 16:52:58 -04:00
importlib bpo-38010 Sync importlib.metadata with importlib_metadata 0.20. (GH-15646) 2019-09-02 11:08:03 -04:00
json json.tool: use stdin and stdout in default cmdlne arguments (GH-11992) 2019-05-14 18:52:42 +02:00
lib2to3 Fix typos in comments, docs and test names (#15018) 2019-07-30 18:16:13 -04:00
logging bpo-37742: Return the root logger when logging.getLogger('root') is c… (#15077) 2019-08-02 16:53:00 +01:00
msilib bpo-12639: msilib.Directory.start_component() fails if *keyfile* is not None (GH-13688) 2019-05-31 09:43:13 -07:00
multiprocessing Fix typos mostly in comments, docs and test names (GH-15209) 2019-08-30 16:21:19 -04:00
pydoc_data Python 3.8.0b1 2019-06-04 19:44:34 +02:00
site-packages
sqlite3 closes bpo-37347: Fix refcount problem in sqlite3. (GH-14268) 2019-07-12 20:15:48 -07:00
test closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558) 2019-09-03 19:45:44 -07:00
tkinter bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
turtledemo Unmark files as executable that can't actually be executed. (GH-15353) 2019-08-20 21:53:59 -07:00
unittest Fix typos mostly in comments, docs and test names (GH-15209) 2019-08-30 16:21:19 -04:00
urllib bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791) 2019-06-16 09:48:57 +03:00
venv bpo-37663: have venv activation scripts all consistently use __VENV_PROMPT__ for prompt customization (GH-14941) 2019-08-21 15:58:01 -07:00
wsgiref bpo-8138: Initialize wsgiref's SimpleServer as single-threaded (GH-12977) 2019-05-24 20:24:42 +03:00
xml bpo-15999: Always pass bool instead of int to the expat parser. (GH-15622) 2019-09-01 12:11:43 +03:00
xmlrpc bpo-15999: Always pass bool instead of int to the expat parser. (GH-15622) 2019-09-01 12:11:43 +03:00
__future__.py bpo-35526: make __future__.barry_as_FLUFL mandatory for Python 4.0 (#11218) 2018-12-19 08:19:39 -08:00
__phello__.foo.py
_bootlocale.py
_collections_abc.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
_compat_pickle.py bpo-37757: Disallow PEP 572 cases that expose implementation details (GH-15131) 2019-08-25 23:45:40 +10:00
_compression.py
_markupbase.py
_osx_support.py bpo-35257: Avoid leaking LTO linker flags into distutils (GH-10900) 2018-12-19 18:19:01 +01:00
_py_abc.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
_pydecimal.py bpo-36793: Remove unneeded __str__ definitions. (GH-13081) 2019-05-06 22:29:40 +03:00
_pyio.py bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
_sitebuiltins.py
_strptime.py
_threading_local.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
_weakrefset.py bpo-36949: Implement __repr__ on WeakSet (GH-13415) 2019-05-20 10:01:07 -07:00
abc.py bpo-35609: Remove examples for deprecated decorators in the abc module. (GH-11355) 2018-12-31 09:56:21 +02:00
aifc.py bpo-37320: Remove openfp() of aifc, sunau and wave (GH-14169) 2019-06-18 00:00:24 +02:00
antigravity.py Change the xkcd link in comment over https. (GH-5452) 2018-09-13 22:45:00 -07:00
argparse.py Steven Bethard designated a new maintainer for argparse (GH-15605) 2019-08-29 21:04:37 -07:00
ast.py bpo-37950: Fix ast.dump() when call with incompletely initialized node. (GH-15510) 2019-08-29 09:30:23 +03:00
asynchat.py
asyncore.py bpo-15999: Always pass bool instead of int to socket.setblocking(). (GH-15621) 2019-09-01 12:12:52 +03:00
base64.py
bdb.py Fix typos mostly in comments, docs and test names (GH-15209) 2019-08-30 16:21:19 -04:00
binhex.py
bisect.py remove duplicate code in biscet (GH-1270) 2019-04-08 17:01:09 +09:00
bz2.py bpo-35128: Fix spacing issues in warning.warn() messages. (GH-10268) 2018-11-01 12:33:35 +02:00
cProfile.py [3.9] bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-12620) 2019-06-05 18:22:31 +03:00
calendar.py bpo-28292: Mark calendar.py helper functions as private. (GH-15113) 2019-08-04 13:14:03 -07:00
cgi.py bpo-35028: cgi: Fix max_num_fields off by one error (GH-9973) 2018-10-23 01:14:35 -07:00
cgitb.py
chunk.py
cmd.py
code.py
codecs.py bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) 2019-05-31 22:44:00 +03:00
codeop.py bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
colorsys.py
compileall.py bpo-36786: Run compileall in parallel during "make install" (GH-13078) 2019-05-15 23:45:18 +02:00
configparser.py fix typo in configparser doc (GH-12154) 2019-03-03 18:23:19 -08:00
contextlib.py [3.9] bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-12620) 2019-06-05 18:22:31 +03:00
contextvars.py
copy.py
copyreg.py bpo-33138: Change standard error message for non-pickleable and non-copyable types. (GH-6239) 2018-10-31 02:28:07 +02:00
crypt.py bpo-25172: Raise appropriate ImportError msg when crypt module used on Windows (GH-15149) 2019-08-08 21:02:49 +01:00
csv.py bpo-27497: Add return value to csv.DictWriter.writeheader (GH-12306) 2019-05-10 03:50:11 +02:00
dataclasses.py bpo-37868: Improve is_dataclass for instances. (GH-15325) 2019-08-20 01:40:28 -04:00
datetime.py bpo-37642: Update acceptable offsets in timezone (GH-14878) 2019-08-09 10:22:16 -04:00
decimal.py
difflib.py Fix difflib `?` hint in diff output when dealing with tabs (#15201) 2019-08-21 13:59:25 -05:00
dis.py bpo-36540: PEP 570 -- Implementation (GH-12701) 2019-04-29 13:36:57 +01:00
doctest.py bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
enum.py bpo-34443: Use __qualname__ instead of __name__ in enum exception messages. (GH-14809) 2019-07-18 11:37:13 -07:00
filecmp.py
fileinput.py bpo-37014: Update docstring and Documentation of fileinput.FileInput(). (GH-13545) 2019-06-02 23:01:49 +02:00
fnmatch.py
formatter.py
fractions.py Add a minor `Fraction.__hash__()` optimization (GH-15313) 2019-08-16 21:09:16 -05:00
ftplib.py Enforce PEP 257 conventions in ftplib.py (GH-15604) 2019-09-02 21:21:33 -07:00
functools.py bpo-36743: __get__ is sometimes called without the owner argument (#12992) 2019-08-29 01:27:42 -07:00
genericpath.py bpo-30974: Change os.path.samefile docstring to match docs (GH-7337) 2019-08-02 15:44:25 -07:00
getopt.py
getpass.py
gettext.py bpo-36239: Skip comments in gettext infos (GH-12255) 2019-05-09 16:22:15 +02:00
glob.py bpo-37363: Add audit events for a range of modules (GH-14301) 2019-06-24 08:42:54 -07:00
gzip.py bpo-6584: Add a BadGzipFile exception to the gzip module. (GH-13022) 2019-05-13 10:50:52 +03:00
hashlib.py
heapq.py bpo-29984: Improve 'heapq' test coverage (GH-992) 2019-05-31 21:13:57 -07:00
hmac.py
imaplib.py Fix typos in comments, docs and test names (#15018) 2019-07-30 18:16:13 -04:00
imghdr.py
imp.py
inspect.py bpo-37173: Show passed class in inspect.getfile error (GH-13861) 2019-06-08 05:05:46 -07:00
io.py bpo-36842: Implement PEP 578 (GH-12613) 2019-05-23 08:45:22 -07:00
ipaddress.py bpo-36845: validate integer network prefix when constructing IP networks (GH-13298) 2019-05-14 19:32:59 +09:00
keyword.py bpo-36143: Regenerate Lib/keyword.py from the Grammar and Tokens file using pgen (GH-12456) 2019-03-25 22:01:12 +00:00
linecache.py
locale.py bpo-18378: Recognize "UTF-8" as a valid name in locale._parse_localename (GH-14736) 2019-08-29 00:33:52 -04:00
lzma.py
mailbox.py bpo-31522: mailbox.get_string: pass `from_` parameter to `get_bytes` (#9857) 2018-10-18 20:21:47 -04:00
mailcap.py
mimetypes.py bpo-4963: Fix for initialization and non-deterministic behavior issues in mimetypes (GH-3062) 2019-06-24 16:46:59 -07:00
modulefinder.py bpo-37032: Add CodeType.replace() method (GH-13542) 2019-05-24 23:57:23 +02:00
netrc.py
nntplib.py bpo-37390: Add audit event table to documentations (GH-14406) 2019-06-27 10:47:59 -07:00
ntpath.py bpo-9949: Call normpath() in realpath() and avoid unnecessary prefixes (GH-15369) 2019-08-21 16:45:02 -07:00
nturl2path.py
numbers.py
opcode.py bpo-34880: Add the LOAD_ASSERTION_ERROR opcode. (GH-15073) 2019-08-25 12:44:09 +03:00
operator.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
optparse.py bpo-34605: Avoid master/slave terms (GH-9101) 2018-09-07 17:30:33 +02:00
os.py bpo-36085: Enable better DLL resolution on Windows (GH-12302) 2019-03-29 16:37:16 -07:00
pathlib.py bpo-37689: add Path.is_relative_to() method (GH-14982) 2019-08-13 21:54:02 +02:00
pdb.py bpo-20523: pdb searches for .pdbrc in ~ instead of $HOME (GH-11847) 2019-08-02 15:20:14 -07:00
pickle.py bpo-37210: Fix pure Python pickle when _pickle is unavailable (GH-14016) 2019-06-13 13:58:51 +02:00
pickletools.py bpo-36785: PEP 574 implementation (GH-7076) 2019-05-26 17:10:09 +02:00
pipes.py
pkgutil.py
platform.py bpo-35389: platform.platform() calls libc_ver() without executable (GH-14418) 2019-06-27 09:04:28 +02:00
plistlib.py Clarify that plistlib's load and dump functions take a binary file object (GH-9825) 2019-07-14 11:01:48 +02:00
poplib.py bpo-37390: Add audit event table to documentations (GH-14406) 2019-06-27 10:47:59 -07:00
posixpath.py bpo-35755: Remove current directory from posixpath.defpath (GH-11586) 2019-04-17 17:05:30 +02:00
pprint.py bpo-37376: pprint support for SimpleNamespace (GH-14318) 2019-06-26 16:13:18 -07:00
profile.py [3.9] bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-12620) 2019-06-05 18:22:31 +03:00
pstats.py Fix typos in docs and docstrings (GH-13745) 2019-06-03 01:12:33 +02:00
pty.py
py_compile.py bpo-22640: Add silent mode to py_compile.compile() (GH-12976) 2019-05-28 19:29:04 +03:00
pyclbr.py Fix typos in docs and docstrings (GH-13745) 2019-06-03 01:12:33 +02:00
pydoc.py bpo-36045: builtins.help() now prefixes `async` for async functions (GH-12010) 2019-05-24 04:38:01 -07:00
queue.py bpo-37394: Fix pure Python implementation of the queue module (GH-14351) 2019-06-25 02:53:30 +01:00
quopri.py bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
random.py bpo-32554: Deprecate hashing arbitrary types in random.seed() (GH-15382) 2019-08-22 09:19:36 -07:00
re.py bpo-36548: Improve the repr of re flags. (GH-12715) 2019-05-31 10:39:47 +03:00
reprlib.py
rlcompleter.py
runpy.py
sched.py
secrets.py
selectors.py
shelve.py
shlex.py bpo-28595: Allow shlex whitespace_split with punctuation_chars (GH-2071) 2019-06-01 20:09:22 +01:00
shutil.py bpo-37834: Prevent shutil.rmtree exception (GH-15602) 2019-08-29 23:20:03 +02:00
signal.py
site.py bpo-37369: Fix initialization of sys members when launched via an app container (GH-14428) 2019-06-29 10:34:11 -07:00
smtpd.py
smtplib.py bpo-32793: Fix a duplicate debug message in smtplib (GH-15341) 2019-08-20 10:52:25 -07:00
sndhdr.py
socket.py closes bpo-37566: Remove _realsocket from socket.py. (GH-14711) 2019-07-11 19:17:52 -07:00
socketserver.py Fix typo in socketserver docstring (GH-11252) 2018-12-21 14:22:09 -08:00
sre_compile.py Simplify flags checks in sre_compile.py. (GH-9718) 2018-10-05 20:53:45 +03:00
sre_constants.py bpo-36793: Remove unneeded __str__ definitions. (GH-13081) 2019-05-06 22:29:40 +03:00
sre_parse.py bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) 2019-07-31 21:50:39 +03:00
ssl.py bpo-37463: match_hostname requires quad-dotted IPv4 (GH-14499) 2019-07-02 11:39:42 -07:00
stat.py
statistics.py bpo-37798: Add C fastpath for statistics.NormalDist.inv_cdf() (GH-15266) 2019-08-23 15:20:30 -07:00
string.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
stringprep.py
struct.py
subprocess.py bpo-37380: subprocess: don't use _active on win (GH-14360) 2019-06-28 18:12:16 +02:00
sunau.py bpo-37320: Remove openfp() of aifc, sunau and wave (GH-14169) 2019-06-18 00:00:24 +02:00
symbol.py bpo-35766: Merge typed_ast back into CPython (GH-11645) 2019-01-31 12:40:27 +01:00
symtable.py bpo-34983: Expose symtable.Symbol.is_nonlocal() in the symtable module (GH-9872) 2018-10-20 01:46:00 +01:00
sysconfig.py bpo-37201: fix test_distutils failures for Windows ARM64 (GH-13902) 2019-06-12 10:16:49 -07:00
tabnanny.py
tarfile.py Add missing docstrings for TarInfo objects (#12555) 2019-03-27 13:16:34 -07:00
telnetlib.py bpo-37363: Add audit events for a range of modules (GH-14301) 2019-06-24 08:42:54 -07:00
tempfile.py bpo-37363: Add audit events for a range of modules (GH-14301) 2019-06-24 08:42:54 -07:00
textwrap.py bpo-30754: Document textwrap.dedent blank line behavior. (GH-14469) 2019-06-29 21:20:03 -07:00
this.py
threading.py bpo-15999: Clean up of handling boolean arguments. (GH-15610) 2019-09-01 12:16:51 +03:00
timeit.py
token.py bpo-35975: Support parsing earlier minor versions of Python 3 (GH-12086) 2019-03-07 12:38:08 -08:00
tokenize.py bpo-5028: Fix up rest of documentation for tokenize documenting line (GH-13686) 2019-05-30 15:06:32 -07:00
trace.py [3.9] bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-12620) 2019-06-05 18:22:31 +03:00
traceback.py bpo-37685: Fixed __eq__, __lt__ etc implementations in some classes. (GH-14952) 2019-08-08 08:42:54 +03:00
tracemalloc.py bpo-37685: Fixed __eq__, __lt__ etc implementations in some classes. (GH-14952) 2019-08-08 08:42:54 +03:00
tty.py
turtle.py Fix typos in docs and docstrings (GH-13745) 2019-06-03 01:12:33 +02:00
types.py bpo-37032: Add CodeType.replace() method (GH-13542) 2019-05-24 23:57:23 +02:00
typing.py bpo-37116: Use PEP 570 syntax for positional-only parameters. (GH-13700) 2019-06-01 11:00:15 +03:00
uu.py bpo-33687: Fix call to os.chmod() in uu.decode() (GH-7282) 2019-01-17 17:15:53 +03:00
uuid.py Fix typos mostly in comments, docs and test names (GH-15209) 2019-08-30 16:21:19 -04:00
warnings.py bpo-35178: Fix warnings._formatwarnmsg() (GH-12033) 2019-03-01 18:17:55 +01:00
wave.py bpo-37320: Remove openfp() of aifc, sunau and wave (GH-14169) 2019-06-18 00:00:24 +02:00
weakref.py bpo-37685: Fixed __eq__, __lt__ etc implementations in some classes. (GH-14952) 2019-08-08 08:42:54 +03:00
webbrowser.py bpo-37363: Add audit events for a range of modules (GH-14301) 2019-06-24 08:42:54 -07:00
xdrlib.py
zipapp.py
zipfile.py bpo-37772: fix zipfile.Path.iterdir() outputs (GH-15170) 2019-08-24 11:26:41 -04:00
zipimport.py bpo-36842: Implement PEP 578 (GH-12613) 2019-05-23 08:45:22 -07:00