cpython/Doc/whatsnew
Miss Islington (bot) 4dd1c9d9c2
closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413947)

Co-authored-by: Greg Price <gnprice@gmail.com>
2019-09-03 20:03:37 -07:00
..
2.0.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
2.1.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
2.2.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
2.3.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
2.4.rst bpo-33641: Convert RFC references into links. (GH-7103) 2018-05-31 07:39:00 +03:00
2.5.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
2.6.rst Docs: FIX broken links. (GH-13491) 2019-05-25 20:02:24 +02:00
2.7.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
3.0.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
3.1.rst Fix miscellaneous typos (#4275) 2017-11-05 15:37:50 +02:00
3.2.rst Docs: FIX broken links. (GH-13491) 2019-05-25 20:02:24 +02:00
3.3.rst bpo-35042: Use the :pep: role where a PEP is specified (#10036) 2018-10-26 15:58:26 -07:00
3.4.rst bpo-35506: Remove redundant and incorrect links from keywords. (GH-11174) 2018-12-19 08:09:46 +02:00
3.5.rst bpo-35044, doc: Use the :exc: role for the exceptions (GH-10037) 2018-10-26 12:52:11 +02:00
3.6.rst bpo-35486: Note Py3.6 import system API requirement change (GH-11540) 2019-01-17 02:41:29 -08:00
3.7.rst bpo-33821: Update IDLE section of What's New 3.7 (GH-15036) 2019-07-30 22:23:05 -07:00
3.8.rst closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) 2019-09-03 20:03:37 -07:00
changelog.rst Include additional changes to support blurbified NEWS (#3340) 2017-09-05 00:46:18 -07:00
index.rst Update Doc build to 3.8 2018-01-31 18:12:38 -05:00