Commit Graph

125 Commits

Author SHA1 Message Date
Miss Islington (bot) 4dd1c9d9c2
closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413947)

Co-authored-by: Greg Price <gnprice@gmail.com>
2019-09-03 20:03:37 -07:00
Jeroen Demeyer 530f506ac9 bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464)
Automatically replace
tp_print -> tp_vectorcall_offset
tp_compare -> tp_as_async
tp_reserved -> tp_as_async
2019-05-30 19:13:39 -07:00
Inada Naoki 6fec905de5
bpo-36642: make unicodedata const (GH-12855) 2019-04-17 08:40:34 +09:00
Max Bélanger 2810dd7be9 closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) 2018-11-04 15:58:24 -08:00
Wonsup Yoon d134809cd3 bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
2018-06-15 20:03:14 +08:00
Benjamin Peterson 7c69c1c0fb
update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
luzpaz a5293b4ff2 Fix miscellaneous typos (#4275) 2017-11-05 15:37:50 +02:00
Benjamin Peterson 279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Serhiy Storchaka f8d7d41507 Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*. 2016-10-23 15:12:25 +03:00
Christian Heimes 0202c347bc Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:21:20 +02:00
Christian Heimes 2f366cab48 Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:20:27 +02:00
Benjamin Peterson 6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Christian Heimes 8cee10386e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:54 +02:00
Christian Heimes 7ce201322e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:46 +02:00
Serhiy Storchaka 2d06e84455 Issue #25923: Added the const qualifier to static constant arrays. 2015-12-25 19:53:18 +02:00
Benjamin Peterson 4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
Larry Hastings 38337d1e15 Issue #24000: Improved Argument Clinic's mapping of converters to legacy
"format units".  Updated the documentation to match.
2015-05-07 23:30:09 -07:00
Larry Hastings dbfdc380df Issue #24001: Argument Clinic converters now use accept={type}
instead of types={'type'} to specify the types the converter accepts.
2015-05-04 06:59:46 -07:00
Serhiy Storchaka 6359641bcd Issue #20181: Converted the unicodedata module to Argument Clinic. 2015-04-17 21:18:49 +03:00
Larry Hastings 89964c48d1 Issue #23944: Argument Clinic now wraps long impl prototypes at column 78. 2015-04-14 18:07:59 -04:00
Serhiy Storchaka 1009bf18b3 Issue #23501: Argumen Clinic now generates code into separate files by default. 2015-04-03 23:53:51 +03:00
Benjamin Peterson 5061e67f0f merge 3.3 (#23367) 2015-03-02 11:18:40 -05:00
Benjamin Peterson b779bfba45 fix possible overflow bugs in unicodedata (closes #23367) 2015-03-02 11:17:05 -05:00
Serhiy Storchaka 1a1ff29659 Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integer
overflows.  Added few missed PyErr_NoMemory().
2015-02-16 13:28:22 +02:00
Serhiy Storchaka d3faf43f9b Issue #23181: More "codepoint" -> "code point". 2015-01-18 11:28:37 +02:00
Victor Stinner 65a3144e54 Closes #21780: make the unicodedata module "ssize_t clean" for parsing parameters 2014-07-01 16:45:52 +02:00
Larry Hastings 2623c8c23c Issue #20530: Argument Clinic's signature format has been revised again.
The new syntax is highly human readable while still preventing false
positives.  The syntax also extends Python syntax to denote "self" and
positional-only parameters, allowing inspect.Signature objects to be
totally accurate for all supported builtins in Python 3.4.
2014-02-08 22:15:29 -08:00
Larry Hastings 581ee3618c Issue #20326: Argument Clinic now uses a simple, unique signature to
annotate text signatures in docstrings, resulting in fewer false
positives.  "self" parameters are also explicitly marked, allowing
inspect.Signature() to authoritatively detect (and skip) said parameters.

Issue #20326: Argument Clinic now generates separate checksums for the
input and output sections of the block, allowing external tools to verify
that the input has not changed (and thus the output is not out-of-date).
2014-01-28 05:00:08 -08:00
Larry Hastings c20472640c Issue #20390: Small fixes and improvements for Argument Clinic. 2014-01-25 20:43:29 -08:00
Larry Hastings 5c66189e88 Issue #20189: Four additional builtin types (PyTypeObject,
PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type)
have been modified to provide introspection information for builtins.
Also: many additional Lib, test suite, and Argument Clinic fixes.
2014-01-24 06:17:25 -08:00
Larry Hastings 61272b77b0 Issue #19273: The marker comments Argument Clinic uses have been changed
to improve readability.
2014-01-07 12:41:53 -08:00
Larry Hastings 77561cccb2 Issue #20141: Improved Argument Clinic's support for the PyArg_Parse "O!"
format unit.
2014-01-07 12:13:13 -08:00
Larry Hastings 44e2eaab54 Issue #19674: inspect.signature() now produces a correct signature
for some builtins.
2013-11-23 15:37:55 -08:00
Larry Hastings ed4a1c5703 Argument Clinic: rename "self" to "module" for module-level functions. 2013-11-18 09:32:13 -08:00
Larry Hastings 3182680210 Issue #16612: Add "Argument Clinic", a compile-time preprocessor
for C files to generate argument parsing code.  (See PEP 436.)
2013-10-19 00:09:25 -07:00
Benjamin Peterson a76795bf53 merge 3.3 2013-10-10 20:22:39 -04:00
Benjamin Peterson 8aa7b89983 replace hardcoded version 2013-10-10 20:22:10 -04:00
Benjamin Peterson f0a1b87560 merge 3.3 2013-10-10 20:17:29 -04:00
Benjamin Peterson 577dd61ff2 make sure the docstring is never out of date wrt unicode data version 2013-10-10 20:16:25 -04:00
Benjamin Peterson d3c43a993c merge 3.3 (#19220) 2013-10-10 17:40:30 -04:00
Benjamin Peterson a4cf1c87d0 remove url from docstring (closes #19220) 2013-10-10 17:39:56 -04:00
Benjamin Peterson 94d08d908b upgrade unicode db to 6.3.0 (closes #19221) 2013-10-10 17:24:45 -04:00
Ezio Melotti 7c4a7e6f3c #18803: fix more typos. Patch by Févry Thibault. 2013-08-26 01:32:56 +03:00
Ezio Melotti 85a8629d21 #18466: fix more typos. Patch by Févry Thibault. 2013-08-17 16:57:41 +03:00
Ezio Melotti 11def426c0 #16681: merge with 3.2. 2012-12-14 20:13:39 +02:00
Ezio Melotti e3d7e54b11 #16681: use "bidirectional class" instead of "bidirectional category" in the docstring too. 2012-12-14 20:12:25 +02:00
Stefan Krah a4b4dea415 Use C-style comments (required for the AIX build slave). 2012-09-23 15:51:16 +02:00
Kristjan Valur Jonsson 85634d7a2e Issue #14909: A number of places were using PyMem_Realloc() apis and
PyObject_GC_Resize() with incorrect error handling.  In case of errors,
the original object would be leaked.  This checkin fixes those cases.
2012-05-31 09:37:31 +00:00
Benjamin Peterson 71f660e00f update to Unicode 6.1 2012-02-20 22:24:29 -05:00
Ezio Melotti df8077ecd3 #13379: merge with 3.2. 2011-11-10 09:37:43 +02:00