cpython

Commit Graph

Author	SHA1	Message	Date
Miss Islington (bot)	4dd1c9d9c2	closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit `2f09413947`) Co-authored-by: Greg Price <gnprice@gmail.com>	2019-09-03 20:03:37 -07:00
Jeroen Demeyer	530f506ac9	bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464) Automatically replace tp_print -> tp_vectorcall_offset tp_compare -> tp_as_async tp_reserved -> tp_as_async	2019-05-30 19:13:39 -07:00
Inada Naoki	6fec905de5	bpo-36642: make unicodedata const (GH-12855)	2019-04-17 08:40:34 +09:00
Max Bélanger	2810dd7be9	closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)	2018-11-04 15:58:24 -08:00
Wonsup Yoon	d134809cd3	bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).	2018-06-15 20:03:14 +08:00
Benjamin Peterson	7c69c1c0fb	update to Unicode 11.0.0 (closes bpo-33778) (GH-7439) Also, standardize indentation of generated tables.	2018-06-06 20:14:28 -07:00
luzpaz	a5293b4ff2	Fix miscellaneous typos (#4275 )	2017-11-05 15:37:50 +02:00
Benjamin Peterson	279a96206f	bpo-30736: upgrade to Unicode 10.0 (#2344 ) Straightforward. While we're at it, though, strip trailing whitespace from generated tables.	2017-06-22 22:31:08 -07:00
Serhiy Storchaka	f8d7d41507	Issue #28511 : Use the "U" format instead of "O!" in PyArg_Parse*.	2016-10-23 15:12:25 +03:00
Christian Heimes	0202c347bc	Add an extra byte for null in case we ever get very long unicode names.	2016-09-23 20:21:20 +02:00
Christian Heimes	2f366cab48	Add an extra byte for null in case we ever get very long unicode names.	2016-09-23 20:20:27 +02:00
Benjamin Peterson	6775231597	Unicode 9.0.0 Not completely mechanical since support for East Asian Width changes—emoji codepoints became Wide—had to be added to unicodedata.	2016-09-14 23:53:47 -07:00
Christian Heimes	8cee10386e	Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()	2016-09-14 10:25:54 +02:00
Christian Heimes	7ce201322e	Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()	2016-09-14 10:25:46 +02:00
Serhiy Storchaka	2d06e84455	Issue #25923 : Added the const qualifier to static constant arrays.	2015-12-25 19:53:18 +02:00
Benjamin Peterson	4801383c29	upgrade to Unicode 8.0.0	2015-06-27 15:45:56 -05:00
Larry Hastings	38337d1e15	Issue #24000 : Improved Argument Clinic's mapping of converters to legacy "format units". Updated the documentation to match.	2015-05-07 23:30:09 -07:00
Larry Hastings	dbfdc380df	Issue #24001 : Argument Clinic converters now use accept={type} instead of types={'type'} to specify the types the converter accepts.	2015-05-04 06:59:46 -07:00
Serhiy Storchaka	6359641bcd	Issue #20181 : Converted the unicodedata module to Argument Clinic.	2015-04-17 21:18:49 +03:00
Larry Hastings	89964c48d1	Issue #23944 : Argument Clinic now wraps long impl prototypes at column 78.	2015-04-14 18:07:59 -04:00
Serhiy Storchaka	1009bf18b3	Issue #23501 : Argumen Clinic now generates code into separate files by default.	2015-04-03 23:53:51 +03:00
Benjamin Peterson	5061e67f0f	merge 3.3 (#23367 )	2015-03-02 11:18:40 -05:00
Benjamin Peterson	b779bfba45	fix possible overflow bugs in unicodedata (closes #23367 )	2015-03-02 11:17:05 -05:00
Serhiy Storchaka	1a1ff29659	Issue #23446 : Use PyMem_New instead of PyMem_Malloc to avoid possible integer overflows. Added few missed PyErr_NoMemory().	2015-02-16 13:28:22 +02:00
Serhiy Storchaka	d3faf43f9b	Issue #23181 : More "codepoint" -> "code point".	2015-01-18 11:28:37 +02:00
Victor Stinner	65a3144e54	Closes #21780 : make the unicodedata module "ssize_t clean" for parsing parameters	2014-07-01 16:45:52 +02:00
Larry Hastings	2623c8c23c	Issue #20530 : Argument Clinic's signature format has been revised again. The new syntax is highly human readable while still preventing false positives. The syntax also extends Python syntax to denote "self" and positional-only parameters, allowing inspect.Signature objects to be totally accurate for all supported builtins in Python 3.4.	2014-02-08 22:15:29 -08:00
Larry Hastings	581ee3618c	Issue #20326 : Argument Clinic now uses a simple, unique signature to annotate text signatures in docstrings, resulting in fewer false positives. "self" parameters are also explicitly marked, allowing inspect.Signature() to authoritatively detect (and skip) said parameters. Issue #20326: Argument Clinic now generates separate checksums for the input and output sections of the block, allowing external tools to verify that the input has not changed (and thus the output is not out-of-date).	2014-01-28 05:00:08 -08:00
Larry Hastings	c20472640c	Issue #20390 : Small fixes and improvements for Argument Clinic.	2014-01-25 20:43:29 -08:00
Larry Hastings	5c66189e88	Issue #20189 : Four additional builtin types (PyTypeObject, PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type) have been modified to provide introspection information for builtins. Also: many additional Lib, test suite, and Argument Clinic fixes.	2014-01-24 06:17:25 -08:00
Larry Hastings	61272b77b0	Issue #19273 : The marker comments Argument Clinic uses have been changed to improve readability.	2014-01-07 12:41:53 -08:00
Larry Hastings	77561cccb2	Issue #20141 : Improved Argument Clinic's support for the PyArg_Parse "O!" format unit.	2014-01-07 12:13:13 -08:00
Larry Hastings	44e2eaab54	Issue #19674 : inspect.signature() now produces a correct signature for some builtins.	2013-11-23 15:37:55 -08:00
Larry Hastings	ed4a1c5703	Argument Clinic: rename "self" to "module" for module-level functions.	2013-11-18 09:32:13 -08:00
Larry Hastings	3182680210	Issue #16612 : Add "Argument Clinic", a compile-time preprocessor for C files to generate argument parsing code. (See PEP 436.)	2013-10-19 00:09:25 -07:00
Benjamin Peterson	a76795bf53	merge 3.3	2013-10-10 20:22:39 -04:00
Benjamin Peterson	8aa7b89983	replace hardcoded version	2013-10-10 20:22:10 -04:00
Benjamin Peterson	f0a1b87560	merge 3.3	2013-10-10 20:17:29 -04:00
Benjamin Peterson	577dd61ff2	make sure the docstring is never out of date wrt unicode data version	2013-10-10 20:16:25 -04:00
Benjamin Peterson	d3c43a993c	merge 3.3 (#19220 )	2013-10-10 17:40:30 -04:00
Benjamin Peterson	a4cf1c87d0	remove url from docstring (closes #19220 )	2013-10-10 17:39:56 -04:00
Benjamin Peterson	94d08d908b	upgrade unicode db to 6.3.0 (closes #19221 )	2013-10-10 17:24:45 -04:00
Ezio Melotti	7c4a7e6f3c	#18803 : fix more typos. Patch by Févry Thibault.	2013-08-26 01:32:56 +03:00
Ezio Melotti	85a8629d21	#18466 : fix more typos. Patch by Févry Thibault.	2013-08-17 16:57:41 +03:00
Ezio Melotti	11def426c0	#16681 : merge with 3.2.	2012-12-14 20:13:39 +02:00
Ezio Melotti	e3d7e54b11	#16681 : use "bidirectional class" instead of "bidirectional category" in the docstring too.	2012-12-14 20:12:25 +02:00
Stefan Krah	a4b4dea415	Use C-style comments (required for the AIX build slave).	2012-09-23 15:51:16 +02:00
Kristjan Valur Jonsson	85634d7a2e	Issue #14909 : A number of places were using PyMem_Realloc() apis and PyObject_GC_Resize() with incorrect error handling. In case of errors, the original object would be leaked. This checkin fixes those cases.	2012-05-31 09:37:31 +00:00
Benjamin Peterson	71f660e00f	update to Unicode 6.1	2012-02-20 22:24:29 -05:00
Ezio Melotti	df8077ecd3	#13379 : merge with 3.2.	2011-11-10 09:37:43 +02:00

1 2 3

125 Commits