Commit Graph

154 Commits

Author SHA1 Message Date
Dong-hee Na 9ef7e75434
gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (gh-101388) 2023-02-06 13:58:00 +09:00
Victor Stinner 65dd745f1a
gh-99300: Use Py_NewRef() in Modules/ directory (#99473)
Replace Py_INCREF() and Py_XINCREF() with Py_NewRef() and
Py_XNewRef() in test C files of the Modules/ directory.
2022-11-14 16:21:40 +01:00
Benjamin Peterson fd1e477f53
closes gh-96734: Update to Unicode 15.0.0. (GH-96809) 2022-09-13 15:45:12 -07:00
Dong-hee Na 2bf5f64455
Remove usage of _Py_IDENTIFIER from unicodedata module. (GH-91532) 2022-04-15 10:44:05 +09:00
Eric Snow 81c72044a1
bpo-46541: Replace core use of _Py_IDENTIFIER() with statically initialized global objects. (gh-30928)
We're no longer using _Py_IDENTIFIER() (or _Py_static_string()) in any core CPython code.  It is still used in a number of non-builtin stdlib modules.

The replacement is: PyUnicodeObject (not pointer) fields under _PyRuntimeState, statically initialized as part of _PyRuntime.  A new _Py_GET_GLOBAL_IDENTIFIER() macro facilitates lookup of the fields (along with _Py_GET_GLOBAL_STRING() for non-identifier strings).

https://bugs.python.org/issue46541#msg411799 explains the rationale for this change.

The core of the change is in:

* (new) Include/internal/pycore_global_strings.h - the declarations for the global strings, along with the macros
* Include/internal/pycore_runtime_init.h - added the static initializers for the global strings
* Include/internal/pycore_global_objects.h - where the struct in pycore_global_strings.h is hooked into _PyRuntimeState
* Tools/scripts/generate_global_objects.py - added generation of the global string declarations and static initializers

I've also added a --check flag to generate_global_objects.py (along with make check-global-objects) to check for unused global strings.  That check is added to the PR CI config.

The remainder of this change updates the core code to use _Py_GET_GLOBAL_IDENTIFIER() instead of _Py_IDENTIFIER() and the related _Py*Id functions (likewise for _Py_GET_GLOBAL_STRING() instead of _Py_static_string()).  This includes adding a few functions where there wasn't already an alternative to _Py*Id(), replacing the _Py_Identifier * parameter with PyObject *.

The following are not changed (yet):

* stop using _Py_IDENTIFIER() in the stdlib modules
* (maybe) get rid of _Py_IDENTIFIER(), etc. entirely -- this may not be doable as at least one package on PyPI using this (private) API
* (maybe) intern the strings during runtime init

https://bugs.python.org/issue46541
2022-02-08 13:39:07 -07:00
Christian Heimes 03e9f5dc75
bpo-43974: Move Py_BUILD_CORE_MODULE into module code (GH-29157)
setup.py no longer defines Py_BUILD_CORE_MODULE. Instead every
module defines the macro before #include "Python.h" unless
Py_BUILD_CORE_BUILTIN is already defined.

Py_BUILD_CORE_BUILTIN is defined for every module that is built by
Modules/Setup.

The PR also simplifies Modules/Setup. Makefile and makesetup
already define Py_BUILD_CORE_BUILTIN and include Modules/internal
for us.

Signed-off-by: Christian Heimes <christian@python.org>
2021-10-22 15:36:28 +02:00
Benjamin Peterson 024fda47d4
closes bpo-45190: Update Unicode data to version 14.0.0. (GH-28336) 2021-09-14 11:00:38 -07:00
Dong-hee Na 9abd07e596
bpo-44987: Speed up unicode normalization of ASCII strings (GH-28283) 2021-09-11 18:04:38 +03:00
Srinivas Reddy Thatiparthy (శ్రీనివాస్ రెడ్డి తాటిపర్తి) 7b21108445
Remove irrelevant comment which was added in 2a70a3a (GH-27044) 2021-07-08 21:57:25 -07:00
Erlend Egeberg Aasland 00710e6346
bpo-43908: Make heap types converted during 3.10 alpha immutable (GH-26351)
* Make functools types immutable

* Multibyte codec types are now immutable

* pyexpat.xmlparser is now immutable

* array.arrayiterator is now immutable

* _thread types are now immutable

* _csv types are now immutable

* _queue.SimpleQueue is now immutable

* mmap.mmap is now immutable

* unicodedata.UCD is now immutable

* sqlite3 types are now immutable

* _lsprof.Profiler is now immutable

* _overlapped.Overlapped is now immutable

* _operator types are now immutable

* winapi__overlapped.Overlapped is now immutable

* _lzma types are now immutable

* _bz2 types are now immutable

* _dbm.dbm and _gdbm.gdbm are now immutable
2021-06-17 11:06:09 +01:00
Erlend Egeberg Aasland 59af59c2df
bpo-42972: Fully support GC for pyexpat, unicodedata, and dbm/gdbm heap types (GH-26376)
* bpo-42972: pyexpat
* bpo-42972: unicodedata
* bpo-42972: dbm/gdbm
2021-05-27 17:29:00 +10:00
Inada Naoki 4d4be47705
Do not use Py_ssize_clean_t (GH-25940) 2021-05-08 10:17:37 +09:00
Erlend Egeberg Aasland 9746cda705
bpo-43916: Apply Py_TPFLAGS_DISALLOW_INSTANTIATION to selected types (GH-25748)
Apply Py_TPFLAGS_DISALLOW_INSTANTIATION to the following types:

* _dbm.dbm
* _gdbm.gdbm
* _multibytecodec.MultibyteCodec
* _sre..SRE_Scanner
* _thread._localdummy
* _thread.lock
* _winapi.Overlapped
* array.arrayiterator
* functools.KeyWrapper
* functools._lru_list_elem
* pyexpat.xmlparser
* re.Match
* re.Pattern
* unicodedata.UCD
* zlib.Compress
* zlib.Decompress
2021-04-30 16:04:57 +02:00
Erlend Egeberg Aasland 61d26394f9
bpo-41798: Allocate unicodedata CAPI on the heap (GH-24128) 2021-01-20 12:03:53 +01:00
Victor Stinner 32bd68c839
bpo-42519: Replace PyObject_MALLOC() with PyObject_Malloc() (GH-23587)
No longer use deprecated aliases to functions:

* Replace PyObject_MALLOC() with PyObject_Malloc()
* Replace PyObject_REALLOC() with PyObject_Realloc()
* Replace PyObject_FREE() with PyObject_Free()
* Replace PyObject_Del() with PyObject_Free()
* Replace PyObject_DEL() with PyObject_Free()
2020-12-01 10:37:39 +01:00
Victor Stinner 84f7382215
bpo-42157: Rename unicodedata.ucnhash_CAPI (GH-22994)
Removed the unicodedata.ucnhash_CAPI attribute which was an internal
PyCapsule object. The related private _PyUnicode_Name_CAPI structure
was moved to the internal C API.

Rename unicodedata.ucnhash_CAPI as unicodedata._ucnhash_CAPI.
2020-10-27 04:36:22 +01:00
Victor Stinner c8c4200b65
bpo-42157: Convert unicodedata.UCD to heap type (GH-22991)
Convert the unicodedata extension module to the multiphase
initialization API (PEP 489) and convert the unicodedata.UCD static
type to a heap type.

Co-Authored-By: Mohamed Koubaa <koubaa.m@gmail.com>
2020-10-26 23:19:22 +01:00
Victor Stinner 920cb647ba
bpo-42157: unicodedata avoids references to UCD_Type (GH-22990)
* UCD_Check() uses PyModule_Check()
* Simplify the internal _PyUnicode_Name_CAPI structure:

  * Remove size and state members
  * Remove state and self parameters of getcode() and getname()
    functions

* Remove global_module_state
2020-10-26 19:19:36 +01:00
Victor Stinner 47e1afd2a1
bpo-1635741: _PyUnicode_Name_CAPI moves to internal C API (GH-22713)
The private _PyUnicode_Name_CAPI structure of the PyCapsule API
unicodedata.ucnhash_CAPI moves to the internal C API. Moreover, the
structure gets a new state member which must be passed to the
getcode() and getname() functions.

* Move Include/ucnhash.h to Include/internal/pycore_ucnhash.h
* unicodedata module is now built with Py_BUILD_CORE_MODULE.
* unicodedata: move hashAPI variable into unicodedata_module_state.
2020-10-26 16:43:47 +01:00
Victor Stinner e6b8c5263a
bpo-1635741: Add a global module state to unicodedata (GH-22712)
Prepare unicodedata to add a state per module: start with a global
"module" state, pass it to subfunctions which access &UCD_Type. This
change also prepares the conversion of the UCD_Type static type to a
heap type.
2020-10-15 16:22:19 +02:00
Mohamed Koubaa ddc0dd001a
bpo-1635741, unicodedata: add ucd_type parameter to UCD_Check() macro (GH-22328)
Co-authored-by: Victor Stinner <vstinner@python.org>
2020-09-23 12:38:16 +02:00
Victor Stinner 4a21e57fe5
bpo-40268: Remove unused structmember.h includes (GH-19530)
If only offsetof() is needed: include stddef.h instead.

When structmember.h is used, add a comment explaining that
PyMemberDef is used.
2020-04-15 02:35:41 +02:00
Serhiy Storchaka cd8295ff75
bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode data. (GH-19345) 2020-04-11 10:48:40 +03:00
Andy Lester 982307b9cc
bpo-39943: Remove unused self from find_nfc_index() (GH-18973) 2020-03-17 17:38:12 +01:00
Benjamin Peterson 051b9d08d1
closes bpo-39926: Update Unicode to 13.0.0. (GH-18910) 2020-03-10 20:41:34 -07:00
Dong-hee Na 1b55b65638
bpo-39573: Clean up modules and headers to use Py_IS_TYPE() function (GH-18521) 2020-02-17 11:09:15 +01:00
Victor Stinner d2ec81a8c9
bpo-39573: Add Py_SET_TYPE() function (GH-18394)
Add Py_SET_TYPE() function to set the type of an object.
2020-02-07 09:17:07 +01:00
Jordon Xu 2ec7010206 bpo-37752: Delete redundant Py_CHARMASK in normalizestring() (GH-15095) 2019-09-10 17:04:08 +01:00
Greg Price 7669cb8b21 bpo-38043: Use `bool` for boolean flags on is_normalized_quickcheck. (GH-15711) 2019-09-09 02:16:31 -07:00
Greg Price 2f09413947 closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
2019-09-03 19:45:44 -07:00
Jeroen Demeyer 530f506ac9 bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464)
Automatically replace
tp_print -> tp_vectorcall_offset
tp_compare -> tp_as_async
tp_reserved -> tp_as_async
2019-05-30 19:13:39 -07:00
Inada Naoki 6fec905de5
bpo-36642: make unicodedata const (GH-12855) 2019-04-17 08:40:34 +09:00
Max Bélanger 2810dd7be9 closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) 2018-11-04 15:58:24 -08:00
Wonsup Yoon d134809cd3 bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
2018-06-15 20:03:14 +08:00
Benjamin Peterson 7c69c1c0fb
update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
luzpaz a5293b4ff2 Fix miscellaneous typos (#4275) 2017-11-05 15:37:50 +02:00
Benjamin Peterson 279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Serhiy Storchaka f8d7d41507 Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*. 2016-10-23 15:12:25 +03:00
Christian Heimes 0202c347bc Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:21:20 +02:00
Christian Heimes 2f366cab48 Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:20:27 +02:00
Benjamin Peterson 6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Christian Heimes 8cee10386e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:54 +02:00
Christian Heimes 7ce201322e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:46 +02:00
Serhiy Storchaka 2d06e84455 Issue #25923: Added the const qualifier to static constant arrays. 2015-12-25 19:53:18 +02:00
Benjamin Peterson 4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
Larry Hastings 38337d1e15 Issue #24000: Improved Argument Clinic's mapping of converters to legacy
"format units".  Updated the documentation to match.
2015-05-07 23:30:09 -07:00
Larry Hastings dbfdc380df Issue #24001: Argument Clinic converters now use accept={type}
instead of types={'type'} to specify the types the converter accepts.
2015-05-04 06:59:46 -07:00
Serhiy Storchaka 6359641bcd Issue #20181: Converted the unicodedata module to Argument Clinic. 2015-04-17 21:18:49 +03:00
Larry Hastings 89964c48d1 Issue #23944: Argument Clinic now wraps long impl prototypes at column 78. 2015-04-14 18:07:59 -04:00
Serhiy Storchaka 1009bf18b3 Issue #23501: Argumen Clinic now generates code into separate files by default. 2015-04-03 23:53:51 +03:00