The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.
With this, that case is actually faster than in master!
$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop
$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
* Use the 'p' format unit instead of manually called PyObject_IsTrue().
* Pass boolean value instead 0/1 integers to functions that needs boolean.
* Convert some arguments to boolean only once.
Fix a ctypes regression of Python 3.8. When a ctypes.Structure is
passed by copy to a function, ctypes internals created a temporary
object which had the side effect of calling the structure finalizer
(__del__) twice. The Python semantics requires a finalizer to be
called exactly once. Fix ctypes internals to no longer call the
finalizer twice.
Create a new internal StructParam_Type which is only used by
_ctypes_callproc() to call PyMem_Free(ptr) on Py_DECREF(argument).
StructUnionType_paramfunc() creates such object.
bpo-37834: Normalise handling of reparse points on Windows
* ntpath.realpath() and nt.stat() will traverse all supported reparse points (previously was mixed)
* nt.lstat() will let the OS traverse reparse points that are not name surrogates (previously would not traverse any reparse point)
* nt.[l]stat() will only set S_IFLNK for symlinks (previous behaviour)
* nt.readlink() will read destinations for symlinks and junction points only
bpo-1311: os.path.exists('nul') now returns True on Windows
* nt.stat('nul').st_mode is now S_IFCHR (previously was an error)
The faulthandler module no longer allocates its alternative stack at
Python startup. Now the stack is only allocated at the first
faulthandler usage.
faulthandler no longer ignores memory allocation failure when
allocating the stack. sigaltstack() failure now raises an OSError
exception, rather than being ignored.
The alternative stack is no longer used if sigaction() is
not available. In practice, sigaltstack() should only be available
when sigaction() is avaialble, so this change should have no effect
in practice.
faulthandler.dump_traceback_later() internal locks are now only
allocated at the first dump_traceback_later() call, rather than
always being allocated at Python startup.
There are plenty of legitimate scripts in the tree that begin with a
`#!`, but also a few that seem to be marked executable by mistake.
Found them with this command -- it gets executable files known to Git,
filters to the ones that don't start with a `#!`, and then unmarks
them as executable:
$ git ls-files --stage \
| perl -lane 'print $F[3] if (!/^100644/)' \
| while read f; do
head -c2 "$f" | grep -qxF '#!' \
|| chmod a-x "$f"; \
done
Looking at the list by hand confirms that we didn't sweep up any
files that should have the executable bit after all. In particular
* The `.psd` files are images from Photoshop.
* The `.bat` files sure look like things that can be run.
But we have lots of other `.bat` files, and they don't have
this bit set, so it must not be needed for them.
Automerge-Triggered-By: @benjaminp
X509_AUX is an odd, note widely used, OpenSSL extension to the X509 file format. This function doesn't actually use any of the extra metadata that it parses, so just use the standard API.
Automerge-Triggered-By: @tiran
faulthandler now allocates a dedicated stack of SIGSTKSZ*2 bytes,
instead of just SIGSTKSZ bytes. Calling the previous signal handler
in faulthandler signal handler uses more than SIGSTKSZ bytes of stack
memory on some platforms.
FreeBSD implementation of poll(2) restricts the timeout argument to be
either zero, or positive, or equal to INFTIM (-1).
Unless otherwise overridden, socket timeout defaults to -1. This value
is then converted to milliseconds (-1000) and used as argument to the
poll syscall. poll returns EINVAL (22), and the connection fails.
This bug was discovered during the EINTR handling testing, and the
reproduction code can be found in
https://bugs.python.org/issue23618 (see connect_eintr.py,
attached). On GNU/Linux, the example runs as expected.
This change is trivial:
If the supplied timeout value is negative, truncate it to -1.
This fixes an inconsistency between the Python and C implementations of
the datetime module. The pure python version of the code was not
accepting offsets greater than 23:59 but less than 24:00. This is an
accidental legacy of the original implementation, which was put in place
before tzinfo allowed sub-minute time zone offsets.
GH-14878
Use a tighter scope temporary variable to help register allocation.
1% speedup for large string.
Use PyDict_SetItemDefault() for memoizing keys.
At most 4% speedup when the cache hit ratio is low.
There was a discrepancy between the Python and C implementations.
Add singletons ALWAYS_EQ, LARGEST and SMALLEST in test.support
to test mixed type comparison.
Support for RFCOMM, L2CAP, HCI, SCO is based on the BTPROTO_* macros
being defined. Winsock only supports RFCOMM, even though it has a
BTHPROTO_L2CAP macro. L2CAP support would build on windows, but not
necessarily work.
This also adds some basic unittests for constants (all of which existed
prior to this commit, just not on windows) and creating sockets.
pair: Nate Duarte <slacknate@gmail.com>
gc used several PySys_WriteStderr() calls to write stats.
It caused stats mixed up when stderr is shared by multiple
processes like this:
gc: collecting generation 2...
gc: objects in each generation: 0 0gc: collecting generation 2...
gc: objects in each generation: 0 0 126077 126077
gc: objects in permanent generation: 0
gc: objects in permanent generation: 0
gc: done, 112575 unreachable, 0 uncollectablegc: done, 112575 unreachable, 0 uncollectable, 0.2223s elapsed
, 0.2344s elapsed
Expose the CAN_BCM SocketCAN constants used in the bcm_msg_head struct
flags (provided by <linux/can/bcm.h>) under the socket library.
This adds the following constants with a CAN_BCM prefix:
* SETTIMER
* STARTTIMER
* TX_COUNTEVT
* TX_ANNOUNCE
* TX_CP_CAN_ID
* RX_FILTER_ID
* RX_CHECK_DLC
* RX_NO_AUTOTIMER
* RX_ANNOUNCE_RESUME
* TX_RESET_MULTI_IDX
* RX_RTR_FRAME
* CAN_FD_FRAME
The CAN_FD_FRAME flag was introduced in the 4.8 kernel, while the other
ones were present since SocketCAN drivers were mainlined in 2.6.25. As
such, it is probably unnecessary to guard against these constants being
missing.
When scanning the string, most characters are valid, so
checking for invalid characters first means never needing
to check the value of strict on valid strings, and only
needing to check it on invalid characters when doing
non-strict parsing of invalid strings.
This provides a measurable reduction in per-character
processing time (~11% in the pre-merge patch testing).
Deprecate the parser module and add a deprecation warning triggered on import and a warning block in the documentation.
https://bugs.python.org/issue37268
Automerge-Triggered-By: @pablogsal
* bpo-37399: Correctly attach tail text to the last element/comment/pi, even when comments or pis are discarded.
Also fixes the insertion of PIs when "insert_pis=True" is configured for a TreeBuilder.
The `arraymodule`'s `b_getitem` function returns a `PyLong` converted
from `arrayobject`'s array, by dereferencing a pointer to `char`.
When the `char` type is `signed char`, the `if (x >= 128) x -= 256;` comparison/code is redundant because a `signed char` will have a value of `[-128, 127]` and so `x` will never be greater or equal than 128.
This check was indeed needed for situations where a given compiler would assume `char` being `unsigned char` which would make `x` in `[0, 256]` range.
However, the check can be removed if we cast the `ob_item` into a signed char pointer (`signed char*`) instead of `char*`.
This PR/commit introduces this change.
Stop using "static PyConfig", PyConfig must now always use
dynamically allocated strings: use PyConfig_SetString(),
PyConfig_SetArgv() and PyConfig_Clear().
SSLContext.post_handshake_auth = True no longer sets
SSL_VERIFY_POST_HANDSHAKE verify flag for client connections. Although the
option is documented as ignored for clients, OpenSSL implicitly enables cert
chain validation when the flag is set.
Signed-off-by: Christian Heimes <christian@python.org>
https://bugs.python.org/issue37428
The os.getcwdb() function now uses the UTF-8 encoding on Windows,
rather than the ANSI code page: see PEP 529 for the rationale. The
function is no longer deprecated on Windows.
os.getcwd() and os.getcwdb() now detect integer overflow on memory
allocations. On Unix, these functions properly report MemoryError on
memory allocation failure.
The sqlite3 module now raises TypeError, rather than ValueError, if
operation argument type is not str: execute(), executemany() and
calling a connection.
In development mode and in debug build, encoding and errors arguments
are now checked on string encoding and decoding operations. Examples:
open(), str.encode() and bytes.decode().
By default, for best performances, the errors argument is only
checked at the first encoding/decoding error, and the encoding
argument is sometimes ignored for empty strings.
Python now gets the absolute path of the script filename specified on
the command line (ex: "python3 script.py"): the __file__ attribute of
the __main__ module, sys.argv[0] and sys.path[0] become an absolute
path, rather than a relative path.
* Add _Py_isabs() and _Py_abspath() functions.
* _PyConfig_Read() now tries to get the absolute path of
run_filename, but keeps the relative path if _Py_abspath() fails.
* Reimplement os._getfullpathname() using _Py_abspath().
* Use _Py_isabs() in getpath.c.
At the moment you can definitely use UDPLITE sockets on Linux systems, but it would be good if this support were formalized such that you can detect support at runtime easily.
At the moment, to make and use a UDPLITE socket requires something like the following code:
```
>>> import socket
>>> a = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, 136)
>>> b = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, 136)
>>> a.bind(('localhost', 44444))
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.setsockopt(136, 10, 16)
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.setsockopt(136, 10, 32)
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.setsockopt(136, 10, 64)
>>> b.sendto(b'test'*256, ('localhost', 44444))
```
If you look at this through Wireshark, you can see that the packets are different in that the checksums and checksum coverages change.
With the pull request that I am submitting momentarily, you could do the following code instead:
```
>>> import socket
>>> a = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDPLITE)
>>> b = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDPLITE)
>>> a.bind(('localhost', 44444))
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.set_send_checksum_coverage(16)
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.set_send_checksum_coverage(32)
>>> b.sendto(b'test'*256, ('localhost', 44444))
>>> b.set_send_checksum_coverage(64)
>>> b.sendto(b'test'*256, ('localhost', 44444))
```
One can also detect support for UDPLITE just by checking
```
>>> hasattr(socket, 'IPPROTO_UDPLITE')
```
https://bugs.python.org/issue37345
* Add Include/cpython/import.h and Include/internal/pycore_import.h
header files.
* Move _PyImport_ReInitLock() to the internal C API. Don't export the
symbol anymore.
Add a new public PyObject_CallNoArgs() function to the C API: call a
callable Python object without any arguments.
It is the most efficient way to call a callback without any argument.
On x86-64, for example, PyObject_CallFunctionObjArgs(func, NULL)
allocates 960 bytes on the stack per call, whereas
PyObject_CallNoArgs(func) only allocates 624 bytes per call.
It is excluded from stable ABI 3.8.
Replace private _PyObject_CallNoArg() with public
PyObject_CallNoArgs() in C extensions: _asyncio, _datetime,
_elementtree, _pickle, _tkinter and readline.
In a subinterpreter, spawning a daemon thread now raises an
exception. Daemon threads were never supported in subinterpreters.
Previously, the subinterpreter finalization crashed with a Pyton
fatal error if a daemon thread was still running.
* Add _thread._is_main_interpreter()
* threading.Thread.start() now raises RuntimeError if the thread is a
daemon thread and the method is called from a subinterpreter.
* The _thread module now uses Argument Clinic for the new function.
* Use textwrap.dedent() in test_threading.SubinterpThreadingTests
Add a new _PyCompilerFlags_INIT macro to initialize PyCompilerFlags
variables, rather than initializing cf_flags and cf_feature_version
explicitly in each variable.
Calling setlocale(LC_CTYPE, "") on a system where GetACP() returns CP_UTF8 results in empty strings in _tzname[].
This causes time.tzname to be an empty string.
I have reported the bug to the UCRT team and will follow up, but it will take some time get a fix into production.
In the meantime one possible workaround is to temporarily change the locale by calling setlocale(LC_CTYPE, "C") before calling _tzset and restore the current locale after if the GetACP() == CP_UTF8 or CP_UTF7
@zooba
https://bugs.python.org/issue36779
* bpo-29505: Enable fuzz testing of the json module, enforce size limit on int(x) fuzz and json input size to avoid timeouts.
Contributed by by Ammar Askar for Google.