In a few places we switch to another interpreter without knowing if it has a thread state associated with the current thread. For the main interpreter there wasn't much of a problem, but for subinterpreters we were *mostly* okay re-using the tstate created with the interpreter (located via PyInterpreterState_ThreadHead()). There was a good chance that tstate wasn't actually in use by another thread.
However, there are no guarantees of that. Furthermore, re-using an already used tstate is currently fragile. To address this, now we create a new thread state in each of those places and use it.
One consequence of this change is that PyInterpreterState_ThreadHead() may not return NULL (though that won't happen for the main interpreter).
The existence of background threads running on a subinterpreter was preventing interpreters from getting properly destroyed, as well as impacting the ability to run the interpreter again. It also affected how we wait for non-daemon threads to finish.
We add PyInterpreterState.threads.main, with some internal C-API functions.
This change makes sure sys.path[0] is set properly for subinterpreters. Before, it wasn't getting set at all. This PR does not address the broader concerns from gh-109853.
* Remove unused <locale.h> includes.
* Remove unused <fcntl.h> include in traceback.h.
* Remove redundant <assert.h> and <stddef.h> includes. They are already
included by "Python.h".
* Remove <object.h> include in faulthandler.c. Python.h already includes it.
* Add missing <stdbool.h> in pycore_pythread.h if HAVE_PTHREAD_STUBS
is defined.
* Fix also warnings in pthread_stubs.h: don't redefine macros if they
are already defined, like the __NEED_pthread_t macro.
* pycore_pythread.h is now the central place to make sure that
_POSIX_THREADS and _POSIX_SEMAPHORES macros are defined if
available.
* Make sure that pycore_pythread.h is included when _POSIX_THREADS
and _POSIX_SEMAPHORES macros are tested.
* PY_TIMEOUT_MAX is now defined as a constant, since its value
depends on _POSIX_THREADS, instead of being defined as a macro.
* Prevent integer overflow in the preprocessor when computing
PY_TIMEOUT_MAX_VALUE on Windows:
replace "0xFFFFFFFELL * 1000 < LLONG_MAX"
with "0xFFFFFFFELL < LLONG_MAX / 1000".
* Document the change and give hints how to fix affected code.
* Add an exception for PY_TIMEOUT_MAX name to smelly.py
* Add PY_TIMEOUT_MAX to the stable ABI
These are the most popular specializations of `LOAD_ATTR` and `STORE_ATTR`
that weren't already viable uops:
* Split LOAD_ATTR_METHOD_WITH_VALUES
* Split LOAD_ATTR_METHOD_NO_DICT
* Split LOAD_ATTR_SLOT
* Split STORE_ATTR_SLOT
* Split STORE_ATTR_INSTANCE_VALUE
Also:
* Add `-v` flag to code generator which prints a list of non-viable uops
(easter-egg: it can print execution counts -- see source)
* Double _Py_UOP_MAX_TRACE_LENGTH to 128
I had dropped one of the DEOPT_IF() calls! :-(
PyImport_GetImporter() now sets RuntimeError if it fails to get sys.path_hooks
or sys.path_importer_cache or they are not list and dict correspondingly.
Previously it could return NULL without setting error in obscure cases,
crash or raise SystemError if these attributes have wrong type.
PyMutex is a one byte lock with fast, inlineable lock and unlock functions for the common uncontended case. The design is based on WebKit's WTF::Lock.
PyMutex is built using the _PyParkingLot APIs, which provides a cross-platform futex-like API (based on WebKit's WTF::ParkingLot). This internal API will be used for building other synchronization primitives used to implement PEP 703, such as one-time initialization and events.
This also includes tests and a mini benchmark in Tools/lockbench/lockbench.py to compare with the existing PyThread_type_lock.
Uncontended acquisition + release:
* Linux (x86-64): PyMutex: 11 ns, PyThread_type_lock: 44 ns
* macOS (arm64): PyMutex: 13 ns, PyThread_type_lock: 18 ns
* Windows (x86-64): PyMutex: 13 ns, PyThread_type_lock: 38 ns
PR Overview:
The primary purpose of this PR is to implement PyMutex, but there are a number of support pieces (described below).
* PyMutex: A 1-byte lock that doesn't require memory allocation to initialize and is generally faster than the existing PyThread_type_lock. The API is internal only for now.
* _PyParking_Lot: A futex-like API based on the API of the same name in WebKit. Used to implement PyMutex.
* _PyRawMutex: A word sized lock used to implement _PyParking_Lot.
* PyEvent: A one time event. This was used a bunch in the "nogil" fork and is useful for testing the PyMutex implementation, so I've included it as part of the PR.
* pycore_llist.h: Defines common operations on doubly-linked list. Not strictly necessary (could do the list operations manually), but they come up frequently in the "nogil" fork. ( Similar to https://man.freebsd.org/cgi/man.cgi?queue)
---------
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
There is a WIP proposal to enable webassembly stack switching which have been
implemented in v8:
https://github.com/WebAssembly/js-promise-integration
It is not possible to switch stacks that contain JS frames so the Emscripten JS
trampolines that allow calling functions with the wrong number of arguments
don't work in this case. However, the js-promise-integration proposal requires
the [type reflection for Wasm/JS API](https://github.com/WebAssembly/js-types)
proposal, which allows us to actually count the number of arguments a function
expects.
For better compatibility with stack switching, this PR checks if type reflection
is available, and if so we use a switch block to decide the appropriate
signature. If type reflection is unavailable, we should use the current EMJS
trampoline.
We cache the function argument counts since when I didn't cache them performance
was negatively affected.
Co-authored-by: T. Wouters <thomas@python.org>
Co-authored-by: Brett Cannon <brett@python.org>
I must have overlooked this when refactoring the code generator.
The Tier 1 interpreter contained a few silly things like
```
goto resume_frame;
STACK_SHRINK(1);
```
(and other variations, some where the unconditional `goto` was hidden in a macro).
* Rename SAVE_IP to _SET_IP
* Rename EXIT_TRACE to _EXIT_TRACE
* Rename SAVE_CURRENT_IP to _SAVE_CURRENT_IP
* Rename INSERT to _INSERT (This is for Ken Jin's abstract interpreter)
* Rename IS_NONE to _IS_NONE
* Rename JUMP_TO_TOP to _JUMP_TO_TOP
This adds a 16-bit inline cache entry to the conditional branch instructions POP_JUMP_IF_{FALSE,TRUE,NONE,NOT_NONE} and their instrumented variants, which is used to keep track of the branch direction.
Each time we encounter these instructions we shift the cache entry left by one and set the bottom bit to whether we jumped.
Then when it's time to translate such a branch to Tier 2 uops, we use the bit count from the cache entry to decided whether to continue translating the "didn't jump" branch or the "jumped" branch.
The counter is initialized to a pattern of alternating ones and zeros to avoid bias.
The .pyc file magic number is updated. There's a new test, some fixes for existing tests, and a few miscellaneous cleanups.
Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.
thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.
Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().