The `--disable-gil` builds occasionally need to pause all but one thread. Some
examples include:
* Cyclic garbage collection, where this is often called a "stop the world event"
* Before calling `fork()`, to ensure a consistent state for internal data structures
* During interpreter shutdown, to ensure that daemon threads aren't accessing Python objects
This adds the following functions to implement global and per-interpreter pauses:
* `_PyEval_StopTheWorldAll()` and `_PyEval_StartTheWorldAll()` (for the global runtime)
* `_PyEval_StopTheWorld()` and `_PyEval_StartTheWorld()` (per-interpreter)
(The function names may change.)
These functions are no-ops outside of the `--disable-gil` build.
* gh-112532: Tag mimalloc heaps and pages
Mimalloc pages are data structures that contain contiguous allocations
of the same block size. Note that they are distinct from operating
system pages. Mimalloc pages are contained in segments.
When a thread exits, it abandons any segments and contained pages that
have live allocations. These segments and pages may be later reclaimed
by another thread. To support GC and certain thread-safety guarantees in
free-threaded builds, we want pages to only be reclaimed by the
corresponding heap in the claimant thread. For example, we want pages
containing GC objects to only be claimed by GC heaps.
This allows heaps and pages to be tagged with an integer tag that is
used to ensure that abandoned pages are only claimed by heaps with the
same tag. Heaps can be initialized with a tag (0-15); any page allocated
by that heap copies the corresponding tag.
* Fix conversion warning
* gh-112532: Isolate abandoned segments by interpreter
Mimalloc segments are data structures that contain memory allocations along
with metadata. Each segment is "owned" by a thread. When a thread exits,
it abandons its segments to a global pool to be later reclaimed by other
threads. This changes the pool to be per-interpreter instead of process-wide.
This will be important for when we use mimalloc to find GC objects in the
`--disable-gil` builds. We want heaps to only store Python objects from a
single interpreter. Absent this change, the abandoning and reclaiming process
could break this isolation.
* Add missing '&_mi_abandoned_default' to 'tld_empty'
* gh-112532: Use separate mimalloc heaps for GC objects
In `--disable-gil` builds, we now use four separate heaps in
anticipation of using mimalloc to find GC objects when the GIL is
disabled. To support this, we also make a few changes to mimalloc:
* `mi_heap_t` and `mi_tld_t` initialization is split from allocation.
This allows us to have a `mi_tld_t` per-`PyThreadState`, which is
important to keep interpreter isolation, since the same OS thread may
run in multiple interpreters (using different PyThreadStates.)
* Heap abandoning (mi_heap_collect_ex) can now be called from a
different thread than the one that created the heap. This is necessary
because we may clear and delete the containing PyThreadStates from a
different thread during finalization and after fork().
* Use enum instead of defines and guard mimalloc includes.
* The enum typedef will be convenient for future PRs that use the type.
* Guarding the mimalloc includes allows us to unconditionally include
pycore_mimalloc.h from other header files that rely on things like
`struct _mimalloc_thread_state`.
* Only define _mimalloc_thread_state in Py_GIL_DISABLED builds
The `PyThreadState_Clear()` function must only be called with the GIL
held and must be called from the same interpreter as the passed in
thread state. Otherwise, any Python objects on the thread state may be
destroyed using the wrong interpreter, leading to memory corruption.
This is also important for `Py_GIL_DISABLED` builds because free lists
will be associated with PyThreadStates and cleared in
`PyThreadState_Clear()`.
This fixes two places that called `PyThreadState_Clear()` from the wrong
interpreter and adds an assertion to `PyThreadState_Clear()`.
This replaces some usages of PyThread_type_lock with PyMutex, which does not require memory allocation to initialize.
This simplifies some of the runtime initialization and is also one step towards avoiding changing the default raw memory allocator during initialize/finalization, which can be non-thread-safe in some circumstances.
Every PyThreadState instance is now actually a _PyThreadStateImpl.
It is safe to cast from `PyThreadState*` to `_PyThreadStateImpl*` and back.
The _PyThreadStateImpl will contain fields that we do not want to expose
in the public C API.
Critical sections are helpers to replace the global interpreter lock
with finer grained locking. They provide similar guarantees to the GIL
and avoid the deadlock risk that plain locking involves. Critical
sections are implicitly ended whenever the GIL would be released. They
are resumed when the GIL would be acquired. Nested critical sections
behave as if the sections were interleaved.
This moves several general internal APIs out of _xxsubinterpretersmodule.c and into the new Python/crossinterp.c (and the corresponding internal headers).
Specifically:
* _Py_excinfo, etc.: the initial implementation for non-object exception snapshots (in pycore_pyerrors.h and Python/errors.c)
* _PyXI_exception_info, etc.: helpers for passing an exception beween interpreters (wraps _Py_excinfo)
* _PyXI_namespace, etc.: helpers for copying a dict of attrs between interpreters
* _PyXI_Enter(), _PyXI_Exit(): functions that abstract out the transitions between one interpreter and a second that will do some work temporarily
Again, these were all abstracted out of _xxsubinterpretersmodule.c as generalizations. I plan on proposing these as public API at some point.
This is partly to clear this stuff out of pystate.c, but also in preparation for moving some code out of _xxsubinterpretersmodule.c. This change also moves this stuff to the internal API (new: Include/internal/pycore_crossinterp.h). @vstinner did this previously and I undid it. Now I'm re-doing it. :/
This adds a new field 'state' to PyThreadState that can take on one of three values: _Py_THREAD_ATTACHED, _Py_THREAD_DETACHED, or _Py_THREAD_GC. The "attached" and "detached" states correspond closely to acquiring and releasing the GIL. The "gc" state is current unused, but will be used to implement stop-the-world GC for --disable-gil builds in the near future.
We do the following:
* add a per-interpreter XID registry (PyInterpreterState.xidregistry)
* put heap types there (keep static types in _PyRuntimeState.xidregistry)
* clear the registries during interpreter/runtime finalization
* avoid duplicate entries in the registry (when _PyCrossInterpreterData_RegisterClass() is called more than once for a type)
* use Py_TYPE() instead of PyObject_Type() in _PyCrossInterpreterData_Lookup()
The per-interpreter registry helps preserve isolation between interpreters. This is important when heap types are registered, which is something we haven't been doing yet but I will likely do soon.
Add PyThreadState_GetUnchecked() function: similar to
PyThreadState_Get(), but don't issue a fatal error if it is NULL. The
caller is responsible to check if the result is NULL. Previously,
this function was private and known as _PyThreadState_UncheckedGet().
In a few places we switch to another interpreter without knowing if it has a thread state associated with the current thread. For the main interpreter there wasn't much of a problem, but for subinterpreters we were *mostly* okay re-using the tstate created with the interpreter (located via PyInterpreterState_ThreadHead()). There was a good chance that tstate wasn't actually in use by another thread.
However, there are no guarantees of that. Furthermore, re-using an already used tstate is currently fragile. To address this, now we create a new thread state in each of those places and use it.
One consequence of this change is that PyInterpreterState_ThreadHead() may not return NULL (though that won't happen for the main interpreter).
The existence of background threads running on a subinterpreter was preventing interpreters from getting properly destroyed, as well as impacting the ability to run the interpreter again. It also affected how we wait for non-daemon threads to finish.
We add PyInterpreterState.threads.main, with some internal C-API functions.
PyMutex is a one byte lock with fast, inlineable lock and unlock functions for the common uncontended case. The design is based on WebKit's WTF::Lock.
PyMutex is built using the _PyParkingLot APIs, which provides a cross-platform futex-like API (based on WebKit's WTF::ParkingLot). This internal API will be used for building other synchronization primitives used to implement PEP 703, such as one-time initialization and events.
This also includes tests and a mini benchmark in Tools/lockbench/lockbench.py to compare with the existing PyThread_type_lock.
Uncontended acquisition + release:
* Linux (x86-64): PyMutex: 11 ns, PyThread_type_lock: 44 ns
* macOS (arm64): PyMutex: 13 ns, PyThread_type_lock: 18 ns
* Windows (x86-64): PyMutex: 13 ns, PyThread_type_lock: 38 ns
PR Overview:
The primary purpose of this PR is to implement PyMutex, but there are a number of support pieces (described below).
* PyMutex: A 1-byte lock that doesn't require memory allocation to initialize and is generally faster than the existing PyThread_type_lock. The API is internal only for now.
* _PyParking_Lot: A futex-like API based on the API of the same name in WebKit. Used to implement PyMutex.
* _PyRawMutex: A word sized lock used to implement _PyParking_Lot.
* PyEvent: A one time event. This was used a bunch in the "nogil" fork and is useful for testing the PyMutex implementation, so I've included it as part of the PR.
* pycore_llist.h: Defines common operations on doubly-linked list. Not strictly necessary (could do the list operations manually), but they come up frequently in the "nogil" fork. ( Similar to https://man.freebsd.org/cgi/man.cgi?queue)
---------
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
There is a WIP proposal to enable webassembly stack switching which have been
implemented in v8:
https://github.com/WebAssembly/js-promise-integration
It is not possible to switch stacks that contain JS frames so the Emscripten JS
trampolines that allow calling functions with the wrong number of arguments
don't work in this case. However, the js-promise-integration proposal requires
the [type reflection for Wasm/JS API](https://github.com/WebAssembly/js-types)
proposal, which allows us to actually count the number of arguments a function
expects.
For better compatibility with stack switching, this PR checks if type reflection
is available, and if so we use a switch block to decide the appropriate
signature. If type reflection is unavailable, we should use the current EMJS
trampoline.
We cache the function argument counts since when I didn't cache them performance
was negatively affected.
Co-authored-by: T. Wouters <thomas@python.org>
Co-authored-by: Brett Cannon <brett@python.org>
Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.
thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.
Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().
thread_run() of _threadmodule.c now calls
_PyThreadState_CheckConsistency() to check if tstate is a dangling
pointer when Python is built in debug mode.
Rename ceval_gil.c is_tstate_valid() to
_PyThreadState_CheckConsistency() to reuse it in _threadmodule.c.
Symbols of the C API should be prefixed by "Py_" to avoid conflict
with existing names in 3rd party C extensions on "#include <Python.h>".
test.pythoninfo now logs Py_C_RECURSION_LIMIT constant and other
_testcapi and _testinternalcapi constants.
pycore_create_interpreter() now returns a status, rather than
calling Py_FatalError().
* PyInterpreterState_New() now calls Py_ExitStatusException() instead
of calling Py_FatalError() directly.
* Replace Py_FatalError() with PyStatus in init_interpreter() and
_PyObject_InitState().
* _PyErr_SetFromPyStatus() now raises RuntimeError, instead of
ValueError. It can now call PyErr_NoMemory(), raise MemoryError,
if it detects _PyStatus_NO_MEMORY() error message.
Python built with "configure --with-trace-refs" (tracing references)
is now ABI compatible with Python release build and debug build.
Moreover, it now also supports the Limited API.
Change Py_TRACE_REFS build:
* Remove _PyObject_EXTRA_INIT macro.
* The PyObject structure no longer has two extra members (_ob_prev
and _ob_next).
* Use a hash table (_Py_hashtable_t) to trace references (all
objects): PyInterpreterState.object_state.refchain.
* Py_TRACE_REFS build is now ABI compatible with release build and
debug build.
* Limited C API extensions can now be built with Py_TRACE_REFS:
xxlimited, xxlimited_35, _testclinic_limited.
* No longer rename PyModule_Create2() and PyModule_FromDefAndSpec2()
functions to PyModule_Create2TraceRefs() and
PyModule_FromDefAndSpec2TraceRefs().
* _Py_PrintReferenceAddresses() is now called before
finalize_interp_delete() which deletes the refchain hash table.
* test_tracemalloc find_trace() now also filters by size to ignore
the memory allocated by _PyRefchain_Trace().
Test changes for Py_TRACE_REFS:
* Add test.support.Py_TRACE_REFS constant.
* Add test_sys.test_getobjects() to test sys.getobjects() function.
* test_exceptions skips test_recursion_normalizing_with_no_memory()
and test_memory_error_in_PyErr_PrintEx() if Python is built with
Py_TRACE_REFS.
* test_repl skips test_no_memory().
* test_capi skisp test_set_nomemory().
This fixes a crasher due to a race condition, triggered infrequently when two isolated (own GIL) subinterpreters simultaneously initialize their sys or builtins modules. The crash happened due the combination of the "detached" thread state we were using and the "last holder" logic we use for the GIL. It turns out it's tricky to use the same thread state for different threads. Who could have guessed?
We solve the problem by eliminating the one object we were still sharing between interpreters. We replace it with a low-level hashtable, using the "raw" allocator to avoid tying it to the main interpreter.
We also remove the accommodations for "detached" thread states, which were a dubious idea to start with.
The _xxsubinterpreters module should not rely on internal API. Some of the functions it uses were recently moved there however. Here we move them back (and expose them properly).
Rename private C API constants:
* Rename PY_MONITORING_UNGROUPED_EVENTS to _PY_MONITORING_UNGROUPED_EVENTS
* Rename PY_MONITORING_EVENTS to _PY_MONITORING_EVENTS
Remove private _PyThreadState and _PyInterpreterState C API
functions: move them to the internal C API (pycore_pystate.h and
pycore_interp.h). Don't export most of these functions anymore, but
still export functions used by tests.
Remove _PyThreadState_Prealloc() and _PyThreadState_Init() from the C
API, but keep it in the stable API.