There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:
1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
However, C holds it, so the assertion fails.
The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.
There are two main parts to this PR:
1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
The event is set by the runtime prior to the thread being cleared (in the same
place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
`PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
`_tstate_lock`s.
[^1]: 441affc9e7/Lib/threading.py (L1201)
[^2]: 441affc9e7/Lib/threading.py (L1115)
[^3]: 8194653279
---------
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
On Windows, time.monotonic() now uses the QueryPerformanceCounter()
clock to have a resolution better than 1 us, instead of the
gGetTickCount64() clock which has a resolution of 15.6 ms.
There are now at least two bytecodes that may attempt to optimize,
JUMP_BACK, and more recently, COLD_EXIT.
Only the JUMP_BACK was counting the attempt in the stats.
This moves that counter to uop_optimize itself so it should
always happen no matter where it is called from.
This isn't strictly necessary because the implementation of `gc_should_collect`
already checks `gcstate->enabled` in the free-threaded build, but it seems
like a good idea until the common pieces of gc.c and gc_free_threading.c are
refactored out.
This moves `current_fast_clear()` up so that the current thread state is
`NULL` while running `tstate_delete_common()`.
This doesn't fix any bugs, but it means that we are more consistent that
`_PyThreadState_GET() != NULL` means that the thread is "attached".
Return 0 on success. Set an exception and return -1 on error.
Fix os.timerfd_settime(): properly report exceptions on
_PyTime_FromSecondsDouble() failure.
No longer export _PyTime_FromSecondsDouble().
In free-threaded builds, running with `PYTHON_GIL=0` will now disable the
GIL. Follow-up issues track work to re-enable the GIL when loading an
incompatible extension, and to disable the GIL by default.
In order to support re-enabling the GIL at runtime, all GIL-related data
structures are initialized as usual, and disabling the GIL simply sets a flag
that causes `take_gil()` and `drop_gil()` to return early.
In general, when `_PyThreadState_GET()` is non-NULL then the current
thread is "attached", but there is a small window during
`PyThreadState_DeleteCurrent()` where that's not true:
tstate_delete_common() is called when the thread is detached, but before
current_fast_clear().
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.
Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.
When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.
The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.
Make `_thread.ThreadHandle` thread-safe in free-threaded builds
We protect the mutable state of `ThreadHandle` using a `_PyOnceFlag`.
Concurrent operations (i.e. `join` or `detach`) on `ThreadHandle` block
until it is their turn to execute or an earlier operation succeeds.
Once an operation has been applied successfully all future operations
complete immediately.
The `join()` method is now idempotent. It may be called multiple times
but the underlying OS thread will only be joined once. After `join()`
succeeds, any future calls to `join()` will succeed immediately.
The internal thread handle `detach()` method has been removed.
This changes the `sym_set_...()` functions to return a `bool` which is `false`
when the symbol is `bottom` after the operation.
All calls to such functions now check this result and go to `hit_bottom`,
a special error label that prints a different message and then reports
that it wasn't able to optimize the trace. No executor will be produced
in this case.
This undoes the *temporary* default disabling of the T2 optimizer pass in gh-115860.
- Add a new test that reproduces Brandt's example from gh-115859; it indeed crashes before gh-116028 with PYTHONUOPSOPTIMIZE=1
- Re-enable the optimizer pass in T2, stop checking PYTHONUOPSOPTIMIZE
- Rename the env var to disable T2 entirely to PYTHON_UOPS_OPTIMIZE (must be explicitly set to 0 to disable)
- Fix skipIf conditions on tests in test_opt.py accordingly
- Export sym_is_bottom() (for debugging)
- Fix various things in the `_BINARY_OP_` specializations in the abstract interpreter:
- DECREF(temp)
- out-of-space check after sym_new_const()
- add sym_matches_type() checks, so even if we somehow reach a binary op with symbolic constants of the wrong type on the stack we won't trigger the type assert
- Any `sym_set_...` call that attempts to set conflicting information
cause the symbol to become `bottom` (contradiction).
- All `sym_is...` and similar calls return false or NULL for `bottom`.
- Everything's tested.
- The tests still pass with `PYTHONUOPSOPTIMIZE=1`.
* Rename _Py_UOpsAbstractInterpContext to _Py_UOpsContext and _Py_UOpsSymType to _Py_UopsSymbol.
* #define shortened form of _Py_uop_... names for improved readability.
The theory is that even if we saw a jump go in the same direction the
last 16 times we got there, we shouldn't be overly confident that it's
still going to go the same way in the future. This PR makes it so that
in the extreme cases, the confidence is multiplied by 0.9 instead of
remaining unchanged. For unpredictable jumps, there is no difference
(still 0.5). For somewhat predictable jumps, we interpolate.
PyTime_t no longer uses an arbitrary unit, it's always a number of
nanoseconds (64-bit signed integer).
* Rename _PyTime_FromNanosecondsObject() to _PyTime_FromLong().
* Rename _PyTime_AsNanosecondsObject() to _PyTime_AsLong().
* Remove pytime_from_nanoseconds().
* Remove pytime_as_nanoseconds().
* Remove _PyTime_FromNanoseconds().
Remove references to the old names _PyTime_MIN
and _PyTime_MAX, now that PyTime_MIN and
PyTime_MAX are public.
Replace also _PyTime_MIN with PyTime_MIN.
* Rename `_testinternalcapi.get_{uop,counter}_optimizer` to `new_*_optimizer`
* Use `_PyUOpName()` instead of` _PyOpcode_uop_name[]`
* Add `target` to executor iterator items -- `list(ex)` now returns `(opcode, oparg, target, operand)` quadruples
* Add executor methods `get_opcode()` and `get_oparg()` to get `vmdata.opcode`, `vmdata.oparg`
* Define a helper for printing uops, and unify various places where they are printed
* Add a hack to summarize_stats.py to fix legacy uop names (e.g. `POP_TOP` -> `_POP_TOP`)
* Define helpers in `test_opt.py` for accessing the set or list of opnames of an executor
This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
<pycore_time.h> include is no longer needed to get the PyTime_t type
in internal header files. This type is now provided by <Python.h>
include. Add <pycore_time.h> includes to C files instead.
This avoids filling the memory occupied by ob_tid, ob_ref_local, and
ob_ref_shared with debug bytes (e.g., 0xDD) in mimalloc in the
free-threaded build.
This change adds an `eval_breaker` field to `PyThreadState`. The primary
motivation is for performance in free-threaded builds: with thread-local eval
breakers, we can stop a specific thread (e.g., for an async exception) without
interrupting other threads.
The source of truth for the global instrumentation version is stored in the
`instrumentation_version` field in PyInterpreterState. Threads usually read the
version from their local `eval_breaker`, where it continues to be colocated
with the eval breaker bits.
This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
The GC keeps track of the number of allocations (less deallocations)
since the last GC. This buffers the count in thread-local state and uses
atomic operations to modify the per-interpreter count. The thread-local
buffering avoids contention on shared state.
A consequence is that the GC scheduling is not as precise, so
"test_sneaky_frame_object" is skipped because it requires that the GC be
run exactly after allocating a frame object.
Makes _PyType_Lookup thread safe, including:
Thread safety of the underlying cache.
Make mutation of mro and type members thread safe
Also _PyType_GetMRO and _PyType_GetBases are currently returning borrowed references which aren't safe.
This is a temporary fix to unblock embedders that do not call Py_Main().
_PyInterpreterState_IsRunningMain() will always return true for the main interpreter, even in corner cases where it technically should not. The (future) full solution will do the right thing in those corner cases.
For the most part, these changes make is substantially easier to backport subinterpreter-related code to 3.12, especially the related modules (e.g. _xxsubinterpreters). The main motivation is to support releasing a PyPI package with the 3.13 capabilities compiled for 3.12.
A lot of the changes here involve either hiding details behind macros/functions or splitting up some files.
Setters for members with an unsigned integer type now support
the same range of valid values for objects that has a __index__()
method as for int.
Previously, Py_T_UINT, Py_T_ULONG and Py_T_ULLONG did not support
objects that has a __index__() method larger than LONG_MAX.
Py_T_ULLONG did not support negative ints. Now it supports them and
emits a RuntimeWarning.
Biased reference counting maintains two refcount fields in each object:
`ob_ref_local` and `ob_ref_shared`. The true refcount is the sum of these two
fields. In some cases, when refcounting operations are split across threads,
the ob_ref_shared field can be negative (although the total refcount must be
at least zero). In this case, the thread that decremented the refcount
requests that the owning thread give up ownership and merge the refcount
fields.
This changes a number of internal usages of `PyDict_SetDefault` to use `PyDict_SetDefaultRef`.
Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
This marks dead ThreadHandles as non-joinable earlier in
`PyOS_AfterFork_Child()` before we execute any Python code. The handles
are stored in a global linked list in `_PyRuntimeState` because `fork()`
affects the entire process.
Fix race between `_PyParkingLot_Park` and `_PyParkingLot_UnparkAll` when handling interrupts
There is a potential race when `_PyParkingLot_UnparkAll` is executing in
one thread and another thread is unblocked because of an interrupt in
`_PyParkingLot_Park`. Consider the following scenario:
1. Thread T0 is blocked[^1] in `_PyParkingLot_Park` on address `A`.
2. Thread T1 executes `_PyParkingLot_UnparkAll` on address `A`. It
finds the `wait_entry` for `T0` and unlinks[^2] its list node.
3. Immediately after (2), T0 is woken up due to an interrupt. It
then segfaults trying to unlink[^3] the node that was previously
unlinked in (2).
To fix this we mark each waiter as unparking before releasing the bucket
lock. `_PyParkingLot_Park` will wait to handle the coming wakeup, and not
attempt to unlink the node, when this field is set. `_PyParkingLot_Unpark`
does this already, presumably to handle this case.
* Fix a RuntimeWarning emitted when assign an integer-like value that
is not an instance of int to an attribute that corresponds to a C
struct member of type T_UINT and T_ULONG.
* Fix a double RuntimeWarning emitted when assign a negative integer value
to an attribute that corresponds to a C struct member of type T_UINT.
* gh-112529: Remove PyGC_Head from object pre-header in free-threaded build
This avoids allocating space for PyGC_Head in the free-threaded build.
The GC implementation for free-threaded CPython does not use the
PyGC_Head structure.
* The trashcan mechanism uses the `ob_tid` field instead of `_gc_prev`
in the free-threaded build.
* The GDB libpython.py file now determines the offset of the managed
dict field based on whether the running process is a free-threaded
build. Those are identified by the `ob_ref_local` field in PyObject.
* Fixes `_PySys_GetSizeOf()` which incorrectly incorrectly included the
size of `PyGC_Head` in the size of static `PyTypeObject`.
The free-threaded build's GC implementation is non-generational, but was
scheduled as if it were collecting a young generation leading to
quadratic behavior. This increases the minimum threshold and scales it
to the number of live objects as we do for the old generation in the
default build.
Note that the scheduling is still not thread-safe without the GIL. Those
changes will come in later PRs.
A few tests, like "test_sneaky_frame_object" rely on prompt scheduling
of the GC. For now, to keep that test passing, we disable the scaled
threshold after calls like `gc.set_threshold(1, 0, 0)`.
Add a configure define for HAVE_PTHREAD_COND_TIMEDWAIT_RELATIVE_NP and
replaces pthread_cond_timedwait() with pthread_cond_timedwait_relative_np()
for relative time when supported in semaphore waiting logic.
Add an option (--enable-experimental-jit for configure-based builds
or --experimental-jit for PCbuild-based ones) to build an
*experimental* just-in-time compiler, based on copy-and-patch (https://fredrikbk.com/publications/copy-and-patch.pdf).
See Tools/jit/README.md for more information on how to install the required build-time tooling.
For interpreters that share state with the main interpreter, this points
to the same static memory structure. For interpreters with their own
obmalloc state, it is heap allocated. Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
* gh-112529: Implement GC for free-threaded builds
This implements a mark and sweep GC for the free-threaded builds of
CPython. The implementation relies on mimalloc to find GC tracked
objects (i.e., "containers").
The `--disable-gil` builds occasionally need to pause all but one thread. Some
examples include:
* Cyclic garbage collection, where this is often called a "stop the world event"
* Before calling `fork()`, to ensure a consistent state for internal data structures
* During interpreter shutdown, to ensure that daemon threads aren't accessing Python objects
This adds the following functions to implement global and per-interpreter pauses:
* `_PyEval_StopTheWorldAll()` and `_PyEval_StartTheWorldAll()` (for the global runtime)
* `_PyEval_StopTheWorld()` and `_PyEval_StartTheWorld()` (per-interpreter)
(The function names may change.)
These functions are no-ops outside of the `--disable-gil` build.
* gh-112529: Use GC heaps for GC allocations in free-threaded builds
The free-threaded build's garbage collector implementation will need to
find GC objects by traversing mimalloc heaps. This hooks up the
allocation calls with the correct heaps by using a thread-local
"current_obj_heap" variable.
* Refactor out setting heap based on type
Fix a few places where the lltrace debug output printed ``(null)`` instead of an opcode name, because it was calling ``_PyUOpName()`` on a Tier-1 opcode.
It can be used to set the location of a .python_history file
---------
Co-authored-by: Levi Sabah <0xl3vi@gmail.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
This splits part of Modules/gcmodule.c of into Python/gc.c, which
now contains the core garbage collection implementation. The Python
module remain in the Modules/gcmodule.c file.
* gh-112532: Tag mimalloc heaps and pages
Mimalloc pages are data structures that contain contiguous allocations
of the same block size. Note that they are distinct from operating
system pages. Mimalloc pages are contained in segments.
When a thread exits, it abandons any segments and contained pages that
have live allocations. These segments and pages may be later reclaimed
by another thread. To support GC and certain thread-safety guarantees in
free-threaded builds, we want pages to only be reclaimed by the
corresponding heap in the claimant thread. For example, we want pages
containing GC objects to only be claimed by GC heaps.
This allows heaps and pages to be tagged with an integer tag that is
used to ensure that abandoned pages are only claimed by heaps with the
same tag. Heaps can be initialized with a tag (0-15); any page allocated
by that heap copies the corresponding tag.
* Fix conversion warning
* gh-112532: Isolate abandoned segments by interpreter
Mimalloc segments are data structures that contain memory allocations along
with metadata. Each segment is "owned" by a thread. When a thread exits,
it abandons its segments to a global pool to be later reclaimed by other
threads. This changes the pool to be per-interpreter instead of process-wide.
This will be important for when we use mimalloc to find GC objects in the
`--disable-gil` builds. We want heaps to only store Python objects from a
single interpreter. Absent this change, the abandoning and reclaiming process
could break this isolation.
* Add missing '&_mi_abandoned_default' to 'tld_empty'
* gh-112532: Use separate mimalloc heaps for GC objects
In `--disable-gil` builds, we now use four separate heaps in
anticipation of using mimalloc to find GC objects when the GIL is
disabled. To support this, we also make a few changes to mimalloc:
* `mi_heap_t` and `mi_tld_t` initialization is split from allocation.
This allows us to have a `mi_tld_t` per-`PyThreadState`, which is
important to keep interpreter isolation, since the same OS thread may
run in multiple interpreters (using different PyThreadStates.)
* Heap abandoning (mi_heap_collect_ex) can now be called from a
different thread than the one that created the heap. This is necessary
because we may clear and delete the containing PyThreadStates from a
different thread during finalization and after fork().
* Use enum instead of defines and guard mimalloc includes.
* The enum typedef will be convenient for future PRs that use the type.
* Guarding the mimalloc includes allows us to unconditionally include
pycore_mimalloc.h from other header files that rely on things like
`struct _mimalloc_thread_state`.
* Only define _mimalloc_thread_state in Py_GIL_DISABLED builds