Deferred reference counting is not fully implemented yet. As a temporary
measure, we immortalize objects that would use deferred reference
counting to avoid multi-threaded scaling bottlenecks.
This is only performed in the free-threaded build once the first
non-main thread is started. Additionally, some tests, including refleak
tests, suppress this behavior.
Basically, I've turned most of _PyImport_LoadDynamicModuleWithSpec() into two new functions (_PyImport_GetModInitFunc() and _PyImport_RunModInitFunc()) and moved the rest of it out into _imp_create_dynamic_impl(). There shouldn't be any changes in behavior.
This change makes some future changes simpler. This is particularly relevant to potentially calling each module init function in the main interpreter first. Thus the critical part of the PR is the addition of _PyImport_RunModInitFunc(), which is strictly focused on running the init func and validating the result. A later PR will take it a step farther by capturing error information rather than raising exceptions.
FWIW, this change also helps readers by clarifying a bit more about what happens when an extension/builtin module is imported.
This is an improvement over the status quo, reducing the likelihood of completely filling the pending calls queue. However, the problem won't go away completely unless we move to an unbounded linked list or add a mechanism for waiting until the queue isn't full.
These are cleanups I've pulled out of gh-118116. Mostly, this change moves code around to align with some future changes and to improve clarity a little. There is one very small change in behavior: we now add the module to the per-interpreter caches after updating the global state, rather than before.
This is a collection of very basic cleanups I've pulled out of gh-118116. It is mostly renaming variables and moving a couple bits of code in functionally equivalent ways.
Makes sys.settrace, sys.setprofile, and monitoring generally thread-safe.
Mostly uses a stop-the-world approach and synchronization around the code object's _co_instrumentation_version. There may be a little bit of extra synchronization around the monitoring data that's required to be TSAN clean.
Guido pointed out to me that some details about the per-interpreter state for the builtin types aren't especially clear. I'm addressing that by:
* adding a comment explaining that state
* adding some asserts to point out the relationship between each index and the interp/global runtime state
This change gives a significant speedup, as the METH_FASTCALL calling
convention is now used. The following bytes and bytearray methods are adapted:
- count()
- find()
- index()
- rfind()
- rindex()
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
This is similar to the situation with threading._DummyThread. The methods (incl. __del__()) of interpreters.Interpreter objects must be careful with interpreters not created by interpreters.create(). The simplest thing to start with is to disable any method that modifies or runs in the interpreter. As part of this, the runtime keeps track of where an interpreter was created. We also handle interpreter "refcounts" properly.
The free-threaded build does not currently support the combination of
single-phase init modules and non-isolated subinterpreters. Ensure that
`check_multi_interp_extensions` is always `True` for subinterpreters in
the free-threaded build so that importing these modules raises an
`ImportError`.
The `__has_feature(thread_sanitizer)` is a Clang-ism. Although new
versions of GCC implement `__has_feature`, the `defined(__has_feature)`
check still fails on GCC so we don't use that code path.
This keeps track of the per-thread total reference count operations in
PyThreadState in the free-threaded builds. The count is merged into the
interpreter's total when the thread exits.
Most mutable data is protected by a striped lock that is keyed on the
referenced object's address. The weakref's hash is protected using the
weakref's per-object lock.
Note that this only affects free-threaded builds. Apart from some minor
refactoring, the added code is all either gated by `ifdef`s or is a no-op
(e.g. `Py_BEGIN_CRITICAL_SECTION`).
Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``),
shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The
API used for adaptive specialization counters is changed but the behavior is
(supposed to be) identical.
The behavior of the Tier 2 counters is changed:
- There are no longer dynamic thresholds (we never varied these).
- All counters now use the same exponential backoff.
- The counter for ``JUMP_BACKWARD`` starts counting down from 16.
- The ``temperature`` in side exits starts counting down from 64.
This merges all `_CHECK_STACK_SPACE` uops in a trace into a single `_CHECK_STACK_SPACE_OPERAND` uop that checks whether there is enough stack space for all calls included in the entire trace.
These helpers make it easier to customize and inspect the config used to initialize interpreters. This is especially valuable in our tests. I found inspiration from the PyConfig API for the PyInterpreterConfig dict conversion stuff. As part of this PR I've also added a bunch of tests.
Mark the swap operations as critical sections.
Add an internal Py_BEGIN_CRITICAL_SECTION_MUT API that takes a PyMutex
pointer instead of a PyObject pointer.
Change old space bit of young objects from 0 to gcstate->visited_space.
This ensures that any object created *and* collected during cycle GC has the bit set correctly.
Use support.infinite_recursion() in test_recursive_pickle() of
test_functools to prevent a stack overflow on "ARM64 Windows
Non-Debug" buildbot.
Lower Py_C_RECURSION_LIMIT to 1,000 frames on Windows ARM64.
Changes to the function version cache:
- In addition to the function object, also store the code object,
and allow the latter to be retrieved even if the function has been evicted.
- Stop assigning new function versions after a critical attribute (e.g. `__code__`)
has been modified; the version is permanently reset to zero in this case.
- Changes to `__annotations__` are no longer considered critical. (This fixes gh-109998.)
Changes to the Tier 2 optimization machinery:
- If we cannot map a function version to a function, but it is still mapped to a code object,
we continue projecting the trace.
The operand of the `_PUSH_FRAME` and `_POP_FRAME` opcodes can be either NULL,
a function object, or a code object with the lowest bit set.
This allows us to trace through code that calls an ephemeral function,
i.e., a function that may not be alive when we are constructing the executor,
e.g. a generator expression or certain nested functions.
We will lose globals removal inside such functions,
but we can still do other peephole operations
(and even possibly [call inlining](https://github.com/python/cpython/pull/116290),
if we decide to do it), which only need the code object.
As before, if we cannot retrieve the code object from the cache, we stop projecting.
Split `_PyThreadState_DeleteExcept` into two functions:
- `_PyThreadState_RemoveExcept` removes all thread states other than one
passed as an argument. It returns the removed thread states as a
linked list.
- `_PyThreadState_DeleteList` deletes those dead thread states. It may
call destructors, so we want to "start the world" before calling
`_PyThreadState_DeleteList` to avoid potential deadlocks.
I added it quite a while ago as a strategy for managing interpreter lifetimes relative to the PEP 554 (now 734) implementation. Relatively recently I refactored that implementation to no longer rely on InterpreterID objects. Thus now I'm removing it.
Add Py_GetConstant() and Py_GetConstantBorrowed() functions.
In the limited C API version 3.13, getting Py_None, Py_False,
Py_True, Py_Ellipsis and Py_NotImplemented singletons is now
implemented as function calls at the stable ABI level to hide
implementation details. Getting these constants still return borrowed
references.
Add _testlimitedcapi/object.c and test_capi/test_object.py to test
Py_GetConstant() and Py_GetConstantBorrowed() functions.
Mostly we unify the two different implementations of the conversion code (from PyObject * to int64_t. We also drop the PyArg_ParseTuple()-style converter function, as well as rename and move PyInterpreterID_LookUp().
Somehow we ended up with two separate counter variables tracking "the next function version".
Most likely this was a historical accident where an old branch was updated incorrectly.
This PR merges the two counters into a single one: `interp->func_state.next_version`.
Defining a type twice is a C11 feature and so makes the C API
incompatible with C99. Fix the issue by only defining the type once.
Example of warning (treated as an error):
In file included from Include/Python.h:122:
Include/cpython/optimizer.h:77:3: error: redefinition of typedef
'_PyOptimizerObject' is a C11 feature [-Werror,-Wtypedef-redefinition]
} _PyOptimizerObject;
^
build/Include/cpython/optimizer.h:60:35: note: previous definition is here
typedef struct _PyOptimizerObject _PyOptimizerObject;
^
There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:
1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
However, C holds it, so the assertion fails.
The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.
There are two main parts to this PR:
1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
The event is set by the runtime prior to the thread being cleared (in the same
place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
`PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
`_tstate_lock`s.
[^1]: 441affc9e7/Lib/threading.py (L1201)
[^2]: 441affc9e7/Lib/threading.py (L1115)
[^3]: 8194653279
---------
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>