_PyWeakref_ClearRef was previously exposed in the public C-API, although
it begins with an underscore and is not documented. It's used by a few
C-API extensions. There is currently no alternative public API that can
replace its use.
_PyWeakref_ClearWeakRefsExceptCallbacks is the only thread-safe way to
use _PyWeakref_ClearRef in the free-threaded build. This exposes the C
symbol, but does not make the API public.
Some embedders and extensions include parts of the internal API. The
pycore_mimalloc.h file is transitively include by a number of other
internal headers. This avoids include errors for code that was
already including those headers.
Use relaxed atomics when reading / writing to the field. There are still a
few places in the GC where we do not use atomics. Those should be safe as
the world is stopped.
Add the ability to enable/disable the GIL at runtime, and use that in
the C module loading code.
We can't know before running a module init function if it supports
free-threading, so the GIL is temporarily enabled before doing so. If
the module declares support for running without the GIL, the GIL is
later disabled. Otherwise, the GIL is permanently enabled, and will
never be disabled again for the life of the current interpreter.
We already intern and immortalize most string constants. In the
free-threaded build, other constants can be a source of reference count
contention because they are shared by all threads running the same code
objects.
Add _PyType_LookupRef and use incref before setting attribute on type
Makes setting an attribute on a class and signaling type modified atomic
Avoid adding re-entrancy exposing the type cache in an inconsistent state by decrefing after type is updated
The designated initializer syntax in static inline functions in pycore_backoff.h
causes problems for C++ or MSVC users who aren't yet using C++20.
While internal, pycore_backoff.h is included (indirectly, via pycore_code.h)
by some key 3rd party software that does so for speed.
Use the new public Raw functions:
* _PyTime_PerfCounterUnchecked() with PyTime_PerfCounterRaw()
* _PyTime_TimeUnchecked() with PyTime_TimeRaw()
* _PyTime_MonotonicUnchecked() with PyTime_MonotonicRaw()
Remove internal functions:
* _PyTime_PerfCounterUnchecked()
* _PyTime_TimeUnchecked()
* _PyTime_MonotonicUnchecked()
We have only been tracking each module's PyModuleDef. However, there are some problems with that. For example, in some cases we load single-phase init extension modules from def->m_base.m_init or def->m_base.m_copy, but if multiple modules share a def then we can end up with unexpected behavior.
With this change, we track the following:
* PyModuleDef (same as before)
* for some modules, its init function or a copy of its __dict__, but specific to that module
* whether it is a builtin/core module or a "dynamic" extension
* the interpreter (ID) that owns the cached __dict__ (only if cached)
This also makes it easier to remember the module's kind (e.g. single-phase init) and if loading it previously failed, which I'm doing separately.
* Add CALL_PY_GENERAL, CALL_BOUND_METHOD_GENERAL and call CALL_NON_PY_GENERAL specializations.
* Remove CALL_PY_WITH_DEFAULTS specialization
* Use CALL_NON_PY_GENERAL in more cases when otherwise failing to specialize
This PR adds the ability to enable the GIL if it was disabled at
interpreter startup, and modifies the multi-phase module initialization
path to enable the GIL when loading a module, unless that module's spec
includes a slot indicating it can run safely without the GIL.
PEP 703 called the constant for the slot `Py_mod_gil_not_used`; I went
with `Py_MOD_GIL_NOT_USED` for consistency with gh-104148.
A warning will be issued up to once per interpreter for the first
GIL-using module that is loaded. If `-v` is given, a shorter message
will be printed to stderr every time a GIL-using module is loaded
(including the first one that issues a warning).
The module itself is a thin wrapper around calls to functions in
`Python/codecs.c`, so that's where the meaningful changes happened:
- Move codecs-related state that lives on `PyInterpreterState` to a
struct declared in `pycore_codecs.h`.
- In free-threaded builds, add a mutex to `codecs_state` to synchronize
operations on `search_path`. Because `search_path_mutex` is used as a
normal mutex and not a critical section, we must be extremely careful
with operations called while holding it.
- The codec registry is explicitly initialized as part of
`_PyUnicode_InitEncodings` to simplify thread-safety.
Add "Raw" variant of PyTime functions:
* PyTime_MonotonicRaw()
* PyTime_PerfCounterRaw()
* PyTime_TimeRaw()
Changes:
* Add documentation and tests. Tests release the GIL while calling
raw clock functions.
* py_get_system_clock() and py_get_monotonic_clock() now check that
the GIL is hold by the caller if raise_exc is non-zero.
* Reimplement "Unchecked" functions with raw clock functions.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Account for `add_stopiteration_handler` pushing a block for `async with`.
To allow generator functions that previously almost hit the `CO_MAXBLOCKS`
limit by nesting non-async blocks, the limit is increased by 1.
This increase allows one more block in non-generator functions.
The code for Tier 2 is now only compiled when configured
with `--enable-experimental-jit[=yes|interpreter]`.
We drop support for `PYTHON_UOPS` and -`Xuops`,
but you can disable the interpreter or JIT
at runtime by setting `PYTHON_JIT=0`.
You can also build it without enabling it by default
using `--enable-experimental-jit=yes-off`;
enable with `PYTHON_JIT=1`.
On Windows, the `build.bat` script supports
`--experimental-jit`, `--experimental-jit-off`,
`--experimental-interpreter`.
In the C code, `_Py_JIT` is defined as before
when the JIT is enabled; the new variable
`_Py_TIER2` is defined when the JIT *or* the
interpreter is enabled. It is actually a bitmask:
1: JIT; 2: default-off; 4: interpreter.
Avoid detaching thread state when stopping the world. When re-attaching
the thread state, the thread would attempt to resume the top-most
critical section, which might now be held by a thread paused for our
stop-the-world request.
Deferred reference counting is not fully implemented yet. As a temporary
measure, we immortalize objects that would use deferred reference
counting to avoid multi-threaded scaling bottlenecks.
This is only performed in the free-threaded build once the first
non-main thread is started. Additionally, some tests, including refleak
tests, suppress this behavior.
Basically, I've turned most of _PyImport_LoadDynamicModuleWithSpec() into two new functions (_PyImport_GetModInitFunc() and _PyImport_RunModInitFunc()) and moved the rest of it out into _imp_create_dynamic_impl(). There shouldn't be any changes in behavior.
This change makes some future changes simpler. This is particularly relevant to potentially calling each module init function in the main interpreter first. Thus the critical part of the PR is the addition of _PyImport_RunModInitFunc(), which is strictly focused on running the init func and validating the result. A later PR will take it a step farther by capturing error information rather than raising exceptions.
FWIW, this change also helps readers by clarifying a bit more about what happens when an extension/builtin module is imported.
This is an improvement over the status quo, reducing the likelihood of completely filling the pending calls queue. However, the problem won't go away completely unless we move to an unbounded linked list or add a mechanism for waiting until the queue isn't full.
These are cleanups I've pulled out of gh-118116. Mostly, this change moves code around to align with some future changes and to improve clarity a little. There is one very small change in behavior: we now add the module to the per-interpreter caches after updating the global state, rather than before.
This is a collection of very basic cleanups I've pulled out of gh-118116. It is mostly renaming variables and moving a couple bits of code in functionally equivalent ways.
Makes sys.settrace, sys.setprofile, and monitoring generally thread-safe.
Mostly uses a stop-the-world approach and synchronization around the code object's _co_instrumentation_version. There may be a little bit of extra synchronization around the monitoring data that's required to be TSAN clean.
Guido pointed out to me that some details about the per-interpreter state for the builtin types aren't especially clear. I'm addressing that by:
* adding a comment explaining that state
* adding some asserts to point out the relationship between each index and the interp/global runtime state
This change gives a significant speedup, as the METH_FASTCALL calling
convention is now used. The following bytes and bytearray methods are adapted:
- count()
- find()
- index()
- rfind()
- rindex()
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
This is similar to the situation with threading._DummyThread. The methods (incl. __del__()) of interpreters.Interpreter objects must be careful with interpreters not created by interpreters.create(). The simplest thing to start with is to disable any method that modifies or runs in the interpreter. As part of this, the runtime keeps track of where an interpreter was created. We also handle interpreter "refcounts" properly.
The free-threaded build does not currently support the combination of
single-phase init modules and non-isolated subinterpreters. Ensure that
`check_multi_interp_extensions` is always `True` for subinterpreters in
the free-threaded build so that importing these modules raises an
`ImportError`.
The `__has_feature(thread_sanitizer)` is a Clang-ism. Although new
versions of GCC implement `__has_feature`, the `defined(__has_feature)`
check still fails on GCC so we don't use that code path.
This keeps track of the per-thread total reference count operations in
PyThreadState in the free-threaded builds. The count is merged into the
interpreter's total when the thread exits.
Most mutable data is protected by a striped lock that is keyed on the
referenced object's address. The weakref's hash is protected using the
weakref's per-object lock.
Note that this only affects free-threaded builds. Apart from some minor
refactoring, the added code is all either gated by `ifdef`s or is a no-op
(e.g. `Py_BEGIN_CRITICAL_SECTION`).
Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``),
shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The
API used for adaptive specialization counters is changed but the behavior is
(supposed to be) identical.
The behavior of the Tier 2 counters is changed:
- There are no longer dynamic thresholds (we never varied these).
- All counters now use the same exponential backoff.
- The counter for ``JUMP_BACKWARD`` starts counting down from 16.
- The ``temperature`` in side exits starts counting down from 64.
This merges all `_CHECK_STACK_SPACE` uops in a trace into a single `_CHECK_STACK_SPACE_OPERAND` uop that checks whether there is enough stack space for all calls included in the entire trace.
These helpers make it easier to customize and inspect the config used to initialize interpreters. This is especially valuable in our tests. I found inspiration from the PyConfig API for the PyInterpreterConfig dict conversion stuff. As part of this PR I've also added a bunch of tests.
Mark the swap operations as critical sections.
Add an internal Py_BEGIN_CRITICAL_SECTION_MUT API that takes a PyMutex
pointer instead of a PyObject pointer.
Change old space bit of young objects from 0 to gcstate->visited_space.
This ensures that any object created *and* collected during cycle GC has the bit set correctly.
Use support.infinite_recursion() in test_recursive_pickle() of
test_functools to prevent a stack overflow on "ARM64 Windows
Non-Debug" buildbot.
Lower Py_C_RECURSION_LIMIT to 1,000 frames on Windows ARM64.
Changes to the function version cache:
- In addition to the function object, also store the code object,
and allow the latter to be retrieved even if the function has been evicted.
- Stop assigning new function versions after a critical attribute (e.g. `__code__`)
has been modified; the version is permanently reset to zero in this case.
- Changes to `__annotations__` are no longer considered critical. (This fixes gh-109998.)
Changes to the Tier 2 optimization machinery:
- If we cannot map a function version to a function, but it is still mapped to a code object,
we continue projecting the trace.
The operand of the `_PUSH_FRAME` and `_POP_FRAME` opcodes can be either NULL,
a function object, or a code object with the lowest bit set.
This allows us to trace through code that calls an ephemeral function,
i.e., a function that may not be alive when we are constructing the executor,
e.g. a generator expression or certain nested functions.
We will lose globals removal inside such functions,
but we can still do other peephole operations
(and even possibly [call inlining](https://github.com/python/cpython/pull/116290),
if we decide to do it), which only need the code object.
As before, if we cannot retrieve the code object from the cache, we stop projecting.
Split `_PyThreadState_DeleteExcept` into two functions:
- `_PyThreadState_RemoveExcept` removes all thread states other than one
passed as an argument. It returns the removed thread states as a
linked list.
- `_PyThreadState_DeleteList` deletes those dead thread states. It may
call destructors, so we want to "start the world" before calling
`_PyThreadState_DeleteList` to avoid potential deadlocks.
I added it quite a while ago as a strategy for managing interpreter lifetimes relative to the PEP 554 (now 734) implementation. Relatively recently I refactored that implementation to no longer rely on InterpreterID objects. Thus now I'm removing it.
Add Py_GetConstant() and Py_GetConstantBorrowed() functions.
In the limited C API version 3.13, getting Py_None, Py_False,
Py_True, Py_Ellipsis and Py_NotImplemented singletons is now
implemented as function calls at the stable ABI level to hide
implementation details. Getting these constants still return borrowed
references.
Add _testlimitedcapi/object.c and test_capi/test_object.py to test
Py_GetConstant() and Py_GetConstantBorrowed() functions.
Mostly we unify the two different implementations of the conversion code (from PyObject * to int64_t. We also drop the PyArg_ParseTuple()-style converter function, as well as rename and move PyInterpreterID_LookUp().
Somehow we ended up with two separate counter variables tracking "the next function version".
Most likely this was a historical accident where an old branch was updated incorrectly.
This PR merges the two counters into a single one: `interp->func_state.next_version`.
Defining a type twice is a C11 feature and so makes the C API
incompatible with C99. Fix the issue by only defining the type once.
Example of warning (treated as an error):
In file included from Include/Python.h:122:
Include/cpython/optimizer.h:77:3: error: redefinition of typedef
'_PyOptimizerObject' is a C11 feature [-Werror,-Wtypedef-redefinition]
} _PyOptimizerObject;
^
build/Include/cpython/optimizer.h:60:35: note: previous definition is here
typedef struct _PyOptimizerObject _PyOptimizerObject;
^
There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:
1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
However, C holds it, so the assertion fails.
The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.
There are two main parts to this PR:
1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
The event is set by the runtime prior to the thread being cleared (in the same
place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
`PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
`_tstate_lock`s.
[^1]: 441affc9e7/Lib/threading.py (L1201)
[^2]: 441affc9e7/Lib/threading.py (L1115)
[^3]: 8194653279
---------
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Return 0 on success. Set an exception and return -1 on error.
Fix os.timerfd_settime(): properly report exceptions on
_PyTime_FromSecondsDouble() failure.
No longer export _PyTime_FromSecondsDouble().
In free-threaded builds, running with `PYTHON_GIL=0` will now disable the
GIL. Follow-up issues track work to re-enable the GIL when loading an
incompatible extension, and to disable the GIL by default.
In order to support re-enabling the GIL at runtime, all GIL-related data
structures are initialized as usual, and disabling the GIL simply sets a flag
that causes `take_gil()` and `drop_gil()` to return early.
Revert "gh-115398: Increment PyExpat_CAPI_MAGIC for SetReparseDeferralEnabled addition (GH-116301)"
This reverts part of commit eda2963378. Why? this comment buried in an earlier code review explains:
I checked again how that value is used in practice, it's here:
0c80da4c14/Modules/_elementtree.c (L4363-L4372)
Based on that code my understanding is that loading bigger structs from the future is considered okay unless `PyExpat_CAPI_MAGIC` differs, which implies that (1) magic needs to stay the same to support loading the future from the past and (2) that `PyExpat_CAPI_MAGIC` should only ever change for changes that do not increase size (but keep it constant).
To summarize, that supports your argument.
I checked branches 3.8, 3.9, 3.10, 3.11, 3.12 now and they all have the same comparison code there so reverting that magic string bump will support seamless backporting.
This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.
Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.
When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
This sets `MI_DEBUG` to `2` in debug builds to enable `mi_assert_internal()`
calls. Expensive internal assertions are not enabled.
This also disables an assertion in free-threaded builds that would be
triggered by the free-threaded GC because we traverse heaps that are not
owned by the current thread.
* Increment PyExpat_CAPI_MAGIC due to SetReparseDeferralEnabled addition.
This is a followup to git commit
6a95676bb5 from Github PR #115623.
* RESTify news API list.
Make `_thread.ThreadHandle` thread-safe in free-threaded builds
We protect the mutable state of `ThreadHandle` using a `_PyOnceFlag`.
Concurrent operations (i.e. `join` or `detach`) on `ThreadHandle` block
until it is their turn to execute or an earlier operation succeeds.
Once an operation has been applied successfully all future operations
complete immediately.
The `join()` method is now idempotent. It may be called multiple times
but the underlying OS thread will only be joined once. After `join()`
succeeds, any future calls to `join()` will succeed immediately.
The internal thread handle `detach()` method has been removed.
Allow controlling Expat >=2.6.0 reparse deferral (CVE-2023-52425) by adding five new methods:
- `xml.etree.ElementTree.XMLParser.flush`
- `xml.etree.ElementTree.XMLPullParser.flush`
- `xml.parsers.expat.xmlparser.GetReparseDeferralEnabled`
- `xml.parsers.expat.xmlparser.SetReparseDeferralEnabled`
- `xml.sax.expatreader.ExpatParser.flush`
Based on the "flush" idea from https://github.com/python/cpython/pull/115138#issuecomment-1932444270 .
### Notes
- Please treat as a security fix related to CVE-2023-52425.
Includes code suggested-by: Snild Dolkow <snild@sony.com>
and by core dev Serhiy Storchaka.
This changes the `sym_set_...()` functions to return a `bool` which is `false`
when the symbol is `bottom` after the operation.
All calls to such functions now check this result and go to `hit_bottom`,
a special error label that prints a different message and then reports
that it wasn't able to optimize the trace. No executor will be produced
in this case.
This undoes the *temporary* default disabling of the T2 optimizer pass in gh-115860.
- Add a new test that reproduces Brandt's example from gh-115859; it indeed crashes before gh-116028 with PYTHONUOPSOPTIMIZE=1
- Re-enable the optimizer pass in T2, stop checking PYTHONUOPSOPTIMIZE
- Rename the env var to disable T2 entirely to PYTHON_UOPS_OPTIMIZE (must be explicitly set to 0 to disable)
- Fix skipIf conditions on tests in test_opt.py accordingly
- Export sym_is_bottom() (for debugging)
- Fix various things in the `_BINARY_OP_` specializations in the abstract interpreter:
- DECREF(temp)
- out-of-space check after sym_new_const()
- add sym_matches_type() checks, so even if we somehow reach a binary op with symbolic constants of the wrong type on the stack we won't trigger the type assert
- Any `sym_set_...` call that attempts to set conflicting information
cause the symbol to become `bottom` (contradiction).
- All `sym_is...` and similar calls return false or NULL for `bottom`.
- Everything's tested.
- The tests still pass with `PYTHONUOPSOPTIMIZE=1`.
* Rename _Py_UOpsAbstractInterpContext to _Py_UOpsContext and _Py_UOpsSymType to _Py_UopsSymbol.
* #define shortened form of _Py_uop_... names for improved readability.
PyTime_t no longer uses an arbitrary unit, it's always a number of
nanoseconds (64-bit signed integer).
* Rename _PyTime_FromNanosecondsObject() to _PyTime_FromLong().
* Rename _PyTime_AsNanosecondsObject() to _PyTime_AsLong().
* Remove pytime_from_nanoseconds().
* Remove pytime_as_nanoseconds().
* Remove _PyTime_FromNanoseconds().
Remove references to the old names _PyTime_MIN
and _PyTime_MAX, now that PyTime_MIN and
PyTime_MAX are public.
Replace also _PyTime_MIN with PyTime_MIN.
This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
<pycore_time.h> include is no longer needed to get the PyTime_t type
in internal header files. This type is now provided by <Python.h>
include. Add <pycore_time.h> includes to C files instead.
This avoids filling the memory occupied by ob_tid, ob_ref_local, and
ob_ref_shared with debug bytes (e.g., 0xDD) in mimalloc in the
free-threaded build.
This change adds an `eval_breaker` field to `PyThreadState`. The primary
motivation is for performance in free-threaded builds: with thread-local eval
breakers, we can stop a specific thread (e.g., for an async exception) without
interrupting other threads.
The source of truth for the global instrumentation version is stored in the
`instrumentation_version` field in PyInterpreterState. Threads usually read the
version from their local `eval_breaker`, where it continues to be colocated
with the eval breaker bits.