Add an option (--enable-experimental-jit for configure-based builds
or --experimental-jit for PCbuild-based ones) to build an
*experimental* just-in-time compiler, based on copy-and-patch (https://fredrikbk.com/publications/copy-and-patch.pdf).
See Tools/jit/README.md for more information on how to install the required build-time tooling.
For interpreters that share state with the main interpreter, this points
to the same static memory structure. For interpreters with their own
obmalloc state, it is heap allocated. Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
* gh-112529: Implement GC for free-threaded builds
This implements a mark and sweep GC for the free-threaded builds of
CPython. The implementation relies on mimalloc to find GC tracked
objects (i.e., "containers").
* Bring in a subset of biased reference counting:
https://github.com/colesbury/nogil/commit/b6b12a9a94e
The NoGIL branch has functions for attempting to do an incref on an object which may or may not be in flight. This just brings those functions over so that they will be usable from in the dict implementation to get items w/o holding a lock.
There's a handful of small simple modifications:
Adding inline to the force inline functions to avoid a warning, and switching from _Py_ALWAYS_INLINE to Py_ALWAYS_INLINE as that's available
Remove _Py_REF_LOCAL_SHIFT as it doesn't exist yet (and is currently 0 in the 3.12 nogil branch anyway)
ob_ref_shared is currently Py_ssize_t and not uint32_t, so use that
_PY_LIKELY doesn't exist, so drop it
_Py_ThreadLocal becomes _Py_IsOwnedByCurrentThread
Add '_PyInterpreterState_GET()' to _Py_IncRefTotal calls.
Co-Authored-By: Sam Gross <colesbury@gmail.com>
The `--disable-gil` builds occasionally need to pause all but one thread. Some
examples include:
* Cyclic garbage collection, where this is often called a "stop the world event"
* Before calling `fork()`, to ensure a consistent state for internal data structures
* During interpreter shutdown, to ensure that daemon threads aren't accessing Python objects
This adds the following functions to implement global and per-interpreter pauses:
* `_PyEval_StopTheWorldAll()` and `_PyEval_StartTheWorldAll()` (for the global runtime)
* `_PyEval_StopTheWorld()` and `_PyEval_StartTheWorld()` (per-interpreter)
(The function names may change.)
These functions are no-ops outside of the `--disable-gil` build.
This adds support for visiting abandoned pages in mimalloc and improves
the performance of the page visiting code. Abandoned pages contain
memory blocks from threads that have exited. At some point, they may be
later reclaimed by other threads. We still need to visit those pages in
the free-threaded GC because they contain live objects.
This also reduces the overhead of visiting mimalloc pages:
* Special cases for full, empty, and pages containing only a single
block.
* Fix free_map to use one bit instead of one byte per block.
* Use fast integer division by a constant algorithm when computing
block offset from block size and index.
* gh-112529: Use GC heaps for GC allocations in free-threaded builds
The free-threaded build's garbage collector implementation will need to
find GC objects by traversing mimalloc heaps. This hooks up the
allocation calls with the correct heaps by using a thread-local
"current_obj_heap" variable.
* Refactor out setting heap based on type
* gh-112529: Track if debug allocator is used as underlying allocator
The GC implementation for free-threaded builds will need to accurately
detect if the debug allocator is used because it affects the offset of
the Python object from the beginning of the memory allocation. The
current implementation of `_PyMem_DebugEnabled` only considers if the
debug allocator is the outer-most allocator; it doesn't handle the case
of "hooks" like tracemalloc being used on top of the debug allocator.
This change enables more accurate detection of the debug allocator by
tracking when debug hooks are enabled.
* Simplify _PyMem_DebugEnabled
gh-113750: Fix object resurrection on free-threaded builds
This avoids the undesired re-initializing of fields like `ob_gc_bits`,
`ob_mutex`, and `ob_tid` when an object is resurrected due to its
finalizer being called.
This change has no effect on the default (with GIL) build.
This splits part of Modules/gcmodule.c of into Python/gc.c, which
now contains the core garbage collection implementation. The Python
module remain in the Modules/gcmodule.c file.
* gh-112532: Tag mimalloc heaps and pages
Mimalloc pages are data structures that contain contiguous allocations
of the same block size. Note that they are distinct from operating
system pages. Mimalloc pages are contained in segments.
When a thread exits, it abandons any segments and contained pages that
have live allocations. These segments and pages may be later reclaimed
by another thread. To support GC and certain thread-safety guarantees in
free-threaded builds, we want pages to only be reclaimed by the
corresponding heap in the claimant thread. For example, we want pages
containing GC objects to only be claimed by GC heaps.
This allows heaps and pages to be tagged with an integer tag that is
used to ensure that abandoned pages are only claimed by heaps with the
same tag. Heaps can be initialized with a tag (0-15); any page allocated
by that heap copies the corresponding tag.
* Fix conversion warning
* gh-112532: Isolate abandoned segments by interpreter
Mimalloc segments are data structures that contain memory allocations along
with metadata. Each segment is "owned" by a thread. When a thread exits,
it abandons its segments to a global pool to be later reclaimed by other
threads. This changes the pool to be per-interpreter instead of process-wide.
This will be important for when we use mimalloc to find GC objects in the
`--disable-gil` builds. We want heaps to only store Python objects from a
single interpreter. Absent this change, the abandoning and reclaiming process
could break this isolation.
* Add missing '&_mi_abandoned_default' to 'tld_empty'
* gh-112532: Use separate mimalloc heaps for GC objects
In `--disable-gil` builds, we now use four separate heaps in
anticipation of using mimalloc to find GC objects when the GIL is
disabled. To support this, we also make a few changes to mimalloc:
* `mi_heap_t` and `mi_tld_t` initialization is split from allocation.
This allows us to have a `mi_tld_t` per-`PyThreadState`, which is
important to keep interpreter isolation, since the same OS thread may
run in multiple interpreters (using different PyThreadStates.)
* Heap abandoning (mi_heap_collect_ex) can now be called from a
different thread than the one that created the heap. This is necessary
because we may clear and delete the containing PyThreadStates from a
different thread during finalization and after fork().
* Use enum instead of defines and guard mimalloc includes.
* The enum typedef will be convenient for future PRs that use the type.
* Guarding the mimalloc includes allows us to unconditionally include
pycore_mimalloc.h from other header files that rely on things like
`struct _mimalloc_thread_state`.
* Only define _mimalloc_thread_state in Py_GIL_DISABLED builds
We need the TracebackException of uncaught exceptions for a single purpose: the error display. Thus we only need to pass the formatted error display between interpreters. Passing a pickled TracebackException is overkill.
The `PyThreadState_Clear()` function must only be called with the GIL
held and must be called from the same interpreter as the passed in
thread state. Otherwise, any Python objects on the thread state may be
destroyed using the wrong interpreter, leading to memory corruption.
This is also important for `Py_GIL_DISABLED` builds because free lists
will be associated with PyThreadStates and cleared in
`PyThreadState_Clear()`.
This fixes two places that called `PyThreadState_Clear()` from the wrong
interpreter and adds an assertion to `PyThreadState_Clear()`.
When an exception is uncaught in Interpreter.exec_sync(), it helps to show that exception's error display if uncaught in the calling interpreter. We do so here by generating a TracebackException in the subinterpreter and passing it between interpreters using pickle.
* gh-110820: Make sure processor specific defines are correct for Universal 2 build on macOS
A number of processor specific defines are different for x86-64 and
arm64, and need to be adjusted in pymacconfig.h.
* remove debug stuf
This replaces some usages of PyThread_type_lock with PyMutex, which does not require memory allocation to initialize.
This simplifies some of the runtime initialization and is also one step towards avoiding changing the default raw memory allocator during initialize/finalization, which can be non-thread-safe in some circumstances.
Every PyThreadState instance is now actually a _PyThreadStateImpl.
It is safe to cast from `PyThreadState*` to `_PyThreadStateImpl*` and back.
The _PyThreadStateImpl will contain fields that we do not want to expose
in the public C API.
This updates `dtoa.c` to avoid using the Bigint free-list in --disable-gil builds and
to pre-computes the needed powers of 5 during interpreter initialization.
* gh-111962: Make dtoa thread-safe in `--disable-gil` builds.
This avoids using the Bigint free-list in `--disable-gil` builds
and pre-computes the needed powers of 5 during interpreter initialization.
* Fix size of cached powers of 5 array.
We need the powers of 5 up to 5**512 because we only jump straight to
underflow when the exponent is less than -512 (or larger than 308).
* Rename Py_NOGIL to Py_GIL_DISABLED
* Changes from review
* Fix assertion placement
* Implement _Py_HashPointerRaw() as a static inline function.
* Add Py_HashPointer() tests to test_capi.test_hash.
* Keep _Py_HashPointer() function as an alias to Py_HashPointer().
Change the declaration of the keywords parameter in functions
PyArg_ParseTupleAndKeywords() and PyArg_VaParseTupleAndKeywords() from `char **`
to `char * const *` in C and `const char * const *` in C++.
It makes these functions compatible with argument of type `const char * const *`,
`const char **` or `char * const *` in C++ and `char * const *` in C
without explicit type cast.
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Use a fraction internally in the _PyTime API to reduce the risk of
integer overflow: simplify the fraction using Greatest Common
Divisor (GCD). The fraction API is used by time functions:
perf_counter(), monotonic() and process_time().
For example, QueryPerformanceFrequency() usually returns 10 MHz on
Windows 10 and newer. The fraction SEC_TO_NS / frequency =
1_000_000_000 / 10_000_000 can be simplified to 100 / 1.
* Add _PyTimeFraction type.
* Add functions:
* _PyTimeFraction_Set()
* _PyTimeFraction_Mul()
* _PyTimeFraction_Resolution()
* No longer check "numer * denom <= _PyTime_MAX" in
_PyTimeFraction_Set(). _PyTimeFraction_Mul() uses _PyTime_Mul()
which handles integer overflow.
* Move _PyRuntimeState.time to _posixstate.ticks_per_second and
time_module_state.ticks_per_second.
* Add time_module_state.clocks_per_second.
* Rename _PyTime_GetClockWithInfo() to py_clock().
* Rename _PyTime_GetProcessTimeWithInfo() to py_process_time().
* Add process_time_times() helper function, called by
py_process_time().
* os.times() is now always built: no longer rely on HAVE_TIMES.
If Py_NOGIL is defined and Py_SET_REFCNT() is called with a reference
count larger than UINT32_MAX, make the object immortal.
Set _Py_IMMORTAL_REFCNT constant type to Py_ssize_t to fix the
following compiler warning:
Include/internal/pycore_global_objects_fini_generated.h:14:24:
warning: comparison of integers of different signs: 'Py_ssize_t'
(aka 'long') and 'unsigned int' [-Wsign-compare]
if (Py_REFCNT(obj) < _Py_IMMORTAL_REFCNT) {
~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~
Add support for TLS-PSK (pre-shared key) to the ssl module.
---------
Co-authored-by: Oleg Iarygin <oleg@arhadthedev.net>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
This makes the Tier 2 interpreter a little faster.
I calculated by about 3%,
though I hesitate to claim an exact number.
This starts by doubling the trace size limit (to 512),
making it more likely that loops fit in a trace.
The rest of the approach is to only load
`oparg` and `operand` in cases that use them.
The code generator know when these are used.
For `oparg`, it will conditionally emit
```
oparg = CURRENT_OPARG();
```
at the top of the case block.
(The `oparg` variable may be referenced multiple times
by the instructions code block, so it must be in a variable.)
For `operand`, it will use `CURRENT_OPERAND()` directly
instead of referencing the `operand` variable,
which no longer exists.
(There is only one place where this will be used.)
This uses the new mechanism whereby certain uops
are replaced by others during translation,
using the `_PyUop_Replacements` table.
We further special-case the `_FOR_ITER_TIER_TWO` uop
to update the deoptimization target to point
just past the corresponding `END_FOR` opcode.
Two tiny code cleanups are also part of this PR.
- Double max trace size to 256
- Add a dependency on executor_cases.c.h for ceval.o
- Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY`
- Add debug output back showing the optimized trace
- Bunch of cleanups to Tools/cases_generator/
* Run again test_ast_recursion_limit() on WASI platform.
* Add _testinternalcapi.get_c_recursion_remaining().
* Fix test_ast and test_sys_settrace: test_ast_recursion_limit() and
test_trace_unpack_long_sequence() now adjust the maximum recursion
depth depending on the the remaining C recursion.