cpython/InternalDocs/string_interning.md

123 lines
4.7 KiB
Markdown
Raw Normal View History

gh-113993: Allow interned strings to be mortal, and fix related issues (GH-120520) * Add an InternalDocs file describing how interning should work and how to use it. * Add internal functions to *explicitly* request what kind of interning is done: - `_PyUnicode_InternMortal` - `_PyUnicode_InternImmortal` - `_PyUnicode_InternStatic` * Switch uses of `PyUnicode_InternInPlace` to those. * Disallow using `_Py_SetImmortal` on strings directly. You should use `_PyUnicode_InternImmortal` instead: - Strings should be interned before immortalization, otherwise you're possibly interning a immortalizing copy. - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in backports, as they are now part of public API and version-specific ABI. * Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery. * Make sure the statically allocated string singletons are unique. This means these sets are now disjoint: - `_Py_ID` - `_Py_STR` (including the empty string) - one-character latin-1 singletons Now, when you intern a singleton, that exact singleton will be interned. * Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic). * Intern `_Py_STR` singletons at startup. * For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup. * Beef up the tests. Cover internal details (marked with `@cpython_only`). * Add lots of assertions Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
2024-06-21 12:19:31 -03:00
# String interning
*Interned* strings are conceptually part of an interpreter-global
*set* of interned strings, meaning that:
- no two interned strings have the same content (across an interpreter);
- two interned strings can be safely compared using pointer equality
(Python `is`).
This is used to optimize dict and attribute lookups, among other things.
Python uses three different mechanisms to intern strings:
- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros.
These are statically allocated, and collected using `make regen-global-objects`
(`Tools/build/generate_global_objects.py`), which generates code
for declaration, initialization and finalization.
The difference between the two kinds is not important. (A `_Py_ID` string is
a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain
non-identifier characters, so it needs a separate C-compatible name.)
The empty string is in this category (as `_Py_STR(empty)`).
These singletons are interned in a runtime-global lookup table,
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
at runtime initialization.
- The 256 possible one-character latin-1 strings are singletons,
which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global
arrays, `_PyRuntime.static_objects.strings.ascii` and
`_PyRuntime.static_objects.strings.latin1`.
These are NOT interned at startup in the normal build.
In the free-threaded build, they are; this avoids modifying the
global lookup table after threads are started.
Interning a one-char latin-1 string will always intern the corresponding
singleton.
- All other strings are allocated dynamically, and have their
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
When interned, such strings are added to an interpreter-wide dict,
`PyInterpreterState.cached_objects.interned_strings`.
The key and value of each entry in this dict reference the same object.
The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`)
are disjoint.
If you have such a singleton, it (and no other copy) will be interned.
## Immortality and reference counting
Invariant: Every immortal string is interned, *except* the one-char latin-1
singletons (which might but might not be interned).
In practice, this means that you must not use `_Py_SetImmortal` on
a string. (If you know it's already immortal, don't immortalize it;
if you know it's not interned you might be immortalizing a redundant copy;
if it's interned and mortal it needs extra processing in
`_PyUnicode_InternImmortal`.)
The converse is not true: interned strings can be mortal.
For mortal interned strings:
- the 2 references from the interned dict (key & value) are excluded from
their refcount
- the deallocator (`unicode_dealloc`) removes the string from the interned dict
- at shutdown, when the interned dict is cleared, the references are added back
As with any type, you should only immortalize strings that will live until
interpreter shutdown.
We currently also immortalize strings contained in code objects and similar,
specifically in the compiler and in `marshal`.
These are “close enough” to immortal: even in use cases like hot reloading
or `eval`-ing user input, the number of distinct identifiers and string
constants expected to stay low.
## Internal API
We have the following *internal* API for interning:
- `_PyUnicode_InternMortal`: just intern the string
- `_PyUnicode_InternImmortal`: intern, and immortalize the result
- `_PyUnicode_InternStatic`: intern a static singleton (`_Py_STR`, `_Py_ID`
or one-byte). Not for general use.
All take an interpreter state, and a pointer to a `PyObject*` which they
modify in place.
The functions take ownership of (“steal”) the reference to their argument,
and update the argument with a *new* reference.
This means:
- They're “reference neutral”.
- They must not be called with a borrowed reference.
## State
The intern state (retrieved by `PyUnicode_CHECK_INTERNED(s)`;
stored in `_PyUnicode_STATE(s).interned`) can be:
- `SSTATE_NOT_INTERNED` (defined as 0, which is useful in a boolean context)
- `SSTATE_INTERNED_MORTAL` (1)
- `SSTATE_INTERNED_IMMORTAL` (2)
- `SSTATE_INTERNED_IMMORTAL_STATIC` (3)
The valid transitions between these states are:
- For dynamically allocated strings:
- 0 -> 1 (`_PyUnicode_InternMortal`)
- 1 -> 2 or 0 -> 2 (`_PyUnicode_InternImmortal`)
Using `_PyUnicode_InternStatic` on these is an error; the other cases
don't change the state.
- One-char latin-1 singletons can be interned (0 -> 3) using any interning
function; after that the functions don't change the state.
- Other statically allocated strings are interned (0 -> 3) at runtime init;
after that all interning functions don't change the state.