gh-119786: [doc] more consistent syntax in InternalDocs (#125815)

This commit is contained in:
Irit Katriel 2024-10-21 23:37:31 +01:00 committed by GitHub
parent 4848b0b92c
commit d0bfff47fb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 371 additions and 420 deletions

View File

@ -31,8 +31,7 @@ although these are not fundamental and may change:
## Example family
The `LOAD_GLOBAL` instruction (in
[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c))
The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
already has an adaptive family that serves as a relatively simple example.
The `LOAD_GLOBAL` instruction performs adaptive specialization,

View File

@ -7,17 +7,16 @@ Abstract
In CPython, the compilation from source code to bytecode involves several steps:
1. Tokenize the source code
[Parser/lexer/](https://github.com/python/cpython/blob/main/Parser/lexer/)
and [Parser/tokenizer/](https://github.com/python/cpython/blob/main/Parser/tokenizer/).
1. Tokenize the source code [Parser/lexer/](../Parser/lexer/)
and [Parser/tokenizer/](../Parser/tokenizer/).
2. Parse the stream of tokens into an Abstract Syntax Tree
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c).
[Parser/parser.c](../Parser/parser.c).
3. Transform AST into an instruction sequence
[Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c).
[Python/compile.c](../Python/compile.c).
4. Construct a Control Flow Graph and apply optimizations to it
[Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c).
[Python/flowgraph.c](../Python/flowgraph.c).
5. Emit bytecode based on the Control Flow Graph
[Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c).
[Python/assemble.c](../Python/assemble.c).
This document outlines how these steps of the process work.
@ -36,12 +35,10 @@ of tokens rather than a stream of characters which is more common with PEG
parsers.
The grammar file for Python can be found in
[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram).
The definitions for literal tokens (such as ``:``, numbers, etc.) can be found in
[Grammar/Tokens](https://github.com/python/cpython/blob/main/Grammar/Tokens).
Various C files, including
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c)
are generated from these.
[Grammar/python.gram](../Grammar/python.gram).
The definitions for literal tokens (such as `:`, numbers, etc.) can be found in
[Grammar/Tokens](../Grammar/Tokens). Various C files, including
[Parser/parser.c](../Parser/parser.c) are generated from these.
See Also:
@ -63,7 +60,7 @@ specification of the AST nodes is specified using the Zephyr Abstract
Syntax Definition Language (ASDL) [^1], [^2].
The definition of the AST nodes for Python is found in the file
[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl).
[Parser/Python.asdl](../Parser/Python.asdl).
Each AST node (representing statements, expressions, and several
specialized types, like list comprehensions and exception handlers) is
@ -87,14 +84,14 @@ approach and syntax:
The preceding example describes two different kinds of statements and an
expression: function definitions, return statements, and yield expressions.
All three kinds are considered of type ``stmt`` as shown by ``|`` separating
All three kinds are considered of type `stmt` as shown by `|` separating
the various kinds. They all take arguments of various kinds and amounts.
Modifiers on the argument type specify the number of values needed; ``?``
means it is optional, ``*`` means 0 or more, while no modifier means only one
value for the argument and it is required. ``FunctionDef``, for instance,
takes an ``identifier`` for the *name*, ``arguments`` for *args*, zero or more
``stmt`` arguments for *body*, and zero or more ``expr`` arguments for
Modifiers on the argument type specify the number of values needed; `?`
means it is optional, `*` means 0 or more, while no modifier means only one
value for the argument and it is required. `FunctionDef`, for instance,
takes an `identifier` for the *name*, `arguments` for *args*, zero or more
`stmt` arguments for *body*, and zero or more `expr` arguments for
*decorators*.
Do notice that something like 'arguments', which is a node type, is
@ -132,9 +129,9 @@ The statement definitions above generate the following C structure type:
```
Also generated are a series of constructor functions that allocate (in
this case) a ``stmt_ty`` struct with the appropriate initialization. The
``kind`` field specifies which component of the union is initialized. The
``FunctionDef()`` constructor function sets 'kind' to ``FunctionDef_kind`` and
this case) a `stmt_ty` struct with the appropriate initialization. The
`kind` field specifies which component of the union is initialized. The
`FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and
initializes the *name*, *args*, *body*, and *attributes* fields.
See also
@ -156,13 +153,13 @@ In general, unless you are working on the critical core of the compiler, memory
management can be completely ignored. But if you are working at either the
very beginning of the compiler or the end, you need to care about how the arena
works. All code relating to the arena is in either
[Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h)
or [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c).
[Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h)
or [Python/pyarena.c](../Python/pyarena.c).
``PyArena_New()`` will create a new arena. The returned ``PyArena`` structure
`PyArena_New()` will create a new arena. The returned `PyArena` structure
will store pointers to all memory given to it. This does the bookkeeping of
what memory needs to be freed when the compiler is finished with the memory it
used. That freeing is done with ``PyArena_Free()``. This only needs to be
used. That freeing is done with `PyArena_Free()`. This only needs to be
called in strategic areas where the compiler exits.
As stated above, in general you should not have to worry about memory
@ -173,25 +170,25 @@ The only exception comes about when managing a PyObject. Since the rest
of Python uses reference counting, there is extra support added
to the arena to cleanup each PyObject that was allocated. These cases
are very rare. However, if you've allocated a PyObject, you must tell
the arena about it by calling ``PyArena_AddPyObject()``.
the arena about it by calling `PyArena_AddPyObject()`.
Source code to AST
==================
The AST is generated from source code using the function
``_PyParser_ASTFromString()`` or ``_PyParser_ASTFromFile()``
[Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c).
`_PyParser_ASTFromString()` or `_PyParser_ASTFromFile()`
[Parser/peg_api.c](../Parser/peg_api.c).
After some checks, a helper function in
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c)
[Parser/parser.c](../Parser/parser.c)
begins applying production rules on the source code it receives; converting source
code to tokens and matching these tokens recursively to their corresponding rule. The
production rule's corresponding rule function is called on every match. These rule
functions follow the format `xx_rule`. Where *xx* is the grammar rule
that the function handles and is automatically derived from
[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram) by
[Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py).
[Grammar/python.gram](../Grammar/python.gram) by
[Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py).
Each rule function in turn creates an AST node as it goes along. It does this
by allocating all the new nodes it needs, calling the proper AST node creation
@ -202,18 +199,15 @@ there are no more rules, an error is set and the parsing ends.
The AST node creation helper functions have the name `_PyAST_{xx}`
where *xx* is the AST node that the function creates. These are defined by the
ASDL grammar and contained in
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c)
(which is generated by
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py)
from
[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl)).
This all leads to a sequence of AST nodes stored in ``asdl_seq`` structs.
ASDL grammar and contained in [Python/Python-ast.c](../Python/Python-ast.c)
(which is generated by [Parser/asdl_c.py](../Parser/asdl_c.py)
from [Parser/Python.asdl](../Parser/Python.asdl)).
This all leads to a sequence of AST nodes stored in `asdl_seq` structs.
To demonstrate everything explained so far, here's the
rule function responsible for a simple named import statement such as
``import sys``. Note that error-checking and debugging code has been
omitted. Removed parts are represented by ``...``.
`import sys`. Note that error-checking and debugging code has been
omitted. Removed parts are represented by `...`.
Furthermore, some comments have been added for explanation. These comments
may not be present in the actual code.
@ -255,55 +249,52 @@ may not be present in the actual code.
To improve backtracking performance, some rules (chosen by applying a
``(memo)`` flag in the grammar file) are memoized. Each rule function checks if
`(memo)` flag in the grammar file) are memoized. Each rule function checks if
a memoized version exists and returns that if so, else it continues in the
manner stated in the previous paragraphs.
There are macros for creating and using ``asdl_xx_seq *`` types, where *xx* is
There are macros for creating and using `asdl_xx_seq *` types, where *xx* is
a type of the ASDL sequence. Three main types are defined
manually -- ``generic``, ``identifier`` and ``int``. These types are found in
[Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c)
and its corresponding header file
[Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h).
Functions and macros for creating ``asdl_xx_seq *`` types are as follows:
manually -- `generic`, `identifier` and `int`. These types are found in
[Python/asdl.c](../Python/asdl.c) and its corresponding header file
[Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h).
Functions and macros for creating `asdl_xx_seq *` types are as follows:
``_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)``
Allocate memory for an ``asdl_generic_seq`` of the specified length
``_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)``
Allocate memory for an ``asdl_identifier_seq`` of the specified length
``_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)``
Allocate memory for an ``asdl_int_seq`` of the specified length
`_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an `asdl_generic_seq` of the specified length
`_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an `asdl_identifier_seq` of the specified length
`_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an `asdl_int_seq` of the specified length
In addition to the three types mentioned above, some ASDL sequence types are
automatically generated by
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py)
and found in
[Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h).
automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py) and found in
[Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h).
Macros for using both manually defined and automatically generated ASDL
sequence types are as follows:
``asdl_seq_GET(asdl_xx_seq *, int)``
Get item held at a specific position in an ``asdl_xx_seq``
``asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)``
Set a specific index in an ``asdl_xx_seq`` to the specified value
`asdl_seq_GET(asdl_xx_seq *, int)`
Get item held at a specific position in an `asdl_xx_seq`
`asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`
Set a specific index in an `asdl_xx_seq` to the specified value
Untyped counterparts exist for some of the typed macros. These are useful
when a function needs to manipulate a generic ASDL sequence:
``asdl_seq_GET_UNTYPED(asdl_seq *, int)``
Get item held at a specific position in an ``asdl_seq``
``asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)``
Set a specific index in an ``asdl_seq`` to the specified value
``asdl_seq_LEN(asdl_seq *)``
Return the length of an ``asdl_seq`` or ``asdl_xx_seq``
`asdl_seq_GET_UNTYPED(asdl_seq *, int)`
Get item held at a specific position in an `asdl_seq`
`asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`
Set a specific index in an `asdl_seq` to the specified value
`asdl_seq_LEN(asdl_seq *)`
Return the length of an `asdl_seq` or `asdl_xx_seq`
Note that typed macros and functions are recommended over their untyped
counterparts. Typed macros carry out checks in debug mode and aid
debugging errors caused by incorrectly casting from ``void *``.
debugging errors caused by incorrectly casting from `void *`.
If you are working with statements, you must also worry about keeping
track of what line number generated the statement. Currently the line
number is passed as the last parameter to each ``stmt_ty`` function.
number is passed as the last parameter to each `stmt_ty` function.
See also [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/).
@ -333,19 +324,19 @@ else:
end()
```
The ``x < 10`` guard is represented by its own basic block that
compares ``x`` with ``10`` and then ends in a conditional jump based on
The `x < 10` guard is represented by its own basic block that
compares `x` with `10` and then ends in a conditional jump based on
the result of the comparison. This conditional jump allows the block
to point to both the body of the ``if`` and the body of the ``else``. The
``if`` basic block contains the ``f1()`` and ``f2()`` calls and points to
the ``end()`` basic block. The ``else`` basic block contains the ``g()``
call and similarly points to the ``end()`` block.
to point to both the body of the `if` and the body of the `else`. The
`if` basic block contains the `f1()` and `f2()` calls and points to
the `end()` basic block. The `else` basic block contains the `g()`
call and similarly points to the `end()` block.
Note that more complex code in the guard, the ``if`` body, or the ``else``
Note that more complex code in the guard, the `if` body, or the `else`
body may be represented by multiple basic blocks. For instance,
short-circuiting boolean logic in a guard like ``if x or y:``
will produce one basic block that tests the truth value of ``x``
and then points both (1) to the start of the ``if`` body and (2) to
short-circuiting boolean logic in a guard like `if x or y:`
will produce one basic block that tests the truth value of `x`
and then points both (1) to the start of the `if` body and (2) to
a different basic block that tests the truth value of y.
CFGs are useful as an intermediate representation of the code because
@ -354,27 +345,24 @@ they are a convenient data structure for optimizations.
AST to CFG to bytecode
======================
The conversion of an ``AST`` to bytecode is initiated by a call to the function
``_PyAST_Compile()`` in
[Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c).
The conversion of an `AST` to bytecode is initiated by a call to the function
`_PyAST_Compile()` in [Python/compile.c](../Python/compile.c).
The first step is to construct the symbol table. This is implemented by
``_PySymtable_Build()`` in
[Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c).
`_PySymtable_Build()` in [Python/symtable.c](../Python/symtable.c).
This function begins by entering the starting code block for the AST (passed-in)
and then calling the proper `symtable_visit_{xx}` function (with *xx* being the
AST node type). Next, the AST tree is walked with the various code blocks that
delineate the reach of a local variable as blocks are entered and exited using
``symtable_enter_block()`` and ``symtable_exit_block()``, respectively.
`symtable_enter_block()` and `symtable_exit_block()`, respectively.
Once the symbol table is created, the ``AST`` is transformed by ``compiler_codegen()``
in [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c)
into a sequence of pseudo instructions. These are similar to bytecode, but
in some cases they are more abstract, and are resolved later into actual
bytecode. The construction of this instruction sequence is handled by several
functions that break the task down by various AST node types. The functions are
all named `compiler_visit_{xx}` where *xx* is the name of the node type (such
as ``stmt``, ``expr``, etc.). Each function receives a ``struct compiler *``
Once the symbol table is created, the `AST` is transformed by `compiler_codegen()`
in [Python/compile.c](../Python/compile.c) into a sequence of pseudo instructions.
These are similar to bytecode, but in some cases they are more abstract, and are
resolved later into actual bytecode. The construction of this instruction sequence
is handled by several functions that break the task down by various AST node types.
The functions are all named `compiler_visit_{xx}` where *xx* is the name of the node
type (such as `stmt`, `expr`, etc.). Each function receives a `struct compiler *`
and `{xx}_ty` where *xx* is the AST node type. Typically these functions
consist of a large 'switch' statement, branching based on the kind of
node type passed to it. Simple things are handled inline in the
@ -382,242 +370,224 @@ node type passed to it. Simple things are handled inline in the
functions named `compiler_{xx}` with *xx* being a descriptive name of what is
being handled.
When transforming an arbitrary AST node, use the ``VISIT()`` macro.
When transforming an arbitrary AST node, use the `VISIT()` macro.
The appropriate `compiler_visit_{xx}` function is called, based on the value
passed in for <node type> (so `VISIT({c}, expr, {node})` calls
`compiler_visit_expr({c}, {node})`). The ``VISIT_SEQ()`` macro is very similar,
`compiler_visit_expr({c}, {node})`). The `VISIT_SEQ()` macro is very similar,
but is called on AST node sequences (those values that were created as
arguments to a node that used the '*' modifier).
Emission of bytecode is handled by the following macros:
* ``ADDOP(struct compiler *, location, int)``
* `ADDOP(struct compiler *, location, int)`
add a specified opcode
* ``ADDOP_IN_SCOPE(struct compiler *, location, int)``
like ``ADDOP``, but also exits current scope; used for adding return value
* `ADDOP_IN_SCOPE(struct compiler *, location, int)`
like `ADDOP`, but also exits current scope; used for adding return value
opcodes in lambdas and closures
* ``ADDOP_I(struct compiler *, location, int, Py_ssize_t)``
* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`
add an opcode that takes an integer argument
* ``ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)``
* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`
add an opcode with the proper argument based on the position of the
specified PyObject in PyObject sequence object, but with no handling of
mangled names; used for when you
need to do named lookups of objects such as globals, consts, or
parameters where name mangling is not possible and the scope of the
name is known; *TYPE* is the name of PyObject sequence
(``names`` or ``varnames``)
* ``ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)``
just like ``ADDOP_O``, but steals a reference to PyObject
* ``ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)``
just like ``ADDOP_O``, but name mangling is also handled; used for
(`names` or `varnames`)
* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`
just like `ADDOP_O`, but steals a reference to PyObject
* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`
just like `ADDOP_O`, but name mangling is also handled; used for
attribute loading or importing based on name
* ``ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)``
add the ``LOAD_CONST`` opcode with the proper argument based on the
* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`
add the `LOAD_CONST` opcode with the proper argument based on the
position of the specified PyObject in the consts table.
* ``ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)``
just like ``ADDOP_LOAD_CONST_NEW``, but steals a reference to PyObject
* ``ADDOP_JUMP(struct compiler *, location, int, basicblock *)``
* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`
just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject
* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`
create a jump to a basic block
The ``location`` argument is a struct with the source location to be
The `location` argument is a struct with the source location to be
associated with this instruction. It is typically extracted from an
``AST`` node with the ``LOC`` macro. The ``NO_LOCATION`` can be used
`AST` node with the `LOC` macro. The `NO_LOCATION` can be used
for *synthetic* instructions, which we do not associate with a line
number at this stage. For example, the implicit ``return None``
number at this stage. For example, the implicit `return None`
which is added at the end of a function is not associated with any
line in the source code.
There are several helper functions that will emit pseudo-instructions
and are named `compiler_{xx}()` where *xx* is what the function helps
with (``list``, ``boolop``, etc.). A rather useful one is ``compiler_nameop()``.
with (`list`, `boolop`, etc.). A rather useful one is `compiler_nameop()`.
This function looks up the scope of a variable and, based on the
expression context, emits the proper opcode to load, store, or delete
the variable.
Once the instruction sequence is created, it is transformed into a CFG
by ``_PyCfg_FromInstructionSequence()``. Then ``_PyCfg_OptimizeCodeUnit()``
by `_PyCfg_FromInstructionSequence()`. Then `_PyCfg_OptimizeCodeUnit()`
applies various peephole optimizations, and
``_PyCfg_OptimizedCfgToInstructionSequence()`` converts the optimized ``CFG``
`_PyCfg_OptimizedCfgToInstructionSequence()` converts the optimized `CFG`
back into an instruction sequence. These conversions and optimizations are
implemented in
[Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c).
implemented in [Python/flowgraph.c](../Python/flowgraph.c).
Finally, the sequence of pseudo-instructions is converted into actual
bytecode. This includes transforming pseudo instructions into actual instructions,
converting jump targets from logical labels to relative offsets, and
construction of the
[exception table](exception_handling.md) and
[locations table](https://github.com/python/cpython/blob/main/InternalDocs/locations.md).
The bytecode and tables are then wrapped into a ``PyCodeObject`` along with additional
metadata, including the ``consts`` and ``names`` arrays, information about function
construction of the [exception table](exception_handling.md) and
[locations table](locations.md).
The bytecode and tables are then wrapped into a `PyCodeObject` along with additional
metadata, including the `consts` and `names` arrays, information about function
reference to the source code (filename, etc). All of this is implemented by
``_PyAssemble_MakeCodeObject()`` in
[Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c).
`_PyAssemble_MakeCodeObject()` in [Python/assemble.c](../Python/assemble.c).
Code objects
============
The result of ``PyAST_CompileObject()`` is a ``PyCodeObject`` which is defined in
[Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h).
The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in
[Include/cpython/code.h](../Include/cpython/code.h).
And with that you now have executable Python bytecode!
The code objects (byte code) are executed in
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c).
The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c).
This file will also need a new case statement for the new opcode in the big switch
statement in ``_PyEval_EvalFrameDefault()``.
statement in `_PyEval_EvalFrameDefault()`.
Important files
===============
* [Parser/](https://github.com/python/cpython/blob/main/Parser/)
* [Parser/](../Parser/)
* [Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl):
* [Parser/Python.asdl](../Parser/Python.asdl):
ASDL syntax file.
* [Parser/asdl.py](https://github.com/python/cpython/blob/main/Parser/asdl.py):
* [Parser/asdl.py](../Parser/asdl.py):
Parser for ASDL definition files.
Reads in an ASDL description and parses it into an AST that describes it.
* [Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py):
* [Parser/asdl_c.py](../Parser/asdl_c.py):
Generate C code from an ASDL description. Generates
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c)
and
[Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h).
[Python/Python-ast.c](../Python/Python-ast.c) and
[Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h).
* [Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c):
The new PEG parser introduced in Python 3.9.
Generated by
[Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py)
from the grammar [Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram).
* [Parser/parser.c](../Parser/parser.c):
The new PEG parser introduced in Python 3.9. Generated by
[Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py)
from the grammar [Grammar/python.gram](../Grammar/python.gram).
Creates the AST from source code. Rule functions for their corresponding production
rules are found here.
* [Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c):
Contains high-level functions which are
used by the interpreter to create an AST from source code.
* [Parser/peg_api.c](../Parser/peg_api.c):
Contains high-level functions which are used by the interpreter to create
an AST from source code.
* [Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c):
* [Parser/pegen.c](../Parser/pegen.c):
Contains helper functions which are used by functions in
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c)
to construct the AST. Also contains helper functions which help raise better error messages
when parsing source code.
[Parser/parser.c](../Parser/parser.c) to construct the AST. Also contains
helper functions which help raise better error messages when parsing source code.
* [Parser/pegen.h](https://github.com/python/cpython/blob/main/Parser/pegen.h):
Header file for the corresponding
[Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c).
Also contains definitions of the ``Parser`` and ``Token`` structs.
* [Parser/pegen.h](../Parser/pegen.h):
Header file for the corresponding [Parser/pegen.c](../Parser/pegen.c).
Also contains definitions of the `Parser` and `Token` structs.
* [Python/](https://github.com/python/cpython/blob/main/Python)
* [Python/](../Python)
* [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c):
* [Python/Python-ast.c](../Python/Python-ast.c):
Creates C structs corresponding to the ASDL types. Also contains code for
marshalling AST nodes (core ASDL types have marshalling code in
[Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c)).
File automatically generated by
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py).
[Python/asdl.c](../Python/asdl.c)).
File automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py).
This file must be committed separately after every grammar change
is committed since the ``__version__`` value is set to the latest
is committed since the `__version__` value is set to the latest
grammar change revision number.
* [Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c):
* [Python/asdl.c](../Python/asdl.c):
Contains code to handle the ASDL sequence type.
Also has code to handle marshalling the core ASDL types, such as number
and identifier. Used by
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c)
and identifier. Used by [Python/Python-ast.c](../Python/Python-ast.c)
for marshalling AST nodes.
* [Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c):
* [Python/ast.c](../Python/ast.c):
Used for validating the AST.
* [Python/ast_opt.c](https://github.com/python/cpython/blob/main/Python/ast_opt.c):
* [Python/ast_opt.c](../Python/ast_opt.c):
Optimizes the AST.
* [Python/ast_unparse.c](https://github.com/python/cpython/blob/main/Python/ast_unparse.c):
* [Python/ast_unparse.c](../Python/ast_unparse.c):
Converts the AST expression node back into a string (for string annotations).
* [Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c):
* [Python/ceval.c](../Python/ceval.c):
Executes byte code (aka, eval loop).
* [Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c):
* [Python/symtable.c](../Python/symtable.c):
Generates a symbol table from AST.
* [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c):
* [Python/pyarena.c](../Python/pyarena.c):
Implementation of the arena memory manager.
* [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c):
* [Python/compile.c](../Python/compile.c):
Emits pseudo bytecode based on the AST.
* [Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c):
* [Python/flowgraph.c](../Python/flowgraph.c):
Implements peephole optimizations.
* [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c):
* [Python/assemble.c](../Python/assemble.c):
Constructs a code object from a sequence of pseudo instructions.
* [Python/instruction_sequence.c](https://github.com/python/cpython/blob/main/Python/instruction_sequence.c):
* [Python/instruction_sequence.c](../Python/instruction_sequence.c):
A data structure representing a sequence of bytecode-like pseudo-instructions.
* [Include/](https://github.com/python/cpython/blob/main/Include/)
* [Include/](../Include/)
* [Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h)
: Header file for
[Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c);
contains definition of ``PyCodeObject``.
* [Include/cpython/code.h](../Include/cpython/code.h)
: Header file for [Objects/codeobject.c](../Objects/codeobject.c);
contains definition of `PyCodeObject`.
* [Include/opcode.h](https://github.com/python/cpython/blob/main/Include/opcode.h)
: One of the files that must be modified if
[Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py) is.
* [Include/opcode.h](../Include/opcode.h)
: One of the files that must be modified whenever
[Lib/opcode.py](../Lib/opcode.py) is.
* [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h)
* [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h)
: Contains the actual definitions of the C structs as generated by
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c)
Automatically generated by
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py).
[Python/Python-ast.c](../Python/Python-ast.c).
Automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py).
* [Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h)
: Header for the corresponding
[Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c).
* [Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h)
: Header for the corresponding [Python/ast.c](../Python/ast.c).
* [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h)
: Declares ``_PyAST_Validate()`` external (from
[Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c)).
* [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h)
: Declares `_PyAST_Validate()` external (from [Python/ast.c](../Python/ast.c)).
* [Include/internal/pycore_symtable.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_symtable.h)
: Header for
[Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c).
``struct symtable`` and ``PySTEntryObject`` are defined here.
* [Include/internal/pycore_symtable.h](../Include/internal/pycore_symtable.h)
: Header for [Python/symtable.c](../Python/symtable.c).
`struct symtable` and `PySTEntryObject` are defined here.
* [Include/internal/pycore_parser.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_parser.h)
: Header for the corresponding
[Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c).
* [Include/internal/pycore_parser.h](../Include/internal/pycore_parser.h)
: Header for the corresponding [Parser/peg_api.c](../Parser/peg_api.c).
* [Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h)
: Header file for the corresponding
[Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c).
* [Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h)
: Header file for the corresponding [Python/pyarena.c](../Python/pyarena.c).
* [Include/opcode_ids.h](https://github.com/python/cpython/blob/main/Include/opcode_ids.h)
: List of opcodes. Generated from
[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c)
* [Include/opcode_ids.h](../Include/opcode_ids.h)
: List of opcodes. Generated from [Python/bytecodes.c](../Python/bytecodes.c)
by
[Tools/cases_generator/opcode_id_generator.py](https://github.com/python/cpython/blob/main/Tools/cases_generator/opcode_id_generator.py).
[Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py).
* [Objects/](https://github.com/python/cpython/blob/main/Objects/)
* [Objects/](../Objects/)
* [Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c)
* [Objects/codeobject.c](../Objects/codeobject.c)
: Contains PyCodeObject-related code.
* [Objects/frameobject.c](https://github.com/python/cpython/blob/main/Objects/frameobject.c)
: Contains the ``frame_setlineno()`` function which should determine whether it is allowed
* [Objects/frameobject.c](../Objects/frameobject.c)
: Contains the `frame_setlineno()` function which should determine whether it is allowed
to make a jump between two points in a bytecode.
* [Lib/](https://github.com/python/cpython/blob/main/Lib/)
* [Lib/](../Lib/)
* [Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py)
* [Lib/opcode.py](../Lib/opcode.py)
: opcode utilities exposed to Python.
* [Include/core/pycore_magic_number.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_magic_number.h)
: Home of the magic number (named ``MAGIC_NUMBER``) for bytecode versioning.
* [Include/core/pycore_magic_number.h](../Include/internal/pycore_magic_number.h)
: Home of the magic number (named `MAGIC_NUMBER`) for bytecode versioning.
Objects
@ -625,7 +595,7 @@ Objects
* [Locations](locations.md): Describes the location table
* [Frames](frames.md): Describes frames and the frame stack
* [Objects/object_layout.md](https://github.com/python/cpython/blob/main/Objects/object_layout.md): Describes object layout for 3.11 and later
* [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later
* [Exception Handling](exception_handling.md): Describes the exception table

View File

@ -68,18 +68,16 @@ Handling Exceptions
-------------------
At runtime, when an exception occurs, the interpreter calls
``get_exception_handler()`` in
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c)
`get_exception_handler()` in [Python/ceval.c](../Python/ceval.c)
to look up the offset of the current instruction in the exception
table. If it finds a handler, control flow transfers to it. Otherwise, the
exception bubbles up to the caller, and the caller's frame is
checked for a handler covering the `CALL` instruction. This
repeats until a handler is found or the topmost frame is reached.
If no handler is found, then the interpreter function
(``_PyEval_EvalFrameDefault()``) returns NULL. During unwinding,
(`_PyEval_EvalFrameDefault()`) returns NULL. During unwinding,
the traceback is constructed as each frame is added to it by
``PyTraceBack_Here()``, which is in
[Python/traceback.c](https://github.com/python/cpython/blob/main/Python/traceback.c).
`PyTraceBack_Here()`, which is in [Python/traceback.c](../Python/traceback.c).
Along with the location of an exception handler, each entry of the
exception table also contains the stack depth of the `try` instruction
@ -174,22 +172,20 @@ which is then encoded as:
for a total of five bytes.
The code to construct the exception table is in ``assemble_exception_table()``
in [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c).
The code to construct the exception table is in `assemble_exception_table()`
in [Python/assemble.c](../Python/assemble.c).
The interpreter's function to lookup the table by instruction offset is
``get_exception_handler()`` in
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c).
The Python function ``_parse_exception_table()`` in
[Lib/dis.py](https://github.com/python/cpython/blob/main/Lib/dis.py)
`get_exception_handler()` in [Python/ceval.c](../Python/ceval.c).
The Python function `_parse_exception_table()` in [Lib/dis.py](../Lib/dis.py)
returns the exception table content as a list of namedtuple instances.
Exception Chaining Implementation
---------------------------------
[Exception chaining](https://docs.python.org/dev/tutorial/errors.html#exception-chaining)
refers to setting the ``__context__`` and ``__cause__`` fields of an exception as it is
being raised. The ``__context__`` field is set by ``_PyErr_SetObject()`` in
[Python/errors.c](https://github.com/python/cpython/blob/main/Python/errors.c)
(which is ultimately called by all ``PyErr_Set*()`` functions).
The ``__cause__`` field (explicit chaining) is set by the ``RAISE_VARARGS`` bytecode.
refers to setting the `__context__` and `__cause__` fields of an exception as it is
being raised. The `__context__` field is set by `_PyErr_SetObject()` in
[Python/errors.c](../Python/errors.c) (which is ultimately called by all
`PyErr_Set*()` functions). The `__cause__` field (explicit chaining) is set by
the `RAISE_VARARGS` bytecode.

View File

@ -10,20 +10,19 @@ of three conceptual sections:
globals dict, code object, instruction pointer, stack depth, the
previous frame, etc.
The definition of the ``_PyInterpreterFrame`` struct is in
[Include/internal/pycore_frame.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_frame.h).
The definition of the `_PyInterpreterFrame` struct is in
[Include/internal/pycore_frame.h](../Include/internal/pycore_frame.h).
# Allocation
Python semantics allows frames to outlive the activation, so they need to
be allocated outside the C call stack. To reduce overhead and improve locality
of reference, most frames are allocated contiguously in a per-thread stack
(see ``_PyThreadState_PushFrame`` in
[Python/pystate.c](https://github.com/python/cpython/blob/main/Python/pystate.c)).
(see `_PyThreadState_PushFrame` in [Python/pystate.c](../Python/pystate.c)).
Frames of generators and coroutines are embedded in the generator and coroutine
objects, so are not allocated in the per-thread stack. See ``PyGenObject`` in
[Include/internal/pycore_genobject.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_genobject.h).
objects, so are not allocated in the per-thread stack. See `PyGenObject` in
[Include/internal/pycore_genobject.h](../Include/internal/pycore_genobject.h).
## Layout
@ -82,16 +81,15 @@ frames for each activation, but with low runtime overhead.
### Generators and Coroutines
Generators (objects of type ``PyGen_Type``, ``PyCoro_Type`` or
``PyAsyncGen_Type``) have a `_PyInterpreterFrame` embedded in them, so
Generators (objects of type `PyGen_Type`, `PyCoro_Type` or
`PyAsyncGen_Type`) have a `_PyInterpreterFrame` embedded in them, so
that they can be created with a single memory allocation.
When such an embedded frame is iterated or awaited, it can be linked with
frames on the per-thread stack via the linkage fields.
If a frame object associated with a generator outlives the generator, then
the embedded `_PyInterpreterFrame` is copied into the frame object (see
``take_ownership()`` in
[Python/frame.c](https://github.com/python/cpython/blob/main/Python/frame.c)).
`take_ownership()` in [Python/frame.c](../Python/frame.c)).
### Field names

View File

@ -12,7 +12,7 @@ a local variable in some C function. When an objects reference count becomes
the object is deallocated. If it contains references to other objects, their
reference counts are decremented. Those other objects may be deallocated in turn, if
this decrement makes their reference count become zero, and so on. The reference
count field can be examined using the ``sys.getrefcount()`` function (notice that the
count field can be examined using the `sys.getrefcount()` function (notice that the
value returned by this function is always 1 more as the function also has a reference
to the object when called):
@ -39,7 +39,7 @@ cycles. For instance, consider this code:
>>> del container
```
In this example, ``container`` holds a reference to itself, so even when we remove
In this example, `container` holds a reference to itself, so even when we remove
our reference to it (the variable "container") the reference count never falls to 0
because it still has its own internal reference. Therefore it would never be
cleaned just by simple reference counting. For this reason some additional machinery
@ -127,7 +127,7 @@ GC for the free-threaded build
------------------------------
In the free-threaded build, Python objects contain a 1-byte field
``ob_gc_bits`` that is used to track garbage collection related state. The
`ob_gc_bits` that is used to track garbage collection related state. The
field exists in all objects, including ones that do not support cyclic
garbage collection. The field is used to identify objects that are tracked
by the collector, ensure that finalizers are called only once per object,
@ -146,14 +146,14 @@ and, during garbage collection, differentiate reachable vs. unreachable objects.
| ... |
```
Note that not all fields are to scale. ``pad`` is two bytes, ``ob_mutex`` and
``ob_gc_bits`` are each one byte, and ``ob_ref_local`` is four bytes. The
other fields, ``ob_tid``, ``ob_ref_shared``, and ``ob_type``, are all
Note that not all fields are to scale. `pad` is two bytes, `ob_mutex` and
`ob_gc_bits` are each one byte, and `ob_ref_local` is four bytes. The
other fields, `ob_tid`, `ob_ref_shared`, and `ob_type`, are all
pointer-sized (that is, eight bytes on a 64-bit platform).
The garbage collector also temporarily repurposes the ``ob_tid`` (thread ID)
and ``ob_ref_local`` (local reference count) fields for other purposes during
The garbage collector also temporarily repurposes the `ob_tid` (thread ID)
and `ob_ref_local` (local reference count) fields for other purposes during
collections.
@ -165,17 +165,17 @@ objects with GC support. These APIs can be found in the
[Garbage Collector C API documentation](https://docs.python.org/3/c-api/gcsupport.html).
Apart from this object structure, the type object for objects supporting garbage
collection must include the ``Py_TPFLAGS_HAVE_GC`` in its ``tp_flags`` slot and
provide an implementation of the ``tp_traverse`` handler. Unless it can be proven
collection must include the `Py_TPFLAGS_HAVE_GC` in its `tp_flags` slot and
provide an implementation of the `tp_traverse` handler. Unless it can be proven
that the objects cannot form reference cycles with only objects of its type or unless
the type is immutable, a ``tp_clear`` implementation must also be provided.
the type is immutable, a `tp_clear` implementation must also be provided.
Identifying reference cycles
============================
The algorithm that CPython uses to detect those reference cycles is
implemented in the ``gc`` module. The garbage collector **only focuses**
implemented in the `gc` module. The garbage collector **only focuses**
on cleaning container objects (that is, objects that can contain a reference
to one or more objects). These can be arrays, dictionaries, lists, custom
class instances, classes in extension modules, etc. One could think that
@ -195,7 +195,7 @@ the interpreter create cycles everywhere. Some notable examples:
To correctly dispose of these objects once they become unreachable, they need
to be identified first. To understand how the algorithm works, lets take
the case of a circular linked list which has one link referenced by a
variable ``A``, and one self-referencing object which is completely
variable `A`, and one self-referencing object which is completely
unreachable:
```pycon
@ -234,7 +234,7 @@ objects have a refcount larger than the number of incoming references from
within the candidate set.
Every object that supports garbage collection will have an extra reference
count field initialized to the reference count (``gc_ref`` in the figures)
count field initialized to the reference count (`gc_ref` in the figures)
of that object when the algorithm starts. This is because the algorithm needs
to modify the reference count to do the computations and in this way the
interpreter will not modify the real reference count field.
@ -243,43 +243,43 @@ interpreter will not modify the real reference count field.
The GC then iterates over all containers in the first list and decrements by one the
`gc_ref` field of any other object that container is referencing. Doing
this makes use of the ``tp_traverse`` slot in the container class (implemented
this makes use of the `tp_traverse` slot in the container class (implemented
using the C API or inherited by a superclass) to know what objects are referenced by
each container. After all the objects have been scanned, only the objects that have
references from outside the “objects to scan” list will have ``gc_ref > 0``.
references from outside the “objects to scan” list will have `gc_ref > 0`.
![gc-image2](images/python-cyclic-gc-2-new-page.png)
Notice that having ``gc_ref == 0`` does not imply that the object is unreachable.
This is because another object that is reachable from the outside (``gc_ref > 0``)
can still have references to it. For instance, the ``link_2`` object in our example
ended having ``gc_ref == 0`` but is referenced still by the ``link_1`` object that
Notice that having `gc_ref == 0` does not imply that the object is unreachable.
This is because another object that is reachable from the outside (`gc_ref > 0`)
can still have references to it. For instance, the `link_2` object in our example
ended having `gc_ref == 0` but is referenced still by the `link_1` object that
is reachable from the outside. To obtain the set of objects that are really
unreachable, the garbage collector re-scans the container objects using the
``tp_traverse`` slot; this time with a different traverse function that marks objects with
``gc_ref == 0`` as "tentatively unreachable" and then moves them to the
`tp_traverse` slot; this time with a different traverse function that marks objects with
`gc_ref == 0` as "tentatively unreachable" and then moves them to the
tentatively unreachable list. The following image depicts the state of the lists in a
moment when the GC processed the ``link_3`` and ``link_4`` objects but has not
processed ``link_1`` and ``link_2`` yet.
moment when the GC processed the `link_3` and `link_4` objects but has not
processed `link_1` and `link_2` yet.
![gc-image3](images/python-cyclic-gc-3-new-page.png)
Then the GC scans the next ``link_1`` object. Because it has ``gc_ref == 1``,
Then the GC scans the next `link_1` object. Because it has `gc_ref == 1`,
the gc does not do anything special because it knows it has to be reachable (and is
already in what will become the reachable list):
![gc-image4](images/python-cyclic-gc-4-new-page.png)
When the GC encounters an object which is reachable (``gc_ref > 0``), it traverses
its references using the ``tp_traverse`` slot to find all the objects that are
When the GC encounters an object which is reachable (`gc_ref > 0`), it traverses
its references using the `tp_traverse` slot to find all the objects that are
reachable from it, moving them to the end of the list of reachable objects (where
they started originally) and setting its ``gc_ref`` field to 1. This is what happens
to ``link_2`` and ``link_3`` below as they are reachable from ``link_1``. From the
state in the previous image and after examining the objects referred to by ``link_1``
the GC knows that ``link_3`` is reachable after all, so it is moved back to the
original list and its ``gc_ref`` field is set to 1 so that if the GC visits it again,
they started originally) and setting its `gc_ref` field to 1. This is what happens
to `link_2` and `link_3` below as they are reachable from `link_1`. From the
state in the previous image and after examining the objects referred to by `link_1`
the GC knows that `link_3` is reachable after all, so it is moved back to the
original list and its `gc_ref` field is set to 1 so that if the GC visits it again,
it will know that it's reachable. To avoid visiting an object twice, the GC marks all
objects that have already been visited once (by unsetting the ``PREV_MASK_COLLECTING``
objects that have already been visited once (by unsetting the `PREV_MASK_COLLECTING`
flag) so that if an object that has already been processed is referenced by some other
object, the GC does not process it twice.
@ -295,7 +295,7 @@ list are really unreachable and can thus be garbage collected.
Pragmatically, it's important to note that no recursion is required by any of this,
and neither does it in any other way require additional memory proportional to the
number of objects, number of pointers, or the lengths of pointer chains. Apart from
``O(1)`` storage for internal C needs, the objects themselves contain all the storage
`O(1)` storage for internal C needs, the objects themselves contain all the storage
the GC algorithms require.
Why moving unreachable objects is better
@ -331,7 +331,7 @@ with the objective of completely destroying these objects. Roughly, the process
follows these steps in order:
1. Handle and clear weak references (if any). Weak references to unreachable objects
are set to ``None``. If the weak reference has an associated callback, the callback
are set to `None`. If the weak reference has an associated callback, the callback
is enqueued to be called once the clearing of weak references is finished. We only
invoke callbacks for weak references that are themselves reachable. If both the weak
reference and the pointed-to object are unreachable we do not execute the callback.
@ -339,15 +339,15 @@ follows these steps in order:
object and support for weak references predates support for object resurrection.
Ignoring the weak reference's callback is fine because both the object and the weakref
are going away, so it's legitimate to say the weak reference is going away first.
2. If an object has legacy finalizers (``tp_del`` slot) move it to the
``gc.garbage`` list.
3. Call the finalizers (``tp_finalize`` slot) and mark the objects as already
2. If an object has legacy finalizers (`tp_del` slot) move it to the
`gc.garbage` list.
3. Call the finalizers (`tp_finalize` slot) and mark the objects as already
finalized to avoid calling finalizers twice if the objects are resurrected or
if other finalizers have removed the object first.
4. Deal with resurrected objects. If some objects have been resurrected, the GC
finds the new subset of objects that are still unreachable by running the cycle
detection algorithm again and continues with them.
5. Call the ``tp_clear`` slot of every object so all internal links are broken and
5. Call the `tp_clear` slot of every object so all internal links are broken and
the reference counts fall to 0, triggering the destruction of all unreachable
objects.
@ -376,9 +376,9 @@ generations. Every collection operates on the entire heap.
In order to decide when to run, the collector keeps track of the number of object
allocations and deallocations since the last collection. When the number of
allocations minus the number of deallocations exceeds ``threshold_0``,
allocations minus the number of deallocations exceeds `threshold_0`,
collection starts. Initially only generation 0 is examined. If generation 0 has
been examined more than ``threshold_1`` times since generation 1 has been
been examined more than `threshold_1` times since generation 1 has been
examined, then generation 1 is examined as well. With generation 2,
things are a bit more complicated; see
[Collecting the oldest generation](#Collecting-the-oldest-generation) for
@ -393,8 +393,8 @@ function:
```
The content of these generations can be examined using the
``gc.get_objects(generation=NUM)`` function and collections can be triggered
specifically in a generation by calling ``gc.collect(generation=NUM)``.
`gc.get_objects(generation=NUM)` function and collections can be triggered
specifically in a generation by calling `gc.collect(generation=NUM)`.
```pycon
>>> import gc
@ -433,7 +433,7 @@ Collecting the oldest generation
--------------------------------
In addition to the various configurable thresholds, the GC only triggers a full
collection of the oldest generation if the ratio ``long_lived_pending / long_lived_total``
collection of the oldest generation if the ratio `long_lived_pending / long_lived_total`
is above a given value (hardwired to 25%). The reason is that, while "non-full"
collections (that is, collections of the young and middle generations) will always
examine roughly the same number of objects (determined by the aforementioned
@ -463,12 +463,12 @@ used for tags or to keep other information most often as a bit field (each
bit a separate tag) as long as code that uses the pointer masks out these
bits before accessing memory. For example, on a 32-bit architecture (for both
addresses and word size), a word is 32 bits = 4 bytes, so word-aligned
addresses are always a multiple of 4, hence end in ``00``, leaving the last 2 bits
addresses are always a multiple of 4, hence end in `00`, leaving the last 2 bits
available; while on a 64-bit architecture, a word is 64 bits = 8 bytes, so
word-aligned addresses end in ``000``, leaving the last 3 bits available.
word-aligned addresses end in `000`, leaving the last 3 bits available.
The CPython GC makes use of two fat pointers that correspond to the extra fields
of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section:
of `PyGC_Head` discussed in the `Memory layout and object structure`_ section:
> [!WARNING]
> Because the presence of extra information, "tagged" or "fat" pointers cannot be
@ -478,23 +478,23 @@ of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section:
> normally assume the pointers inside the lists are in a consistent state.
- The ``_gc_prev`` field is normally used as the "previous" pointer to maintain the
- The `_gc_prev` field is normally used as the "previous" pointer to maintain the
doubly linked list but its lowest two bits are used to keep the flags
``PREV_MASK_COLLECTING`` and ``_PyGC_PREV_MASK_FINALIZED``. Between collections,
the only flag that can be present is ``_PyGC_PREV_MASK_FINALIZED`` that indicates
if an object has been already finalized. During collections ``_gc_prev`` is
temporarily used for storing a copy of the reference count (``gc_ref``), in
`PREV_MASK_COLLECTING` and `_PyGC_PREV_MASK_FINALIZED`. Between collections,
the only flag that can be present is `_PyGC_PREV_MASK_FINALIZED` that indicates
if an object has been already finalized. During collections `_gc_prev` is
temporarily used for storing a copy of the reference count (`gc_ref`), in
addition to two flags, and the GC linked list becomes a singly linked list until
``_gc_prev`` is restored.
`_gc_prev` is restored.
- The ``_gc_next`` field is used as the "next" pointer to maintain the doubly linked
- The `_gc_next` field is used as the "next" pointer to maintain the doubly linked
list but during collection its lowest bit is used to keep the
``NEXT_MASK_UNREACHABLE`` flag that indicates if an object is tentatively
`NEXT_MASK_UNREACHABLE` flag that indicates if an object is tentatively
unreachable during the cycle detection algorithm. This is a drawback to using only
doubly linked lists to implement partitions: while most needed operations are
constant-time, there is no efficient way to determine which partition an object is
currently in. Instead, when that's needed, ad hoc tricks (like the
``NEXT_MASK_UNREACHABLE`` flag) are employed.
`NEXT_MASK_UNREACHABLE` flag) are employed.
Optimization: delay tracking containers
=======================================
@ -531,7 +531,7 @@ benefit from delayed tracking:
full garbage collection (all generations), the collector will untrack any dictionaries
whose contents are not tracked.
The garbage collector module provides the Python function ``is_tracked(obj)``, which returns
The garbage collector module provides the Python function `is_tracked(obj)`, which returns
the current tracking status of the object. Subsequent garbage collections may change the
tracking status of the object.
@ -556,20 +556,20 @@ Differences between GC implementations
This section summarizes the differences between the GC implementation in the
default build and the implementation in the free-threaded build.
The default build implementation makes extensive use of the ``PyGC_Head`` data
The default build implementation makes extensive use of the `PyGC_Head` data
structure, while the free-threaded build implementation does not use that
data structure.
- The default build implementation stores all tracked objects in a doubly
linked list using ``PyGC_Head``. The free-threaded build implementation
linked list using `PyGC_Head`. The free-threaded build implementation
instead relies on the embedded mimalloc memory allocator to scan the heap
for tracked objects.
- The default build implementation uses ``PyGC_Head`` for the unreachable
- The default build implementation uses `PyGC_Head` for the unreachable
object list. The free-threaded build implementation repurposes the
``ob_tid`` field to store a unreachable objects linked list.
- The default build implementation stores flags in the ``_gc_prev`` field of
``PyGC_Head``. The free-threaded build implementation stores these flags
in ``ob_gc_bits``.
`ob_tid` field to store a unreachable objects linked list.
- The default build implementation stores flags in the `_gc_prev` field of
`PyGC_Head`. The free-threaded build implementation stores these flags
in `ob_gc_bits`.
The default build implementation relies on the

View File

@ -9,12 +9,12 @@ Python's Parser is currently a
[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
parser. It was introduced in
[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser.
the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser.
The code implementing the parser is generated from a grammar definition by a
[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
Therefore, changes to the Python language are made by modifying the
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram).
[grammar file](../Grammar/python.gram).
Developers rarely need to modify the generator itself.
See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar)
@ -33,9 +33,9 @@ is ordered. This means that when writing:
rule: A | B | C
```
a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will
a parser that implements a context-free-grammar (such as an `LL(1)` parser) will
generate constructions that, given an input string, *deduce* which alternative
(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will
(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will
check each alternative, in the order in which they are specified, and select
that first one that succeeds.
@ -67,21 +67,21 @@ time complexity with a technique called
which not only loads the entire program in memory before parsing it but also
allows the parser to backtrack arbitrarily. This is made efficient by memoizing
the rules already matched for each position. The cost of the memoization cache
is that the parser will naturally use more memory than a simple ``LL(1)`` parser,
is that the parser will naturally use more memory than a simple `LL(1)` parser,
which normally are table-based.
Key ideas
---------
- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ).
- Alternatives are ordered ( `A | B` is not the same as `B | A` ).
- If a rule returns a failure, it doesn't mean that the parsing has failed,
it just means "try something else".
- By default PEG parsers run in exponential time, which can be optimized to linear by
using memoization.
- If parsing fails completely (no rule succeeds in parsing all the input text), the
PEG parser doesn't have a concept of "where the
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
> [!IMPORTANT]
@ -111,16 +111,16 @@ the following two rules (in these examples, a token is an individual character):
second_rule: ('aa' | 'a' ) 'a'
```
In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but
in PEG, one of these two rules accepts the string ``aaa`` but not the string
``aa``. The other does the opposite -- it accepts the string ``aa``
but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does
not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the
final ``a`` in the rule consume the second, and leaving out the third ``a``.
In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but
in PEG, one of these two rules accepts the string `aaa` but not the string
`aa`. The other does the opposite -- it accepts the string `aa`
but not the string `aaa`. The rule `('a'|'aa')'a'` does
not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the
final `a` in the rule consume the second, and leaving out the third `a`.
As the rule has succeeded, no attempt is ever made to go back and let
``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does
not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing
for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not
`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does
not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing
for the final `a`. Again, the second alternative of `'aa'|'a'` is not
tried.
> [!CAUTION]
@ -137,7 +137,7 @@ one is in almost all cases a mistake, for example:
```
In this example, the second alternative will never be tried because the first one will
succeed first (even if the input string has an ``'else' block`` that follows). To correctly
succeed first (even if the input string has an `'else' block` that follows). To correctly
write this rule you can simply alter the order:
```
@ -146,7 +146,7 @@ write this rule you can simply alter the order:
| 'if' expression 'then' block
```
In this case, if the input string doesn't have an ``'else' block``, the first alternative
In this case, if the input string doesn't have an `'else' block`, the first alternative
will fail and the second will be attempted.
Grammar Syntax
@ -166,8 +166,8 @@ the rule:
rule_name[return_type]: expression
```
If the return type is omitted, then a ``void *`` is returned in C and an
``Any`` in Python.
If the return type is omitted, then a `void *` is returned in C and an
`Any` in Python.
Grammar expressions
-------------------
@ -214,7 +214,7 @@ Variables in the grammar
------------------------
A sub-expression can be named by preceding it with an identifier and an
``=`` sign. The name can then be used in the action (see below), like this:
`=` sign. The name can then be used in the action (see below), like this:
```
rule_name[return_type]: '(' a=some_other_rule ')' { a }
@ -387,9 +387,9 @@ returns a valid C-based Python AST:
| NUMBER
```
Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset,
end_lineno, end_col_offset, p->arena``, those being variables automatically
injected by the parser; ``p`` points to an object that holds on to all state
Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset,
end_lineno, end_col_offset, p->arena`, those being variables automatically
injected by the parser; `p` points to an object that holds on to all state
for the parser.
A similar grammar written to target Python AST objects:
@ -422,50 +422,47 @@ Pegen
Pegen is the parser generator used in CPython to produce the final PEG parser
used by the interpreter. It is the program that can be used to read the python
grammar located in
[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram)
and produce the final C parser. It contains the following pieces:
grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce
the final C parser. It contains the following pieces:
- A parser generator that can read a grammar file and produce a PEG parser
written in Python or C that can parse said grammar. The generator is located at
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen).
[`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen).
- A PEG meta-grammar that automatically generates a Python parser which is used
for the parser generator itself (this means that there are no manually-written
parsers). The meta-grammar is located at
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
- A generated parser (using the parser generator) that can directly produce C and Python AST objects.
The source code for Pegen lives at
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen)
The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen)
but normally all typical commands to interact with the parser generator are executed from
the main makefile.
How to regenerate the parser
----------------------------
Once you have made the changes to the grammar files, to regenerate the ``C``
Once you have made the changes to the grammar files, to regenerate the `C`
parser (the one used by the interpreter) just execute:
```
make regen-pegen
```
using the ``Makefile`` in the main directory. If you are on Windows you can
using the `Makefile` in the main directory. If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute:
```
./PCbuild/build.bat --regen
```
The generated parser file is located at
[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c).
The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
How to regenerate the meta-parser
---------------------------------
The meta-grammar (the grammar that describes the grammar for the grammar files
themselves) is located at
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
Although it is very unlikely that you will ever need to modify it, if you make
any modifications to this file (in order to implement new Pegen features) you will
need to regenerate the meta-parser (the parser that parses the grammar files).
@ -488,11 +485,11 @@ Grammatical elements and rules
Pegen has some special grammatical elements and rules:
- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS.
- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS.
- Uppercase names (for example, ``NAME``) denote tokens in the
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file.
- Rule names starting with ``invalid_`` are used for specialized syntax errors.
- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS.
- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS.
- Uppercase names (for example, `NAME`) denote tokens in the
[`Grammar/Tokens`](../Grammar/Tokens) file.
- Rule names starting with `invalid_` are used for specialized syntax errors.
- These rules are NOT used in the first pass of the parser.
- Only if the first pass fails to parse, a second pass including the invalid
@ -509,14 +506,13 @@ Tokenization
It is common among PEG parser frameworks that the parser does both the parsing
and the tokenization, but this does not happen in Pegen. The reason is that the
Python language needs a custom tokenizer to handle things like indentation
boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for
boundaries, some special keywords like `ASYNC` and `AWAIT` (for
compatibility purposes), backtracking errors (such as unclosed parenthesis),
dealing with encoding, interactive mode and much more. Some of these reasons
are also there for historical purposes, and some others are useful even today.
The list of tokens (all uppercase names in the grammar) that you can use can
be found in thei
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens)
be found in the [`Grammar/Tokens`](../Grammar/Tokens)
file. If you change this file to add new tokens, make sure to regenerate the
files by executing:
@ -532,9 +528,7 @@ the tokens or to execute:
```
How tokens are generated and the rules governing this are completely up to the tokenizer
([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer)
and
[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer));
([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer));
the parser just receives tokens from it.
Memoization
@ -548,7 +542,7 @@ both in memory and time. Although the memory cost is obvious (the parser needs
memory for storing previous results in the cache) the execution time cost comes
for continuously checking if the given rule has a cache hit or not. In many
situations, just parsing it again can be faster. Pegen **disables memoization
by default** except for rules with the special marker ``memo`` after the rule
by default** except for rules with the special marker `memo` after the rule
name (and type, if present):
```
@ -567,8 +561,7 @@ To determine whether a new rule needs memoization or not, benchmarking is requir
(comparing execution times and memory usage of some considerably large files with
and without memoization). There is a very simple instrumentation API available
in the generated C parse code that allows to measure how much each rule uses
memoization (check the
[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c)
memoization (check the [`Parser/pegen.c`](../Parser/pegen.c)
file for more information) but it needs to be manually activated.
Automatic variables
@ -578,9 +571,9 @@ To make writing actions easier, Pegen injects some automatic variables in the
namespace available when writing actions. In the C parser, some of these
automatic variable names are:
- ``p``: The parser structure.
- ``EXTRA``: This is a macro that expands to
``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``,
- `p`: The parser structure.
- `EXTRA`: This is a macro that expands to
`(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`,
which is normally used to create AST nodes as almost all constructors need these
attributes to be provided. All of the location variables are taken from the
location information of the current token.
@ -590,13 +583,13 @@ Hard and soft keywords
> [!NOTE]
> In the grammar files, keywords are defined using **single quotes** (for example,
> ``'class'``) while soft keywords are defined using **double quotes** (for example,
> ``"match"``).
> `'class'`) while soft keywords are defined using **double quotes** (for example,
> `"match"`).
There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
keywords. The difference between hard and soft keywords is that hard keywords
are always reserved words, even in positions where they make no sense
(for example, ``x = class + 1``), while soft keywords only get a special
(for example, `x = class + 1`), while soft keywords only get a special
meaning in context. Trying to use a hard keyword as a variable will always
fail:
@ -621,7 +614,7 @@ one where they are defined as keywords:
>>> foo(match="Yeah!")
```
The ``match`` and ``case`` keywords are soft keywords, so that they are
The `match` and `case` keywords are soft keywords, so that they are
recognized as keywords at the beginning of a match statement or case block
respectively, but are allowed to be used in other places as variable or
argument names.
@ -662,7 +655,7 @@ is, and it will unwind the stack and report the exception. This means that if a
[rule action](#grammar-actions) raises an exception, all parsing will
stop at that exact point. This is done to allow to correctly propagate any
exception set by calling Python's C API functions. This also includes
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
exceptions and it is the main mechanism the parser uses to report custom syntax
error messages.
@ -684,10 +677,10 @@ grammar.
To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
the location of *generic* syntax errors is reported to be the furthest token that
was attempted to be matched but failed. This is only done if parsing has failed
(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has
(the parser returns `NULL` in C or `None` in Python) but no exception has
been raised.
As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic
As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic
has an extremely high success rate, but some PEG features, such as lookaheads,
can impact this.
@ -699,19 +692,19 @@ can impact this.
To generate more precise syntax errors, custom rules are used. This is a common
practice also in context free grammars: the parser will try to accept some
construct that is known to be incorrect just to report a specific syntax error
for that construct. In pegen grammars, these rules start with the ``invalid_``
for that construct. In pegen grammars, these rules start with the `invalid_`
prefix. This is because trying to match these rules normally has a performance
impact on parsing (and can also affect the 'correct' grammar itself in some
tricky cases, depending on the ordering of the rules) so the generated parser
acts in two phases:
1. The first phase will try to parse the input stream without taking into
account rules that start with the ``invalid_`` prefix. If the parsing
account rules that start with the `invalid_` prefix. If the parsing
succeeds it will return the generated AST and the second phase will be
skipped.
2. If the first phase failed, a second parsing attempt is done including the
rules that start with an ``invalid_`` prefix. By design this attempt
rules that start with an `invalid_` prefix. By design this attempt
**cannot succeed** and is only executed to give to the invalid rules a
chance to detect specific situations where custom, more precise, syntax
errors can be raised. This also allows to trade a bit of performance for
@ -723,15 +716,15 @@ acts in two phases:
> When defining invalid rules:
>
> - Make sure all custom invalid rules raise
> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
> [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
> exceptions (or a subclass of it).
> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not
> - Make sure **all** invalid rules start with the `invalid_` prefix to not
> impact performance of parsing correct Python code.
> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).
You can find a collection of macros to raise specialized syntax errors in the
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
[`Parser/pegen.h`](../Parser/pegen.h)
header file. These macros allow also to report ranges for
the custom errors, which will be highlighted in the tracebacks that will be
displayed when the error is reported.
@ -746,35 +739,33 @@ displayed when the error is reported.
<valid python code> $ 42
```
should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this
should trigger the syntax error in the `$` character. If your rule is not correctly defined this
won't happen. As another example, suppose that you try to define a rule to match Python 2 style
``print`` statements in order to create a better error message and you define it as:
`print` statements in order to create a better error message and you define it as:
```
invalid_print: "print" expression
```
This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid
code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass
of the parser will fail (because of the ``$``) and in the second phase, the rule will match the
``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error
will be reported there instead of the ``$`` character.
This will **seem** to work because the parser will correctly parse `print(something)` because it is valid
code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass
of the parser will fail (because of the `$`) and in the second phase, the rule will match the
`print(something)` as `print` followed by the variable `something` between parentheses and the error
will be reported there instead of the `$` character.
Generating AST objects
----------------------
The output of the C parser used by CPython, which is generated from the
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram),
is a Python AST object (using C structures). This means that the actions in the
grammar file generate AST objects when they succeed. Constructing these objects
can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
[grammar file](../Grammar/python.gram), is a Python AST object (using C
structures). This means that the actions in the grammar file generate AST
objects when they succeed. Constructing these objects can be quite cumbersome
(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
for more information on how these objects are constructed and how they are used
by the compiler), so special helper functions are used. These functions are
declared in the
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
header file and defined in the
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
file. The helpers include functions that join AST sequences, get specific elements
declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined
in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The
helpers include functions that join AST sequences, get specific elements
from them or to perform extra processing on the generated tree.
@ -788,11 +779,9 @@ from them or to perform extra processing on the generated tree.
As a general rule, if an action spawns multiple lines or requires something more
complicated than a single expression of C code, is normally better to create a
custom helper in
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
and expose it in the
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
header file so that it can be used from the grammar.
custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c)
and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that
it can be used from the grammar.
When parsing succeeds, the parser **must** return a **valid** AST object.
@ -801,16 +790,15 @@ Testing
There are three files that contain tests for the grammar and the parser:
- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py)
- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py)
- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py)
- [test_grammar.py](../Lib/test/test_grammar.py)
- [test_syntax.py](../Lib/test/test_syntax.py)
- [test_exceptions.py](../Lib/test/test_exceptions.py)
Check the contents of these files to know which is the best place for new tests, depending
on the nature of the new feature you are adding.
Check the contents of these files to know which is the best place for new
tests, depending on the nature of the new feature you are adding.
Tests for the parser generator itself can be found in the
[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator)
directory.
[test_peg_generator](../Lib/test_peg_generator) directory.
Debugging generated parsers
@ -825,33 +813,32 @@ correctly compile and execute Python anymore. This makes it a bit challenging
to debug when something goes wrong, especially when experimenting.
For this reason it is a good idea to experiment first by generating a Python
parser. To do this, you can go to the
[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator)
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
directory on the CPython repository and manually call the parser generator by executing:
```
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
```
This will generate a file called ``parse.py`` in the same directory that you
This will generate a file called `parse.py` in the same directory that you
can use to parse some input:
```
$ python parse.py file_with_source_code_to_test.py
```
As the generated ``parse.py`` file is just Python code, you can modify it
As the generated `parse.py` file is just Python code, you can modify it
and add breakpoints to debug or better understand some complex situations.
Verbose mode
------------
When Python is compiled in debug mode (by adding ``--with-pydebug`` when
running the configure step in Linux or by adding ``-d`` when calling the
[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)),
it is possible to activate a **very** verbose mode in the generated parser. This
is very useful to debug the generated parser and to understand how it works, but it
When Python is compiled in debug mode (by adding `--with-pydebug` when
running the configure step in Linux or by adding `-d` when calling the
[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a
**very** verbose mode in the generated parser. This is very useful to
debug the generated parser and to understand how it works, but it
can be a bit hard to understand at first.
> [!NOTE]
@ -859,13 +846,13 @@ can be a bit hard to understand at first.
> interactive mode as it can be much harder to understand, because interactive
> mode involves some special steps compared to regular parsing.
To activate verbose mode you can add the ``-d`` flag when executing Python:
To activate verbose mode you can add the `-d` flag when executing Python:
```
$ python -d file_to_test.py
```
This will print **a lot** of output to ``stderr`` so it is probably better to dump
This will print **a lot** of output to `stderr` so it is probably better to dump
it to a file for further analysis. The output consists of trace lines with the
following structure::
@ -873,17 +860,17 @@ following structure::
<indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
```
Every line is indented by a different amount (``<indentation>``) depending on how
Every line is indented by a different amount (`<indentation>`) depending on how
deep the call stack is. The next character marks the type of the trace:
- ``>`` indicates that a rule is going to be attempted to be parsed.
- ``-`` indicates that a rule has failed to be parsed.
- ``+`` indicates that a rule has been parsed correctly.
- ``!`` indicates that an exception or an error has been detected and the parser is unwinding.
- `>` indicates that a rule is going to be attempted to be parsed.
- `-` indicates that a rule has failed to be parsed.
- `+` indicates that a rule has been parsed correctly.
- `!` indicates that an exception or an error has been detected and the parser is unwinding.
The ``<token_location>`` part indicates the current index in the token array,
the ``<rule_name>`` part indicates what rule is being parsed and
the ``<alternative>`` part indicates what alternative within that rule
The `<token_location>` part indicates the current index in the token array,
the `<rule_name>` part indicates what rule is being parsed and
the `<alternative>` part indicates what alternative within that rule
is being attempted.
@ -891,4 +878,5 @@ is being attempted.
> **Document history**
>
> Pablo Galindo Salgado - Original author
>
> Irit Katriel and Jacob Coffee - Convert to Markdown