gh-119786: [doc] more consistent syntax in InternalDocs (#125815)

This commit is contained in:
Irit Katriel 2024-10-21 23:37:31 +01:00 committed by GitHub
parent 4848b0b92c
commit d0bfff47fb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 371 additions and 420 deletions

View File

@ -31,8 +31,7 @@ although these are not fundamental and may change:
## Example family ## Example family
The `LOAD_GLOBAL` instruction (in The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c))
already has an adaptive family that serves as a relatively simple example. already has an adaptive family that serves as a relatively simple example.
The `LOAD_GLOBAL` instruction performs adaptive specialization, The `LOAD_GLOBAL` instruction performs adaptive specialization,

View File

@ -7,17 +7,16 @@ Abstract
In CPython, the compilation from source code to bytecode involves several steps: In CPython, the compilation from source code to bytecode involves several steps:
1. Tokenize the source code 1. Tokenize the source code [Parser/lexer/](../Parser/lexer/)
[Parser/lexer/](https://github.com/python/cpython/blob/main/Parser/lexer/) and [Parser/tokenizer/](../Parser/tokenizer/).
and [Parser/tokenizer/](https://github.com/python/cpython/blob/main/Parser/tokenizer/).
2. Parse the stream of tokens into an Abstract Syntax Tree 2. Parse the stream of tokens into an Abstract Syntax Tree
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c). [Parser/parser.c](../Parser/parser.c).
3. Transform AST into an instruction sequence 3. Transform AST into an instruction sequence
[Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c). [Python/compile.c](../Python/compile.c).
4. Construct a Control Flow Graph and apply optimizations to it 4. Construct a Control Flow Graph and apply optimizations to it
[Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c). [Python/flowgraph.c](../Python/flowgraph.c).
5. Emit bytecode based on the Control Flow Graph 5. Emit bytecode based on the Control Flow Graph
[Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c). [Python/assemble.c](../Python/assemble.c).
This document outlines how these steps of the process work. This document outlines how these steps of the process work.
@ -36,12 +35,10 @@ of tokens rather than a stream of characters which is more common with PEG
parsers. parsers.
The grammar file for Python can be found in The grammar file for Python can be found in
[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram). [Grammar/python.gram](../Grammar/python.gram).
The definitions for literal tokens (such as ``:``, numbers, etc.) can be found in The definitions for literal tokens (such as `:`, numbers, etc.) can be found in
[Grammar/Tokens](https://github.com/python/cpython/blob/main/Grammar/Tokens). [Grammar/Tokens](../Grammar/Tokens). Various C files, including
Various C files, including [Parser/parser.c](../Parser/parser.c) are generated from these.
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c)
are generated from these.
See Also: See Also:
@ -63,7 +60,7 @@ specification of the AST nodes is specified using the Zephyr Abstract
Syntax Definition Language (ASDL) [^1], [^2]. Syntax Definition Language (ASDL) [^1], [^2].
The definition of the AST nodes for Python is found in the file The definition of the AST nodes for Python is found in the file
[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl). [Parser/Python.asdl](../Parser/Python.asdl).
Each AST node (representing statements, expressions, and several Each AST node (representing statements, expressions, and several
specialized types, like list comprehensions and exception handlers) is specialized types, like list comprehensions and exception handlers) is
@ -87,14 +84,14 @@ approach and syntax:
The preceding example describes two different kinds of statements and an The preceding example describes two different kinds of statements and an
expression: function definitions, return statements, and yield expressions. expression: function definitions, return statements, and yield expressions.
All three kinds are considered of type ``stmt`` as shown by ``|`` separating All three kinds are considered of type `stmt` as shown by `|` separating
the various kinds. They all take arguments of various kinds and amounts. the various kinds. They all take arguments of various kinds and amounts.
Modifiers on the argument type specify the number of values needed; ``?`` Modifiers on the argument type specify the number of values needed; `?`
means it is optional, ``*`` means 0 or more, while no modifier means only one means it is optional, `*` means 0 or more, while no modifier means only one
value for the argument and it is required. ``FunctionDef``, for instance, value for the argument and it is required. `FunctionDef`, for instance,
takes an ``identifier`` for the *name*, ``arguments`` for *args*, zero or more takes an `identifier` for the *name*, `arguments` for *args*, zero or more
``stmt`` arguments for *body*, and zero or more ``expr`` arguments for `stmt` arguments for *body*, and zero or more `expr` arguments for
*decorators*. *decorators*.
Do notice that something like 'arguments', which is a node type, is Do notice that something like 'arguments', which is a node type, is
@ -132,9 +129,9 @@ The statement definitions above generate the following C structure type:
``` ```
Also generated are a series of constructor functions that allocate (in Also generated are a series of constructor functions that allocate (in
this case) a ``stmt_ty`` struct with the appropriate initialization. The this case) a `stmt_ty` struct with the appropriate initialization. The
``kind`` field specifies which component of the union is initialized. The `kind` field specifies which component of the union is initialized. The
``FunctionDef()`` constructor function sets 'kind' to ``FunctionDef_kind`` and `FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and
initializes the *name*, *args*, *body*, and *attributes* fields. initializes the *name*, *args*, *body*, and *attributes* fields.
See also See also
@ -156,13 +153,13 @@ In general, unless you are working on the critical core of the compiler, memory
management can be completely ignored. But if you are working at either the management can be completely ignored. But if you are working at either the
very beginning of the compiler or the end, you need to care about how the arena very beginning of the compiler or the end, you need to care about how the arena
works. All code relating to the arena is in either works. All code relating to the arena is in either
[Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h) [Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h)
or [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c). or [Python/pyarena.c](../Python/pyarena.c).
``PyArena_New()`` will create a new arena. The returned ``PyArena`` structure `PyArena_New()` will create a new arena. The returned `PyArena` structure
will store pointers to all memory given to it. This does the bookkeeping of will store pointers to all memory given to it. This does the bookkeeping of
what memory needs to be freed when the compiler is finished with the memory it what memory needs to be freed when the compiler is finished with the memory it
used. That freeing is done with ``PyArena_Free()``. This only needs to be used. That freeing is done with `PyArena_Free()`. This only needs to be
called in strategic areas where the compiler exits. called in strategic areas where the compiler exits.
As stated above, in general you should not have to worry about memory As stated above, in general you should not have to worry about memory
@ -173,25 +170,25 @@ The only exception comes about when managing a PyObject. Since the rest
of Python uses reference counting, there is extra support added of Python uses reference counting, there is extra support added
to the arena to cleanup each PyObject that was allocated. These cases to the arena to cleanup each PyObject that was allocated. These cases
are very rare. However, if you've allocated a PyObject, you must tell are very rare. However, if you've allocated a PyObject, you must tell
the arena about it by calling ``PyArena_AddPyObject()``. the arena about it by calling `PyArena_AddPyObject()`.
Source code to AST Source code to AST
================== ==================
The AST is generated from source code using the function The AST is generated from source code using the function
``_PyParser_ASTFromString()`` or ``_PyParser_ASTFromFile()`` `_PyParser_ASTFromString()` or `_PyParser_ASTFromFile()`
[Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c). [Parser/peg_api.c](../Parser/peg_api.c).
After some checks, a helper function in After some checks, a helper function in
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c) [Parser/parser.c](../Parser/parser.c)
begins applying production rules on the source code it receives; converting source begins applying production rules on the source code it receives; converting source
code to tokens and matching these tokens recursively to their corresponding rule. The code to tokens and matching these tokens recursively to their corresponding rule. The
production rule's corresponding rule function is called on every match. These rule production rule's corresponding rule function is called on every match. These rule
functions follow the format `xx_rule`. Where *xx* is the grammar rule functions follow the format `xx_rule`. Where *xx* is the grammar rule
that the function handles and is automatically derived from that the function handles and is automatically derived from
[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram) by [Grammar/python.gram](../Grammar/python.gram) by
[Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py). [Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py).
Each rule function in turn creates an AST node as it goes along. It does this Each rule function in turn creates an AST node as it goes along. It does this
by allocating all the new nodes it needs, calling the proper AST node creation by allocating all the new nodes it needs, calling the proper AST node creation
@ -202,18 +199,15 @@ there are no more rules, an error is set and the parsing ends.
The AST node creation helper functions have the name `_PyAST_{xx}` The AST node creation helper functions have the name `_PyAST_{xx}`
where *xx* is the AST node that the function creates. These are defined by the where *xx* is the AST node that the function creates. These are defined by the
ASDL grammar and contained in ASDL grammar and contained in [Python/Python-ast.c](../Python/Python-ast.c)
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) (which is generated by [Parser/asdl_c.py](../Parser/asdl_c.py)
(which is generated by from [Parser/Python.asdl](../Parser/Python.asdl)).
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py) This all leads to a sequence of AST nodes stored in `asdl_seq` structs.
from
[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl)).
This all leads to a sequence of AST nodes stored in ``asdl_seq`` structs.
To demonstrate everything explained so far, here's the To demonstrate everything explained so far, here's the
rule function responsible for a simple named import statement such as rule function responsible for a simple named import statement such as
``import sys``. Note that error-checking and debugging code has been `import sys`. Note that error-checking and debugging code has been
omitted. Removed parts are represented by ``...``. omitted. Removed parts are represented by `...`.
Furthermore, some comments have been added for explanation. These comments Furthermore, some comments have been added for explanation. These comments
may not be present in the actual code. may not be present in the actual code.
@ -255,55 +249,52 @@ may not be present in the actual code.
To improve backtracking performance, some rules (chosen by applying a To improve backtracking performance, some rules (chosen by applying a
``(memo)`` flag in the grammar file) are memoized. Each rule function checks if `(memo)` flag in the grammar file) are memoized. Each rule function checks if
a memoized version exists and returns that if so, else it continues in the a memoized version exists and returns that if so, else it continues in the
manner stated in the previous paragraphs. manner stated in the previous paragraphs.
There are macros for creating and using ``asdl_xx_seq *`` types, where *xx* is There are macros for creating and using `asdl_xx_seq *` types, where *xx* is
a type of the ASDL sequence. Three main types are defined a type of the ASDL sequence. Three main types are defined
manually -- ``generic``, ``identifier`` and ``int``. These types are found in manually -- `generic`, `identifier` and `int`. These types are found in
[Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c) [Python/asdl.c](../Python/asdl.c) and its corresponding header file
and its corresponding header file [Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h).
[Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h). Functions and macros for creating `asdl_xx_seq *` types are as follows:
Functions and macros for creating ``asdl_xx_seq *`` types are as follows:
``_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`` `_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an ``asdl_generic_seq`` of the specified length Allocate memory for an `asdl_generic_seq` of the specified length
``_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`` `_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an ``asdl_identifier_seq`` of the specified length Allocate memory for an `asdl_identifier_seq` of the specified length
``_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`` `_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`
Allocate memory for an ``asdl_int_seq`` of the specified length Allocate memory for an `asdl_int_seq` of the specified length
In addition to the three types mentioned above, some ASDL sequence types are In addition to the three types mentioned above, some ASDL sequence types are
automatically generated by automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py) and found in
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py) [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h).
and found in
[Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h).
Macros for using both manually defined and automatically generated ASDL Macros for using both manually defined and automatically generated ASDL
sequence types are as follows: sequence types are as follows:
``asdl_seq_GET(asdl_xx_seq *, int)`` `asdl_seq_GET(asdl_xx_seq *, int)`
Get item held at a specific position in an ``asdl_xx_seq`` Get item held at a specific position in an `asdl_xx_seq`
``asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`` `asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`
Set a specific index in an ``asdl_xx_seq`` to the specified value Set a specific index in an `asdl_xx_seq` to the specified value
Untyped counterparts exist for some of the typed macros. These are useful Untyped counterparts exist for some of the typed macros. These are useful
when a function needs to manipulate a generic ASDL sequence: when a function needs to manipulate a generic ASDL sequence:
``asdl_seq_GET_UNTYPED(asdl_seq *, int)`` `asdl_seq_GET_UNTYPED(asdl_seq *, int)`
Get item held at a specific position in an ``asdl_seq`` Get item held at a specific position in an `asdl_seq`
``asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`` `asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`
Set a specific index in an ``asdl_seq`` to the specified value Set a specific index in an `asdl_seq` to the specified value
``asdl_seq_LEN(asdl_seq *)`` `asdl_seq_LEN(asdl_seq *)`
Return the length of an ``asdl_seq`` or ``asdl_xx_seq`` Return the length of an `asdl_seq` or `asdl_xx_seq`
Note that typed macros and functions are recommended over their untyped Note that typed macros and functions are recommended over their untyped
counterparts. Typed macros carry out checks in debug mode and aid counterparts. Typed macros carry out checks in debug mode and aid
debugging errors caused by incorrectly casting from ``void *``. debugging errors caused by incorrectly casting from `void *`.
If you are working with statements, you must also worry about keeping If you are working with statements, you must also worry about keeping
track of what line number generated the statement. Currently the line track of what line number generated the statement. Currently the line
number is passed as the last parameter to each ``stmt_ty`` function. number is passed as the last parameter to each `stmt_ty` function.
See also [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/). See also [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/).
@ -333,19 +324,19 @@ else:
end() end()
``` ```
The ``x < 10`` guard is represented by its own basic block that The `x < 10` guard is represented by its own basic block that
compares ``x`` with ``10`` and then ends in a conditional jump based on compares `x` with `10` and then ends in a conditional jump based on
the result of the comparison. This conditional jump allows the block the result of the comparison. This conditional jump allows the block
to point to both the body of the ``if`` and the body of the ``else``. The to point to both the body of the `if` and the body of the `else`. The
``if`` basic block contains the ``f1()`` and ``f2()`` calls and points to `if` basic block contains the `f1()` and `f2()` calls and points to
the ``end()`` basic block. The ``else`` basic block contains the ``g()`` the `end()` basic block. The `else` basic block contains the `g()`
call and similarly points to the ``end()`` block. call and similarly points to the `end()` block.
Note that more complex code in the guard, the ``if`` body, or the ``else`` Note that more complex code in the guard, the `if` body, or the `else`
body may be represented by multiple basic blocks. For instance, body may be represented by multiple basic blocks. For instance,
short-circuiting boolean logic in a guard like ``if x or y:`` short-circuiting boolean logic in a guard like `if x or y:`
will produce one basic block that tests the truth value of ``x`` will produce one basic block that tests the truth value of `x`
and then points both (1) to the start of the ``if`` body and (2) to and then points both (1) to the start of the `if` body and (2) to
a different basic block that tests the truth value of y. a different basic block that tests the truth value of y.
CFGs are useful as an intermediate representation of the code because CFGs are useful as an intermediate representation of the code because
@ -354,27 +345,24 @@ they are a convenient data structure for optimizations.
AST to CFG to bytecode AST to CFG to bytecode
====================== ======================
The conversion of an ``AST`` to bytecode is initiated by a call to the function The conversion of an `AST` to bytecode is initiated by a call to the function
``_PyAST_Compile()`` in `_PyAST_Compile()` in [Python/compile.c](../Python/compile.c).
[Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c).
The first step is to construct the symbol table. This is implemented by The first step is to construct the symbol table. This is implemented by
``_PySymtable_Build()`` in `_PySymtable_Build()` in [Python/symtable.c](../Python/symtable.c).
[Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c).
This function begins by entering the starting code block for the AST (passed-in) This function begins by entering the starting code block for the AST (passed-in)
and then calling the proper `symtable_visit_{xx}` function (with *xx* being the and then calling the proper `symtable_visit_{xx}` function (with *xx* being the
AST node type). Next, the AST tree is walked with the various code blocks that AST node type). Next, the AST tree is walked with the various code blocks that
delineate the reach of a local variable as blocks are entered and exited using delineate the reach of a local variable as blocks are entered and exited using
``symtable_enter_block()`` and ``symtable_exit_block()``, respectively. `symtable_enter_block()` and `symtable_exit_block()`, respectively.
Once the symbol table is created, the ``AST`` is transformed by ``compiler_codegen()`` Once the symbol table is created, the `AST` is transformed by `compiler_codegen()`
in [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c) in [Python/compile.c](../Python/compile.c) into a sequence of pseudo instructions.
into a sequence of pseudo instructions. These are similar to bytecode, but These are similar to bytecode, but in some cases they are more abstract, and are
in some cases they are more abstract, and are resolved later into actual resolved later into actual bytecode. The construction of this instruction sequence
bytecode. The construction of this instruction sequence is handled by several is handled by several functions that break the task down by various AST node types.
functions that break the task down by various AST node types. The functions are The functions are all named `compiler_visit_{xx}` where *xx* is the name of the node
all named `compiler_visit_{xx}` where *xx* is the name of the node type (such type (such as `stmt`, `expr`, etc.). Each function receives a `struct compiler *`
as ``stmt``, ``expr``, etc.). Each function receives a ``struct compiler *``
and `{xx}_ty` where *xx* is the AST node type. Typically these functions and `{xx}_ty` where *xx* is the AST node type. Typically these functions
consist of a large 'switch' statement, branching based on the kind of consist of a large 'switch' statement, branching based on the kind of
node type passed to it. Simple things are handled inline in the node type passed to it. Simple things are handled inline in the
@ -382,242 +370,224 @@ node type passed to it. Simple things are handled inline in the
functions named `compiler_{xx}` with *xx* being a descriptive name of what is functions named `compiler_{xx}` with *xx* being a descriptive name of what is
being handled. being handled.
When transforming an arbitrary AST node, use the ``VISIT()`` macro. When transforming an arbitrary AST node, use the `VISIT()` macro.
The appropriate `compiler_visit_{xx}` function is called, based on the value The appropriate `compiler_visit_{xx}` function is called, based on the value
passed in for <node type> (so `VISIT({c}, expr, {node})` calls passed in for <node type> (so `VISIT({c}, expr, {node})` calls
`compiler_visit_expr({c}, {node})`). The ``VISIT_SEQ()`` macro is very similar, `compiler_visit_expr({c}, {node})`). The `VISIT_SEQ()` macro is very similar,
but is called on AST node sequences (those values that were created as but is called on AST node sequences (those values that were created as
arguments to a node that used the '*' modifier). arguments to a node that used the '*' modifier).
Emission of bytecode is handled by the following macros: Emission of bytecode is handled by the following macros:
* ``ADDOP(struct compiler *, location, int)`` * `ADDOP(struct compiler *, location, int)`
add a specified opcode add a specified opcode
* ``ADDOP_IN_SCOPE(struct compiler *, location, int)`` * `ADDOP_IN_SCOPE(struct compiler *, location, int)`
like ``ADDOP``, but also exits current scope; used for adding return value like `ADDOP`, but also exits current scope; used for adding return value
opcodes in lambdas and closures opcodes in lambdas and closures
* ``ADDOP_I(struct compiler *, location, int, Py_ssize_t)`` * `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`
add an opcode that takes an integer argument add an opcode that takes an integer argument
* ``ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`` * `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`
add an opcode with the proper argument based on the position of the add an opcode with the proper argument based on the position of the
specified PyObject in PyObject sequence object, but with no handling of specified PyObject in PyObject sequence object, but with no handling of
mangled names; used for when you mangled names; used for when you
need to do named lookups of objects such as globals, consts, or need to do named lookups of objects such as globals, consts, or
parameters where name mangling is not possible and the scope of the parameters where name mangling is not possible and the scope of the
name is known; *TYPE* is the name of PyObject sequence name is known; *TYPE* is the name of PyObject sequence
(``names`` or ``varnames``) (`names` or `varnames`)
* ``ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`` * `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`
just like ``ADDOP_O``, but steals a reference to PyObject just like `ADDOP_O`, but steals a reference to PyObject
* ``ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`` * `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`
just like ``ADDOP_O``, but name mangling is also handled; used for just like `ADDOP_O`, but name mangling is also handled; used for
attribute loading or importing based on name attribute loading or importing based on name
* ``ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`` * `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`
add the ``LOAD_CONST`` opcode with the proper argument based on the add the `LOAD_CONST` opcode with the proper argument based on the
position of the specified PyObject in the consts table. position of the specified PyObject in the consts table.
* ``ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`` * `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`
just like ``ADDOP_LOAD_CONST_NEW``, but steals a reference to PyObject just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject
* ``ADDOP_JUMP(struct compiler *, location, int, basicblock *)`` * `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`
create a jump to a basic block create a jump to a basic block
The ``location`` argument is a struct with the source location to be The `location` argument is a struct with the source location to be
associated with this instruction. It is typically extracted from an associated with this instruction. It is typically extracted from an
``AST`` node with the ``LOC`` macro. The ``NO_LOCATION`` can be used `AST` node with the `LOC` macro. The `NO_LOCATION` can be used
for *synthetic* instructions, which we do not associate with a line for *synthetic* instructions, which we do not associate with a line
number at this stage. For example, the implicit ``return None`` number at this stage. For example, the implicit `return None`
which is added at the end of a function is not associated with any which is added at the end of a function is not associated with any
line in the source code. line in the source code.
There are several helper functions that will emit pseudo-instructions There are several helper functions that will emit pseudo-instructions
and are named `compiler_{xx}()` where *xx* is what the function helps and are named `compiler_{xx}()` where *xx* is what the function helps
with (``list``, ``boolop``, etc.). A rather useful one is ``compiler_nameop()``. with (`list`, `boolop`, etc.). A rather useful one is `compiler_nameop()`.
This function looks up the scope of a variable and, based on the This function looks up the scope of a variable and, based on the
expression context, emits the proper opcode to load, store, or delete expression context, emits the proper opcode to load, store, or delete
the variable. the variable.
Once the instruction sequence is created, it is transformed into a CFG Once the instruction sequence is created, it is transformed into a CFG
by ``_PyCfg_FromInstructionSequence()``. Then ``_PyCfg_OptimizeCodeUnit()`` by `_PyCfg_FromInstructionSequence()`. Then `_PyCfg_OptimizeCodeUnit()`
applies various peephole optimizations, and applies various peephole optimizations, and
``_PyCfg_OptimizedCfgToInstructionSequence()`` converts the optimized ``CFG`` `_PyCfg_OptimizedCfgToInstructionSequence()` converts the optimized `CFG`
back into an instruction sequence. These conversions and optimizations are back into an instruction sequence. These conversions and optimizations are
implemented in implemented in [Python/flowgraph.c](../Python/flowgraph.c).
[Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c).
Finally, the sequence of pseudo-instructions is converted into actual Finally, the sequence of pseudo-instructions is converted into actual
bytecode. This includes transforming pseudo instructions into actual instructions, bytecode. This includes transforming pseudo instructions into actual instructions,
converting jump targets from logical labels to relative offsets, and converting jump targets from logical labels to relative offsets, and
construction of the construction of the [exception table](exception_handling.md) and
[exception table](exception_handling.md) and [locations table](locations.md).
[locations table](https://github.com/python/cpython/blob/main/InternalDocs/locations.md). The bytecode and tables are then wrapped into a `PyCodeObject` along with additional
The bytecode and tables are then wrapped into a ``PyCodeObject`` along with additional metadata, including the `consts` and `names` arrays, information about function
metadata, including the ``consts`` and ``names`` arrays, information about function
reference to the source code (filename, etc). All of this is implemented by reference to the source code (filename, etc). All of this is implemented by
``_PyAssemble_MakeCodeObject()`` in `_PyAssemble_MakeCodeObject()` in [Python/assemble.c](../Python/assemble.c).
[Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c).
Code objects Code objects
============ ============
The result of ``PyAST_CompileObject()`` is a ``PyCodeObject`` which is defined in The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in
[Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h). [Include/cpython/code.h](../Include/cpython/code.h).
And with that you now have executable Python bytecode! And with that you now have executable Python bytecode!
The code objects (byte code) are executed in The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c).
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c).
This file will also need a new case statement for the new opcode in the big switch This file will also need a new case statement for the new opcode in the big switch
statement in ``_PyEval_EvalFrameDefault()``. statement in `_PyEval_EvalFrameDefault()`.
Important files Important files
=============== ===============
* [Parser/](https://github.com/python/cpython/blob/main/Parser/) * [Parser/](../Parser/)
* [Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl): * [Parser/Python.asdl](../Parser/Python.asdl):
ASDL syntax file. ASDL syntax file.
* [Parser/asdl.py](https://github.com/python/cpython/blob/main/Parser/asdl.py): * [Parser/asdl.py](../Parser/asdl.py):
Parser for ASDL definition files. Parser for ASDL definition files.
Reads in an ASDL description and parses it into an AST that describes it. Reads in an ASDL description and parses it into an AST that describes it.
* [Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py): * [Parser/asdl_c.py](../Parser/asdl_c.py):
Generate C code from an ASDL description. Generates Generate C code from an ASDL description. Generates
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) [Python/Python-ast.c](../Python/Python-ast.c) and
and [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h).
[Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h).
* [Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c): * [Parser/parser.c](../Parser/parser.c):
The new PEG parser introduced in Python 3.9. The new PEG parser introduced in Python 3.9. Generated by
Generated by [Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py)
[Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py) from the grammar [Grammar/python.gram](../Grammar/python.gram).
from the grammar [Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram).
Creates the AST from source code. Rule functions for their corresponding production Creates the AST from source code. Rule functions for their corresponding production
rules are found here. rules are found here.
* [Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c): * [Parser/peg_api.c](../Parser/peg_api.c):
Contains high-level functions which are Contains high-level functions which are used by the interpreter to create
used by the interpreter to create an AST from source code. an AST from source code.
* [Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c): * [Parser/pegen.c](../Parser/pegen.c):
Contains helper functions which are used by functions in Contains helper functions which are used by functions in
[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c) [Parser/parser.c](../Parser/parser.c) to construct the AST. Also contains
to construct the AST. Also contains helper functions which help raise better error messages helper functions which help raise better error messages when parsing source code.
when parsing source code.
* [Parser/pegen.h](https://github.com/python/cpython/blob/main/Parser/pegen.h): * [Parser/pegen.h](../Parser/pegen.h):
Header file for the corresponding Header file for the corresponding [Parser/pegen.c](../Parser/pegen.c).
[Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c). Also contains definitions of the `Parser` and `Token` structs.
Also contains definitions of the ``Parser`` and ``Token`` structs.
* [Python/](https://github.com/python/cpython/blob/main/Python) * [Python/](../Python)
* [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c): * [Python/Python-ast.c](../Python/Python-ast.c):
Creates C structs corresponding to the ASDL types. Also contains code for Creates C structs corresponding to the ASDL types. Also contains code for
marshalling AST nodes (core ASDL types have marshalling code in marshalling AST nodes (core ASDL types have marshalling code in
[Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c)). [Python/asdl.c](../Python/asdl.c)).
File automatically generated by File automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py).
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py).
This file must be committed separately after every grammar change This file must be committed separately after every grammar change
is committed since the ``__version__`` value is set to the latest is committed since the `__version__` value is set to the latest
grammar change revision number. grammar change revision number.
* [Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c): * [Python/asdl.c](../Python/asdl.c):
Contains code to handle the ASDL sequence type. Contains code to handle the ASDL sequence type.
Also has code to handle marshalling the core ASDL types, such as number Also has code to handle marshalling the core ASDL types, such as number
and identifier. Used by and identifier. Used by [Python/Python-ast.c](../Python/Python-ast.c)
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c)
for marshalling AST nodes. for marshalling AST nodes.
* [Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c): * [Python/ast.c](../Python/ast.c):
Used for validating the AST. Used for validating the AST.
* [Python/ast_opt.c](https://github.com/python/cpython/blob/main/Python/ast_opt.c): * [Python/ast_opt.c](../Python/ast_opt.c):
Optimizes the AST. Optimizes the AST.
* [Python/ast_unparse.c](https://github.com/python/cpython/blob/main/Python/ast_unparse.c): * [Python/ast_unparse.c](../Python/ast_unparse.c):
Converts the AST expression node back into a string (for string annotations). Converts the AST expression node back into a string (for string annotations).
* [Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c): * [Python/ceval.c](../Python/ceval.c):
Executes byte code (aka, eval loop). Executes byte code (aka, eval loop).
* [Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c): * [Python/symtable.c](../Python/symtable.c):
Generates a symbol table from AST. Generates a symbol table from AST.
* [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c): * [Python/pyarena.c](../Python/pyarena.c):
Implementation of the arena memory manager. Implementation of the arena memory manager.
* [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c): * [Python/compile.c](../Python/compile.c):
Emits pseudo bytecode based on the AST. Emits pseudo bytecode based on the AST.
* [Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c): * [Python/flowgraph.c](../Python/flowgraph.c):
Implements peephole optimizations. Implements peephole optimizations.
* [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c): * [Python/assemble.c](../Python/assemble.c):
Constructs a code object from a sequence of pseudo instructions. Constructs a code object from a sequence of pseudo instructions.
* [Python/instruction_sequence.c](https://github.com/python/cpython/blob/main/Python/instruction_sequence.c): * [Python/instruction_sequence.c](../Python/instruction_sequence.c):
A data structure representing a sequence of bytecode-like pseudo-instructions. A data structure representing a sequence of bytecode-like pseudo-instructions.
* [Include/](https://github.com/python/cpython/blob/main/Include/) * [Include/](../Include/)
* [Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h) * [Include/cpython/code.h](../Include/cpython/code.h)
: Header file for : Header file for [Objects/codeobject.c](../Objects/codeobject.c);
[Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c); contains definition of `PyCodeObject`.
contains definition of ``PyCodeObject``.
* [Include/opcode.h](https://github.com/python/cpython/blob/main/Include/opcode.h) * [Include/opcode.h](../Include/opcode.h)
: One of the files that must be modified if : One of the files that must be modified whenever
[Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py) is. [Lib/opcode.py](../Lib/opcode.py) is.
* [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h) * [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h)
: Contains the actual definitions of the C structs as generated by : Contains the actual definitions of the C structs as generated by
[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) [Python/Python-ast.c](../Python/Python-ast.c).
Automatically generated by Automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py).
[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py).
* [Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h) * [Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h)
: Header for the corresponding : Header for the corresponding [Python/ast.c](../Python/ast.c).
[Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c).
* [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h) * [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h)
: Declares ``_PyAST_Validate()`` external (from : Declares `_PyAST_Validate()` external (from [Python/ast.c](../Python/ast.c)).
[Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c)).
* [Include/internal/pycore_symtable.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_symtable.h) * [Include/internal/pycore_symtable.h](../Include/internal/pycore_symtable.h)
: Header for : Header for [Python/symtable.c](../Python/symtable.c).
[Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c). `struct symtable` and `PySTEntryObject` are defined here.
``struct symtable`` and ``PySTEntryObject`` are defined here.
* [Include/internal/pycore_parser.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_parser.h) * [Include/internal/pycore_parser.h](../Include/internal/pycore_parser.h)
: Header for the corresponding : Header for the corresponding [Parser/peg_api.c](../Parser/peg_api.c).
[Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c).
* [Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h) * [Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h)
: Header file for the corresponding : Header file for the corresponding [Python/pyarena.c](../Python/pyarena.c).
[Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c).
* [Include/opcode_ids.h](https://github.com/python/cpython/blob/main/Include/opcode_ids.h) * [Include/opcode_ids.h](../Include/opcode_ids.h)
: List of opcodes. Generated from : List of opcodes. Generated from [Python/bytecodes.c](../Python/bytecodes.c)
[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c)
by by
[Tools/cases_generator/opcode_id_generator.py](https://github.com/python/cpython/blob/main/Tools/cases_generator/opcode_id_generator.py). [Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py).
* [Objects/](https://github.com/python/cpython/blob/main/Objects/) * [Objects/](../Objects/)
* [Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c) * [Objects/codeobject.c](../Objects/codeobject.c)
: Contains PyCodeObject-related code. : Contains PyCodeObject-related code.
* [Objects/frameobject.c](https://github.com/python/cpython/blob/main/Objects/frameobject.c) * [Objects/frameobject.c](../Objects/frameobject.c)
: Contains the ``frame_setlineno()`` function which should determine whether it is allowed : Contains the `frame_setlineno()` function which should determine whether it is allowed
to make a jump between two points in a bytecode. to make a jump between two points in a bytecode.
* [Lib/](https://github.com/python/cpython/blob/main/Lib/) * [Lib/](../Lib/)
* [Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py) * [Lib/opcode.py](../Lib/opcode.py)
: opcode utilities exposed to Python. : opcode utilities exposed to Python.
* [Include/core/pycore_magic_number.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_magic_number.h) * [Include/core/pycore_magic_number.h](../Include/internal/pycore_magic_number.h)
: Home of the magic number (named ``MAGIC_NUMBER``) for bytecode versioning. : Home of the magic number (named `MAGIC_NUMBER`) for bytecode versioning.
Objects Objects
@ -625,7 +595,7 @@ Objects
* [Locations](locations.md): Describes the location table * [Locations](locations.md): Describes the location table
* [Frames](frames.md): Describes frames and the frame stack * [Frames](frames.md): Describes frames and the frame stack
* [Objects/object_layout.md](https://github.com/python/cpython/blob/main/Objects/object_layout.md): Describes object layout for 3.11 and later * [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later
* [Exception Handling](exception_handling.md): Describes the exception table * [Exception Handling](exception_handling.md): Describes the exception table

View File

@ -68,18 +68,16 @@ Handling Exceptions
------------------- -------------------
At runtime, when an exception occurs, the interpreter calls At runtime, when an exception occurs, the interpreter calls
``get_exception_handler()`` in `get_exception_handler()` in [Python/ceval.c](../Python/ceval.c)
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c)
to look up the offset of the current instruction in the exception to look up the offset of the current instruction in the exception
table. If it finds a handler, control flow transfers to it. Otherwise, the table. If it finds a handler, control flow transfers to it. Otherwise, the
exception bubbles up to the caller, and the caller's frame is exception bubbles up to the caller, and the caller's frame is
checked for a handler covering the `CALL` instruction. This checked for a handler covering the `CALL` instruction. This
repeats until a handler is found or the topmost frame is reached. repeats until a handler is found or the topmost frame is reached.
If no handler is found, then the interpreter function If no handler is found, then the interpreter function
(``_PyEval_EvalFrameDefault()``) returns NULL. During unwinding, (`_PyEval_EvalFrameDefault()`) returns NULL. During unwinding,
the traceback is constructed as each frame is added to it by the traceback is constructed as each frame is added to it by
``PyTraceBack_Here()``, which is in `PyTraceBack_Here()`, which is in [Python/traceback.c](../Python/traceback.c).
[Python/traceback.c](https://github.com/python/cpython/blob/main/Python/traceback.c).
Along with the location of an exception handler, each entry of the Along with the location of an exception handler, each entry of the
exception table also contains the stack depth of the `try` instruction exception table also contains the stack depth of the `try` instruction
@ -174,22 +172,20 @@ which is then encoded as:
for a total of five bytes. for a total of five bytes.
The code to construct the exception table is in ``assemble_exception_table()`` The code to construct the exception table is in `assemble_exception_table()`
in [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c). in [Python/assemble.c](../Python/assemble.c).
The interpreter's function to lookup the table by instruction offset is The interpreter's function to lookup the table by instruction offset is
``get_exception_handler()`` in `get_exception_handler()` in [Python/ceval.c](../Python/ceval.c).
[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c). The Python function `_parse_exception_table()` in [Lib/dis.py](../Lib/dis.py)
The Python function ``_parse_exception_table()`` in
[Lib/dis.py](https://github.com/python/cpython/blob/main/Lib/dis.py)
returns the exception table content as a list of namedtuple instances. returns the exception table content as a list of namedtuple instances.
Exception Chaining Implementation Exception Chaining Implementation
--------------------------------- ---------------------------------
[Exception chaining](https://docs.python.org/dev/tutorial/errors.html#exception-chaining) [Exception chaining](https://docs.python.org/dev/tutorial/errors.html#exception-chaining)
refers to setting the ``__context__`` and ``__cause__`` fields of an exception as it is refers to setting the `__context__` and `__cause__` fields of an exception as it is
being raised. The ``__context__`` field is set by ``_PyErr_SetObject()`` in being raised. The `__context__` field is set by `_PyErr_SetObject()` in
[Python/errors.c](https://github.com/python/cpython/blob/main/Python/errors.c) [Python/errors.c](../Python/errors.c) (which is ultimately called by all
(which is ultimately called by all ``PyErr_Set*()`` functions). `PyErr_Set*()` functions). The `__cause__` field (explicit chaining) is set by
The ``__cause__`` field (explicit chaining) is set by the ``RAISE_VARARGS`` bytecode. the `RAISE_VARARGS` bytecode.

View File

@ -10,20 +10,19 @@ of three conceptual sections:
globals dict, code object, instruction pointer, stack depth, the globals dict, code object, instruction pointer, stack depth, the
previous frame, etc. previous frame, etc.
The definition of the ``_PyInterpreterFrame`` struct is in The definition of the `_PyInterpreterFrame` struct is in
[Include/internal/pycore_frame.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_frame.h). [Include/internal/pycore_frame.h](../Include/internal/pycore_frame.h).
# Allocation # Allocation
Python semantics allows frames to outlive the activation, so they need to Python semantics allows frames to outlive the activation, so they need to
be allocated outside the C call stack. To reduce overhead and improve locality be allocated outside the C call stack. To reduce overhead and improve locality
of reference, most frames are allocated contiguously in a per-thread stack of reference, most frames are allocated contiguously in a per-thread stack
(see ``_PyThreadState_PushFrame`` in (see `_PyThreadState_PushFrame` in [Python/pystate.c](../Python/pystate.c)).
[Python/pystate.c](https://github.com/python/cpython/blob/main/Python/pystate.c)).
Frames of generators and coroutines are embedded in the generator and coroutine Frames of generators and coroutines are embedded in the generator and coroutine
objects, so are not allocated in the per-thread stack. See ``PyGenObject`` in objects, so are not allocated in the per-thread stack. See `PyGenObject` in
[Include/internal/pycore_genobject.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_genobject.h). [Include/internal/pycore_genobject.h](../Include/internal/pycore_genobject.h).
## Layout ## Layout
@ -82,16 +81,15 @@ frames for each activation, but with low runtime overhead.
### Generators and Coroutines ### Generators and Coroutines
Generators (objects of type ``PyGen_Type``, ``PyCoro_Type`` or Generators (objects of type `PyGen_Type`, `PyCoro_Type` or
``PyAsyncGen_Type``) have a `_PyInterpreterFrame` embedded in them, so `PyAsyncGen_Type`) have a `_PyInterpreterFrame` embedded in them, so
that they can be created with a single memory allocation. that they can be created with a single memory allocation.
When such an embedded frame is iterated or awaited, it can be linked with When such an embedded frame is iterated or awaited, it can be linked with
frames on the per-thread stack via the linkage fields. frames on the per-thread stack via the linkage fields.
If a frame object associated with a generator outlives the generator, then If a frame object associated with a generator outlives the generator, then
the embedded `_PyInterpreterFrame` is copied into the frame object (see the embedded `_PyInterpreterFrame` is copied into the frame object (see
``take_ownership()`` in `take_ownership()` in [Python/frame.c](../Python/frame.c)).
[Python/frame.c](https://github.com/python/cpython/blob/main/Python/frame.c)).
### Field names ### Field names

View File

@ -12,7 +12,7 @@ a local variable in some C function. When an objects reference count becomes
the object is deallocated. If it contains references to other objects, their the object is deallocated. If it contains references to other objects, their
reference counts are decremented. Those other objects may be deallocated in turn, if reference counts are decremented. Those other objects may be deallocated in turn, if
this decrement makes their reference count become zero, and so on. The reference this decrement makes their reference count become zero, and so on. The reference
count field can be examined using the ``sys.getrefcount()`` function (notice that the count field can be examined using the `sys.getrefcount()` function (notice that the
value returned by this function is always 1 more as the function also has a reference value returned by this function is always 1 more as the function also has a reference
to the object when called): to the object when called):
@ -39,7 +39,7 @@ cycles. For instance, consider this code:
>>> del container >>> del container
``` ```
In this example, ``container`` holds a reference to itself, so even when we remove In this example, `container` holds a reference to itself, so even when we remove
our reference to it (the variable "container") the reference count never falls to 0 our reference to it (the variable "container") the reference count never falls to 0
because it still has its own internal reference. Therefore it would never be because it still has its own internal reference. Therefore it would never be
cleaned just by simple reference counting. For this reason some additional machinery cleaned just by simple reference counting. For this reason some additional machinery
@ -127,7 +127,7 @@ GC for the free-threaded build
------------------------------ ------------------------------
In the free-threaded build, Python objects contain a 1-byte field In the free-threaded build, Python objects contain a 1-byte field
``ob_gc_bits`` that is used to track garbage collection related state. The `ob_gc_bits` that is used to track garbage collection related state. The
field exists in all objects, including ones that do not support cyclic field exists in all objects, including ones that do not support cyclic
garbage collection. The field is used to identify objects that are tracked garbage collection. The field is used to identify objects that are tracked
by the collector, ensure that finalizers are called only once per object, by the collector, ensure that finalizers are called only once per object,
@ -146,14 +146,14 @@ and, during garbage collection, differentiate reachable vs. unreachable objects.
| ... | | ... |
``` ```
Note that not all fields are to scale. ``pad`` is two bytes, ``ob_mutex`` and Note that not all fields are to scale. `pad` is two bytes, `ob_mutex` and
``ob_gc_bits`` are each one byte, and ``ob_ref_local`` is four bytes. The `ob_gc_bits` are each one byte, and `ob_ref_local` is four bytes. The
other fields, ``ob_tid``, ``ob_ref_shared``, and ``ob_type``, are all other fields, `ob_tid`, `ob_ref_shared`, and `ob_type`, are all
pointer-sized (that is, eight bytes on a 64-bit platform). pointer-sized (that is, eight bytes on a 64-bit platform).
The garbage collector also temporarily repurposes the ``ob_tid`` (thread ID) The garbage collector also temporarily repurposes the `ob_tid` (thread ID)
and ``ob_ref_local`` (local reference count) fields for other purposes during and `ob_ref_local` (local reference count) fields for other purposes during
collections. collections.
@ -165,17 +165,17 @@ objects with GC support. These APIs can be found in the
[Garbage Collector C API documentation](https://docs.python.org/3/c-api/gcsupport.html). [Garbage Collector C API documentation](https://docs.python.org/3/c-api/gcsupport.html).
Apart from this object structure, the type object for objects supporting garbage Apart from this object structure, the type object for objects supporting garbage
collection must include the ``Py_TPFLAGS_HAVE_GC`` in its ``tp_flags`` slot and collection must include the `Py_TPFLAGS_HAVE_GC` in its `tp_flags` slot and
provide an implementation of the ``tp_traverse`` handler. Unless it can be proven provide an implementation of the `tp_traverse` handler. Unless it can be proven
that the objects cannot form reference cycles with only objects of its type or unless that the objects cannot form reference cycles with only objects of its type or unless
the type is immutable, a ``tp_clear`` implementation must also be provided. the type is immutable, a `tp_clear` implementation must also be provided.
Identifying reference cycles Identifying reference cycles
============================ ============================
The algorithm that CPython uses to detect those reference cycles is The algorithm that CPython uses to detect those reference cycles is
implemented in the ``gc`` module. The garbage collector **only focuses** implemented in the `gc` module. The garbage collector **only focuses**
on cleaning container objects (that is, objects that can contain a reference on cleaning container objects (that is, objects that can contain a reference
to one or more objects). These can be arrays, dictionaries, lists, custom to one or more objects). These can be arrays, dictionaries, lists, custom
class instances, classes in extension modules, etc. One could think that class instances, classes in extension modules, etc. One could think that
@ -195,7 +195,7 @@ the interpreter create cycles everywhere. Some notable examples:
To correctly dispose of these objects once they become unreachable, they need To correctly dispose of these objects once they become unreachable, they need
to be identified first. To understand how the algorithm works, lets take to be identified first. To understand how the algorithm works, lets take
the case of a circular linked list which has one link referenced by a the case of a circular linked list which has one link referenced by a
variable ``A``, and one self-referencing object which is completely variable `A`, and one self-referencing object which is completely
unreachable: unreachable:
```pycon ```pycon
@ -234,7 +234,7 @@ objects have a refcount larger than the number of incoming references from
within the candidate set. within the candidate set.
Every object that supports garbage collection will have an extra reference Every object that supports garbage collection will have an extra reference
count field initialized to the reference count (``gc_ref`` in the figures) count field initialized to the reference count (`gc_ref` in the figures)
of that object when the algorithm starts. This is because the algorithm needs of that object when the algorithm starts. This is because the algorithm needs
to modify the reference count to do the computations and in this way the to modify the reference count to do the computations and in this way the
interpreter will not modify the real reference count field. interpreter will not modify the real reference count field.
@ -243,43 +243,43 @@ interpreter will not modify the real reference count field.
The GC then iterates over all containers in the first list and decrements by one the The GC then iterates over all containers in the first list and decrements by one the
`gc_ref` field of any other object that container is referencing. Doing `gc_ref` field of any other object that container is referencing. Doing
this makes use of the ``tp_traverse`` slot in the container class (implemented this makes use of the `tp_traverse` slot in the container class (implemented
using the C API or inherited by a superclass) to know what objects are referenced by using the C API or inherited by a superclass) to know what objects are referenced by
each container. After all the objects have been scanned, only the objects that have each container. After all the objects have been scanned, only the objects that have
references from outside the “objects to scan” list will have ``gc_ref > 0``. references from outside the “objects to scan” list will have `gc_ref > 0`.
![gc-image2](images/python-cyclic-gc-2-new-page.png) ![gc-image2](images/python-cyclic-gc-2-new-page.png)
Notice that having ``gc_ref == 0`` does not imply that the object is unreachable. Notice that having `gc_ref == 0` does not imply that the object is unreachable.
This is because another object that is reachable from the outside (``gc_ref > 0``) This is because another object that is reachable from the outside (`gc_ref > 0`)
can still have references to it. For instance, the ``link_2`` object in our example can still have references to it. For instance, the `link_2` object in our example
ended having ``gc_ref == 0`` but is referenced still by the ``link_1`` object that ended having `gc_ref == 0` but is referenced still by the `link_1` object that
is reachable from the outside. To obtain the set of objects that are really is reachable from the outside. To obtain the set of objects that are really
unreachable, the garbage collector re-scans the container objects using the unreachable, the garbage collector re-scans the container objects using the
``tp_traverse`` slot; this time with a different traverse function that marks objects with `tp_traverse` slot; this time with a different traverse function that marks objects with
``gc_ref == 0`` as "tentatively unreachable" and then moves them to the `gc_ref == 0` as "tentatively unreachable" and then moves them to the
tentatively unreachable list. The following image depicts the state of the lists in a tentatively unreachable list. The following image depicts the state of the lists in a
moment when the GC processed the ``link_3`` and ``link_4`` objects but has not moment when the GC processed the `link_3` and `link_4` objects but has not
processed ``link_1`` and ``link_2`` yet. processed `link_1` and `link_2` yet.
![gc-image3](images/python-cyclic-gc-3-new-page.png) ![gc-image3](images/python-cyclic-gc-3-new-page.png)
Then the GC scans the next ``link_1`` object. Because it has ``gc_ref == 1``, Then the GC scans the next `link_1` object. Because it has `gc_ref == 1`,
the gc does not do anything special because it knows it has to be reachable (and is the gc does not do anything special because it knows it has to be reachable (and is
already in what will become the reachable list): already in what will become the reachable list):
![gc-image4](images/python-cyclic-gc-4-new-page.png) ![gc-image4](images/python-cyclic-gc-4-new-page.png)
When the GC encounters an object which is reachable (``gc_ref > 0``), it traverses When the GC encounters an object which is reachable (`gc_ref > 0`), it traverses
its references using the ``tp_traverse`` slot to find all the objects that are its references using the `tp_traverse` slot to find all the objects that are
reachable from it, moving them to the end of the list of reachable objects (where reachable from it, moving them to the end of the list of reachable objects (where
they started originally) and setting its ``gc_ref`` field to 1. This is what happens they started originally) and setting its `gc_ref` field to 1. This is what happens
to ``link_2`` and ``link_3`` below as they are reachable from ``link_1``. From the to `link_2` and `link_3` below as they are reachable from `link_1`. From the
state in the previous image and after examining the objects referred to by ``link_1`` state in the previous image and after examining the objects referred to by `link_1`
the GC knows that ``link_3`` is reachable after all, so it is moved back to the the GC knows that `link_3` is reachable after all, so it is moved back to the
original list and its ``gc_ref`` field is set to 1 so that if the GC visits it again, original list and its `gc_ref` field is set to 1 so that if the GC visits it again,
it will know that it's reachable. To avoid visiting an object twice, the GC marks all it will know that it's reachable. To avoid visiting an object twice, the GC marks all
objects that have already been visited once (by unsetting the ``PREV_MASK_COLLECTING`` objects that have already been visited once (by unsetting the `PREV_MASK_COLLECTING`
flag) so that if an object that has already been processed is referenced by some other flag) so that if an object that has already been processed is referenced by some other
object, the GC does not process it twice. object, the GC does not process it twice.
@ -295,7 +295,7 @@ list are really unreachable and can thus be garbage collected.
Pragmatically, it's important to note that no recursion is required by any of this, Pragmatically, it's important to note that no recursion is required by any of this,
and neither does it in any other way require additional memory proportional to the and neither does it in any other way require additional memory proportional to the
number of objects, number of pointers, or the lengths of pointer chains. Apart from number of objects, number of pointers, or the lengths of pointer chains. Apart from
``O(1)`` storage for internal C needs, the objects themselves contain all the storage `O(1)` storage for internal C needs, the objects themselves contain all the storage
the GC algorithms require. the GC algorithms require.
Why moving unreachable objects is better Why moving unreachable objects is better
@ -331,7 +331,7 @@ with the objective of completely destroying these objects. Roughly, the process
follows these steps in order: follows these steps in order:
1. Handle and clear weak references (if any). Weak references to unreachable objects 1. Handle and clear weak references (if any). Weak references to unreachable objects
are set to ``None``. If the weak reference has an associated callback, the callback are set to `None`. If the weak reference has an associated callback, the callback
is enqueued to be called once the clearing of weak references is finished. We only is enqueued to be called once the clearing of weak references is finished. We only
invoke callbacks for weak references that are themselves reachable. If both the weak invoke callbacks for weak references that are themselves reachable. If both the weak
reference and the pointed-to object are unreachable we do not execute the callback. reference and the pointed-to object are unreachable we do not execute the callback.
@ -339,15 +339,15 @@ follows these steps in order:
object and support for weak references predates support for object resurrection. object and support for weak references predates support for object resurrection.
Ignoring the weak reference's callback is fine because both the object and the weakref Ignoring the weak reference's callback is fine because both the object and the weakref
are going away, so it's legitimate to say the weak reference is going away first. are going away, so it's legitimate to say the weak reference is going away first.
2. If an object has legacy finalizers (``tp_del`` slot) move it to the 2. If an object has legacy finalizers (`tp_del` slot) move it to the
``gc.garbage`` list. `gc.garbage` list.
3. Call the finalizers (``tp_finalize`` slot) and mark the objects as already 3. Call the finalizers (`tp_finalize` slot) and mark the objects as already
finalized to avoid calling finalizers twice if the objects are resurrected or finalized to avoid calling finalizers twice if the objects are resurrected or
if other finalizers have removed the object first. if other finalizers have removed the object first.
4. Deal with resurrected objects. If some objects have been resurrected, the GC 4. Deal with resurrected objects. If some objects have been resurrected, the GC
finds the new subset of objects that are still unreachable by running the cycle finds the new subset of objects that are still unreachable by running the cycle
detection algorithm again and continues with them. detection algorithm again and continues with them.
5. Call the ``tp_clear`` slot of every object so all internal links are broken and 5. Call the `tp_clear` slot of every object so all internal links are broken and
the reference counts fall to 0, triggering the destruction of all unreachable the reference counts fall to 0, triggering the destruction of all unreachable
objects. objects.
@ -376,9 +376,9 @@ generations. Every collection operates on the entire heap.
In order to decide when to run, the collector keeps track of the number of object In order to decide when to run, the collector keeps track of the number of object
allocations and deallocations since the last collection. When the number of allocations and deallocations since the last collection. When the number of
allocations minus the number of deallocations exceeds ``threshold_0``, allocations minus the number of deallocations exceeds `threshold_0`,
collection starts. Initially only generation 0 is examined. If generation 0 has collection starts. Initially only generation 0 is examined. If generation 0 has
been examined more than ``threshold_1`` times since generation 1 has been been examined more than `threshold_1` times since generation 1 has been
examined, then generation 1 is examined as well. With generation 2, examined, then generation 1 is examined as well. With generation 2,
things are a bit more complicated; see things are a bit more complicated; see
[Collecting the oldest generation](#Collecting-the-oldest-generation) for [Collecting the oldest generation](#Collecting-the-oldest-generation) for
@ -393,8 +393,8 @@ function:
``` ```
The content of these generations can be examined using the The content of these generations can be examined using the
``gc.get_objects(generation=NUM)`` function and collections can be triggered `gc.get_objects(generation=NUM)` function and collections can be triggered
specifically in a generation by calling ``gc.collect(generation=NUM)``. specifically in a generation by calling `gc.collect(generation=NUM)`.
```pycon ```pycon
>>> import gc >>> import gc
@ -433,7 +433,7 @@ Collecting the oldest generation
-------------------------------- --------------------------------
In addition to the various configurable thresholds, the GC only triggers a full In addition to the various configurable thresholds, the GC only triggers a full
collection of the oldest generation if the ratio ``long_lived_pending / long_lived_total`` collection of the oldest generation if the ratio `long_lived_pending / long_lived_total`
is above a given value (hardwired to 25%). The reason is that, while "non-full" is above a given value (hardwired to 25%). The reason is that, while "non-full"
collections (that is, collections of the young and middle generations) will always collections (that is, collections of the young and middle generations) will always
examine roughly the same number of objects (determined by the aforementioned examine roughly the same number of objects (determined by the aforementioned
@ -463,12 +463,12 @@ used for tags or to keep other information most often as a bit field (each
bit a separate tag) as long as code that uses the pointer masks out these bit a separate tag) as long as code that uses the pointer masks out these
bits before accessing memory. For example, on a 32-bit architecture (for both bits before accessing memory. For example, on a 32-bit architecture (for both
addresses and word size), a word is 32 bits = 4 bytes, so word-aligned addresses and word size), a word is 32 bits = 4 bytes, so word-aligned
addresses are always a multiple of 4, hence end in ``00``, leaving the last 2 bits addresses are always a multiple of 4, hence end in `00`, leaving the last 2 bits
available; while on a 64-bit architecture, a word is 64 bits = 8 bytes, so available; while on a 64-bit architecture, a word is 64 bits = 8 bytes, so
word-aligned addresses end in ``000``, leaving the last 3 bits available. word-aligned addresses end in `000`, leaving the last 3 bits available.
The CPython GC makes use of two fat pointers that correspond to the extra fields The CPython GC makes use of two fat pointers that correspond to the extra fields
of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section: of `PyGC_Head` discussed in the `Memory layout and object structure`_ section:
> [!WARNING] > [!WARNING]
> Because the presence of extra information, "tagged" or "fat" pointers cannot be > Because the presence of extra information, "tagged" or "fat" pointers cannot be
@ -478,23 +478,23 @@ of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section:
> normally assume the pointers inside the lists are in a consistent state. > normally assume the pointers inside the lists are in a consistent state.
- The ``_gc_prev`` field is normally used as the "previous" pointer to maintain the - The `_gc_prev` field is normally used as the "previous" pointer to maintain the
doubly linked list but its lowest two bits are used to keep the flags doubly linked list but its lowest two bits are used to keep the flags
``PREV_MASK_COLLECTING`` and ``_PyGC_PREV_MASK_FINALIZED``. Between collections, `PREV_MASK_COLLECTING` and `_PyGC_PREV_MASK_FINALIZED`. Between collections,
the only flag that can be present is ``_PyGC_PREV_MASK_FINALIZED`` that indicates the only flag that can be present is `_PyGC_PREV_MASK_FINALIZED` that indicates
if an object has been already finalized. During collections ``_gc_prev`` is if an object has been already finalized. During collections `_gc_prev` is
temporarily used for storing a copy of the reference count (``gc_ref``), in temporarily used for storing a copy of the reference count (`gc_ref`), in
addition to two flags, and the GC linked list becomes a singly linked list until addition to two flags, and the GC linked list becomes a singly linked list until
``_gc_prev`` is restored. `_gc_prev` is restored.
- The ``_gc_next`` field is used as the "next" pointer to maintain the doubly linked - The `_gc_next` field is used as the "next" pointer to maintain the doubly linked
list but during collection its lowest bit is used to keep the list but during collection its lowest bit is used to keep the
``NEXT_MASK_UNREACHABLE`` flag that indicates if an object is tentatively `NEXT_MASK_UNREACHABLE` flag that indicates if an object is tentatively
unreachable during the cycle detection algorithm. This is a drawback to using only unreachable during the cycle detection algorithm. This is a drawback to using only
doubly linked lists to implement partitions: while most needed operations are doubly linked lists to implement partitions: while most needed operations are
constant-time, there is no efficient way to determine which partition an object is constant-time, there is no efficient way to determine which partition an object is
currently in. Instead, when that's needed, ad hoc tricks (like the currently in. Instead, when that's needed, ad hoc tricks (like the
``NEXT_MASK_UNREACHABLE`` flag) are employed. `NEXT_MASK_UNREACHABLE` flag) are employed.
Optimization: delay tracking containers Optimization: delay tracking containers
======================================= =======================================
@ -531,7 +531,7 @@ benefit from delayed tracking:
full garbage collection (all generations), the collector will untrack any dictionaries full garbage collection (all generations), the collector will untrack any dictionaries
whose contents are not tracked. whose contents are not tracked.
The garbage collector module provides the Python function ``is_tracked(obj)``, which returns The garbage collector module provides the Python function `is_tracked(obj)`, which returns
the current tracking status of the object. Subsequent garbage collections may change the the current tracking status of the object. Subsequent garbage collections may change the
tracking status of the object. tracking status of the object.
@ -556,20 +556,20 @@ Differences between GC implementations
This section summarizes the differences between the GC implementation in the This section summarizes the differences between the GC implementation in the
default build and the implementation in the free-threaded build. default build and the implementation in the free-threaded build.
The default build implementation makes extensive use of the ``PyGC_Head`` data The default build implementation makes extensive use of the `PyGC_Head` data
structure, while the free-threaded build implementation does not use that structure, while the free-threaded build implementation does not use that
data structure. data structure.
- The default build implementation stores all tracked objects in a doubly - The default build implementation stores all tracked objects in a doubly
linked list using ``PyGC_Head``. The free-threaded build implementation linked list using `PyGC_Head`. The free-threaded build implementation
instead relies on the embedded mimalloc memory allocator to scan the heap instead relies on the embedded mimalloc memory allocator to scan the heap
for tracked objects. for tracked objects.
- The default build implementation uses ``PyGC_Head`` for the unreachable - The default build implementation uses `PyGC_Head` for the unreachable
object list. The free-threaded build implementation repurposes the object list. The free-threaded build implementation repurposes the
``ob_tid`` field to store a unreachable objects linked list. `ob_tid` field to store a unreachable objects linked list.
- The default build implementation stores flags in the ``_gc_prev`` field of - The default build implementation stores flags in the `_gc_prev` field of
``PyGC_Head``. The free-threaded build implementation stores these flags `PyGC_Head`. The free-threaded build implementation stores these flags
in ``ob_gc_bits``. in `ob_gc_bits`.
The default build implementation relies on the The default build implementation relies on the

View File

@ -9,12 +9,12 @@ Python's Parser is currently a
[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) [`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
parser. It was introduced in parser. It was introduced in
[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser.
The code implementing the parser is generated from a grammar definition by a The code implementing the parser is generated from a grammar definition by a
[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler). [parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
Therefore, changes to the Python language are made by modifying the Therefore, changes to the Python language are made by modifying the
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). [grammar file](../Grammar/python.gram).
Developers rarely need to modify the generator itself. Developers rarely need to modify the generator itself.
See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar)
@ -33,9 +33,9 @@ is ordered. This means that when writing:
rule: A | B | C rule: A | B | C
``` ```
a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will a parser that implements a context-free-grammar (such as an `LL(1)` parser) will
generate constructions that, given an input string, *deduce* which alternative generate constructions that, given an input string, *deduce* which alternative
(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will (`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will
check each alternative, in the order in which they are specified, and select check each alternative, in the order in which they are specified, and select
that first one that succeeds. that first one that succeeds.
@ -67,21 +67,21 @@ time complexity with a technique called
which not only loads the entire program in memory before parsing it but also which not only loads the entire program in memory before parsing it but also
allows the parser to backtrack arbitrarily. This is made efficient by memoizing allows the parser to backtrack arbitrarily. This is made efficient by memoizing
the rules already matched for each position. The cost of the memoization cache the rules already matched for each position. The cost of the memoization cache
is that the parser will naturally use more memory than a simple ``LL(1)`` parser, is that the parser will naturally use more memory than a simple `LL(1)` parser,
which normally are table-based. which normally are table-based.
Key ideas Key ideas
--------- ---------
- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). - Alternatives are ordered ( `A | B` is not the same as `B | A` ).
- If a rule returns a failure, it doesn't mean that the parsing has failed, - If a rule returns a failure, it doesn't mean that the parsing has failed,
it just means "try something else". it just means "try something else".
- By default PEG parsers run in exponential time, which can be optimized to linear by - By default PEG parsers run in exponential time, which can be optimized to linear by
using memoization. using memoization.
- If parsing fails completely (no rule succeeds in parsing all the input text), the - If parsing fails completely (no rule succeeds in parsing all the input text), the
PEG parser doesn't have a concept of "where the PEG parser doesn't have a concept of "where the
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
> [!IMPORTANT] > [!IMPORTANT]
@ -111,16 +111,16 @@ the following two rules (in these examples, a token is an individual character):
second_rule: ('aa' | 'a' ) 'a' second_rule: ('aa' | 'a' ) 'a'
``` ```
In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but
in PEG, one of these two rules accepts the string ``aaa`` but not the string in PEG, one of these two rules accepts the string `aaa` but not the string
``aa``. The other does the opposite -- it accepts the string ``aa`` `aa`. The other does the opposite -- it accepts the string `aa`
but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does but not the string `aaa`. The rule `('a'|'aa')'a'` does
not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the
final ``a`` in the rule consume the second, and leaving out the third ``a``. final `a` in the rule consume the second, and leaving out the third `a`.
As the rule has succeeded, no attempt is ever made to go back and let As the rule has succeeded, no attempt is ever made to go back and let
``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does `'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does
not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing
for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not for the final `a`. Again, the second alternative of `'aa'|'a'` is not
tried. tried.
> [!CAUTION] > [!CAUTION]
@ -137,7 +137,7 @@ one is in almost all cases a mistake, for example:
``` ```
In this example, the second alternative will never be tried because the first one will In this example, the second alternative will never be tried because the first one will
succeed first (even if the input string has an ``'else' block`` that follows). To correctly succeed first (even if the input string has an `'else' block` that follows). To correctly
write this rule you can simply alter the order: write this rule you can simply alter the order:
``` ```
@ -146,7 +146,7 @@ write this rule you can simply alter the order:
| 'if' expression 'then' block | 'if' expression 'then' block
``` ```
In this case, if the input string doesn't have an ``'else' block``, the first alternative In this case, if the input string doesn't have an `'else' block`, the first alternative
will fail and the second will be attempted. will fail and the second will be attempted.
Grammar Syntax Grammar Syntax
@ -166,8 +166,8 @@ the rule:
rule_name[return_type]: expression rule_name[return_type]: expression
``` ```
If the return type is omitted, then a ``void *`` is returned in C and an If the return type is omitted, then a `void *` is returned in C and an
``Any`` in Python. `Any` in Python.
Grammar expressions Grammar expressions
------------------- -------------------
@ -214,7 +214,7 @@ Variables in the grammar
------------------------ ------------------------
A sub-expression can be named by preceding it with an identifier and an A sub-expression can be named by preceding it with an identifier and an
``=`` sign. The name can then be used in the action (see below), like this: `=` sign. The name can then be used in the action (see below), like this:
``` ```
rule_name[return_type]: '(' a=some_other_rule ')' { a } rule_name[return_type]: '(' a=some_other_rule ')' { a }
@ -387,9 +387,9 @@ returns a valid C-based Python AST:
| NUMBER | NUMBER
``` ```
Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset,
end_lineno, end_col_offset, p->arena``, those being variables automatically end_lineno, end_col_offset, p->arena`, those being variables automatically
injected by the parser; ``p`` points to an object that holds on to all state injected by the parser; `p` points to an object that holds on to all state
for the parser. for the parser.
A similar grammar written to target Python AST objects: A similar grammar written to target Python AST objects:
@ -422,50 +422,47 @@ Pegen
Pegen is the parser generator used in CPython to produce the final PEG parser Pegen is the parser generator used in CPython to produce the final PEG parser
used by the interpreter. It is the program that can be used to read the python used by the interpreter. It is the program that can be used to read the python
grammar located in grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce
[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram) the final C parser. It contains the following pieces:
and produce the final C parser. It contains the following pieces:
- A parser generator that can read a grammar file and produce a PEG parser - A parser generator that can read a grammar file and produce a PEG parser
written in Python or C that can parse said grammar. The generator is located at written in Python or C that can parse said grammar. The generator is located at
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen). [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen).
- A PEG meta-grammar that automatically generates a Python parser which is used - A PEG meta-grammar that automatically generates a Python parser which is used
for the parser generator itself (this means that there are no manually-written for the parser generator itself (this means that there are no manually-written
parsers). The meta-grammar is located at parsers). The meta-grammar is located at
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). [`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
- A generated parser (using the parser generator) that can directly produce C and Python AST objects. - A generated parser (using the parser generator) that can directly produce C and Python AST objects.
The source code for Pegen lives at The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen)
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen)
but normally all typical commands to interact with the parser generator are executed from but normally all typical commands to interact with the parser generator are executed from
the main makefile. the main makefile.
How to regenerate the parser How to regenerate the parser
---------------------------- ----------------------------
Once you have made the changes to the grammar files, to regenerate the ``C`` Once you have made the changes to the grammar files, to regenerate the `C`
parser (the one used by the interpreter) just execute: parser (the one used by the interpreter) just execute:
``` ```
make regen-pegen make regen-pegen
``` ```
using the ``Makefile`` in the main directory. If you are on Windows you can using the `Makefile` in the main directory. If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute: use the Visual Studio project files to regenerate the parser or to execute:
``` ```
./PCbuild/build.bat --regen ./PCbuild/build.bat --regen
``` ```
The generated parser file is located at The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c).
How to regenerate the meta-parser How to regenerate the meta-parser
--------------------------------- ---------------------------------
The meta-grammar (the grammar that describes the grammar for the grammar files The meta-grammar (the grammar that describes the grammar for the grammar files
themselves) is located at themselves) is located at
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). [`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
Although it is very unlikely that you will ever need to modify it, if you make Although it is very unlikely that you will ever need to modify it, if you make
any modifications to this file (in order to implement new Pegen features) you will any modifications to this file (in order to implement new Pegen features) you will
need to regenerate the meta-parser (the parser that parses the grammar files). need to regenerate the meta-parser (the parser that parses the grammar files).
@ -488,11 +485,11 @@ Grammatical elements and rules
Pegen has some special grammatical elements and rules: Pegen has some special grammatical elements and rules:
- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS. - Strings with single quotes (') (for example, `'class'`) denote KEYWORDS.
- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS. - Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS.
- Uppercase names (for example, ``NAME``) denote tokens in the - Uppercase names (for example, `NAME`) denote tokens in the
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file. [`Grammar/Tokens`](../Grammar/Tokens) file.
- Rule names starting with ``invalid_`` are used for specialized syntax errors. - Rule names starting with `invalid_` are used for specialized syntax errors.
- These rules are NOT used in the first pass of the parser. - These rules are NOT used in the first pass of the parser.
- Only if the first pass fails to parse, a second pass including the invalid - Only if the first pass fails to parse, a second pass including the invalid
@ -509,14 +506,13 @@ Tokenization
It is common among PEG parser frameworks that the parser does both the parsing It is common among PEG parser frameworks that the parser does both the parsing
and the tokenization, but this does not happen in Pegen. The reason is that the and the tokenization, but this does not happen in Pegen. The reason is that the
Python language needs a custom tokenizer to handle things like indentation Python language needs a custom tokenizer to handle things like indentation
boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for boundaries, some special keywords like `ASYNC` and `AWAIT` (for
compatibility purposes), backtracking errors (such as unclosed parenthesis), compatibility purposes), backtracking errors (such as unclosed parenthesis),
dealing with encoding, interactive mode and much more. Some of these reasons dealing with encoding, interactive mode and much more. Some of these reasons
are also there for historical purposes, and some others are useful even today. are also there for historical purposes, and some others are useful even today.
The list of tokens (all uppercase names in the grammar) that you can use can The list of tokens (all uppercase names in the grammar) that you can use can
be found in thei be found in the [`Grammar/Tokens`](../Grammar/Tokens)
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens)
file. If you change this file to add new tokens, make sure to regenerate the file. If you change this file to add new tokens, make sure to regenerate the
files by executing: files by executing:
@ -532,9 +528,7 @@ the tokens or to execute:
``` ```
How tokens are generated and the rules governing this are completely up to the tokenizer How tokens are generated and the rules governing this are completely up to the tokenizer
([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) ([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer));
and
[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer));
the parser just receives tokens from it. the parser just receives tokens from it.
Memoization Memoization
@ -548,7 +542,7 @@ both in memory and time. Although the memory cost is obvious (the parser needs
memory for storing previous results in the cache) the execution time cost comes memory for storing previous results in the cache) the execution time cost comes
for continuously checking if the given rule has a cache hit or not. In many for continuously checking if the given rule has a cache hit or not. In many
situations, just parsing it again can be faster. Pegen **disables memoization situations, just parsing it again can be faster. Pegen **disables memoization
by default** except for rules with the special marker ``memo`` after the rule by default** except for rules with the special marker `memo` after the rule
name (and type, if present): name (and type, if present):
``` ```
@ -567,8 +561,7 @@ To determine whether a new rule needs memoization or not, benchmarking is requir
(comparing execution times and memory usage of some considerably large files with (comparing execution times and memory usage of some considerably large files with
and without memoization). There is a very simple instrumentation API available and without memoization). There is a very simple instrumentation API available
in the generated C parse code that allows to measure how much each rule uses in the generated C parse code that allows to measure how much each rule uses
memoization (check the memoization (check the [`Parser/pegen.c`](../Parser/pegen.c)
[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c)
file for more information) but it needs to be manually activated. file for more information) but it needs to be manually activated.
Automatic variables Automatic variables
@ -578,9 +571,9 @@ To make writing actions easier, Pegen injects some automatic variables in the
namespace available when writing actions. In the C parser, some of these namespace available when writing actions. In the C parser, some of these
automatic variable names are: automatic variable names are:
- ``p``: The parser structure. - `p`: The parser structure.
- ``EXTRA``: This is a macro that expands to - `EXTRA`: This is a macro that expands to
``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``, `(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`,
which is normally used to create AST nodes as almost all constructors need these which is normally used to create AST nodes as almost all constructors need these
attributes to be provided. All of the location variables are taken from the attributes to be provided. All of the location variables are taken from the
location information of the current token. location information of the current token.
@ -590,13 +583,13 @@ Hard and soft keywords
> [!NOTE] > [!NOTE]
> In the grammar files, keywords are defined using **single quotes** (for example, > In the grammar files, keywords are defined using **single quotes** (for example,
> ``'class'``) while soft keywords are defined using **double quotes** (for example, > `'class'`) while soft keywords are defined using **double quotes** (for example,
> ``"match"``). > `"match"`).
There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
keywords. The difference between hard and soft keywords is that hard keywords keywords. The difference between hard and soft keywords is that hard keywords
are always reserved words, even in positions where they make no sense are always reserved words, even in positions where they make no sense
(for example, ``x = class + 1``), while soft keywords only get a special (for example, `x = class + 1`), while soft keywords only get a special
meaning in context. Trying to use a hard keyword as a variable will always meaning in context. Trying to use a hard keyword as a variable will always
fail: fail:
@ -621,7 +614,7 @@ one where they are defined as keywords:
>>> foo(match="Yeah!") >>> foo(match="Yeah!")
``` ```
The ``match`` and ``case`` keywords are soft keywords, so that they are The `match` and `case` keywords are soft keywords, so that they are
recognized as keywords at the beginning of a match statement or case block recognized as keywords at the beginning of a match statement or case block
respectively, but are allowed to be used in other places as variable or respectively, but are allowed to be used in other places as variable or
argument names. argument names.
@ -662,7 +655,7 @@ is, and it will unwind the stack and report the exception. This means that if a
[rule action](#grammar-actions) raises an exception, all parsing will [rule action](#grammar-actions) raises an exception, all parsing will
stop at that exact point. This is done to allow to correctly propagate any stop at that exact point. This is done to allow to correctly propagate any
exception set by calling Python's C API functions. This also includes exception set by calling Python's C API functions. This also includes
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
exceptions and it is the main mechanism the parser uses to report custom syntax exceptions and it is the main mechanism the parser uses to report custom syntax
error messages. error messages.
@ -684,10 +677,10 @@ grammar.
To report generic syntax errors, pegen uses a common heuristic in PEG parsers: To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
the location of *generic* syntax errors is reported to be the furthest token that the location of *generic* syntax errors is reported to be the furthest token that
was attempted to be matched but failed. This is only done if parsing has failed was attempted to be matched but failed. This is only done if parsing has failed
(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has (the parser returns `NULL` in C or `None` in Python) but no exception has
been raised. been raised.
As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic
has an extremely high success rate, but some PEG features, such as lookaheads, has an extremely high success rate, but some PEG features, such as lookaheads,
can impact this. can impact this.
@ -699,19 +692,19 @@ can impact this.
To generate more precise syntax errors, custom rules are used. This is a common To generate more precise syntax errors, custom rules are used. This is a common
practice also in context free grammars: the parser will try to accept some practice also in context free grammars: the parser will try to accept some
construct that is known to be incorrect just to report a specific syntax error construct that is known to be incorrect just to report a specific syntax error
for that construct. In pegen grammars, these rules start with the ``invalid_`` for that construct. In pegen grammars, these rules start with the `invalid_`
prefix. This is because trying to match these rules normally has a performance prefix. This is because trying to match these rules normally has a performance
impact on parsing (and can also affect the 'correct' grammar itself in some impact on parsing (and can also affect the 'correct' grammar itself in some
tricky cases, depending on the ordering of the rules) so the generated parser tricky cases, depending on the ordering of the rules) so the generated parser
acts in two phases: acts in two phases:
1. The first phase will try to parse the input stream without taking into 1. The first phase will try to parse the input stream without taking into
account rules that start with the ``invalid_`` prefix. If the parsing account rules that start with the `invalid_` prefix. If the parsing
succeeds it will return the generated AST and the second phase will be succeeds it will return the generated AST and the second phase will be
skipped. skipped.
2. If the first phase failed, a second parsing attempt is done including the 2. If the first phase failed, a second parsing attempt is done including the
rules that start with an ``invalid_`` prefix. By design this attempt rules that start with an `invalid_` prefix. By design this attempt
**cannot succeed** and is only executed to give to the invalid rules a **cannot succeed** and is only executed to give to the invalid rules a
chance to detect specific situations where custom, more precise, syntax chance to detect specific situations where custom, more precise, syntax
errors can be raised. This also allows to trade a bit of performance for errors can be raised. This also allows to trade a bit of performance for
@ -723,15 +716,15 @@ acts in two phases:
> When defining invalid rules: > When defining invalid rules:
> >
> - Make sure all custom invalid rules raise > - Make sure all custom invalid rules raise
> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) > [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
> exceptions (or a subclass of it). > exceptions (or a subclass of it).
> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not > - Make sure **all** invalid rules start with the `invalid_` prefix to not
> impact performance of parsing correct Python code. > impact performance of parsing correct Python code.
> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules > - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). > (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).
You can find a collection of macros to raise specialized syntax errors in the You can find a collection of macros to raise specialized syntax errors in the
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) [`Parser/pegen.h`](../Parser/pegen.h)
header file. These macros allow also to report ranges for header file. These macros allow also to report ranges for
the custom errors, which will be highlighted in the tracebacks that will be the custom errors, which will be highlighted in the tracebacks that will be
displayed when the error is reported. displayed when the error is reported.
@ -746,35 +739,33 @@ displayed when the error is reported.
<valid python code> $ 42 <valid python code> $ 42
``` ```
should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this should trigger the syntax error in the `$` character. If your rule is not correctly defined this
won't happen. As another example, suppose that you try to define a rule to match Python 2 style won't happen. As another example, suppose that you try to define a rule to match Python 2 style
``print`` statements in order to create a better error message and you define it as: `print` statements in order to create a better error message and you define it as:
``` ```
invalid_print: "print" expression invalid_print: "print" expression
``` ```
This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid This will **seem** to work because the parser will correctly parse `print(something)` because it is valid
code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass
of the parser will fail (because of the ``$``) and in the second phase, the rule will match the of the parser will fail (because of the `$`) and in the second phase, the rule will match the
``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error `print(something)` as `print` followed by the variable `something` between parentheses and the error
will be reported there instead of the ``$`` character. will be reported there instead of the `$` character.
Generating AST objects Generating AST objects
---------------------- ----------------------
The output of the C parser used by CPython, which is generated from the The output of the C parser used by CPython, which is generated from the
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram), [grammar file](../Grammar/python.gram), is a Python AST object (using C
is a Python AST object (using C structures). This means that the actions in the structures). This means that the actions in the grammar file generate AST
grammar file generate AST objects when they succeed. Constructing these objects objects when they succeed. Constructing these objects can be quite cumbersome
can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
for more information on how these objects are constructed and how they are used for more information on how these objects are constructed and how they are used
by the compiler), so special helper functions are used. These functions are by the compiler), so special helper functions are used. These functions are
declared in the declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The
header file and defined in the helpers include functions that join AST sequences, get specific elements
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
file. The helpers include functions that join AST sequences, get specific elements
from them or to perform extra processing on the generated tree. from them or to perform extra processing on the generated tree.
@ -788,11 +779,9 @@ from them or to perform extra processing on the generated tree.
As a general rule, if an action spawns multiple lines or requires something more As a general rule, if an action spawns multiple lines or requires something more
complicated than a single expression of C code, is normally better to create a complicated than a single expression of C code, is normally better to create a
custom helper in custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c)
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that
and expose it in the it can be used from the grammar.
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
header file so that it can be used from the grammar.
When parsing succeeds, the parser **must** return a **valid** AST object. When parsing succeeds, the parser **must** return a **valid** AST object.
@ -801,16 +790,15 @@ Testing
There are three files that contain tests for the grammar and the parser: There are three files that contain tests for the grammar and the parser:
- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py) - [test_grammar.py](../Lib/test/test_grammar.py)
- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py) - [test_syntax.py](../Lib/test/test_syntax.py)
- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py) - [test_exceptions.py](../Lib/test/test_exceptions.py)
Check the contents of these files to know which is the best place for new tests, depending Check the contents of these files to know which is the best place for new
on the nature of the new feature you are adding. tests, depending on the nature of the new feature you are adding.
Tests for the parser generator itself can be found in the Tests for the parser generator itself can be found in the
[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator) [test_peg_generator](../Lib/test_peg_generator) directory.
directory.
Debugging generated parsers Debugging generated parsers
@ -825,33 +813,32 @@ correctly compile and execute Python anymore. This makes it a bit challenging
to debug when something goes wrong, especially when experimenting. to debug when something goes wrong, especially when experimenting.
For this reason it is a good idea to experiment first by generating a Python For this reason it is a good idea to experiment first by generating a Python
parser. To do this, you can go to the parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator)
directory on the CPython repository and manually call the parser generator by executing: directory on the CPython repository and manually call the parser generator by executing:
``` ```
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE> $ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
``` ```
This will generate a file called ``parse.py`` in the same directory that you This will generate a file called `parse.py` in the same directory that you
can use to parse some input: can use to parse some input:
``` ```
$ python parse.py file_with_source_code_to_test.py $ python parse.py file_with_source_code_to_test.py
``` ```
As the generated ``parse.py`` file is just Python code, you can modify it As the generated `parse.py` file is just Python code, you can modify it
and add breakpoints to debug or better understand some complex situations. and add breakpoints to debug or better understand some complex situations.
Verbose mode Verbose mode
------------ ------------
When Python is compiled in debug mode (by adding ``--with-pydebug`` when When Python is compiled in debug mode (by adding `--with-pydebug` when
running the configure step in Linux or by adding ``-d`` when calling the running the configure step in Linux or by adding `-d` when calling the
[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)), [PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a
it is possible to activate a **very** verbose mode in the generated parser. This **very** verbose mode in the generated parser. This is very useful to
is very useful to debug the generated parser and to understand how it works, but it debug the generated parser and to understand how it works, but it
can be a bit hard to understand at first. can be a bit hard to understand at first.
> [!NOTE] > [!NOTE]
@ -859,13 +846,13 @@ can be a bit hard to understand at first.
> interactive mode as it can be much harder to understand, because interactive > interactive mode as it can be much harder to understand, because interactive
> mode involves some special steps compared to regular parsing. > mode involves some special steps compared to regular parsing.
To activate verbose mode you can add the ``-d`` flag when executing Python: To activate verbose mode you can add the `-d` flag when executing Python:
``` ```
$ python -d file_to_test.py $ python -d file_to_test.py
``` ```
This will print **a lot** of output to ``stderr`` so it is probably better to dump This will print **a lot** of output to `stderr` so it is probably better to dump
it to a file for further analysis. The output consists of trace lines with the it to a file for further analysis. The output consists of trace lines with the
following structure:: following structure::
@ -873,17 +860,17 @@ following structure::
<indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ... <indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
``` ```
Every line is indented by a different amount (``<indentation>``) depending on how Every line is indented by a different amount (`<indentation>`) depending on how
deep the call stack is. The next character marks the type of the trace: deep the call stack is. The next character marks the type of the trace:
- ``>`` indicates that a rule is going to be attempted to be parsed. - `>` indicates that a rule is going to be attempted to be parsed.
- ``-`` indicates that a rule has failed to be parsed. - `-` indicates that a rule has failed to be parsed.
- ``+`` indicates that a rule has been parsed correctly. - `+` indicates that a rule has been parsed correctly.
- ``!`` indicates that an exception or an error has been detected and the parser is unwinding. - `!` indicates that an exception or an error has been detected and the parser is unwinding.
The ``<token_location>`` part indicates the current index in the token array, The `<token_location>` part indicates the current index in the token array,
the ``<rule_name>`` part indicates what rule is being parsed and the `<rule_name>` part indicates what rule is being parsed and
the ``<alternative>`` part indicates what alternative within that rule the `<alternative>` part indicates what alternative within that rule
is being attempted. is being attempted.
@ -891,4 +878,5 @@ is being attempted.
> **Document history** > **Document history**
> >
> Pablo Galindo Salgado - Original author > Pablo Galindo Salgado - Original author
>
> Irit Katriel and Jacob Coffee - Convert to Markdown > Irit Katriel and Jacob Coffee - Convert to Markdown