From 3909ff69e2506f86b05ee07ba53d433a28f78afd Mon Sep 17 00:00:00 2001 From: Brett Cannon Date: Thu, 2 Mar 2006 22:08:16 +0000 Subject: [PATCH] Text moved to PEP 339. --- Python/compile.txt | 507 --------------------------------------------- 1 file changed, 507 deletions(-) delete mode 100644 Python/compile.txt diff --git a/Python/compile.txt b/Python/compile.txt deleted file mode 100644 index 12f2128b551..00000000000 --- a/Python/compile.txt +++ /dev/null @@ -1,507 +0,0 @@ -Developer Notes for Python Compiler -=================================== - -Table of Contents ------------------ - -- Scope - Defines the limits of the change -- Parse Trees - Describes the local (Python) concept -- Abstract Syntax Trees (AST) - Describes the AST technology used -- Parse Tree to AST - Defines the transform approach -- Control Flow Graphs - Defines the creation of "basic blocks" -- AST to CFG to Bytecode - Tracks the flow from AST to bytecode -- Code Objects - Pointer to making bytecode "executable" -- Modified Files - Files added/modified/removed from CPython compiler -- ToDo - Work yet remaining (before complete) -- References - Academic and technical references to technology used. - - -Scope ------ - -Historically (through 2.4), compilation from source code to bytecode -involved two steps: - -1. Parse the source code into a parse tree (Parser/pgen.c) -2. Emit bytecode based on the parse tree (Python/compile.c) - -Historically, this is not how a standard compiler works. The usual -steps for compilation are: - -1. Parse source code into a parse tree (Parser/pgen.c) -2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c) -3. Transform AST into a Control Flow Graph (Python/newcompile.c) -4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c) - -Starting with Python 2.5, the above steps are now used. This change -was done to simplify compilation by breaking it into three steps. -The purpose of this document is to outline how the lattter three steps -of the process works. - -This document does not touch on how parsing works beyond what is needed -to explain what is needed for compilation. It is also not exhaustive -in terms of the how the entire system works. You will most likely need -to read some source to have an exact understanding of all details. - - -Parse Trees ------------ - -Python's parser is an LL(1) parser mostly based off of the -implementation laid out in the Dragon Book [Aho86]_. - -The grammar file for Python can be found in Grammar/Grammar with the -numeric value of grammar rules are stored in Include/graminit.h. The -numeric values for types of tokens (literal tokens, such as ``:``, -numbers, etc.) are kept in Include/token.h). The parse tree made up of -``node *`` structs (as defined in Include/node.h). - -Querying data from the node structs can be done with the following -macros (which are all defined in Include/token.h): - -- ``CHILD(node *, int)`` - Returns the nth child of the node using zero-offset indexing -- ``RCHILD(node *, int)`` - Returns the nth child of the node from the right side; use - negative numbers! -- ``NCH(node *)`` - Number of children the node has -- ``STR(node *)`` - String representation of the node; e.g., will return ``:`` for a - COLON token -- ``TYPE(node *)`` - The type of node as specified in ``Include/graminit.h`` -- ``REQ(node *, TYPE)`` - Assert that the node is the type that is expected -- ``LINENO(node *)`` - retrieve the line number of the source code that led to the - creation of the parse rule; defined in Python/ast.c - -To tie all of this example, consider the rule for 'while':: - - while_stmt: 'while' test ':' suite ['else' ':' suite] - -The node representing this will have ``TYPE(node) == while_stmt`` and -the number of children can be 4 or 7 depending on if there is an 'else' -statement. To access what should be the first ':' and require it be an -actual ':' token, `(REQ(CHILD(node, 2), COLON)``. - - -Abstract Syntax Trees (AST) ---------------------------- - -The abstract syntax tree (AST) is a high-level representation of the -program structure without the necessity of containing the source code; -it can be thought of a abstract representation of the source code. The -specification of the AST nodes is specified using the Zephyr Abstract -Syntax Definition Language (ASDL) [Wang97]_. - -The definition of the AST nodes for Python is found in the file -Parser/Python.asdl . - -Each AST node (representing statements, expressions, and several -specialized types, like list comprehensions and exception handlers) is -defined by the ASDL. Most definitions in the AST correspond to a -particular source construct, such as an 'if' statement or an attribute -lookup. The definition is independent of its realization in any -particular programming language. - -The following fragment of the Python ASDL construct demonstrates the -approach and syntax:: - - module Python - { - stmt = FunctionDef(identifier name, arguments args, stmt* body, - expr* decorators) - | Return(expr? value) | Yield(expr value) - attributes (int lineno) - } - -The preceding example describes three different kinds of statements; -function definitions, return statements, and yield statements. All -three kinds are considered of type stmt as shown by '|' separating the -various kinds. They all take arguments of various kinds and amounts. - -Modifiers on the argument type specify the number of values needed; '?' -means it is optional, '*' means 0 or more, no modifier means only one -value for the argument and it is required. FunctionDef, for instance, -takes an identifier for the name, 'arguments' for args, zero or more -stmt arguments for 'body', and zero or more expr arguments for -'decorators'. - -Do notice that something like 'arguments', which is a node type, is -represented as a single AST node and not as a sequence of nodes as with -stmt as one might expect. - -All three kinds also have an 'attributes' argument; this is shown by the -fact that 'attributes' lacks a '|' before it. - -The statement definitions above generate the following C structure type:: - - typedef struct _stmt *stmt_ty; - - struct _stmt { - enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind; - union { - struct { - identifier name; - arguments_ty args; - asdl_seq *body; - } FunctionDef; - - struct { - expr_ty value; - } Return; - - struct { - expr_ty value; - } Yield; - } v; - int lineno; - } - -Also generated are a series of constructor functions that allocate (in -this case) a stmt_ty struct with the appropriate initialization. The -'kind' field specifies which component of the union is initialized. The -FunctionDef() constructor function sets 'kind' to FunctionDef_kind and -initializes the 'name', 'args', 'body', and 'attributes' fields. - -*** NOTE: if you make a change here that can affect the output of bytecode that -is already in existence, make sure to delete your old .py(c|o) files! Running -``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. - - -Parse Tree to AST ------------------ - -The AST is generated from the parse tree in (see Python/ast.c) using the -function:: - - mod_ty PyAST_FromNode(const node *n); - -The function begins a tree walk of the parse tree, creating various AST -nodes as it goes along. It does this by allocating all new nodes it -needs, calling the proper AST node creation functions for any required -supporting functions, and connecting them as needed. - -Do realize that there is no automated nor symbolic connection between -the grammar specification and the nodes in the parse tree. No help is -directly provided by the parse tree as in yacc. - -For instance, one must keep track of -which node in the parse tree one is working with (e.g., if you are -working with an 'if' statement you need to watch out for the ':' token -to find the end of the conditional). No help is directly provided by -the parse tree as in yacc. - -The functions called to generate AST nodes from the parse tree all have -the name ast_for_xx where xx is what the grammar rule that the function -handles (alias_for_import_name is the exception to this). These in turn -call the constructor functions as defined by the ASDL grammar and -contained in Python/Python-ast.c (which was generated by -Parser/asdl_c.py) to create the nodes of the AST. This all leads to a -sequence of AST nodes stored in asdl_seq structs. - - -Function and macros for creating and using ``asdl_seq *`` types as found -in Python/asdl.c and Include/asdl.h: - -- ``asdl_seq_new(int)`` - Allocate memory for an asdl_seq for length 'size' -- ``asdl_seq_free(asdl_seq *)`` - Free asdl_seq struct -- ``asdl_seq_GET(asdl_seq *seq, int pos)`` - Get item held at 'pos' -- ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)`` - Set 'pos' in 'seq' to 'val' -- ``asdl_seq_APPEND(asdl_seq *seq, void *val)`` - Set the end of 'seq' to 'val' -- ``asdl_seq_LEN(asdl_seq *)`` - Return the length of 'seq' - -If you are working with statements, you must also worry about keeping -track of what line number generated the statement. Currently the line -number is passed as the last parameter to each stmt_ty function. - - -Control Flow Graphs -------------------- - -A control flow graph (often referenced by its acronym, CFG) is a -directed graph that models the flow of a program using basic blocks that -contain the intermediate representation (abbreviated "IR", and in this -case is Python bytecode) within the blocks. Basic blocks themselves are -a block of IR that has a single entry point but possibly multiple exit -points. The single entry point is the key to basic blocks; it all has -to do with jumps. An entry point is the target of something that -changes control flow (such as a function call or a jump) while exit -points are instructions that would change the flow of the program (such -as jumps and 'return' statements). What this means is that a basic -block is a chunk of code that starts at the entry point and runs to an -exit point or the end of the block. - -As an example, consider an 'if' statement with an 'else' block. The -guard on the 'if' is a basic block which is pointed to by the basic -block containing the code leading to the 'if' statement. The 'if' -statement block contains jumps (which are exit points) to the true body -of the 'if' and the 'else' body (which may be NULL), each of which are -their own basic blocks. Both of those blocks in turn point to the -basic block representing the code following the entire 'if' statement. - -CFGs are usually one step away from final code output. Code is directly -generated from the basic blocks (with jump targets adjusted based on the -output order) by doing a post-order depth-first search on the CFG -following the edges. - - -AST to CFG to Bytecode ----------------------- - -With the AST created, the next step is to create the CFG. The first step -is to convert the AST to Python bytecode without having jump targets -resolved to specific offsets (this is calculated when the CFG goes to -final bytecode). Essentially, this transforms the AST into Python -bytecode with control flow represented by the edges of the CFG. - -Conversion is done in two passes. The first creates the namespace -(variables can be classified as local, free/cell for closures, or -global). With that done, the second pass essentially flattens the CFG -into a list and calculates jump offsets for final output of bytecode. - -The conversion process is initiated by a call to the function in -Python/newcompile.c:: - - PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags); - -This function does both the conversion of the AST to a CFG and -outputting final bytecode from the CFG. The AST to CFG step is handled -mostly by the two functions called by PyAST_Compile():: - - struct symtable * PySymtable_Build(mod_ty, const char *, - PyFutureFeatures); - PyCodeObject * compiler_mod(struct compiler *, mod_ty); - -The former is in Python/symtable.c while the latter is in -Python/newcompile.c . - -PySymtable_Build() begins by entering the starting code block for the -AST (passed-in) and then calling the proper symtable_visit_xx function -(with xx being the AST node type). Next, the AST tree is walked with -the various code blocks that delineate the reach of a local variable -as blocks are entered and exited:: - - static int symtable_enter_block(struct symtable *, identifier, - block_ty, void *, int); - static int symtable_exit_block(struct symtable *, void *); - -Once the symbol table is created, it is time for CFG creation, whose -code is in Python/newcompile.c . This is handled by several functions -that break the task down by various AST node types. The functions are -all named compiler_visit_xx where xx is the name of the node type (such -as stmt, expr, etc.). Each function receives a ``struct compiler *`` -and xx_ty where xx is the AST node type. Typically these functions -consist of a large 'switch' statement, branching based on the kind of -node type passed to it. Simple things are handled inline in the -'switch' statement with more complex transformations farmed out to other -functions named compiler_xx with xx being a descriptive name of what is -being handled. - -When transforming an arbitrary AST node, use the VISIT macro:: - - VISIT(struct compiler *, , ); - -The appropriate compiler_visit_xx function is called, based on the value -passed in for (so ``VISIT(c, expr, node)`` calls -``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar, - but is called on AST node sequences (those values that were created as -arguments to a node that used the '*' modifier). There is also -VISIT_SLICE just for handling slices:: - - VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty); - -Emission of bytecode is handled by the following macros: - -- ``ADDOP(struct compiler *c, int op)`` - add 'op' as an opcode -- ``ADDOP_I(struct compiler *c, int op, int oparg)`` - add 'op' with an 'oparg' argument -- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)`` - add 'op' with the proper argument based on the position of obj in - 'type', but with no handling of mangled names; used for when you - need to do named lookups of objects such as globals, consts, or - parameters where name mangling is not possible and the scope of the - name is known -- ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)`` - just like ADDOP_O, but name mangling is also handled; used for - attribute loading or importing based on name -- ``ADDOP_JABS(struct compiling *c, int op, basicblock b)`` - create an absolute jump to the basic block 'b' -- ``ADDOP_JREL(struct compiling *c, int op, basicblock b)`` - create a relative jump to the basic block 'b' - -Several helper functions that will emit bytecode and are named -compiler_xx() where xx is what the function helps with (list, boolop - etc.). A rather useful one is:: - - static int compiler_nameop(struct compiler *, identifier, - expr_context_ty); - -This function looks up the scope of a variable and, based on the -expression context, emits the proper opcode to load, store, or delete -the variable. - -As for handling the line number on which a statement is defined, is -handled by compiler_visit_stmt() and thus is not a worry. - -In addition to emitting bytecode based on the AST node, handling the -creation of basic blocks must be done. Below are the macros and -functions used for managing basic blocks: - -- ``NEW_BLOCK(struct compiler *)`` - create block and set it as current -- ``NEXT_BLOCK(struct compiler *)`` - basically NEW_BLOCK() plus jump from current block -- ``compiler_new_block(struct compiler *)`` - create a block but don't use it (used for generating jumps) - -Once the CFG is created, it must be flattened and then final emission of -bytecode occurs. Flattening is handled using a post-order depth-first -search. Once flattened, jump offsets are backpatched based on the -flattening and then a PyCodeObject file is created. All of this is -handled by calling:: - - PyCodeObject * assemble(struct compiler *, int); - -*** NOTE: if you make a change here that can affect the output of bytecode that -is already in existence, make sure to delete your old .py(c|o) files! Running -``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. - - -Code Objects ------------- - -In the end, one ends up with a PyCodeObject which is defined in -Include/code.h . And with that you now have executable Python bytecode! - - -Modified Files --------------- - -+ Parser/ - - - Python.asdl - ASDL syntax file - - - asdl.py - "An implementation of the Zephyr Abstract Syntax Definition - Language." Uses SPARK_ to parse the ASDL files. - - - asdl_c.py - "Generate C code from an ASDL description." Generates - ../Python/Python-ast.c and ../Include/Python-ast.h . - - - spark.py - SPARK_ parser generator - -+ Python/ - - - Python-ast.c - Creates C structs corresponding to the ASDL types. Also - contains code for marshaling AST nodes (core ASDL types have - marshaling code in asdl.c). "File automatically generated by - ../Parser/asdl_c.py". - - - asdl.c - Contains code to handle the ASDL sequence type. Also has code - to handle marshalling the core ASDL types, such as number and - identifier. used by Python-ast.c for marshaling AST nodes. - - - ast.c - Converts Python's parse tree into the abstract syntax tree. - - - compile.txt - This file. - - - newcompile.c - New version of compile.c that handles the emitting of bytecode. - - - symtable.c - Generates symbol table from AST. - - -+ Include/ - - - Python-ast.h - Contains the actual definitions of the C structs as generated by - ../Python/Python-ast.c . - "Automatically generated by ../Parser/asdl_c.py". - - - asdl.h - Header for the corresponding ../Python/ast.c . - - - ast.h - Declares PyAST_FromNode() external (from ../Python/ast.c). - - - code.h - Header file for ../Objects/codeobject.c; contains definition of - PyCodeObject. - - - symtable.h - Header for ../Python/symtable.c . struct symtable and - PySTEntryObject are defined here. - -+ Objects/ - - - codeobject.c - Contains PyCodeObject-related code (originally in - ../Python/compile.c). - - -ToDo ----- -*** NOTE: all bugs and patches should be filed on SF under the group - "AST" for easy searching. It also does not hurt to put - "[AST]" at the beginning of the subject line of the tracker - item. - -+ Stdlib support - - AST->Python access? - - rewrite compiler package to mirror AST structure? -+ Documentation - - flesh out this doc - * byte stream output - * explanation of how the symbol table pass works - * code object (PyCodeObject) -+ Universal - - make sure entire test suite passes - - fix memory leaks - - make sure return types are properly checked for errors - - no gcc warnings - -References ----------- - -.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. - `Compilers: Principles, Techniques, and Tools`, - http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108 - -.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris - S. Serra. `The Zephyr Abstract Syntax Description Language.`_ - In Proceedings of the Conference on Domain-Specific Languages, pp. - 213--227, 1997. - -.. _The Zephyr Abstract Syntax Description Language.: - http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html - -.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/ -