Remove second parser module example; it referred to non-readily-available example files, and this kind of discovery is much better done with the AST nowadays anyway.
This commit is contained in:
parent
fc9794a8fc
commit
047e486c45
|
@ -317,22 +317,8 @@ ST objects have the following methods:
|
||||||
Same as ``st2tuple(st, line_info, col_info)``.
|
Same as ``st2tuple(st, line_info, col_info)``.
|
||||||
|
|
||||||
|
|
||||||
.. _st-examples:
|
Example: Emulation of :func:`compile`
|
||||||
|
-------------------------------------
|
||||||
Examples
|
|
||||||
--------
|
|
||||||
|
|
||||||
.. index:: builtin: compile
|
|
||||||
|
|
||||||
The parser modules allows operations to be performed on the parse tree of Python
|
|
||||||
source code before the :term:`bytecode` is generated, and provides for inspection of the
|
|
||||||
parse tree for information gathering purposes. Two examples are presented. The
|
|
||||||
simple example demonstrates emulation of the :func:`compile` built-in function
|
|
||||||
and the complex example shows the use of a parse tree for information discovery.
|
|
||||||
|
|
||||||
|
|
||||||
Emulation of :func:`compile`
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
While many useful operations may take place between parsing and bytecode
|
While many useful operations may take place between parsing and bytecode
|
||||||
generation, the simplest operation is to do nothing. For this purpose, using
|
generation, the simplest operation is to do nothing. For this purpose, using
|
||||||
|
@ -366,320 +352,3 @@ readily available functions::
|
||||||
def load_expression(source_string):
|
def load_expression(source_string):
|
||||||
st = parser.expr(source_string)
|
st = parser.expr(source_string)
|
||||||
return st, st.compile()
|
return st, st.compile()
|
||||||
|
|
||||||
|
|
||||||
Information Discovery
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
.. index::
|
|
||||||
single: string; documentation
|
|
||||||
single: docstrings
|
|
||||||
|
|
||||||
Some applications benefit from direct access to the parse tree. The remainder
|
|
||||||
of this section demonstrates how the parse tree provides access to module
|
|
||||||
documentation defined in docstrings without requiring that the code being
|
|
||||||
examined be loaded into a running interpreter via :keyword:`import`. This can
|
|
||||||
be very useful for performing analyses of untrusted code.
|
|
||||||
|
|
||||||
Generally, the example will demonstrate how the parse tree may be traversed to
|
|
||||||
distill interesting information. Two functions and a set of classes are
|
|
||||||
developed which provide programmatic access to high level function and class
|
|
||||||
definitions provided by a module. The classes extract information from the
|
|
||||||
parse tree and provide access to the information at a useful semantic level, one
|
|
||||||
function provides a simple low-level pattern matching capability, and the other
|
|
||||||
function defines a high-level interface to the classes by handling file
|
|
||||||
operations on behalf of the caller. All source files mentioned here which are
|
|
||||||
not part of the Python installation are located in the :file:`Demo/parser/`
|
|
||||||
directory of the distribution.
|
|
||||||
|
|
||||||
The dynamic nature of Python allows the programmer a great deal of flexibility,
|
|
||||||
but most modules need only a limited measure of this when defining classes,
|
|
||||||
functions, and methods. In this example, the only definitions that will be
|
|
||||||
considered are those which are defined in the top level of their context, e.g.,
|
|
||||||
a function defined by a :keyword:`def` statement at column zero of a module, but
|
|
||||||
not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
|
|
||||||
construct, though there are some good reasons for doing so in some situations.
|
|
||||||
Nesting of definitions will be handled by the code developed in the example.
|
|
||||||
|
|
||||||
To construct the upper-level extraction methods, we need to know what the parse
|
|
||||||
tree structure looks like and how much of it we actually need to be concerned
|
|
||||||
about. Python uses a moderately deep parse tree so there are a large number of
|
|
||||||
intermediate nodes. It is important to read and understand the formal grammar
|
|
||||||
used by Python. This is specified in the file :file:`Grammar/Grammar` in the
|
|
||||||
distribution. Consider the simplest case of interest when searching for
|
|
||||||
docstrings: a module consisting of a docstring and nothing else. (See file
|
|
||||||
:file:`docstring.py`.) ::
|
|
||||||
|
|
||||||
"""Some documentation.
|
|
||||||
"""
|
|
||||||
|
|
||||||
Using the interpreter to take a look at the parse tree, we find a bewildering
|
|
||||||
mass of numbers and parentheses, with the documentation buried deep in nested
|
|
||||||
tuples. ::
|
|
||||||
|
|
||||||
>>> import parser
|
|
||||||
>>> import pprint
|
|
||||||
>>> st = parser.suite(open('docstring.py').read())
|
|
||||||
>>> tup = st.totuple()
|
|
||||||
>>> pprint.pprint(tup)
|
|
||||||
(257,
|
|
||||||
(264,
|
|
||||||
(265,
|
|
||||||
(266,
|
|
||||||
(267,
|
|
||||||
(307,
|
|
||||||
(287,
|
|
||||||
(288,
|
|
||||||
(289,
|
|
||||||
(290,
|
|
||||||
(292,
|
|
||||||
(293,
|
|
||||||
(294,
|
|
||||||
(295,
|
|
||||||
(296,
|
|
||||||
(297,
|
|
||||||
(298,
|
|
||||||
(299,
|
|
||||||
(300, (3, '"""Some documentation.\n"""'))))))))))))))))),
|
|
||||||
(4, ''))),
|
|
||||||
(4, ''),
|
|
||||||
(0, ''))
|
|
||||||
|
|
||||||
The numbers at the first element of each node in the tree are the node types;
|
|
||||||
they map directly to terminal and non-terminal symbols in the grammar.
|
|
||||||
Unfortunately, they are represented as integers in the internal representation,
|
|
||||||
and the Python structures generated do not change that. However, the
|
|
||||||
:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
|
|
||||||
and dictionaries which map from the integers to the symbolic names for the node
|
|
||||||
types.
|
|
||||||
|
|
||||||
In the output presented above, the outermost tuple contains four elements: the
|
|
||||||
integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
|
|
||||||
name :const:`file_input`. Each of these inner tuples contains an integer as the
|
|
||||||
first element; these integers, ``264``, ``4``, and ``0``, represent the node
|
|
||||||
types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
|
|
||||||
Note that these values may change depending on the version of Python you are
|
|
||||||
using; consult :file:`symbol.py` and :file:`token.py` for details of the
|
|
||||||
mapping. It should be fairly clear that the outermost node is related primarily
|
|
||||||
to the input source rather than the contents of the file, and may be disregarded
|
|
||||||
for the moment. The :const:`stmt` node is much more interesting. In
|
|
||||||
particular, all docstrings are found in subtrees which are formed exactly as
|
|
||||||
this node is formed, with the only difference being the string itself. The
|
|
||||||
association between the docstring in a similar tree and the defined entity
|
|
||||||
(class, function, or module) which it describes is given by the position of the
|
|
||||||
docstring subtree within the tree defining the described structure.
|
|
||||||
|
|
||||||
By replacing the actual docstring with something to signify a variable component
|
|
||||||
of the tree, we allow a simple pattern matching approach to check any given
|
|
||||||
subtree for equivalence to the general pattern for docstrings. Since the
|
|
||||||
example demonstrates information extraction, we can safely require that the tree
|
|
||||||
be in tuple form rather than list form, allowing a simple variable
|
|
||||||
representation to be ``['variable_name']``. A simple recursive function can
|
|
||||||
implement the pattern matching, returning a Boolean and a dictionary of variable
|
|
||||||
name to value mappings. (See file :file:`example.py`.) ::
|
|
||||||
|
|
||||||
def match(pattern, data, vars=None):
|
|
||||||
if vars is None:
|
|
||||||
vars = {}
|
|
||||||
if isinstance(pattern, list):
|
|
||||||
vars[pattern[0]] = data
|
|
||||||
return True, vars
|
|
||||||
if not instance(pattern, tuple):
|
|
||||||
return (pattern == data), vars
|
|
||||||
if len(data) != len(pattern):
|
|
||||||
return False, vars
|
|
||||||
for pattern, data in zip(pattern, data):
|
|
||||||
same, vars = match(pattern, data, vars)
|
|
||||||
if not same:
|
|
||||||
break
|
|
||||||
return same, vars
|
|
||||||
|
|
||||||
Using this simple representation for syntactic variables and the symbolic node
|
|
||||||
types, the pattern for the candidate docstring subtrees becomes fairly readable.
|
|
||||||
(See file :file:`example.py`.) ::
|
|
||||||
|
|
||||||
import symbol
|
|
||||||
import token
|
|
||||||
|
|
||||||
DOCSTRING_STMT_PATTERN = (
|
|
||||||
symbol.stmt,
|
|
||||||
(symbol.simple_stmt,
|
|
||||||
(symbol.small_stmt,
|
|
||||||
(symbol.expr_stmt,
|
|
||||||
(symbol.testlist,
|
|
||||||
(symbol.test,
|
|
||||||
(symbol.and_test,
|
|
||||||
(symbol.not_test,
|
|
||||||
(symbol.comparison,
|
|
||||||
(symbol.expr,
|
|
||||||
(symbol.xor_expr,
|
|
||||||
(symbol.and_expr,
|
|
||||||
(symbol.shift_expr,
|
|
||||||
(symbol.arith_expr,
|
|
||||||
(symbol.term,
|
|
||||||
(symbol.factor,
|
|
||||||
(symbol.power,
|
|
||||||
(symbol.atom,
|
|
||||||
(token.STRING, ['docstring'])
|
|
||||||
)))))))))))))))),
|
|
||||||
(token.NEWLINE, '')
|
|
||||||
))
|
|
||||||
|
|
||||||
Using the :func:`match` function with this pattern, extracting the module
|
|
||||||
docstring from the parse tree created previously is easy::
|
|
||||||
|
|
||||||
>>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
|
|
||||||
>>> found
|
|
||||||
True
|
|
||||||
>>> vars
|
|
||||||
{'docstring': '"""Some documentation.\n"""'}
|
|
||||||
|
|
||||||
Once specific data can be extracted from a location where it is expected, the
|
|
||||||
question of where information can be expected needs to be answered. When
|
|
||||||
dealing with docstrings, the answer is fairly simple: the docstring is the first
|
|
||||||
:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
|
|
||||||
types). A module consists of a single :const:`file_input` node, and class and
|
|
||||||
function definitions each contain exactly one :const:`suite` node. Classes and
|
|
||||||
functions are readily identified as subtrees of code block nodes which start
|
|
||||||
with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
|
|
||||||
(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
|
|
||||||
since it does not support multiple sibling nodes to match without regard to
|
|
||||||
number. A more elaborate matching function could be used to overcome this
|
|
||||||
limitation, but this is sufficient for the example.
|
|
||||||
|
|
||||||
Given the ability to determine whether a statement might be a docstring and
|
|
||||||
extract the actual string from the statement, some work needs to be performed to
|
|
||||||
walk the parse tree for an entire module and extract information about the names
|
|
||||||
defined in each context of the module and associate any docstrings with the
|
|
||||||
names. The code to perform this work is not complicated, but bears some
|
|
||||||
explanation.
|
|
||||||
|
|
||||||
The public interface to the classes is straightforward and should probably be
|
|
||||||
somewhat more flexible. Each "major" block of the module is described by an
|
|
||||||
object providing several methods for inquiry and a constructor which accepts at
|
|
||||||
least the subtree of the complete parse tree which it represents. The
|
|
||||||
:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
|
|
||||||
cannot otherwise determine the name of the module.
|
|
||||||
|
|
||||||
The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
|
|
||||||
:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
|
|
||||||
:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
|
|
||||||
:class:`ClassInfo` objects support :meth:`get_method_names` and
|
|
||||||
:meth:`get_method_info` while the other classes provide
|
|
||||||
:meth:`get_function_names` and :meth:`get_function_info`.
|
|
||||||
|
|
||||||
Within each of the forms of code block that the public classes represent, most
|
|
||||||
of the required information is in the same form and is accessed in the same way,
|
|
||||||
with classes having the distinction that functions defined at the top level are
|
|
||||||
referred to as "methods." Since the difference in nomenclature reflects a real
|
|
||||||
semantic distinction from functions defined outside of a class, the
|
|
||||||
implementation needs to maintain the distinction. Hence, most of the
|
|
||||||
functionality of the public classes can be implemented in a common base class,
|
|
||||||
:class:`SuiteInfoBase`, with the accessors for function and method information
|
|
||||||
provided elsewhere. Note that there is only one class which represents function
|
|
||||||
and method information; this parallels the use of the :keyword:`def` statement
|
|
||||||
to define both types of elements.
|
|
||||||
|
|
||||||
Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
|
|
||||||
need to be overridden by subclasses. More importantly, the extraction of most
|
|
||||||
information from a parse tree is handled through a method called by the
|
|
||||||
:class:`SuiteInfoBase` constructor. The example code for most of the classes is
|
|
||||||
clear when read alongside the formal grammar, but the method which recursively
|
|
||||||
creates new information objects requires further examination. Here is the
|
|
||||||
relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
|
|
||||||
|
|
||||||
class SuiteInfoBase:
|
|
||||||
_docstring = ''
|
|
||||||
_name = ''
|
|
||||||
|
|
||||||
def __init__(self, tree = None):
|
|
||||||
self._class_info = {}
|
|
||||||
self._function_info = {}
|
|
||||||
if tree:
|
|
||||||
self._extract_info(tree)
|
|
||||||
|
|
||||||
def _extract_info(self, tree):
|
|
||||||
# extract docstring
|
|
||||||
if len(tree) == 2:
|
|
||||||
found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
|
|
||||||
else:
|
|
||||||
found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
|
|
||||||
if found:
|
|
||||||
self._docstring = eval(vars['docstring'])
|
|
||||||
# discover inner definitions
|
|
||||||
for node in tree[1:]:
|
|
||||||
found, vars = match(COMPOUND_STMT_PATTERN, node)
|
|
||||||
if found:
|
|
||||||
cstmt = vars['compound']
|
|
||||||
if cstmt[0] == symbol.funcdef:
|
|
||||||
name = cstmt[2][1]
|
|
||||||
self._function_info[name] = FunctionInfo(cstmt)
|
|
||||||
elif cstmt[0] == symbol.classdef:
|
|
||||||
name = cstmt[2][1]
|
|
||||||
self._class_info[name] = ClassInfo(cstmt)
|
|
||||||
|
|
||||||
After initializing some internal state, the constructor calls the
|
|
||||||
:meth:`_extract_info` method. This method performs the bulk of the information
|
|
||||||
extraction which takes place in the entire example. The extraction has two
|
|
||||||
distinct phases: the location of the docstring for the parse tree passed in, and
|
|
||||||
the discovery of additional definitions within the code block represented by the
|
|
||||||
parse tree.
|
|
||||||
|
|
||||||
The initial :keyword:`if` test determines whether the nested suite is of the
|
|
||||||
"short form" or the "long form." The short form is used when the code block is
|
|
||||||
on the same line as the definition of the code block, as in ::
|
|
||||||
|
|
||||||
def square(x): "Square an argument."; return x ** 2
|
|
||||||
|
|
||||||
while the long form uses an indented block and allows nested definitions::
|
|
||||||
|
|
||||||
def make_power(exp):
|
|
||||||
"Make a function that raises an argument to the exponent `exp`."
|
|
||||||
def raiser(x, y=exp):
|
|
||||||
return x ** y
|
|
||||||
return raiser
|
|
||||||
|
|
||||||
When the short form is used, the code block may contain a docstring as the
|
|
||||||
first, and possibly only, :const:`small_stmt` element. The extraction of such a
|
|
||||||
docstring is slightly different and requires only a portion of the complete
|
|
||||||
pattern used in the more common case. As implemented, the docstring will only
|
|
||||||
be found if there is only one :const:`small_stmt` node in the
|
|
||||||
:const:`simple_stmt` node. Since most functions and methods which use the short
|
|
||||||
form do not provide a docstring, this may be considered sufficient. The
|
|
||||||
extraction of the docstring proceeds using the :func:`match` function as
|
|
||||||
described above, and the value of the docstring is stored as an attribute of the
|
|
||||||
:class:`SuiteInfoBase` object.
|
|
||||||
|
|
||||||
After docstring extraction, a simple definition discovery algorithm operates on
|
|
||||||
the :const:`stmt` nodes of the :const:`suite` node. The special case of the
|
|
||||||
short form is not tested; since there are no :const:`stmt` nodes in the short
|
|
||||||
form, the algorithm will silently skip the single :const:`simple_stmt` node and
|
|
||||||
correctly not discover any nested definitions.
|
|
||||||
|
|
||||||
Each statement in the code block is categorized as a class definition, function
|
|
||||||
or method definition, or something else. For the definition statements, the
|
|
||||||
name of the element defined is extracted and a representation object appropriate
|
|
||||||
to the definition is created with the defining subtree passed as an argument to
|
|
||||||
the constructor. The representation objects are stored in instance variables
|
|
||||||
and may be retrieved by name using the appropriate accessor methods.
|
|
||||||
|
|
||||||
The public classes provide any accessors required which are more specific than
|
|
||||||
those provided by the :class:`SuiteInfoBase` class, but the real extraction
|
|
||||||
algorithm remains common to all forms of code blocks. A high-level function can
|
|
||||||
be used to extract the complete set of information from a source file. (See
|
|
||||||
file :file:`example.py`.) ::
|
|
||||||
|
|
||||||
def get_docs(fileName):
|
|
||||||
import os
|
|
||||||
import parser
|
|
||||||
|
|
||||||
source = open(fileName).read()
|
|
||||||
basename = os.path.basename(os.path.splitext(fileName)[0])
|
|
||||||
st = parser.suite(source)
|
|
||||||
return ModuleInfo(st.totuple(), basename)
|
|
||||||
|
|
||||||
This provides an easy-to-use interface to the documentation of a module. If
|
|
||||||
information is required which is not extracted by the code of this example, the
|
|
||||||
code may be extended at clearly defined points to provide additional
|
|
||||||
capabilities.
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue