Remove second parser module example; it referred to non-readily-available example files, and this kind of discovery is much better done with the AST nowadays anyway.
This commit is contained in:
parent
fc9794a8fc
commit
047e486c45
|
@ -317,22 +317,8 @@ ST objects have the following methods:
|
|||
Same as ``st2tuple(st, line_info, col_info)``.
|
||||
|
||||
|
||||
.. _st-examples:
|
||||
|
||||
Examples
|
||||
--------
|
||||
|
||||
.. index:: builtin: compile
|
||||
|
||||
The parser modules allows operations to be performed on the parse tree of Python
|
||||
source code before the :term:`bytecode` is generated, and provides for inspection of the
|
||||
parse tree for information gathering purposes. Two examples are presented. The
|
||||
simple example demonstrates emulation of the :func:`compile` built-in function
|
||||
and the complex example shows the use of a parse tree for information discovery.
|
||||
|
||||
|
||||
Emulation of :func:`compile`
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Example: Emulation of :func:`compile`
|
||||
-------------------------------------
|
||||
|
||||
While many useful operations may take place between parsing and bytecode
|
||||
generation, the simplest operation is to do nothing. For this purpose, using
|
||||
|
@ -366,320 +352,3 @@ readily available functions::
|
|||
def load_expression(source_string):
|
||||
st = parser.expr(source_string)
|
||||
return st, st.compile()
|
||||
|
||||
|
||||
Information Discovery
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. index::
|
||||
single: string; documentation
|
||||
single: docstrings
|
||||
|
||||
Some applications benefit from direct access to the parse tree. The remainder
|
||||
of this section demonstrates how the parse tree provides access to module
|
||||
documentation defined in docstrings without requiring that the code being
|
||||
examined be loaded into a running interpreter via :keyword:`import`. This can
|
||||
be very useful for performing analyses of untrusted code.
|
||||
|
||||
Generally, the example will demonstrate how the parse tree may be traversed to
|
||||
distill interesting information. Two functions and a set of classes are
|
||||
developed which provide programmatic access to high level function and class
|
||||
definitions provided by a module. The classes extract information from the
|
||||
parse tree and provide access to the information at a useful semantic level, one
|
||||
function provides a simple low-level pattern matching capability, and the other
|
||||
function defines a high-level interface to the classes by handling file
|
||||
operations on behalf of the caller. All source files mentioned here which are
|
||||
not part of the Python installation are located in the :file:`Demo/parser/`
|
||||
directory of the distribution.
|
||||
|
||||
The dynamic nature of Python allows the programmer a great deal of flexibility,
|
||||
but most modules need only a limited measure of this when defining classes,
|
||||
functions, and methods. In this example, the only definitions that will be
|
||||
considered are those which are defined in the top level of their context, e.g.,
|
||||
a function defined by a :keyword:`def` statement at column zero of a module, but
|
||||
not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
|
||||
construct, though there are some good reasons for doing so in some situations.
|
||||
Nesting of definitions will be handled by the code developed in the example.
|
||||
|
||||
To construct the upper-level extraction methods, we need to know what the parse
|
||||
tree structure looks like and how much of it we actually need to be concerned
|
||||
about. Python uses a moderately deep parse tree so there are a large number of
|
||||
intermediate nodes. It is important to read and understand the formal grammar
|
||||
used by Python. This is specified in the file :file:`Grammar/Grammar` in the
|
||||
distribution. Consider the simplest case of interest when searching for
|
||||
docstrings: a module consisting of a docstring and nothing else. (See file
|
||||
:file:`docstring.py`.) ::
|
||||
|
||||
"""Some documentation.
|
||||
"""
|
||||
|
||||
Using the interpreter to take a look at the parse tree, we find a bewildering
|
||||
mass of numbers and parentheses, with the documentation buried deep in nested
|
||||
tuples. ::
|
||||
|
||||
>>> import parser
|
||||
>>> import pprint
|
||||
>>> st = parser.suite(open('docstring.py').read())
|
||||
>>> tup = st.totuple()
|
||||
>>> pprint.pprint(tup)
|
||||
(257,
|
||||
(264,
|
||||
(265,
|
||||
(266,
|
||||
(267,
|
||||
(307,
|
||||
(287,
|
||||
(288,
|
||||
(289,
|
||||
(290,
|
||||
(292,
|
||||
(293,
|
||||
(294,
|
||||
(295,
|
||||
(296,
|
||||
(297,
|
||||
(298,
|
||||
(299,
|
||||
(300, (3, '"""Some documentation.\n"""'))))))))))))))))),
|
||||
(4, ''))),
|
||||
(4, ''),
|
||||
(0, ''))
|
||||
|
||||
The numbers at the first element of each node in the tree are the node types;
|
||||
they map directly to terminal and non-terminal symbols in the grammar.
|
||||
Unfortunately, they are represented as integers in the internal representation,
|
||||
and the Python structures generated do not change that. However, the
|
||||
:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
|
||||
and dictionaries which map from the integers to the symbolic names for the node
|
||||
types.
|
||||
|
||||
In the output presented above, the outermost tuple contains four elements: the
|
||||
integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
|
||||
name :const:`file_input`. Each of these inner tuples contains an integer as the
|
||||
first element; these integers, ``264``, ``4``, and ``0``, represent the node
|
||||
types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
|
||||
Note that these values may change depending on the version of Python you are
|
||||
using; consult :file:`symbol.py` and :file:`token.py` for details of the
|
||||
mapping. It should be fairly clear that the outermost node is related primarily
|
||||
to the input source rather than the contents of the file, and may be disregarded
|
||||
for the moment. The :const:`stmt` node is much more interesting. In
|
||||
particular, all docstrings are found in subtrees which are formed exactly as
|
||||
this node is formed, with the only difference being the string itself. The
|
||||
association between the docstring in a similar tree and the defined entity
|
||||
(class, function, or module) which it describes is given by the position of the
|
||||
docstring subtree within the tree defining the described structure.
|
||||
|
||||
By replacing the actual docstring with something to signify a variable component
|
||||
of the tree, we allow a simple pattern matching approach to check any given
|
||||
subtree for equivalence to the general pattern for docstrings. Since the
|
||||
example demonstrates information extraction, we can safely require that the tree
|
||||
be in tuple form rather than list form, allowing a simple variable
|
||||
representation to be ``['variable_name']``. A simple recursive function can
|
||||
implement the pattern matching, returning a Boolean and a dictionary of variable
|
||||
name to value mappings. (See file :file:`example.py`.) ::
|
||||
|
||||
def match(pattern, data, vars=None):
|
||||
if vars is None:
|
||||
vars = {}
|
||||
if isinstance(pattern, list):
|
||||
vars[pattern[0]] = data
|
||||
return True, vars
|
||||
if not instance(pattern, tuple):
|
||||
return (pattern == data), vars
|
||||
if len(data) != len(pattern):
|
||||
return False, vars
|
||||
for pattern, data in zip(pattern, data):
|
||||
same, vars = match(pattern, data, vars)
|
||||
if not same:
|
||||
break
|
||||
return same, vars
|
||||
|
||||
Using this simple representation for syntactic variables and the symbolic node
|
||||
types, the pattern for the candidate docstring subtrees becomes fairly readable.
|
||||
(See file :file:`example.py`.) ::
|
||||
|
||||
import symbol
|
||||
import token
|
||||
|
||||
DOCSTRING_STMT_PATTERN = (
|
||||
symbol.stmt,
|
||||
(symbol.simple_stmt,
|
||||
(symbol.small_stmt,
|
||||
(symbol.expr_stmt,
|
||||
(symbol.testlist,
|
||||
(symbol.test,
|
||||
(symbol.and_test,
|
||||
(symbol.not_test,
|
||||
(symbol.comparison,
|
||||
(symbol.expr,
|
||||
(symbol.xor_expr,
|
||||
(symbol.and_expr,
|
||||
(symbol.shift_expr,
|
||||
(symbol.arith_expr,
|
||||
(symbol.term,
|
||||
(symbol.factor,
|
||||
(symbol.power,
|
||||
(symbol.atom,
|
||||
(token.STRING, ['docstring'])
|
||||
)))))))))))))))),
|
||||
(token.NEWLINE, '')
|
||||
))
|
||||
|
||||
Using the :func:`match` function with this pattern, extracting the module
|
||||
docstring from the parse tree created previously is easy::
|
||||
|
||||
>>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
|
||||
>>> found
|
||||
True
|
||||
>>> vars
|
||||
{'docstring': '"""Some documentation.\n"""'}
|
||||
|
||||
Once specific data can be extracted from a location where it is expected, the
|
||||
question of where information can be expected needs to be answered. When
|
||||
dealing with docstrings, the answer is fairly simple: the docstring is the first
|
||||
:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
|
||||
types). A module consists of a single :const:`file_input` node, and class and
|
||||
function definitions each contain exactly one :const:`suite` node. Classes and
|
||||
functions are readily identified as subtrees of code block nodes which start
|
||||
with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
|
||||
(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
|
||||
since it does not support multiple sibling nodes to match without regard to
|
||||
number. A more elaborate matching function could be used to overcome this
|
||||
limitation, but this is sufficient for the example.
|
||||
|
||||
Given the ability to determine whether a statement might be a docstring and
|
||||
extract the actual string from the statement, some work needs to be performed to
|
||||
walk the parse tree for an entire module and extract information about the names
|
||||
defined in each context of the module and associate any docstrings with the
|
||||
names. The code to perform this work is not complicated, but bears some
|
||||
explanation.
|
||||
|
||||
The public interface to the classes is straightforward and should probably be
|
||||
somewhat more flexible. Each "major" block of the module is described by an
|
||||
object providing several methods for inquiry and a constructor which accepts at
|
||||
least the subtree of the complete parse tree which it represents. The
|
||||
:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
|
||||
cannot otherwise determine the name of the module.
|
||||
|
||||
The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
|
||||
:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
|
||||
:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
|
||||
:class:`ClassInfo` objects support :meth:`get_method_names` and
|
||||
:meth:`get_method_info` while the other classes provide
|
||||
:meth:`get_function_names` and :meth:`get_function_info`.
|
||||
|
||||
Within each of the forms of code block that the public classes represent, most
|
||||
of the required information is in the same form and is accessed in the same way,
|
||||
with classes having the distinction that functions defined at the top level are
|
||||
referred to as "methods." Since the difference in nomenclature reflects a real
|
||||
semantic distinction from functions defined outside of a class, the
|
||||
implementation needs to maintain the distinction. Hence, most of the
|
||||
functionality of the public classes can be implemented in a common base class,
|
||||
:class:`SuiteInfoBase`, with the accessors for function and method information
|
||||
provided elsewhere. Note that there is only one class which represents function
|
||||
and method information; this parallels the use of the :keyword:`def` statement
|
||||
to define both types of elements.
|
||||
|
||||
Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
|
||||
need to be overridden by subclasses. More importantly, the extraction of most
|
||||
information from a parse tree is handled through a method called by the
|
||||
:class:`SuiteInfoBase` constructor. The example code for most of the classes is
|
||||
clear when read alongside the formal grammar, but the method which recursively
|
||||
creates new information objects requires further examination. Here is the
|
||||
relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
|
||||
|
||||
class SuiteInfoBase:
|
||||
_docstring = ''
|
||||
_name = ''
|
||||
|
||||
def __init__(self, tree = None):
|
||||
self._class_info = {}
|
||||
self._function_info = {}
|
||||
if tree:
|
||||
self._extract_info(tree)
|
||||
|
||||
def _extract_info(self, tree):
|
||||
# extract docstring
|
||||
if len(tree) == 2:
|
||||
found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
|
||||
else:
|
||||
found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
|
||||
if found:
|
||||
self._docstring = eval(vars['docstring'])
|
||||
# discover inner definitions
|
||||
for node in tree[1:]:
|
||||
found, vars = match(COMPOUND_STMT_PATTERN, node)
|
||||
if found:
|
||||
cstmt = vars['compound']
|
||||
if cstmt[0] == symbol.funcdef:
|
||||
name = cstmt[2][1]
|
||||
self._function_info[name] = FunctionInfo(cstmt)
|
||||
elif cstmt[0] == symbol.classdef:
|
||||
name = cstmt[2][1]
|
||||
self._class_info[name] = ClassInfo(cstmt)
|
||||
|
||||
After initializing some internal state, the constructor calls the
|
||||
:meth:`_extract_info` method. This method performs the bulk of the information
|
||||
extraction which takes place in the entire example. The extraction has two
|
||||
distinct phases: the location of the docstring for the parse tree passed in, and
|
||||
the discovery of additional definitions within the code block represented by the
|
||||
parse tree.
|
||||
|
||||
The initial :keyword:`if` test determines whether the nested suite is of the
|
||||
"short form" or the "long form." The short form is used when the code block is
|
||||
on the same line as the definition of the code block, as in ::
|
||||
|
||||
def square(x): "Square an argument."; return x ** 2
|
||||
|
||||
while the long form uses an indented block and allows nested definitions::
|
||||
|
||||
def make_power(exp):
|
||||
"Make a function that raises an argument to the exponent `exp`."
|
||||
def raiser(x, y=exp):
|
||||
return x ** y
|
||||
return raiser
|
||||
|
||||
When the short form is used, the code block may contain a docstring as the
|
||||
first, and possibly only, :const:`small_stmt` element. The extraction of such a
|
||||
docstring is slightly different and requires only a portion of the complete
|
||||
pattern used in the more common case. As implemented, the docstring will only
|
||||
be found if there is only one :const:`small_stmt` node in the
|
||||
:const:`simple_stmt` node. Since most functions and methods which use the short
|
||||
form do not provide a docstring, this may be considered sufficient. The
|
||||
extraction of the docstring proceeds using the :func:`match` function as
|
||||
described above, and the value of the docstring is stored as an attribute of the
|
||||
:class:`SuiteInfoBase` object.
|
||||
|
||||
After docstring extraction, a simple definition discovery algorithm operates on
|
||||
the :const:`stmt` nodes of the :const:`suite` node. The special case of the
|
||||
short form is not tested; since there are no :const:`stmt` nodes in the short
|
||||
form, the algorithm will silently skip the single :const:`simple_stmt` node and
|
||||
correctly not discover any nested definitions.
|
||||
|
||||
Each statement in the code block is categorized as a class definition, function
|
||||
or method definition, or something else. For the definition statements, the
|
||||
name of the element defined is extracted and a representation object appropriate
|
||||
to the definition is created with the defining subtree passed as an argument to
|
||||
the constructor. The representation objects are stored in instance variables
|
||||
and may be retrieved by name using the appropriate accessor methods.
|
||||
|
||||
The public classes provide any accessors required which are more specific than
|
||||
those provided by the :class:`SuiteInfoBase` class, but the real extraction
|
||||
algorithm remains common to all forms of code blocks. A high-level function can
|
||||
be used to extract the complete set of information from a source file. (See
|
||||
file :file:`example.py`.) ::
|
||||
|
||||
def get_docs(fileName):
|
||||
import os
|
||||
import parser
|
||||
|
||||
source = open(fileName).read()
|
||||
basename = os.path.basename(os.path.splitext(fileName)[0])
|
||||
st = parser.suite(source)
|
||||
return ModuleInfo(st.totuple(), basename)
|
||||
|
||||
This provides an easy-to-use interface to the documentation of a module. If
|
||||
information is required which is not extracted by the code of this example, the
|
||||
code may be extended at clearly defined points to provide additional
|
||||
capabilities.
|
||||
|
||||
|
|
Loading…
Reference in New Issue