Ast visitor
Level: Advanced (score: 4)
In Scoring objects we learned how to identify builtin functions, modules, and keywords when provided with a list of strings.
In this Bite we take it a step further considering the following scenario: What if somebody gives you some python source code and asks you to identify which modules it uses and determine how frequently functions from those modules are used?
For a few small files you can solve this problem manually using python's regex facility or with command line tools like grep
or awk
. However the problem becomes more difficult to manage when you consider a library which may have many large modules.
The ast
module can help you in this scenario.
Intro to the ast
module
The ast
module transforms any python source code into a tree representation of the underlying python grammar. Such tree is called Abstract Syntax Tree (AST), and it is a data structure used to reason about programming languages grammar in general.
Let's see the ast
module in action on a simple example:
>>> import ast
>>> tree = ast.parse("one_plus_two = 1+2")
>>> ast.dump(tree)
"Module(body=[Assign(targets=[Name(id='a', ctx=Store())], value=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))])"
...ghhh, pretty condensed output, isn't it? Let's rearrange it manually to make it more readable:
"Module(
body=[
Assign(
targets=[
Name(id='one_plus_two', ctx=Store())
],
value=BinOp(
left=Num(n=1),
op=Add(),
right=Num(n=2))
)
]
)"
Module
|____ Assign
|___ Name
|___ BinOp
|__ Num
|__ Add
|__ Num
What can we observe?
- The outer most node Module
is the root of our tree.
- The Module.body
attribute is a list of nodes, one for each instruction of the input program. Our example program consists of one assignment instruction, so Module.body
contains only one Assign
node.
- The Assign
node has two attributes, Assign.targets
and Assign.value
, which describe the left and right-hand sides of the assignment with respect to the equal sign.
- The Assign.targets
attribute is a list containing the different destinations of the assignment operation. In our case only the node Name
is in the list since the assignment target is a single variable. Notice also that the Name.id
attribute of the target node contains the name of the variable "one_plus_two"
as indeed specified in the source code.
- The Assign.value
contains a BinOp
node since the right hand side of the assignment is a binary operation between two operands. The attributes BinOp.left
and BinOp.right
specify the left and right operands for the BinOp.op
operation. Remember, the expression on the right hand side of the assignment was 1 + 2
, so it is not surprising that BinOp.left
and BinOp.right
nodes are Num
nodes with values 1
and 2
respectively.
The ast
module provides a large variety of nodes to cover the different aspects of the Python grammar (for more details, see the official python 3.9 doc of the ast
module, as well as the external documentation Green Tree Snake as suggested by the official doc).
Overall, the rich functionality of the ast
module makes it easy to work with an AST data structure to both inspect and manipulate python source code programmatically.
As a concrete example, mutpy
is a mutation testing tool that artificially alters the code under testing to broaden the unit tests in an automated fashion. Under the hood, it uses the ast
module to (i) create an AST, (ii) apply mutations, and (iii) convert the tree into a code object via the builtin function compile()
before executing it (see mutpy.mutpy.utils.create_module()
for details).
Our objective
In this bite we are going to see how the ast
module makes it easy to visit an AST tree, and discover import
and from ... import
statements, and function calls. Specifically, we will work with ast.Import
, ast.ImportFrom
, and ast.Call
node types.
Given a python program source code, determine the following:
- Which builtin functions are used.
- Which modules are imported.
- Which calls make direct use of imported modules or submodules.
To do so, we are going to use the ast
module. Specifically, we take advantage of the NodeVisitor class to create our own AstVisitor
class to visit the nodes of an AST.
We already provide you a template of the AstVisitor
class with a .parse()
method to create an AST tree and trigger a visit, and we registered three callbacks that are invoked when three specific node types are encountered during parsing:
- .visit_Import()
called when visiting a node related to import xyz
statements (see the ast.Import
node documentation).
- .visit_ImportFrom()
called when visiting a node related to from abc import xyz
statements (see the ast.ImportFrom
node documentation).
- .visit_Call()
called during when visiting a node related to function call (see the ast.Call
node documentation).
What is left to do is to complete the implementation of the three methods mentioned so to "track" both which builtins functions (if any) are used, which modules are imported (if any) and which calls they related to (if any). To be more specific, we want to know both the list of builtins / methods, but also the line number where each of tracked objects are located (see the ast.AST
documentation). To report on the tracking you need to complete the following methods defined in the template:
- .builtins()
returns a list of string, one for each builtin functions tracked.
- .builtins_lineno()
returns a dictionary whose keys are builtins functions, and values are the line number where the related call occurred.
- .modules()
returns a list of string, one for each imported module.
- .modules_lineno()
returns a dictionary whose keys are module object calls, and values are the line number where the related call occurred.
The tracking you need to implement is subject to the following rules:
- __rule-1__: each of the methods support an optional input parameter named valid
to specify a list of builtins or modules to restrict the search to. When valid=None
, all builtins and modules will be reported in the output.
- __rule-2__: resolve aliased import statements and report their true names.
- __rule-3__: resolve submodule imports and express the module path using dotted notation.
- __rule-4__: when considering modules, valid=
can contain alias names, but they need to be expanded to full names to comply with __rule-2__;
An example
>> code = '''
import pandas as pd
import numpy as np
from argparse import ArgumentParser
print("this is a test")
parser = ArgumentParser()
len("12345")
df_tmp = pd.DataFrame(range(10), column=["x"])
df_tmp.loc[:, "y"] = df_tmp[x] + 1
np.random.random()
'''
>> vst = AstVisitor(code)
>> vst.builtins()
["print", "range", "len"]
>> vst.modules()
["argparse", "numpy", "pandas"]
>> vst.builtins_lineno()
{"print": [6], "len": [9], "range": [11]}
>> vst.modules_lineno()
{"argparse.ArgumentParser": [8], "pandas.DataFrame": [11], "numpy.random.random": [14]}
Considering the output of vst.modules_lineno()
, notice the following:
- The call ArgumentParser()
is resolved into argparse.ArgumentParser
to comply with __rule-3__.
- The pd.DataFrame()
and np.random.random()
calls are resolved into pandas.DataFrame
and numpy.random.random
respectively to comply with __rule-2__.
Let's see a few more examples using the code from the previous example, but this time using valid=
to constrain the search output:
>> first = vst.builtins(valid=["iter"])
>> first
[]
>> second = vst.builtins_lineno(valid=["iter"])
>> second
{}
>> third = vst.builtins_lineno(valid=["iter", "range"])
>> third
{"range": [11]}
>> fourth = vst.modules(valid=["np"])
>> fourth
["numpy"]
>> fifth = vst.modules(valid=["numpy"])
>> fifth
["numpy"]
>> sixth = vst.modules_lineno(valid=["numpy"])
>> sixth
{"numpy.random.random": [14]}
Notice the following:
- first
and second
are empty since iter
is not used in the source code.
- third
contains only range
since iter
is not used in the source code.
- fourth
contains numpy
even if we specified valid=["np"]
to comply with __rule-2__ and __rule-4__.
- fifth
is identical to fourth
as expected, but notice that this time we specified valid=["numpy"]
while before we used valid=["np"]
.
- sixth
contains only calls related to numpy
as requested, but notice how we specified valid=["numpy"]
while the original code has the statement np.random.random()
, hence the search takes care of aliasing to comply with __rule-2__.
Final notes:
- Please refer to the python3.9 documentation or, as suggested by the doc itself, the external Green Tree Snake documentation. The official documentation for python version lower than 3.9 is not very complete. If you are coding outside the pybites platform, you should not have issues in solving this bite with any python versions higher than 3.6.
- Tracking name aliasing is the most complex part to implement successfully and depends on you understanding what the various node attributes mean.
- Beware of expressions like module.submodule.submodule.func()
which may be present when handling Call
nodes.
- Lastly, I recommend you spend some time with the python console trying to parse simple statements and investigate the AST tree manually. The ipython
console console is great for this sort of exploration.
Good luck!