Pybites Logo

Ast visitor

Level: Advanced (score: 4)

In Scoring objects we learned how to identify builtin functions, modules, and keywords when provided with a list of strings.

In this Bite we take it a step further considering the following scenario: What if somebody gives you some python source code and asks you to identify which modules it uses and determine how frequently functions from those modules are used?

For a few small files you can solve this problem manually using python's regex facility or with command line tools like grep or awk. However the problem becomes more difficult to manage when you consider a library which may have many large modules.

The ast module can help you in this scenario.

Intro to the ast module

The ast module transforms any python source code into a tree representation of the underlying python grammar. Such tree is called Abstract Syntax Tree (AST), and it is a data structure used to reason about programming languages grammar in general.

Let's see the ast module in action on a simple example:

>>> import ast
>>> tree = ast.parse("one_plus_two = 1+2")
>>> ast.dump(tree)
"Module(body=[Assign(targets=[Name(id='a', ctx=Store())], value=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))])"

...ghhh, pretty condensed output, isn't it? Let's rearrange it manually to make it more readable:

"Module(
    body=[
        Assign(
            targets=[
                Name(id='one_plus_two', ctx=Store())
            ],
            value=BinOp(
                left=Num(n=1),
                op=Add(),
                right=Num(n=2))
            )
    ]
)"
Module
   |____ Assign
            |___ Name
            |___ BinOp
                   |__ Num
                   |__ Add
                   |__ Num

What can we observe?

- The outer most node Module is the root of our tree.

- The Module.body attribute is a list of nodes, one for each instruction of the input program. Our example program consists of one assignment instruction, so Module.body contains only one Assign node.

- The Assign node has two attributes, Assign.targets and Assign.value, which describe the left and right-hand sides of the assignment with respect to the equal sign.

    - The Assign.targets attribute is a list containing the different destinations of the assignment operation. In our case only the node Name is in the list since the assignment target is a single variable. Notice also that the Name.id attribute of the target node contains the name of the variable "one_plus_two" as indeed specified in the source code.

    - The Assign.value contains a BinOp node since the right hand side of the assignment is a binary operation between two operands. The attributes BinOp.left and BinOp.right specify the left and right operands for the BinOp.op operation. Remember, the expression on the right hand side of the assignment was 1 + 2, so it is not surprising that BinOp.left and BinOp.right nodes are Num nodes with values 1 and 2 respectively.

The ast module provides a large variety of nodes to cover the different aspects of the Python grammar (for more details, see the official python 3.9 doc of the ast module, as well as the external documentation Green Tree Snake as suggested by the official doc).

Overall, the rich functionality of the ast module makes it easy to work with an AST data structure to both inspect and manipulate python source code programmatically.

As a concrete example, mutpy is a mutation testing tool that artificially alters the code under testing to broaden the unit tests in an automated fashion. Under the hood, it uses the ast module to (i) create an AST, (ii) apply mutations, and (iii) convert the tree into a code object via the builtin function compile() before executing it (see mutpy.mutpy.utils.create_module() for details).

Our objective

In this bite we are going to see how the ast module makes it easy to visit an AST tree, and discover import and from ... import statements, and function calls. Specifically, we will work with ast.Import, ast.ImportFrom, and ast.Call node types.

Given a python program source code, determine the following:

- Which builtin functions are used.

- Which modules are imported.

- Which calls make direct use of imported modules or submodules.

To do so, we are going to use the ast module. Specifically, we take advantage of the NodeVisitor class to create our own AstVisitor class to visit the nodes of an AST.

We already provide you a template of the AstVisitor class with a .parse() method to create an AST tree and trigger a visit, and we registered three callbacks that are invoked when three specific node types are encountered during parsing:

- .visit_Import() called when visiting a node related to import xyz statements (see the ast.Import node documentation).

- .visit_ImportFrom() called when visiting a node related to from abc import xyz statements (see the ast.ImportFrom node documentation).

- .visit_Call() called during when visiting a node related to function call (see the ast.Call node documentation).

What is left to do is to complete the implementation of the three methods mentioned so to "track" both which builtins functions (if any) are used, which modules are imported (if any) and which calls they related to (if any). To be more specific, we want to know both the list of builtins / methods, but also the line number where each of tracked objects are located (see the ast.AST documentation). To report on the tracking you need to complete the following methods defined in the template:

- .builtins() returns a list of string, one for each builtin functions tracked.

- .builtins_lineno() returns a dictionary whose keys are builtins functions, and values are the line number where the related call occurred.

- .modules() returns a list of string, one for each imported module.

- .modules_lineno() returns a dictionary whose keys are module object calls, and values are the line number where the related call occurred.

The tracking you need to implement is subject to the following rules:

- __rule-1__: each of the methods support an optional input parameter named valid to specify a list of builtins or modules to restrict the search to. When valid=None, all builtins and modules will be reported in the output.

- __rule-2__: resolve aliased import statements and report their true names.

- __rule-3__: resolve submodule imports and express the module path using dotted notation.

- __rule-4__: when considering modules, valid= can contain alias names, but they need to be expanded to full names to comply with __rule-2__;

An example

>> code = '''
import pandas as pd
import numpy as np
from argparse import ArgumentParser
print("this is a test")
parser = ArgumentParser()
len("12345")
df_tmp = pd.DataFrame(range(10), column=["x"])
df_tmp.loc[:, "y"] = df_tmp[x] + 1
np.random.random()
'''

>> vst = AstVisitor(code)
>> vst.builtins()
["print", "range", "len"]
>> vst.modules()
["argparse", "numpy", "pandas"]
>> vst.builtins_lineno()
{"print": [6], "len": [9], "range": [11]}
>> vst.modules_lineno()
{"argparse.ArgumentParser": [8], "pandas.DataFrame": [11], "numpy.random.random": [14]}

Considering the output of vst.modules_lineno(), notice the following:

- The call ArgumentParser() is resolved into argparse.ArgumentParser to comply with __rule-3__.

- The pd.DataFrame() and np.random.random() calls are resolved into pandas.DataFrame and numpy.random.random respectively to comply with __rule-2__.

Let's see a few more examples using the code from the previous example, but this time using valid= to constrain the search output:

>> first = vst.builtins(valid=["iter"])
>> first
[]
>> second = vst.builtins_lineno(valid=["iter"])
>> second
{}
>> third = vst.builtins_lineno(valid=["iter", "range"])
>> third
{"range": [11]}
>> fourth = vst.modules(valid=["np"])
>> fourth
["numpy"]
>> fifth = vst.modules(valid=["numpy"])
>> fifth
["numpy"]
>> sixth = vst.modules_lineno(valid=["numpy"])
>> sixth
{"numpy.random.random": [14]}

Notice the following:

- first and second are empty since iter is not used in the source code.

- third contains only range since iter is not used in the source code.

- fourth contains numpy even if we specified valid=["np"] to comply with __rule-2__ and __rule-4__.

- fifth is identical to fourth as expected, but notice that this time we specified valid=["numpy"] while before we used valid=["np"].

- sixth contains only calls related to numpy as requested, but notice how we specified valid=["numpy"] while the original code has the statement np.random.random(), hence the search takes care of aliasing to comply with __rule-2__.

Final notes:

- Please refer to the python3.9 documentation or, as suggested by the doc itself, the external Green Tree Snake documentation. The official documentation for python version lower than 3.9 is not very complete. If you are coding outside the pybites platform, you should not have issues in solving this bite with any python versions higher than 3.6.

- Tracking name aliasing is the most complex part to implement successfully and depends on you understanding what the various node attributes mean.

- Beware of expressions like module.submodule.submodule.func() which may be present when handling Call nodes.

- Lastly, I recommend you spend some time with the python console trying to parse simple statements and investigate the AST tree manually. The ipython console console is great for this sort of exploration.

Good luck!

AST