Pybites Logo

Hash sql statements

Level: Intermediate (score: 3)

In this Bite you are going to implement a simple (and not 100% correct) hashing function for Structured Query Language (SQL) queries. You don't have to know anything about SQL to solve this Bite, but you can find a very nice introduction here. To understand the context, you only need to know that SQL is used to query data from a database. So an SQL query tells the database which data to return.

Look at this beautiful SQL Query:

select Candidate, Election_year, sum(Total_$), count(*)
from combined_party_data
where Election_year = 2016
group by Candidate, Election_year
having count(*) > 80
order by count(*) DESC;

If you are not familiar with SQL, which, again, is totally fine, this query pulls data from a combined_party_data table that shows the campaign spending totals per candidate in the 2016 election year, but only for those candidates with more than over 80 spending positions.

Consider that you are writing some kind of framework, library or package where the user can enter/execute such SQL queries. You want to be clever and avoid fetching identical data from the database twice. So your first idea is to store each SQL query and check if such a query was run previously. However, different users (and even the same user over time) can write the same SQL query slightly differently. For example, the keywords like SELECT and WHERE are not case-sensitive and thus not required to be in lower case or upper case. And the order of fields that get selected could be changed, resulting in a dataset with the same columns but in different order. So you need another approach than just to store the plain query text. 

Your next idea is to hash the query, that is you map an arbitrary SQL query string to a fixed-size value. Whenever the user enters a new query, you can check its hash value against a list of already processed queries and their hash values. If there is a match, you have run this query earlier and you can return the cached data instead of asking the database to do the work again.

So what you want to achieve is something like this:

query = """select Candidate, Election_year, sum(Total_$), count(*)
    from combined_party_data
    where Election_year = 2016
    group by Candidate, Election_year
    having count(*) > 80
    order by count(*) DESC; 

print(hash_query(query))

>>> c4f2b2adf3af1cff8efddadc16d57e03

Your task

Your task in this Bite is to write the hash function for an SQL query.

Implement the function hash_query according to the function's documentation and the tests (you will find some edge cases there!).

Note: The main idea is that an SQL query should be independent of the format and case insensitive. It should not matter whether the user is writing SELECT, select or SeleCt. This might be true for the keywords of the language, but different database management systems (DBMS) handle table and column names differently. This is why this Bite is a simplified approach to the problem and we are not dealing with the full problem here. See Is SQL Case-Sensitive? for more information about the topic.

Note: There is another assumption in this Bite that is only partially correct: That the order of the different query parts does not matter. This might be true for the order of column names, when you accept that a dataset is the same no matter the order of its columns. However, the order in a GROUP BY clause does matter, for example. If you GROUP BY by two columns, the result depends on which column comes first. The main reason for ignoring the order in this Bite is to have a more robust hash function that handles queries like sets: two queries are identical if they have the same content, which means they have the same words. As I said, this is not correct in general.

Hints

- The hashlib module of the standard library might come in handy.

Keep calm and coding!