Date: 26 June 2021
You may find my cheat sheet here.
I used to think Python is better than SQL. In my last position, I will use Python for everything if I have the choice. Even for joining two tables, I prefer to do it in
pd.merge(df1, df2, on='some_index'). In my current position, I was forced to use SQL for a lot of things, partly because it is the default (The query engine is presto) and partly because I cannot just query all the data on a Big Data scale. The reason why I was able to do that in my last position is that the data size is relatively small.
I experienced a huge improvement in my SQL skills. When I learnt more, I think SQL is a powerful tool if you know two things: common table expression and window functions.
One of the reasons why I hate SQL is that it is difficult to read SQL written by others, or even myself. Usually, you will find endless sub-queries, meaningless alias (SELECT * FROM Table1 AS a), and a lack of comments. These make SQL hard to comprehend. But, is it the fault of the tool itself? A year ago, my answer is yes. I think Python is just better. Now, I think it is the fault of the users themselves.
After I learned to use common table expression (CTE), I rarely write sub-queries anymore. It makes codes much more readable. When written sub-queries, it is hard to trace and understand because human reads from top to bottom while you read nested sub-queries from inner query to outer query. CTE fixed that by representing queries in friendly way. For example:
SELECT MAX(avg_length) AS max_length FROM ( SELECT first_name, last_name, AVG(length) AS avg_length FROM participant JOIN jump ON jump.participant_id = participant.id GROUP BY first_name, last_name ) AS average_lengths;
WITH average_lengths AS ( SELECT first_name, last_name, AVG(length) AS avg_length FROM participant JOIN jump ON jump.participant_id = participant.id GROUP BY first_name, last_name ) SELECT MAX(avg_length) AS max_length FROM average_lengths;
WITH statement creates a temporary table that can be used just a normal database table. With that, you can break down your queries from a large unreadable code to some manageable steps.
The second skill I learnt is window functions. I cannot describe accurately how it is different if you do not use window functions. But I just feel using it makes many queries simpler. One of the key difference when using aggregation such as Group By compared with window functions is that window functions keep the rows before aggregations, like the following.
One typical application is calculating the cumulative sum.
If you learned and practiced these two skills, you will make your SQL more readable and manageable . Try it out!