Date: 26 June 2021

You may find my cheat sheet here.

I used to think Python is better than SQL. In my last position, I will use Python for everything if I have the choice. Even for joining two tables, I prefer to do it in pd.merge(df1, df2, on='some_index'). In my current position, I was forced to use SQL for a lot of things, partly because it is the default (The query engine is presto) and partly because I cannot just query all the data on a Big Data scale. The reason why I was able to do that in my last position is that the data size is relatively small.

I experienced a huge improvement in my SQL skills. When I learnt more, I think SQL is a powerful tool if you know two things: common table expression and window functions.

One of the reasons why I hate SQL is that it is difficult to read SQL written by others, or even myself. Usually, you will find endless sub-queries, meaningless alias (SELECT * FROM Table1 AS a), and a lack of comments. These make SQL hard to comprehend. But, is it the fault of the tool itself? A year ago, my answer is yes. I think Python is just better. Now, I think it is the fault of the users themselves.

After I learned to use common table expression (CTE), I rarely write sub-queries anymore. It makes codes much more readable. When written sub-queries, it is hard to trace and understand because human reads from top to bottom while you read nested sub-queries from inner query to outer query. CTE fixed that by representing queries in friendly way. For example:

SELECT
  MAX(avg_length) AS max_length
FROM (
  SELECT
    first_name,
    last_name,
    AVG(length) AS avg_length
  FROM participant
  JOIN jump
    ON jump.participant_id = participant.id
  GROUP BY first_name,
    last_name
) AS average_lengths;
WITH average_lengths AS (
  SELECT
    first_name,
    last_name,
    AVG(length) AS avg_length
  FROM participant
  JOIN jump
    ON jump.participant_id = participant.id
  GROUP BY first_name,
    last_name
)
 
SELECT
  MAX(avg_length) AS max_length
FROM average_lengths;

The WITH statement creates a temporary table that can be used just a normal database table. With that, you can break down your queries from a large unreadable code to some manageable steps.

The second skill I learnt is window functions. I cannot describe accurately how it is different if you do not use window functions. But I just feel using it makes many queries simpler. One of the key difference when using aggregation such as Group By compared with window functions is that window functions keep the rows before aggregations, like the following.

One typical application is calculating the cumulative sum.

If you learned and practiced these two skills, you will make your SQL more readable and manageable . Try it out!

Leave a Reply

Hey, wait!

Before you go, Subscribe and Get Notify for Content Like This.

%d bloggers like this: