Saturday, October 15, 2022

SQL Aggregate Functions for Your Next Data Science Interview

 At various hierarchies in an enterprise; we often rely on data in the form of reports or visual dashboards for daily routine tasks; say, for audit and compliance(both internal and external), to make a decision, measure success or even to support the hypothesis of a new business idea (to develop a new idea)

No matter what data you have and how effectively you have stored or maintained it. The fact is you need to know basic coding to actually retrieve that data, that is when your data is of any worth.

Knowing basic SQL is not really a rocket science but it’s a game changer. If everyone on the team is skilled with the basic data retrieval coding, it empowers the entire team and reduce the dependency for the trivial tasks resulting into quick decision making.

Aggregate Functions in SQL are very powerful and most commonly used ones in day to day analysis; like, which is the highest grossing product, the total sales per quarter etc. Here is the quick recap of most commonly Aggregate Functions.

What are SQL aggregate functions?

SQL Aggregate Functions perform the calculation on a set of values and return one value. Well, most often! When paired with GROUP BY clause, it returns single value for each group. Most of the Aggregate Functions are deterministic functions.

It will all make sense once we go through the examples in this article.

Here, I’m using classicmodels MySQL sample database for demonstration, it holds business data of a car retailer. Below is the ER Diagram for a quick understanding,

Image by author

📌 Side Note

GROUP BY clause is often used with Aggregate Functions to group the result-set by one or more columns. (Basically to organise similar data into groups)

ORDER BY clause is used to sort the result-set in either ascending or descending order. By default it’s in ascending order. GROUP BY clause always precedes the ORDER BY clause when used together.

HAVING clause is similar to WHERE clause with some differences. It can be used only with columns which are part of GROUP BY clause or aggregate functions. I have written a detailed article on filtering clauses in SQL, feel free to refer it here,

DISTINCT clause is used to filter out duplicates and provide only unique values in the result-set.

The most common aggregate functions are,

COUNT( )

Generic meaning of the word ‘Count’ is determining the total number of something.

Image by author, created on canva

Similarly in SQL, COUNT() returns the number of non-NULL records or the records which matches the specified conditional expression. COUNT() is a deterministic function when used without the OVER and ORDER BY clauses, and nondeterministic when used with the OVER and ORDER BY clauses. Common syntax of COUNT() is,

COUNT( [ALL|DISTINCT] expression)

Let’s quickly query table OFFICES which holds the data of sales offices across the world,

#offices
SELECT * FROM CLASSICMODELS.OFFICES LIMIT 10;
Image by author

Let’s start simple; we need to count total number of distinct cities that have office locations across the world,

#cities that have sales offices.
SELECT DISTINCT
COUNT(DISTINCT CITY) AS TOTAL_OFFICES
FROM
CLASSICMODELS.OFFICES;
Image by author

Now, let’s try to query the number of offices in each country,

#total number of sales offices in each country.
SELECT
COUNTRY,
COUNT(DISTINCT CITY) AS TOTAL_OFFICES
FROM
CLASSICMODELS.OFFICES
GROUP BY COUNTRY
ORDER BY COUNTRY DESC;
Image by author

The first query simply provides total count of offices across the world, while in the second one, we are grouping the result-set based on countries and sorting it using ORDER BY clause. Total number of offices count remain the same in both the result set, that’s 7; only difference is in the second result-set count is divided into different groups.

COUNT() can be used in several different ways; here are some quick examples,

  • COUNT(*) returns the total number of records/rows from the table including duplicates and NULL values.
SELECT COUNT(*) FROM CLASSICMODELS.OFFICES;
Image by author
  • COUNT(1), 1 is a non-NULL expression so it is similar to COUNT(*).
SELECT COUNT(1) FROM CLASSICMODELS.OFFICES;
Image by author
  • COUNT(column_name) returns the non-NULL number of records from that particular column_name.
SELECT COUNT(COUNTRY) FROM CLASSICMODELS.OFFICES;
Image by author
  • COUNT(DISTINCT column_name) returns the non-NULL and unique number of records from column_name.
SELECT COUNT(DISTINCT COUNTRY) FROM CLASSICMODELS.OFFICES;
Image by authors
  • COUNT(CASE WHEN x THEN y ELSE z END) returns the number of rows that satisfy the condition.
#USA office locations
SELECT
COUNT(CASE WHEN COUNTRY='USA' THEN CITY ELSE NULL END) AS USA_OFFICE_LOCATIONS
FROM
CLASSICMODELS.OFFICES;
Image by author
  • COUNT IF is another extension of COUNT() which returns the count of records that matches conditional expression.
#country wise count of office locations
SELECT
COUNT(IF(COUNTRY = 'USA', 1, NULL)) AS 'USA',
COUNT(IF(COUNTRY = 'UK', 1, NULL)) AS 'UK',
COUNT(IF(COUNTRY = 'Japan', 1, NULL)) AS 'Japan'
FROM
CLASSICMODELS.OFFICES;
Image by author

SUM( )

In mathematics, summation/sum is the addition of a sequence of numbers. Similarly in SQL, SUM() returns the sum/summation of all non-NULL values from the selected set.

Image by author, created on canva

Syntax is,

SUM( [ALL|DISTINCT] expression)

Beginners can sometimes get confused between SUM() and COUNT(); but remember, they both are very different. COUNT() returns the total number of non-NULL records/rows in the column while SUM() returns the summation of all the values from that column.

Image by author, created on canva

Table PRODUCTS holds the data of a range of products available,

#products
SELECT * FROM CLASSICMODELS.PRODUCTS LIMIT 10;
Image by author

Say, we need the information about total quantity of products currently present in the inventory,

#total quantity available in stock
SELECT
SUM(QUANTITYINSTOCK) as TOTAL_QUANTITY_IN_STOCK
FROM
CLASSICMODELS.PRODUCTS;
Image by author

In business, this information is not much of a use. Let’s now try to query total quantity in stock currently for each PRODUCTLINE and sort the result set in descending order,

#total quantity currently in stock for each productline grouped by productline
SELECT
PRODUCTLINE,
SUM(QUANTITYINSTOCK) AS TOTAL_IN_STOCK
FROM
CLASSICMODELS.PRODUCTS
GROUP BY PRODUCTLINE
ORDER BY TOTAL_IN_STOCK DESC;
Image by author

Let’s do some quick sales analysis. Sample data from table ORDERDETAILS is,

#orderdetails
SELECT * FROM CLASSICMODELS.ORDERDETAILS LIMIT 10;
Image by author

Say, we need to know the total sales for each order, sorted in descending order,

#total sales for each order placed
SELECT
ORDERNUMBER,
SUM(QUANTITYORDERED * PRICEEACH) AS TOTAL_SALES
FROM
CLASSICMODELS.ORDERDETAILS
GROUP BY ORDERNUMBER
ORDER BY TOTAL_SALES DESC;
Image by author

I have limited the result-set of the above query for demo purpose. Now this result-set also gives an idea about the ORDERNUMBER with the highest sales. You can pick up the ORDERNUMBER from here and do the further analysis on what products were part of this particular order, how much was the quantity, who was the buyer, when was the order placed, shipment status etc.

SUM() will return NULL if all the values of the selected set are NULL.

AVG( )

In mathematics, average is calculated by adding a sequence of numbers and then dividing it by the count of those numbers.

Image by author, created on canva

Similarly, AVG() returns the average(mathematical average) value of the selected set, NULL values are ignored. Common syntax is,

AVG( [ALL|DISTINCT] expression)

Continuing to PRODUCTS table, let’s now find out the average buy price.

Image by author
#average buy price
SELECT
AVG(BUYPRICE) AS AVERAGE_BUYPRICE
FROM
CLASSICMODELS.PRODUCTS;
Image by author

Let’s do this analysis for each PRODUCTLINE now,

#average buy price for each product line
SELECT
PRODUCTLINE,
AVG(BUYPRICE) AS AVERAGE_BUYPRICE
FROM
CLASSICMODELS.PRODUCTS
GROUP BY PRODUCTLINE;
Image by author

With a little bit of makeup, replace AVG(BUYPRICE) AS AVERAGE_BUYPRICE with FORMAT(AVG(BUYPRICE),2) AS AVERAGE_BUYPRICE and the result-set will be,

Image by author

MAX( ), MIN( )

Well! This one is simple one, as the name suggests; MAX() returns the highest or maximum non-NULL value from the selected set and MIN() returns the lowest or minimum non-NULL value from the selected set.

Image by author, created on canva
#maximum buy price
SELECT
MAX(BUYPRICE) AS MAXIMUM_BUYPRICE
FROM
CLASSICMODELS.PRODUCTS;
Image by author
#minimum buy price
SELECT
MIN(BUYPRICE) AS MAXIMUM_BUYPRICE
FROM
CLASSICMODELS.PRODUCTS;
Image by author

Conclusion

The idea of writing this article was a quick and simple recap of the basics of SQL Aggregate Functions. There is more to be explored about these and below are some of the resources to get started,

Happy Learning!

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...