At various hierarchies in an enterprise; we often rely on data in the form of reports or visual dashboards for daily routine tasks; say, for audit and compliance(both internal and external), to make a decision, measure success or even to support the hypothesis of a new business idea (to develop a new idea)

No matter what data you have and how effectively you have stored or maintained it. The fact is you need to know basic coding to actually retrieve that data, that is when your data is of any worth.

Knowing basic SQL is not really a rocket science but it’s a game changer. If everyone on the team is skilled with the basic data retrieval coding, it empowers the entire team and reduce the dependency for the trivial tasks resulting into quick decision making.

Aggregate Functions in SQL are very powerful and most commonly used ones in day to day analysis; like, which is the highest grossing product, the total sales per quarter etc. Here is the quick recap of most commonly Aggregate Functions.

What are SQL aggregate functions?

SQL Aggregate Functions perform the calculation on a set of values and return one value. Well, most often! When paired with GROUP BY clause, it returns single value for each group. Most of the Aggregate Functions are deterministic functions.

It will all make sense once we go through the examples in this article.

Here, I’m using classicmodels MySQL sample database for demonstration, it holds business data of a car retailer. Below is the ER Diagram for a quick understanding,

📌 Side Note

GROUP BY clause is often used with Aggregate Functions to group the result-set by one or more columns. (Basically to organise similar data into groups)

ORDER BY clause is used to sort the result-set in either ascending or descending order. By default it’s in ascending order. GROUP BY clause always precedes the ORDER BY clause when used together.

HAVING clause is similar to WHERE clause with some differences. It can be used only with columns which are part of GROUP BY clause or aggregate functions. I have written a detailed article on filtering clauses in SQL, feel free to refer it here,

Commonly Asked Interview Question: Where vs Having vs Qualify

Back To Basics | SQL fundamentals for beginners

towardsdatascience.com

DISTINCT clause is used to filter out duplicates and provide only unique values in the result-set.

The most common aggregate functions are,

COUNT( )

Generic meaning of the word ‘Count’ is determining the total number of something.

Similarly in SQL, COUNT() returns the number of non-NULL records or the records which matches the specified conditional expression. COUNT() is a deterministic function when used without the OVER and ORDER BY clauses, and nondeterministic when used with the OVER and ORDER BY clauses. Common syntax of COUNT() is,

COUNT( [ALL|DISTINCT] expression)

Let’s quickly query table OFFICES which holds the data of sales offices across the world,

#offices
SELECT * FROM CLASSICMODELS.OFFICES LIMIT 10;

Let’s start simple; we need to count total number of distinct cities that have office locations across the world,

#cities that have sales offices.
SELECT DISTINCT
    COUNT(DISTINCT CITY) AS TOTAL_OFFICES
FROM 
    CLASSICMODELS.OFFICES;

Now, let’s try to query the number of offices in each country,

#total number of sales offices in each country.
SELECT
    COUNTRY,
    COUNT(DISTINCT CITY) AS TOTAL_OFFICES
FROM 
    CLASSICMODELS.OFFICES
GROUP BY COUNTRY
ORDER BY COUNTRY DESC;

The first query simply provides total count of offices across the world, while in the second one, we are grouping the result-set based on countries and sorting it using ORDER BY clause. Total number of offices count remain the same in both the result set, that’s 7; only difference is in the second result-set count is divided into different groups.

COUNT() can be used in several different ways; here are some quick examples,

COUNT(*) returns the total number of records/rows from the table including duplicates and NULL values.

SELECT COUNT(*) FROM CLASSICMODELS.OFFICES;

COUNT(1), 1 is a non-NULL expression so it is similar to COUNT(*).

SELECT COUNT(1) FROM CLASSICMODELS.OFFICES;

COUNT(column_name) returns the non-NULL number of records from that particular column_name.

SELECT COUNT(COUNTRY) FROM CLASSICMODELS.OFFICES;

COUNT(DISTINCT column_name) returns the non-NULL and unique number of records from column_name.

SELECT COUNT(DISTINCT COUNTRY) FROM CLASSICMODELS.OFFICES;

COUNT(CASE WHEN x THEN y ELSE z END) returns the number of rows that satisfy the condition.

#USA office locations
SELECT 
    COUNT(CASE WHEN COUNTRY='USA' THEN CITY ELSE NULL END) AS USA_OFFICE_LOCATIONS
FROM 
    CLASSICMODELS.OFFICES;

COUNT IF is another extension of COUNT() which returns the count of records that matches conditional expression.

#country wise count of office locations
SELECT 
    COUNT(IF(COUNTRY = 'USA', 1, NULL)) AS 'USA',
    COUNT(IF(COUNTRY = 'UK', 1, NULL)) AS 'UK',
    COUNT(IF(COUNTRY = 'Japan', 1, NULL)) AS 'Japan'
FROM
    CLASSICMODELS.OFFICES;

SUM( )

In mathematics, summation/sum is the addition of a sequence of numbers. Similarly in SQL, SUM() returns the sum/summation of all non-NULL values from the selected set.

Syntax is,

SUM( [ALL|DISTINCT] expression)

Beginners can sometimes get confused between SUM() and COUNT(); but remember, they both are very different. COUNT() returns the total number of non-NULL records/rows in the column while SUM() returns the summation of all the values from that column.

Table PRODUCTS holds the data of a range of products available,

#products
SELECT * FROM CLASSICMODELS.PRODUCTS LIMIT 10;

Say, we need the information about total quantity of products currently present in the inventory,

#total quantity available in stock
SELECT 
    SUM(QUANTITYINSTOCK) as TOTAL_QUANTITY_IN_STOCK
FROM
    CLASSICMODELS.PRODUCTS;

In business, this information is not much of a use. Let’s now try to query total quantity in stock currently for each PRODUCTLINE and sort the result set in descending order,

#total quantity currently in stock for each productline grouped by productline
SELECT
    PRODUCTLINE,
    SUM(QUANTITYINSTOCK) AS TOTAL_IN_STOCK
FROM 
    CLASSICMODELS.PRODUCTS
GROUP BY PRODUCTLINE      
ORDER BY TOTAL_IN_STOCK DESC;

Let’s do some quick sales analysis. Sample data from table ORDERDETAILS is,

#orderdetails
SELECT * FROM CLASSICMODELS.ORDERDETAILS LIMIT 10;

Say, we need to know the total sales for each order, sorted in descending order,

#total sales for each order placed
SELECT 
    ORDERNUMBER,
    SUM(QUANTITYORDERED * PRICEEACH) AS TOTAL_SALES
FROM 
    CLASSICMODELS.ORDERDETAILS
GROUP BY ORDERNUMBER
ORDER BY TOTAL_SALES DESC;

I have limited the result-set of the above query for demo purpose. Now this result-set also gives an idea about the ORDERNUMBER with the highest sales. You can pick up the ORDERNUMBER from here and do the further analysis on what products were part of this particular order, how much was the quantity, who was the buyer, when was the order placed, shipment status etc.

SUM() will return NULL if all the values of the selected set are NULL.

AVG( )

In mathematics, average is calculated by adding a sequence of numbers and then dividing it by the count of those numbers.

Similarly, AVG() returns the average(mathematical average) value of the selected set, NULL values are ignored. Common syntax is,

AVG( [ALL|DISTINCT] expression)

Continuing to PRODUCTS table, let’s now find out the average buy price.

#average buy price
SELECT
    AVG(BUYPRICE) AS AVERAGE_BUYPRICE
FROM
    CLASSICMODELS.PRODUCTS;

Let’s do this analysis for each PRODUCTLINE now,

#average buy price for each product line
SELECT
    PRODUCTLINE,
    AVG(BUYPRICE) AS AVERAGE_BUYPRICE
FROM
    CLASSICMODELS.PRODUCTS
GROUP BY PRODUCTLINE;

With a little bit of makeup, replace AVG(BUYPRICE) AS AVERAGE_BUYPRICE with FORMAT(AVG(BUYPRICE),2) AS AVERAGE_BUYPRICE and the result-set will be,

MAX( ), MIN( )

Well! This one is simple one, as the name suggests; MAX() returns the highest or maximum non-NULL value from the selected set and MIN() returns the lowest or minimum non-NULL value from the selected set.

#maximum buy price
SELECT 
    MAX(BUYPRICE) AS MAXIMUM_BUYPRICE
FROM
   CLASSICMODELS.PRODUCTS;

#minimum buy price
SELECT 
    MIN(BUYPRICE) AS MAXIMUM_BUYPRICE
FROM
    CLASSICMODELS.PRODUCTS;

Conclusion

The idea of writing this article was a quick and simple recap of the basics of SQL Aggregate Functions. There is more to be explored about these and below are some of the resources to get started,

MySQL sample database
MySQL official documents
Introduction to aggregated functions
Hackerrank or Leetcode to practice basic/intermediate/advance SQL aggregations problems.

Happy Learning!

Subrat's Technical Blog

Saturday, October 15, 2022

SQL Aggregate Functions for Your Next Data Science Interview