How to Perform SQL-flavored Left | Right | Inner | Outer Join in Pandas
In the world of data analysis, we rarely have the lucky scenario where the data we need all resides in one table and is in a ready-to-consume format. Instead, we often need to source data from multiple tables and bring them into one dataset for analysis. This data munging process is usually referred to as data merging/ joining/concatenation.
If you come from the world of SQL, you are probably already familiar with the SQL concept that deals with data merging. Specifically, SQL refers to the process of merging two or multiple tables horizontally as ‘join’, and vertically as ‘union’. In this post, we will focus exclusively on how to perform SQL-flavored ‘join’ in python using pandas’ pandas.DataFrame.merge
method. We will discuss how to do SQL’s ‘union’ query in pandas in another post.
There are four basic ways of merging/joining two datasets horizontally using a common key: Left Join, Right Join, Inner Join, and Outer Join. Let’s use the following two sample datasets for demonstration.
import pandas as pd
import numpy as npdf_left = pd.DataFrame({'Student_ID': ['001', '002', '003', '004','005'],
'Gender': ['F', 'M','F', 'F','M'],
'race/ethnicity': ['Group A', 'Group B','Group D', 'Group E','Group E']})df_right = pd.DataFrame({'Student_ID': ['002', '003', '005', '006','007'],
'Math': [77, 46,83, 53,65],
'Reading': [91, 62,88, 56,72],
'Writing': [49, 90 ,67, 66,59]})
Here we have two dataframes called ‘df_left’ and ‘df_right’ respectively. The ‘df_left’ dataframe has a Student_ID column and each student’s gender and race/ethnicity information. The ‘df_right’ dataframe also has a column for ‘Student_ID’ as well as each student’s math, reading, and writing scores. We want to merge the two tables (horizontally) so that we have all the information about a student in one dataset for subsequent analysis.
Left Join
The left join operation keeps all the records in the left table (df_left) and only brings in the records from the right table that have a match by the common key (‘Student_ID’ ). For the records in the right table that don’t have a match by the common key, the merged table will have values of ‘NaN’ for those records.
To perform the left join in python, we can use Pandas’ dataframe.merge()
method. The on
argument takes the key column and the how
argument takes the type of the join (left, right, etc.)
df_left.merge(df_right, on='Student_ID', how='left')
Right Join
Opposite to the left join, the right join operation keeps all the records in the right table (df_right) and only brings in the records from the left table that have a match by the common key. For the records in the left table that don’t have a match by the common key, the merged table will have values of ‘NaN’ for those records.
We can perform the right join in python using the code below by specifying the how
argument to be how='right'
:
df_left.merge(df_right, on='Student_ID', how='right')
Inner Join
Unlike left join and right join, inner join only takes the intersection of the two tables by the common key (Student_ID). Any records that do not have common Student_IDs will be dropped from the merged table as shown below.
We can perform the inner join in python using the code below:
df_left.merge(df_right, on='Student_ID', how='inner')
Outer Join
Outer join (also referred to as full outer join), takes all the Student_IDs from both tables and inserts NaN (missing values) for the fields that don’t have a match by the common key.
We can perform the full outer join in python using the code below:
df_left.merge(df_right, on='Student_ID', how='outer')
Only Merge-in a Subset of Columns
The code snippet shown above, by default, will merge the left and right tables by whatever type of joins you choose and include all the columns from both tables. What if you only want to include the ‘Reading’ column from the right table when merging?
Well, you can always subset the columns of the tables before merging. For example, if we want to bring in the ‘Reading’ column from the right table with a left join, we can do the following. Simple, isn’t it?
df_left.merge(df_right[['Student_ID','Reading']], on='Student_ID', how='left')
Join with different ‘Key Column’ Names
As you’ve probably noticed in the previous example, the common key columns in both tables have exactly the same column name — ‘Student_ID’. What if the key columns in the two tables have different names?
Well, you can choose to rename one of the table’s key column so that it has the same name as the other table and use the same code snippet shown above. Another way to perform the join with different key column names is to explicitly specify the names of the key column using left_on
and right_on
as arguments instead of on
.
For example, let’s say the left table’s key column name is ‘Student_ID’ whereas the right table’s key column name is ‘ID’. To perform inner join, you can use the following code:
df_left.merge(df_right, left_on='Student_ID', right_on='ID', how='inner')
Merge on Multiple Keys
Sometimes you may need to join two tables on multiple keys. You can simply specify a list of keys you want to use to join the two tables in the on
argument:
df_left.merge(df_right, on=['key1', 'key2'...],how='left')
Or use the left_on
and right_on
arguments if you have different names for the key columns:
df_left.merge(df_right, left_on=['leftkey1', 'leftkey2'], right_on=['rightkey1', 'rightkey2'],how='left')
Merging tables is one of the most common data munging tasks in the world of data analysis. I hope that you find this tutorial helpful in understanding the basic types of SQL-flavored joins and how to do them using Pandas
. As mentioned earlier in the post, this tutorial only teaches you how to merge two tables horizontally based on one or more common key(s). To merge/stack tables vertically (equivalent to the ‘UNION’ operation in SQL), we will need to use a different pandas
function and we will cover that topic in a different post. Thanks for reading and I hope you enjoyed this short tutorial.
Data Sources: The sample datasets used in this tutorial were created by the author.
You can unlock full access to my writing and the rest of Medium by signing up for Medium membership ($5 per month) through this referral link. By signing up through this link, I will receive a portion of your membership fee at no additional cost to you. Thank you!
Comments