Tuesday, November 12, 2019

Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations

For data scientists, checking correlations is an important part of the exploratory data analysis process. This analysis is one of the methods used to decide which features affect the target variable the most, and in turn, get used in predicting this target variable. In other words, it’s a commonly-used method for feature selection in machine learning.
And because visualization is generally easier to understand than reading tabular data, heatmaps are typically used to visualize correlation matrices. A simple way to plot a heatmap in Python is by importing and implementing the Seaborn library.
From seaborn documentation

Seaborn heatmap arguments

But what else can we get from the heatmap apart from a simple plot of the correlation matrix?
In two words: A LOT.
Surprisingly, the Seaborn heatmap function has 18 arguments that can be used to customize a correlation matrix, improving how fast insights can be derived. For the purposes of this tutorial, we’re going to use 13 of those arguments.
Let’s get right to it

Getting started with Seaborn

Please note: If using Google Colab or any Anaconda package, there’s no need to install Seaborn; you’ll only need to import it. Otherwise, use this link to install Seaborn.

Hungry for more? Join over 11,000 machine learners and data scientists who receive the latest and greatest in deep learning in their inbox each week.

The data

One important thing to note when plotting a correlation matrix is that it completely ignores any non-numeric column. For the purposes of this tutorial, all the category variable were changed to numeric variables.
This is how the DataFrame looks like after wrangling.
Take a look at how the data was wrangled here.
As mentioned previously, the Seaborn heatmap function can take in 18 arguments.
This is what the function looks like with all the arguments:
sns.heatmap(data, vmin=None, vmax=None, cmap=None,center=None, robust=False, annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’, cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, yticklabels=’auto’, mask=None, ax=None, **kwargs)
Just taking a look at the code and not having any idea about how it works can be very overwhelming. Let’s dissect it together.
To better understand the arguments, we’re going to group them into 4 categories:
  1. The Essentials
2. Adjusting the axis (the measurement bar)
3. Aesthetics
4. Changing the matrix shape

The Essentials

2. Interpreting the insights by just using the first argument is sufficient. For an even easier interpretation, an argument called annot=True should be passed as well, which helps display the correlation coefficient.
3. There are times where correlation coefficients may be running towards 5 decimal digits. A good trick to reduce the number displayed and improve readability is to pass the argument fmt =’.3g'or fmt = ‘.1g' because by default the function displays two digits after the decimal (greater than zero) i.e fmt='.2g'(this may not always mean it displays two decimal places). Let's specify the default argument to fmt='.1g' .
For the rest of this tutorial, we will stick to the default fmt='.2g'

Adjusting the axis (the measurement bar)

One obvious change, apart from the rescaling, is that the color changed. This has to do with changing the center from None to Zero or any other number. But this does not mean we can’t change the color back or to any other available color. Let’s see how to do this.

Aesthetic

Check here for more information on the available color codes.
6. By default, the thickness and color border of each row of the matrix are set at 0 and white, respectively. There are times where the heatmap may look better with some border thickness and a change of color. This is where the arguments linewidths and linecolor apply. Let's specify the linewidths and the linecolor to 3 and black, respectively.
For the rest of this tutorial, we’ll switch back to the default cmap , linecolor, and linewidths . This can be done either by passing the following arguments: cmap=None , linecolor='white', and linewidths=0; or not passing the arguments at all (which we’re going to do).
7. So far, the heatmap used has its color bar displayed vertically. This can be customized to be horizontal instead by specifying the argument cbar_kws
8. There also might be instances where a heatmap may be better off not having a color bar at all. This can be done by specifying cbar=False
For the rest of this tutorial, we will display the color bar.
9. Take a closer look at the shape of each matrix box above. They’re all rectangular in shape. We can change them into squares by specifying the argument to square=True

Changing the matrix shape

.triu() is a method in NumPy that returns the lower triangle of any matrix given to it, while .tril() returns the upper triangle of any matrix given to it.
The idea is to pass the correlation matrix into the NumPy method and then pass this into the mask argument in order to create a mask on the heatmap matrix. Let’s see how this works below.
First using the np.trui() method:
Then using the np.tril() method:

In conclusion

References

To learn more about improving the EDA process through visualization, check out this Dataquest tutorial (login required).

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and Heartbeat), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

Heartbeat

Exploring the intersection of mobile development and machine learning. Sponsored by Fritz AI.

Thanks to Austin Kodra. 

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...