Monday, April 13, 2020

Beginner’s Guide to Data Science Libraries in Python


NumPy

This is the most fundamental library that all data scientists need to learn. It provides all of the basic functions in scientific computing and is able to process lots of data quickly. The following code is a quick example of what NumPy can do.
Sample of using NumPy for scientific calculations
Input:  [0, 1.5707963267948966, 3.141592653589793, 4.71238898038469, 6.283185307179586]Sine values:  [ 0.0000000e+00  1.0000000e+00  1.2246468e-16 -1.0000000e+00-2.4492936e-16]Cosine values:  [ 1.0000000e+00  6.1232340e-17 -1.0000000e+00 -1.8369702e-161.0000000e+00]Sine values:  [ 0.  1.  0. -1. -0.]Cosine values:  [ 1.  0. -1. -0.  1.]

pandas

This is the most fundamental library for data analysis and manipulation in Python. This library is able to quickly read large raw data files into a DataFrame object, perform all kinds of data cleaning and data mining operations with automatic indexing and data alignment, execute all possible SQL queries on the DataFrame table, such as joins and merges, and then output the data into another data file or even directly into visualizations.
df = pd.DataFrame('some data...')df.plot(x='label1', y='label2', kind='scatter', ...)
Sample scatterplot of Area vs. Population [source]

SciPy

The SciPy library is an abstracted layer on top of NumPy and the rest of the SciPy stack. This library includes many numerical routines such as numerical integration, interpolation, optimization, linear algebra, statistics, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering. Therefore, this library is more specific to mathematical functions that tailor towards calculations done by scientists and engineers from an academic standpoint. To learn more, visit the official documentation site.

Scikit-Learn

The Scikit-Learn library is further abstracted on top of SciPy and is more practical and application focused. This library includes functions that focuses more on machine learning applications such as regression, classification, clustering, etc. People who are passionate about machine learning will definitely want to learn more about functionalities that this library provides. For example, you can easily run a RANSAC linear regression on a set of raw data by running the following code:
# generate random data with a small set of outliers
np.random.seed(0)
X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1))
y[:n_outliers] = -3 + 10 * np.random.normal(size=n_outliers)

# Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)

# Robustly fit linear model with RANSAC algorithm
ransac = linear_model.RANSACRegressor()
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

# Predict data of estimated models
line_X = np.arange(X.min(), X.max())[:, np.newaxis]
line_y = lr.predict(line_X)
line_y_ransac = ransac.predict(line_X)
Scatterplot of Raw data and Linear Regression vs. RANSAC Regression of data [source]
sklearn.model_selection.train_test_split(*arrays, **options)[source]

matplotlib.pyplot

This library, although has nothing to do with the analytics portion of data science, is also a key library to learn and use. This library is the Python adaptation of Matlab’s plotting functionality. This library is able to generate anything from simple scatterplots, histograms, line plots to complex heatmaps, 3D plots, eclipses, streamplots, etc. Some examples of these plots are below:
Sample scatterplot with color-coding [source]
Sample 3D plot [source]
Sample visualization of 2D array [source]

Conclusion

There are lots of Python libraries that I did not cover. There are lots of libraries targeted towards bio-informatics, deep-learning and AI, self-driving, etc. However, the libraries outlined here are widely used in data science and are the building blocks of many advanced Python libraries. I believe that becoming familiar with these libraries will build up a strong foundation for a beginner who wants to explore the field of data science. If there are any other common Python libraries that you think would be useful to learn, please share them in the comments below.

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...