Saturday, February 19, 2022

Machine Learning With Spark

 T

Image by Author using Canva.com
  • Machine Learning Basic Concepts
  • Preprocessing and Data Transformation using Spark
  • Spark Clustering with pyspark
  • Classification with pyspark
  • Regression methods with pyspark

What is Apache Spark?

According to Apache Spark and Delta Lake Under the Hood

Image by Author

Setting up Spark 3.0.1 in the Google Colaboratory

As a first step, I configure the google colab runtime with spark installation. For details, readers may read my article Getting Started Spark 3.0.0 in Google Colab om medium.

# Run below commands
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark

Environment Variable

After installing the spark and Java, set the environment variables where Spark and Java are installed.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

Spark Installation test

Let us test the installation of spark in our google colab environment.

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])

df.show(3, False)

Machine Learning

Once, we have set up the spark in google colab and made sure it is running with the correct version i.e. 3.0.1 in this case, we can start exploring the machine learning API developed on top of Spark. PySpark is a higher level Python API to use spark with python. For this tutorial, I assume the readers have a basic understanding of Machine Learning and SK-Learn for model building and training. Spark MLlib used the same fit and predict structure as in SK-Learn.

Data Preparation and Transformations in Spark

This section covers the basic steps involved in transformations of input feature data into the format Machine Learning algorithms accept. We will be covering the transformations coming with the SparkML library. To understand or read more about the available spark transformations in 3.0.3, follow the below link.

Normalize Numeric Data

MinMaxScaler is one of the favorite classes shipped with most machine learning libraries. It scaled the data between 0 and 1.

from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

Standardize Numeric Data

StandardScaler is another well-known class written with machine learning libraries. It normalizes the data between -1 and 1 and converts the data into bell-shaped data. You can demean the data and scale to some variance.

from pyspark.ml.feature import  StandardScaler
from pyspark.ml.linalg import Vectors

Bucketize Numeric Data

The real data sets come with various ranges and sometimes it is advisable to transform the data into well-defined buckets before plugging into machine learning algorithms.

from pyspark.ml.feature import  Bucketizer
from pyspark.ml.linalg import Vectors

Tokenize text Data

Natural Language Processing is one of the main applications of Machine learning. One of the first steps for NLP is tokenizing the text into words or token. We can utilize the Tokenizer class with SparkML to perform this task.

from pyspark.ml.feature import  Tokenizer

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Using the above-tokenized data, Let us apply the TF-IDF

from pyspark.ml.feature import HashingTF, IDF

Clustering Using PySpark

Clustering is a machine learning technique where the data is grouped into a reasonable number of classes using the input features. In this section, we study the basic application of clustering techniques using the spark ML framework.

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans, BisectingKMeans
import glob
# Read the data.
clustering_file_name ='clustering_dataset.csv'
import pandas as pd
# df = pd.read_csv(clustering_file_name)
cluster_df = spark.read.csv(clustering_file_name, header=True,inferSchema=True)
# Coverting the input data into features column
vectorAssembler = VectorAssembler(inputCols = ['col1', 'col2', 'col3'], outputCol = "features")
vcluster_df = vectorAssembler.transform(cluster_df)
# Applying the k-means algorithm
kmeans = KMeans().setK(3)
kmeans = kmeans.setSeed(1)
kmodel = kmeans.fit(vcluster_df)
centers = kmodel.clusterCenters()
print("The location of centers: {}".format(centers))
# Applying Hierarchical Clustering
bkmeans = BisectingKMeans().setK(3)
bkmeans = bkmeans.setSeed(1)

Classification Using PySpark

Classification is one of the widely used Machine algorithms and almost every data engineer and data scientist must know about these algorithms. Once the data is loaded and prepared, I will demonstrate three classification algorithms.

  1. Multi-Layer Perceptron Classification
  2. Decision Trees Classification
# Downloading the clustering data
!wget -q "https://raw.githubusercontent.com/amjadraza/blogs-data/master/spark_ml/iris.csv"
png
spark.createDataFrame(df, columns)

Preprocessing the Iris Data

In this section, we will be using the IRIS data to understand the classification. To perform ML models, we apply the preprocessing step on our input data.

from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

Naive Bayes Classification

Once the data is prepared, we are ready to apply the first classification algorithm.

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
nbaccuracy = evaluator.evaluate(predictions_df)
nbaccuracy

Multilayer Perceptron Classification

The second classifier we will be investigating is a Multi-layer perceptron. In this tutorial, I am not going into details of the optimal MLP network for this problem however in practice, you research the optimal network suitable to the problem in hand.

from pyspark.ml.classification import MultilayerPerceptronClassifier

Decision Trees Classification

Another common classifier in the ML family is the Decision Tree Classifier, in this section, we explore this classifier.

from pyspark.ml.classification import DecisionTreeClassifier

Regression using PySpark

In this section, we explore the Machine learning models for regression problems using pyspark. Regression models are helpful in predicting future values using past data.

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

Linear Regression

We start with the simplest regression technique i.e. Linear Regression.

# Define and fit Linear Regression
lr = LinearRegression(featuresCol="features", labelCol="PE")
lr_model = lr.fit(vpp_df)

Decision Tree Regression

In this section, we explore the Decision Tree Regression commonly used in Machine learning.

from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator

Gradient Boosting Decision Tree Regression

Gradient Boosting is another common choice among ML professionals. Let us try the GBM in this section.

from pyspark.ml.regression import GBTRegressor

A working Google Colab

Conclusions

In this tutorial, I have tried to give the readers an opportunity to learn and implement basic Machine Learning algorithms using PySpark. Spark not only provide the benefit of distributed processing but also can handle a large amount of data to be processing. To summarise, we have covered below topics/algorithms

  • Overview of Data Transformations using PySpark
  • Clustering algorithms using PySpark
  • Classification problems using PySpark
  • Regression Problems using PySpark

References Readings/Links

  1. https://spark.apache.org/docs/latest/ml-features.html
  2. https://spark.apache.org/docs/3.0.1/ml-classification-regression.html#regression
  3. https://spark.apache.org/docs/3.0.1/ml-clustering.html
  4. https://spark.apache.org/docs/3.0.1/ml-classification-regression.html#classification



No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...