Monday, September 7, 2020

Machine Learning With SQL — It’s Easier Than You Think

 If you’ve been studying data science, it’s likely you know how to perform machine learning tasks in languages like Python, R, and Julia. But what can you do when speed is the key, the hardware is limited, or the company you work for treats SQL as the only option for predictive analytics? In-database machine learning is the answer.

Image for post
Photo by hao wang on Unsplash

We’ll use Oracle Cloud for this article. It’s free, so please register and create an instance of the OLTP database (Version 19c, has 0.2TB of storage). Once done, download the cloud wallet and establish a connection through SQL Developer — or any other tool.

This will take you 10 minutes at least but is a fairly straightforward thing to do, so I won’t waste time on it.

We’ll use Oracle Machine Learning (OML) to train a classification model on the well-known Iris dataset. I’ve chosen it because it doesn’t require any preparation — we only need to create the table and insert the data.

Let’s do that next.


Data preparation

As mentioned, we need to create a table for holding the Iris dataset, and then we need to load data to it. OML requires one column to be used as row ID (sequence), so let’s keep that in mind:

CREATE SEQUENCE seq_iris;
CREATE TABLE iris_data(
iris_id NUMBER DEFAULT seq_iris.NEXTVAL,
sepal_length NUMBER,
sepal_width NUMBER,
petal_length NUMBER,
petal_width NUMBER,
species VARCHAR2(16)
);

Awesome! Now we can download the data and load it:

Image for post

When a modal window pops-up simply provide a path to the downloaded CSV and click Next a couple of times. SQL Developer should get things right without your assistance.

Once done, we have our dataset loaded and prepared:

Image for post

Let’s continue with the fun part now.


Model training

Now we can get our hands dirty with the fun stuff, and that’s training the classification model. This is broken down into multiple steps, such as train/test split, model training, and model evaluation. Let’s start with the simplest one.

Train/test split

Oracle likes this step done with two views — one for training data and one for testing data. We can easily create those with a bit of PL/SQL magic:

BEGIN
EXECUTE IMMEDIATE
‘CREATE OR REPLACE VIEW
iris_train_data AS
SELECT * FROM iris_data
SAMPLE (75) SEED (42)’;
EXECUTE IMMEDIATE
‘CREATE OR REPLACE VIEW
iris_test_data AS
SELECT * FROM iris_data
MINUS
SELECT * FROM iris_train_data’;
END;
/

This script does two things:

  1. Creates a train view — has 75% of data (SAMPLE (75)) split at the random seed 42 ( SEED (42))
  2. Creates a test view — as a difference of the entire dataset and the training view

Our data is stored in views named iris_train_data and iris_test_data — you guess which one holds what.

Let’s quickly check how many rows are in each:

SELECT COUNT(*) FROM iris_train_data;
>>> 111

SELECT COUNT(*) FROM iris_test_data;
>>> 39

We are ready to train the model, so let’s do that next.

Model training

The easiest method for model training is through DBMS_DATA_MINING package, with a single procedure execution, and without the need for creating additional settings tables.

We’ll use the Decision Tree algorithm to train our model. Here’s how:

DECLARE 
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst(‘PREP_AUTO’) := ‘ON’;
v_setlst(‘ALGO_NAME’) := ‘ALGO_DECISION_TREE’;

DBMS_DATA_MINING.CREATE_MODEL2(
‘iris_clf_model’,
‘CLASSIFICATION’,
‘SELECT * FROM iris_train_data’,
v_setlst,
‘iris_id’,
‘species’
);
END;
/

The CREATE_MODEL2 procedure (curious why it wasn’t named CREATE_MODEL_FINAL_FINAL89) accepts a lot of parameters. Let’s explain the ones we entered:

  • iris_clf_model — simply the name of your model. Can be anything
  • CLASSIFICATION — type of machine learning task we’re doing. Must be uppercase for some reason
  • SELECT * FROM iris_train_data — specifies where the training data is stored
  • v_setlst — above declared settings list for our model
  • iris_id — name of the sequence type column (each value is unique)
  • species — name of the target variable (what we’re trying to predict)

Executing this block will take a second or two, but once done it’s ready for evaluation!

Model evaluation

Let’s use this script to evaluate our model:

BEGIN
DBMS_DATA_MINING.APPLY(
‘iris_clf_model’,
‘iris_test_data’,
‘iris_id’,
‘iris_apply_result’
);
END;
/

It applies iris_clf_model to the unseen test data iris_test_data and stores evaluation results into a iris_apply_result table. Here’s how this table looks like:

Image for post

It has many more rows (39 x 3), but you get the point. This still isn’t the most straightforward thing to look at, so let’s show the results in a slightly different way:

DECLARE 
CURSOR iris_ids IS
SELECT DISTINCT(iris_id) iris_id
FROM iris_apply_result
ORDER BY iris_id;
curr_y VARCHAR2(16);
curr_yhat VARCHAR2(16);
num_correct INTEGER := 0;
num_total INTEGER := 0;
BEGIN
FOR r_id IN iris_ids LOOP
BEGIN
EXECUTE IMMEDIATE
‘SELECT species FROM
iris_test_data
WHERE iris_id = ‘ || r_id.iris_id
INTO curr_y;
EXECUTE IMMEDIATE
‘SELECT prediction
FROM iris_apply_result
WHERE iris_id = ‘ || r_id.iris_id ||
‘AND probability = (
SELECT MAX(probability)
FROM iris_apply_result
WHERE iris_id = ‘ || r_id.iris_id ||
‘)’ INTO curr_yhat;
END;

num_total := num_total + 1;
IF curr_y = curr_yhat THEN
num_correct := num_correct + 1;
END IF;
END LOOP;

DBMS_OUTPUT.PUT_LINE(‘Num. test cases: ‘
|| num_total);
DBMS_OUTPUT.PUT_LINE(‘Num. correct : ‘
|| num_correct);
DBMS_OUTPUT.PUT_LINE(‘Accuracy : ‘
|| ROUND((num_correct / num_total), 2));
END;
/

Yes, it’s a lot, but the script above can’t be any simpler. Let’s break it down:

  • CURSOR — gets all distinct iris_ids (because we have them duplicated in iris_apply_results table
  • curr_ycurr_yhatnum_correctnum_total are variables for storing actual species and predicted species at every iteration, number of correct classifications, and total number of test items
  • For every unique iris_id we get the actual species (from iris_test_data, where ids match) and the predicted species (where prediction probability is the highest in iris_apply_results table)
  • Then it’s easy to check if the actual and predicted values are identical — which indicates the classification is correct
  • Variables num_total and num_correct are updated at every iteration
  • Finally, we print the model’s performance to the console

Here’s the output for this script:

Image for post

Awesome! To interpret:

  • The test set has 39 cases
  • Of these 39, 37 were classified correctly
  • Which results in the 95% accuracy

And that’s pretty much it for the model evaluation.


Before you go

And there you have it — machine learning project written from scratch in SQL. Not all of us have the privilege to work with something like Python on our job, and if a machine learning task comes on your desk you now know how to solve it via SQL.

This was just a simple classification task, of course, and scripts can be improved further, but you get the point. I hope you’ve managed to follow along. For any questions and comments, please refer to the comment section.

Thanks for reading.



Originally published at https://www.betterdatascience.com on September 6, 2020.

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...