Sunday, August 23, 2020

How to Convert Pandas Dataframe to Keras RNN and Back to Pandas for Multivariate Regression Problems

 

This post provides a straightforward Python code that takes data in Pandas dataframe and outputs predictions in the same format using Keras RNN LSTM model.



The problem I encountered was rather common (I think): Taking data in a pandas dataframe format and making predictions using a time series regression model with keras RNN where I have more than one independent X (AKA features or predictors) and one dependent y. To be more precise, the problem was not to build the model, rather to convert the data from a pandas dataframe format to a format that an RNN model (in keras) requires and obtaining predictions from the keras model back as a pandas dataframe.

It felt that wherever I looked for a solution I got explanations on how RNN works or solutions for uni-variate regression problem. Hence, I will try to keep this post as concise and focused as possible. The coding here assumes that you already did all the necessary preprocessing (e.g., data cleansing, feature engineering etc.) and have a ready-for-analysis time series in a pandas dataframe format.

What you WILL NOT find here:

· Theoretical explanations

· Nice illustrations of an RNN model

· Preprocessing techniques

· A complicated or sophisticated model

There are many sources online that well explain these issues, I highly recommend to check the StackOverflow questions and Jason Brownlee’s, Machine Learning Mastery blog post.

What you WILL find here:

· A straightforward Python code that takes a pandas dataframe and outputs predictions in the same format using a keras RNN LSTM model for multivariate regression problems.

This post will describe snippets of code with explanations and a full seamless code will be provided at the end.

Let’s begin:

Step 1, let’s import all needed packages and check the keras version

As you can see, my keras version is 2.3.1, so if you have some problems with the code I post here, please check that you have the same version or a higher one.

Step 2, read the data

Image for post

As you can see, I have 32,128 rows and four columns, with one y and three X. The code here can work on any number of X including just one XNote that you need to define your y column in order to make things easier and more generic.

Optional step — plot the data

Image for post

I know, the resolution is not great here but you get the idea of how my data looks like.

Step 3, split the data to train and test

Please note here the comment in line #3. Let’s plot again to see if our split makes sense.

Image for post

Again, don’t mind the resolution, it’s not important. The plot looks good, the past (blue) is our training data and recent dates are our test data (orange).

Step 4, separate X and y only for the training data. We will handle the test data later.

Now, the X_train looks like this:

Image for post

and the y_train looks like this:

Image for post

Step 5, scale and prepare X and y data for keras

This part requires some explanations. Here we convert the data from pandas dataframe to numpy arrays which is required by keras. In line 1–8 we first scale X and y using the sklearn MinMaxScaler model, so that their range will be from 0 to 1. The next lines are some shape manipulation to the y in order to make it applicable for keras. We need the shape of y to be (n, ), where n is the number of rows. Line 12 “pushes” the y data one step forward by adding zero in the first position and line 13 keeps the shape of y by deleting the last time step (last row). Here is a simplified example of what happens in line 12–13:

#let's say y = [1,2,3,4]
# y = np.insert(y,0,0) --> [0,1,2,3,4]
# y = np.delete(y,-1) --> [0,1,2,3]

If this is explanation is not clear enough, I refer you to Jason Brownlee’blog post Multivariate Time Series Forecasting with LSTMs in Keras. Look for the section with the title: Multivariate Inputs and Dependent Series Example.

To sum up the shape manipulation of y let’s have a quick look on what happened. We started with y data as a dataframe:

Image for post

And now it should look like this:

array([0.        , 0.12779697, 0.12401905, ..., 0.59237795, 0.6018512 , 0.61132446])

Step 6, combine X and y using the keras TimeseriesGenerator

The TimeseriesGenerator transforms the separate X and y into a structure of samples ready to train deep learning models. I would recommend to print the shape of the generator object to make sure it worked. The shape should be (batch_size,n_input,n_features)exactly how it shows in step 6 in line 8.

The hard part of converting our data from pandas dataframe to something ready to use for deep learning models is behind us. Now we can move on to step 7, instantiate the model:

Note that I used here a very simple model, with only one hidden layer and without a dropout layer. This is because I wanted to keep this post concise and the actual model architecture is not the focus here. But feel free to experiment with more layers.

Step 8, fit the model and plot the losses

Image for post

Also here, I used simple settings, only 5 epochs, just to illustrate the entire process.

Now the model is ready to use and we can make predictions on the test set.

The first line generates the X_test data by dropping the y from the test set, we do not want the y data to be included in the X. Then we scale the X_test according to the MinMaxScaler model which was fitted on the X_train earlier. Line 3 is important because we need to create a TimeseriesGenerator for the test data. I was struggling with this part because in the examples that I saw the y_test was included in here , but I do not want the model to have any knowledge whatsoever about the y_test data. I did not want the slightest chance of data leakage that will result in bias predictions. Thanks to a great help I received from Marco Cerliani on StackOverflow I understood that the second argument in the TimeseriesGenerator, which is the y_test is just a prediction method and that the actual values of the y_test don’t matter (in this specific place), so you can insert a dummy y_test → an array of zeros that has the same shape of the actual y_test data. The rest of the TimeseriesGenerator is similar to the training data, and also here I printed the shape to make sure it’s OK.

Line 10 calls the predict method and line 11 rescales the predictions. Remember that we earlier scaled the y data between 0 and 1, so we need to scale it back. In line 12 we construct a dataframe from y_true and y_pred. Note that we only call for a subset of y_true (test[y_col].values[n_input:]), this is because the model needs n_input timesteps (rows or observations) to start predict, so it takes these n_input (in this case 25 timesteps) from X_test and only then start to predict. For example, if we had 50 timesteps in our test set (or 50 rows or observations), then we will have only 25 predictions because the first 25 were used by the model according to its architecture that we set.

Now we have our results in a nice pandas dataframe structure:

Image for post

And we can plot them using results.plot();:

Image for post

That’s it, we began with data in a pandas dataframe format and finished with predictions in the same format.

That’s the entire code in one block

I hope this is useful and will help you in your machine learning missions. Please write me below if you have any comments.


import pandas as pd
import numpy as np
import keras
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
df = pd.read_pickle(r'C:\....\data.pkl') # read data
y_col='y' # define y variable, i.e., what we want to predict
test_size = int(len(df) * 0.1) # here I ask that the test data will be 10% (0.1) of the entire data
train = df.iloc[:-test_size,:].copy() # the copy() here is important, it will prevent us from getting: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
# Try using .loc[row_index,col_indexer] = value instead
test = df.iloc[-test_size:,:].copy()
X_train = train.drop(y_col,axis=1).copy()
y_train = train[[y_col]].copy() # the double brakets here are to keep the y in dataframe format, otherwise it will be pandas Series
Xscaler = MinMaxScaler(feature_range=(0, 1)) # scale so that all the X data will range from 0 to 1
Xscaler.fit(X_train)
scaled_X_train = Xscaler.transform(X_train)
Yscaler = MinMaxScaler(feature_range=(0, 1))
Yscaler.fit(y_train)
scaled_y_train = Yscaler.transform(y_train)
scaled_y_train = scaled_y_train.reshape(-1) # remove the second dimention from y so the shape changes from (n,1) to (n,)
scaled_y_train = np.insert(scaled_y_train, 0, 0)
scaled_y_train = np.delete(scaled_y_train, -1)
n_input = 25 #how many samples/rows/timesteps to look in the past in order to forecast the next sample
n_features= X_train.shape[1] # how many predictors/Xs/features we have to predict y
b_size = 32 # Number of timeseries samples in each batch
generator = TimeseriesGenerator(scaled_X_train, scaled_y_train, length=n_input, batch_size=b_size)
model = Sequential()
model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit_generator(generator,epochs=5)
X_test = test.drop(y_col,axis=1).copy()
scaled_X_test = Xscaler.transform(X_test)
test_generator = TimeseriesGenerator(scaled_X_test, np.zeros(len(X_test)), length=n_input, batch_size=b_size)
y_pred_scaled = model.predict(test_generator)
y_pred = Yscaler.inverse_transform(y_pred_scaled)
results = pd.DataFrame({'y_true':test[y_col].values[n_input:],'y_pred':y_pred.ravel()})

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...