Advanced techniques to take your linear regression game to the next level

Quick intro

This article aims to introduce some more advanced linear regression techniques that can dramatically improve your model results. If you haven’t mastered the basics of linear regression yet, I suggest two other articles that can help you with that: the first one introduces some fundamental concepts to understand linear regression, while the second one explains how you can spot and correct some common issues that come up while doing linear regressions.

Polynomial regression

As its name suggests, a linear regressions implies a linear relationship between two variables, in the form of y = ax + b. However, it might often be the case that the relationship between those two variables is not linear, but follows some other function that takes the squared (or some other power) value of x. In that case, one simple solution is to add to your explanatory variables a list of all the transformations you want to check (x squared, x to the third, and so on). This determines the degree of your polynomial (if you go up to x to the fifth, for instance, you have a polynomial of the fifth degree). Once you have your full list of explanatory variables, you can perform your linear regression as usual. It is a simple transformation but it can yield some really good results.

Fixing overfitting

Ridge regression

Ridge regression is a special type of linear regression, which tries to reduce overfitting, by applying weights to some of the parameters, reducing the influence of parameters that might be insignificant in reality.

To do this in R, we use the function glmnet(), and specify the parameter alpha = 0, to make it a Ridge regression. We will also have to specify the regularisation parameter lambda, which will determine how “strict” our regression will be with the weights it gives. The optimal value can be found by using cross-validation:

IN:library(glmnet)
library(tidyverse)y = data$price
x = data %>% select(sqft_living, floors, yr_built, yr_built_2) %>% data.matrix()model6 = glmnet(x, y, alpha = 0)lambdas = 10^seq(10, -10, by = -.2)
cv_fit = cv.glmnet(x, y, alpha = 0, lambda = lambdas)
plot(cv_fit)

This plot indicates us the lambda that minimises the error, but we can also fetch it directly.

IN:
opt_lambda = cv_fit$lambda.min
opt_lambdaOUT:
6.309573e-07IN:model6 = cv_fit$glmnet.fitplot(model6, xvar = "lambda")
legend("left", lwd = 1, col = 1:6, legend = colnames(x), cex = .5)

This plot shows how the coefficients weights evolve for each variable, according to the value of lambda.

LASSO regression

LASSO regression works in a similar way to Ridge regression, except that instead of giving small weights to non-significant parameters, it gives them a 0 weight, meaning it cuts them out of the regression.

In R, we will also use the glmnet() function, but with alpha set to 1 this time, to make it a LASSO regression:

IN:
cv_fit = cv.glmnet(x, y, alpha = 1, lambda = lambdas)
opt_lambda = cv_fit$lambda.minmodel7 = cv_fit$glmnet.fit
plot(model7, xvar = "lambda")
legend("left", lwd = 1, col = 1:6, legend = colnames(x), cex = .5)

This is basically the same as the previous plot, but notice how the convergence of the weights to zero is more abrupt.

Elastic net

Elastic net regularisation is, essentially, a linear combination of the two regularisation parameters from Ridge and LASSO regressions. In practice, this means setting a value for alpha between 0 and 1, which will set some of the coefficients to 0, like in the LASSO regression, and a weight to some of them, like the Ridge regression.

IN:
library(caret)
set.seed(111)
model8 = train(
  price ~., data = data, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)model8$bestTuneOUT:
   alpha   lambda
94     1 1468.049

Stepwise regression

In the previous methods, we assigned weights to each parameter to reduce overfitting. In stepwise regression, we try multiple combinations of parameters, and exclude those which are not significant. One way of doing that is by starting with all variables and then removing the less important, one by one. Another way, is by starting with one variable and adding the important variables one by one. Finally, we can do a combination of both methods. This choice can be made by setting the parameter “direction” in the stepAIC function, which can take the following values: “backward”, “forward” and “both”. The importance of a variable can be measured in many different ways, so here we will look at one of those, calculated by the stepAIC() function, which looks at the Akaike Information Criterion.

IN:
library(MASS)
model8 = lm(price~., data = data)
model8 = stepAIC(model8, direction = "both", 
                      trace = FALSE)
summary(model8)OUT:
Call:
lm(formula = price ~ id + bedrooms + bathrooms + sqft_living + 
    sqft_lot + floors + waterfront + view + condition + grade + 
    sqft_above + yr_built + yr_renovated + zipcode + lat + long + 
    sqft_living15 + sqft_lot15 + yr_built_2, data = data)Residuals:
     Min       1Q   Median       3Q      Max 
-1317688   -99197    -9464    76111  4340354Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.177e+07  7.674e+06  11.958  < 2e-16 ***
id            -1.485e-06  4.812e-07  -3.086 0.002032 ** 
bedrooms      -3.298e+04  1.900e+03 -17.358  < 2e-16 ***
bathrooms      3.554e+04  3.276e+03  10.846  < 2e-16 ***
sqft_living    1.510e+02  4.371e+00  34.537  < 2e-16 ***
sqft_lot       1.244e-01  4.783e-02   2.601 0.009307 ** 
floors        -1.376e+04  3.974e+03  -3.463 0.000536 ***
waterfront     5.860e+05  1.730e+04  33.868  < 2e-16 ***
view           5.474e+04  2.138e+03  25.599  < 2e-16 ***
condition      3.053e+04  2.371e+03  12.876  < 2e-16 ***
grade          9.579e+04  2.146e+03  44.640  < 2e-16 ***
sqft_above     3.033e+01  4.346e+00   6.979 3.05e-12 ***
yr_built      -8.830e+04  7.167e+03 -12.320  < 2e-16 ***
yr_renovated   2.263e+01  3.652e+00   6.196 5.89e-10 ***
zipcode       -6.053e+02  3.293e+01 -18.382  < 2e-16 ***
lat            6.099e+05  1.072e+04  56.908  < 2e-16 ***
long          -2.203e+05  1.312e+04 -16.785  < 2e-16 ***
sqft_living15  2.175e+01  3.436e+00   6.330 2.51e-10 ***
sqft_lot15    -3.518e-01  7.330e-02  -4.800 1.60e-06 ***
yr_built_2     2.189e+01  1.831e+00  11.955  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 200600 on 21593 degrees of freedom
Multiple R-squared:  0.7018, Adjusted R-squared:  0.7016 
F-statistic:  2675 on 19 and 21593 DF,  p-value: < 2.2e-16

We are left with only some of the initial variables, and they are all significative.

You can access the full script here.

Subrat's Technical Blog

Sunday, December 29, 2019

Linear regression: the basics