Linear regression: the basics
Advanced techniques to take your linear regression game to the next level
Quick intro
This article aims to introduce some more advanced linear regression techniques that can dramatically improve your model results. If you haven’t mastered the basics of linear regression yet, I suggest two other articles that can help you with that: the first one introduces some fundamental concepts to understand linear regression, while the second one explains how you can spot and correct some common issues that come up while doing linear regressions.
Polynomial regression
As its name suggests, a linear regressions implies a linear relationship between two variables, in the form of y = ax + b. However, it might often be the case that the relationship between those two variables is not linear, but follows some other function that takes the squared (or some other power) value of x. In that case, one simple solution is to add to your explanatory variables a list of all the transformations you want to check (x squared, x to the third, and so on). This determines the degree of your polynomial (if you go up to x to the fifth, for instance, you have a polynomial of the fifth degree). Once you have your full list of explanatory variables, you can perform your linear regression as usual. It is a simple transformation but it can yield some really good results.
Fixing overfitting
Ridge regression
Ridge regression is a special type of linear regression, which tries to reduce overfitting, by applying weights to some of the parameters, reducing the influence of parameters that might be insignificant in reality.
To do this in R, we use the function glmnet(), and specify the parameter alpha = 0, to make it a Ridge regression. We will also have to specify the regularisation parameter lambda, which will determine how “strict” our regression will be with the weights it gives. The optimal value can be found by using cross-validation:
IN:library(glmnet) library(tidyverse)y = data$price x = data %>% select(sqft_living, floors, yr_built, yr_built_2) %>% data.matrix()model6 = glmnet(x, y, alpha = 0)lambdas = 10^seq(10, -10, by = -.2) cv_fit = cv.glmnet(x, y, alpha = 0, lambda = lambdas) plot(cv_fit)
IN: opt_lambda = cv_fit$lambda.min opt_lambdaOUT: 6.309573e-07IN:model6 = cv_fit$glmnet.fitplot(model6, xvar = "lambda") legend("left", lwd = 1, col = 1:6, legend = colnames(x), cex = .5)
LASSO regression
LASSO regression works in a similar way to Ridge regression, except that instead of giving small weights to non-significant parameters, it gives them a 0 weight, meaning it cuts them out of the regression.
In R, we will also use the glmnet() function, but with alpha set to 1 this time, to make it a LASSO regression:
IN: cv_fit = cv.glmnet(x, y, alpha = 1, lambda = lambdas) opt_lambda = cv_fit$lambda.minmodel7 = cv_fit$glmnet.fit plot(model7, xvar = "lambda") legend("left", lwd = 1, col = 1:6, legend = colnames(x), cex = .5)
Elastic net
Elastic net regularisation is, essentially, a linear combination of the two regularisation parameters from Ridge and LASSO regressions. In practice, this means setting a value for alpha between 0 and 1, which will set some of the coefficients to 0, like in the LASSO regression, and a weight to some of them, like the Ridge regression.
IN: library(caret) set.seed(111) model8 = train( price ~., data = data, method = "glmnet", trControl = trainControl("cv", number = 10), tuneLength = 10 )model8$bestTuneOUT: alpha lambda 94 1 1468.049
Stepwise regression
In the previous methods, we assigned weights to each parameter to reduce overfitting. In stepwise regression, we try multiple combinations of parameters, and exclude those which are not significant. One way of doing that is by starting with all variables and then removing the less important, one by one. Another way, is by starting with one variable and adding the important variables one by one. Finally, we can do a combination of both methods. This choice can be made by setting the parameter “direction” in the stepAIC function, which can take the following values: “backward”, “forward” and “both”. The importance of a variable can be measured in many different ways, so here we will look at one of those, calculated by the stepAIC() function, which looks at the Akaike Information Criterion.
IN: library(MASS) model8 = lm(price~., data = data) model8 = stepAIC(model8, direction = "both", trace = FALSE) summary(model8)OUT: Call: lm(formula = price ~ id + bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + sqft_above + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + yr_built_2, data = data)Residuals: Min 1Q Median 3Q Max -1317688 -99197 -9464 76111 4340354Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.177e+07 7.674e+06 11.958 < 2e-16 *** id -1.485e-06 4.812e-07 -3.086 0.002032 ** bedrooms -3.298e+04 1.900e+03 -17.358 < 2e-16 *** bathrooms 3.554e+04 3.276e+03 10.846 < 2e-16 *** sqft_living 1.510e+02 4.371e+00 34.537 < 2e-16 *** sqft_lot 1.244e-01 4.783e-02 2.601 0.009307 ** floors -1.376e+04 3.974e+03 -3.463 0.000536 *** waterfront 5.860e+05 1.730e+04 33.868 < 2e-16 *** view 5.474e+04 2.138e+03 25.599 < 2e-16 *** condition 3.053e+04 2.371e+03 12.876 < 2e-16 *** grade 9.579e+04 2.146e+03 44.640 < 2e-16 *** sqft_above 3.033e+01 4.346e+00 6.979 3.05e-12 *** yr_built -8.830e+04 7.167e+03 -12.320 < 2e-16 *** yr_renovated 2.263e+01 3.652e+00 6.196 5.89e-10 *** zipcode -6.053e+02 3.293e+01 -18.382 < 2e-16 *** lat 6.099e+05 1.072e+04 56.908 < 2e-16 *** long -2.203e+05 1.312e+04 -16.785 < 2e-16 *** sqft_living15 2.175e+01 3.436e+00 6.330 2.51e-10 *** sqft_lot15 -3.518e-01 7.330e-02 -4.800 1.60e-06 *** yr_built_2 2.189e+01 1.831e+00 11.955 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 200600 on 21593 degrees of freedom Multiple R-squared: 0.7018, Adjusted R-squared: 0.7016 F-statistic: 2675 on 19 and 21593 DF, p-value: < 2.2e-16
We are left with only some of the initial variables, and they are all significative.
You can access the full script here.
Comments