Stepwise Regression Essentials in R

kassambara|11/03/2018|459893|Comments (3)|Model Selection Essentials in R

The stepwise regression (or stepwise selection) consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the data set resulting in the best performing model, that is a model that lowers prediction error.

There are three strategies of stepwise regression (James et al. 2014,P. Bruce and Bruce (2017)):

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
leaps, for computing stepwise regression

library(tidyverse)library(caret)library(leaps)

Computing stepwise regression

There are many functions and R packages for computing stepwise regression. These include:

stepAIC() [MASS package], which choose the best model by AIC. It has an option named direction, which can take the following values: i) “both” (for stepwise regression, both forward and backward selection); “backward” (for backward selection) and “forward” (for forward selection). It return the best final model.

library(MASS)# Fit the full model full.model <- lm(Fertility ~., data = swiss)# Stepwise regression modelstep.model <- stepAIC(full.model, direction = "both", trace = FALSE)summary(step.model)

regsubsets() [leaps package], which has the tuning parameter nvmax specifying the maximal number of predictors to incorporate in the model (See Chapter @ref(best-subsets-regression)). It returns multiple models with different size up to nvmax. You need to compare the performance of the different models for choosing the best one. regsubsets() has the option method, which can take the values “backward”, “forward” and “seqrep” (seqrep = sequential replacement, combination of forward and backward selections).

models <- regsubsets(Fertility~., data = swiss, nvmax = 5, method = "seqrep")summary(models)

Note that, the train() function [caret package] provides an easy workflow to perform stepwise selections using the leaps and the MASS packages. It has an option named method, which can take the following values:

"leapBackward", to fit linear regression with backward selection
"leapForward", to fit linear regression with forward selection
"leapSeq", to fit linear regression with stepwise selection .

You also need to specify the tuning parameter nvmax, which corresponds to the maximum number of predictors to be incorporated in the model.

For example, you can vary nvmax from 1 to 5. In this case, the function starts by searching different best models of different size, up to the best 5-variables model. That is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-variables models.

The following example performs backward selection (method = "leapBackward"), using the swiss data set, to identify the best model for predicting Fertility on the basis of socio-economic indicators.

As the data set contains only 5 predictors, we’ll vary nvmax from 1 to 5 resulting to the identification of the 5 best models with different sizes: the best 1-variable model, the best 2-variables model, …, the best 5-variables model.

We’ll use 10-fold cross-validation to estimate the average prediction error (RMSE) of each of the 5 models (see Chapter @ref(cross-validation)). The RMSE statistical metric is used to compare the 5 models and to automatically choose the best one, where best is defined as the model that minimize the RMSE.

# Set seed for reproducibilityset.seed(123)# Set up repeated k-fold cross-validationtrain.control <- trainControl(method = "cv", number = 10)# Train the modelstep.model <- train(Fertility ~., data = swiss, method = "leapBackward", tuneGrid = data.frame(nvmax = 1:5), trControl = train.control )step.model$results

## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD## 1 1 9.30 0.408 7.91 1.53 0.390 1.65## 2 2 9.08 0.515 7.75 1.66 0.247 1.40## 3 3 8.07 0.659 6.55 1.84 0.216 1.57## 4 4 7.27 0.732 5.93 2.14 0.236 1.67## 5 5 7.38 0.751 6.03 2.23 0.239 1.64

The output above shows different metrics and their standard deviation for comparing the accuracy of the 5 best models. Columns are:

nvmax: the number of variable in the model. For example nvmax = 2, specify the best 2-variables model
RMSE and MAE are two different metrics measuring the prediction error of each model. The lower the RMSE and MAE, the better the model.
Rsquared indicates the correlation between the observed outcome values and the values predicted by the model. The higher the R squared, the better the model.

In our example, it can be seen that the model with 4 variables (nvmax = 4) is the one that has the lowest RMSE. You can display the best tuning values (nvmax), automatically selected by the train() function, as follow:

step.model$bestTune

## nvmax## 4 4

This indicates that the best model is the one with nvmax = 4 variables. The function summary() reports the best set of variables for each model size, up to the best 4-variables model.

summary(step.model$finalModel)

## Subset selection object## 5 Variables (and intercept)## Forced in Forced out## Agriculture FALSE FALSE## Examination FALSE FALSE## Education FALSE FALSE## Catholic FALSE FALSE## Infant.Mortality FALSE FALSE## 1 subsets of each size up to 4## Selection Algorithm: backward## Agriculture Examination Education Catholic Infant.Mortality## 1 ( 1 ) " " " " "*" " " " " ## 2 ( 1 ) " " " " "*" "*" " " ## 3 ( 1 ) " " " " "*" "*" "*" ## 4 ( 1 ) "*" " " "*" "*" "*"

An asterisk specifies that a given variable is included in the corresponding model. For example, it can be seen that the best 4-variables model contains Agriculture, Education, Catholic, Infant.Mortality (Fertility ~ Agriculture + Education + Catholic + Infant.Mortality).

The regression coefficients of the final model (id = 4) can be accessed as follow:

coef(step.model$finalModel, 4)

Or, by computing the linear model using only the selected predictors:

lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)

## ## Call:## lm(formula = Fertility ~ Agriculture + Education + Catholic + ## Infant.Mortality, data = swiss)## ## Coefficients:## (Intercept) Agriculture Education Catholic ## 62.101 -0.155 -0.980 0.125 ## Infant.Mortality ## 1.078

Discussion

This chapter describes stepwise regression methods in order to choose an optimal simple model, without compromising the model accuracy.

We have demonstrated how to use the leaps R package for computing stepwise regression. Another alternative is the function stepAIC() available in the MASS package. It has an option called direction, which can have the following values: “both”, “forward”, “backward”.

library(MASS)res.lm <- lm(Fertility ~., data = swiss)step <- stepAIC(res.lm, direction = "both", trace = FALSE)step

Additionally, the caret package has method to compute stepwise regression using the MASS package (method = "lmStepAIC"):

# Train the modelstep.model <- train(Fertility ~., data = swiss, method = "lmStepAIC", trControl = train.control, trace = FALSE )# Model accuracystep.model$results# Final model coefficientsstep.model$finalModel# Summary of the modelsummary(step.model$finalModel)

Stepwise regression is very useful for high-dimensional data containing multiple predictor variables. Other alternatives are the penalized regression (ridge and lasso regression) (Chapter @ref(penalized-regression)) and the principal components-based regression methods (PCR and PLS) (Chapter @ref(pcr-and-pls-regression)).

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

Stepwise Regression Essentials in R - Articles (2024)

Loading required R packages

Computing stepwise regression

Discussion

References