- 1 Definitions
- 2 Forward Stepwise Regression inR
- 3 Backward Stepwise Regression inR
- 4 Forward-Backward (or Both) StepwiseRegression in R
Here, we discuss stepwise regression in R, including, forward,backward, and bi-directional (or forward-backward) steps.
Stepwise regression in R can be performed with thestep()
and lm()
functions from the "stats" package in the baseversion of R.
For multiple regressionmodeling, stepwise regression can be used to perform variableselection among a set of variables by adding or dropping one variable ata time.
Sample Steps to Run a Stepwise Regression:
# Specify datax1 = c(6.8, 4.7, 3.3, 4.3, 4.6, 7.4, 2.8)x2 = c(7.2, 3.9, 6.7, 7.0, 5.4, 5.8, 6.6)x3 = c(2.8, 1.8, 6.2, 3.6, 0.4, 4.5, 3.0)x4 = c(6.4, 7.7, 7.9, 7.0, 7.4, 7.1, 6.4)e = c(-0.01, -0.05, 0.04, -0.07, 0.05, 0.03, -0.02)y = 3 + 4*x1 + 3*x3 + edf_data = data.frame(y, x1, x2, x3, x4)df_data
y x1 x2 x3 x41 38.59 6.8 7.2 2.8 6.42 27.15 4.7 3.9 1.8 7.73 34.84 3.3 6.7 6.2 7.94 30.93 4.3 7.0 3.6 7.05 22.65 4.6 5.4 0.4 7.46 46.13 7.4 5.8 4.5 7.17 23.18 2.8 6.6 3.0 6.4
# Specify the intercept modelintercept_model = lm(y ~ 1, data = df_data)# Specify the guess modelguess_model = lm(y ~ x1 + x2, data = df_data)# Specify the full modelfull_model = lm(y ~ ., data = df_data)#orfull_model = lm(y ~ x1 + x2 +x3 + x4, data = df_data)
Perform forward step-wise regression:
step(intercept_model, direction = 'forward', scope = formula(full_model), trace = 0)
Call:lm(formula = y ~ x1 + x3, data = df_data)Coefficients:(Intercept) x1 x3 2.964 4.004 3.003
Perform backward step-wise regression:
step(full_model, direction = 'backward', trace = 0)
Call:lm(formula = y ~ x1 + x3, data = df_data)Coefficients:(Intercept) x1 x3 2.964 4.004 3.003
Perform forward-backward step-wise regression:
step(guess_model, direction = 'both', scope = list(lower = formula(intercept_model), upper = formula(full_model)), trace = 0)
Call:lm(formula = y ~ x1 + x3, data = df_data)Coefficients:(Intercept) x1 x3 2.964 4.004 3.003
Showing Model Iterations and Using BIC:
To show model iterations, set trace = 1
. For BIC, setk = log(number of rows)
.
The results shows that the algorithm only adds variables that reducethe criterion (BIC in this case, although labelled AIC). If there aremultiple variables that reduce the AIC, it picks the one that reduces itthe most.
When no other variable can reduce the AIC, the algorithm stops. Thisway, the algorithm is highly likely to have the final model as thecombination of variables that have the lowest value of the criterion(AIC or BIC).
# Perform forward step-wise regression with step shown# BIC can be used when we set k = log(nrow(df_data)step(intercept_model, direction = 'forward', scope = formula(full_model), trace = 1, k = log(nrow(df_data)))
Start: AIC=30.95y ~ 1 Df Sum of Sq RSS AIC+ x1 1 253.444 187.55 26.909+ x3 1 164.437 276.56 29.627<none> 440.99 30.948+ x2 1 34.025 406.97 32.332+ x4 1 1.493 439.50 32.870Step: AIC=26.91y ~ x1 Df Sum of Sq RSS AIC+ x3 1 187.536 0.012 -38.605<none> 187.549 26.909+ x2 1 43.612 143.936 27.002+ x4 1 4.962 182.587 28.667Step: AIC=-38.6y ~ x1 + x3 Df Sum of Sq RSS AIC<none> 0.012240 -38.605+ x4 1 0.00116413 0.011076 -37.359+ x2 1 0.00003159 0.012209 -36.677
Call:lm(formula = y ~ x1 + x3, data = df_data)Coefficients:(Intercept) x1 x3 2.964 4.004 3.003
Argument | Usage |
object | The lm model object to start from |
scope | Set the range of variables applicable. This can be one model or alist of the upper and lower models |
direction | Set to “both” for forward-backward steps, “backward”, or “forward”,with a default of “both”. If scope is not set, the default is“backward”. |
trace | Set to 1 to print the selection steps, set to0 otherwise |
k | Multiple of the number of penalty degrees of freedom. For AIC(default), k = 2, For BIC, k = log(number of rows). |
Consider the set of predictors \(x_1, x_2,\cdots, x_p\), to be used to predict \(y\). We can define the following modelsbelow.
Full Model
The model with all the available variables:
\[\widehat y = \widehat \beta_0 + \widehat\beta_1 x_1 + \widehat \beta_2 x_2 + \cdots + \widehat \beta_px_p.\]
full_model = lm(y ~ ., data = dataset)#orfull_model = lm(y ~ x1 + x2 + ... + xp, data = dataset)
Intercept Model (or Null Model)
The model with no variable, hence, the predictor is the sample mean,\(\bar y\):
\[\widehat y = \bar y.\]
intercept_model = lm(y ~ 1, data = dataset)
Guess Model
A sub model of the full model, containing a subset of the full set ofvariables, which may be the guess before variable selection:
\[\widehat y = \widehat \beta_0 + \widehat\beta_1 x_1 + \widehat \beta_2 x_2.\]
guess_model = lm(y ~ x1 + x2, data = dataset)
Dataset
Using the "USJudgeRatings" data from the "datasets" package with10 sample rows from 43 rows below:
USJudgeRatings
CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTENAARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6HEALEY.A.H. 8.0 7.6 6.6 7.2 6.5 6.5 6.8 6.7 6.4 6.5 6.9 6.7O'BRIEN,F.J. 7.1 8.5 8.3 8.0 7.9 7.9 7.8 7.8 7.8 7.7 8.3 8.2O'SULLIVAN,T.J. 7.5 9.0 8.9 8.7 8.4 8.5 8.4 8.3 8.3 8.3 8.8 8.7PASKEY,L. 7.5 8.1 7.7 8.2 8.0 8.1 8.2 8.4 8.0 8.1 8.4 8.1SIDOR,W.J. 7.7 6.2 5.1 5.6 5.6 5.9 5.6 5.6 5.3 5.5 6.3 5.3STAPLETON,J.F. 6.5 8.2 7.7 7.8 7.6 7.7 7.7 7.7 7.5 7.6 8.5 7.7ZARRILLI,K.J. 8.6 7.4 7.0 7.5 7.5 7.7 7.4 7.2 6.9 7.0 7.8 7.1
NOTE: The dependent variable is "RTEN".
The data on US judges’ ratings by lawyers with descriptionsbelow.
Sign | Code | Variable Description |
\(y\) | RTEN | Worthy of retention |
\(x_1\) | CONT | Number of contacts of lawyer with judge |
\(x_2\) | INTG | Judicial integrity |
\(x_3\) | DMNR | Demeanor |
\(x_4\) | DILG | Diligence |
\(x_5\) | CFMG | Case flow managing |
\(x_6\) | DECI | Prompt decisions |
\(x_7\) | PREP | Preparation for trial |
\(x_8\) | FAMI | Familiarity with law |
\(x_9\) | ORAL | Sound oral rulings |
\(x_{10}\) | WRIT | Sound written rulings |
\(x_{11}\) | PHYS | Physical ability |
Set the Defined Reference Models
# Specify the intercept modelintercept_model = lm(RTEN ~ 1, data = USJudgeRatings)# Specify the guess modelguess_model = lm(RTEN ~ DMNR + DILG + CFMG, data = USJudgeRatings)# Specify the full modelfull_model = lm(RTEN ~ ., data = USJudgeRatings)#orfull_model = lm(RTEN ~ CONT + INTG + DMNR + DILG + CFMG + DECI + PREP + FAMI + ORAL + WRIT + PHYS, data = USJudgeRatings)
In the forward selection, the following happens:
We start with the intercept model or the guess model (theguess model must be a subset of the full set of variables in the upperscope).
Then from the intercept model or the guess model, based on AIC orBIC criterion, we add one variable repeatedly, and potentially pickvariables up to the full set of variables available (or upper scope) ifthey all improve model quality.
Perform forward step-wise regression fromintercept:
step(intercept_model, direction = 'forward', scope = formula(full_model), trace = 0)
Call:lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)Coefficients:(Intercept) ORAL DMNR PHYS INTG DECI -2.2043 0.2917 0.1520 0.2829 0.3779 0.1667
The final selection includes only "ORAL", "DMNR", "PHYS","INTG", and "DECI".
summary(lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings))
Call:lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)Residuals: Min 1Q Median 3Q Max -0.240656 -0.069026 -0.009474 0.068961 0.246402 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.20433 0.43611 -5.055 1.19e-05 ***ORAL 0.29169 0.10191 2.862 0.006887 ** DMNR 0.15199 0.06354 2.392 0.021957 * PHYS 0.28292 0.04678 6.048 5.40e-07 ***INTG 0.37785 0.10559 3.579 0.000986 ***DECI 0.16672 0.07702 2.165 0.036928 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 0.1119 on 37 degrees of freedomMultiple R-squared: 0.9909, Adjusted R-squared: 0.9897 F-statistic: 806.1 on 5 and 37 DF, p-value: < 2.2e-16
If we want our model to include a specific set of variables,we could start with those variables in a guess model. For example, for"DMNR", "DILG", and "CFMG", we could use the guess modelabove.
Perform forward step-wise regression from guess:
step(guess_model, direction = 'forward', scope = formula(full_model), trace = 0)
Call:lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG + DECI, data = USJudgeRatings)Coefficients:(Intercept) DMNR DILG CFMG PHYS ORAL -2.22619 0.15132 0.01817 -0.10734 0.29224 0.30146 INTG DECI 0.37467 0.24209
The final selection includes the guess variables, "DMNR","DILG", and "CFMG", including additional variables "PHYS", "ORAL","INTG" and "DECI".
Notice that not all the variables in the table below are significant,compared to the model selection that started with the intercept model."DILG" and "CFMG" are not significant, which is a disadvantage of notstarting from the intercept model.
summary(lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG + DECI, data = USJudgeRatings))
Call:lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG + DECI, data = USJudgeRatings)Residuals: Min 1Q Median 3Q Max -0.24638 -0.06583 -0.01054 0.06288 0.25587 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.22619 0.45932 -4.847 2.55e-05 ***DMNR 0.15132 0.06908 2.190 0.03524 * DILG 0.01817 0.10274 0.177 0.86062 CFMG -0.10734 0.12109 -0.886 0.38142 PHYS 0.29224 0.05101 5.730 1.75e-06 ***ORAL 0.30146 0.10845 2.780 0.00869 ** INTG 0.37467 0.11885 3.152 0.00331 ** DECI 0.24209 0.12758 1.898 0.06603 . ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 0.1137 on 35 degrees of freedomMultiple R-squared: 0.9911, Adjusted R-squared: 0.9893 F-statistic: 557.3 on 7 and 35 DF, p-value: < 2.2e-16
In the backward selection, the following happens:
We start with the full model or set of models to be possiblyincluded in the final selection.
Then from the full model, based on AIC or BIC criterion, we dropone variable repeatedly, and potentially drop every variable to get tothe intercept model if no variable improves model quality, or get to afinal selection with specific variables if those variables are set in alower scope model formula. The lower scope must be a subset of thefull set of variables in the full model.
Perform backward step-wise regression from fullmodel:
step(full_model, direction = 'backward', trace = 0)
Call:lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)Coefficients:(Intercept) INTG DMNR DECI ORAL PHYS -2.2043 0.3779 0.1520 0.1667 0.2917 0.2829
The final selection includes only "INTG", "DMNR", "DECI","ORAL", and "PHYS". This is the same set as the selection abovefor the forward stepwise selection but in different order.
If we want our model to include a specific set of variables,we could end with those variables included in a set lower scope model.For example, for "DMNR", "DILG", and "CFMG", we could use the guessmodel above.
Perform backward step-wise regression with lower model asguess model:
step(full_model, direction = 'backward', scope = list(lower = formula(guess_model)), trace = 0)
Call:lm(formula = RTEN ~ INTG + DMNR + DILG + CFMG + DECI + ORAL + PHYS, data = USJudgeRatings)Coefficients:(Intercept) INTG DMNR DILG CFMG DECI -2.22619 0.37467 0.15132 0.01817 -0.10734 0.24209 ORAL PHYS 0.30146 0.29224
The final selection includes the guess variables, "DMNR","DILG", and "CFMG", including additional variables "INTG", "DECI","ORAL" and "PHYS". This is the same set as the selection abovefor the forward stepwise selection that started with the guess model butin different order.
Again, not all the variables in the selection are significant,compared to the model selection that did not have a lower scope modelset. "DILG" and "CFMG" are not significant, which is a disadvantage ofarbitrarily setting a lower scope.
In the forward-backward (or both) selection, the followinghappens:
We start with a guess model model, this could be the interceptmodel, the full model, or a subset of the full model.
Then from the guess model, based on AIC or BIC criterion, we addone or drop one variable repeatedly, and potentially add more variablesto the guess model if they are significant or drop variables from theguess model if they all improve model quality. The lower scope mustbe as subset of the guess model which in turn must be a subset of thefull model.
Perform forward-backward step-wise regression starting fromintercept model:
step(intercept_model, direction = 'both', scope = list(lower = formula(intercept_model), upper = formula(full_model)), trace = 0)
Call:lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)Coefficients:(Intercept) ORAL DMNR PHYS INTG DECI -2.2043 0.2917 0.1520 0.2829 0.3779 0.1667
Perform forward-backward step-wise regression starting fromguess model:
step(guess_model, direction = 'both', scope = list(lower = formula(intercept_model), upper = formula(full_model)), trace = 0)
Call:lm(formula = RTEN ~ DMNR + PHYS + ORAL + INTG + DECI, data = USJudgeRatings)Coefficients:(Intercept) DMNR PHYS ORAL INTG DECI -2.2043 0.1520 0.2829 0.2917 0.3779 0.1667
Perform forward-backward step-wise regression starting fromfull model:
step(full_model, direction = 'both', scope = list(lower = formula(intercept_model), upper = formula(full_model)), trace = 0)
Call:lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)Coefficients:(Intercept) INTG DMNR DECI ORAL PHYS -2.2043 0.3779 0.1520 0.1667 0.2917 0.2829
The final selections are all the same and it includes only"DMNR", "PHYS", "ORAL", "INTG", and "DECI". This is the sameset as the selection above for the forward stepwise selection.
In the forward-backward stepwise selection, unlike theforward and backward stepwise selections, you cannot pre-specifyvariables that must be included in the final selection, you can only setthe scope of the variables to be included.