StepReg: Stepwise Regression Analysis (2024)

Junhui Li1, Kai Hu1, Xiaohuan Lu2 and Lihua Julie Zhu1

1University of Massachusset Chan Medical School, Worcester, USA
2Clark University, Worcester, USA

2 August 2024

Abstract

The StepReg package, developed for exploratory model building tasks, offers support across diverse scenarios. It facilitates model construction for various response variable types, including continuous (linear regression), binary (logistic regression), and time-to-event (Cox regression), among others. StepReg encompasses all commonly used model selection strategies, including forward selection, backward elimination, bidirectional elimination, and best subsets. Notably, it offers flexibility in selection metrics, accommodating both information criteria (AIC, BIC, etc.) and significance level cutoffs. This vignettes provide numerous examples showcasing the effective utilization of StepReg for model development in diverse contexts. Furthermore, it delves into considerations for selecting appropriate strategies and metrics, empowering users to make informed decisions throughout the modeling process.

Package

StepReg 1.5.2

Model selection is the process of choosing the most relevant features from a set of candidate variables. This procedure is crucial because it ensures that the final model is both accurate and interpretable while being computationally efficient and avoiding overfitting. Stepwise regression algorithms iteratively add or remove features from the model based on certain criteria (e.g., significance level or P-value, information criteria like AIC or BIC, etc.). The process continues until no further improvements can be made according to the chosen criterion. At the end of the stepwise procedure, you’ll have a final model that includes the selected features and their coefficients.

StepReg simplifies model selection tasks by providing a unified programming interface. It currently supports model buildings for five distinct response variable types (section 3.1), four model selection strategies (section 3.2) including the best subsets algorithm, and a variety of selection metrics (section 3.3). Moreover, StepReg detects and addresses the multicollinearity issues if they exist (section 3.4). The output of StepReg includes multiple tables summarizing the final model and the variable selection procedures. Additionally, StepReg offers a plot function to visualize the selection steps (section 4). For demonstration, the vignettes include four use cases covering distinct regression scenarios (section 5). Non-programmers can access the tool through the iterative Shiny app detailed in section 6.

The following example selects an optimal linear regression model with the mtcars dataset.

library(StepReg)data(mtcars)formula <- mpg ~ .res <- stepwise(formula = formula, data = mtcars, type = "linear", include = c("qsec"), strategy = "bidirection", metric = c("AIC"))

Breakdown of the parameters:

  • formula: specifies the dependent and independent variables
  • type: specifies the regression category, depending on your data, choose from “linear”, “logit”, “cox”, etc.
  • include: specifies the variables that must be in the final model
  • strategy: specifies the model selection strategy, choose from “forward”, “backward”, “bidirection”, “subset”
  • metric: specifies the model fit evaluation metric, choose one or more from “AIC”, “AICc”, “BIC”, “SL”, etc.

The output consists of multiple tables, which can be viewed with:

res
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value————————————————————————————————————————————— included variable qsec strategy bidirection metric AIC tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent mpg numeric Independent cyl numeric Independent disp numeric Independent hp numeric Independent drat numeric Independent wt numeric Independent qsec numeric Independent vs numeric Independent am numeric Independent gear numeric Independent carb numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under bidirection with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered EffectRemoved NumberParams AIC—————————————————————————————————————————————————————————————— 1 1 1 149.94345 2 qsec 2 145.776054 3 wt 3 97.90843 4 am 4 95.307305 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with mpg under bidirection and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 9.617781 6.959593 1.381946 0.177915 qsec 1.225886 0.28867 4.246676 0.000216 wt -3.916504 0.711202 -5.506882 7e-06 am 2.935837 1.410905 2.080819 0.046716 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗

You can also visualize the variable selection procedures with:

plot(res)
$bidirection$bidirection$detail

StepReg: Stepwise Regression Analysis (1)

$bidirection$summary

StepReg: Stepwise Regression Analysis (2)

The (+)1 refers to original model with intercept being added, (+) indicates variables being added to the model while (-) means variables being removed from the model.

Additionally, you can generate reports of various formats with:

report(res, report_name = "path_to/demo_res", format = "html")

Replace "path_to/demo_res" with desired output file name, the suffix ".html" will be added automatically. For detailed examples and more usage, refer to section 4 and 5.

3.1 Regression categories

StepReg supports multiple types of regressions, including linear, logit, cox, poisson, and gamma regressions. These methods primarily vary by the type of response variable, which are summarized in the table below. Additional regression techniques can be incorporated upon user requests.

Table 1: Common regression categories
RegressionReponse
linearcontinuous
logitbinary
coxtime-to-event
poissoncount
gammacontinuous and positively skewed

3.2 Model selection strategies

Model selection aims to identify the subset of independent variables that provide the best predictive performance for the response variable. Both stepwise regression and best subsets approaches are implemented in StepReg. For stepwise regression, there are mainly three methods: Forward Selection, Backward Elimination, Bidirectional Elimination.

Table 2: Model selection strategy
StrategyDescription
Forward SelectionIn forward selection, the algorithm starts with an empty model (no predictors) and adds in variables one by one. Each step tests the addition of every possible predictor by calculating a pre-selected metric. Add the variable (if any) whose inclusion leads to the most statistically significant fit improvement. Repeat this process until more predictors no longer lead to a statistically better fit.
Backward EliminationIn backward elimination, the algorithm starts with a full model (all predictors) and deletes variables one by one. Each step test the deletion of every possible predictor by calculating a pre-selected metric. Delete the variable (if any) whose loss leads to the most statistically significant fit improvement. Repeat this process until less predictors no longer lead to a statistically better fit.
Bidirectional EliminationBidirectional elimination is essentially a forward selection procedure combined with backward elimination at each iteration. Each iteration starts with a forward selection step that adds in predictors, followed by a round of backward elimination that removes predictors. Repeat this process until no more predictors are added or excluded.
Best SubsetsStepwise algorithms add or delete one predictor at a time and output a single model without evaluating all candidates. Therefore, it is a relatively simple procedure that only produces one model. In contrast, the Best Subsets algorithm calculates all possible models and output the best-fitting models with one predictor, two predictors, etc., for users to choose from.

Given the computational constraints, when dealing with datasets featuring a substantial number of predictor variables greater than the sample size, the Bidirectional Elimination typically emerges as the most advisable approach. Forward Selection and Backward Elimination can be considered in sequence. On the contrary, the Best Subsets approach requires the most substantial processing time, yet it calculates a comprehensive set of models with varying numbers of variables. In practice, users can experiment with various methods and select a final model based on the specific dataset and research objectives at hand.

3.3 Selection metrics

Various selection metrics can be used to guide the process of adding or removing predictors from the model. These metrics help to determine the importance or significance of predictors in improving the model fit. In StepReg, selection metrics include two categories: Information Criteria and Significance Level of the coefficient associated with each predictor. Information Criteria is a means of evaluating a model’s performance, which balances model fit with complexity by penalizing models with a higher number of parameters. Lower Information Criteria values indicate a better trade-off between model fit and complexity. Note that when evaluating different models, it is important to compare them within the same Information Criteria framework rather than across multiple Information Criteria. For example, if you decide to use AIC, you should compare all models using AIC. This ensures consistency and fairness in model comparison, as each Information Criterion has its own scale and penalization factors. In practice, multiple metrics have been proposed, the ones supported by StepReg are summarized below.

Importantly, given the discrepancies in terms of the precise definitions of each metric, StepReg mirrors the formulas adopted by SAS for univariate multiple regression (UMR) except for HQ, IC(1), and IC(3/2). A subset of the UMR can be easily extended to multivariate multiple regression (MMR), which are indicated in the following table.

Table 3: Statistics in selection metric
StatisticMeanings
\({n}\)Sample Size
\({p}\)Number of parameters including the intercept
\({q}\)Number of dependent variables
\(\sigma^2\)Estimate of pure error variance from fitting the full model
\({SST}\)Total sum of squares corrected for the mean for the dependent variable, which is a numeric value for UMR and a matrix for multivariate regression
\({SSE}\)Error sum of squares, which is a numeric value for UMR and a matrix for multivariate regression
\(\text{LL}\)The natural logarithm of likelihood
\({| |}\)The determinant function
\(\ln()\)The natural logarithm
Table 4: Abbreviation, Definition, and Formula of the Selection Metric for Linear, Logit, Cox, Possion, and Gamma regression
AbbreviationDefinitionFormula
linearlogit, cox, poisson and gamma
AICAkaike’s Information Criterion\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + 2pq + n + q(q+1)\)
(Clifford M. Hurvich 1989; Al-Subaihi 2002)\(^1\)
\(-2\text{LL} + 2p\)
(Darlington 1968; George G. Judge 1985)
AICcCorrected Akaike’s Information Criterion\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + \frac{nq(n+p)}{n-p-q-1}\)
(Clifford M. Hurvich 1989; Edward J. Bedrick 1994)\(^2\)
\(-2\text{LL} + \frac{n(n+p)}{n-p-2}\)
(Clifford M. Hurvich 1989)
BICSawa Bayesian Information Criterion\(n\ln\left(\frac{SSE}{n}\right) + 2(p+2)o - 2o^2, o = \frac{n\sigma^2}{SSE}\)
(Sawa 1978; George G. Judge 1985)
not available for MMR
not available
CpMallows’ Cp statistic\(\frac{SSE}{\sigma^2} + 2p - n\)
(Mallows 1973; Hocking 1976)
not available for MMR
not available
HQHannan and Quinn Information Criterion\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + 2pq\ln(\ln(n))\)
(E. J. Hannan 1979; Allan D R McQuarrie 1998; Clifford M. Hurvich 1989)
\(-2\text{LL} + 2p\ln(\ln(n))\)
(E. J. Hannan 1979)
IC(1)Information Criterion with Penalty Coefficient Set to 1\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + p\)
(J. A. Nelder 1972; A. F. M. Smith 1980) not available for MMR
\(-2\text{LL} + p\)
(J. A. Nelder 1972; A. F. M. Smith 1980)
IC(3/2)Information Criterion with Penalty Coefficient Set to 3/2\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + \frac{3}{2}p\)
(A. F. M. Smith 1980)
not available for MMR
\(-2\text{LL} + \frac{3}{2}p\)
(A. F. M. Smith 1980)
SBCSchwarz Bayesian Information Criterion\(n\ln\left(\frac{|\text{SSE}|}{n}\right) + p \ln(n)\)
(Clifford M. Hurvich 1989; Schwarz 1978; George G. Judge 1985; Al-Subaihi 2002)
not available for MMR
\(-2\text{LL} + p\ln(n)\)
(Schwarz 1978; George G. Judge 1985)
SLSignificance Level (pvalue)\(\textit{F test}\) for UMR and \(\textit{Approximate F test}\) for MMRForward: LRT and Rao Chi-square test (logit, poisson, gamma); LRT (cox);

Backward: Wald test

RsqR-square statistic\(1 - \frac{SSE}{SST}\)
not available for MMR
not available
adjRsqAdjusted R-square statistic\(1 - \frac{(n-1)(1-R^2)}{n-p}\)
(Darlington 1968; George G. Judge 1985)
not available for MMR
not available
1 Unsupported AIC formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):

\(AIC=n\ln\left(\frac{SSE}{n}\right) + 2p\) (Darlington 1968; George G. Judge 1985)

2 Unsupported AICc formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):

\(AICc=\ln\left(\frac{SSE}{n}\right) + 1 + \frac{2(p+1)}{n-p-2}\) (Allan D R McQuarrie 1998)

No metric is necessarily optimal for all datasets. The choice of them depends on your data and research goals. We recommend using multiple metrics simultaneously, which allows the selection of the best model based on your specific needs. Below summarizes general guidance.

  • AIC: AIC works by penalizing the inclusion of additional variables in a model. The lower the AIC, the better performance of the model. AIC does not include sample size in penalty calculation, and it is optimal in minimizing the mean square error of predictions (Mark J. Brewer 2016).

  • AICc: AICc is a variant of AIC, which works better for small sample size, especially when numObs / numParam < 40 (Kenneth P. Burnham 2002).

  • Cp: Cp is used for linear models. It is equivalent to AIC when dealing with Gaussian linear model selection.

  • IC(1) and IC(3/2): IC(1) and IC(3/2) have 1 and 3/2 as penalty factors respectively, compared to 2 used by AIC. As such, IC(1) turns to return a complex model with more variables that may suffer from overfitting issues.

  • BIC and SBC: Both BIC and SBC are variants of Bayesian Information Criterion. The main distinction between BIC/SBC and AIC lies in the magnitude of the penalty imposed: BIC/SBC are more parsimonious when penalizing model complexity, which typically results to a simpler model (SAS Institute Inc 2018; Sawa 1978; Clifford M. Hurvich 1989; Schwarz 1978; George G. Judge 1985; Al-Subaihi 2002).

The precise definitions of these criteria can vary across literature and in the SAS environment. Here, BIC aligns with the definition of the Sawa Bayesion Information Criterion as outlined in SAS documentation, while SBC corresponds to the Schwarz Bayesian Information Criterion. According to Richard’s post, whereas AIC often favors selecting overly complex models, BIC/SBC prioritize a small models. Consequently, when dealing with a limited sample size, AIC may seem preferable, whereas BIC/SBC tend to perform better with larger sample sizes.

  • HQ: HQ is an alternative to AIC, differing primarily in the method of penalty calculation. However, HQ has remained relatively underutilized in practice (Kenneth P. Burnham 2002).

  • Rsq: The R-squared (R²) statistic measures the proportion of variations that is explained by the model. It ranges from 0 to 1, with 1 indicating that all of the variability in the response variables is accounted for by the independent variables. As such, R-squared is valuable for communicating the explanatory power of a model. However, R-squared alone is not sufficient for selection because it does not take into account the complexity of the model. Therefore, while R-squared is useful for understanding how well the model fits the data, it should not be the sole criterion for model selection.

  • adjRsq: The adjusted R-squared (adj-R²) seeks to overcome the limitation of R-squared in model selection by considering the number of predictors. It serves a similar purpose to information criteria, as both methods compare models by weighing their goodness of fit against the number of parameters. However, information criteria are typically regarded as superior in this context (Stevens 2016).

  • SL: SL stands for Significance Level (P-value), embodying a distinct approach to model selection in contrast to information criteria. The SL method operates by calculating a P-value through specific hypothesis testing. Should this P-value fall below a predefined threshold, such as 0.05, one should favor the alternative hypothesis, indicating that the full model significantly outperforms the reduced model. The effectiveness of this method hinges upon the selection of the P-value threshold, wherein smaller thresholds tend to yield simpler models.

3.4 Multicollinearity

This blog by Jim Frost gives an excellent overview of multicollinearity and when it is necessary to remove it.

Simply put, a dataset contains multicollinearity when input predictors are correlated. When multicollinearity occurs, the interpretability of predictors will be badly affected because changes in one input variable lead to changes in other input variables. Therefore, it is hard to individually estimate the relationship between each input variable and the dependent variable.

Multicollinearity can dramatically reduce the precision of the estimated regression coefficients of correlated input variables, making it hard to find the correct model. However, as Jim pointed out, “Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.”

In StepReg, QC Matrix Decomposition is performed ahead of time to detect and remove input variables causing multicollinearity.

StepReg provides multiple functions for summarizing the model building results. The function stepwise() generates a list of tables that describe the feature selection steps and the final model. To facilitate collaborations, you can redirect the tables into various formats such as “xlsx”, “html”, “docx”, etc. with the function report(). Furthermore, you can easily compare the variable selection procedures for multiple selection metrics by visualizing the steps with the function plot(). Details see below.

Depending on the number of selected regression strategies and metrics, you can expect to receive at least four tables from stepwise(). Below describes the content of each of 4 tables from Quick Demo 2.

Table 5: Tables generated by StepReg
Table_NameTable_Description
Summary of arguments for model selectionArguments used in the stepwise function, either default or user-supplied values.
Summary of variables in datasetVariable names, types, and classes in dataset.
Summary of selection process under xxx(strategy) with xxx(metric)Overview of the variable selection process under specified strategy and metric.
Summary of coefficients for the selected model with xxx(dependent variable) under xxx(strategy) and xxx(metric)Coefficients for the selected models under specified strategy with metric. Please note that this table will not be generated for the strategy ‘subset’ when using the metric ‘SL’.

You can save the output in different format like “xlsx”, “docx”, “html”, “pptx”, and others, facilitating easy sharing. Of note, the suffix will be automatically added to the report_name. For instance, the following example generates both “results.xlsx” and “results.docx” reports.

report(res, report_name = "results", format = c("xlsx", "docx"))

Please choose the regression model that best suits the type of response variable. For detailed guidance, see section 3.1. Below, we present various examples utilizing different models tailored to specific datasets.

5.1 Linear regression with the mtcars dataset

In this section, we’ll demonstrate how to perform linear regression analysis using the mtcars dataset, showcasing different scenarios with varying numbers of predictors and dependent variables. We set type = "linear" to direct the function to perform linear regression.

Description of the mtcars dataset

The mtcars is a classic dataset in statistics and is included in the base R installation. It was sourced from the 1974 Motor Trend US magazine, comprising 32 observations on 11 variables. Here’s a brief description of the variables included:

  1. mpg: miles per gallon (fuel efficiency)
  2. cyl: number of cylinders
  3. disp: displacement (engine size) in cubic inches
  4. hp: gross horsepower
  5. drat: rear axle ratio
  6. wt: weight (in thousands of pounds)
  7. qsec: 1/4 mile time (in seconds)
  8. vs: engine type (0 = V-shaped, 1 = straight)
  9. am: transmission type (0 = automatic, 1 = manual)
  10. gear: number of forward gears
  11. carb: number of carburetors

Why choose linear regression

Linear regression is an ideal choice for analyzing the mtcars dataset due to its inclusion of continuous variables like “mpg”, “hp”, or “weight”, which can serve as response variables. Furthermore, the dataset exhibits potential linear relationships between the response variable and other variables.

5.1.1 Example1: single dependent variable (“mpg”)

In this example, we employ “forward” strategy with “AIC” as the selection criteria. Additionally, we specify using the include argument that “disp”, “cyl” must always be included in the model.

library(StepReg)data(mtcars)formula <- mpg ~ .res1 <- stepwise(formula = formula, data = mtcars, type = "linear", include = c("disp", "cyl"), strategy = "forward", metric = "AIC")res1
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value—————————————————————————————————————————— included variable disp cyl strategy forward metric AIC tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent mpg numeric Independent cyl numeric Independent disp numeric Independent hp numeric Independent drat numeric Independent wt numeric Independent qsec numeric Independent vs numeric Independent am numeric Independent gear numeric Independent carb numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams AIC——————————————————————————————————————————————— 1 1 1 149.94345 2 disp cyl 3 108.333571 3 wt 4 98.746294 4 hp 5 97.525537 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with mpg under forward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 40.828537 2.757468 14.806532 0 disp 0.011599 0.011727 0.989122 0.331386 cyl -1.29332 0.655877 -1.971894 0.058947 wt -3.853904 1.015474 -3.795178 0.000759 hp -0.020538 0.012147 -1.690851 0.102379 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗

To visualize the selection process:

plot_list <- plot(res1)cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1)

StepReg: Stepwise Regression Analysis (3)

To exclude the intercept from the model, adjust the formula as follows:

formula <- mpg ~ . + 0

To limit the model to a specific subset of predictors, adjust the formula as follows, which will only consider “cyp”, “disp”, “hp”, “wt”, “vs”, and “am” as the predictors.

formula <- mpg ~ cyl + disp + hp + wt + vs + am + 0

Another way is to use minus symbol("-") to exclude some predictors for variable selection. For example, include all variables except “disp”, “wt”, and intercept.

formula <- mpg ~ . - 1 - disp - wt

You can simultaneously provide multiple selection strategies and metrics. For example, the following code snippet employs both “forward” and “backward” strategies using metrics “AIC”, “BIC”, and “SL”. It’s worth mentioning that when “SL” is specified, you may also want to set the significance level for entry (“sle”) and stay (“sls”), both of which default to 0.15.

formula <- mpg ~ .res2 <- stepwise(formula = formula, data = mtcars, type = "linear", strategy = c("forward", "backward"), metric = c("AIC", "BIC", "SL"), sle = 0.05, sls = 0.05)res2
Table 1. Summary of arguments for model selection ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value———————————————————————————————————————————————————————— included variable NULL strategy forward & backward metric AIC & BIC & SL significance level for entry (sle) 0.05 significance level for stay (sls) 0.05 test method F tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent mpg numeric Independent cyl numeric Independent disp numeric Independent hp numeric Independent drat numeric Independent wt numeric Independent qsec numeric Independent vs numeric Independent am numeric Independent gear numeric Independent carb numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams AIC——————————————————————————————————————————————— 1 1 1 149.94345 2 wt 2 107.217363 3 cyl 3 97.197999 4 hp 4 96.664562 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of selection process under forward with BIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams BIC——————————————————————————————————————————————— 1 1 1 115.061344 2 wt 2 74.373395 3 cyl 3 66.190251 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 5. Summary of selection process under forward with SL‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams SL————————————————————————————————————————————— 1 1 1 1 2 wt 2 0 3 cyl 3 0.001064 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 6. Summary of selection process under backward with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectRemoved NumberParams AIC——————————————————————————————————————————————— 1 11 104.897744 2 cyl 10 102.915068 3 vs 9 100.973242 4 carb 8 99.121264 5 gear 7 97.456669 6 drat 6 96.161901 7 disp 5 95.515302 8 hp 4 95.307305 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 7. Summary of selection process under backward with BIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectRemoved NumberParams BIC—————————————————————————————————————————————— 1 11 83.872801 2 cyl 10 80.827738 3 vs 9 77.795923 4 carb 8 74.805755 5 gear 7 71.925757 6 drat 6 69.307272 7 disp 5 67.229924 8 hp 4 65.713833 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 8. Summary of selection process under backward with SL‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectRemoved NumberParams SL————————————————————————————————————————————— 1 11 1 2 cyl 10 0.916087 3 vs 9 0.843258 4 carb 8 0.746958 5 gear 7 0.619641 6 drat 6 0.462401 7 disp 5 0.298972 8 hp 4 0.223088 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 9. Summary of coefficients for the selected model with mpg under forward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 38.751787 1.786864 21.687038 0 wt -3.166973 0.740576 -4.276365 0.000199 cyl -0.941617 0.550916 -1.709183 0.09848 hp -0.018038 0.011876 -1.518838 0.140015 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 10. Summary of coefficients for the selected model with mpg under forward and BIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 39.686261 1.714984 23.140893 0 wt -3.190972 0.756906 -4.215808 0.000222 cyl -1.507795 0.414688 -3.635972 0.001064 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 11. Summary of coefficients for the selected model with mpg under forward and SL ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 39.686261 1.714984 23.140893 0 wt -3.190972 0.756906 -4.215808 0.000222 cyl -1.507795 0.414688 -3.635972 0.001064 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 12. Summary of coefficients for the selected model with mpg under backward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 9.617781 6.959593 1.381946 0.177915 wt -3.916504 0.711202 -5.506882 7e-06 qsec 1.225886 0.28867 4.246676 0.000216 am 2.935837 1.410905 2.080819 0.046716 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 13. Summary of coefficients for the selected model with mpg under backward and BIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 9.617781 6.959593 1.381946 0.177915 wt -3.916504 0.711202 -5.506882 7e-06 qsec 1.225886 0.28867 4.246676 0.000216 am 2.935837 1.410905 2.080819 0.046716 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 14. Summary of coefficients for the selected model with mpg under backward and SL ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 9.617781 6.959593 1.381946 0.177915 wt -3.916504 0.711202 -5.506882 7e-06 qsec 1.225886 0.28867 4.246676 0.000216 am 2.935837 1.410905 2.080819 0.046716 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot_list <- plot(res2)cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1))

StepReg: Stepwise Regression Analysis (4)

cowplot::plot_grid(plotlist = plot_list$backward, ncol = 1, rel_heights = c(2, 1))

StepReg: Stepwise Regression Analysis (5)

5.1.2 Example2: multivariate regression (“mpg” and “drat”)

In this scenario, there are two dependent variables, “mpg” and “drat”. The model selection aims to identify the most influential predictors that affect both variables.

formula <- cbind(mpg, drat) ~ .res3 <- stepwise(formula = formula, data = mtcars, type = "linear", strategy = "forward", metric = c("AIC", "HQ"))res3
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value—————————————————————————————————————————— included variable NULL strategy forward metric AIC & HQ tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class————————————————————————————————————————————————— Dependent cbind(mpg, drat) nmatrix.2 Independent cyl numeric Independent disp numeric Independent hp numeric Independent wt numeric Independent qsec numeric Independent vs numeric Independent am numeric Independent gear numeric Independent carb numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams AIC——————————————————————————————————————————————— 1 1 1 205.805744 2 wt 2 161.304784 3 cyl 3 150.750461 4 carb 4 142.109246 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of selection process under forward with HQ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams HQ——————————————————————————————————————————————— 1 1 1 168.777444 2 wt 2 125.248184 3 cyl 3 115.665561 4 carb 4 107.996046 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 5. Summary of coefficients for the selected model with Response mpg under forward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 39.60214 1.682264 23.540979 0 wt -3.159452 0.742346 -4.256035 0.000211 cyl -1.289788 0.432598 -2.981496 0.00588 carb -0.485763 0.32947 -1.474375 0.151536 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 6. Summary of coefficients for the selected model with Response drat under forward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 5.046867 0.215307 23.440312 0 wt -0.240667 0.09501 -2.533067 0.017191 cyl -0.168582 0.055367 -3.044821 0.005026 carb 0.130518 0.042168 3.095207 0.004433 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 7. Summary of coefficients for the selected model with Response mpg under forward and HQ ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 39.60214 1.682264 23.540979 0 wt -3.159452 0.742346 -4.256035 0.000211 cyl -1.289788 0.432598 -2.981496 0.00588 carb -0.485763 0.32947 -1.474375 0.151536 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 8. Summary of coefficients for the selected model with Response drat under forward and HQ ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error t value Pr(>|t|)————————————————————————————————————————————————————————— (Intercept) 5.046867 0.215307 23.440312 0 wt -0.240667 0.09501 -2.533067 0.017191 cyl -0.168582 0.055367 -3.044821 0.005026 carb 0.130518 0.042168 3.095207 0.004433 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot(res3)
$forward$forward$detail

StepReg: Stepwise Regression Analysis (6)

$forward$summary

StepReg: Stepwise Regression Analysis (7)

5.2 Logistic regression with the remission dataset

In this example, we’ll showcase logistic regression using the remission dataset. By setting type = "logit", we instruct the function to perform logistic regression.

Description of the remission dataset

The remission dataset, obtained from the online course STAT501 at Penn State University, has been integrated into StepReg. It consists of 27 observations across seven variables, including a binary variable named “remiss”:

  1. remiss: whether leukemia remission occurred, a value of 1 indicates occurrence while 0 means non-occurrence
  2. cell: cellularity of the marrow clot section
  3. smear: smear differential percentage of blasts
  4. infil: percentage of absolute marrow leukemia cell infiltrate
  5. li: percentage labeling index of the bone marrow leukemia cells
  6. blast: the absolute number of blasts in the peripheral blood
  7. temp: the highest temperature before the start of treatment

Why choose logistic regression

Logistic regression effectively captures the relationship between predictors and a categorical response variable, offering insights into the probability of being assigned into specific response categories given a set of predictors. It is suitable for analyzing binary outcomes, such as the remission status (“remiss”) in the remission dataset.

5.2.1 Example1: using “forward” strategy

In this example, we employ a “forward” strategy with “AIC” as the selection criteria, while force ensuring that the “cell” variable is included in the model.

data(remission)formula <- remiss ~ .res4 <- stepwise(formula = formula, data = remission, type = "logit", include= "cell", strategy = "forward", metric = "AIC")
[1] "There was an error message."
res4
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value————————————————————————————————————————— included variable cell strategy forward metric AIC tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent remiss numeric Independent cell numeric Independent smear numeric Independent infil numeric Independent li numeric Independent blast numeric Independent temp numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with AIC‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams AIC—————————————————————————————————————————————— 1 1 1 36.371765 2 cell 2 35.791792 3 li 3 30.340719 4 temp 4 29.953368 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with remiss under forward and AIC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error z value Pr(>|z|)————————————————————————————————————————————————————————— (Intercept) 67.633906 56.887547 1.188905 0.234477 cell 9.652152 7.751076 1.245266 0.213034 li 3.8671 1.778278 2.174632 0.029658 temp -82.073774 61.712382 -1.32994 0.183538 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot(res4)
$forward$forward$detail

StepReg: Stepwise Regression Analysis (8)

$forward$summary

StepReg: Stepwise Regression Analysis (9)

5.2.2 Example2: using “subset” strategy

In this example, we employ a “subset” strategy, utilizing “SBC” as the selection criteria while excluding the intercept. Meanwhile, we set best_n = 3 to restrict the output to the top 3 models for each number of variables.

data(remission)formula <- remiss ~ . + 0res5 <- stepwise(formula = formula, data = remission, type = "logit", strategy = "subset", metric = "SBC", best_n = 3)
[1] "There was an error message."
res5
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value———————————————————————————————————————— included variable NULL strategy subset metric SBC tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 0 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent remiss numeric Independent cell numeric Independent smear numeric Independent infil numeric Independent li numeric Independent blast numeric Independent temp numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under subset with SBC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ NumberOfVariables SBC VariablesInModel————————————————————————————————————————————————————————————————— 1 37.627761 0 temp 1 38.645147 0 cell 1 38.908893 0 smear 2 32.450079 0 li temp 2 37.241906 0 blast temp 2 37.692052 0 cell li 3 33.57889 0 cell li temp 3 35.101444 0 infil li temp 3 35.582915 0 smear li temp 4 36.711937 0 cell li blast temp 4 36.809659 0 cell smear li temp 4 36.841948 0 cell infil li temp 5 38.980486 0 cell smear infil li temp 5 39.686807 0 cell smear li blast temp 5 39.752516 0 cell infil li blast temp 6 42.25891 0 cell smear infil li blast temp ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with remiss under subset and SBC ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error z value Pr(>|z|)—————————————————————————————————————————————————————— li 2.941731 1.195358 2.460962 0.013856 temp -3.855005 1.404155 -2.745426 0.006043 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot(res5)
$subset$subset$detail

StepReg: Stepwise Regression Analysis (10)

$subset$summary

StepReg: Stepwise Regression Analysis (11)

Here, the 0 in the above plot means that there is no intercept in the model.

5.3 Cox regression with the lung dataset

In this example, we’ll demonstrate how to perform Cox regression analysis using the lung dataset. By setting type = "cox", we instruct the function to conduct Cox regression.

Description of the lung dataset

The lung dataset, available in the "survival" R package, includes information on survival times for 228 patients with advanced lung cancer. It comprises ten variables, among which the “status” variable codes for censoring status (1 = censored, 2 = dead), and the “time” variable denotes the patient survival time in days. To learn more about the dataset, use ?survival::lung.

Why choose Cox regression

Cox regression, also termed the Cox proportional hazards model, is specifically designed for analyzing survival data, making it well-suited for datasets like lung that include information on the time until an event (e.g., death) occurs. This method accommodates censoring and assumes proportional hazards, enhancing its applicability to medical studies involving time-to-event outcomes.

5.3.1 Example1: using “forward” strategy

In this example, we employ a “forward” strategy with “AICc” as the selection criteria.

library(dplyr)library(survival)# Preprocess:lung <- survival::lung %>% mutate(sex = factor(sex, levels = c(1, 2))) %>% # make sex as factor na.omit() # get rid of incomplete recordsformula = Surv(time, status) ~ .res6 <- stepwise(formula = formula, data = lung, type = "cox", strategy = "forward", metric = "AICc")res6
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value————————————————————————————————————————————— included variable NULL strategy forward metric AICc significance level for entry (sle) 0.15 significance level for stay (sls) 0.15 test method efron tolerance of multicollinearity 1e-07 multicollinearity variable NULL ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class——————————————————————————————————————————————————— Dependent Surv(time, status) nmatrix.2 Independent inst numeric Independent age numeric Independent sex factor Independent ph.ecog numeric Independent ph.karno numeric Independent pat.karno numeric Independent meal.cal numeric Independent wt.loss numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with AICc‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams AICc———————————————————————————————————————————————— 1 ph.ecog 1 1127.927415 2 sex 2 1122.958293 3 inst 3 1121.654289 4 wt.loss 4 1120.224249 5 ph.karno 5 1118.535984 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with Surv(time, status) under forward and AICc ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable coef exp(coef) se(coef) z Pr(>|z|)——————————————————————————————————————————————————————————————— ph.ecog 0.993224 2.699926 0.232115 4.279014 1.9e-05 sex2 -0.571959 0.564419 0.198865 -2.876111 0.004026 inst -0.030042 0.970404 0.012931 -2.323316 0.020162 wt.loss -0.0148 0.985309 0.007664 -1.931026 0.05348 ph.karno 0.021492 1.021725 0.011222 1.915171 0.055471 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot(res6)
$forward$forward$detail

StepReg: Stepwise Regression Analysis (12)

$forward$summary

StepReg: Stepwise Regression Analysis (13)

5.4 Poisson regression with the creditCard dataset

In this example, we’ll demonstrate how to perform Poisson regression analysis using the creditCard dataset. We set type = "poisson" to direct the function to perform Poisson regression.

Descprition of the creditCard dataset

The creditCard dataset contains credit history information for a sample of applicants for a specific type of credit card, included in the "AER" package. It encompasses 1319 observations across 12 variables, including “reports”, “age”, “income”, among others. The “reports” variable represents the number of major derogatory reports. For detailed information, refer to ?AER::CreditCard.

Why choose Poisson regression

Poisson regression is frequently employed method for analyzing count data, where the response variable represents the occurrences of an event within a defined time or space frame. In the context of the creditCard dataset, Poisson regression can model the count of major derogatory reports (“reports”), enabling assessment of predictors’ impact on this variable.

5.4.1 Example1: using “forward” strategy

In this example, we employ a “forward” strategy with “SL” as the selection criteria. We set the significance level for entry to 0.05 (sle = 0.05).

data(creditCard)formula = reports ~ .res7 <- stepwise(formula = formula, data = creditCard, type = "poisson", strategy = "forward", metric = "SL", sle = 0.05)
[1] "There was an error message."
res7
Table 1. Summary of arguments for model selection‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Parameter Value————————————————————————————————————————————— included variable NULL strategy forward metric SL significance level for entry (sle) 0.05 test method Rao tolerance of multicollinearity 1e-07 multicollinearity variable NULL intercept 1 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 2. Summary of variables in dataset ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable_type Variable_name Variable_class—————————————————————————————————————————————— Dependent reports numeric Independent card factor Independent age numeric Independent income numeric Independent share numeric Independent expenditure numeric Independent owner factor Independent selfemp factor Independent dependents numeric Independent months numeric Independent majorcards numeric Independent active numeric ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 3. Summary of selection process under forward with SL‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Step EffectEntered NumberParams SL————————————————————————————————————————————— 1 1 1 1 2 card 2 0 3 active 3 0 4 expenditure 4 0.000167 5 months 5 0.001497 6 owner 6 0.00029 7 majorcards 7 0.008537 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗Table 4. Summary of coefficients for the selected model with reports under forward and SL ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗ Variable Estimate Std. Error z value Pr(>|z|)—————————————————————————————————————————————————————————— (Intercept) -0.298644 0.109685 -2.722729 0.006475 cardyes -2.703522 0.117196 -23.068395 0 active 0.06543 0.003998 16.367447 0 expenditure 0.000672 0.000178 3.785384 0.000153 months 0.002125 0.00053 4.006288 6.2e-05 owneryes -0.34377 0.092648 -3.710493 0.000207 majorcards 0.274039 0.104513 2.622062 0.00874 ‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗‗
plot(res7)
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
$forward$forward$detail

StepReg: Stepwise Regression Analysis (14)

$forward$summary

StepReg: Stepwise Regression Analysis (15)

We have developed an interactive Shiny application to simplify model selection tasks for non-programmers. You can access the app through the following URL:

https://junhuili1017.shinyapps.io/StepReg/

You can also access the Shiny app directly from your local machine with the following code:

library(StepReg)StepRegShinyApp()

Here is the user interface.

StepReg: Stepwise Regression Analysis (16)

StepReg: Stepwise Regression Analysis (17)

R version 4.1.3 (2022-03-10)Platform: x86_64-apple-darwin17.0 (64-bit)Running under: macOS Big Sur/Monterey 10.16Matrix products: defaultBLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylibLAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dyliblocale:[1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8attached base packages:[1] stats graphics grDevices utils datasets methods base other attached packages:[1] survival_3.6-4 dplyr_1.1.4 kableExtra_1.3.4 knitr_1.47 [5] StepReg_1.5.2 BiocStyle_2.22.0loaded via a namespace (and not attached): [1] colorspace_2.1-0 pryr_0.1.6 flextable_0.9.6 [4] base64enc_0.1-3 httpcode_0.3.0 rstudioapi_0.16.0 [7] farver_2.1.1 ggrepel_0.9.5 DT_0.33 [10] fansi_1.0.6 lubridate_1.9.3 xml2_1.3.4 [13] codetools_0.2-19 splines_4.1.3 cachem_1.0.8 [16] shinythemes_1.2.0 jsonlite_1.8.8 shiny_1.8.1.1 [19] BiocManager_1.30.23 compiler_4.1.3 httr_1.4.7 [22] backports_1.5.0 ggcorrplot_0.1.4.1 Matrix_1.5-3 [25] fastmap_1.1.1 cli_3.6.2 later_1.3.2 [28] htmltools_0.5.8.1 tools_4.1.3 gtable_0.3.5 [31] glue_1.7.0 reshape2_1.4.4 tinytex_0.51 [34] Rcpp_1.0.12 jquerylib_0.1.4 fontquiver_0.2.1 [37] vctrs_0.6.5 crul_1.4.2 svglite_2.1.1 [40] xfun_0.44 stringr_1.5.1 summarytools_1.0.1 [43] rvest_1.0.3 timechange_0.3.0 mime_0.12 [46] lifecycle_1.0.4 shinycssloaders_1.0.0 MASS_7.3-58.3 [49] scales_1.3.0 ragg_1.2.5 promises_1.3.0 [52] fontLiberation_0.1.0 yaml_2.3.8 curl_5.2.1 [55] ggplot2_3.5.1 pander_0.6.5 gdtools_0.3.7 [58] sass_0.4.9 stringi_1.8.4 fontBitstreamVera_0.1.1 [61] highr_0.11 checkmate_2.3.1 zip_2.3.1 [64] rlang_1.1.4 pkgconfig_2.0.3 systemfonts_1.0.6 [67] matrixStats_1.3.0 evaluate_0.23 lattice_0.21-8 [70] purrr_1.0.2 rapportools_1.1 htmlwidgets_1.6.4 [73] labeling_0.4.3 cowplot_1.1.3 tidyselect_1.2.1 [76] plyr_1.8.9 magrittr_2.0.3 bookdown_0.39 [79] R6_2.5.1 magick_2.7.4 generics_0.1.3 [82] pillar_1.9.0 withr_3.0.0 tibble_3.2.1 [85] crayon_1.5.2 gfonts_0.2.0 uuid_1.2-0 [88] utf8_1.2.4 rmarkdown_2.27 officer_0.6.6 [91] grid_4.1.3 data.table_1.15.4 digest_0.6.35 [94] webshot_0.5.4 xtable_1.8-4 tidyr_1.3.1 [97] httpuv_1.6.15 textshaping_0.3.7 openssl_2.2.0 [100] munsell_0.5.1 viridisLite_0.4.2 bslib_0.7.0 [103] tcltk_4.1.3 askpass_1.2.0 shinyjs_2.1.0 

A. F. M. Smith, D. J. Spiegelhalter. 1980. “Bayes Factors and Choice Criteria for Linear Model.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 42 (2): 213–20.

Allan D R McQuarrie, Chih-Ling Tsai. 1998. Regression and Time Series Model Selection. Book. River Edge, NJ.: World Scientific Publishing Co. Pte. Ltd.

Al-Subaihi, Ali A. 2002. “Variable Selection in Multivariable Regression Using SAS/IML.” Journal Article. Journal of Statistical Software 7 (12): 1–20.

Clifford M. Hurvich, Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Journal Article. Biometrika 76: 297–307.

Darlington, R. B. 1968. “Multiple Regression in Psychological Research and Practice.” Journal Article. Psychological Bulletin 69 (3): 161–82.

E. J. Hannan, B. G. Quinn. 1979. “The Determination of the Order of an Autoregression.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 41 (2): 190–95.

Edward J. Bedrick, Chih-Ling Tsai. 1994. “Model Selection for Multivariate Regression in Small Samples.” Journal Article. Biometrics 50 (1): 226–31.

George G. Judge, R. Carter Hill, William E. Griffiths. 1985. The Theory and Practice of Econometrics, 2nd Edition. Book. Wiley. https://www.wiley.com/en-us/The+Theory+and+Practice+of+Econometrics%2C+2nd+Edition-p-9780471895305.

Hocking, R. R. 1976. “A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression.” Journal Article. Biometrics 32 (1): 1–49.

Hotelling, Harold. 1992. “The Generalization of Student’s Ratio.” Book Section. In Breakthroughs in Statistics: Foundations and Basic Theory., 54–62.

J. A. Nelder, R. W. M. Wedderburn. 1972. “Generalized Linear Models.” Journal Article. Journal of the Royal Statistical Society. Series A (General) 135 (3): 370–84.

K. V. Mardia, J. M. Bibby, J. T. Ken. 1981. “Multivariate Analysis.” Journal Article. Mathematical Gazette 65 (431): 75–76.

Kenneth P. Burnham, David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach 2nd Edition. Book. Springer.

Mallows, C. L. 1973. “Some Comments on CP.” Journal Article. Technometrics 15 (4): 661–75.

Mark J. Brewer, Susan L. Cooksley, Adam Butler. 2016. “The Relative Performance of AIC, AICC and BIC in the Presence of Unobserved Heterogeneity.” Journal Article. Methods in Ecology and Evolution 7 (6): 679–92.

McKEON, JAMES J. 1974. “F Approximations to the Distribution of Hotelling’s T20.” Journal Article. Biometrika 61 (2): 381–83.

Pillai, K. C. S. 1995. “Some New Test Criteria in Multivariate Analysis.” Journal Article. Ann. Math. Statist 26 (1): 117–21.

Prathapasinghe Dharmawansa, Ofer Shwartz, Boaz Nadler. 2014. “Roy’s Largest Root Under Rank-One Alternatives:the Complex Valued Case and Applications.” Journal Article. arXiv Preprint arXiv 1411: 4226.

RS Sparks, D Coutsourides, W Zucchini. 1985. “On Variable Selection in Multivariate Regression.” Journal Article. Communication in Statistics- Theory and Methods 14 (7): 1569–87.

SAS Institute Inc. 2018. SAS/STAT® 15.1 User’s Guide. Book. Cary, NC: SAS Institute Inc.

Sawa, Takamitsu. 1978. “Information Criteria for Discriminating Among Alternative Regression Models.” Journal Article. Econometrica 46 (6): 1273–91.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Journal Article. Ann. Statist. 6 (2): 461–64.

Stevens, James P. 2016. Applied Multivariate Statistics for the Social Sciences. Book. Fifth Edition. Routledge.

StepReg: Stepwise Regression Analysis (2024)
Top Articles
Bacterial throat infection (pharyngitis)
Tipos de Carne y Sus Propiedades - Mercado de San Fernando
Tiny Tina Deadshot Build
Joi Databas
Mrh Forum
Ventura Craigs List
Jesus Revolution Showtimes Near Chisholm Trail 8
Morgan Wallen Pnc Park Seating Chart
Sotyktu Pronounce
Full Range 10 Bar Selection Box
What Is Njvpdi
Pvschools Infinite Campus
Cnnfn.com Markets
Walmart Windshield Wiper Blades
What is Cyber Big Game Hunting? - CrowdStrike
Bad Moms 123Movies
Khiara Keating: Manchester City and England goalkeeper convinced WSL silverware is on the horizon
使用 RHEL 8 时的注意事项 | Red Hat Product Documentation
Swgoh Turn Meter Reduction Teams
Invert Clipping Mask Illustrator
Jalapeno Grill Ponca City Menu
Nine Perfect Strangers (Miniserie, 2021)
Hennens Chattanooga Dress Code
Ups Drop Off Newton Ks
Doublelist Paducah Ky
The EyeDoctors Optometrists, 1835 NW Topeka Blvd, Topeka, KS 66608, US - MapQuest
Sandals Travel Agent Login
Wat is een hickmann?
Xxn Abbreviation List 2017 Pdf
Ou Football Brainiacs
Ff14 Sage Stat Priority
Missing 2023 Showtimes Near Mjr Southgate
"Pure Onyx" by xxoom from Patreon | Kemono
Free Robux Without Downloading Apps
Tirage Rapid Georgia
Ksu Sturgis Library
Merkantilismus – Staatslexikon
Ticket To Paradise Showtimes Near Regal Citrus Park
Www.craigslist.com Waco
Tunica Inmate Roster Release
Tgirls Philly
Rush Copley Swim Lessons
Pulaski County Ky Mugshots Busted Newspaper
Pike County Buy Sale And Trade
Spurs Basketball Reference
Iman Fashion Clearance
Minterns German Shepherds
Kushfly Promo Code
Food and Water Safety During Power Outages and Floods
antelope valley for sale "lancaster ca" - craigslist
Latest Posts
Article information

Author: Msgr. Benton Quitzon

Last Updated:

Views: 6536

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Msgr. Benton Quitzon

Birthday: 2001-08-13

Address: 96487 Kris Cliff, Teresiafurt, WI 95201

Phone: +9418513585781

Job: Senior Designer

Hobby: Calligraphy, Rowing, Vacation, Geocaching, Web surfing, Electronics, Electronics

Introduction: My name is Msgr. Benton Quitzon, I am a comfortable, charming, thankful, happy, adventurous, handsome, precious person who loves writing and wants to share my knowledge and understanding with you.