Tutorial: Regression Modeling Strategies using the R Package
rms
Frank
E Harrell Jr, Department of Biostatistics,
Vanderbilt University School of Medicine, USA
Course Description
The first part of the course presents the following elements
of multivariable predictive modeling for a single response
variable: using regression splines to relax linearity
assumptions, perils of variable selection and overfitting,
where to spend degrees of freedom, shrinkage, imputation of
missing data, data reduction, and interaction surfaces. Then
a default overall modeling strategy will be described. This
is followed by methods for graphically understanding models
(e.g., using nomograms) and using re-sampling to estimate a
model's likely performance on new data. Then the freely
available R rms package will be overviewed.
rms facilitates most of the steps of the modeling
process. Two of the following three case studies will be
presented: an interactive exploration of the survival status
of Titanic passengers, an interactive case study in
developing a survival time model for critically ill patients,
and a case study in Cox regression.
The methods covered in this course will apply to almost any
regression model, including ordinary least squares, logistic
regression models, and survival models.
Objectives
-
Be familiar with modern methods for fitting multivariable
regression models:
-
accurately
-
in a way the sample size will allow, without
overfitting
-
uncovering complex non-linear or non-additive
relationships
-
testing for and quantifying the association between one
or more predictors and the response, with possible
adjustment for other factors
-
Be able to validate models for predictive accuracy and to
detect overfitting
-
Be able to interpret fitted models using both parameter
estimates and graphics
-
Be able to critique the literature to detect models that
are likely to be unreliable
Outline
-
Planning for Modeling
-
Notation for Regression Models
-
Interpreting Model Parameters
-
Nominal Predictors
-
Interactions
-
Relaxing Linearity Assumption for Continuous Predictors
-
Simple Nonlinear Terms
-
Splines for Estimating Shape of Regression Function and
Determining Predictor Transformations
-
Cubic Spline Functions
-
Restricted Cubic Splines
-
Nonparametric regression
-
Advantages of Splines over Other Methods
-
Tests of Association
-
Assessment of Model Fit
-
Regression Assumptions
-
Modeling and Testing Interactions
-
Missing Data
-
Types of Missingness
-
Understanding Patterns of Missing Values
-
Problems with Simple Alternatives to Imputation
-
Strategies for Developing Imputation Algorithms
-
Single Conditional Mean Imputation
-
Multiple Imputation
-
R Software for Fitting Models and Adjusting Variances
for Multiple Imputation
-
Multivariable Modeling Strategy
-
Pre-Specification of Predictor Complexity
-
Variable Selection
-
Overfitting and Limits on Number of Predictors
-
Shrinkage
-
Data Reduction
-
Resampling, Validating, Describing, and Simplifying the
Model
-
The Bootstrap
-
Model Validation
-
Graphically Describing the Fitted Model
-
Simplifying the Model by Approximating It
-
R rms package
-
Case Study: Binary Logistic Model for Survival of Titanic
Passengers
-
Missing Data
-
Nonparametric Regression
-
Development of Logistic Model
-
Multiple Imputation to Handle Missing Passenger Ages
-
Case Study: Development of a Long-Term Survival Model for
Critically Ill Patients
Instructor
Dr. Harrell is Professor of Biostatistics and Statistics at
the Dept. of Biostatistics, Vanderbilt University School of
Medicine, Nashville, TN. He received his Ph.D. in
biostatistics from the University of North Carolina, Chapel
Hill in 1979, where he studied under P.K. Sen. Dr. Harrell
has been involved in statistical computing since 1969 and is
the author of many R functions and SAS procedures. Since 1973
he has been involved in medical applications of statistics,
especially in the area of survival analysis and clinical
prediction modeling. He is an editorial consultant for the
Journal of Clinical Epidemiology, an associate
editor of Statistics in Medicine, a fellow of the
American Statistical Association, and a consultant to FDA and
to the pharmaceutical and finance industries. He has been an
S/R user since 1991.
Handouts
Participants will receive extensive handouts, which will also
be available in advance at http://biostat.mc.vanderbilt.edu/rms.
Background
Regression models are frequently used to develop diagnostic,
prognostic, and health resource utilization models in
clinical, health services, outcomes, pharmacoeconomic, and
epidemiologic research, and in a multitude of
non-health-related areas. Regression models are also used to
adjust for patient heterogeneity in randomized clinical
trials, to obtain tests that are more powerful and valid than
unadjusted treatment comparisons.
Models must be flexible enough to fit nonlinear and
non-additive relationships, but unless the sample size is
enormous, the approach to modeling must avoid common problems
with data mining or data dredging that result in overfitting
and a failure of the predictive model to validate on new
subjects.
All standard regression models have assumptions that must be
verified for the model to have power to test hypotheses and
for it to be able to predict accurately. Of the principal
assumptions (linearity, additivity, distributional), this
short course will emphasize methods for assessing and
satisfying the first two. Practical but powerful tools are
presented for validating model assumptions and presenting
model results. This course provides methods for estimating
the shape of the relationship between predictors and response
using the widely applicable method of augmenting the design
matrix using restricted cubic splines.
Intended Audience
Statisticians and persons from other quantitative disciplines
who are interested in multivariable regression analysis of
univariate responses, in developing, validating, and
graphically describing multivariable predictive models. The
course will be of particular interest to:
-
Applied statisticians who want to learn new methodology for
flexibly fitting all types of multivariable regression
models while making estimation of optimal covariable
transformations an explicit part of the modeling process.
-
Those who want to learn how to develop models that are
likely to predict future observations as accurately as they
predicted responses from the data used to fit the models.
-
Statisticians who want to learn how to graphically present
complex regression models to non-statisticians.
-
Analysts who would like to be introduced to multiple
imputation with regression models to handle missing and
incomplete data.
-
Quantitatively-minded epidemiologists and others who need
to use binary or ordinal logistic models and time-to-event
(survival) models for analyzing and predicting outcomes in
observational studies.
-
Biostatisticians, health services and outcomes researchers,
and economists who need to study or predict health outcomes
or resource utilization.
Prerequisites
A good general knowledge of statistical estimation and
inference methods and a good command of ordinary linear
regression. Those who want to run the laboratory exercises
themselves or who want to use R to use the methods taught in
this course in their everyday work should have had a previous
introduction to R. Participants are encouraged to read
references [1, 2, 3] in advance.
Please check here for up to date tutorial resources.
References
[1] F.E. Harrell, K. L. Lee, and D. B. Mark. Multivariable
prognostic models: Issues in developing models, evaluating
assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine, 15:361–387, 1996.
[2] F.E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K.
Mulholland, D. Lehmann, L. Muhe, S. Gatchalian, and H. F.
Eichenwald. Development of a clinical prediction model for an
ordinal outcome: The World Health Organization ARI Multicentre
Study of clinical signs and etiologic agents of pneumonia,
sepsis, and meningitis in young infants. Statistics in
Medicine, 17:909–944, 1998.
[3] A. Spanos, F. E. Harrell, and D. T. Durack. Differential
diagnosis of acute meningitis: An analysis of the predictive
value of initial observations. Journal of the American
Medical Association, 262:2700–2707, 1989.