useR

Tutorial: Regression Modeling Strategies using the R Package rms


Frank E Harrell Jr, Department of Biostatistics, Vanderbilt University School of Medicine, USA

Course Description

The first part of the course presents the following elements of multivariable predictive modeling for a single response variable: using regression splines to relax linearity assumptions, perils of variable selection and overfitting, where to spend degrees of freedom, shrinkage, imputation of missing data, data reduction, and interaction surfaces. Then a default overall modeling strategy will be described. This is followed by methods for graphically understanding models (e.g., using nomograms) and using re-sampling to estimate a model's likely performance on new data. Then the freely available R rms package will be overviewed. rms facilitates most of the steps of the modeling process. Two of the following three case studies will be presented: an interactive exploration of the survival status of Titanic passengers, an interactive case study in developing a survival time model for critically ill patients, and a case study in Cox regression.

The methods covered in this course will apply to almost any regression model, including ordinary least squares, logistic regression models, and survival models.

Objectives

Outline

Instructor

Dr. Harrell is Professor of Biostatistics and Statistics at the Dept. of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN. He received his Ph.D. in biostatistics from the University of North Carolina, Chapel Hill in 1979, where he studied under P.K. Sen. Dr. Harrell has been involved in statistical computing since 1969 and is the author of many R functions and SAS procedures. Since 1973 he has been involved in medical applications of statistics, especially in the area of survival analysis and clinical prediction modeling. He is an editorial consultant for the Journal of Clinical Epidemiology, an associate editor of Statistics in Medicine, a fellow of the American Statistical Association, and a consultant to FDA and to the pharmaceutical and finance industries. He has been an S/R user since 1991.

Handouts

Participants will receive extensive handouts, which will also be available in advance at http://biostat.mc.vanderbilt.edu/rms.

Background

Regression models are frequently used to develop diagnostic, prognostic, and health resource utilization models in clinical, health services, outcomes, pharmacoeconomic, and epidemiologic research, and in a multitude of non-health-related areas. Regression models are also used to adjust for patient heterogeneity in randomized clinical trials, to obtain tests that are more powerful and valid than unadjusted treatment comparisons.

Models must be flexible enough to fit nonlinear and non-additive relationships, but unless the sample size is enormous, the approach to modeling must avoid common problems with data mining or data dredging that result in overfitting and a failure of the predictive model to validate on new subjects.

All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this short course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines.

Intended Audience

Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models. The course will be of particular interest to:

Prerequisites

A good general knowledge of statistical estimation and inference methods and a good command of ordinary linear regression. Those who want to run the laboratory exercises themselves or who want to use R to use the methods taught in this course in their everyday work should have had a previous introduction to R. Participants are encouraged to read references [1, 2, 3] in advance.


Please check here for up to date tutorial resources.

References

[1] F.E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15:361–387, 1996.

[2] F.E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K. Mulholland, D. Lehmann, L. Muhe, S. Gatchalian, and H. F. Eichenwald. Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Statistics in Medicine, 17:909–944, 1998.

[3] A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. Journal of the American Medical Association, 262:2700–2707, 1989.