https://github.com/cran/caret
Tip revision: 287adc67c8767de8d0cf6fe6ec16ca734cdf5441 authored by Max Kuhn on 02 October 2012, 00:00:00 UTC
version 5.15-044
version 5.15-044
Tip revision: 287adc6
caretTrain.Rnw
% \VignetteIndexEntry{caret Manual -- Model Building}
% \VignetteDepends{caret}
% \VignettePackage{caret}
\documentclass[12pt]{article}
\usepackage{colortbl}
\usepackage{amsmath}
\usepackage[pdftex]{graphicx}
\usepackage{color}
\usepackage{xspace}
\usepackage{fancyvrb}
\usepackage{fancyhdr}
\usepackage{lastpage}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage[boxed, linesnumbered]{algorithm2e}
\usepackage[
colorlinks=true,
linkcolor=blue,
citecolor=blue,
urlcolor=blue]
{hyperref}
\usepackage{lscape}
\usepackage{Sweave}
\SweaveOpts{keep.source=TRUE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% define new colors for use
\definecolor{darkgreen}{rgb}{0,0.6,0}
\definecolor{darkred}{rgb}{0.6,0.0,0}
\definecolor{lightbrown}{rgb}{1,0.9,0.8}
\definecolor{brown}{rgb}{0.6,0.3,0.3}
\definecolor{darkblue}{rgb}{0,0,0.8}
\definecolor{darkmagenta}{rgb}{0.5,0,0.5}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\bld}[1]{\mbox{\boldmath $#1$}}
\newcommand{\shell}[1]{\mbox{$#1$}}
\renewcommand{\vec}[1]{\mbox{\bf {#1}}}
\newcommand{\codeheading}[1]{\mbox{\color{darkblue}\texttt{#1}}}
\newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}}
\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
\renewcommand{\pkg}[1]{{\textsf{#1}}}
\newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize}
\newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize}
\providecommand{\SetAlgoLined}{\SetLine}
\newcommand{\halfs}{\frac{1}{2}}
\setlength{\oddsidemargin}{-.25 truein}
\setlength{\evensidemargin}{0truein}
\setlength{\topmargin}{-0.2truein}
\setlength{\textwidth}{7 truein}
\setlength{\textheight}{8.5 truein}
\setlength{\parindent}{0truein}
\setlength{\parskip}{0.10truein}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\fvset{fontsize=\footnotesize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagestyle{fancy}
\lhead{}
\chead{The {\tt caret} Package}
\rhead{}
\lfoot{}
\cfoot{}
\rfoot{\thepage\ of \pageref{LastPage}}
\renewcommand{\headrulewidth}{1pt}
\renewcommand{\footrulewidth}{1pt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{The {\tt caret} Package}
\author{Max Kuhn \\ max.kuhn@pfizer.com}
\begin{document}
\maketitle
\renewcommand{\baselinestretch}{.6}
\tableofcontents
\thispagestyle{empty}
\vspace{.2in}
\renewcommand{\baselinestretch}{1}
<<loadLibs, results = hide, echo = FALSE>>=
library(caret)
library(kernlab)
library(gbm)
library(ipred)
library(grid)
library(Hmisc)
library(randomForest)
data(BloodBrain)
data(mdrr)
getInfo <- function(what = "Suggests")
{
text <- packageDescription("caret")[what][[1]]
text <- gsub("\n", ", ", text, fixed = TRUE)
text <- gsub(">=", "$\\\\ge$", text, fixed = TRUE)
eachPkg <- strsplit(text, ", ", fixed = TRUE)[[1]]
eachPkg <- gsub(",", "", eachPkg, fixed = TRUE)
out <- paste("\\\\pkg{", eachPkg[order(tolower(eachPkg))], "}", sep = "")
paste(out, collapse = ", ")
}
@
The \pkg{caret} package (short for
{\bf{\color{blue}{c}}}lassification {\bf{\color{blue}{a}}}nd
{\bf{\color{blue}{re}}}gression {\bf{\color{blue}{t}}}raining)
contains functions to streamline the model training process for
complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package
start-up\footnote{By adding formal package dependencies, the package
startup time can be greatly decreased}. The package ``suggests''
field includes: \Sexpr{getInfo("Suggests")}. \pkg{caret} loads
packages as needed and assumes that they are installed. Install
\pkg{caret} using
\begin{Verbatim}
install.packages("caret", dependencies = c("Depends", "Suggests"))
\end{Verbatim}
to ensure that all the needed packages are installed.
\section{Model Training and Parameter Tuning}\label{S:train}
\pkg{caret} has several functions that attempt to streamline the model building and evaluation process.
The \pkg{caret} function can be used to
\begin{itemize}
\item evaluate, using resampling, the effect of model tuning parameters on performance
\item choose the ``optimal'' model across these parameters
\item estimate model performance from a training set
\end{itemize}
More formally:
\begin{algorithm}[H]
\label{A:tune}
\SetAlgoLined
\RestyleAlgo{plain}
\DontPrintSemicolon
Define sets of model parameter values to evaluate \nllabel{A:grid}\;
\For{each parameter set}{
\For{each resampling iteration}{
Hold--out specific samples \nllabel{A:resample} \;
[Optional] Pre--process the data\;
Fit the model on the remainder\;
Predict the hold--out samples\;
}
Calculate the average performance across hold--out predictions \nllabel{A:perf}
}
Determine the optimal parameter set \nllabel{A:best}\;
Fit the final model to all the training data using the optimal
parameter set\;
\end{algorithm}
First, a specific model must be chosen. Currently,
\Sexpr{length(unique(modelLookup()$model))} are available using
\pkg{caret}; see Tables \ref{T:methods1}, \ref{T:methods2} and \ref{T:methods3} for details.
In Tables \ref{T:methods1}, \ref{T:methods2} and \ref{T:methods3}, there are lists of tuning parameters that can
potentially be optimized. The first step in tuning the model (line
\ref{A:grid} in Algorithm \ref{A:tune}) is to choose a set of
parameters to evaluate. For example, if fitting a Partial Least Squares
(PLS) model, the number of PLS components to evaluate must be specified.
Once the model and tuning parameter values have been defined, the type
of resampling should be also be specified. Currently, $k$--fold
cross--validation (once or repeated),
leave--one--out cross--validation and bootstrap
(simple estimation or the 632 rule)
resampling methods can be used by \pkg{caret}. After resampling,
the process produces a profile of performance measures is available to
guide the user as to which tuning parameter values should be
chosen. By default, the function automatically chooses the tuning
parameters associated with the best value, although different
algorithms can be used (see Section \ref{S:finalMod}).
\subsection{An Example}
As an example, the multidrug resistance reversal (MDRR) agent data is
used to determine a predictive model for the ``ability of a compound
to reverse a leukemia cell's resistance to adriamycin''
(\href{http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2005/45/i03/abs/ci0500379.html}{Svetnik
et al, 2003}). For each sample (i.e. compound), predictors are
calculated that reflect characteristics of the molecular
structure. These molecular descriptors are then used to predict assay
results that reflect resistance.
The data are accessed using \code{data(mdrr)}. This creates a data
frame of predictors called \code{mdrrDescr} and a factor vector with
the observed class called \code{mdrrClass}.
To start, we will:
\begin{itemize}
\item use unsupervised filters to remove predictors with
unattractive characteristics (e.g. spare distributions or high
inter--predictor correlations)
\item split the entire data set into a training and test set
\end{itemize}
See the package vignette ``caret Manual -- Data and Functions'' for more details about these operations.
\begin{small}
<<preProc>>=
print(ncol(mdrrDescr))
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
print(ncol(filteredDescr))
descrCor <- cor(filteredDescr)
highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
filteredDescr <- filteredDescr[,-highlyCorDescr]
print(ncol(filteredDescr))
set.seed(1)
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)
trainDescr <- filteredDescr[inTrain,]
testDescr <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
testMDRR <- mdrrClass[-inTrain]
print(length(trainMDRR))
print(length(testMDRR))
@
\end{small}
\subsection{Basic Parameter Tuning}\label{S:basic}
By default, simple bootstrap resampling is used for line
\ref{A:resample} in Algorithm \ref{A:tune}. Others are availible, such
as repeated $K$--fold cross--validation. The function
\code{trainControl} can be used to specifiy the type of resampling:
\begin{small}
<<setup>>=
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3,
## Save all the resampling results
returnResamp = "all")
@
\end{small}
More information about \code{trainControl} is given in Section
\ref{S:control}.
The first two arguments to \pkg{caret} are the predictor and
outcome data objects, respectively. The third argument,
\code{method}, specifies the type of model (see Tables \ref{T:methods1}, \ref{T:methods2} and \ref{T:methods3}). We will fit a boosted tree model via the \pkg{gbm}
package. The basic syntax for fitting this model using repeated
cross--vlaidation is shown below:
\begin{small}
<<mdrrModel1>>=
gbmFit1 <- train(trainDescr, trainMDRR,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
@
\end{small}
For a gradient boosting machine (GBM) model, there are three main
tuning parameters:
\begin{itemize}
\item number of iterations, {\it i.e. trees}, (called \code{n.trees} in the
\code{gbm} function)
\item complexity of the tree, called \code{interaction.depth}
\item learning rate: how quickly the algorithm adapts, called
\code{shrinkage}
\end{itemize}
The default values tested for this model are shown in the first two
columns (\code{shrinkage} is not shown beause the grid set of
candidate models all use a value of 0.1 for this tuning parameter).
The column labeled ``\code{Accuracy}'' is the overall agreement rate
averaged over cross--validation iterations. The agreement standard
deviation is also calculated from the cross-validation results. The
column ``\code{Kappa}'' is Cohen's (unweighted) Kappa statistic
averaged across the resampling results. \pkg{caret} works with
specific models (see Tables \ref{T:methods1}, \ref{T:methods2} and \ref{T:methods3}). For these models, \pkg{caret} can automatically
create a grid of tuning parameters. By default, if $p$ is the number
of tuning parameters, the grid size is $3^p$. For example, regularized
discriminant analysis (RDA) models have two parameters
(\texttt{gamma} and \texttt{lambda}), both of which lie on $[0,
1]$. The default training grid would produce nine combinations in this
two--dimensional space.
<<getSeqMods, echo = FALSE, results = hide>>=
seqModList <- paste(paste("\\\\pkg{", unique(subset(modelLookup(), seq)$model), "}", sep = ""), collapse = ", ")
@
\subsection{Notes}
\begin{itemize}
\item There is a formula interface (e.g. \code{train(y$\sim$., data =
someData}) that can be used. One of the issues with a large
number of predictors is that the objects related to the formula
which are saved can get very large. In these cases, it is best to
stick with the non--formula interface described above.
\item The function determines the type of problem
(classification or regression) from the type of the response
given in the \code{y} argument.
\item The \code{$\ldots$} option can be used to pass
parameters to the fitting function. For example, in random
forest models, you can specify the number of trees to be
used in the call to \pkg{caret}.
\item For regression models (i.e. a numeric outcome), a similar table would
be produced showing the average root mean squared error and average
$R^2$ value statistic across tuning parameters, otherwise known as
$Q^2$ (see the note below related to this calculation).
For regression models, the classical $R^2$ statistic
cannot be compared between models that contain an intercept
and models that do not. Also, some models do not have an
intercept only null model.
\item [] To approximate this statistic across different
types of models, the square of the correlation between
the observed and predicted outcomes is used. This means
that the $R^2$ values produced by \pkg{caret} will not
match the results of \code{lm} and other functions.
\item [] Also, the correlation estimate does not take into
account the degrees of freedom in a model and thus does
not penalize models with more parameters. For some
models (e.g random forests or on--linear support vector
machines) there is no clear sense of the degrees of
freedom, so this information cannot be used in $R^2$ if
we would like to compare different models.
\item The nearest shrunken centroid model of
\href{http://projecteuclid.org/handle/euclid.ss/1056397488}{Tibshirani
et al (2003)} is specified using \code{method =
"pam"}. For this model, there must be at least two samples
in each class. \pkg{caret} will ignore classes where
there are less than two samples per class from every model
fit during bootstrapping or cross--validation (this model
only).
\item For recursive partitioning models, an initial model is
fit to all of the training data to obtain the possible
values of the maximum depth of any node
(\code{maxdepth}). The tuning grid is created based on
these values. If \code{tuneLength} is larger than the
number of possible \code{maxdepth} values determined by
the initial model, the grid will be truncated to the
\code{maxdepth} list.
\item [] The same is also true for nearest shrunken centroid
models, where an initial model is fit to find the range of
possible threshold values, and MARS models (see the details
below).
\item For multivariate adaptive regression splines (MARS), the
\texttt{earth} package is used with a model type of \code{mars}
or \code{earth} is requested. The tuning parameters used by
\pkg{caret} are \code{degree} and \code{nprune}. The
parameter \code{nk} is not automatically specified and, if not
specified, the default in the \code{earth} function is used.
\item [] For example, suppose a training set with 40 predictors is
used with \code{degree = 1} and \code{nprune = 20}. An initial
model with \code{nk = 41} is fit and is pruned down to 20
terms. This number includes the intercept and may include
``singleton'' terms instead of pairs.
\item [] Alternate model training schemes can be used by passing
\code{nk} and/or \code{pmethod} to the \code{earth}
function. Also, using \code{method = 'gcvEearth'} will use the
basic GCV pruning procedure and only tune the degree.
\item [] Also, there may be cases where the message such as
``specified 'nprune' 29 is greater than the number of available
model terms 24, forcing 'nprune' to 24'' show up after the model
fit. This can occur since the \code{earth} function may not
actually use the number of terms in the initial model as specified
by \code{nk}. This may be because the \code{earth} function
removes terms with linear dependencies and the forward pass
counts as if terms were added in pairs (although singleton terms
may be used). By default, the \pkg{caret} function fits and
initial MARS model is used to determine the number of possible
terms in the training set to create the tuning grid. Resampled
data sets may produce slightly different models that do not have
as many possible values of \code{nprune}.
\item For the \code{glmboost} and \code{gamboost} functions
from the \texttt{mboost} package, an additional tuning parameter,
\code{prune}, is used by train. If \code{prune = "yes"}, the
number of trees is reduced based on the AIC statistic. If
\code{"no"}, the number of trees is kept at the value specified
by the \code{mstop} parameter. See the \texttt{mboost} package
vignette for more details about AIC pruning.
\item The partitioning model of Molinaro {\it et al.} (2010) has a
tuning parameter that is the number of partitions in the
model. The R function \code{partDSA} has the argument
\code{cut.off.growth} which is described as ``the maximum number
of terminal partitions to be considered when building the model.''
Since this is the maximum, the user might ask for a model with $X$
partitions but the model can only predict $Y < X$. In these cases,
the model predictions will be based on the largest model available
($Y$).
\item For generalized additive models, a formula is generated from
the data. First, predictors with degenerate distributions are
excluded (via the \code{nearZeroVar} function). Then, the
number of distinct values for each predictor is calculated. If
this value is greater than 10, the predictor is entered into the
formula via a smoothed term (otherwise a linear term is
used). For models in the \texttt{gam} package, the smooth terms
have the same amount of smoothing applied to them (i.e. equal
\code{df} or \code{span} across all the smoothed predictors).
\item For some models (\Sexpr{seqModList}), the \pkg{caret}
function will fit a model that can be used to derive predictions
for some sub-models. For example, for MARS (via the
\code{earth} function), for a fixed degree, a model with a
maximum number of terms will be fit and the predictions of all of
the requested models with the same degree and smaller number of
terms will be computed using \code{update.earth} instead of
fitting a new model. When the \code{verboseIter} option of the
\code{trainControl} function is
used, a line is printed for the ``top--level'' model (instead of
each model in the tuning grid).
\item There are \code{print} and \code{plot} methods for the
\code{train} class. The
\code{plot} method visualizes the profile of average resampled
performance values across the different tuning parameters using
scatter plots or level plots. See Figures \ref{f:plots1} and
\ref{f:plots2} for examples. Functions that visualize the individual
resampling results for \pkg{caret} objects are discussed in
Section \ref{S:withinMod}.
\item Using the first set of tuning parameters that are
optimal (in the sense of accuracy or mean squared error),
\pkg{caret} automatically fits a model with these
parameters to the entire training data set. That model
object is accessible in the \code{finalModel} object
within \pkg{caret}. For example, \verb+gbmFit$finalModel+
is the same object that would have been produced using a
direct call to the \code{gbm} function with the final tuning
parameters.
\end{itemize}
There is additional functionality in \pkg{caret} that is described in the next section.
<<getModelData, echo = FALSE, results = hide>>=
models <- modelLookup()
modelKey <- read.delim(system.file("modelKey.txt", package = "caret"))
modelInfo <- merge(models, modelKey, all.x = TRUE)
modelInfo$parameter <- latexTranslate(modelInfo$parameter)
cranRef <- function(x) paste("{\\tt \\href{http://cran.r-project.org/web/packages/", x, "/index.html}{", x, "}}", sep = "")
makeTable <- function(x)
{
params <- paste("\\code{", as.character(x$parameter), "}", sep = "", collapse = ", ")
params <- ifelse(params == "\\code{parameter}", "None", params)
data.frame(method = paste("\\texttt{", as.character(x$model)[1], "}", sep = ""),
Package = cranRef(as.character(x$Package)[1]),
Parameters = params)
}
forTable <- ddply(modelInfo, .(Order, ProblemType, Class, model), makeTable)
forTable$model <- NULL
dual <- subset(forTable, ProblemType == "Dual Use")[,-1]
reg <- subset(forTable, ProblemType == "Regression Only")[,-1]
clas <- subset(forTable, ProblemType == "Classification Only")[,-1]
@
\pagestyle{plain}
{\footnotesize
<<dualTable, echo = FALSE, results = tex>>=
dualCats <- unique(as.character(dual$Class))
dualReps <- table(as.character(dual$Class))
dualReps <- dualReps[dualCats]
dual <- as.matrix(dual[, -(1:2)])
rownames(dual) <- NULL
latex(dual,
file = "",
title = "Type",
rowname = " ",
rgroup = dualCats,
n.rgroup = dualReps,
label = "T:methods1",
caption = "Dual use models (regression and classification) available in \\code{train}.",
booktabs = TRUE,
landscape = TRUE,
longtable = TRUE)
@
<<regTable, echo = FALSE, results = tex>>=
regCats <- unique(as.character(reg$Class))
regReps <- table(as.character(reg$Class))
regReps <- regReps[regCats]
reg <- as.matrix(reg[, -(1:2)])
rownames(reg) <- NULL
latex(reg,
file = "",
title = "Type",
rowname = " ",
rgroup = regCats,
n.rgroup = regReps,
label = "T:methods2",
caption = "Regression only models available in \\code{train}.",
booktabs = TRUE,
landscape = TRUE,
longtable = TRUE)
@
<<clasTable, echo = FALSE, results = tex>>=
clasCats <- unique(as.character(clas$Class))
clasReps <- table(as.character(clas$Class))
clasReps <- clasReps[clasCats]
clas <- as.matrix(clas[, -(1:2)])
rownames(clas) <- NULL
latex(clas,
file = "",
title = "Type",
rowname = " ",
rgroup = clasCats,
n.rgroup = clasReps,
label = "T:methods3",
caption = "Classification only models available in \\code{train}.",
booktabs = TRUE,
landscape = TRUE,
longtable = TRUE)
@
}
\pagestyle{fancy}
\section{Customizing the Tuning Process}
There are a few ways to customize the process of selecting
tuning/complexity parameters and building the final model.
\subsection{Pre--Processing Options}\label{S:pp}
As previously mentioned, \pkg{caret} can pre--process the data in
various ways prior to model fitting. The \pkg{caret} function
\code{preProcess} is automatically used. This function can be used
for centering and scaling, imputation (see details below),
applying the spatial sign transformation and feature extraction via
principal component analysis or independent component
analysis. Options to the \code{preProcess} function can be passed
via the \code{trainControl} function.
These processing steps would be applied during any predictions
generated using \code{predict.train}, \code{extractPrediction} or
\code{extractProbs} (see Section \ref{S:probs} later in this document). The
pre--processing would {\bf not} be applied to predictions that
directly use the \verb+object$finalModel+ object.
For imputation, there are two methods currently implemented:
\begin{itemize}
\item $k$--nearest neighbors takes a sample with missing values and
finds the $k$ closest samples in the training set. The average of
the $k$ training set values for that predictor are used as a
substitute for the original data. When calculating the distances to
the training set samples, the predictors used in the calculation are
the ones with no missing values for that sample and no missing
values in the training set.
\item another approach is to fit a bagged tree model for each
predictor using the training set samples. This is usually a fairly
accurate model and can handle missing values. When a predictor for a
sample requires imputation, the values for the other predictors are
fed through the bagged tree and the prediction is used as the new
value. This model can have significant computational cost.
\end{itemize}
If there are missing values in the training set, PCA and ICA models
only use complete samples.
\subsection{Alternate Tuning Grids}\label{S:altTune}
The tuning parameter grid can be specified by the user. The argument
\code{tuneGrid} can take a data frame with columns for each tuning
parameter (see Tables \ref{T:methods1}, \ref{T:methods2} and \ref{T:methods3} for specific details). The column
names should be the same as the fitting function's arguments with a
period preceding the name. For the previously mentioned RDA example, the names would be
\code{.gamma} and \code{.lambda}. \pkg{caret} will tune the
model over each combination of values in the rows.
We can fix the learning rate and evaluate more than three values of
\code{n.trees}:
\begin{small}
<<mdrrGrid>>=
gbmGrid <- expand.grid(.interaction.depth = c(1, 3),
.n.trees = c(10, 50, 100, 150, 200, 250, 300),
.shrinkage = 0.1)
@
<<mdrrGridFit>>=
set.seed(3)
gbmFit2 <- train(trainDescr, trainMDRR,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
## Now specify the exact models
## to evaludate:
tuneGrid = gbmGrid)
gbmFit2
@
\end{small}
\subsection{The \code{trainControl} Function}\label{S:control}
The function \code{trainControl} generates parameters that further
control how models are created, with possible values:
\begin{itemize}
\item \code{method}: The resampling method: \code{boot}, \code{boot632},
\code{cv}, \code{LOOCV}, \code{LGOCV}, \code{repeatedcv} and \code{oob}. The
last value, out--of--bag estimates, can only be used by random
forest, bagged trees, bagged earth, bagged flexible discriminant
analysis, or conditional tree forest models. GBM models are not
included (the \texttt{gbm} package maintainer has indicated that
it would not be a good idea to choose tuning parameter values
based on the model OOB error estimates with boosted trees). Also,
for leave--one--out cross--validation, no uncertainty estimates
are given for the resampled performance measures.
\item \code{number and repeats}: \code{number} controls with the
number of folds in $K$--fold cross--validation or number of
resampling iterations for bootstrapping and leave--group--out
cross--validation. \code{repeats} applied only to repeated
$K$--fold cross--validation. Suppose that \code{method = "repeatedcv"},
\code{number = 10} and \code{repeats = 3}, then three separate
10--fold cross--validations are used as the resampling scheme.
\item \code{verboseIter}: A logical for printing a training log.
\item \code{returnData}: A logical for saving the data into a slot
called \code{trainingData}.
\item \code{p}: For leave-group out cross-validation: the training
percentage
\item \code{ classProbs}: a logical value determining whether
class probabilities should be computed for held--out samples
during resample. Examples of using this argument are given in
Section \ref{S:altPerf}.
\item \code{index}: a list with elements for each resampling
iteration. Each list element is the sample rows used for training
at that iteration. When these values are not specified,
\pkg{caret} will generate them.
\item \code{summaryFunction}: a function to compute alternate
performance summaries. See Section \ref{S:altPerf} for more
details.
\item \code{selectionFunction}: a function to choose the optimal
tuning parameters. See Section \ref{S:finalMod} for more details
and examples.
\item \code{PCAthresh}, \code{ICAcomp} and \code{k}: these are
all options to pass to the \code{preProcess} function (when used).
\item \code{returnResamp}: a character string containing one of
the following values: \code{"all"}, \code{"final"} or
\code{"none"}. This specifies how much of the resampled
performance measures to save.
\end{itemize}
\begin{figure}
\begin{center}
\includegraphics[clip, width = .4\textwidth]{gbm1}
\hspace*{.2 in}
\includegraphics[clip, width = .4\textwidth]{gbm2}
\caption{ Examples of output from \code{plot.tain}. {\bf left} a plot produced using
\code{plot(gbmFit3)} showing the relationship between the
number of boosting iterations, the interaction depth and the
resampled classification accuracy {\bf right} the same
plot, but the Kappa statistic is plotted using
\code{plot(gbmFit3, metric = "Kappa")}}
\label{f:plots1}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[clip, width = .6\textwidth]{gbm3}
\caption{For the boosted tree example in Section \ref{S:altTune}, using
\code{plot(gbmFit metric = "Kappa", plotType = "level")} shows
the relationship (using a \code{levelplot}) between the number of
boosting iterations, the interaction depth and the resampled
estimate of the Kappa statistic. }
\label{f:plots2}
\end{center}
\end{figure}
\subsection{Alternate Performance Metrics}\label{S:altPerf}
The user can change the metric used to determine the best settings. By
default, RMSE and $R^2$ are computed for regression while accuracy and
Kappa are computed for classification. Also by default, the parameter
values are chosen using RMSE and accuracy, respectively for
regression and classification. The \code{metric} argument of the
\pkg{caret} function allows the user to control which the
optimality criterion is used. For example, in problems where there are
a low percentage of samples in one class, using \code{metric =
"Kappa"} can improve quality of the final model.
If none of these parameters are satisfactory, the user can also
compute custom performance metrics. The \code{trainControl} function
has a argument called \code{summaryFunction} that specifies a
function for computing performance. The function should have these
arguments:
\begin{itemize}
\item \code{data} is a reference for a data frame or matrix with
columns called \code{obs} and \code{pred} for the observed and
predicted outcome values (either numeric data for regression or
character values for classification). Currently, class probabilities
are not passed to the function. The values in data are the held--out
predictions (and their associated reference values) for a single
combination of tuning parameters. If the \code{classProbs}
argument of the \code{trainControl} object is set to
\code{TRUE}, additional columns in \code{data} will be present
that contains the class probabilities. The names of these columns
are the same as the class levels.
\item \code{lev} is a character string that has the outcome factor
levels taken from the training data. For regression, a value of
\code{NULL} is passed into the function.
\item \code{model} is a character string for the model being used
(i.e. the value passed to the \code{method} value of
\pkg{caret}).
\end{itemize}
The output to the function should be a vector of numeric summary
metrics with non--null names. By default, \pkg{caret} evaluate classification models in terms of
the predicted classes. Optionally, class probabilities can also be
used to measure performance. To obtain predicted class probabilities
within the resampling process, the argument \code{classProbs} in
\code{trainControl} must be set to \code{TRUE}. This merges
columns of probabilities into the predictions generated from each
resample (there is a column per class and the column names are the
class names).
As shown in the last section, custom functions can be used to
calculate performance scores that are averaged over the
resamples. Another built--in function, \code{twoClassSummary}, will
compute the sensitivity, specificity and area under the ROC curve.
\clearpage
<<summaryFunc>>=
twoClassSummary
@
To rebuild the boosted tree model using this criterion, we
can see the relationship between the tuning parameters and
the area under the ROC curve using the following code:
\begin{small}
<<setup>>=
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
returnResamp = "all",
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
@
\end{small}
<<reTune>>=
set.seed(3)
gbmFit3 <- train(trainDescr, trainMDRR,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit3
@
In this case, the average area under the ROC curve associated with the
optimal tuning parameters was
\Sexpr{round(caret:::getTrainPerf(gbmFit3)[1,"TrainROC"], 3)} across
the \Sexpr{length(gbmFit3$control$index)} resamples.
\subsection{Choosing the Final Model}\label{S:finalMod}
Another method for customizing the tuning process is to modify the
algorithm that is used to select the ``best'' parameter values, given
the performance numbers. By default, the \pkg{caret} function
chooses the model with the largest performance value (or smallest, for
mean squared error in regression models). Other schemes for selecting
model can be used. Breiman et al (1984) suggested the ``one standard
error rule'' for simple tree--based models. In this case, the model
with the best performance value is identified and, using resampling,
we can estimate the standard error of performance. The final model
used was the simplest model within one standard error of the
(empirically) best model. With simple trees this makes sense, since
these models will start to over-fit as they become more and more
specific to the training data.
\pkg{caret} allows the user to specify alternate rules for
selecting the final model. The argument \code{selectionFunction}
can be used to supply a function to algorithmically determine the
final model. There are three existing functions in the package:
\code{best} is chooses the largest/smallest value, \code{oneSE}
attempts to capture the spirit of Breiman et al (1984) and
\code{tolerance} selects the least complex model within some percent
tolerance of the best value. See \code{?best} for more details.
User--defined functions can be used, as long as they have the
following arguments:
\begin{itemize}
\item \code{x} is a data frame containing the tune parameters and
their associated performance metrics. Each row corresponds to a
different tuning parameter combination
\item \code{metric} a character string indicating which
performance metric should be optimized (this is passed in
directly from the \code{metric} argument of \pkg{caret}.
\item \code{maximize} is a single logical value indicating
whether larger values of the performance metric are better
(this is also directly passed from the call to
\pkg{caret}).
\end{itemize}
The function should output a single integer indicating which row in
\code{x} is chosen.
<<bestGBM, echo = FALSE, results = hide>>=
printSelected <- function(x)
{
tmp <- x$bestTune
names(tmp) <- gsub(".", " ", names(tmp), fixed = TRUE)
tmp <- paste(names(tmp), "=", tmp)
paste(tmp, collapse = ", ")
}
getTrainPerf <- function(x)
{
bst <- x$bestTune
names(bst) <- substring(names(bst), 2)
merge(bst, x$results)
}
@
As an example, if we chose the previous boosted tree model on the basis of
overall accuracy (Figure \ref{f:plots1}), we would choose:
\Sexpr{printSelected(gbmFit3)}. However, the scale in this plots is
fairly tight, with accuracy values ranging from
\Sexpr{round(min(gbmFit3$results$ROC), 3)} to
\Sexpr{round(max(gbmFit3$results$ROC), 3)}. A less complex model
(e.g. fewer, more shallow trees) might also yield acceptable
accuracy.
The tolerance function could be used to find a less complex
model based on $(x-x_{best})/x_{best}\times 100$, which is the percent
difference. For example, to select parameter values based on a $2\%$ loss of performance:
<<tolerance>>=
whichTwoPct <- tolerance(gbmFit3$results, "ROC", 2, TRUE)
cat("best model within 2 pct of best:\n")
gbmFit3$results[whichTwoPct,]
@
This indicates that we can get a less complex model with and accuracy
of \Sexpr{round(gbmFit3$results[whichTwoPct,"ROC"], 3)} (compared
to the ``pick the best'' value of
\Sexpr{round(getTrainPerf(gbmFit3)$ROC, 3)}).
The main issue with these functions is related to ordering the models
from simplest to complex. In some cases, this is easy (e.g. simple
trees, partial least squares), but in cases such as this model, the ordering of
models is subjective. For example, is a boosted tree model using 100
iterations and a tree depth of 2 more complex than one with 50
iterations and a depth of 8? The package makes some choices regarding
the orderings. In the case of boosted trees, the package assumes that
increasing the number of iterations adds complexity at a faster rate
than increasing the tree depth, so models are ordered on the number of
iterations then ordered with depth. See \code{?best} for more
examples for specific models.
<<makePlots, results = hide, echo = FALSE>>=
pdf(paste(getwd(), "/gbm1.pdf", sep = ""), width = 5, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(plot(gbmFit3))
dev.off()
pdf(paste(getwd(), "/gbm2.pdf", sep = ""), width = 5, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(plot(gbmFit3, metric = "Spec"))
dev.off()
pdf(paste(getwd(), "/gbm3.pdf", sep = ""), width = 8, height = 4)
trellis.par.set(caretTheme(), warn = FALSE)
print(plot(gbmFit3, meric = "ROC", plotType = "level"))
dev.off()
@
\subsection{Parallel Processing}\label{S:parallel}
If a model is tuned using resampling, the number of model fits can
become large as the number of tuning combinations increases (see the
two loops in Algorithm \ref{A:tune}). To reduce the training time,
parallel processing can be used.
For example, to train the gradient boosting machine model in Section
\ref{S:basic}, each of the \Sexpr{nrow(gbmFit1$results)} candidate
models was fit to \Sexpr{length(gbmFit1$control$index)} separate
resamples. Since each resample is independent of the other, these
\Sexpr{nrow(gbmFit1$results)*length(gbmFit1$control$index)} models
could be computed in parallel.
R has several packages that facilitates parallel processing when
multiple processors are available (see Schmidberger et al., 2009).
\pkg{caret} can be used to build multiple models simultaneously.
As of \pkg{caret} version 4.99, a new parallel processing framework
is used to increase the computational efficiency. The \pkg{foreach}
package allows parallel computations using several different
technologies. Although the execution times using \pkg{foreach} is
similar to the framework used prior to version 4.99, there are a few advantages:
\begin{itemize}
\item the call to \code{train} (or \code{rfe} or \code{sbf}) does
not change. Parallel ``backends'' are registered with
\pkg{foreach} prior to the call to \code{train}
\item compared to the kludgy techniques in \pkg{caret} prior to
version 4.99, \pkg{foreach} does a much better job of managing memory.
\item \pkg{foreach} code can be added in several places in the
package and nest parallelism can be used.
\end{itemize}
For example, to use the \pkg{multicore} package to parallelize the
computations, invoking these commands prior to \code{train} would
split the computations into two workers:
\begin{small}
\begin{Schunk}
\begin{Sinput}
> library(doMC)
> registerDoMC(2)
\end{Sinput}
\end{Schunk}
\end{small}
One common metric used to assess the efficacy of parallelization is
$speedup = T_{seq}/T_{par}$, where $T_{seq}$ and $T_{par}$ denote the
execution times to train the model serially and in parallel,
respectively. Excluding systems with sophisticated shared memory
capabilities, the maximum possible speedup attained by parallelization
with $P$ processors is equal to $P$. Factors affecting the speedup
include the overhead of starting the parallel workers, data transfer,
the percentage of the algorithm's computations that can be done in
parallel, etc.
\begin{figure}[t]
\begin{center}
\includegraphics[clip, width = .4\textwidth]{min}
\includegraphics[clip, width = .4\textwidth]{speedup}
\caption{Training time profiles using parallel processing via
\pkg{caret} for a benchmarking data set run on a 16 core
machine. The left panel shows the elapsed time to train
a model using single or multiple
processors. The panel on the right shows the ``speedup,''
defined to be the time for serial execution divided by the
parallel execution time. The ``old'' line corresponds to version
4.98 and below while the ``new'' curve is using \pkg{foreach}.
}
\label{f:parallel}
\end{center}
\end{figure}
Figure \ref{f:parallel} shows the results of a benchmarking study run
on a 16 core machine. In the left panel, the actual training time for
a radial basis function SVM model for a data set with 5,000 samples
and 400 predictors. The model was tuned over 10 values of the cost
parameter using 50 bootstrap samples. The ``new'' curve corresponds to
the \pkg{foreach} infrastructure.
One downside to parallel processing in this manner is that the dataset
is held in memory for every node used to train the model. For example,
if parallelism is used to compute the results from 50 bootstrap
samples using $P$ processors, $P$ data sets are held in memory. For
large datasets, this can become a problem if the additional processors
are on the same machines where they are competing for the same
physical memory. The old codebase starts to slow down around 10
workers due to exhausting the physical memory on the machine. The new
codebase does a better job at managing memory with no additional slowdown.
\section{Extracting Predictions and Class Probabilities}\label{S:probs}
As previously mentioned, objects produced by the \pkg{caret}
function contain the ``optimized'' model in the \code{finalModel}
sub--object. Predictions can be made from these objects as usual. In
some cases, such as \code{pls} or \code{gbm} objects, additional
parameters from the optimized fit may need to be specified. In these
cases, the \pkg{caret} objects uses the results of the parameter
optimization to predict new samples.
For example, we can load the Boston Housing data:
\begin{small}
<<bhExample>>=
library(mlbench)
data(BostonHousing)
# we could use the formula interface too
bhDesignMatrix <- model.matrix(medv ~. - 1, BostonHousing)
@
\end{small}
\noindent split the data into random training/test groups:
\begin{small}
<<bhSplit>>=
set.seed(4)
inTrain <- createDataPartition(BostonHousing$medv, p = .8, list = FALSE, times = 1)
trainBH <- bhDesignMatrix[inTrain,]
testBH <- bhDesignMatrix[-inTrain,]
trainMedv <- BostonHousing$medv[inTrain]
testMedv <- BostonHousing$medv[-inTrain]
@
\end{small}
\noindent fit partial least squares and multivariate adaptive regression spline models:
\begin{small}
<<bhModels>>=
set.seed(5)
plsFit <- train(trainBH, trainMedv,
"pls",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(verboseIter = FALSE,
returnResamp = "all"))
set.seed(5)
marsFit <- train(trainBH, trainMedv,
"earth",
tuneLength = 10,
trControl = trainControl(verboseIter = FALSE,
returnResamp = "all"))
@
\end{small}
To obtain predictions for the MARS model, \code{predict.earth} can be
used.
\begin{small}
<<marPrediction1>>=
marsPred1 <- predict(marsFit$finalModel, newdata = testBH)
head(marsPred1)
@
\end{small}
Alternatively, \code{predict.train} can be used to get a vector of predictions for the optimal model only:
\begin{small}
<<marsPrediction2>>=
marsPred2 <- predict(marsFit, newdata = testBH)
head(marsPred2)
@
\end{small}
Note that the \code{plsFit} object used pre--processing. In this
case, we cannot directly call \code{predict.mvr} and expect to get
the same answers as \code{predict.train}. The latter function knows
that centering and scaling is required and execute these calculations
on the new samples, whereas \code{predict.mvr} does not. For the
\code{pls} function, there is an argument called \code{scale} that
can be used instead of the pre--processing options in the
\pkg{caret} function.
For multiple models, the objects can be grouped using a list and predicted simultaneously:
\begin{small}
<<bhPrediction1>>=
bhModels <- list(pls = plsFit, mars = marsFit)
bhPred1 <- predict(bhModels, newdata = testBH)
str(bhPred1)
@
\end{small}
In some cases,observed outcomes and their associated predictions may
be needed for a set of models. In this case,
\code{extractPrediction} can be used. This function takes a list of
models and test and/or unknown samples as inputs and returns a data
frame of predictions:
\begin{small}
<<bhPrediction1>>=
allPred <- extractPrediction(bhModels,
testX = testBH,
testY = testMedv)
testPred <- subset(allPred, dataType == "Test")
head(testPred)
ddply(testPred, .(model), defaultSummary)
@
\end{small}
The output of \code{extractPrediction} is a data frame with columns:
\begin{itemize}
\item \code{obs}, the observed data
\item \code{pred}, the predicted values from each model
\item \code{model}, a character string (``\code{rpart}'', ``\code{pls}'' etc.)
\item \code{dataType}, a character string for the type of data:
\begin{itemize}
\item ``\code{Training}'' data are the predictions on the training data from
the optimal model,
\item ``\code{Test}'' denote the predictions on the test set (if one is specified),
\item ``\code{Unknown}'' data are the predictions on the unknown samples (if specified).
Only the predictions are produced for these data. Also, if the quick prediction of the unknowns
is the primary goal, the argument \code{unkOnly} can be used to only process the unknowns.
\end{itemize}
\end{itemize}
Some classification models can produce probabilities for each
class. The functions \code{predict.train} and \code{predict.list}
can be used with the \code{type = "probs"} argument to produce data
frames of class probabilities (with one column per class). Also, the
function \code{extractProbs} can be used to get these probabilities
from one or more models. The results are very similar to what is
produced by \code{extractPrediction} but with columns for each
class. The column \code{pred} is still the predicted class from the
model.
\section{Evaluating Test Sets}
A function, \code{postResample}, can be used obtain the same
performance measures as generated by \pkg{caret} for regression or
classification.
\subsection{Confusion Matrices}\label{S:confusion}
\pkg{caret} also contains several functions that can be used to
describe the performance of classification models. The functions
\code{sensitivity}, \code{specificity}, \code{posPredValue} and
\code{negPredValue} can be used to characterize performance where
there are two classes. By default, the first level of the outcome
factor is used to define the ``positive'' result (i.e. the event of
interest), although this can be changed.
The function \code{confusionMatrix} can also be used to summarize
the results of a classification model:
\begin{small}
<<mdrConfusion>>=
mdrrPredictions <- extractPrediction(list(gbmFit3), testX = testDescr, testY = testMDRR)
mdrrPredictions <- mdrrPredictions[mdrrPredictions$dataType == "Test",]
sensitivity(mdrrPredictions$pred, mdrrPredictions$obs)
confusionMatrix(mdrrPredictions$pred, mdrrPredictions$obs)
@
\end{small}
The ``no--information rate'' is the largest proportion of the observed
classes (there were more actives than inactives in this test set). A
hypothesis test is also computed to evaluate whether the overall
accuracy rate is greater than the rate of the largest class. Also, the
prevalence of the ``positive event'' is computed from the data (unless
passed in as an argument), the detection rate (the rate of true events
also predicted to be events) and the detection prevalence (the
prevalence of predicted events).
Suppose a $2 \times 2$ table with notation
\begin{tabular}{r|c|c|}
& \multicolumn{2}{c}{{\bf Reference}} \\
\cline{2-3}
{\bf Predicted} & Event & No Event \\
\cline{2-3}
Event & A & B \\
\cline{2-3}
No Event & C & D \\
\cline{2-3}
\end{tabular}
The formulas used here are:
\begin{align}
Sensitivity &= \frac{A}{A+C}\notag \\
Specificity &= \frac{D}{B+D}\notag \\
Prevalence &= \frac{A+C}{A+B+C+D}\notag \\
PPV &= \frac{sensitivity \times prevalence}{((sensitivity \times prevalence) + ((1-specificity) \times(1-prevalence))}\notag \\
NPV &= \frac{specificity \times (1-prevalence)}{((1-sensitivity) \times prevalence) + ((specificity) \times(1-prevalence))}\notag \\
Detection\: Rate &= \frac{A}{A+B+C+D}\notag \\
Detection\: Prevalence &= \frac{A + B}{A+B+C+D}\notag
\end{align}
When there are three or more classes, \code{confusionMatrix} will
show the confusion matrix and a set of ``one--versus--all''
results. For example, in a three class problem, the sensitivity of the
first class is calculated against all the samples in the second and
third classes (and so on).
Also, a resampled estimate of the training set can also be obtained
using \code{confusionMatrix.train}. For each resampling iteration,
a confusion matrix is created from the hold--out samples and these
values can be aggregated to diagnose issues with the model fit.
For example, in the two--class SVM model used in Section
\ref{S:basic}, we could use:
<<resampCM>>=
confusionMatrix(gbmFit3)
@
These values are the percentages that hold--out samples landed in the
confusion matrix during resampling. There are several methods for
normalizing these values. See \code{?confusionMatrix.train} for details.
\subsection*{Plotting Predictions and Probabilities}
Two functions, \code{plotObsVsPred} and \code{plotClassProbs}, are
interfaces to lattice to plot model results. For regression,
\code{plotObsVsPred} plots the observed versus predicted values by
model type and data (e.g. test). See Figure \ref{f:bhPredPlot} for example. For classification data,
\code{plotObsVsPred} plots the accuracy rates for models/data in a
dotplot.
<<mdrrPlots, echo = FALSE, results = hide>>=
mdrrProbs <- extractProb(list(gbmFit3), testX = testDescr, testY = testMDRR)
mdrrProbs <- mdrrProbs[mdrrProbs$dataType == "Test",]
pdf(paste(getwd(), "/gbmProbs.pdf", sep = ""), width = 9, height = 7)
trellis.par.set(caretTheme(), warn = FALSE)
print(plotClassProbs(mdrrProbs))
dev.off()
@
\begin{figure}
\begin{center}
\includegraphics[clip, width = .5\textwidth]{gbmProbs}
\caption{The predicted class probabilities from a gradient boosting machine fit for the MDRR test set. This plot was created using \code{plotClassProbs(mdrrProbs)}.}
\label{f:mdrrProbs}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
<<bhPredPlot, echo = FALSE, results = hide, fig = FALSE>>=
pdf("bhPredPlot.pdf", width = 8, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(plotObsVsPred(testPred))
dev.off()
@
\includegraphics[clip, width = .75\textwidth]{bhPredPlot}
\caption{The results of using \code{ plotObsVsPred } to show
plots of the observed median home price against the
predictions from two models. The plot shows the training and
test sets in the same Lattice plot}
\label{f:bhPredPlot}
\end{center}
\end{figure}
To plot class probabilities, \code{plotClassProbs} will display the
results by model, data and true class (for example, Figure
\ref{f:mdrrProbs}).
\clearpage
\section{Exploring and Comparing Resampling Distributions}
\subsection{Within--Model}\label{S:withinMod}
There are several Lattice functions than can be used to explore
relationships between tuning parameters and the resampling results for
a specific model:
\begin{itemize}
\item \code{xyplot} and \code{stripplot} can be used to plot resampling statistics
against (numeric) tuning parameters.
\item \code{histogram} and \code{densityplot} can also be used
to look at distributions of the tuning parameters across tuning parameters.
\end{itemize}
For example, the following statements produces the images in Figure \ref{f:marsDesns}.
\begin{small}
\begin{Schunk}
\begin{Sinput}
> xyplot(marsFit, type= c("g", "p", "smooth"), degree = 2)
> densityplot(marsFit, as.table = TRUE, subset = nprune < 10)
\end{Sinput}
\end{Schunk}
\end{small}
<<makeResampPlots, results = hide, echo = FALSE>>=
pdf(paste(getwd(), "/marsXY.pdf", sep = ""), width = 7, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(xyplot(marsFit, type= c("g", "p", "smooth"), degree = 2))
dev.off()
pdf(paste(getwd(), "/marsDens.pdf", sep = ""), width = 7, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(densityplot(marsFit, as.table = TRUE, subset = nprune < 10))
dev.off()
@
\begin{figure}[hb]
\begin{center}
\includegraphics[clip, width = .7\textwidth]{marsXY}
\vspace{.3 in}
\includegraphics[clip, width = .7\textwidth]{marsDens}
\caption{Scatter plots and density plots of the resampled RMSE by the number of
retained terms for the MARS model fit to the Boston Housing data}
\label{f:marsDesns}
\end{center}
\end{figure}
\clearpage
\subsection{Between--Models}\label{S:betweenMod}
\pkg{caret} also includes functions to characterize the differences
between models (generated using \pkg{caret}, \code{sbf} or
\code{rfe}) via their resampling distributions. These functions are
based on the work of Hothorn et al. (2005) and Eugster et al (2008).
Using the blood-brain barrier data (see \code{?BloodBrain)}, three regression models were
created: an \code{rpart} tree, a conditional inference tree using
\code{ctree}, M5 rules using \code{M5Rules} and a MARS model using
\code{earth}. We ensure that the models use the same resampling data
sets. In this case, 100 leave--group--out cross--validation was
employed.
<<bbbModelFits, eval = FALSE>>=
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB, p = 0.8, times = 100)
bbbControl <- trainControl(method = "LGOCV", index = tmp, timingSamps = 50)
rpartFit <- train(bbbDescr, logBBB,
"rpart",
tuneLength = 16,
trControl = bbbControl)
ctreeFit <- train(bbbDescr, logBBB,
"ctree2",
tuneLength = 10,
trControl = bbbControl)
earthFit <- train(bbbDescr, logBBB,
"earth",
tuneLength = 20,
trControl = bbbControl)
m5Fit <- train(bbbDescr, logBBB,
"M5Rules",
trControl = bbbControl)
@
<<>>=
print(getwd())
@
Given these models, can we make statistical statements about their
performance differences? To do this, we first collect the resampling
results using \code{resamples}.
<<resamps, eval = FALSE>>=
resamps <- resamples(list(CART = rpartFit,
CondInfTree = ctreeFit,
MARS = earthFit,
M5 = m5Fit))
@
<<resampLoad, echo = FALSE, results = hide>>=
load("resamps.RData")
@
<<resampFuncs>>=
resamps
summary(resamps)
@
There are several Lattice plot methods that can be used to visualize
the resampling distributions: density plots, box--whisker plots,
scatterplot matrices and scatterplots of summary statistics. In the
latter case, the plot consists of a scatterplot between the two
models. (See Figure
\ref{f:resampleScatter}). In Figure \ref{F:resampleDens}, density plots
of the data are shown. In this figure, the $R^2$ distributions
indicate that M5 rules and MARS appear to be
similar to one another but different from the two tree--based
models. However, this pattern is inconsistent with the root
mean squared error distributions.
<<resamplePlots>>=
bwplot(resamps, metric = "RMSE")
densityplot(resamps, metric = "RMSE")
xyplot(resamps,
models = c("CART", "MARS"),
metric = "RMSE")
splom(resamps, metric = "RMSE")
@
<<resamplePlots2, results = hide, echo = FALSE>>=
pdf("resampleScatter.pdf", width = 6, height = 6)
trellis.par.set(caretTheme())
print(xyplot(resamps, models = c("CART", "MARS")))
dev.off()
pdf("resampleDens.pdf", width = 7, height = 4.5)
print(
densityplot(resamps,
scales = list(x = list(relation = "free")),
adjust = 1.2,
plot.points = FALSE,
auto.key = list(columns = 2)))
dev.off()
@
Since models are fit on the same versions of the training data, it
makes sense to make inferences on the differences between models. In
this way we reduce the within--resample correlation that may exist. We
can compute the differences, then use a simple $t$--test to evaluate
the null hypothesis that there is no difference between models.
<<diffs>>=
difValues <- diff(resamps)
difValues
summary(difValues)
@
Note that these results are consistent with the patterns shown in
Figure \ref{F:resampleDens}; there are more differences in the $R^2$
distributions than in the error distributions.
Several Lattices methods also exist to plot the differences (density
and box--whisker plots) or the inferential results (level and dot
plots). Figures \ref{F:diffLevel} and \ref{F:diffDot} show examples of
level and dot plots.
<<diffPlots>>=
dotplot(difValues)
densityplot(difValues,
metric = "RMSE",
auto.key = TRUE,
pch = "|")
bwplot(difValues,
metric = "RMSE")
levelplot(difValues, what = "differences")
@
<<diffPlots2, results = hide, echo = FALSE>>=
pdf("diffLevel.pdf", width = 7.5, height = 6)
trellis.par.set(caretTheme())
print(levelplot(difValues, what = "differences"))
dev.off()
pdf("diffDot.pdf", width = 6, height = 6)
plotTheme <- caretTheme()
plotTheme$plot.symbol$pch <- 16
plotTheme$plot.line$col <- "black"
trellis.par.set(plotTheme)
print(dotplot(difValues))
dev.off()
@
\begin{figure}
\begin{center}
\includegraphics[clip, width = .6\textwidth]{resampleScatter}
\caption{ Examples of output from \code{xyplot(resamps, models = c("CART", "MARS"))}. }
\label{f:resampleScatter}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[clip, width = .9\textwidth]{resampleDens}
\caption{ Examples of output from
\code{densityplot(resamps)}. Looking at $R^2$, M5 rules and MARS appear to be
similar to one another but different from the two tree--based
models. However, this pattern is inconsistent with the root
mean squared error distributions.}
\label{F:resampleDens}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[clip, width = .8\textwidth]{diffLevel}
\caption{ Examples of output from \code{levelplot(difValues,
what = "differences")}. The pair--wise differences in RMSE
are shown}
\label{F:diffLevel}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[clip, width = .8\textwidth]{diffDot}
\caption{ Examples of output from
\code{dotplot(difValues)}. The differences in RMSE and their
associated confidence intervals are shown.}
\label{F:diffDot}
\end{center}
\end{figure}
\section{Custom Methods for \codeheading{train}}
Although there are currently more than
\Sexpr{length(unique(modelLookup()$model)) -(length(unique(modelLookup()$model)) %% 10)}
methods available to \code{train}, there may be the need to create
custom model functions (e.g. testing a new model etc). One
application of custom models would be to create diverse ensembles of
models. For example, a set of different classification models may be
fit to the same data and a ''pick--the--winner'' approach can be
taken (or the average of the class probabilities could be used, see
Kuncheva (2004) or Seni and Elder (2010)). \code{train} already has
a framework for resampling and tuning models and
\code{predict.train} can be used to encapsulate the ensemble of
models into one call for prediction.
\subsection*{Why Not Re--Write \codeheading{train} Altogether?}
One could make a strong argument that back--fitting the package to work with custom models is a bit kludgy. This is probably true, but there are several munches of the current code base that, if implemented in a more formal manner, would make the new code base overly complicated.
For example, the current code base exploits models that can produce predictions for many tuning parameters using a single model object. For example PLS can fit a K component model and make predictions from models with $1\ldots K$ components. This potentially saves us $K-1$ model fits. PLS is fairly fast, but pother models with this same feature (gbm, cubist, etc) are computationally taxing and this saves a lot of time.
Also, the package currently does many types of resampling and some, such as the bootstrap 632 estimator, are not as simple as others. In this case, an additional model must be fit for all the data to get the apparent error rate. Formalizing this (or starting over) with a new code base would be unnecessarily complex.
\subsection{How To Write Custom Methods}
The user will need to deviate from the standard call in two ways:
\begin{itemize}
\item use \code{method = "custom"} in the call to \code{train}, and
\item add the required functions for the model using \code{trainControl}
\end{itemize}
The \code{custom} argument of \code{trainControl} requires a list of
named functions with the following elements: \code{parameters},
\code{model}, \code{prediction}, \code{prob} and \code{sort}.
\subsubsection{The \codeheading{parameters} Argument}
This element is used to specify or generate the models tuning
parameters. This can be done either as a function to generate them or
a data frame of the actual parameters.
{\it Inputs}:
\begin{itemize}
\item \code{data}: a data frame of the training set data. The
outcome will be in a column labeled \code{.outcome}. If the
formula method for \code{train} was invoked, the data passed into
this function will have been processed (i.e. dummy variables have
been created etc).
\item \code{len} an optional parameter passed in form the
\code{tuneLength} argument to \code{train}
\end{itemize}
{\it Outputs}: a data frame where
\begin{itemize}
\item all columns start with a dot
\item there is at least one row
\end{itemize}
Instead of a function, the final data frame can be passed in
\subsubsection{The \codeheading{model} Function}
This element fits the model and any other functions
(e.g. pre--processing of the data)
{\it Inputs}:
\begin{itemize}
\item \code{data}: a data frame of the training set data. The outcome
will be in a column labeled \code{.outcome}. If case weights were
specified in the \code{train} call, these are in the column
\code{.modelWeights}. If the formula method for \code{train} was
invoked, the data passed into this function will have been processed
(i.e. dummy variables have been created etc).
\item \code{weights} case weights
\item \code{parameter} a single row data frame with the current
tuning parameter
\item \code{levels}: either \code{NULL} or a character vector or
factor labels
\item \code{last} a logical vector for the final model fit with the
selected tuning parameters and the full training set
\item \code{...} arguments passed form \code{train} to this function
\end{itemize}
{\it Outputs}: a list with at least one element:
\begin{itemize}
\item \code{fit}: the object corresponding to the trained model
\end{itemize}
Anything else can be attached to this object. If custom
pre--processing is required, this can be estimated in the \code{model}
function and attached to the output list. Subsequent calls to the
\code{prediction} and \code{probability} functions will have the
entire list available, so the processing can be applied to the new
data.
\subsubsection{The \codeheading{prediction} Function}
This should be a function that produces either a number vector (for
regression) or a factor (or character) vector for classification.
{\it Inputs}:
\begin{itemize}
\item \code{object}: a list with two elements resulting from the
model function
\item \code{newdata}: a matrix or data frame of predictors to be
processed through the model (and possibly pre-processing routine)
\end{itemize}
The output should be either a numeric, character or factor vector. For
classification, factors are converted to character elsewhere to ensure
the proper levels are in the output.
\subsubsection{The \codeheading{probability} Argument}
For classification models, this function should generate a data frame of class probabilities. For regression, a value of \code{NULL} can be used.
{\it Inputs}:
\begin{itemize}
\item \code{object}: a list with two elements resulting from the
model function
\item \code{newdata}: a matrix or data frame of predictors to be
processed through the model (and possibly pre-processing routine)
\end{itemize}
The output should be a data frame with these characteristics:
\begin{itemize}
\item as many columns are factor levels
\item column names are the same as the factor levels and in the same order
\end{itemize}
\subsubsection{The \codeheading{sort} Function}
There are cases where multiple tuning parameters yield the same level
of performance. In these situations, \code{train} will choose the
parameters associated with the most simplistic model. This function
should take the grid of tuning parameters and order them from least
complex to most complex.
The input is a data frame of tuning parameters (without the
preceding dot in the name).
The output is the same data frame sorted appropriately.
\subsection{An Example}
As an example, suppose we want to test out \code{rpart} models where
we tune over the complexity parameter and the minimum number of
samples in a node to do further splitting (a.k.a \code{minsplit}).
We'll use the Blood--brain barrier data in \pkg{caret} to illustrate.
First, we would need to create a training grid with the candidate
values of \code{cp} and \code{minsplit}. When using the nominal
\code{rpart} method in \code{train}, an initial \code{rpart} model is
created and the unique values of the complexity parameter are obtained
from the sub--object \code{cptable}. We will test two values of
\code{minsplit}: 10 and 30. First, we get the unique
$C_p$ values for \code{minsplit = 10}
<<getCp1>>=
## rpart requires a formula method
tmpData <- bbbDescr
tmpData$logBBB <- logBBB
cpValues10 <- rpart(logBBB ~ ., data = tmpData,
control = rpart.control(minsplit = 10))$cptable[,"CP"]
cpValues30 <- rpart(logBBB ~ ., data = tmpData,
control = rpart.control(minsplit = 30))$cptable[,"CP"]
head(cpValues10)
@
From these, we will create the tuning grid of \Sexpr{nrow(cpValues10)*2}
canidiate models (3 values of \code{minsplit} for each of the
\Sexpr{nrow(cpValues10)} possible $C_p$ values:
<<cpGrid>>=
rpartGrid <- data.frame(.cp = c(cpValues10, cpValues30),
.minsplit =
c(rep(10, length(cpValues10)),
rep(30, length(cpValues30))))
@
We can now write a model function:
<<modelFoo>>=
modelFunc <- function(data, parameter, levels, last, ...)
{
library(rpart)
ctrl <- rpart.control(cp = parameter$.cp,
minsplit = parameter$.minsplit)
list(fit = rpart(.outcome ~ ., data = data, control = ctrl))
}
@
It is a good idea to load the \pkg{rpart} package and anything else needed within the function.
The prediciton function is simple:
<<predFoo>>=
predFunc <- function(object, newdata)
{
library(rpart)
predict(object$fit, newdata)
}
@
Sorting by complexity is somewhat subjective. Both parameters govern
how deep the tree can be. We will sort by \code{cp} then \code{minsplit}:
<<sortFoo>>=
sortFunc <- function(x) x[order(x$cp, x$minsplit),]
@
Now we can create a control object for \code{train}:
<<trControl, eval = FALSE>>=
ctrl <- trainControl(custom = list(
parameters = rpartGrid,
model = modelFunc,
prediction = predFunc,
probability = NULL,
sort = sortFunc),
method = "cv")
set.seed(581)
customRpart <- train(bbbDescr, logBBB, "custom", trControl = ctrl)
@
The \code{predict}, \code{print}, \code{plot} and \code{resamples}
methods work with custom models. In the case of \code{plot.train}, the
axis and key labels will be the parameter names. However, \code{update}
can be used to make the labels more descriptive:
<<plotResults, eval = FALSE>>=
rpartPlot <- plot(customRpart, xTrans = log10)
rpartPlot <- update(rpartPlot, xlab = "Complexity Parameter")
@
\section{Session Information}
<<session, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@
\section{References}
\begin{description}
\item Breiman, Friedman, Olshen, and Stone. (1984) {\it Classification
and Regression Trees}. Wadsworth.
\item Eugster et al. (2008), ``Exploratory and inferential analysis
of benchmark experiments, '' {\it Ludwigs-Maximilians-Universitat
Munchen, Department of Statistics, Tech. Rep} vol. 30
\item Hothorn et al. (2005), ``The design and analysis of
benchmark experiments, '' {\it Journal of Computational and
Graphical Statistics}, 14, 675--699
\item Kuncheva (2004), {\it Combining Pattern Classifiers: Methods
and Algorithms}. Wiley.
\item Molinaro et al. (2010), ``partDSA:
deletion/substitution/addition algorithm for partitioning the
covariate space in prediction,'' {\it Bioinformatics}, 26, 1357--1363
\item Rand (1971), ``Objective criteria for the evaluation of
clustering methods,'' {\it Journal of the American Statistical
Association} 66, 846--850.
\item Schmidberger et al. (2009), `` State-of-the-art in Parallel
Computing with R,'' {\it Journal of Statistical Software}, 31
\item Seni and Elder (2010) {\it Ensemble Methods in Data Mining:
Improving Accuracy Through Combining Predictions}.
Morgan and Claypool Publishers
\item Svetnik, V., Wang, T., Tong, C., Liaw, A., Sheridan, R. P. and
Song, Q. (2005), ``Boosting: An ensemble learning tool for compound
classification and QSAR modeling,'' {\it Journal of Chemical
Information and Modeling}, 45, 786 --799.
\item Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. (2003),
``Class prediction by nearest shrunken centroids, with applications
to DNA microarrays,'' {\it Statistical Science}, 18, 104--117.
\end{description}
\end{document}