Content - a3cf6407cdbf1f0b1d3b26dbf1018aefa0b43e80 - e8bec87/vignettes/caretTrain.Rnw

visit type:
Tip revision: 287adc67c8767de8d0cf6fe6ec16ca734cdf5441 authored by Max Kuhn on 02 October 2012, 00:00:00 UTC
version 5.15-044
Tip revision: 287adc6
caretTrain.Rnw
% \VignetteIndexEntry{caret Manual -- Model Building}
% \VignetteDepends{caret}
% \VignettePackage{caret}
\documentclass[12pt]{article}
\usepackage{colortbl}
\usepackage{amsmath}
\usepackage[pdftex]{graphicx}
\usepackage{color}
\usepackage{xspace}
\usepackage{fancyvrb}
\usepackage{fancyhdr}
\usepackage{lastpage}
\usepackage{booktabs} 
\usepackage{longtable} 
\usepackage[boxed, linesnumbered]{algorithm2e}
\usepackage[
         colorlinks=true,
         linkcolor=blue,
         citecolor=blue,
         urlcolor=blue]
         {hyperref}
\usepackage{lscape}

\usepackage{Sweave}
\SweaveOpts{keep.source=TRUE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% define new colors for use
\definecolor{darkgreen}{rgb}{0,0.6,0}
\definecolor{darkred}{rgb}{0.6,0.0,0}
\definecolor{lightbrown}{rgb}{1,0.9,0.8}
\definecolor{brown}{rgb}{0.6,0.3,0.3}
\definecolor{darkblue}{rgb}{0,0,0.8}
\definecolor{darkmagenta}{rgb}{0.5,0,0.5}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\bld}[1]{\mbox{\boldmath $#1$}}
\newcommand{\shell}[1]{\mbox{$#1$}}
\renewcommand{\vec}[1]{\mbox{\bf {#1}}}
 \newcommand{\codeheading}[1]{\mbox{\color{darkblue}\texttt{#1}}}
\newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}}
\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
\renewcommand{\pkg}[1]{{\textsf{#1}}}

\newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize}
\newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize}

\providecommand{\SetAlgoLined}{\SetLine}

\newcommand{\halfs}{\frac{1}{2}}

\setlength{\oddsidemargin}{-.25 truein}
\setlength{\evensidemargin}{0truein}
\setlength{\topmargin}{-0.2truein}
\setlength{\textwidth}{7 truein}
\setlength{\textheight}{8.5 truein}
\setlength{\parindent}{0truein}
\setlength{\parskip}{0.10truein}


\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}

\fvset{fontsize=\footnotesize}
 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagestyle{fancy}
\lhead{}
\chead{The {\tt caret} Package}
\rhead{}
\lfoot{}
\cfoot{}
\rfoot{\thepage\ of \pageref{LastPage}}
\renewcommand{\headrulewidth}{1pt}
\renewcommand{\footrulewidth}{1pt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{The {\tt caret} Package}
\author{Max Kuhn \\ max.kuhn@pfizer.com}


\begin{document}

\maketitle
\renewcommand{\baselinestretch}{.6}

\tableofcontents

\thispagestyle{empty}
	
\vspace{.2in}

\renewcommand{\baselinestretch}{1}
 
	
	
<<loadLibs, results = hide, echo = FALSE>>=
library(caret)
library(kernlab)
library(gbm)
library(ipred)
library(grid)
library(Hmisc)
library(randomForest)
data(BloodBrain)
data(mdrr)

getInfo <- function(what = "Suggests")
{
  text <- packageDescription("caret")[what][[1]]
  text <- gsub("\n", ", ", text, fixed = TRUE)
  text <- gsub(">=", "$\\\\ge$", text, fixed = TRUE)
  eachPkg <- strsplit(text, ", ", fixed = TRUE)[[1]]
  eachPkg <- gsub(",", "", eachPkg, fixed = TRUE)
  out <- paste("\\\\pkg{", eachPkg[order(tolower(eachPkg))], "}", sep = "")
  paste(out, collapse = ", ")
}
@


The \pkg{caret} package (short for
{\bf{\color{blue}{c}}}lassification {\bf{\color{blue}{a}}}nd
{\bf{\color{blue}{re}}}gression {\bf{\color{blue}{t}}}raining)
contains functions to streamline the model training process for
complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package
start-up\footnote{By adding formal package dependencies, the package
  startup time can be greatly decreased}. The package ``suggests''
field includes: \Sexpr{getInfo("Suggests")}. \pkg{caret} loads
packages as needed and assumes that they are installed. Install
\pkg{caret} using  
\begin{Verbatim}
install.packages("caret", dependencies = c("Depends", "Suggests"))
\end{Verbatim}
to ensure that all the needed packages are installed.

\section{Model Training and Parameter Tuning}\label{S:train}

\pkg{caret} has several functions that attempt to streamline the model building and evaluation process. 

The \pkg{caret} function can be used to
\begin{itemize}
   \item evaluate, using resampling, the effect of model tuning parameters on performance
   \item choose the ``optimal'' model across these parameters 
   \item estimate model performance from a training set
\end{itemize}

More formally:

\begin{algorithm}[H]
   \label{A:tune}
   \SetAlgoLined
   \RestyleAlgo{plain}
   \DontPrintSemicolon
  Define  sets of model parameter values to evaluate \nllabel{A:grid}\;
  \For{each parameter set}{

    \For{each resampling iteration}{
      Hold--out specific samples \nllabel{A:resample} \;
      
      [Optional] Pre--process the data\;
      Fit the model on the remainder\;
      
      Predict the hold--out samples\;
    }
    Calculate the average performance across hold--out predictions \nllabel{A:perf}
  }
  Determine the optimal parameter set \nllabel{A:best}\;
  
  Fit the final model to all the training data using the optimal
  parameter set\;  
\end{algorithm}


First, a specific model must be chosen. Currently,
\Sexpr{length(unique(modelLookup()$model))} are available using
\pkg{caret}; see Tables \ref{T:methods1},  \ref{T:methods2} and  \ref{T:methods3} for details. 

In Tables \ref{T:methods1},  \ref{T:methods2} and  \ref{T:methods3}, there are lists of tuning parameters that can
potentially be optimized.  The first step in tuning the model (line
\ref{A:grid} in Algorithm \ref{A:tune}) is to choose a set of
parameters to evaluate. For example, if fitting a Partial Least Squares
(PLS) model, the number of PLS components to evaluate must be specified. 

Once the model and tuning parameter values have been defined, the type
of resampling should be also be specified. Currently, $k$--fold
cross--validation (once or repeated), 
leave--one--out cross--validation and bootstrap
(simple estimation or the 632 rule) 
resampling methods can be used by \pkg{caret}. After resampling,
the process produces a profile of performance measures is available to
guide the user as to which tuning parameter values should be
chosen. By default, the function automatically chooses the tuning
parameters associated with the best value, although different
algorithms can be used (see Section \ref{S:finalMod}). 


\subsection{An Example}

As an example, the multidrug resistance reversal (MDRR) agent data is
used to determine a predictive model for the ``ability of a compound
to reverse a leukemia cell's resistance to adriamycin''
(\href{http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2005/45/i03/abs/ci0500379.html}{Svetnik
  et al, 2003}). For each sample (i.e. compound), predictors are
calculated that reflect characteristics of the molecular
structure. These molecular descriptors are then used to predict assay
results that reflect resistance.  

The data are accessed using \code{data(mdrr)}. This creates a data
frame of predictors called \code{mdrrDescr} and a factor vector with
the observed class called \code{mdrrClass}. 

To start, we will:
 
\begin{itemize}
   \item use unsupervised filters to remove predictors with
     unattractive characteristics (e.g. spare distributions or high
     inter--predictor correlations) 
   \item split the entire data set into a training and test set
\end{itemize}

See the package vignette ``caret Manual -- Data and Functions'' for more details about these operations.

\begin{small}
<<preProc>>=
print(ncol(mdrrDescr))
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]

print(ncol(filteredDescr))

descrCor <- cor(filteredDescr)
highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
filteredDescr <- filteredDescr[,-highlyCorDescr]

print(ncol(filteredDescr))

set.seed(1)
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)

trainDescr <- filteredDescr[inTrain,]
testDescr <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
testMDRR <- mdrrClass[-inTrain]

print(length(trainMDRR))
print(length(testMDRR))
@
\end{small}

\subsection{Basic Parameter Tuning}\label{S:basic}

By default, simple bootstrap resampling is used for line
\ref{A:resample} in Algorithm \ref{A:tune}. Others are availible, such
as repeated $K$--fold cross--validation. The function
\code{trainControl} can be used to specifiy the type of resampling:

\begin{small}
<<setup>>=
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated three times
                           repeats = 3,
                           ## Save all the resampling results
                           returnResamp = "all")
@
\end{small}
More information about \code{trainControl} is given in Section
\ref{S:control}. 

The first two arguments to \pkg{caret} are the predictor and
outcome data objects, respectively. The third argument,
\code{method}, specifies the type of model (see Tables \ref{T:methods1},  \ref{T:methods2} and  \ref{T:methods3}). We will fit a boosted tree model via the \pkg{gbm}
package. The basic syntax for fitting this model using repeated
cross--vlaidation is shown below:

\begin{small}
<<mdrrModel1>>=
gbmFit1 <- train(trainDescr, trainMDRR, 
                 method = "gbm", 
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1
@
\end{small}

For a gradient boosting machine (GBM) model, there are three main
tuning parameters:
\begin{itemize}
\item number of iterations,  {\it i.e. trees},  (called \code{n.trees} in the
  \code{gbm} function)
\item complexity of the tree, called \code{interaction.depth}
\item learning rate: how quickly the algorithm adapts, called
  \code{shrinkage} 
\end{itemize}
The default values tested for this model are shown in the first two
columns (\code{shrinkage} is not shown beause the grid set of
candidate models all use a value of 0.1 for this tuning parameter). 
The column labeled ``\code{Accuracy}'' is the overall agreement rate
averaged over cross--validation iterations. The agreement standard
deviation is also calculated from the cross-validation results. The
column ``\code{Kappa}'' is Cohen's (unweighted) Kappa statistic
averaged across the resampling results. \pkg{caret} works with
specific models (see Tables \ref{T:methods1},  \ref{T:methods2} and  \ref{T:methods3}). For these models, \pkg{caret} can automatically
create a grid of tuning parameters. By default, if $p$ is the number
of tuning parameters, the grid size is $3^p$. For example, regularized
discriminant analysis (RDA) models have two  parameters
(\texttt{gamma} and \texttt{lambda}), both of which lie on $[0,
1]$. The default training grid would produce nine combinations in this
two--dimensional space. 

<<getSeqMods, echo = FALSE, results = hide>>=
seqModList <- paste(paste("\\\\pkg{", unique(subset(modelLookup(), seq)$model), "}", sep = ""), collapse = ", ")
@

\subsection{Notes}

\begin{itemize}
  
\item There is a formula interface (e.g. \code{train(y$\sim$., data =
    someData}) that can be used. One of the issues with a large
  number of predictors is that the objects related to the formula
  which are saved can get very large. In these cases, it is best to
  stick with the non--formula interface described above. 
  
\item The function determines the type of problem
  (classification or regression) from the type of the response
  given in the \code{y} argument.  
  
\item The \code{$\ldots$} option can be used to pass
  parameters to the fitting function. For example, in random
  forest models, you can specify the number of trees to be
  used in the call to \pkg{caret}. 
  
\item For regression models (i.e. a numeric outcome), a similar table would
be produced showing the average root mean squared error and average
$R^2$ value statistic across tuning parameters, otherwise known as
$Q^2$ (see the note below related to this calculation).
For regression models, the classical $R^2$ statistic
  cannot be compared between models that contain an intercept
  and models that do not. Also, some models do not have an
  intercept only null model.	 
  
\item [] To approximate this statistic across different
  types of models,  the square of the correlation between
  the observed and predicted outcomes is used. This means
  that the $R^2$ values produced by \pkg{caret} will not
  match the results of \code{lm} and other functions.  
  
\item [] Also, the correlation estimate does not take into
  account the degrees of freedom in a model and thus does
  not penalize models with more parameters. For some
  models (e.g random forests or on--linear support vector
  machines) there is no clear sense of the degrees of
  freedom, so this information cannot be used in $R^2$ if
  we would like to compare different models. 
  
  
\item The nearest shrunken centroid model of
  \href{http://projecteuclid.org/handle/euclid.ss/1056397488}{Tibshirani
    et al (2003)} is specified using \code{method =
    "pam"}. For this model, there must be at least two samples
  in each class. \pkg{caret} will ignore classes where
  there are less than two samples per class from every model
  fit during bootstrapping or cross--validation (this model
  only). 
  
\item For recursive partitioning models, an initial model is
  fit to all of the training data to obtain the possible
  values of the maximum depth of any node
  (\code{maxdepth}). The tuning grid is created based on
  these values. If \code{tuneLength} is larger than the
  number of possible \code{maxdepth} values determined by
  the initial model, the grid will be truncated to the
  \code{maxdepth} list. 
  
\item [] The same is also true for nearest shrunken centroid
  models, where an initial model is fit to find the range of
  possible threshold values, and MARS models (see the details
  below). 
  
\item For multivariate adaptive regression splines (MARS), the
  \texttt{earth} package is used with a model type of \code{mars}
  or \code{earth} is requested. The tuning parameters used by
  \pkg{caret} are \code{degree} and \code{nprune}. The
  parameter \code{nk} is not automatically specified and, if not
  specified, the default in the \code{earth} function is used.  

\item [] For example, suppose a training set with 40 predictors is
  used with \code{degree = 1} and \code{nprune = 20}. An initial
  model with \code{nk = 41} is fit and is pruned down to 20
  terms. This number includes the intercept and may include
  ``singleton'' terms instead of pairs.  

\item [] Alternate model training schemes can be used by passing
  \code{nk} and/or \code{pmethod} to the \code{earth}
  function. Also, using \code{method = 'gcvEearth'} will use the
  basic GCV pruning procedure and only tune the degree.

\item [] Also, there may be cases where the message such as
  ``specified 'nprune' 29 is greater than the number of available
  model terms 24, forcing 'nprune' to 24'' show up after the model
  fit. This can occur since the \code{earth} function may not
  actually use the number of terms in the initial model as specified
  by \code{nk}. This may be because the \code{earth} function
  removes terms with linear dependencies and  the forward pass
  counts as if terms were added in pairs (although singleton terms
  may be used). By default, the \pkg{caret} function fits and
  initial MARS model is used to determine the number of possible
  terms in the training set to create the tuning grid. Resampled
  data sets may produce slightly different models that do not have
  as many possible values of  \code{nprune}. 
  
\item For the \code{glmboost} and \code{gamboost} functions
  from the \texttt{mboost} package, an additional tuning parameter,
  \code{prune}, is used by train. If \code{prune = "yes"}, the
  number of trees is reduced based on the AIC statistic. If
  \code{"no"}, the number of trees is kept at the value specified
  by the \code{mstop} parameter. See the \texttt{mboost} package
  vignette for more details about AIC pruning. 
  
\item The partitioning model of Molinaro {\it et al.} (2010) has a
  tuning parameter that is the number of partitions in the
  model. The R function \code{partDSA} has the argument
  \code{cut.off.growth} which is described as ``the maximum number
  of terminal partitions to be considered when building the model.''
  Since this is the maximum, the user might ask for a model with $X$
  partitions but the model can only predict $Y < X$. In these cases,
  the model predictions will be based on the largest model available
  ($Y$). 

    \item For generalized additive models, a formula is generated from
      the data. First, predictors with degenerate distributions are
      excluded (via the \code{nearZeroVar} function). Then, the
      number of distinct values for each predictor is calculated. If
      this value is greater than 10, the predictor is entered into the
      formula via a smoothed term (otherwise a linear term is
      used). For models in the \texttt{gam} package, the smooth terms
      have the same amount of smoothing applied to them (i.e. equal
      \code{df} or \code{span} across all the smoothed predictors).
      
    \item For some models (\Sexpr{seqModList}), the \pkg{caret}
      function will fit a model that can be used to derive predictions
      for some sub-models. For example, for MARS (via the
      \code{earth} function), for a fixed degree, a model with a
      maximum number of terms will be fit and the predictions of all of
      the requested models with the same degree and smaller number of
      terms will be computed using \code{update.earth} instead of
      fitting a new model. When the \code{verboseIter} option of the
      \code{trainControl} function is
      used, a line is printed for the ``top--level'' model (instead of
      each model in the tuning grid). 

    \item There are \code{print} and \code{plot} methods for the
      \code{train} class. The
      \code{plot} method visualizes the profile of average resampled
      performance values across the different tuning parameters using
      scatter plots or level plots. See Figures \ref{f:plots1} and
      \ref{f:plots2} for examples. Functions that visualize the individual
      resampling results for \pkg{caret} objects are discussed in
      Section \ref{S:withinMod}.
      
    \item Using the first set of tuning parameters that are
      optimal (in the sense of accuracy or mean squared error),
      \pkg{caret} automatically fits a model with these
      parameters to the entire training data set. That model
      object is accessible in the \code{finalModel} object
      within \pkg{caret}. For example, \verb+gbmFit$finalModel+
      is the same object that would have been produced using a
      direct call to the \code{gbm} function with the final tuning
      parameters. 
      
  
\end{itemize}

There is additional functionality in \pkg{caret} that is described in the next section.

<<getModelData, echo = FALSE, results = hide>>=

models <- modelLookup()
modelKey <- read.delim(system.file("modelKey.txt", package = "caret"))
modelInfo <- merge(models, modelKey, all.x = TRUE)
modelInfo$parameter <- latexTranslate(modelInfo$parameter)

cranRef <- function(x) paste("{\\tt \\href{http://cran.r-project.org/web/packages/", x, "/index.html}{", x, "}}", sep = "")

makeTable <- function(x)
  {
    params <- paste("\\code{", as.character(x$parameter), "}", sep = "", collapse = ", ")
    params <- ifelse(params == "\\code{parameter}", "None", params)

    data.frame(method = paste("\\texttt{", as.character(x$model)[1], "}", sep = ""),
               Package = cranRef(as.character(x$Package)[1]),
               Parameters = params)
               

  }

forTable <- ddply(modelInfo, .(Order, ProblemType, Class, model), makeTable)
forTable$model <- NULL

dual <- subset(forTable, ProblemType == "Dual Use")[,-1]
reg <- subset(forTable, ProblemType == "Regression Only")[,-1]
clas <-  subset(forTable, ProblemType == "Classification Only")[,-1]


@ 

\pagestyle{plain}

{\footnotesize
  
<<dualTable, echo = FALSE, results = tex>>=
dualCats <- unique(as.character(dual$Class))
dualReps <- table(as.character(dual$Class))
dualReps <- dualReps[dualCats]

dual <- as.matrix(dual[, -(1:2)])
rownames(dual) <- NULL

latex(dual, 
      file = "",
      title = "Type",
      rowname = " ",
      rgroup = dualCats,
      n.rgroup = dualReps,
      label = "T:methods1",
      caption = "Dual use models (regression and classification) available in \\code{train}.",
      booktabs = TRUE, 
      landscape = TRUE, 
      longtable = TRUE)
@ 

<<regTable, echo = FALSE, results = tex>>=
regCats <- unique(as.character(reg$Class))
regReps <- table(as.character(reg$Class))
regReps <- regReps[regCats]

reg <- as.matrix(reg[, -(1:2)])
rownames(reg) <- NULL

latex(reg, 
      file = "",
      title = "Type",
      rowname = " ",
      rgroup = regCats,
      n.rgroup = regReps,
      label = "T:methods2",
      caption = "Regression only models available in \\code{train}.",
      booktabs = TRUE, 
      landscape = TRUE, 
      longtable = TRUE)
@ 


<<clasTable, echo = FALSE, results = tex>>=
clasCats <- unique(as.character(clas$Class))
clasReps <- table(as.character(clas$Class))
clasReps <- clasReps[clasCats]

clas <- as.matrix(clas[, -(1:2)])
rownames(clas) <- NULL

latex(clas, 
      file = "",
      title = "Type",
      rowname = " ",
      rgroup = clasCats,
      n.rgroup = clasReps,
      label = "T:methods3",
      caption = "Classification only models available in \\code{train}.",
      booktabs = TRUE, 
      landscape = TRUE, 
      longtable = TRUE)
@ 



}
\pagestyle{fancy}

\section{Customizing the Tuning Process}

There are a few ways to customize the process of selecting
tuning/complexity parameters and building the final model.

\subsection{Pre--Processing Options}\label{S:pp}

As previously mentioned, \pkg{caret} can pre--process the data in
various ways prior to model fitting. The \pkg{caret} function
\code{preProcess} is automatically used. This function can be used
for centering and scaling, imputation (see details below),
applying the spatial sign transformation and feature extraction via
principal component analysis or independent component
analysis. Options to the \code{preProcess} function can be passed
via the \code{trainControl} function.  

These processing steps would be applied during any predictions
generated using \code{predict.train}, \code{extractPrediction} or
\code{extractProbs} (see Section \ref{S:probs}  later in this document). The
pre--processing would {\bf not} be applied to predictions that
directly use the \verb+object$finalModel+ object.  

For imputation, there are two methods currently implemented:
\begin{itemize}
\item $k$--nearest neighbors takes a sample with missing values and
  finds the $k$ closest samples in the training set. The average of
  the $k$ training set values for that predictor are used as a
  substitute for the original data. When calculating the distances to
  the training set samples, the predictors used in the calculation are
  the ones with no missing values for that sample and no missing
  values in the training set.
\item another approach is to fit a bagged tree model for each
  predictor using the training set samples. This is usually a fairly
  accurate model and can handle missing values. When a predictor for a
  sample requires imputation, the values for the other predictors are
  fed through the bagged tree and the prediction is used as the new
  value. This model can have significant computational cost.
\end{itemize}

If there are missing values in the training set, PCA and ICA models
only use complete samples.


\subsection{Alternate Tuning Grids}\label{S:altTune}

The tuning parameter grid can be specified by the user. The argument
\code{tuneGrid} can take a data frame with columns for each tuning
parameter (see Tables \ref{T:methods1},  \ref{T:methods2} and  \ref{T:methods3} for specific details). The column
names should be the same as the fitting function's arguments with a
period preceding the name. For the previously mentioned RDA example, the names would be
\code{.gamma} and \code{.lambda}. \pkg{caret} will tune the
model over each combination of values in the rows. 
 

We can fix the learning rate and evaluate more than three values of
\code{n.trees}: 

\begin{small}
<<mdrrGrid>>=
gbmGrid <-  expand.grid(.interaction.depth = c(1, 3), 
                        .n.trees = c(10, 50, 100, 150, 200, 250, 300), 
                        .shrinkage = 0.1)
@ 

<<mdrrGridFit>>=
set.seed(3)
gbmFit2 <- train(trainDescr, trainMDRR, 
                 method = "gbm", 
                 trControl = fitControl, 
                 verbose = FALSE, 
                 ## Now specify the exact models 
                 ## to evaludate:
                 tuneGrid = gbmGrid)
gbmFit2
@

\end{small}

\subsection{The \code{trainControl} Function}\label{S:control}

The function \code{trainControl} generates parameters that further
control how models are created, with possible values: 
\begin{itemize}
\item \code{method}: The resampling method: \code{boot}, \code{boot632},
  \code{cv}, \code{LOOCV}, \code{LGOCV}, \code{repeatedcv}  and \code{oob}. The
  last value, out--of--bag estimates, can only be used by random
  forest, bagged trees, bagged earth, bagged flexible discriminant
  analysis, or conditional tree forest models. GBM models are not
  included (the \texttt{gbm} package maintainer has indicated that
  it would not be a good idea to choose tuning parameter values
  based on the model OOB error estimates with boosted trees). Also,
  for leave--one--out cross--validation, no uncertainty estimates
  are given for the resampled performance measures. 
\item \code{number and repeats}: \code{number} controls with the 
  number of folds in $K$--fold cross--validation or number of
  resampling iterations for bootstrapping and leave--group--out
  cross--validation. \code{repeats} applied only to repeated  
  $K$--fold cross--validation. Suppose that \code{method = "repeatedcv"}, 
  \code{number = 10} and \code{repeats = 3}, then three separate
  10--fold cross--validations are used as the resampling scheme.
\item \code{verboseIter}: A logical for printing a training log.
\item \code{returnData}: A logical for saving the data into a slot
  called \code{trainingData}. 
\item \code{p}: For leave-group out cross-validation: the training
  percentage 
\item \code{ classProbs}: a logical value determining whether
  class probabilities should be computed for held--out samples
  during resample. Examples of using this argument are given in
  Section \ref{S:altPerf}. 
\item \code{index}: a list with elements for each resampling
  iteration. Each list element is the sample rows used for training
  at that iteration. When these values are not specified,
  \pkg{caret} will generate them. 
\item \code{summaryFunction}: a function to compute alternate
  performance summaries. See Section \ref{S:altPerf} for more
  details. 
\item \code{selectionFunction}: a function to choose the optimal
  tuning parameters. See Section \ref{S:finalMod} for more details
  and examples. 
\item \code{PCAthresh}, \code{ICAcomp} and \code{k}: these are
  all options to pass to the \code{preProcess} function (when used).
\item \code{returnResamp}: a character string containing one of
  the following values: \code{"all"}, \code{"final"} or
  \code{"none"}. This specifies how much of the resampled
  performance measures to save.  
\end{itemize}



\begin{figure}
   \begin{center}		
      \includegraphics[clip, width = .4\textwidth]{gbm1}
      \hspace*{.2 in}
      \includegraphics[clip, width = .4\textwidth]{gbm2}    

      \caption{ Examples of output from \code{plot.tain}. {\bf left} a plot produced using
        \code{plot(gbmFit3)} showing the relationship between the
        number of boosting iterations, the interaction depth and the
        resampled classification accuracy {\bf right} the same
        plot, but the Kappa statistic is plotted using
        \code{plot(gbmFit3, metric = "Kappa")}} 
      \label{f:plots1} 
    \end{center}
\end{figure} 

\begin{figure}
  \begin{center}	
    \includegraphics[clip, width = .6\textwidth]{gbm3}      
    \caption{For the boosted tree example in Section \ref{S:altTune}, using
      \code{plot(gbmFit metric = "Kappa", plotType = "level")} shows
      the relationship (using a \code{levelplot}) between the number of
      boosting iterations, the interaction depth and the resampled
      estimate of the Kappa statistic.  }
    \label{f:plots2}         
  \end{center}
\end{figure}   


\subsection{Alternate Performance Metrics}\label{S:altPerf}

The user can change the metric used to determine the best settings. By
default, RMSE and $R^2$ are computed for regression while accuracy and
Kappa are computed for classification. Also by default, the parameter
values are chosen using RMSE and accuracy, respectively  for
regression and classification. The \code{metric} argument of the
\pkg{caret} function allows the user to control which the
optimality criterion is used. For example, in problems where there are
a low percentage of samples in one class, using \code{metric =
  "Kappa"} can improve quality of the final model. 

If none of these parameters are satisfactory, the user can also
compute custom performance metrics. The \code{trainControl} function
has a argument called \code{summaryFunction} that specifies a
function for computing performance. The function should have these
arguments: 
\begin{itemize}
\item \code{data} is a reference for a data frame or matrix with
  columns called \code{obs} and \code{pred} for the observed and
  predicted outcome values (either numeric data for regression or
  character values for classification). Currently, class probabilities
  are not passed to the function. The values in data are the held--out
  predictions (and their associated reference values) for a single
  combination of tuning parameters. If the \code{classProbs}
  argument of the \code{trainControl} object is set to
  \code{TRUE}, additional columns in \code{data} will be present
  that contains the class probabilities. The names of these columns
  are the same as the class levels.
\item \code{lev} is a character string that has the outcome factor
  levels taken from the training data. For regression, a value of
  \code{NULL} is passed into the function. 
\item \code{model} is a character string for the model being used
  (i.e. the value passed to the \code{method} value of
  \pkg{caret}). 
\end{itemize}
The output to the function should be a vector of numeric summary
metrics with non--null names.  By default, \pkg{caret} evaluate classification models in terms of
the predicted classes. Optionally, class probabilities can also be
used to measure performance. To obtain predicted class probabilities
within the resampling process, the argument \code{classProbs} in
\code{trainControl} must be set to \code{TRUE}. This merges
columns of probabilities into the predictions generated from each
resample (there is a column per class and the column names are the
class names).

As shown in the last section, custom functions can be used to
calculate performance scores that are averaged over the
resamples. Another built--in function, \code{twoClassSummary}, will
compute the sensitivity, specificity and area under the ROC curve.

\clearpage
<<summaryFunc>>=
twoClassSummary
@ 

To rebuild the boosted tree model using this criterion, we
can see the relationship between the tuning parameters and
the area under the ROC curve using the following code: 
\begin{small}
<<setup>>=
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           returnResamp = "all",
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary)
@
\end{small}
<<reTune>>=
set.seed(3)
gbmFit3 <- train(trainDescr, trainMDRR, 
                 method = "gbm", 
                 trControl = fitControl, 
                 verbose = FALSE, 
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmFit3
@ 

In this case, the average area under the ROC curve associated with the
optimal tuning parameters was
\Sexpr{round(caret:::getTrainPerf(gbmFit3)[1,"TrainROC"], 3)} across
the \Sexpr{length(gbmFit3$control$index)} resamples.


\subsection{Choosing the Final Model}\label{S:finalMod}

Another method for customizing the tuning process is to modify the
algorithm that is used to select the ``best'' parameter values, given
the performance numbers. By default, the \pkg{caret} function
chooses the model with the largest performance value (or smallest, for
mean squared error in regression models). Other schemes for selecting
model can be used.  Breiman et al (1984) suggested the ``one standard
error rule'' for simple tree--based models. In this case, the model
with the best performance value is identified and, using resampling,
we can estimate the standard error of performance. The final model
used was the simplest model within one standard error of the
(empirically) best model. With simple trees this makes sense, since
these models will start to over-fit as they become more and more
specific to the training data. 

\pkg{caret} allows the user to specify alternate rules for
selecting the final model. The argument \code{selectionFunction}
can be used to supply a function to algorithmically determine the
final model. There are three existing functions in the package:
\code{best} is chooses the largest/smallest value, \code{oneSE}
attempts to capture the spirit of Breiman et al (1984) and
\code{tolerance} selects the least complex model within some percent
tolerance of the best value. See \code{?best} for more details. 

User--defined functions can be used, as long as they have the
following arguments: 
\begin{itemize}
  \item \code{x} is a data frame containing the tune parameters and
    their associated performance metrics. Each row corresponds to a
    different tuning parameter combination 
    \item \code{metric} a character string indicating which
      performance metric should be optimized (this is passed in
      directly from the \code{metric} argument of \pkg{caret}.  
      \item \code{maximize} is a single logical value indicating
        whether larger values of the performance metric are better
        (this is also directly passed from the call to
        \pkg{caret}). 
  \end{itemize}
  The function should output a single integer indicating which row in
  \code{x} is chosen. 

<<bestGBM, echo = FALSE, results = hide>>=
printSelected <- function(x)
  {
    tmp <- x$bestTune
    names(tmp) <- gsub(".", " ", names(tmp), fixed = TRUE)
    tmp <- paste(names(tmp), "=", tmp)
    paste(tmp, collapse = ", ")
  }
getTrainPerf <- function(x)
  {
    bst <- x$bestTune
    names(bst) <- substring(names(bst), 2)
    merge(bst, x$results)
  }
@   
  
As an example, if we chose the previous boosted tree model on the basis of
overall accuracy (Figure \ref{f:plots1}), we would choose:
\Sexpr{printSelected(gbmFit3)}. However, the scale in this plots is
fairly tight, with accuracy values ranging from
\Sexpr{round(min(gbmFit3$results$ROC), 3)} to
\Sexpr{round(max(gbmFit3$results$ROC), 3)}. A less complex model
(e.g. fewer, more shallow trees) might also yield acceptable
accuracy. 

The tolerance function could be used to find a less complex
model based on $(x-x_{best})/x_{best}\times 100$, which is the percent
difference. For example, to select parameter values based on a $2\%$ loss of performance: 
<<tolerance>>=
whichTwoPct <- tolerance(gbmFit3$results, "ROC", 2, TRUE)  
cat("best model within 2 pct of best:\n")
gbmFit3$results[whichTwoPct,]
@ 
This indicates that we can get a less complex model with and accuracy
of \Sexpr{round(gbmFit3$results[whichTwoPct,"ROC"], 3)} (compared
to the ``pick the best'' value of 
 \Sexpr{round(getTrainPerf(gbmFit3)$ROC, 3)}). 

The main issue with these functions is related to ordering the models
from simplest to complex. In some cases, this is easy (e.g. simple
trees, partial least squares), but in cases such as this model, the ordering of
models is subjective. For example, is a boosted tree model using 100
iterations and a tree depth of 2 more complex than one with 50
iterations and a depth of 8? The package makes some choices regarding
the orderings. In the case of boosted trees, the package assumes that
increasing the number of iterations adds complexity at a faster rate
than increasing the tree depth, so models are ordered on the number of
iterations then ordered with depth. See \code{?best} for more
examples for specific models. 

<<makePlots, results = hide, echo = FALSE>>=
pdf(paste(getwd(), "/gbm1.pdf", sep = ""), width = 5, height = 5)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(plot(gbmFit3))
dev.off()
pdf(paste(getwd(), "/gbm2.pdf", sep = ""), width = 5, height = 5)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(plot(gbmFit3, metric = "Spec"))
dev.off()
pdf(paste(getwd(), "/gbm3.pdf", sep = ""), width = 8, height = 4)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(plot(gbmFit3, meric = "ROC", plotType = "level"))
dev.off()

@

\subsection{Parallel Processing}\label{S:parallel}

If a model is tuned using resampling, the number of model fits can
become large as the number of tuning combinations increases (see the
two loops in Algorithm \ref{A:tune}). To reduce the training time,
parallel processing can be used.  
For example, to train the gradient boosting machine model in Section
\ref{S:basic}, each of the \Sexpr{nrow(gbmFit1$results)} candidate
models was fit to \Sexpr{length(gbmFit1$control$index)} separate
resamples. Since each resample is independent of the other, these
\Sexpr{nrow(gbmFit1$results)*length(gbmFit1$control$index)}  models
could be computed in parallel. 

R has several packages that facilitates parallel processing when
multiple processors are available (see Schmidberger et al., 2009). 
\pkg{caret} can be used to build multiple models simultaneously. 
As of \pkg{caret} version 4.99, a new parallel processing framework
is used to increase the computational efficiency. The \pkg{foreach}
package allows parallel computations using several different
technologies. Although the execution times using \pkg{foreach} is
similar to the framework used prior to version 4.99, there are a few advantages:
\begin{itemize}
  \item the call to \code{train} (or \code{rfe} or \code{sbf}) does
    not change. Parallel ``backends'' are registered with
    \pkg{foreach} prior to the call to \code{train}
  \item compared to the kludgy techniques in \pkg{caret} prior to
    version 4.99, \pkg{foreach} does a much better job of managing memory.
  \item \pkg{foreach} code can be added in several places in the
    package and nest parallelism can be used.
\end{itemize}


For example, to use the \pkg{multicore} package to parallelize the
computations, invoking these commands prior to \code{train} would
split the computations into two workers:
\begin{small}
\begin{Schunk}
\begin{Sinput}
> library(doMC)
> registerDoMC(2)
\end{Sinput}
\end{Schunk}
\end{small}

One common metric used to assess the efficacy of parallelization is
$speedup = T_{seq}/T_{par}$, where $T_{seq}$ and $T_{par}$ denote the
execution times to train the model serially and in parallel,
respectively. Excluding systems with sophisticated shared memory
capabilities, the maximum possible speedup attained by parallelization
with $P$ processors is equal to $P$. Factors affecting the speedup
include the overhead of starting the parallel workers, data transfer,
the percentage of the algorithm's computations that can be done in
parallel, etc.   


\begin{figure}[t]
  \begin{center}		
    \includegraphics[clip, width = .4\textwidth]{min}    
    \includegraphics[clip, width = .4\textwidth]{speedup}
    \caption{Training time profiles using parallel processing via
      \pkg{caret} for a benchmarking data set run on a 16 core
      machine. The left panel shows the elapsed time to train
      a model using single or multiple
      processors. The panel on the right shows the ``speedup,''
      defined to be the time for serial execution divided by the
      parallel execution time. The ``old'' line corresponds to version
      4.98 and below while the ``new'' curve is using \pkg{foreach}.
    }
    \label{f:parallel}         
  \end{center}
\end{figure}

Figure \ref{f:parallel} shows the results of a benchmarking study run
on a 16 core machine. In the left panel, the actual training time for
a radial basis function SVM model for a data set with 5,000 samples
and 400 predictors. The model was tuned over 10 values of the cost
parameter using 50 bootstrap samples. The ``new'' curve corresponds to
the \pkg{foreach} infrastructure. 

One downside to parallel processing in this manner is that the dataset
is held in memory for every node used to train the model. For example,
if parallelism is used to compute the results from 50 bootstrap
samples using $P$ processors, $P$ data sets are held in memory. For
large datasets, this can become a problem if the additional processors
are on the same machines where they are competing for the same
physical memory. The old codebase starts to slow down around 10
workers due to exhausting the physical memory on the machine. The new
codebase does a better job at managing memory with no additional slowdown.


\section{Extracting Predictions and Class Probabilities}\label{S:probs}

As previously mentioned, objects produced by the \pkg{caret}
function contain the ``optimized'' model in the \code{finalModel}
sub--object. Predictions can be made from these objects as usual. In
some cases, such as \code{pls} or \code{gbm} objects, additional
parameters from the optimized fit may need to be specified. In these
cases, the \pkg{caret} objects uses the results of the parameter
optimization to predict new samples. 

For example, we can load the Boston Housing data:

\begin{small}
<<bhExample>>=
library(mlbench)
data(BostonHousing)
# we could use the formula interface too
bhDesignMatrix <-  model.matrix(medv ~. - 1, BostonHousing)
@
\end{small}

\noindent split the data into random training/test groups:

\begin{small}
<<bhSplit>>=
set.seed(4)
inTrain <- createDataPartition(BostonHousing$medv, p = .8, list = FALSE, times = 1)
trainBH <- bhDesignMatrix[inTrain,]
testBH <- bhDesignMatrix[-inTrain,]

trainMedv <- BostonHousing$medv[inTrain]
testMedv <- BostonHousing$medv[-inTrain]
@
\end{small}

\noindent fit partial least squares and multivariate adaptive regression spline models:

\begin{small}
<<bhModels>>=
set.seed(5)
plsFit <- train(trainBH, trainMedv, 
                "pls", 
                preProcess = c("center", "scale"),
                tuneLength = 10, 
                trControl = trainControl(verboseIter = FALSE,
                                         returnResamp = "all"))
set.seed(5)
marsFit <- train(trainBH, trainMedv, 
                 "earth", 
                 tuneLength = 10, 
                 trControl = trainControl(verboseIter = FALSE,
                                          returnResamp = "all"))

@
\end{small} 
To obtain predictions  for the MARS model, \code{predict.earth} can be
used. 

\begin{small}
<<marPrediction1>>=
marsPred1 <- predict(marsFit$finalModel, newdata = testBH)
head(marsPred1)
@ 
\end{small}
Alternatively, \code{predict.train} can be used to get a vector of predictions for the optimal model only:
\begin{small}
<<marsPrediction2>>=
marsPred2 <- predict(marsFit, newdata = testBH)
head(marsPred2)
@
\end{small}
Note that the \code{plsFit} object used pre--processing. In this
case, we cannot directly call \code{predict.mvr} and expect to get
the same answers as \code{predict.train}. The latter function knows
that centering and scaling is required and execute these calculations
on the new samples, whereas \code{predict.mvr} does not. For the
\code{pls} function, there is an argument called \code{scale} that
can be used instead of the pre--processing options in the
\pkg{caret} function.

For multiple models, the objects can be grouped using a list and predicted simultaneously:
\begin{small}
<<bhPrediction1>>=
bhModels <- list(pls = plsFit, mars = marsFit)

bhPred1 <- predict(bhModels, newdata = testBH)
str(bhPred1)
@
\end{small}
In some cases,observed outcomes and their associated predictions may
be needed for a set of models. In this case,
\code{extractPrediction} can be used. This function takes a list of
models and test and/or unknown samples as inputs and returns a data
frame of predictions: 
\begin{small}
<<bhPrediction1>>=
allPred <- extractPrediction(bhModels,
                             testX = testBH,
                             testY = testMedv)
testPred <- subset(allPred, dataType == "Test")
head(testPred)

ddply(testPred, .(model), defaultSummary)

@
\end{small}
The output of \code{extractPrediction} is a data frame with columns:
   \begin{itemize}   
      \item \code{obs}, the observed data
      \item \code{pred}, the predicted values from each model
      \item \code{model}, a character string (``\code{rpart}'', ``\code{pls}'' etc.)
      \item \code{dataType}, a character string for the type of data:
      \begin{itemize}
         \item ``\code{Training}'' data are the predictions on the training data from
            the optimal model,
         \item ``\code{Test}'' denote the predictions on the test set (if one is specified),
         \item ``\code{Unknown}'' data are the predictions on the unknown samples (if specified). 
         Only the predictions are produced for these data. Also, if the quick prediction of the unknowns
         is the primary goal, the argument \code{unkOnly} can be used to only process the unknowns.
      \end{itemize}
   \end{itemize}      

Some classification models can produce probabilities for each
class. The functions \code{predict.train} and \code{predict.list}
can be used with the \code{type = "probs"} argument to produce data
frames of class probabilities (with one column per class). Also, the
function \code{extractProbs} can be used to get these probabilities
from one or more models. The results are very similar to what is
produced by \code{extractPrediction} but with columns for each
class. The column \code{pred} is still the predicted class from the
model.  


\section{Evaluating Test Sets}

A function, \code{postResample}, can be used obtain the same
performance measures as generated by \pkg{caret} for regression or
classification. 

\subsection{Confusion Matrices}\label{S:confusion}

\pkg{caret} also contains several functions that can be used to
describe the performance of classification models. The functions
\code{sensitivity}, \code{specificity}, \code{posPredValue} and
\code{negPredValue} can be used to characterize performance where
there are two classes. By default, the first level of the outcome
factor is used to define the ``positive'' result (i.e. the event of
interest), although this can be changed.  

The function \code{confusionMatrix} can also be used to summarize
the results of a classification model: 

\begin{small}
<<mdrConfusion>>=
mdrrPredictions <- extractPrediction(list(gbmFit3), testX = testDescr, testY = testMDRR)
mdrrPredictions <- mdrrPredictions[mdrrPredictions$dataType == "Test",]
sensitivity(mdrrPredictions$pred, mdrrPredictions$obs)
confusionMatrix(mdrrPredictions$pred, mdrrPredictions$obs)
@
\end{small}

The ``no--information rate'' is the largest proportion of the observed
classes (there were more actives than inactives in this test set). A
hypothesis test is also computed to evaluate whether the overall
accuracy rate is greater than the rate of the largest class. Also, the
prevalence of the ``positive event'' is computed from the data (unless
passed in as an argument), the detection rate (the rate of true events
also predicted to be events) and the detection prevalence (the
prevalence of predicted events). 

Suppose a $2 \times 2$ table with notation

\begin{tabular}{r|c|c|}
                    & \multicolumn{2}{c}{{\bf Reference}}          \\
\cline{2-3}
         {\bf Predicted}  & Event     & No Event \\
\cline{2-3}
         Event      & A         & B        \\
\cline{2-3}
         No Event   & C         & D      \\
\cline{2-3}  
\end{tabular}

The formulas used here are:
\begin{align}
Sensitivity &= \frac{A}{A+C}\notag \\
Specificity &= \frac{D}{B+D}\notag \\
Prevalence &= \frac{A+C}{A+B+C+D}\notag \\
PPV &= \frac{sensitivity \times prevalence}{((sensitivity \times prevalence) + ((1-specificity) \times(1-prevalence))}\notag \\
NPV &= \frac{specificity \times (1-prevalence)}{((1-sensitivity) \times prevalence) + ((specificity) \times(1-prevalence))}\notag \\
Detection\: Rate &=  \frac{A}{A+B+C+D}\notag \\
Detection\: Prevalence &=  \frac{A + B}{A+B+C+D}\notag 
\end{align}

When there are three or more classes, \code{confusionMatrix} will
show the confusion matrix and a set of ``one--versus--all''
results. For example, in a three class problem, the sensitivity of the
first class is calculated against all the samples in the second and
third classes (and so on). 

Also, a resampled estimate of the training set can also be obtained
using \code{confusionMatrix.train}. For each resampling iteration,
a confusion matrix is created from the hold--out samples and these
values can be aggregated to diagnose issues with the model fit.

For example, in the two--class SVM model used in Section
\ref{S:basic}, we could use:

<<resampCM>>=
confusionMatrix(gbmFit3)
@ 

These values are the percentages that hold--out samples landed in the
confusion matrix during resampling. There are several methods for
normalizing these values. See \code{?confusionMatrix.train} for details.


\subsection*{Plotting Predictions and Probabilities}

Two functions, \code{plotObsVsPred} and \code{plotClassProbs}, are
interfaces to lattice to plot model results. For regression,
\code{plotObsVsPred} plots the observed versus predicted values by
model type and data (e.g. test). See Figure \ref{f:bhPredPlot} for example. For classification data,
\code{plotObsVsPred} plots the accuracy rates for models/data in a
dotplot.  


<<mdrrPlots, echo = FALSE, results = hide>>=
mdrrProbs <- extractProb(list(gbmFit3), testX = testDescr, testY = testMDRR)
mdrrProbs <- mdrrProbs[mdrrProbs$dataType == "Test",]

pdf(paste(getwd(), "/gbmProbs.pdf", sep = ""), width = 9, height = 7)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(plotClassProbs(mdrrProbs))
dev.off()
@

\begin{figure}
   \begin{center}      
      \includegraphics[clip, width = .5\textwidth]{gbmProbs}   
      \caption{The predicted class probabilities from a gradient boosting machine fit for the MDRR test set. This plot was created using \code{plotClassProbs(mdrrProbs)}.}
      \label{f:mdrrProbs}     
   \end{center}
\end{figure}  

\begin{figure}
   \begin{center}      
<<bhPredPlot, echo = FALSE, results = hide, fig = FALSE>>=
pdf("bhPredPlot.pdf", width = 8, height = 5)
trellis.par.set(caretTheme(), warn = FALSE)
print(plotObsVsPred(testPred))
dev.off()
@         
      \includegraphics[clip, width = .75\textwidth]{bhPredPlot}   
      \caption{The results of using \code{ plotObsVsPred } to show
        plots of the observed median home price against the
        predictions from two models. The plot shows the training and
        test sets in the same Lattice plot} 
      \label{f:bhPredPlot}         
   \end{center}
\end{figure}


To plot class probabilities, \code{plotClassProbs} will display the
results by model, data and true class (for example, Figure
\ref{f:mdrrProbs}).      

\clearpage

\section{Exploring and Comparing Resampling Distributions}

\subsection{Within--Model}\label{S:withinMod}

There are several Lattice functions than can be used to explore
relationships between tuning parameters and the resampling results for
a specific model:
\begin{itemize}
\item \code{xyplot} and \code{stripplot} can be used to plot resampling statistics
  against (numeric) tuning parameters.
  \item \code{histogram} and \code{densityplot} can also be used
      to look at distributions of the tuning parameters across tuning parameters.
\end{itemize}

For example, the following statements produces the images in Figure \ref{f:marsDesns}.   

\begin{small}
\begin{Schunk}
\begin{Sinput}
> xyplot(marsFit, type= c("g", "p", "smooth"), degree = 2)
> densityplot(marsFit, as.table = TRUE, subset = nprune < 10)
\end{Sinput}
\end{Schunk}
\end{small}

<<makeResampPlots, results = hide, echo = FALSE>>=
pdf(paste(getwd(), "/marsXY.pdf", sep = ""), width = 7, height = 5)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(xyplot(marsFit, type= c("g", "p", "smooth"), degree = 2))
dev.off()
pdf(paste(getwd(), "/marsDens.pdf", sep = ""), width = 7, height = 5)
   trellis.par.set(caretTheme(), warn = FALSE)
   print(densityplot(marsFit, as.table = TRUE, subset = nprune < 10))
dev.off()
@


\begin{figure}[hb]
   \begin{center}      
      \includegraphics[clip, width = .7\textwidth]{marsXY}   
      
      \vspace{.3 in}      

       \includegraphics[clip, width = .7\textwidth]{marsDens}   
      \caption{Scatter plots and density plots of the resampled RMSE by the number of
        retained terms for the MARS model fit to the Boston Housing data}
      \label{f:marsDesns}       
   \end{center}
\end{figure}  
\clearpage

\subsection{Between--Models}\label{S:betweenMod}


\pkg{caret} also includes functions to characterize the differences
between models (generated using \pkg{caret}, \code{sbf} or
\code{rfe}) via their resampling distributions. These functions are
based on the work of Hothorn et al. (2005) and Eugster et al (2008).

Using the blood-brain barrier data (see \code{?BloodBrain)}, three regression models were
created: an \code{rpart} tree, a conditional inference tree using
\code{ctree}, M5 rules using \code{M5Rules} and a MARS model using
\code{earth}. We ensure that the models use the same resampling data
sets. In this case, 100 leave--group--out cross--validation was
employed. 

<<bbbModelFits, eval = FALSE>>=
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB, p = 0.8, times = 100)
bbbControl <- trainControl(method = "LGOCV", index = tmp, timingSamps = 50)

rpartFit <- train(bbbDescr, logBBB,
                   "rpart", 
                   tuneLength = 16,
                   trControl = bbbControl)

ctreeFit <- train(bbbDescr, logBBB,
                 "ctree2",
                  tuneLength = 10,
                  trControl = bbbControl)

earthFit <- train(bbbDescr, logBBB,
                   "earth",
                   tuneLength = 20,
                   trControl = bbbControl)
m5Fit <- train(bbbDescr, logBBB,
                "M5Rules",
                trControl = bbbControl)  
@ 

<<>>=
print(getwd())
@ 

Given these models, can we make statistical statements about their
performance differences? To do this, we first collect the resampling
results using \code{resamples}. 
<<resamps, eval = FALSE>>=
resamps <- resamples(list(CART = rpartFit,
                          CondInfTree = ctreeFit,
                          MARS = earthFit,
                          M5 = m5Fit))
@ 
<<resampLoad, echo = FALSE, results = hide>>=
load("resamps.RData")
@ 
<<resampFuncs>>=
resamps
summary(resamps)
@ 

There are several Lattice plot methods that can be used to visualize
the resampling distributions: density plots, box--whisker plots,
scatterplot matrices and scatterplots of summary statistics. In the
latter case, the plot consists of a scatterplot between the two
models.  (See Figure
\ref{f:resampleScatter}). In Figure \ref{F:resampleDens}, density plots
of the data are shown. In this figure,  the $R^2$ distributions
indicate that M5 rules and MARS appear to be
similar to one another but different from the two tree--based
models. However, this pattern is inconsistent with the root
mean squared error distributions.

<<resamplePlots>>=
bwplot(resamps, metric = "RMSE")

densityplot(resamps, metric = "RMSE")

xyplot(resamps,
       models = c("CART", "MARS"),
       metric = "RMSE")

splom(resamps, metric = "RMSE")
@ 

<<resamplePlots2, results = hide, echo = FALSE>>=
pdf("resampleScatter.pdf", width = 6, height = 6)
trellis.par.set(caretTheme())
print(xyplot(resamps, models = c("CART", "MARS")))
dev.off()
pdf("resampleDens.pdf", width = 7, height = 4.5)
print(
      densityplot(resamps, 
                  scales = list(x = list(relation = "free")), 
                  adjust = 1.2, 
                  plot.points = FALSE, 
                  auto.key = list(columns = 2)))

dev.off() 
@ 


Since models are fit on the same versions of the training data, it
makes sense to make inferences on the differences between models. In
this way we reduce the within--resample correlation that may exist. We
can compute the differences, then use a simple $t$--test to evaluate
the null hypothesis that there is no difference between models. 

<<diffs>>=
difValues <- diff(resamps)

difValues

summary(difValues)
@ 
Note that these results are consistent with the patterns shown in
Figure \ref{F:resampleDens}; there are more differences in the $R^2$
distributions than in the error distributions.

Several Lattices methods also exist to plot the differences (density
and box--whisker plots) or the inferential results (level and dot
plots). Figures \ref{F:diffLevel} and \ref{F:diffDot} show examples of
level and dot plots.

<<diffPlots>>=
dotplot(difValues)

densityplot(difValues,
            metric = "RMSE",
            auto.key = TRUE,
            pch = "|")

bwplot(difValues,
       metric = "RMSE")

levelplot(difValues, what = "differences")
@ 

<<diffPlots2, results = hide, echo = FALSE>>=
pdf("diffLevel.pdf", width = 7.5, height = 6)
trellis.par.set(caretTheme())
print(levelplot(difValues, what = "differences"))
dev.off()

pdf("diffDot.pdf", width = 6, height = 6)
plotTheme <- caretTheme()
plotTheme$plot.symbol$pch <- 16
plotTheme$plot.line$col <- "black"
trellis.par.set(plotTheme)
print(dotplot(difValues))
dev.off()

@ 


\begin{figure}
   \begin{center}		
      \includegraphics[clip, width = .6\textwidth]{resampleScatter}    

      \caption{ Examples of output from \code{xyplot(resamps, models = c("CART", "MARS"))}. }
      \label{f:resampleScatter} 
    \end{center}
\end{figure} 

\begin{figure}
   \begin{center}		
      \includegraphics[clip, width = .9\textwidth]{resampleDens}    

      \caption{ Examples of output from
        \code{densityplot(resamps)}. Looking at $R^2$, M5 rules and MARS appear to be
        similar to one another but different from the two tree--based
        models. However, this pattern is inconsistent with the root
        mean squared error distributions.}
      \label{F:resampleDens} 
    \end{center}
\end{figure} 


\begin{figure}
   \begin{center}		
      \includegraphics[clip, width = .8\textwidth]{diffLevel}    

      \caption{ Examples of output from \code{levelplot(difValues,
          what = "differences")}. The pair--wise differences in RMSE
        are shown}
      \label{F:diffLevel} 
    \end{center}
\end{figure} 


\begin{figure}
   \begin{center}		
      \includegraphics[clip, width = .8\textwidth]{diffDot}    

      \caption{ Examples of output from
        \code{dotplot(difValues)}. The differences in RMSE and their
        associated confidence intervals are shown.}
      \label{F:diffDot} 
    \end{center}
\end{figure} 

\section{Custom Methods for \codeheading{train}}

Although there are currently more than
\Sexpr{length(unique(modelLookup()$model)) -(length(unique(modelLookup()$model)) %% 10)}
  methods available to \code{train}, there may be the need to create
  custom model functions (e.g. testing a new model etc). One
  application of custom models would be to create diverse ensembles of
  models. For example, a set of different classification models may be
  fit to the same data and a ''pick--the--winner'' approach can be
  taken (or the average of the class probabilities could be used, see
  Kuncheva (2004) or Seni and Elder (2010)). \code{train} already has
  a framework for resampling and tuning models and
  \code{predict.train} can be used to encapsulate the ensemble of
  models into one call for prediction.

\subsection*{Why Not Re--Write \codeheading{train} Altogether?}

One could make a strong argument that back--fitting the package to work with custom models is a bit kludgy. This is probably true, but there are several munches of the current code base that, if implemented in a more formal manner, would make the new code base overly complicated. 

For example, the current code base exploits models that can produce predictions for many tuning parameters using a single model object. For example PLS can fit a K component model and make predictions from models with $1\ldots K$ components. This potentially saves us $K-1$ model fits. PLS is fairly fast, but pother models with this same feature (gbm, cubist, etc) are computationally taxing and this saves a lot of time.  

Also, the package currently does many types of resampling and some, such as the bootstrap 632 estimator, are not as simple as others. In this case, an additional model must be fit for all the data to get the apparent error rate. Formalizing this (or starting over) with a new code base would be unnecessarily complex.

\subsection{How To Write Custom Methods}

The user will need to deviate from the standard call in two ways:
\begin{itemize}
  \item use \code{method = "custom"} in the call to \code{train}, and
  \item add the required functions for the model using \code{trainControl}
\end{itemize}

The \code{custom} argument of \code{trainControl} requires a list of
named functions with the following elements: \code{parameters},
\code{model}, \code{prediction}, \code{prob} and \code{sort}.

\subsubsection{The \codeheading{parameters} Argument}

This element is used to specify or generate the models tuning
parameters. This can be done either as a function to generate them or
a data frame of the actual parameters.

{\it Inputs}:
\begin{itemize}
  \item \code{data}: a data frame of the training set data. The
    outcome will be in a column labeled \code{.outcome}. If the
    formula method for \code{train} was invoked, the data passed into
    this function will have been processed (i.e. dummy variables have
    been created etc).
  \item \code{len} an optional parameter passed in form the
    \code{tuneLength} argument to \code{train}
\end{itemize}

{\it Outputs}: a data frame where
\begin{itemize}
  \item all columns start with a dot 
  \item there is at least one row
\end{itemize}

Instead of a function, the final data frame can be passed in

\subsubsection{The \codeheading{model} Function}

This element fits the model and any other functions
(e.g. pre--processing of the data)


{\it Inputs}:
\begin{itemize}
\item \code{data}: a data frame of the training set data. The outcome
  will be in a column labeled \code{.outcome}. If case weights were
  specified in the \code{train} call, these are in the column
  \code{.modelWeights}. If the formula method for \code{train} was
  invoked, the data passed into this function will have been processed
  (i.e. dummy variables have been created etc).
  \item \code{weights} case weights
  \item \code{parameter} a single row data frame with the current
    tuning parameter
  \item \code{levels}: either \code{NULL} or a character vector or
    factor labels
  \item \code{last} a logical vector for the final model fit with the
    selected tuning parameters and the full training set
  \item \code{...} arguments passed form \code{train} to this function
\end{itemize}

{\it Outputs}: a list with at least one element:
\begin{itemize}
  \item \code{fit}: the object corresponding to the trained model
\end{itemize}
Anything else can be attached to this object. If custom
pre--processing is required, this can be estimated in the \code{model}
function and attached to the output list. Subsequent calls to the
\code{prediction} and \code{probability} functions will have the
entire list available, so the processing can be applied to the new
data. 


\subsubsection{The \codeheading{prediction} Function}

This should be a function that produces either a number vector (for
regression) or a factor (or character) vector for classification.

{\it Inputs}:
\begin{itemize}
\item \code{object}: a list with two elements resulting from the
  model function
\item \code{newdata}: a matrix or data frame of predictors to be
  processed through the model (and possibly pre-processing routine)
\end{itemize}

The output should be either a numeric, character or factor vector. For
classification, factors are converted to character elsewhere to ensure
the proper levels are in the output.


\subsubsection{The \codeheading{probability} Argument}

For classification models, this function should generate a data frame of class probabilities. For regression, a value of \code{NULL} can be used. 

{\it Inputs}:
\begin{itemize}
\item \code{object}: a list with two elements resulting from the
  model function
\item \code{newdata}: a matrix or data frame of predictors to be
  processed through the model (and possibly pre-processing routine)
\end{itemize}

The output should be a data frame with these characteristics:
\begin{itemize}
\item as many columns are factor levels
\item column names are the same as the factor levels and in the same order
\end{itemize}

\subsubsection{The \codeheading{sort} Function}

There are cases where multiple tuning parameters yield the same level
of performance. In these situations, \code{train} will choose the
parameters associated with the most simplistic model. This function
should take the grid of tuning parameters and order them from least
complex to most complex.

The input is a data frame of tuning parameters (without the
preceding dot in the name). 

The output is the same data frame sorted appropriately. 

\subsection{An Example}

As an example, suppose we want to test out \code{rpart} models where
we tune over the complexity parameter and the minimum number of
samples in a node to do further splitting (a.k.a \code{minsplit}).

We'll use the Blood--brain barrier data in \pkg{caret} to illustrate.

First, we would need to create a training grid with the candidate
values of \code{cp} and \code{minsplit}. When using the nominal
\code{rpart} method in \code{train}, an initial \code{rpart} model is
created and the unique values of the complexity parameter are obtained
from the sub--object \code{cptable}. We will test two values of
\code{minsplit}: 10 and 30. First, we get the unique
$C_p$ values for \code{minsplit = 10}

<<getCp1>>=
## rpart requires a formula method
tmpData <- bbbDescr
tmpData$logBBB <- logBBB

cpValues10 <- rpart(logBBB ~ ., data = tmpData,
              control = rpart.control(minsplit = 10))$cptable[,"CP"]
cpValues30 <- rpart(logBBB ~ ., data = tmpData,
              control = rpart.control(minsplit = 30))$cptable[,"CP"]
head(cpValues10)
@ 

From these, we will create the tuning grid of \Sexpr{nrow(cpValues10)*2}
canidiate models (3 values of \code{minsplit} for each of the
\Sexpr{nrow(cpValues10)} possible $C_p$ values:
<<cpGrid>>=
rpartGrid <- data.frame(.cp = c(cpValues10, cpValues30),
                        .minsplit = 
                        c(rep(10, length(cpValues10)), 
                          rep(30, length(cpValues30))))

@ 

We can now write a model function:
<<modelFoo>>=
modelFunc <- function(data, parameter, levels, last, ...)
  {
    library(rpart)
    ctrl <- rpart.control(cp = parameter$.cp,
                          minsplit = parameter$.minsplit)
    
    list(fit = rpart(.outcome ~ ., data = data, control = ctrl))                             
  }
@ 
It is a good idea to load the \pkg{rpart} package and anything else needed within the function.

The prediciton function is simple:
<<predFoo>>=
predFunc <- function(object, newdata)
  {
    library(rpart)
    predict(object$fit, newdata)                            
  }
@ 

Sorting by complexity is somewhat subjective. Both parameters govern
how deep the tree can be. We will sort by \code{cp} then \code{minsplit}:
<<sortFoo>>=
sortFunc <- function(x)  x[order(x$cp, x$minsplit),]                           
@ 

Now we can create a control object for \code{train}:
<<trControl, eval = FALSE>>=
ctrl <- trainControl(custom = list(
                       parameters = rpartGrid,
                       model = modelFunc,
                       prediction = predFunc,
                       probability = NULL,
                       sort = sortFunc),
                     method = "cv")

set.seed(581)
customRpart <- train(bbbDescr, logBBB, "custom", trControl = ctrl)
@ 

The \code{predict}, \code{print}, \code{plot} and \code{resamples}
methods work with custom models. In the case of \code{plot.train}, the
axis and key labels will be the parameter names. However, \code{update}
can be used to make the labels more descriptive:

<<plotResults, eval = FALSE>>=
rpartPlot <- plot(customRpart, xTrans = log10)
rpartPlot <- update(rpartPlot, xlab = "Complexity Parameter")
@ 




\section{Session Information}

<<session, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@ 


\section{References}

\begin{description}

  
\item Breiman, Friedman, Olshen, and Stone. (1984) {\it Classification
    and Regression Trees}. Wadsworth. 

\item Eugster et al. (2008),  ``Exploratory and inferential analysis
  of benchmark experiments, '' {\it Ludwigs-Maximilians-Universitat
    Munchen, Department of Statistics, Tech. Rep} vol. 30 
  
  
\item Hothorn et al. (2005),  ``The design and analysis of
  benchmark experiments, '' {\it Journal of Computational and
    Graphical Statistics}, 14,  675--699

  \item Kuncheva (2004), {\it Combining Pattern Classifiers: Methods
      and Algorithms}. Wiley.
    
  \item Molinaro et al. (2010), ``partDSA:
    deletion/substitution/addition algorithm for partitioning the
    covariate space in prediction,'' {\it Bioinformatics}, 26, 1357--1363

\item Rand (1971),  ``Objective criteria for the evaluation of
  clustering methods,'' {\it Journal of the American Statistical
    Association} 66, 846--850. 
    
\item Schmidberger et al. (2009), `` State-of-the-art in Parallel
  Computing with R,'' {\it Journal of Statistical Software}, 31
  
\item Seni and Elder (2010) {\it Ensemble Methods in Data Mining:
    Improving Accuracy Through Combining Predictions}.
  Morgan and Claypool Publishers
  
\item Svetnik, V., Wang, T., Tong, C., Liaw, A., Sheridan, R. P. and
  Song, Q. (2005), ``Boosting: An ensemble learning tool for compound
  classification and QSAR modeling,'' {\it Journal of Chemical
    Information and Modeling}, 45, 786 --799. 
  
\item Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. (2003),
  ``Class prediction by nearest shrunken centroids, with applications
  to DNA microarrays,'' {\it  Statistical Science}, 18, 104--117. 

\end{description}

\end{document}
Browse the archive

https://github.com/cran/caret