https://github.com/cran/caret
Raw File
Tip revision: ad556e5a82a1e7242d1a6f6eab663cab8c037069 authored by Max Kuhn on 28 September 2021, 13:00:13 UTC
version 6.0-89
Tip revision: ad556e5
findCorrelation.Rd
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/findCorrelation.R
\name{findCorrelation}
\alias{findCorrelation}
\title{Determine highly correlated variables}
\usage{
findCorrelation(
  x,
  cutoff = 0.9,
  verbose = FALSE,
  names = FALSE,
  exact = ncol(x) < 100
)
}
\arguments{
\item{x}{A correlation matrix}

\item{cutoff}{A numeric value for the pair-wise absolute correlation cutoff}

\item{verbose}{A boolean for printing the details}

\item{names}{a logical; should the column names be returned (\code{TRUE}) or
the column index (\code{FALSE})?}

\item{exact}{a logical; should the average correlations be recomputed at
each step? See Details below.}
}
\value{
A vector of indices denoting the columns to remove (when \code{names
= TRUE}) otherwise a vector of column names. If no correlations meet the
criteria, \code{integer(0)} is returned.
}
\description{
This function searches through a correlation matrix and returns a vector of
integers corresponding to columns to remove to reduce pair-wise
correlations.
}
\details{
The absolute values of pair-wise correlations are considered. If two
variables have a high correlation, the function looks at the mean absolute
correlation of each variable and removes the variable with the largest mean
absolute correlation.

Using \code{exact = TRUE} will cause the function to re-evaluate the average
correlations at each step while \code{exact = FALSE} uses all the
correlations regardless of whether they have been eliminated or not. The
exact calculations will remove a smaller number of predictors but can be
much slower when the problem dimensions are "big".

There are several function in the \pkg{subselect} package
(\code{\link[subselect:eleaps]{leaps}},
\code{\link[subselect:genetic]{genetic}},
\code{\link[subselect:anneal]{anneal}}) that can also be used to accomplish
the same goal but tend to retain more predictors.
}
\examples{

R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32, 
                  0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
                  0.85, 0.32, 0.91, 0.36, 1), 
                .Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
R1

findCorrelation(R1, cutoff = .6, exact = FALSE)
findCorrelation(R1, cutoff = .6, exact = TRUE)
findCorrelation(R1, cutoff = .6, exact = TRUE, names = FALSE)


R2 <- diag(rep(1, 5))
R2[2, 3] <- R2[3, 2] <- .7
R2[5, 3] <- R2[3, 5] <- -.7
R2[4, 1] <- R2[1, 4] <- -.67

corrDF <- expand.grid(row = 1:5, col = 1:5)
corrDF$correlation <- as.vector(R2)
levelplot(correlation ~ row + col, corrDF)

findCorrelation(R2, cutoff = .65, verbose = TRUE)

findCorrelation(R2, cutoff = .99, verbose = TRUE)

}
\seealso{
\code{\link[subselect:eleaps]{leaps}},
\code{\link[subselect:genetic]{genetic}},
\code{\link[subselect:anneal]{anneal}}, \code{\link{findLinearCombos}}
}
\author{
Original R code by Dong Li, modified by Max Kuhn
}
\keyword{manip}
back to top