https://github.com/cran/caret
Raw File
Tip revision: f6773b4ed18fbe4010eafe9d979821d4bd06b6b0 authored by Max Kuhn on 21 March 2023, 18:30:02 UTC
version 6.0-94
Tip revision: f6773b4
twoClassSim.Rd
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/reg_sims.R, R/twoClassSim.R
\name{SLC14_1}
\alias{SLC14_1}
\alias{SLC14_2}
\alias{LPH07_1}
\alias{LPH07_2}
\alias{twoClassSim}
\title{Simulation Functions}
\usage{
SLC14_1(n = 100, noiseVars = 0, corrVars = 0, corrType = "AR1", corrValue = 0)

SLC14_2(n = 100, noiseVars = 0, corrVars = 0, corrType = "AR1", corrValue = 0)

LPH07_1(
  n = 100,
  noiseVars = 0,
  corrVars = 0,
  corrType = "AR1",
  corrValue = 0,
  factors = FALSE,
  class = FALSE
)

LPH07_2(n = 100, noiseVars = 0, corrVars = 0, corrType = "AR1", corrValue = 0)

twoClassSim(
  n = 100,
  intercept = -5,
  linearVars = 10,
  noiseVars = 0,
  corrVars = 0,
  corrType = "AR1",
  corrValue = 0,
  mislabel = 0,
  ordinal = FALSE
)
}
\arguments{
\item{n}{The number of simulated data points}

\item{noiseVars}{The number of uncorrelated irrelevant predictors to be
included.}

\item{corrVars}{The number of correlated irrelevant predictors to be
included.}

\item{corrType}{The correlation structure of the correlated irrelevant
predictors. Values of "AR1" and "exch" are available (see Details below)}

\item{corrValue}{The correlation value.}

\item{factors}{Should the binary predictors be converted to factors?}

\item{class}{Should the simulation produce class labels instead of numbers?}

\item{intercept}{The intercept, which controls the class balance. The
default value produces a roughly balanced data set when the other defaults
are used.}

\item{linearVars}{The number of linearly important effects. See Details
below.}

\item{mislabel}{The proportion of data that is possibly mislabeled. Only
used when \code{ordinal = FALSE}. See Details below.}

\item{ordinal}{Should an ordered factor be returned? See Details below.}
}
\value{
a data frame with columns: \item{Class }{A factor with levels
"Class1" and "Class2"} \item{TwoFactor1, TwoFactor2 }{Correlated
multivariate normal predictors (denoted as \code{A} and \code{B} above)}
\item{Nonlinear1, Nonlinear2, Nonlinear3}{Uncorrelated random uniform
predictors (\code{J}, \code{K} and \code{L} above).} \item{Linear1,
}{Optional uncorrelated standard normal predictors (\code{C} through
\code{H} above)}\item{list()}{Optional uncorrelated standard normal
predictors (\code{C} through \code{H} above)} \item{Noise1, }{Optional
uncorrelated standard normal predictions}\item{list()}{Optional uncorrelated
standard normal predictions} \item{Corr1, }{Optional correlated multivariate
normal predictors (each with unit variances)}\item{list()}{Optional
correlated multivariate normal predictors (each with unit variances)}.
}
\description{
This function simulates regression and classification data with truly
important predictors and irrelevant predictions.
}
\details{
The first function (\code{twoClassSim}) generates two class data. The data
are simulated in different sets. First, two multivariate normal predictors
(denoted here as \code{A} and \code{B}) are created with a correlation our
about 0.65. They change the log-odds using main effects and an interaction:

\preformatted{ intercept - 4A + 4B + 2AB }

The intercept is a parameter for the simulation and can be used to control
the amount of class imbalance.

The second set of effects are linear with coefficients that alternate signs
and have values between 2.5 and 0.025. For example, if there were six
predictors in this set, their contribution to the log-odds would be

\preformatted{ -2.50C + 2.05D -1.60E + 1.15F -0.70G + 0.25H }

The third set is a nonlinear function of a single predictor ranging between
[0, 1] called \code{J} here:

\preformatted{ (J^3) + 2exp(-6(J-0.3)^2) }

The fourth set of informative predictors are copied from one of Friedman's
systems and use two more predictors (\code{K} and \code{L}):

\preformatted{ 2sin(KL) }

All of these effects are added up to model the log-odds.

When \code{ordinal = FALSE}, this is used to calculate the probability of a
sample being in the first class and a random uniform number is used to
actually make the assignment of the actual class. To mislabel the data, the
probability is reversed (i.e. \code{p = 1 - p}) before the random number
generation.

For \code{ordinal = TRUE}, random normal errors are added to the linear
predictor (i.e. prior to computing the probability) and cut points (0.00,
0.20, 0.75, and 1.00) are used to bin the probabilities into classes
\code{"low"}, \code{"med"}, and \code{"high"} (despite the function's name).

The remaining functions simulate regression data sets. \code{LPH07_1} and
\code{LPH07_2} are from van der Laan et al. (2007). The first function uses
random Bernoulli variables that have a 40\% probability of being a value of
1. The true regression equation is:

\preformatted{ 2*w_1*w_10 + 4*w_2*w_7 + 3*w_4*w_5 - 5*w_6*w_10 + 3*w_8*w_9 +
w_1*w_2*w_4 - 2*w_7*(1-w_6)*w_2*w_9 - 4*(1 - w_10)*w_1*(1-w_4) }

The simulated error term is a standard normal (i.e. Gaussian). The noise
variables are simulated in the same manner as described above but are made
binary based on whether the normal random variable is above or below 0. If
\code{factors = TRUE}, each of the predictors is coerced into a factor.
This simulation can also be adapted for classification using the option
\code{class = TRUE}. In this case, the outcome is converted to be a factor
by first computing the logit transformation of the equation above and using
uniform random numbers to assign the observed class.

A second function (\code{LPH07_2}) uses 20 independent Gaussians with mean
zero and variance 16. The functional form here is:

\preformatted{ x_1*x_2 + x_10^2 - x_3*x_17 - x_15*x_4 + x_9*x_5 + x_19 -
x_20^2 + x_9*x_8 }

The error term is also Gaussian with mean zero and variance 16.

The function \code{SLC14_1} simulates a system from Sapp et al. (2014). All
informative predictors are independent Gaussian random variables with mean
zero and a variance of 9. The prediction equation is:

\preformatted{ x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 +
I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) +
cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18
- x_19*x_20 }

The random error here is also Gaussian with mean zero and a variance of 9.

\code{SLC14_2} is also from Sapp et al. (2014). Two hundred independent
Gaussian variables are generated, each having mean zero and variance 16. The
functional form is

\preformatted{ -1 + log(abs(x_1)) + ... + log(abs(x_200)) }

and the error term is Gaussian with mean zero and a variance of 25.

For each simulation, the user can also add non-informative predictors to the
data. These are random standard normal predictors and can be optionally
added to the data in two ways: a specified number of independent predictors
or a set number of predictors that follow a particular correlation
structure. The only two correlation structure that have been implemented are

\itemize{ \item compound-symmetry (aka exchangeable) where there is a
constant correlation between all the predictors

\item auto-regressive 1 [AR(1)]. While there is no time component to these
data, this structure can be used to add predictors of varying levels of
correlation. For example, if there were 4 predictors and \code{r} was the
correlation parameter, the between predictor correlation matrix would be }

\preformatted{ | 1 sym | | r 1 | | r^2 r 1 | | r^3 r^2 r 1 | | r^4 r^3 r^2 r
1 | }
}
\examples{

example <- twoClassSim(100, linearVars = 1)
splom(~example[, 1:6], groups = example$Class)

}
\references{
van der Laan, M. J., & Polley Eric, C. (2007). Super learner.
Statistical Applications in Genetics and Molecular Biology, 6(1), 1-23.

Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble
method for combining subset-specific algorithm fits. Journal of Applied
Statistics, 41(6), 1247-1259.
}
\author{
Max Kuhn
}
\keyword{models}
back to top