Content - ce93ffc9b2d52eb06e06fac0da510762e48d9366 - e51887a/man/lmWinsor.Rd

visit type:
Tip revision: 229f7206c9196b18bb88d6ee7c3c775e8b28e5d3 authored by J. O. Ramsay on 01 June 2009, 00:00:00 UTC
version 2.1.3
Tip revision: 229f720
lmWinsor.Rd
\name{lmWinsor}
\alias{lmWinsor}
\title{
  Winsorized Regression
}
\description{
  Clip inputs and predictions to (upper, lower) or to selected quantiles
  to limit wild predictions outside the training set.
}
\usage{
  lmWinsor(formula, data, lower=NULL, upper=NULL, trim=0,
        quantileType=7, subset, weights=NULL, na.action,
        model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
        singular.ok = TRUE, contrasts = NULL, offset=NULL,
        method=c('QP', 'clip'), eps=sqrt(.Machine$double.eps),
        trace=ceiling(sqrt(nrow(data))), ...)
}
\arguments{
  \item{formula}{
    an object of class '"formula"' (or one that can be coerced to that
    class): a symbolic description of the model to be fitted.  See
    \link{lm}.  The left hand side of 'formula' must be a single vector
    in 'data', untransformed.
  }
  \item{data}{
    an optional data frame, list or environment (or object coercible by
    'as.data.frame' to a data frame) containing the variables in the
    model.  If not found in 'data', the variables are taken from
    'environment(formula)';  see \link{lm}.
  }
  \item{lower, upper}{
    Either NULL or a numeric vector or a list.

    If a numeric vector, it must have names matching columns of 'data'
    giving limits on the ranges of predictors and predictions:  If
    present, values below 'lower' will be increased to 'lower', and
    values above 'upper' will be decreased to 'upper'.  If absent, these
    limit(s) will be inferred from quantile(..., prob=c(trim, 1-trim),
    na.rm=TRUE, type=quantileType).

    If a list, its components must be numeric vectors with names, as
    just described.

    If length(trim) > 1 or either 'lower' or 'upper' is a list,
    'lmWinsor' will return a list giving alternative fits;  see 'value'
    below.
  }
  \item{trim}{
    Either a constant or a numeric vector giving the fraction (0 to 0.5)
    of observations to be considered outside the range of the data in
    determining limits not specified in 'lower' and 'upper'.  If
    length(trim) > 1 or either 'lower' or 'upper' is a list, 'lmWinsor'
    will return a list giving alternative fits;  see 'value' below.

    NOTES:

    (1) trim>0 with a singular fit may give an error.  In such cases,
    fix the singularity and retry.

    (2) trim = 0.5 should NOT be used except to check the algorithm,
    because it trims everything to the median, thereby providing zero
    leverage for estimating a regression.
  }
  \item{quantileType}{
    an integer between 1 and 9 selecting one of the nine quantile
    algorithms to be used with 'trim' to determine limits not provided
    with 'lower' and 'upper';  see   \code{\link{quantile}}.
  }
  \item{ subset }{
    an optional vector specifying a subset of observations to be used in
    the fitting process.
  }
  \item{ weights }{
    an optional vector of weights to be used in the fitting process.
    Should be 'NULL' or a numeric vector. If non-NULL, weighted least
    squares is used with weights 'weights' (that is, minimizing
    'sum(w*e*e)'); otherwise ordinary least squares is used.
  }
  \item{ na.action }{
    a function which indicates what should happen when the data contain
    'NA's.  The default is set by the 'na.action' setting of 'options',
    and is 'na.fail' if that is unset.  The factory-fresh default is
    'na.omit'.  Another possible value is 'NULL', no action.  Value
    'na.exclude' can be useful.
  }
  \item{model, x, y, qr}{
    logicals.  If 'TRUE' the corresponding components of the fit (the
    model frame, the model matrix, the response, the QR decomposition)
    are returned.
  }
  \item{ singular.ok }{
    logical. If 'FALSE' (the default in S but not in R) a singular fit
    is an error.
  }
  \item{ contrasts }{
    an optional list. See the 'contrasts.arg' of
    'model.matrix.default'.
  }
  \item{offset}{
    this can be used to specify an a priori known component to be
    included in the linear predictor during fitting.  This should be
    'NULL' or a numeric vector of length either one or equal to the
    number of cases. One or more 'offset' terms can be included in the
    formula instead or as well, and if both are specified their sum is
    used.  See 'model.offset'.
  }
  \item{method}{
    Either 'QP' or 'clip';  partial matching is allowed.  If 'clip',
    force all 'fitted.values' from an 'lm' fit to be at least
    lower[yName] and at most'upper[yName].  If 'QP', iteratively find
    the prediction farthest outside (lower, upper)[yName] and transfer
    it from the sum of squared residuals objective function to a
    constraint that keeps high 'fitted.values' from going below
    upper[yName] and low 'fitted.values' from going above lower[yName].
  }
  \item{eps}{
    small positive number used in two ways:

    \itemize{
      \item{limits}{
	'pred' is judged between 'lower' and 'upper' for 'y' as follows:
	First compute mod = mean(abs(y)).  If this is 0, let Eps = eps;
	otherwise let Eps = eps*mod.  Then pred is low if it is less
	than (lower - Eps), high if it exceeds (upper + Eps), and
	inside limits otherwise.
      }
      \item{QP}{
	To identify singularity in the quadratic program (QP) discussed
	in 'details', step 7 below, first compute the model.matrix of
	the points with interior predictions.  Then compute the QR
	decomposition of this reduced model.matix.  Then compute the
	absolute values of the diagonal elements of R.  If the smallest
	of these numbers is less than eps times the largest, terminate
	the QP with the previous parameter estimates.
      }
    }
  }
  \item{trace}{
    Print the iteration count every 'trace' iteration during 'QP'
    iterations;  see details.  'trace' = 0 means no trace.
  }
  \item{\dots}{
    additional arguments to be passed to the low level regression
    fitting functions;  see \link{lm}.
  }
}
\details{
  1.  Identify inputs and outputs via mdly <- mdlx <- formula;
  mdly[[3]] <- NULL;  mdlx[[2]] <- NULL;  xNames <- all.vars(mdlx);
  yNames <- all.vars(mdly).  Give an error if as.character(mdly[[2]]) !=
  yNames.

  2.  Do 'lower' and 'upper' contain limits for all numeric columns of
  'data?  Create limits to fill any missing.

  3.  clipData = data with all xNames clipped to (lower, upper).

  4.  fit0 <- lm(formula, clipData, subset = subset, weights = weights,
  na.action = na.action, method = method, x=x, y=y, qr=qr,
  singular.ok=singular.ok, contrasts=contrasts, offset=offset, ...)

  5.  out = a logical matrix with two columns, indicating any of
  predict(fit0) outside (lower, upper)[yName].

  6.  Add components lower and upper to fit0 and convert it to class
  c('lmWinsor', 'lm').

  7.  If((method == 'clip') || !any(out)), return(fit0).

  8.  Else, use quadratic programming (\link[quadprog]{solve.QP}) to
  minimize the 'Winsorized sum of squares of residuals' as follows:

  8.1.  First find the prediction farthest outside (lower,
  upper)[yNames].  Set temporary limits at the next closest point inside
  that point (or at the limit if that's closer).

  8.2.  Use QP to minimize the sum of squares of residuals among all
  points not outside the temporary limits while keeping the prediction
  for the exceptional point away from the interior of (lower,
  upper)[yNames].

  8.3.  Are the predictions for all points unconstrained in QP inside
  (lower, upper)[yNames]?  If yes, quit.

  8.4.  Otherwise, among the points still unconstrained, find the
  prediction farthest outside (lower, upper)[yNames].  Adjust the
  temporary limits to the next closest point inside that point (or at
  the limit if that's closer).

  8.5.  Use QP as in 8.2 but with multiple exceptional points, then
  return to step 8.3.

  9.  Modify the components of fit0 as appropriate and return the
  result.
}
\value{
  an object of class 'lmWinsor', which is either an object of class
  c('lmWinsor', 'lm') or a list of such objects.  A list is returned
  when length(trim) > 0 or is.list(lower) or is.list(upper).  The length
  of the ouput list is the max of length(trim) and the lengths of any
  lists provided for 'lower' and 'upper'.  If these lengths are not the
  same, shorter ones are replicated to match the longest one.

  An object of class c('lmWinsor', 'lm') has 'lower', 'upper', 'out',
  'message', and 'elapsed.time' components in addition to the standard
  'lm' components. The 'out' component is a logical matrix identifying
  which predictions from the initial 'lm' fit were below lower[yName]
  and above upper[yName].  If method = 'QP' and the initial fit produces
  predictions outside the limits, this object returned will also include
  a component 'coefIter' containing the model coefficients, the index
  number of the observation in 'data' transferred from the objective
  function to the constraints on that iteration, plus the sum of squared
  residuals before and after clipping the predictions and the number of
  predictions in 5 categories:  below and at the lower limit, between
  the limits, and at and above the upper limit.  The 'elapsed.time'
  component gives the run time in seconds.

  The options for 'message' are as follows:

  \item{1}{
    'Initial fit in bounds':  All predictions were between 'lower' and
    'upper' for 'y'.
  }
  \item{2}{
    'QP iterations successful':  The QP iteration described in
    'Details', step 7, terminated with all predictions either at or
    between the 'lower' and 'upper' for 'y'.
  }
  \item{3}{
    'Iteration terminated by a singular quadratic program':  The QP
    iteration described in 'Details', step 7, terminated when the
    model.matrix for the QP objective function became rank deficient.
    (Rank deficient in this case means that the smallest singular
    value is less than 'eps' times the largest.)
  }

  In addition to the coefficients, 'coefIter' also includes columns for
  'SSEraw' and 'SSEclipped', containing the residual sums of squres from
  the estimated linear model before and after clipping to the 'lower'
  and 'upper' limits for 'y', plus 'nLoOut', 'nLo.', 'nIn', 'nHi.', and
  'nHiOut', summarizing the distribtion of model predictions at each
  iteration relative to the limits.
}
\author{ Spencer Graves }
\seealso{
  \code{\link{predict.lmWinsor}}
  \code{\link{lmeWinsor}}
  \code{\link{lm}}
  \code{\link{quantile}}
  \code{\link[quadprog]{solve.QP}}
}
\examples{
# example from 'anscombe'
lm.1 <- lmWinsor(y1~x1, data=anscombe)

# no leverage to estimate the slope
lm.1.5 <- lmWinsor(y1~x1, data=anscombe, trim=0.5)

# test nonlinear optimization
lm.1.25 <- lmWinsor(y1~x1, data=anscombe, trim=0.25)

# list example
lm.1. <- lmWinsor(y1~x1, data=anscombe, trim=c(0, 0.25, .4, .5))

}
\keyword{ models }
Browse the archive

https://github.com/cran/fda