https://github.com/hadley/dplyr
Raw File
Tip revision: 1405946245d65c32b64ee3ffc2e7ba24a8fb445c authored by Romain Francois on 31 August 2015, 16:34:19 UTC
oops forgot to define y. 🍤
Tip revision: 1405946
do.Rd
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/do.r, R/tbl-sql.r
\name{do}
\alias{do}
\alias{do_}
\alias{do_.tbl_sql}
\title{Do arbitrary operations on a tbl.}
\usage{
do(.data, ...)

do_(.data, ..., .dots)

\method{do_}{tbl_sql}(.data, ..., .dots, .chunk_size = 10000L)
}
\arguments{
\item{.data}{a tbl}

\item{...}{Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use \code{.} to refer to the current group. You can not mix named and
unnamed arguments.}

\item{.dots}{Used to work around non-standard evaluation. See
\code{vignette("nse")} for details.}

\item{.chunk_size}{The size of each chunk to pull into R. If this number is
too big, the process will be slow because R has to allocate and free a lot
of memory. If it's too small, it will be slow, because of the overhead of
talking to the database.}
}
\value{
\code{do} always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from \code{...}. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.

Groups are preserved for a single unnamed input. This is different to
\code{\link{summarise}} because \code{do} generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
\code{\link{rowwise}}. This allows other verbs to work in an intuitive
way.
}
\description{
This is a general purpose complement to the specialised manipulation
functions \code{\link{filter}}, \code{\link{select}}, \code{\link{mutate}},
\code{\link{summarise}} and \code{\link{arrange}}. You can use \code{do}
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
\code{do} and then flexibly extract components with either another
\code{do} or \code{summarise}.
}
\section{Connection to plyr}{


If you're familiar with plyr, \code{do} with named arguments is basically
eqvuivalent to \code{dlply}, and \code{do} with a single unnamed argument
is basically equivalent to \code{ldply}. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that \code{summarise} applied to the result of \code{do} can
act like \code{ldply}.
}
\examples{
by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))

models <- by_cyl \%>\% do(mod = lm(mpg ~ disp, data = .))
models

summarise(models, rsq = summary(mod)$r.squared)
models \%>\% do(data.frame(coef = coef(.$mod)))
models \%>\% do(data.frame(
  var = names(coef(.$mod)),
  coef(summary(.$mod)))
)

models <- by_cyl \%>\% do(
  mod_linear = lm(mpg ~ disp, data = .),
  mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
compare <- models \%>\% do(aov = anova(.$mod_linear, .$mod_quad))
# compare \%>\% summarise(p.value = aov$`Pr(>F)`)

if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)

mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods \%>\% do(as.data.frame(coef(.$mod)))
mods \%>\% summarise(rsq = summary(mod)$r.squared)

\dontrun{
# This longer example shows the progress bar in action
by_dest <- flights \%>\% group_by(dest) \%>\% filter(n() > 100)
library(mgcv)
by_dest \%>\% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
}
}
}

back to top