Revision ca9b808cf89fc5bfd73bfbc3a7aad705634c2260 authored by asardaes on 16 March 2022, 20:54:21 UTC, committed by asardaes on 16 March 2022, 20:54:21 UTC
1 parent 6382b57
Raw File
compare_clusterings.Rd
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/CLUSTERING-compare-clusterings.R
\name{compare_clusterings}
\alias{compare_clusterings}
\title{Compare different clustering configurations}
\usage{
compare_clusterings(
  series = NULL,
  types = c("p", "h", "f", "t"),
  configs = compare_clusterings_configs(types),
  seed = NULL,
  trace = FALSE,
  ...,
  score.clus = function(...) stop("No scoring"),
  pick.clus = function(...) stop("No picking"),
  shuffle.configs = FALSE,
  return.objects = FALSE,
  packages = character(0L),
  .errorhandling = "stop"
)
}
\arguments{
\item{series}{A list of series, a numeric matrix or a data frame. Matrices and data frames are
coerced to a list row-wise (see \code{\link[=tslist]{tslist()}}).}

\item{types}{Clustering types. It must be any combination of (possibly abbreviated):
"partitional", "hierarchical", "fuzzy", "tadpole."}

\item{configs}{The list of data frames with the desired configurations to run. See
\code{\link[=pdc_configs]{pdc_configs()}} and \code{\link[=compare_clusterings_configs]{compare_clusterings_configs()}}.}

\item{seed}{Seed for random reproducibility.}

\item{trace}{Logical indicating that more output should be printed to screen.}

\item{...}{Further arguments for \code{\link[=tsclust]{tsclust()}}, \code{score.clus} or \code{pick.clus}.}

\item{score.clus}{A function that gets the list of results (and \code{...}) and scores each one. It
may also be a named list of functions, one for each type of clustering. See Scoring section.}

\item{pick.clus}{A function to pick the best result. See Picking section.}

\item{shuffle.configs}{Randomly shuffle the order of configs, which can be useful to balance load
when using parallel computation.}

\item{return.objects}{Logical indicating whether the objects returned by \code{\link[=tsclust]{tsclust()}} should be
given in the result.}

\item{packages}{A character vector with the names of any packages needed for any functions used
(distance, centroid, preprocessing, etc.). The name "dtwclust" is added automatically. Relevant
for parallel computation.}

\item{.errorhandling}{This will be passed to \code{\link[foreach:foreach]{foreach::foreach()}}. See Parallel section below.}
}
\value{
A list with:
\itemize{
\item \code{results}: A list of data frames with the flattened configs and the corresponding scores
returned by \code{score.clus}.
\item \code{scores}: The scores given by \code{score.clus}.
\item \code{pick}: The object returned by \code{pick.clus}.
\item \code{proc_time}: The measured execution time, using \code{\link[base:proc.time]{base::proc.time()}}.
\item \code{seeds}: A list of lists with the random seeds computed for each configuration.
}

The cluster objects are also returned if \code{return.objects} \code{=} \code{TRUE}.
}
\description{
Compare many different clustering algorithms with support for parallelization.
}
\details{
This function calls \code{\link[=tsclust]{tsclust()}} with different configurations and evaluates the results with the
provided functions. Parallel support is included. See the examples.

Parameters specified in \code{configs} whose values are \code{NA} will be ignored automatically.

The scoring and picking functions are for convenience, if they are not specified, the \code{scores}
and \code{pick} elements of the result will be \code{NULL}.

See \code{\link[=repeat_clustering]{repeat_clustering()}} for when \code{return.objects = FALSE}.
}
\section{Parallel computation}{


The configurations for each clustering type can be evaluated in parallel (multi-processing)
with the \pkg{foreach} package. A parallel backend can be registered, e.g., with
\pkg{doParallel}.

If the \code{.errorhandling} parameter is changed to "pass" and a custom \code{score.clus} function is
used, said function should be able to deal with possible error objects.

If it is changed to "remove", it might not be possible to attach the scores to the results data
frame, or it may be inconsistent. Additionally, if \code{return.objects} is \code{TRUE}, the names given
to the objects might also be inconsistent.

Parallelization can incur a lot of deep copies of data when returning the cluster objects,
since each one will contain a copy of \code{datalist}. If you want to avoid this, consider
specifying \code{score.clus} and setting \code{return.objects} to \code{FALSE}, and then using
\code{\link[=repeat_clustering]{repeat_clustering()}}.
}

\section{Scoring}{


The clustering results are organized in a \emph{list of lists} in the following way (where only
applicable \code{types} exist; first-level list names in bold):
\itemize{
\item \strong{partitional} - list with
\itemize{
\item Clustering results from first partitional config
\item etc.
}
\item \strong{hierarchical} - list with
\itemize{
\item Clustering results from first hierarchical config
\item etc.
}
\item \strong{fuzzy} - list with
\itemize{
\item Clustering results from first fuzzy config
\item etc.
}
\item \strong{tadpole} - list with
\itemize{
\item Clustering results from first tadpole config
\item etc.
}
}

If \code{score.clus} is a function, it will be applied to the available partitional, hierarchical,
fuzzy and/or tadpole results via:\preformatted{scores <- lapply(list_of_lists, score.clus, ...)
}

Otherwise, \code{score.clus} should be a list of functions with the same names as the list above, so
that \code{score.clus$partitional} is used to score \code{list_of_lists$partitional} and so on (via
\code{\link[base:funprog]{base::Map()}}).

Therefore, the scores returned shall always be a list of lists with first-level names as above.
}

\section{Picking}{


If \code{return.objects} is \code{TRUE}, the results' data frames and the list of \linkS4class{TSClusters}
objects are given to \code{pick.clus} as first and second arguments respectively, followed by \code{...}.
Otherwise, \code{pick.clus} will receive only the data frames and the contents of \code{...} (since the
objects will not be returned by the preceding step).
}

\section{Limitations}{


Note that the configurations returned by the helper functions assign special names to
preprocessing/distance/centroid arguments, and these names are used internally to recognize
them.

If some of these arguments are more complex (e.g. matrices) and should \emph{not} be expanded,
consider passing them directly via the ellipsis (\code{...}) instead of using \code{\link[=pdc_configs]{pdc_configs()}}. This
assumes that said arguments can be passed to all functions without affecting their results.

The distance matrices (if calculated) are not re-used across configurations. Given the way the
configurations are created, this shouldn't matter, because clusterings with arguments that can
use the same distance matrix are already grouped together by \code{\link[=compare_clusterings_configs]{compare_clusterings_configs()}}
and \code{\link[=pdc_configs]{pdc_configs()}}.
}

\examples{
# Fuzzy preprocessing: calculate autocorrelation up to 50th lag
acf_fun <- function(series, ...) {
    lapply(series, function(x) {
        as.numeric(acf(x, lag.max = 50, plot = FALSE)$acf)
    })
}

# Define overall configuration
cfgs <- compare_clusterings_configs(
    types = c("p", "h", "f", "t"),
    k = 19L:20L,
    controls = list(
        partitional = partitional_control(
            iter.max = 30L,
            nrep = 1L
        ),
        hierarchical = hierarchical_control(
            method = "all"
        ),
        fuzzy = fuzzy_control(
            # notice the vector
            fuzziness = c(2, 2.5),
            iter.max = 30L
        ),
        tadpole = tadpole_control(
            # notice the vectors
            dc = c(1.5, 2),
            window.size = 19L:20L
        )
    ),
    preprocs = pdc_configs(
        type = "preproc",
        # shared
        none = list(),
        zscore = list(center = c(FALSE)),
        # only for fuzzy
        fuzzy = list(
            acf_fun = list()
        ),
        # only for tadpole
        tadpole = list(
            reinterpolate = list(new.length = 205L)
        ),
        # specify which should consider the shared ones
        share.config = c("p", "h")
    ),
    distances = pdc_configs(
        type = "distance",
        sbd = list(),
        fuzzy = list(
            L2 = list()
        ),
        share.config = c("p", "h")
    ),
    centroids = pdc_configs(
        type = "centroid",
        partitional = list(
            pam = list()
        ),
        # special name 'default'
        hierarchical = list(
            default = list()
        ),
        fuzzy = list(
            fcmdd = list()
        ),
        tadpole = list(
            default = list(),
            shape_extraction = list(znorm = TRUE)
        )
    )
)

# Number of configurations is returned as attribute
num_configs <- sapply(cfgs, attr, which = "num.configs")
cat("\nTotal number of configurations without considering optimizations:",
    sum(num_configs),
    "\n\n")

# Define evaluation functions based on CVI: Variation of Information (only crisp partition)
vi_evaluators <- cvi_evaluators("VI", ground.truth = CharTrajLabels)
score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick

# ====================================================================================
# Short run with only fuzzy clustering
# ====================================================================================

comparison_short <- compare_clusterings(CharTraj, types = c("f"), configs = cfgs,
                                        seed = 293L, trace = TRUE,
                                        score.clus = score_fun, pick.clus = pick_fun,
                                        return.objects = TRUE)

\dontrun{
# ====================================================================================
# Parallel run with all comparisons
# ====================================================================================

require(doParallel)
registerDoParallel(cl <- makeCluster(detectCores()))

comparison_long <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
                                       configs = cfgs,
                                       seed = 293L, trace = TRUE,
                                       score.clus = score_fun,
                                       pick.clus = pick_fun,
                                       return.objects = TRUE)

# Using all external CVIs and majority vote
external_evaluators <- cvi_evaluators("external", ground.truth = CharTrajLabels)
score_external <- external_evaluators$score
pick_majority <- external_evaluators$pick

comparison_majority <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
                                           configs = cfgs,
                                           seed = 84L, trace = TRUE,
                                           score.clus = score_external,
                                           pick.clus = pick_majority,
                                           return.objects = TRUE)

# best results
plot(comparison_majority$pick$object)
print(comparison_majority$pick$config)

stopCluster(cl); registerDoSEQ()

# ====================================================================================
# A run with only partitional clusterings
# ====================================================================================

p_cfgs <- compare_clusterings_configs(
    types = "p", k = 19L:21L,
    controls = list(
        partitional = partitional_control(
            iter.max = 20L,
            nrep = 8L
        )
    ),
    preprocs = pdc_configs(
        "preproc",
        none = list(),
        zscore = list(center = c(FALSE, TRUE))
    ),
    distances = pdc_configs(
        "distance",
        sbd = list(),
        dtw_basic = list(window.size = 19L:20L,
                         norm = c("L1", "L2")),
        gak = list(window.size = 19L:20L,
                   sigma = 100)
    ),
    centroids = pdc_configs(
        "centroid",
        partitional = list(
            pam = list(),
            shape = list()
        )
    )
)

# Remove redundant (shape centroid always uses zscore preprocessing)
id_redundant <- p_cfgs$partitional$preproc == "none" &
    p_cfgs$partitional$centroid == "shape"
p_cfgs$partitional <- p_cfgs$partitional[!id_redundant, ]

# LONG! 30 minutes or so, sequentially
comparison_partitional <- compare_clusterings(CharTraj, types = "p",
                                              configs = p_cfgs,
                                              seed = 32903L, trace = TRUE,
                                              score.clus = score_fun,
                                              pick.clus = pick_fun,
                                              shuffle.configs = TRUE,
                                              return.objects = TRUE)
}
}
\seealso{
\code{\link[=compare_clusterings_configs]{compare_clusterings_configs()}}, \code{\link[=tsclust]{tsclust()}}
}
\author{
Alexis Sarda-Espinosa
}
back to top