Revision 9fe2325202485e2063ef91e5b42cdebc841a8b5b authored by Alexis Sarda on 21 February 2016, 12:10:07 UTC, committed by Alexis Sarda on 21 February 2016, 12:10:07 UTC
1 parent ba2d267
Raw File
README.Rmd
---
output:
     md_document:
          variant: markdown_github
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setOptions, cache = FALSE, echo = FALSE, warning = FALSE, message = FALSE}
library(knitr)
library(dtwclust)

knitr::opts_chunk$set(
     collapse = TRUE,
     comment = "#>",
     fig.path = "README-"
)
```

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/dtwclust)](http://cran.r-project.org/web/packages/dtwclust)
[![Downloads](http://cranlogs.r-pkg.org/badges/dtwclust)](http://cran.rstudio.com/package=dtwclust)

# Time Series Clustering With Optimizations for the Dynamic Time Warping Distance (DTW)

Time series clustering with a wide variety of strategies and a series of optimizations specific to the Dynamic Time Warping (DTW) distance and its corresponding lower bounds (LBs). There are implementations of both traditional clustering algorithms, and more recent procedures such as k-Shape and TADPole clustering. The functionality can be easily extended with custom distance measures and centroid definitions.

Many of the algorithms implemented in this package are specifically tailored to time series and DTW, hence its name. However, the main clustering function is flexible so that one can test many different clustering approaches, using either the time series directly, or by applying suitable transformations and then clustering in the resulting space.

DTW is a dynamic programming algorithm that tries to find the optimum warping path between two series. Over the years, several variations have appeared in order to make the procedure faster or more efficient. Please refer to the included references for more information, especially Giorgino
(2009), which is a good practical introduction.

Most optimizations require equal dimensionality, which means time series should have equal lengths. DTW itself does not require this, but it is relatively slow to compute. Other distance definitions may be used, or series could be reinterpolated to a matching length (Ratanamahatana and Keogh, 2004).

## Implementations

* Keogh's and Lemire's lower bounds
* DTW Barycenter Averaging
* k-Shape clustering
* TADPole clustering
* Fuzzy c-means

## Examples

```{r set-up}
## Load data
data(uciCT)

## Reinterpolate data to equal length
datalist <- zscore(CharTraj)
data <- lapply(CharTraj, reinterpolate, newLength = 180)

## Common controls
ctrl <- new("dtwclustControl", window.size = 20L, trace = TRUE)
```

```{r partitional}
## =============================================================================================
## Using DTW with help of lower bounds and PAM centroids
## =============================================================================================

ctrl@pam.precompute <- FALSE

kc.dtwlb <- dtwclust(data = data, k = 20, distance = "dtw_lb",
                     centroid = "pam", seed = 3247, 
                     control = ctrl)

plot(kc.dtwlb)

ctrl@pam.precompute <- TRUE
```

```{r hierarchical}
## =============================================================================================
## Hierarchical clustering based on shape-based distance
## =============================================================================================

hc.sbd <- dtwclust(datalist, type = "hierarchical",
                   k = 19:21, distance = "sbd",
                   method = "all",
                   control = ctrl)

cat("Rand index for HC+SBD:\n")
print(ri <- sapply(hc.sbd, randIndex, y = CharTrajLabels))

plot(hc.sbd[[which.max(ri)]])
```

```{r tadpole}
## =============================================================================================
## TADPole clustering
## =============================================================================================

kc.tadp <- dtwclust(data, type = "tadpole", k = 20,
                    dc = 1.5, control = ctrl)

plot(kc.tadp, clus = 1:4)
```

```{r parallel}
## =============================================================================================
## Parallel support
## =============================================================================================

require(doParallel)
cl <- makeCluster(detectCores(), "FORK")
invisible(clusterEvalQ(cl, library(dtwclust)))
registerDoParallel(cl)

## Registering a custom distance with proxy and using it (normalized DTW)
ndtw <- function(x, y, ...) {
     dtw::dtw(x, y, step.pattern = symmetric2,
              distance.only = TRUE, ...)$normalizedDistance
}

## Registering the function with 'proxy'
proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                       loop = TRUE, type = "metric", distance = TRUE,
                       description = "Normalized DTW with L1 norm")

## Data with different lengths
kc.ndtw <- dtwclust(datalist, k = 20,
                    distance = "nDTW", centroid = "pam",
                    seed = 159, control = new("dtwclustControl", nrep = 8L))

sapply(kc.ndtw, randIndex, y = CharTrajLabels)

## DBA centroids
kc <- dtwclust(datalist, k = 20,
               distance = "nDTW", centroid = "dba",
               seed = 9421, control = list(trace = TRUE))

# Modifying some plot parameters
plot(kc, labs.arg = list(title = "DBA Centroids", x = "time", y = "series"))

stopCluster(cl)
registerDoSEQ()
```

```{r fuzzy}
## =============================================================================================
## Fuzzy clustering (autocorrelation-based)
## =============================================================================================

# Calculate autocorrelation up to 50th lag, considering a list of time series as input
acf_fun <- function(dat) {
     lapply(dat, function(x) as.numeric(acf(x, lag.max = 50, plot = FALSE)$acf))
}

# Fuzzy c-means
fc <- dtwclust(datalist[1:25], type = "fuzzy", k = 5,
               preproc = acf_fun, distance = "L2",
               seed = 123)

fc
```

## Dependencies

* Partitional procedures are inspired by the `flexclust` package.
* Hierarchical procedures use the native `hclust` function.
* Cross-distances make use of the `proxy` package.
* The core DTW calculations are done by the `dtw` package.
* Plotting is done with the `ggplot2` package.
* Parallel computation depends on the `foreach` package.
* Random streams for repetitions of partitional procedures use the `rngtools` package.
back to top