https://github.com/JuliaStats/Distance.jl
Raw File
Tip revision: 1766f975860bde7e95f8d148cc2c00b9013088af authored by Iain Dunning on 17 March 2015, 23:06:06 UTC
Final commit
Tip revision: 1766f97
README.old.md
# Distance.jl

**This package is deprecated. Please use [Distances.jl](https://github.com/JuliaStats/Distances.jl) instead.**

[![Build Status](https://travis-ci.org/JuliaStats/Distance.jl.svg?branch=master)](https://travis-ci.org/JuliaStats/Distance.jl)

A Julia package for evaluating distances(metrics) between vectors.

This package also provides carefully optimized functions to compute column-wise and pairwise distances, which is often faster than a straightforward loop implementation by one or two orders of magnitude. (See the benchmark section below for details).

## Supported distances

* Euclidean distance
* Squared Euclidean distance
* Cityblock distance 
* Chebyshev distance
* Minkowski distance
* Hamming distance
* Cosine distance
* Correlation distance
* Chi-square distance
* Kullback-Leibler divergence
* Jensen-Shannon divergence
* Mahalanobis distance
* Squared Mahalanobis distance
* Bhattacharyya distance
* Hellinger distance

For ``Euclidean distance``, ``Squared Euclidean distance``, ``Cityblock distance``, ``Minkowski distance``, and ``Hamming distance``, a weighted version is also provided.

## Basic Use

The library supports three ways of computation: *computing the distance between two vectors*, *column-wise computation*, and *pairwise computation*.


#### Computing the distance between two vectors

Each distance corresponds to a *distance type*. You can always compute a certain distance between two vectors using the following syntax

```julia
r = evaluate(dist, x, y)
```

Here, dist is an instance of a distance type. For example, the type for Euclidean distance is ``Euclidean`` (more distance types will be introduced in the next section), then you can compute the Euclidean distance between ``x`` and ``y`` as

```julia
r = evaluate(Euclidean(), x, y)
``` 

Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below

```julia
r = euclidean(x, y)
```

#### Computing distances between corresponding columns

Suppose you have two ``m-by-n`` matrix ``X`` and ``Y``, then you can compute all distances between corresponding columns of X and Y in one batch, using the ``colwise`` function, as

```julia
r = colwise(dist, X, Y)
```

The output ``r`` is a vector of length ``n``. In particular, ``r[i]`` is the distance between ``X[:,i]`` and ``Y[:,i]``. The batch computation typically runs considerably faster than calling ``evaluate`` column-by-column.

Note that either of ``X`` and ``Y`` can be just a single vector -- then the ``colwise`` function will compute the distance between this vector and each column of the other parameter.

#### Computing pairwise distances

Let ``X`` and ``Y`` respectively have ``m`` and ``n`` columns. Then the ``pairwise`` function computes distances between each pair of columns in ``X`` and ``Y``:

```julia
R = pairwise(dist, X, Y)
```

In the output, ``R`` is a matrix of size ``(m, n)``, such that ``R[i,j]`` is the distance between ``X[:,i]`` and ``Y[:,j]``. Computing distances for all pairs using ``pairwise`` function is often remarkably faster than evaluting for each pair individually.

If you just want to just compute distances between columns of a matrix ``X``, you can write

```julia
R = pairwise(dist, X)
```

This statement will result in an ``m-by-m`` matrix, where ``R[i,j]`` is the distance between ``X[:,i]`` and ``X[:,j]``.
``pairwise(dist, X)`` is typically more efficient than ``pairwise(dist, X, X)``, as the former will take advantage of the symmetry when ``dist`` is a semi-metric (including metric).


#### Computing column-wise and pairwise distances inplace

If the vector/matrix to store the results are pre-allocated, you may use the storage (without creating a new array) using the following syntax:

```julia
colwise!(r, dist, X, Y)
pairwise!(R, dist, X, Y)
pairwise!(R, dist, X)
```

Please pay attention to the difference, the functions for inplace computation are ``colwise!`` and ``pairwise!`` (instead of ``colwise`` and ``pairwise``).



## Distance type hierarchy

The distances are organized into a type hierarchy. 

At the top of this hierarchy is an abstract class **PreMetric**, which is defined to be a function ``d`` that satisfies

	d(x, x) == 0  for all x
	d(x, y) >= 0  for all x, y
	
**SemiMetric** is a abstract type that refines **PreMetric**. Formally, a *semi-metric* is a *pre-metric* that is also symmetric, as

	d(x, y) == d(y, x)  for all x, y
	
**Metric** is a abstract type that further refines **SemiMetric**. Formally, a *metric* is a *semi-metric* that also satisfies triangle inequality, as

	d(x, z) <= d(x, y) + d(y, z)  for all x, y, z
	
This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, and derive the values immediately for the remaining halve by leveraging the symmetry of *semi-metrics*.

Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.

| type name            |  convenient syntax   | math definition     | 
| -------------------- | -------------------- | --------------------|
|  Euclidean           |  euclidean(x, y)     | sqrt(sum((x - y) .^ 2)) |
|  SqEuclidean         |  sqeuclidean(x, y)   | sum((x - y).^2) |
|  Cityblock           |  cityblock(x, y)     | sum(abs(x - y)) |
|  Chebyshev           |  chebyshev(x, y)     | max(abs(x - y)) |
|  Minkowski           |  minkowski(x, y, p)  | sum(abs(x - y).^p) ^ (1/p) |
|  Hamming             |  hamming(x, y)       | sum(x .!= y) |
|  CosineDist          |  cosine_dist(x, y)   | 1 - dot(x, y) / (norm(x) * norm(y)) |
|  CorrDist            |  corr_dist(x, y)     | cosine_dist(x - mean(x), y - mean(y)) |
|  ChiSqDist           |  chisq_dist(x, y)    | sum((x - y).^2 / (x + y)) | 
|  KLDivergence        |  kl_divergence(x, y) | sum(p .* log(p ./ q)) |
|  JSDivergence        |  js_divergence(x, y) | KL(x, m) / 2 + KL(y, m) / 2 with m = (x + y) / 2 |
|  SpanNormDist        |  spannorm_dist(x, y) | max(x - y) - min(x - y ) |
|  BhattacharyyaDist   |  bhattacharyya(x, y) | -log(sum(sqrt(x .* y) / sqrt(sum(x) * sum(y))) |
|  HellingerDist       |  hellinger(x, y)     | sqrt(1 - sum(sqrt(x .* y) / sqrt(sum(x) * sum(y)))) |
|  Mahalanobis         |  mahalanobis(x, y, Q)    | sqrt((x - y)' * Q * (x - y)) |
|  SqMahalanobis       |  sqmahalanobis(x, y, Q)  |  (x - y)' * Q * (x - y)  |
|  WeightedEuclidean   |  euclidean(x, y, w)      | sqrt(sum((x - y).^2 .* w))  |
|  WeightedSqEuclidean |  sqeuclidean(x, y, w)    | sum((x - y).^2 .* w)  |
|  WeightedCityblock   |  cityblock(x, y, w)      | sum(abs(x - y) .* w)  |
|  WeightedMinkowski   |  minkowski(x, y, w, p)   | sum(abs(x - y).^p .* w) ^ (1/p)  |
|  WeightedHamming     |  hamming(x, y, w)        | sum((x .!= y) .* w)  |
  
**Note:** The formulas above are using *Julia*'s functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way.


## Benchmarks


The implementation has been carefully optimized based on benchmarks. The Julia scripts ``test/bench_colwise.jl`` and ``test/bench_pairwise.jl`` run the benchmarks on a variety of distances, respectively under column-wise and pairwise settings.

Here are the benchmarks that I obtained on Mac OS X 10.8 with Intel Core i7 2.6 GHz.

#### Column-wise benchmark

The table below compares the performance (measured in terms of average elapsed time of each iteration) of a straightforward loop implementation and an optimized implementation provided in *Distance.jl*. The task in each iteration is to compute a specific distance between corresponding columns in two ``200-by-10000`` matrices.

|  distance   |   loop  |   colwise   |   gain     |
|------------ | --------| ------------| -----------|
| SqEuclidean | 0.046962 | 0.002782 | 16.8782 |
| Euclidean | 0.046667 | 0.0029 | 16.0937 |
| Cityblock | 0.046619 | 0.0031 | 15.039 |
| Chebyshev | 0.053578 | 0.010856 | 4.9356 |
| Minkowski | 0.061804 | 0.02357 | 2.6221 |
| Hamming | 0.044047 | 0.00219 | 20.1131 |
| CosineDist | 0.04496 | 0.002855 | 15.7457 |
| CorrDist | 0.080828 | 0.029708 | 2.7207 |
| ChiSqDist | 0.051009 | 0.008088 | 6.307 |
| KLDivergence | 0.079598 | 0.035353 | 2.2515 |
| JSDivergence | 0.545789 | 0.493362 | 1.1063 |
| WeightedSqEuclidean | 0.046182 | 0.003219 | 14.3477 |
| WeightedEuclidean | 0.046831 | 0.004122 | 11.3603 |
| WeightedCityblock | 0.046457 | 0.003636 | 12.7781 |
| WeightedMinkowski | 0.062532 | 0.020486 | 3.0524 |
| WeightedHamming | 0.046217 | 0.002269 | 20.3667 |
| SqMahalanobis | 0.150364 | 0.042335 | 3.5518 |
| Mahalanobis | 0.159638 | 0.041071 | 3.8869 |

We can see that using ``colwise`` instead of a simple loop yields considerable gain (2x - 9x), especially when the internal computation of each distance is simple. Nonetheless, when the computaton of a single distance is heavy enough (e.g. *Minkowski* and *JSDivergence*), the gain is not as significant.

#### Pairwise benchmark

The table below compares the performance (measured in terms of average elapsed time of each iteration) of a straightforward loop implementation and an optimized implementation provided in *Distance.jl*. The task in each iteration is to compute a specific distance in a pairwise manner between columns in a ``100-by-200`` and ``100-by-250`` matrices, which will result in a ``200-by-250`` distance matrix.

|  distance   |   loop  |   pairwise  |   gain     |
|------------ | --------| ------------| -----------|
| SqEuclidean | 0.119961 | 0.00037 | **324.6457** |
| Euclidean | 0.122645 | 0.000678 | **180.9180** |
| Cityblock | 0.116956 | 0.007997 | 14.6251 |
| Chebyshev | 0.137985 | 0.028489 | 4.8434 |
| Minkowski | 0.170101 | 0.059991 | 2.8354 |
| Hamming | 0.110742 | 0.004781 | 23.1627 |
| CosineDist | 0.110913 | 0.000514 | **215.8028** |
| CorrDist | 0.1992 | 0.000808 | 246.4574 |
| ChiSqDist | 0.124782 | 0.020781 | 6.0046 |
| KLDivergence | 0.1994 | 0.088366 | 2.2565 |
| JSDivergence | 1.35502 | 1.215785 | 1.1145 |
| WeightedSqEuclidean | 0.119797 | 0.000444 | **269.531** |
| WeightedEuclidean | 0.126304 | 0.000712 | **177.5122** |
| WeightedCityblock | 0.117185 | 0.011475 | 10.2122 |
| WeightedMinkowski | 0.172614 | 0.061693 | 2.7979 |
| WeightedHamming | 0.112525 | 0.005072 | 22.1871 |
| SqMahalanobis | 0.377342 | 0.000577 | **653.9759** |
| Mahalanobis | 0.373796 | 0.002359 | **158.4337** |

For distances of which a major part of the computation is a quadratic form (e.g. *Euclidean*, *CosineDist*, *Mahalanobis*), the performance can be drastically improved by restructuring the computation and delegating the core part to ``GEMM`` in *BLAS*. The use of this strategy can easily lead to 100x performance gain over simple loops (see the highlighted part of the table above).


back to top