https://github.com/DBWangGroupUNSW/SRS
Raw File
Tip revision: 0bdf8aa436d237c23d278160dce85c589298daec authored by yifangs on 25 May 2015, 00:30:17 UTC
Update run_toy_data.sh
Tip revision: 0bdf8aa
README.md
SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With a Tiny Index
====================================================================================================

SRS-Mem is a C++ program for performing approximate nearest neighbor search in
high dimension Euclidean space in the main memory. The current implementation is
adapted from our
[VLDB'15 research paper](http://www.cse.unsw.edu.au/~weiw/files/VLDB15-SRS-Final.pdf).
The main modification is to use an in-memory multidimensional index (rather than
an R-tree as in the paper), as it is often the case that our index is so small
and can be accommodated in the main memory. Currently, the index is a modified
version of the
[Cover Tree](http://hunch.net/~jl/projects/cover_tree/cover_tree.html) due to
its strong theoretical guarantees; nevertheless, any multidimensional index that
supports incremental exact *k*NN search can be used!

Features
--------

* **Guaranteed Success Probability**

  Theoretically, SRS guarantees to return a _c_-approximate nearest neighbor to
  the query with a user specified probability even in the worst case. For
  example, many heuristic methods will not return a near neighbor on some hard
  datasets (e.g., those generated by `gen_hard_data`).

  There are several other unique theoretical properties of the SRS algorithm.
  The top-_k_ version of SRS guarantees to return _c_-_k_-approximate nearest
  neighbors with constant probability (while previous methods have no guarantee
  for _k_ > 1), and the `SRS-1` algorithm guarantees to return *nearest
  neighbor* (i.e., _c_ = 1) to the query with any user specified success
  probability.
  
* **Small Index Size**

  The index size of SRS is substantially smaller than the size of original data.
  For example, the index size for a 12GB data set (8 million 384-dimension
  points) is only 337MB. This means that the SRS index usually can be
  accommodated _entirely_ in the main memory.

  As a side note, our index and query processing algorithm is independent of the
  dimensionality of the dataset, i.e., it works for arbitrarily high dimensions. 
  
* **Rich Functionality**

  The users can easily achieve a space-time balance by tuning parameters even
  after the index has been built. All four variants of the query processing
  algorithm in the paper are supported.




How it works
----------------

In a nutshell, SRS reduces the problem of "approximate NN search in high
dimensional space" to a "exact T-NN search in a low dimensional space". 

In the indexing phase, SRS projects data points from the original
high-dimensional space into an appropriately chosen *m*-dimensional space via
2-stable random projections, and then uses a cover-tree to index these projected
points.

The key observation is that the inter-point distance in the projected space
(called _Projected Distance_) over that in the original high-dimensional space
follows a scaled chi-squared distribution, which has a sharp concentration bound.
Therefore, given any threshold on the projected distance, and for any point _o_,
we can compute exactly the probability that _o_'s projected distance is within the
threshold.

In the querying phase, SRS performs an incrementally _k_-NN search on the
projected space, until when it has found a satisfactory point (i.e.,
early-termination condition), or it has examined _t * n_ points (i.e.,
normal-termination condition).

Before start
------------

* The users need to have Boost C++ library installed (http://www.boost.org/).
  The Boost library is used to calculate the quantile of chi-squared
  distribution.
* Currently the program has only been tested on Linux.
* There are four key parameters to the SRS algorithms:
  * _n_: number of data points. 
  * _c_: approximation ratio. 
  * _m_: number of 2-stable random projections to be used in the index. 
  * _prob\_thres_: the probability that the algorithm returns a _c_-approximate NN. 
  Typically, _n_ is fixed, and one can fine tune the other three parameters to
  achieve different space/time/quality tradoffs. Fixing any of the two
  parameters will place a constraint on the third parameter.
* In addition to these input parameters, the program also generate an internal
  parameter _t_: the fraction of data points to be examined before the search
  terminates in the normal condition. 

How to use SRS
--------------

1. Compile the program:

  ```
  % make all
  ```

2. Use `cal_param` to calculate a feasible setting of parameters based on the
   given constraints. The users can manually set either *m* or *the success
   probability*. A feasible setting of parameters will be printed on the screen
   and it can be used later in the query processing phase. The implementation is
   based on Algorithm 6 in the paper. For example (using the toy dataset with
   3000 points):

  ```
  % ./cal_param -n 3000 -m 7 -c 4
  ```

  The following message will be printed out:

  ```
  A feasible setting is:
  m = 7
  prob_thres(-r) = 0.299203
  T_max(-t) = 2
  t = 0.000544
  ```

    The output indicates that, in the query processing phase, the users shall
    use the arguments `-c 4`, `-m 7`, `-r 0.299203` and `-t 2`.

  As a rule of thumb, we recommend setting *m* between 6 and 10, to begin with.
  
3. Use `gen_gt` to generate the ground truth of given dataset and query
   workload. The ground truth file will be used when processing the query
   workload. For example (using the toy dataset):

  ```
  % ./gen_gt -d 192 -n 3000 -k 10 -q data/toy.q -s data/toy.ds -g data/toy.gt -y i
  ```

  `-y i` indicates that each coordinates is an integer. The other option is `-y
  f`, indicating that each coordinates is a floating point number.

4. Use `srs` with the `-I` option to index the data. Users need to specify *m* and
   *index path* in this step. For example (using the toy dataset):

  ```
  % mkdir index
  % ./srs -I -d 192 -i index/ -m 7 -n 3000 -s data/toy.ds -y i
  ```
  
5. Use `srs` with the `-Q` option to process the query workload. The
   implementation is based on Algorithm 1 in the paper and its variants. The
   top-_k_ approximate nearest neighbors for each query in the query workload
   will be returned, together with the average ratio and time over all queries.

    Users can use the parameter setting given by `cal_param`. For
    example (using the toy dataset):
  
  ```
  % ./srs -Q -c 4 -g data/toy.gt -i index/ -k 10 -q data/toy.q -t 2 -r 0.299203
  ```

    Alternatively, users can also specify the parameters by themselves to achieve another
    space-time trade-off.
    
    
6. The users can change the `-t` parameter to _n_ (i.e., the cardinality of the
   dataset) to force the algorithm rely on the early-termination condition to
   stop. This will increase the quality and slightly increase the time cost.
   This is the `SRS-2` algorithm in the paper.

7. The users can change the `-r` parameter to any number larger than 1, to force the
   algorithm stop only on the normal-termination condition (i.e., examining _tn_
   data points). This will substantially increase the quality and time cost.
   This is the `SRS-1` algorithm in the paper.

8. The users can change the `-c` parameter to a smaller value to achieve better
   quality without affecting the worst case time cost. This is the `SRS-12+`
   algorithm in the paper.


Data Format
-----------

1. Data file should contain _n_ lines, where _n_ is the cardinality of the
   dataset. The file should be formatted as:

  ```
  e_1_1 e_1_2 ... e_1_d
  ...
  e_n_1 e_n_2 ... e_n_d
  ```
  
  where `e_i_j`s are integers, and are separated by whitespace.

2. Query file should contain _N+1_ lines, where _N_ is the number of queries in the
   query workload. The file should be formatted as:
  
  ```
  N d
  ID_1 e_1_1 e_1_2 ... e_1_d
  ...
  ID_N e_N_1 e_N_2 ... e_N_d
  ```
  where _d_ is the dimensionality, `e_i_j` is an integer, and separated by whitespace.

Hard Dataset
------------

* Users can use `gen_hard_data` to generate a hard dataset with a user specified
  cardinality, dimensionality and approximation ratio. The dataset contains one
  point which is the nearest neighbor of query. All the other points are
  (c+ε)-NN of query (c is user specified approximation ratio), and these
  points are distributed randomly and uniformly on a sphere centered at the
  query point with radius c+ε.

* An example of using `gen_hard_data`:

  ```
  % ./gen_hard_data -n 1000000 -d 128 -c 4 -s hard.ds -q hard.q
  ```

    Then a dataset contains 1,000,000 points (i.e., `hard.ds`) and a query set
    contains 1 query (i.e., `hard.q`) will be generated.

* Users can repeat indexing and query processing using different random seeds 
(use the `-e` parameter to specify different random seeds during the indexing 
phase) to get empirical success probability of SRS.

  Users can also run the example script `run_hard_data`:
  
  ```
  % sh run_hard_data.sh
  ```

Condition of use
----------------

* SRS is distributed under the terms of the GPL License.
* Copyright is owned by DBWang Group, University of New South Wales, Australia.

Future work
-----------

* Support more input data formats.
* Integrate SRS with other multidimensional indexing methods.

Contact
-------------

Please report bugs, feature requests, comments, or suggestions to Yifang Sun
(`yifangs AT cse.unsw.edu.au`) or Wei Wang (`weiw AT cse.unsw.edu.au`).
back to top