https://github.com/DBWangGroupUNSW/SRS
Raw File
Tip revision: 0bdf8aa436d237c23d278160dce85c589298daec authored by yifangs on 25 May 2015, 00:30:17 UTC
Update run_toy_data.sh
Tip revision: 0bdf8aa
README.html
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>srs_v2.2/README.html</title>

</head>

<body>

<h1>SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With Tiny Index</h1>

<p>SRS-Mem is a C++ program for performing approximate nearest neighbor search in
high dimension Euclidean space in the main memory. The current implementation is
adapted from our
<a href="http://www.cse.unsw.edu.au/~weiw/files/VLDB15-SRS-Final.pdf">VLDB'15 research paper</a>.
The main modification is to use an in-memory multidimensional index (rather than
an R-tree as in the paper), as it is often the case that our index is so small
and can be accommodated in the main memory. Currently, the index is a modified
version of the
<a href="http://hunch.net/~jl/projects/cover_tree/cover_tree.html">Cover Tree</a> due to
its strong theoretical guarantees; nevertheless, any multidimensional index that
supports incremental exact <em>k</em>NN search can be used!</p>

<h2>Features</h2>

<ul>
<li><p><strong>Guaranteed Success Probability</strong></p>

<p>Theoretically, SRS guarantees to return a <em>c</em>-approximate nearest neighbor to
the query with a user specified probability even in the worst case. For
example, many heuristic methods will not return a near neighbor on some hard
datasets (e.g., those generated by <code>gen_hard_data</code>).</p>

<p>There are several other unique theoretical properties of the SRS algorithm.
The top-<em>k</em> version of SRS guarantees to return <em>c</em>-<em>k</em>-approximate nearest
neighbors with constant probability (while previous methods have no guarantee
for <em>k</em> > 1), and the <code>SRS-1</code> algorithm guarantees to return <em>nearest
neighbor</em> (i.e., <em>c</em> = 1) to the query with any user specified success
probability.</p></li>
<li><p><strong>Small Index Size</strong></p>

<p>The index size of SRS is substantially smaller than the size of original data.
For example, the index size for a 12GB data set (8 million 384-dimension
points) is only 337MB. This means that the SRS index usually can be
accommodated <em>entirely</em> in the main memory.</p>

<p>As a side note, our index and query processing algorithm is independent of the
dimensionality of the dataset, i.e., it works for arbitrarily high dimensions. </p></li>
<li><p><strong>Rich Functionality</strong></p>

<p>The users can easily achieve a space-time balance by tuning parameters even
after the index has been built. All four variants of the query processing
algorithm in the paper are supported.</p></li>
</ul>

<h2>How it works</h2>

<p>In a nutshell, SRS reduces the problem of "approximate NN search in high
dimensional space" to a "exact T-NN search in a low dimensional space". </p>

<p>In the indexing phase, SRS projects data points from the original
high-dimensional space into an appropriately chosen <em>m</em>-dimensional space via
2-stable random projections, and then uses a cover-tree to index these projected
points.</p>

<p>The key observation is that the inter-point distance in the projected space
(called <em>Projected Distance</em>) over that in the original high-dimensional space
follows a scaled chi-squared distribution, which has a sharp concentration bound.
Therefore, given any threshold on the projected distance, and for any point <em>o</em>,
we can compute exactly the probability that <em>o</em>'s projected distance is within the
threshold.</p>

<p>In the querying phase, SRS performs an incrementally <em>k</em>-NN search on the
projected space, until when it has found a satisfactory point (i.e.,
early-termination condition), or it has examined <em>t * n</em> points (i.e.,
normal-termination condition).</p>

<h2>Before start</h2>

<ul>
<li>The users need to have Boost C++ library installed (http://www.boost.org/).
The Boost library is used to calculate the quantile of chi-squared
distribution.</li>
<li>Currently the program has only been tested on Linux.</li>
<li>There are four key parameters to the SRS algorithms:
<ul>
<li><em>n</em>: number of data points. </li>
<li><em>c</em>: approximation ratio. </li>
<li><em>m</em>: number of 2-stable random projections to be used in the index. </li>
<li><em>prob_thres</em>: the probability that the algorithm returns a <em>c</em>-approximate NN. 
Typically, <em>n</em> is fixed, and one can fine tune the other three parameters to
achieve different space/time/quality tradoffs. Fixing any of the two
parameters will place a constraint on the third parameter.</li>
</ul></li>
<li>In addition to these input parameters, the program also generate an internal
parameter <em>t</em>: the fraction of data points to be examined before the search
terminates in the normal condition. </li>
</ul>

<h2>How to use SRS</h2>

<ol>
<li><p>Compile the program:</p>

<p><code>
% make all
</code></p></li>
<li><p>Use <code>cal_param</code> to calculate a feasible setting of parameters based on the
given constraints. The users can manually set either <em>m</em> or <em>the success
probability</em>. A feasible setting of parameters will be printed on the screen
and it can be used later in the query processing phase. The implementation is
based on Algorithm 6 in the paper. For example (using the toy dataset with
3000 points):</p>

<p><code>
% ./cal_param -n 3000 -m 7 -c 4
</code></p>

<p>The following message will be printed out:</p>

<p><code>
A feasible setting is:
m = 7
prob_thres(-r) = 0.299203
T_max(-t) = 2
t = 0.000544
</code></p>

<p>The output indicates that, in the query processing phase, the users shall
use the arguments <code>-c 4</code>, <code>-m 7</code>, <code>-r 0.299203</code> and <code>-t 2</code>.</p>

<p>As a rule of thumb, we recommend setting <em>m</em> between 6 and 10, to begin with.</p></li>
<li><p>Use <code>gen_gt</code> to generate the ground truth of given dataset and query
workload. The ground truth file will be used when processing the query
workload. For example (using the toy dataset):</p>

<p><code>
% ./gen_gt -d 192 -n 3000 -k 10 -q data/toy.q -s data/toy.ds -g data/toy.gt -y i
</code></p>

<p><code>-y i</code> indicates that each coordinates is an integer. The other option is <code>-y
f</code>, indicating that each coordinates is a floating point number.</p></li>
<li><p>Use <code>srs</code> with the <code>-I</code> option to index the data. Users need to specify <em>m</em> and
<em>index path</em> in this step. For example (using the toy dataset):</p>

<p><code>
% mkdir index
% ./srs -I -d 192 -i index/ -m 7 -n 3000 -s data/toy.ds -y i
</code></p></li>
<li><p>Use <code>srs</code> with the <code>-Q</code> option to process the query workload. The
implementation is based on Algorithm 1 in the paper and its variants. The
top-<em>k</em> approximate nearest neighbors for each query in the query workload
will be returned, together with the average ratio and time over all queries.</p>

<p>Users can use the parameter setting given by <code>cal_param</code>. For
example (using the toy dataset):</p>

<p><code>
% ./srs -Q -c 4 -g data/toy.gt -i index/ -k 10 -q data/toy.q -t 2 -r 0.299203
</code></p>

<p>Alternatively, users can also specify the parameters by themselves to achieve another
space-time trade-off.</p></li>
<li><p>The users can change the <code>-t</code> parameter to <em>n</em> (i.e., the cardinality of the
dataset) to force the algorithm rely on the early-termination condition to
stop. This will increase the quality and slightly increase the time cost.
This is the <code>SRS-2</code> algorithm in the paper.</p></li>
<li><p>The users can change the <code>-r</code> parameter to any number larger than 1, to force the
algorithm stop only on the normal-termination condition (i.e., examining <em>tn</em>
data points). This will substantially increase the quality and time cost.
This is the <code>SRS-1</code> algorithm in the paper.</p></li>
<li><p>The users can change the <code>-c</code> parameter to a smaller value to achieve better
quality without affecting the worst case time cost. This is the <code>SRS-12+</code>
algorithm in the paper.</p></li>
</ol>

<h2>Data Format</h2>

<ol>
<li><p>Data file should contain <em>n</em> lines, where <em>n</em> is the cardinality of the
dataset. The file should be formatted as:</p>

<p><code>
e_1_1 e_1_2 ... e_1_d
...
e_n_1 e_n_2 ... e_n_d
</code></p>

<p>where <code>e_i_j</code>s are integers, and are separated by whitespace.</p></li>
<li><p>Query file should contain <em>N+1</em> lines, where <em>N</em> is the number of queries in the
query workload. The file should be formatted as:</p>

<p><code>
N d
ID_1 e_1_1 e_1_2 ... e_1_d
...
ID_N e_N_1 e_N_2 ... e_N_d
</code>
where <em>d</em> is the dimensionality, <code>e_i_j</code> is an integer, and separated by whitespace.</p></li>
</ol>

<h2>Hard Dataset</h2>

<ul>
<li><p>Users can use <code>gen_hard_data</code> to generate a hard dataset with a user specified
cardinality, dimensionality and approximation ratio. The dataset contains one
point which is the nearest neighbor of query. All the other points are
(c+&#949;)-NN of query (c is user specified approximation ratio), and these
points are distributed randomly and uniformly on a sphere centered at the
query point with radius c+&#949;.</p></li>
<li><p>An example of using <code>gen_hard_data</code>:</p>

<p><code>
% ./gen_hard_data -n 1000000 -d 128 -c 4 -s hard.ds -q hard.q
</code></p>

<p>Then a dataset contains 1,000,000 points (i.e., <code>hard.ds</code>) and a query set
contains 1 query (i.e., <code>hard.q</code>) will be generated.</p></li>
</ul>

<h2>Condition of use</h2>

<ul>
<li>SRS is distributed under the terms of the GPL License.</li>
<li>Copyright is owned by DBWang Group, University of New South Wales, Australia.</li>
</ul>

<h2>Future work</h2>

<ul>
<li>Support more input data formats.</li>
<li>Integrate SRS with other multidimensional indexing methods.</li>
</ul>

<h2>Contact</h2>

<p>Please report bugs, feature requests, comments, or suggestions to Yifang Sun
(<code>yifangs AT cse.unsw.edu.au</code>) or Wei Wang (<code>weiw AT cse.unsw.edu.au</code>).</p>

</body>
</html>
back to top