README.md
ntHash
=
ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.
# Build the test suite
```
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
```
To install nttest in a specified directory:
```
$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install
```
The nttest suite has the options for *runtime* and *uniformity* tests.
## Runtime test
For the runtime test the program has the following options:
```
nttest [OPTIONS] ... [FILE]
```
Parameters:
* `-k`, `--kmer=SIZE`: the length of k-mer used for runtime test hashing `[50]`
* `-h`, `--hash=SIZE`: the number of generated hashes for each k-mer `[1]`
* `FILE`: is the input fasta or fastq file
For example to evaluate the runtime of different hash methods on the test file `reads.fa` in DATA/ folder for k-mer length `50`, run:
```
$ nttest -k50 reads.fa
```
## Uniformity test
For the uniformity test using the Bloom filter data structure the program has the following options:
```
nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]
```
Parameters:
* `-q`, `--qnum=SIZE`: number of queries in query file
* `-l`, `--qlen=SIZE`: length of reads in query file
* `-t`, `--tnum=SIZE`: number of sequences in reference file
* `-g`, `--tlen=SIZE`: length of reference sequence
* `-i`, `--input`: generate random query and reference files
* `-j`, `threads=SIZE`: number of threads to run uniformity test `[1]`
* `REF_FILE`: the reference file name
* `QUERY_FILE`: the query file name
For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:
* `100` genes of length `5,000,000bp` as reference in file `genes.fa`
* `4,000,000` reads of length `250bp` as query in file `reads.fa`
* `12` threads
run:
```
$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa
```
## Code samples
To hash all k-mers of length `k` in a given sequence `seq`:
```bash
string kmer = seq.substr(0, k);
uint64_t hVal=0;
hVal = NTF64(kmer.c_str(), k); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i++)
{
hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values
...
}
```
To canonical hash all k-mers of length `k` in a given sequence `seq`:
```bash
string kmer = seq.substr(0, k);
uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i++)
{
hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
...
}
```
To multi-hash with `h` hash values all k-mers of length `k` in a given sequence `seq`:
```bash
string kmer = seq.substr(0, k);
uint64_t hVec[h];
NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
...
for (size_t i = 0; i < seq.length() - k; i++)
{
NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
...
}
```
# ntHashIterator
Enables ntHash on sequences
To hash all k-mers of length `k` in a given sequence `seq` with `h` hash values using ntHashIterator:
```bash
ntHashIterator itr(seq, h, k);
while (itr != itr.end())
{
... use *itr ...
++itr;
}
```
## Usage example (C++)
Outputing hash values of all k-mers in a sequence
```C++
#include <iostream>
#include <string>
#include "ntHashIterator.hpp"
int main(int argc, const char* argv[])
{
/* test sequence */
std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
/* k is the k-mer length */
unsigned k = 70;
/* h is the number of hashes for each k-mer */
unsigned h = 1;
/* init ntHash state and compute hash values for first k-mer */
ntHashIterator itr(seq, h, k);
while (itr != itr.end()) {
std::cout << (*itr)[0] << std::endl;
++itr;
}
return 0;
}
```
Publications
============
## [ntHash](http://bioinformatics.oxfordjournals.org/content/early/2016/08/01/bioinformatics.btw397)
Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol.
**ntHash: recursive nucleotide hashing**.
*Bioinformatics* (2016) 32 (22): 3492-3494.
[doi:10.1093/bioinformatics/btw397 ](http://dx.doi.org/10.1093/bioinformatics/btw397)
# acknowledgements
This projects uses:
* [CATCH](https://github.com/philsquared/Catch) unit test framework for C/C++