https://github.com/ddiazdom/grlBWT
Raw File
Tip revision: 74212f503c7f9975e6bd7d59eb8d886f5e52aa5a authored by ddiazdom on 28 February 2022, 22:38:35 UTC
Merge pull request #3 from ddiazdom/experimental
Tip revision: 74212f5
README.md
## grlBWT : Efficient construction of the BWT using string compression

This repository contains the implementation of **grlBWT**, a linear-time semi-external algorithm for building the BCR BWT
of a string collection. The novelty of grlBWT is that it resorts to compression to maintain the intermediate data of
the BWT construction in compact form. The compact format used by grlBWT reduces working memory and avoids redundant
computations, thus decreasing computing time too.

## Third-party libraries

1. [SDSL-lite](https://github.com/simongog/sdsl-lite)
2. [xxHash](https://github.com/Cyan4973/xxHash)
3. [CLI11](https://github.com/CLIUtils/CLI11)

## Prerequisites

1. C++ >= 17
2. CMake >= 3.7
3. SDSL-lite

The xxHash and CLI11 libraries are already included in the source files of this repository. The SDSL-lite has to be installed beforehand.
However, we include a CMake module that will search for its local
installation. No need to indicate the path during the compilation.

## Installation

Clone repository, enter the project folder and execute the following commands:

```
mkdir build
cd build
cmake ..
make
```

## Example execution

```
./grlbwt ../tests/sample_file.txt
```

Our tool currently supports string collections in one-string-per-line format or the classical FASTA and FASTQ formats.
It also handles gzipped inputs. In principle, grlBWT automatically detects the format. However,
we have not tested the automatic detection functionality exhaustively yet.

## Output

Our tool outputs the BCR BWT as a run-length compressed array. Concretely, it produces a sequence of equal-symbol runs
encoded as pairs (*a*,*l*), where *a* is the run symbol and *l* is its length. We represent the run symbols and the run
lengths using cells of different widths to reduce space usage. The width for the symbols is the smallest number of bytes
that fits the alphabet. On the other hand, the width for the lengths is the smallest number of bytes that fits the
length of the longest equal-symbol run in the BCR BWT.

We plan to write a parser in the future to produce the BCR BWT in plain format.

## Printing the original text

If you want to print the original strings, please use the *print_seqs* binary:

```
./print_seqss sample_file.txt.rl_bwt 10
```

This command will print the first 10 strings of the original file used in example execution above.

## Multithreading

Our algorithm has a hashing step that supports multithreading to some extent. If you want to use it, pass the parameter
-t to the execution of grlBWT.

## How to cite

We will put a source to cite soon.

## Bugs

This tool still contains experimental code, so you will probably find bugs. Please report them in this repository.

## Author

This implementation was written by [Diego Díaz](https://github.com/ddiazdom) .
back to top