https://github.com/TNishimoto/renum
Tip revision: ecc5606f9a9f067ac295f0e45625918d8da7f681 authored by Nishimoto on 06 April 2021, 06:53:31 UTC
minor fix
minor fix
Tip revision: ecc5606
readme.md
# R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
This library provides implementations for enumerating characteristic substrings (maximal substrings and minimal unique substrings) in a string T by processing the BWT of T.
The programs run in O(n log r) time with O(r) bytes of working space,
where n is the length of the input BWT, and r is the number of runs in the BWT.
For more details on R-enum, please see (https://arxiv.org/abs/2004.01493).
## Download
The source codes in 'module' directory are maintained in different repositories.
So, to download all the necessary source codes, do the following:
> git clone https://github.com/TNishimoto/renum.git
> cd renum
> git submodule init
> git submodule update
You need sdsl-lite(https://github.com/simongog/sdsl-lite) to excecute this program.
Please edit CMakeLists.txt to set SDSL library and include directory paths.
## Compile
We assume that the library and header files of SDSL are installed in \~/lib and \~/include directories, respectively.
> mkdir build
> cd build
> cmake .. -DCMAKE_BUILD_TYPE=Release -DSDSL_LIBRARY_DIR=\~/lib -DSDSL_INCLUDE_DIR=\~/include
> make
## Executions && Examples
### text_to_bwt.out
This program outputs the BWT of a given text using libdivsufsort.
usage: ./text_to_bwt.out --input_file=string [options] ...
options:
-i, --input_file input file path (string)
-o, --output_file output bwt file path (string [=])
-s, --special_character special character (string [=])
-?, --help print this message
$ echo -n GATCAATGAGGTGGACACCAGAGGCGGGGACTTGT > sample.txt
$ ./text_to_bwt.out -i sample.txt -s $
> Special character: $(36)
> Input Text: GATCAATGAGGTGGACACCAGAGGCGGGGACTTGT$
> Output BWT: TCGCGCGGGATACAGAGGAT$GTGAGCATGGAAGTC
> writing: sample.txt.bwt
> wrote: sample.txt.bwt, time = 2
### maximal_repeat.out
This program outputs the LCP intervals for all the maximal repeats
in the string represented by a given BWT.
The output file is encoded in binary format, and print.out can decode the encoded file.
The program can use multithreading for computing maximal repeats.
It works with a single thread if you use "-p 1".
It uses all the processors in your computer if you use "-p -1" or you do not use the -p option.
usage: ./maximal_repeat.out --input_file=string [options] ...
options:
-i, --input_file input bwt file path (string)
-o, --output_file output file path (default: input_file_path.max) (string [=])
-p, --thread_num thread number for parallel processing (int [=-1])
-?, --help print this message
$ ./maximal_repeat.out -i ./sample.txt.bwt
> ______________________RESULT______________________
> RLBWT File : ./sample.txt.bwt
> Output : ./sample.txt.bwt.max
> Peak children count : 22
> Thread number : 8
> Integer Type : UINT32_t
> The length of the input text : 36
> The number of runs on BWT : 31
> The number of maximum substrings : 16
> ______________________Execution Time______________________
> Excecution time : 0 [s]
> Character per second : inf [KB/s]
> Preprocessing time : 0 [s]
> Enumeration time : 0 [s]
> ______________________Memory Usage______________________
> RLBWT : 0 [KB]
> WT : 10 [KB]
> Queue : 16 [KB]
> _______________________________________________________
### print.out
This program converts the LCP intervals stored in a given binary file into a new file in text format.
It also outputs the string represented by each LCP interval if you use option -s 1.
usage: ./print.out --input_file=string --lcp_interval_file=string [options] ...
options:
-i, --input_file input bwt file path (string)
-l, --lcp_interval_file LCP interval file path (string)
-o, --output_file output file path (default: lcp_interval_file.interval.log) (string [=])
-s, --string_flag Output the string represented by each interval if this flag is 1 (bool [=1])
-?, --help print this message
$ ./print.out -i ./sample.txt.bwt -l ./sample.txt.bwt.max -s 1
> (i, j, LCP, substring)
> 0, 35, 0,
> 30, 35, 1, T
> 10, 15, 1, C
> 16, 29, 1, G
> 1, 9, 1, A
> 28, 29, 2, GT
> 8, 9, 2, AT
> 2, 4, 2, AC
> 22, 27, 2, GG
> 5, 7, 2, AG
> 32, 34, 2, TG
> 10, 12, 2, CA
> 16, 20, 2, GA
> 25, 26, 3, GGG
> 22, 23, 4, GGAC
> 18, 19, 4, GAGG
> ______________________RESULT______________________
> BWT File : ./sample.txt.bwt
> Interval File : ./sample.txt.bwt.max
> Output File : ./sample.txt.bwt.max.interval.log
> File Type: LCPInterval
> Integer Type: uint32t
> Count: 16
> _______________________________________________________
### mus.out
This program outputs the LCP intervals for all the minimal unique substrings
in the string represented by a given BWT.
The output file is encoded in binary format, and print.out can decode the encoded file.
The program can use multithreading for computing minimal unique substrings.
It works with a single thread if you use "-p 1".
It uses all the processors in your computer if you use "-p -1" or you do not use the -p option.
usage: ./mus.out --input_file=string [options] ...
options:
-i, --input_file input file name (string)
-o, --output_file output file path (default: input_file_path.mus) (string [=])
-p, --thread_num thread number (int [=-1])
-?, --help print this message
$ ./mus.out -i ./sample.txt.bwt
> ______________________RESULT______________________
> RLBWT File : ./sample.txt.bwt
> Output : ./sample.txt.bwt.mus
> Peak children count : 22
> Thread number : 8
> Integer Type : UINT32_t
> The length of the input text : 36
> The number of runs on BWT : 31
> The number of MUSs : 5
> ______________________Execution Time______________________
> Excecution time : 12 [s]
> Character per second : 3 [KB/s]
> Preprocessing time : 8 [s]
> Enumeration time : 4 [s]
> ______________________Memory Usage______________________
> RLBWT : 0 [KB]
> WT : 10 [KB]
> Queue : 16 [KB]
> _______________________________________________________
$ ./print.out -i ./sample.txt.bwt -l ./sample.txt.bwt.mus
> File Type: LCPInterval
> Integer Type: uint32t
> Count: 21
> (i, j, LCP, substring)
> 0, 0, 1, $
> 31, 31, 2, TC
> 35, 35, 2, TT
> 13, 13, 2, CC
> 14, 14, 2, CG
> 15, 15, 2, CT
> 21, 21, 2, GC
> 1, 1, 2, AA
> 29, 29, 3, GTG
> 9, 9, 3, ATG
> 2, 2, 3, ACA
> 27, 27, 3, GGT
> 5, 5, 3, AGA
> 32, 32, 3, TGA
> 33, 33, 3, TGG
> 34, 34, 3, TGT
> 11, 11, 3, CAC
> 12, 12, 3, CAG
> 20, 20, 3, GAT
> 25, 25, 4, GGGA
> 26, 26, 4, GGGG
> ______________________RESULT______________________
> BWT File : ./sample.txt.bwt
> Interval File : ./sample.txt.bwt.mus
> Output File : ./sample.txt.bwt.mus.interval.log
> File Type: LCPInterval
> Integer Type: uint32t
> Count: 21
> _______________________________________________________
## Excecutions for Experiments
### convert_bwt_into_int_vector.out
This program converts a given BWT in text format into the BWT in binary format(sdsl::int_vector).
usage: ./convert_bwt_into_int_vector.out --input_file=string [options] ...
options:
-i, --input_file input bwt file path (text format) (string)
-o, --output_file output bwt file path (binary file) (default: input_bwt_file.iv) (string [=])
-?, --help print this message
$ ./convert_bwt_into_int_vector.out -i ./sample.txt.bwt
> Finished.
> ______________________RESULT______________________
> Input BWT File : ./sample.txt.bwt
> Output BWT File : ./sample.txt.bwt.iv
> _______________________________________________________
### bbo_meximal_repeat.out
This program outputs the LCP intervals for all the maximal repeats
in the string represented by a given BWT, using the algorithm described in (https://link.springer.com/chapter/10.1007/978-3-642-34109-0_11).
I did not the technique for storing a queue for LCP intervals in $n + o(n)$ bits.
So, the memory usage of this implementation is slightly larger than the original implementation.
The format of the input BWT must be binary format (sdsl::int_vector).
The input file is deleted after finishing the program.
usage: ./bbo_meximal_repeat.out --input_file=string [options] ...
options:
-i, --input_file input file name (string)
-o, --output_file output file path (default: input_file_path.bbo.max) (string [=])
-?, --help print this message
$ ./bbo_maximal_repeat.out -i ./sample.txt.bwt.iv
> ______________________RESULT______________________
> BWT File : ./sample.txt.bwt.iv
> Output File : ./sample.txt.bwt.iv.bbo.max
> The length of the input text : 36
> The number of maximum substrings : 16
> Peak count : 22
> The memory usage of Wavelet tree : 3[KB]
> The memory usage of Queue : 0[KB]
> Excecution time : 0[s]
> Character per second : inf[KB/s]
> Preprocessing time : 0[s]
> Enumeration time : 0[s]
> _______________________________________________________
### dfs_meximal_repeat.out
This program outputs the LCP intervals for all the maximal repeats
in the string represented by a given BWT, using the algorithm described in (https://link.springer.com/chapter/10.1007/978-3-319-23826-5_22).
The format of the input BWT must be binary format (sdsl::int_vector).
The input file is deleted after finishing the program.
usage: ./bbo_meximal_repeat.out --input_file=string [options] ...
options:
-i, --input_file input file name (string)
-o, --output_file output file path (default: input_file_path.dfs.max) (string [=])
-?, --help print this message
$ ./dfs_maximal_repeat.out -i ./sample.txt.bwt.iv
> ______________________RESULT______________________
> BWT File : ./sample.txt.bwt.iv
> Output File : ./sample.txt.bwt.iv.dfs.max
> The length of the input text : 36
> The number of maximum substrings : 16
> The usage of Wavelet tree : 3[KB]
> Excecution time : 0[s]
> Character per second : inf[KB/s]
> Preprocessing time : 0[s]
> Enumeration time : 0[s]
> _______________________________________________________
## Library
(in preparation)