Revision 7c9015bd11e2c7359f3311d9202aea6be69b4131 authored by Gabe DuBose on 10 June 2022, 00:10:28 UTC, committed by GitHub on 10 June 2022, 00:10:28 UTC
1 parent 7e56464
README.md
# Evaluation of MUtations via reference Simulation: EMUS
<p align="center"><img src="emus-logo.png" height="150" /></p>
NOTE: THIS PROGRAM IS STILL UNDER DEVELOPMENT AND NOT READY FOR GENERAL USE.
EMUS is a pipeline and tool for statistically evaluating the frequency of mutational classes in genomic studies. This is accomplished in two primary steps. First, random mutations are generated across your organisms reference genome. Second, the same number or mutations that were observed are randomly selected from the simualted mutations. The frequencies of mutational classes (i.e., synonymous, missense, intergenic, etc.) are then compared. This process is then repeated for n number of bootstraps, and the probability that the observed mutational classes orccured at higher or lower frequencies than randomly expected is computed.
This documentation is a quick overview of the emus functionality. More examples and tutorials will be uploaded upon package completion.
## Installation
The simplest way to install EMUS is to setup the emus conda environment and then pip install this repository:
```
conda env create evo-informatics/emus
conda activate emus
pip3 install git+https://github.com/gabe-dubose/emus.git
```
Another option would be to download or clone this repository and use the emus-env.yaml file to create a conda environment with the dependencies:
```
git clone https://github.com/gabe-dubose/emus
cd emus
conda env create --file emus-env.yaml
pip3 install git+https://github.com/gabe-dubose/emus.git
```
Althogh not recomended, you can also install the dependencies manually and then clone this repository. It is recomended that these either be installed with pip and/or conda:
Dependencies:
- Python3
- SnpEff
- Seaborn
- Matplotlib
## General Workflow and Tutorial
### Simulating mutations in the reference genome
The simulate_mutations.py script takes an input genome in fasta format and simulates a flat number of SNPs. The output from this program is a variant call file (VCF) containing the number of mutations you specificed. However, if you are looking for more custom and fine-grained simulations, we recomend using Mutation-Simulator. The output VCF from this program is able to be incorporated in EMUS as well.
```
simulate-mutations \
-i/--input <reference_genome.fasta> \
-s/--snp <#snps> \
-o/--output <output_file.vcf>
```
NOTE: It is recommended to annotate these simulated variants using the same annotation tool that was used for the observed dataset.
### Optional: Visualizing simulation
EMUS offers a visualization feature for manually inspecting the distribution of mutations across each of your references chromosomes. The VCF generated by EMUS or Mutation-Simulator can be used as input. The output generated is an individual plot for each histogram, so we recomend making a separate directory to put these in.
```
mkdir output_dir
plot-vcf-histogram \
-i/--input <simulated_mutations.vcf> \
-o/--output <output_dir>
```
### Reading in variant annotations
All EMUS needs for downstream analyses is a .tsv file with a list of variants in the first column. Different variant annotation tools produce different output, so sometimes getting this information can be challenging. EMUS offers a little bit of help through the get-annotations command, which supports conversion from standard SnpEff, ANNOVAR, and VEP outputs. More functionality here can easily be added later on as well. If applicable, this step should be performed on the observed data and the simulated data.
```
get-annotations \
-i/--input <input_file> \
--snpeff OPTIONAL
--annovar OPTIONAL
--vep OPTIONAL
-o/--output <out_file.tsv>
```
### Comparing observed and simulated variants
Using the output from the get-annotations command in the previous step, we can compare our observed variants to our simulated ones.
```
compare-variants \
-i/--input <observed_variants.tsv> \
-c/--comparison <simualted_variants.tsv> \
-b/--bootstraps <#bootstraps> \
-o/--output <output_file_handle> \
```
This program will produce a plain .tsv file that will have the probability values for each mutation class. It will also produce a .bootstraps.tsv file that will contain the relevant information for visualization.
### Visualizing comparisons
With the .boostraps.tsv file generated in the previous step, EMUS offers plotting options to generate publication quality figures. These visualizations include histograms, kernel density estimate (KDE) plots, emperical cumulative density estimate plots (ECDF), as well as options for figure customization and coloring.
```
mkdir out_dir
plot-variant-comparisons \
-i/--input <data.bootstraps.tsv> \
-o/--outdir <out_dir> \
Optional Flags:
--hist
--kde
--ecdf
--color_tail <color>
--comp_line_color <color>
--plot_color <color>
--blank_bars
--background_theme <white or dark>
```

Computing file changes ...