Revision b8fd5632d258bc78ae136208ef1ad1fe6d359483 authored by John Favate on 23 July 2022, 22:04:52 UTC, committed by John Favate on 23 July 2022, 22:04:52 UTC
1 parent 916e7ab
Raw File
README.md
This repo contains the code and associated files for "The landscape of transcriptional and translational changes over 22 years of bacterial adaptation", currently found on biorxiv (https://www.biorxiv.org/content/10.1101/2021.01.12.426406v1). 

## Repo organization

The repo is organized into folders with self-descriptive names. For example, `code` contains the code used to process the data, perform analysis, and make figures. Likewise, `data_frames` contains the various data files created or used during the analysis. Note that some data files are missing because they are too large to place here and must be generated by running the code.

## Running the code

Our analysis was performed on a server with 2 Intel Xeon CPU E5-2660 v4 @ 2.00GHz CPUs with 14 cores and 2 threads per core each, totaling 56 threads, 264Gb of RAM, running the following versions of software

| Software      	| version     	|
|---------------	|-------------	|
| cutadapt      	| 2.8         	|
| python        	| 3.6.9       	|
| hisat2        	| 2.1.0       	|
| kallisto      	| 0.46.2      	|
| samtools      	| 1.10        	|
| BBmap         	| 37          	|
| fastX toolkit 	| 0.0.14      	|
| R             	| 4.2.0       	|
| Ubuntu        	| 18.04.5 LTS 	|

Additionally, when knitted to an HTML, the bottom of each Rmd will display the versions of the packages used. Many of the steps make use of, but do not require multiple threads. If needed, you can change the thread usage with the only consequence being that it will take longer to run. There are 3 main phases to the analysis:  
1. Sequencing data processing - process the raw sequencing data such that it can be aligned and quantified.  
2. Analysis - run the various analyses that are based of the sequencing data.  
3. Interpretation - make visualizations of the data acquired during the analysis phase.

In order to recreate the analysis, you simply need to clone the repository, creating a local directory structure that should match the following:
```
.
├── alignment
│   ├── hisat2
│   │   ├── indices
│   │   └── output
│   └── kallisto
│       ├── indices
│       └── output
├── biocyc_files
├── code
│   ├── analysis
│   ├── data_processing
│   └── figures
├── data_frames
├── fastas
├── figures
├── gffs
└── seqdata
    ├── 1-original
    ├── 2-adapter_removed
    ├── 3-demultiplexed
    ├── 4-deduplicated
    │   ├── deduped_files
    │   └── duplicates
    ├── 5-trimmed_ends
    └── 6-rrna_depleted
```

Then, download the sequencing data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164308) and place it in /seqdata/1-original. Upon downloading the data, you should change the filenames to the following new file names:

| GSE number 	| Sample name        	| New file name      	|
|------------	|--------------------	|--------------------	|
| GSM5006206 	| rep1 ribo-seq araM 	| rep1-ribo-am.fq.gz 	|
| GSM5006207 	| rep1 ribo-seq araP 	| rep1-ribo-ap.fq.gz 	|
| GSM5006208 	| rep1 RNAseq araM   	| rep1-rna-am.fq.gz  	|
| GSM5006209 	| rep1 RNAseq araP   	| rep1-rna-ap.fq.gz  	|
| GSM5006210 	| rep2 ribo-seq araM 	| rep2-ribo-am.fq.gz 	|
| GSM5006211 	| rep2 ribo-seq araP 	| rep2-ribo-ap.fq.gz 	|
| GSM5006212 	| rep2 RNAseq araM   	| rep2-rna-am.fq.gz  	|
| GSM5006213 	| rep2 RNAseq araP   	| rep2-rna-ap.fq.gz  	|

After that, it should just be a matter of running the code in the specified order. **To ensure smooth running of the code, start with a clean R environment for each Rmd**. Unless you need to modify the code, the safest way to run each Rmd is to simply `knit` it from within Rstudio. Knitting won't work for certain Rmds that use shell code, namely the first few that process the data and perform alignment, or others that clone repositories. It's recommended that you copy and paste these commands into the command line to execute them, as they may not execute properly from inside Rstudio. 

#### Sequencing data processing

1. `/code/data_processing/seq_data_processing.Rmd`  
2. `/code/data_processing/alignment.Rmd`  
3. `/code/data_processing/data_cleaning.Rmd`

#### Analysis

Only three of the scripts require a particular order, after that, the order does not matter.  

1. `/code/analysis/DEseq2.Rmd`  
2. `/code/analysis/riborex.Rmd`  
3. `/code/analysis/combine_data_frames.Rmd`  

#### Interpretation

The code to make the figures is in `/code/figures`. The order of these does not matter, some may require you to run code in `analysis` before being able to generate a figure. `knitting` each document will place a pdf, png, and rds file in the `/figures` directory.
back to top