README.md
# Snakemake workflow: rna-seq-star-deseq2
[![Snakemake](https://img.shields.io/badge/snakemake-≥5.2.1-brightgreen.svg)](https://snakemake.bitbucket.io)
[![Build Status](https://travis-ci.org/snakemake-workflows/rna-seq-star-deseq2.svg?branch=master)](https://travis-ci.org/snakemake-workflows/rna-seq-star-deseq2)
[![Snakemake-Report](https://img.shields.io/badge/snakemake-report-green.svg)](https://cdn.rawgit.com/snakemake-workflows/rna-seq-star-deseq2/master/.test/report.html)
This workflow performs a differential expression analysis with STAR and Deseq2.
## Authors of base pipeline
* Johannes Köster (@johanneskoester), https://koesterlab.github.io
* Sebastian Schmeier (@sschmeier), https://sschmeier.com
* Jose Maturana (@matrs)
## Usage
### Snakemake-bulkRNAseq-pipeline
# Last updated: 08/27/2019
# by: Phil
The purpose of this document to to setup Snakemake and a differential expression pipeline to analyze bulk RNA-sequencing data.
The pipeline was generated by Johannes Koesterr, the Github repo is here.
¡BEFORE STARTING!
Make sure you have installed miniconda (Python3.7 version) and snakemake.
Download and install STAR version 2.5.3a.
Installing the pipeline
Clone the git/bitbucket repository.
Create a branch in case you make edits.
Make sure you are in the directory with the pipeline.
Place your raw FASTQ files in the data/ folder. Ex:
rsync -av /path/to/data/SRX5725609.fastq.gz data/ ”
Make sure you have the following components of your reference:
Reference FASTA
Reference GTF
Index your reference genome using the following command (this will take several hours to a day, but only needs to be done once):
Edit the file samples.tsv to appropriately reflect your sample names and the conditions.
Ex: “ sample condition
SRX5725609 mock
SRX5725612 infected
… “
Edit the file units.tsv to appropriately reflect your sample characteristics. [this is important to, for instance, mark fastq coming from the same file but different sequencing lanes if you don’t combine beforehand] [*** If you only have single end reads, leave the fq2 column blank]
Ex: “ sample unit fq1 fq2
SRX5725609 sample data/SRX5725609.fastq.gz
SRX5725610 rep1 data/SRX5725610.fastq.gz
…”
Edit the configuration file, config.yaml, to reflect:
The appropriate adapter sequence
The pca label
The comparison of conditions
Any parameters you want to include
*** and to the location of your STAR index directory and the GTF file.
Once this is complete, if you are running on MAC OSX, the command “zcat” will not work. To work around this without editing the wrapper, make the following changes:
sed 's/.fastq.gz/.fastq/g' rules/align.smk > rules/align.smk ;
sed 's/.fastq.gz/.fastq/g' rules/trim.smk > rules/trim.smk ;
This effectively removes the compression of all intermediate FASTQ files and bypasses the need to gunzip them.
If you are running on a Linux workstation, the above step is not necessary.
You’re almost there! Now we run snakemake:
Test that the configuration works with:
snakemake --use-conda -n
If this work, then run the following command to run the script ($N is number of cores):
snakemake --use-conda --cores $N
Once it’s done running you can create a report with the following command:
snakemake --report report.html
If there is an error, the type will come up in red color and you can check the logs/ folder for the sample/step where an error was generated.
The pipeline should produce two useful .html files, two .svg files, and two .pdf files:
./report.html (overview of basic properties of your data)
qc/multiqc_report.html (more detailed overview of sequencing data, including number of intronic reads, splice reads, etc…)
results/pca.svg (a figure of first two principal components of your samples based on gene expression)
results/diffexp/{condition2-vs-condition1}.ma-plot.svg (a figuring giving the comparison of the log ratio and mean average expression comparing the two conditions).
results/diffexp/{condition2-vs-condition1}.diffexp.volcano.pdf
results/diffexp/{condition2-vs-condition1}.diffexp.heatmap.pdf