https://github.com/arjunrajlaboratory/RajLabSeqTools
Raw File
Tip revision: c8b8c79b2ec9c1bd9eb7ced427bb2aec25f19506 authored by Benjamin Emert on 26 March 2020, 17:37:11 UTC
Updated reorganizeBasespaceFiles.py to better parse samples with same first index (e.g. sample 1 and sample10)
Tip revision: c8b8c79
README.md
# Snakemake workflow: rna-seq-star-deseq2

[![Snakemake](https://img.shields.io/badge/snakemake-≥5.2.1-brightgreen.svg)](https://snakemake.bitbucket.io)
[![Build Status](https://travis-ci.org/snakemake-workflows/rna-seq-star-deseq2.svg?branch=master)](https://travis-ci.org/snakemake-workflows/rna-seq-star-deseq2)
[![Snakemake-Report](https://img.shields.io/badge/snakemake-report-green.svg)](https://cdn.rawgit.com/snakemake-workflows/rna-seq-star-deseq2/master/.test/report.html)

This workflow performs a differential expression analysis with STAR and Deseq2.

## Authors of base pipeline

* Johannes Köster (@johanneskoester), https://koesterlab.github.io
* Sebastian Schmeier (@sschmeier), https://sschmeier.com
* Jose Maturana (@matrs)

## Usage

###	 Snakemake-bulkRNAseq-pipeline
# 	Last updated:	08/27/2019
# 	by:	Phil

The purpose of this document to to setup Snakemake and a differential expression pipeline to analyze bulk RNA-sequencing data.

The pipeline was generated by Johannes Koesterr, the Github repo is here.


¡BEFORE STARTING!
Make sure you have installed miniconda (Python3.7 version) and snakemake.
Download and install STAR version 2.5.3a.

Installing the pipeline
Clone the git/bitbucket repository.

Create a branch in case you make edits.

Make sure you are in the directory with the pipeline.

Place your raw FASTQ files in the data/ folder. Ex:

 rsync -av /path/to/data/SRX5725609.fastq.gz data/ ”


Make sure you have the following components of your reference:
Reference FASTA
Reference GTF

Index your reference genome using the following command (this will take several hours to a day, but only needs to be done once):	

Edit the file samples.tsv to appropriately reflect your sample names and the conditions.
	Ex: 	“ sample		condition
		  SRX5725609      	mock 
  SRX5725612      	infected 
  … “

Edit the file units.tsv to appropriately reflect your sample characteristics. [this is important to, for instance, mark fastq coming from the same file but different sequencing lanes if you don’t combine beforehand] [*** If you only have single end reads, leave the fq2 column blank]
	Ex:	“ sample  unit    fq1     fq2
  SRX5725609      sample  data/SRX5725609.fastq.gz
  SRX5725610      rep1    data/SRX5725610.fastq.gz
  …”

Edit the configuration file, config.yaml, to reflect:
The appropriate adapter sequence
The pca label
The comparison of conditions
Any parameters you want to include
*** and to the location of your STAR index directory and the GTF file.

Once this is complete, if you are running on MAC OSX, the command “zcat” will not work. To work around this without editing the wrapper, make the following changes:

	sed 's/.fastq.gz/.fastq/g' rules/align.smk > rules/align.smk ;

	sed 's/.fastq.gz/.fastq/g' rules/trim.smk > rules/trim.smk ;

This effectively removes the compression of all intermediate FASTQ files and bypasses the need to gunzip them.

If you are running on a Linux workstation, the above step is not necessary.

You’re almost there! Now we run snakemake:

Test that the configuration works with:

	snakemake --use-conda -n

If this work, then run the following command to run the script ($N is number of cores):

	snakemake --use-conda --cores $N

Once it’s done running you can create a report with the following command:

	snakemake --report report.html

If there is an error, the type will come up in red color and you can check the logs/ folder for the sample/step where an error was generated.

The pipeline should produce two useful .html files, two .svg files, and two .pdf files:

 ./report.html (overview of basic properties of your data)
 qc/multiqc_report.html (more detailed overview of sequencing data, including number of intronic reads, splice reads, etc…)
results/pca.svg (a figure of first two principal components of your samples based on gene expression)
results/diffexp/{condition2-vs-condition1}.ma-plot.svg (a figuring giving the comparison of the log ratio and mean average expression comparing the two conditions).
results/diffexp/{condition2-vs-condition1}.diffexp.volcano.pdf
results/diffexp/{condition2-vs-condition1}.diffexp.heatmap.pdf
back to top