Revision 640a6536eb2958c29a21ac48c6f1cd17ac842749 authored by Alexey Sergushichev on 27 November 2014, 10:09:19 UTC, committed by Alexey Sergushichev on 27 November 2014, 10:09:19 UTC
1 parent 25ee0e8
Raw File

A set of scripts for 3' RNA-seq quantification

## Install

To install run `python install`.

To install locally, without root privelegies, run `python install --user`. Don't forget to add `.local/bin` to your `PATH`. You can do it by adding line `export PATH="$HOME/.local/bin:$PATH"` to the `~/.profile` file.

## Example

We bundled a small example dataset.

cd example

There are two problems of 3' RNA-seq we are dealling with:
* Bad 3' annotaion.
* Relatively high number of multimappers near the 3' end.

### Fixing annotation

Basically we find peaks in RNA-seq and try to associate it with a close transcript.

First, we call peaks for positive and negative strands using `macs2-stranded` for
all sample combined to determine posible positions of unannotated exons.

macs2-stranded -n example ./bam/*.bam

It will produce a file `example_peaks.narrowPeak`.

Second, we fix annotation with RNA-seq peaks up to 5Kbp downstream of transcript's 3' end.
We also put newly added exons in a sepearate file `mm10.slice.extns.gtf`.

gtf-extend -g mm10.slice.gtf -p example_peaks.narrowPeak -o mm10.slice.extended.gtf --extns-out mm10.slice.extns.gtf

These way we added 4 likely exons that were not annotated.

### Fixing multimappers

Main idea here is that we expect RNA-seq reads to be aligned to the genes.
So, we discard alignments of multimappers to the unannotated regions and 
fix multimapping annotation (`NH` SAM field) for such reads. Some of them
can be uniquely mapped to one gene.

mkdir bam_fixed
for f in bam/*.bam
    echo "$f"
    fix-mm -g mm10.slice.gtf "$f" -o "${f/bam/bam_fixed}"

## Dependencies

`macs2-stranded` depends on `samtools` and `macs2` executables.  `macs2` version should be >= 2.0.10

`` and `` needs `HTSeq` python package installed.

back to top