Content - 9228f7f8b6d330346b6b6cee99f0ac8437dd3fe9 - acd9f3d/README.md

visit type:
Tip revision: 19621ae94de30c90c07fa8a980f01fd523c65f68 authored by Brian Haas on 05 October 2015, 19:50:57 UTC
sample reads included
Tip revision: 19621ae
README.md
# STAR-Fusion 

STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set (using a GTF file, ideally the same annotation file used during the STAR genome index building process during the intial STAR setup).


STAR should be run using options that are well suited to fusion read detection.  An example of settings similar to those used in the landmark publication "The landscape of kinase fusions in cancer" (PMID: 25204415) by Stransky et al., Nat Commun 2014 are as follows:

```
   STAR --genomeDir Hg19.fa_star_index \
        --readFilesIn left.fq right.fq \
        --outSAMstrandField intronMotif \
        --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
        --outReadsUnmapped None --chimSegmentMin 12 \
        --chimJunctionOverhangMin 12 \
        --alignSJDBoverhangMin 10 \
        --alignMatesGapMax 200000 \
        --alignIntronMax 200000 \
        --outSAMtype BAM SortedByCoordinate 
```

The output from running star will include two primary output files that contain the junction and spanning read information (see STAR documentation for precise details).

      Chimeric.out.junction  : contains junction reads.
	  Chimeric.out.sam : contains alignments for fusion-spanning reads.


## Installation Requirements 

### Software prerequisites:

  In addition to having the STAR aligner installed, you'll need NCBI BLAST+: 
  http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download

  STAR-Fusion requires the following non-standard Perl modules from CPAN: Set/IntervalTree.pm and DB_File.pm
  
  A typical perl module installation may involve:
  perl -MCPAN -e shell
    install Set::IntervalTree
    install DB_File
 	
  *The Set::IntervalTree module tends to install trouble-free on Linux.  Note, if you have trouble installing Set::IntervalTree on Mac OS X (as I did), try the following:  download the tarball from the http://search.cpan.org/~benbooth/Set-IntervalTree-0.02/lib/Set/IntervalTree.pm, run the perl Makefile.pl, then edit the generated 'Makefile' and remove all occurrences of '-arch i386'. Then try 'make', 'make test', and finally 'make install'.

### Building the reference sequence index

Included with STAR-Fusion is the Gencode Hg19 sequence annotations and specially formatted reference transcript sequences. The transcript sequences need to be indexed, which can be done by simply running

   make

in the STAR-Fusion base installation directory.

If you're interested in running STAR-Fusion using other reference genomes and reference annotations, see instructions below in how to integrate alternative genome resources.


## Running STAR-Fusion 

Run STAR-Fusion like so, using these two files above.  (Note, specify -G ref_annot.gtf if you choose to use a different annotation set than that included and used by default (gencode.v19 in the resources/ folder)

      STAR-Fusion -S Chimeric.out.sam -J Chimeric.out.junction


The output from STAR-Fusion is found as a tab-delimited file named 'star-fusion.fusion_candidates.final', and has the following format:

```
 #fusion_name    JunctionReads   SpanningFrags   LeftGene        LeftBreakpoint  LeftDistFromRefExonSplice       RightGene       RightBreakpoint RightDistFromRefExonSplice
 FIP1L1--PDGFRA  98      13      FIP1L1^ENSG00000145216.11       chr4:54292132:+ 0       PDGFRA^ENSG00000134853.7        chr4:55141092:+ 84
 BRD4--NUTM1     7       2       BRD4^ENSG00000141867.13 chr19:15364963:-        0       NUTM1^ENSG00000184507.11        chr15:34640170:+        0
 EWSR1--FLI1     5       2       EWSR1^ENSG00000182944.13        chr22:29683123:+        0       FLI1^ENSG00000151702.12 chr11:128677075:+       0
 GOPC--ROS1      82      36      GOPC^ENSG00000047932.9  chr6:117888017:-        0       ROS1^ENSG00000047936.6  chr6:117642557:-        0
 ETV6--NTRK3     8       3       ETV6^ENSG00000139083.6  chr12:12022903:+        0       NTRK3^ENSG00000140538.12        chr15:88483984:-        0
 FGFR3--TACC3    221     372     FGFR3^ENSG00000068078.13        chr4:1808661:+  0       TACC3^ENSG00000013810.14        chr4:1729704:+  269
 EWSR1--ATF1     8       3       EWSR1^ENSG00000182944.13        chr22:29683123:+        0       ATF1^ENSG00000123268.4  chr12:51208063:+        0
 HOOK3--RET      9       2       HOOK3^ENSG00000168172.4 chr8:42823357:+ 0       RET^ENSG00000165731.13  chr10:43612032:+        0
 CD74--ROS1      5       0       CD74^ENSG00000019582.10 chr5:149784243:-        0       ROS1^ENSG00000047936.6  chr6:117645578:-        0
 TMPRSS2--ETV1   10      3       TMPRSS2^ENSG00000184012.7       chr21:42866302:-        19      ETV1^ENSG00000006468.9  chr7:13975463:- 58
 AKAP9--BRAF     4       4       AKAP9^ENSG00000127914.12        chr7:91632549:+ 0       BRAF^ENSG00000157764.8  chr7:140487384:-        0
 ...
```

Note, these fusion candidates are derived based on mapping the STAR outputs to the reference annotations.  Paralogous genes are notorious for showing up as false-positive fusion candidates. Initial/preliminary predictions are found in file 'star-fusion.fusion_predictions.preliminary'. These are filtered using BLASTN, and those preliminary predictions that are excluded are prefixed with '#' in the file 'star-fusion.fusion_predictions.preliminary.filt', and the BLAST results are included in additional column fields for such entries.  Those that are not flagged as likely artifacts are reported in the final report file 'star-fusions.fusion_predictions.final'.  To turn off filtering (the BLAST step), simply run STAR-Fusion with the '--no_filter' parameter. See usage information (--help) for additional options).


## Parameterization 

STAR-Fusion will report all candidates having at least 1 junction read where the breakpoints match up precisely with reference exon junctions of two different genes.

For those breakpoints that do not precisely match at reference exon junctions, the breakpoint fusion read support must be at least --min_novel_junction_support (default 10 reads).

In the case where multiple candidate fusion breakpoints are reported, only those breakpoints having at least --min_alt_pct_junction (default 10%) of the dominant isoform junction support will be reported.

Finally, it is worth noting that the counts of spanning fragments are entirely non-overlapping with the counts of the breakpoint junction reads. That is, no spanning fragment (from Chimeric.out.sam) is counted if it contains a read that is reported as evidence in the breakpoint junction candidate data (from Chimeric.out.junction).



## Example data and execution:

In the included test/ directory, you'll find a 'runMe.sh' script along with a data/ subdirectory.  The data/ subdirectory contains example fusion and spanning data generated from running STAR, in addition to a reference annotation file for gencode v19. Note, the reference GTF file contains only the 'exon' records instead of all lines from the original gencode annotation file; this speeds up parsing of the file and keeps the file size relatively small for including in this package.

In this test/ directory, Run the sample execution like so:

    ./runMe.sh

which simply runs:

    ../STAR-Fusion -S Chimeric.out.sam.gz -J Chimeric.out.junction.gz 

and you'll find the output file 'star-fusion.fusion_candidates.txt' containing the fusion candidates in the format described above.


## Integrating alternative genome resources

STAR-Fusion comes with reference annotations and sequences based on the human reference genome Hg19 and gencode annotations.  

If you wish to use a different genome and set of reference annotations, you can install them as follows.

You'll need a reference genome (ie. my_genome.fasta) and reference transcript structure annotations (ie. my_annotations.gtf).  This GTF file should include 'exon' features and contain attributes for 'gene_id', 'transcript_id', and optionally but recommended 'gene_name' to preferentially use gene symbols.

Generate a specially formatted reference cDNA fasta file like so:

    util/gtf_file_to_cDNA_seqs.pl my_annotations.gtf my_genome.fasta > my_cdna.fasta

and then build an index for the my_cdna.fasta file like so:

    util/index_cdna_seqs.pl my_cdna.fasta


When running STAR-Fusion, specify '--ref_GTF my_annotations.gtf' and '--ref_cdna my_cdna.fasta' to make use of these alternative targets.



######################
## Acknowledgements ##
######################

This effort was largely inspired by earlier work done by Nicolas Stransky and discussions with Daniel Nicorici.

STAR-Fusion is contributed by Brian Haas, Broad Institute, 2015
Browse the archive

https://github.com/STAR-Fusion/STAR-Fusion