https://github.com/stephenfloor/tripseq-analysis
Tip revision: 3e823abcca5b8c1e5e89dd9bd4c49e8673b3e957 authored by Stephen Floor on 24 June 2017, 00:52:49 UTC
email update
email update
Tip revision: 3e823ab
README.md
# tripseq-analysis
Analysis tools for TrIP-seq data. There are various individual tools inside to perform different tasks. Limited documentation below, contact stephen.floor@ucsf.edu with questions.
## transcriptome_properties.py
Calculate the properties of an input transcriptome (or regions thereof). Input format is BED, output files are .csv files with various properties as specified on the command line.
#### usage
```
usage: transcriptome_properties.py [-h] -i INPUT -g GENOME [--gc] [--length]
[--exonct] [--nt NT] [-o OUTPUT]
[--window WINDOW]
[--convtorefseq CONVTOREFSEQ]
[--targetscanfile TARGETSCANFILE]
[--deltag] [--lfold] [--cap-structure]
[--kozak] [--uorf-count] [--uorf-overlap]
[--start-codon] [--rare-codons]
[--mirna-sites] [--au-elements]
optional arguments:
-h, --help show this help message and exit
Global arguments:
-i INPUT, --input INPUT
The transcriptome region (BED format)
-g GENOME, --genome GENOME
The genome for the input transcriptome
--gc Calculate GC content
--length Calculate length
--exonct Count # of exons
--nt NT Number of threads (default is 8 or 4 for lfold)
-o OUTPUT, --output OUTPUT
Output basename (e.g. CDS)
--window WINDOW Window size for sliding window calculations (default
75)
--convtorefseq CONVTOREFSEQ
Filename to convert input annotations to refseq (for
targetscan; e.g. knownToRefSeq.txt)
--targetscanfile TARGETSCANFILE
Filename of targetscan scores (e.g.
Summary_Counts.txt)
--deltag Calculate min deltaG in sliding window of size
--window over region
--lfold Use RNALfold to calculate MFE rather than RNAfold
(faster but does not compute centroid,MEA)
5' UTR specific arguments:
--cap-structure Calculate structure at the 5' end
Start-codon-specific arguments:
--kozak Calculate Kozak context score
--uorf-count Calculate number of 5' UTR uORFs (starting with
[ACT]TG)
--uorf-overlap Overlap of uORF with start codon (implies --uorf-
count)
--start-codon Record the start codon used (ATG or other)
CDS-specific arguments:
--rare-codons Calculate codon usage properties
3' UTR specific arguments:
--mirna-sites Compile miRNA binding site info from targetscan
--au-elements Count number of AU-rich elements in the 3' UTR
```
####Requirements:
* ViennaRNA RNAfold and RNALfold (http://www.tbi.univie.ac.at/RNA)
* HumanCodonTable (this page)
* AnnotationConverter (this page)
* TargetscanScores (this page)
* SNFUtils (this page)
## compare_tripseq_clusters.py
Take two lists of TrIP-seq data (i.e. clusters) and compare them for genes that have the transcripts in each of the two different sets. For each set of gene-linked transcript isoforms, compare input transcriptome features as calculated using transcriptome_properties.py
#### Usage
```
usage: compare_tripseq_clusters.py [-h] --set1 FNAME ID ... [FNAME ID ... ...]
--set2 FNAME ID ... [FNAME ID ... ...]
--tx-to-gene TX_TO_GENE [-o OUTPUT] -n NREP
[--txome-props TXOME_PROPS [TXOME_PROPS ...]]
[--control] --txome-gtf TXOME_GTF
optional arguments:
-h, --help show this help message and exit
--set1 FNAME ID ... [FNAME ID ... ...]
Files and IDs containing transcript distributions;
compare between set1 and set2
--set2 FNAME ID ... [FNAME ID ... ...]
Files and IDs containing transcript distributions;
compare between set1 and set2
--tx-to-gene TX_TO_GENE
Mapping between transcript ID in input file and gene
ID
-o OUTPUT, --output OUTPUT
Output filename (default is stdout)
-n NREP, --nrep NREP Number of replicates of each point
--txome-props TXOME_PROPS [TXOME_PROPS ...]
List of files with transcriptome properties to
correlate among (wildcards ok)
--control Perform randomized comparisons of input transcripts as
a control.
--txome-gtf TXOME_GTF
Path to transcriptome GTF
```
#### Requirements
* GTF.py (this page)
* Transcript.py (this page)
* SNFUtils.py (this page)
* Two lists of transcripts to compare (i.e. clusters)
* Lists of transcriptome properties to compare between transcript isoforms of the same gene in the two sets (generated by transcriptome_properties.py)
* File containing transcript ID to gene mapping
## plot_tripseq_transcript.py
Plot an individual transcript or all transcripts of a gene. Requires input polysome sequencing data (i.e. TrIPseq) or some other distribution.
Input "tx-to-gene" file should be a file containing four columns: txid, geneid, gene_name, tx_name. This can be downloaded from Ensembl Biomart or other sources.
#### Usage
```
usage: plot_tripseq_transcript.py [-h] -i INPUT [-o OUTPUT] -n NREP --id ID
--tx-to-gene TX_TO_GENE [--text]
[--format FORMAT]
Plot input transcript ID from input distribution file
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
File containing transcript distributions
-o OUTPUT, --output OUTPUT
Output filename (default is stdout)
-n NREP, --nrep NREP Number of replicates of each point
--id ID Transcript ID(s) to print (can be partial; can be
comma-separated list)
--tx-to-gene TX_TO_GENE
File containing transcript ID to gene name mapping
--text Output text data in addition to plots.
--format FORMAT Image format to export (png or pdf).
```
#### Requirements
* SNFUtils.py (this page)
* Input per-transcript distributions
* File containing transcript ID to gene mapping (if per-gene plotting is desired)
## fpkm_to_tpm.py
Converts between FPKM and TPM (transcripts per million). Uses the formula TPM_i = FPKM_i * 1e6 / sum(FPKM_g for all genes g)
Citation: http://lynchlab.uchicago.edu/publications/Wagner,%20Kin,%20and%20Lynch%20%282012%29.pdf
#### Usage
```
usage: fpkm_to_tpm.py [-h] -i INPUT [-t SEPARATOR] [-o [OUTPUT]]
[--ignore IGNORE] [--filter FILTER] [-u]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
File containing identifiers to use for the merge
-t SEPARATOR, --separator SEPARATOR
Field separator (default comma; "tab" for tabs;
"space" for whitespace
-o [OUTPUT], --output [OUTPUT]
File to output to (default stdout)
--ignore IGNORE Number of columns to ignore (one-based; 1 ignores the
first column)
--filter FILTER Filter genes with TPM below arg
-u, --unique Only output lines with unique entries in column 1
```
#### Requirements
* A file with FPKM values to convert to TPM
## Utility classes
#### AnnotationConverter.py
A class to provide for conversion between two annotation sets.
#### GTF.py
A class to read GTF files - downloaded from https://gist.github.com/slowkow/8101481 and minimally modified
#### SNFUtils.py
A file providing various utility functions.
#### HumanCodonTable.py
A class harboring information on human codon usage.
#### TargetscanScores.py
A class to read in targetscan scores and provide accessor functions.
#### Transcript.py
A class defining a transcript and structural features associated with it.