https://github.com/gFedonin/EpiStat
Raw File
Tip revision: 6c528ec4991854467e98684aa471a9b8f095b875 authored by gennady.fedonin on 08 September 2022, 08:49:57 UTC
changed names in examples
Tip revision: 6c528ec
README
TERMS
    Event - the change of character states on a branch of phylogenetic tree,
    e.g. mutation in a genome site or the change of phenotypic states

    Site - the integer identifier for the set of events. It corresponds to a
    genome or a protein site where mutations occurred or to the set of
    specific phenotypic states, e.g. possible phenotypic reactions on a drug
    treatment.

DESCRIPTION
    This is a stub module that contains command-line utilities for searching
    for associations between events on a phylogenetic tree. Events may be
    mutations in genome sites for which pairwise epistatic statistics would
    be calculated or mutations and changes of phenotypic states for which
    associations of mutations and phenotype changes are calculated.

    The method is based on the following publications:

        1. Kryazhimsky S., et. al. Prevalence of Epistasis in the Evolution of Influenza A Surface Proteins. 2011 PLoS Genetics 7(2):e1001301
        2. Neverov A.D., et al. Coordinated Evolution of Influenza A Surface Proteins. 2015 PLoS Genetics 11(8):e1005404
        3. Neverov A.D., et al. Episodic evolution of coadapted sets of amino acid sites in mitochondrial proteins. 2021 PLoS Genetics 17(1):e1008711.

    Pipeline scripts:

        estimate_tau.pl - estimates average waiting time for an event along a phylogenetic lineage. This parameter is used for parametrization of epistatic statistics.

        run_epistat.pl - starts the search for associations. The script requires the phylogenetic tree in the branch list format (XPARR) where for each branch the sets of events in the background and in the foreground are provided.
            All events within common sites are aggregated, e.g. mutations in the common genome site or phenotype changes for the common phenotype ID.
            The main result of processing the data is a list of all site pairs with corresponding p-values calculated according to the null-model which assumes that events in each site are inpependent of other sites. 

        estimate_fdr.pl - estimates FDR for p-values for pairs of sites.

        mk_coevolmtx.pl - calculates pairwise pseudocorrelation statistics, e.i. normalized statistics that are analogs of covariances of random descrete variables on the phylogeneetic tree.

        minvert/cor2pcor.R - calculates paiwise partial correlations for the matrix of pseudocorrelation statistics using the cor2pcor package for R

        minvert/psicov.R - inverts pseudocorrelation matrix using Lasso.

    Scrips that typically are not used directly but required for the
    analysis:

        epistat.pl - the main script in the pipeline, it calculates paiwise association statictics for pairs of sites.
            This script is typically called from run_epistat.pl, however, it could be called separately for drawing events on the tree.
        shuffle_incidence_matrix.R - randomly shuffles events on the branches of the tree preserving the number of events in each site and on each branch.
        sample2xparr.pl - converts output of shuffle_incidence_matrix.R into trees in XPARR format.
        fake_sample.pl - calculates p-value statistics for the tree with randomly shuffled events on branches. Used for the estimating of the false discovery rates (FDR).

    Scripts command line templates and usage parameters are provided in the
    header of each script file as comments. The documentation in the POD
    format for scripts is in the preparation.

    Input data format: The input data is a phylogenetic tree with a list of
    substitutions on branches in XPARR format:

        Header
        RootID
        BranchLine+
    
        Header="child\tparent\tlength"
    
        RootID="\w+" - any string containing letters or digits
        BranchLine = "NodeID\tParentNodeID\tBranchLength\tSynSiteList\tNonSynMutList\tNonSynMutList"
        site=\d+ - site in the reference sequence, positive number
        SynSiteList = "site(;site)*" - a list of sites, in which synonymous substitutions has occured on this edge
        NonSynMutList = "mutation(;mutation)*" - alist of amino acid substitutions on this edge
        mutation = "AsiteB"
        A and B is one letter amino acid code

    You can find an example of XPARR tree in epistat_examples, e.g.
    epistat_examples/intragene/alleles/sars2_s/S.xparr

    XPARR formatted trees can be generated from the output of pylognetic
    reconstruction program which can reconstruct both the tree and the
    ancestor states. The mk_xparr.pl script can be used to generate XPARR
    from tree and alignment:

    mk_xparr.pl -t <tree_NWK> [options] <alignment_FASTA>

    Here <tree_NWK> is a newick formated tree, where the bootstrap-slot is
    filled with internal nodes' identifiers, and <alignment_FASTA> is a
    FASTA alignment of both terminal and ancestor sequences.

AUTHOR
    Alexey Dmitrievich Neverov neva_2000 (at) mail.ru

LICENCE
    GPL v. 3.0

INSTALLATION
    Install perl 5 (you can use perlbrew if you like)

    Install R

    Install python 2 (typically system python for linux, you can also use
    pyenv with miniconda)

    Install GNU Parallel

    Install R packages:

        BiRewire package for R:

        To install this package, start R console (version "4.0") and enter:

        if (!requireNamespace("BiocManager", quietly = TRUE))
            install.packages("BiocManager")
        BiocManager::install("BiRewire")

        For older versions of R, please refer to the appropriate Bioconductor release.

        Enter at the R console: install.packages("corpcor").

        Enter at the R console: install.packages("glasso").

        Enter at the R console: install.packages("getopt").

        If R packages have been installed in the local directory add the path to the directory to the R_LIBS environment variable or add the following to your .bash_profile:
    
            if [ -n $R_LIBS ]; then
                export R_LIBS=/usr/me/local/R/library:$R_LIBS
            else
                export R_LIBS=/usr/me/local/R/library
            fi

    Install Epistat:

        Clone the repository or download and unzip the release archive.
        NOTE: The Build.PL uses Pod::Select, part of the Pod-Parser distribution which will be removed from the Perl core since version 5.32.
        In the case of occurring of the following compilation error: "Can't locate Pod/Select.pm in @INC" try
            cpan install Pod::Select
        type:
            cd /path/to/unpacked/epistat/distributive
            perl Build.PL
            ./Build installdeps
            ./Build build
            ./Build test
            ./Build install --install_base=/where/to/installed/epistat

        Create two system variables: 
            EPISTAT_HOME pointing to '/where/to/installed/epistat/bin' directory and
            EPISTAT_LIB pointing to the '/where/to/installed/epistat/lib/perl5'.
            On linux you can modify your .bash_profile, for example:

                export EPISTAT_HOME=/where/to/installed/epistat/bin
                export EPISTAT_LIB=/where/to/installed/epistat/lib/perl5

QUICK_START
    These instructions are for linux.

        chmod -R 775 /where/to/installed/epistat/epistat_examples
        cd /where/to/installed/epistat/epistat_examples
     #Chouse a subfoulder with an example, e.g. 
        cd intragene/alleles/sars2_s

     Run these commands from the shell:

     #This script estimates 'tau' parameter and prints it to S.tau. Copy the value of "TAU=" from S.tau to the field "TAU=" in S.epistat.prm.
            $EPISTAT_HOME/estimate_tau.pl -b 0.95 S.xparr > S.tau
     #It launches the analysis
            $EPISTAT_HOME/run_epistat.pl -x S.xparr -m 2 -p 50000 S.epistat.prm run_epistat.prm
     #It estimates FDRs for upper p-values
            $EPISTAT_HOME/estimate_fdr.pl S.upper.pvalue.unord_pairs.sites.fdr.prm > S.upper.pvalue.unord_pairs.sites.fdr
     #It estimates FDRs for lower p-values
            $EPISTAT_HOME/estimate_fdr.pl S.lower.pvalue.unord_pairs.sites.fdr.prm > S.lower.pvalue.unord_pairs.sites.fdr
     #It calculates the pseudocorrelations for site pairs. Negative values are ignored (NonNegativeElementFilter="1" in the S.mk_coevolmtx.prm).
            $EPISTAT_HOME/mk_coevolmtx.pl –m Z-score S.mk_coevolmtx.prm
     #It calculates the pseudocorrelations for site pairs. Does not ignore the negative values (NonNegativeElementFilter="0" in the S.all.mk_coevolmtx.prm).
            $EPISTAT_HOME/mk_coevolmtx.pl –m Z-score S.all.mk_coevolmtx.prm
     #it calculates the partial correlations for the matrix of positive pseudocorrelation statistics.
            $EPISTAT_HOME/minvert/cor2pcor.R -f S.block.mtx -l 0.9 -n 0 -s .cor2pcor.n0.l90.R.out
     #It builds the table of concordantly evolving pairs of sites having upper p-values<4.4e-4
     #To start, you need to modify the field "Path=" in the file S.up00044.FDR10.mk_summary.cfg setting up to the actual path to the example folder, e.g. 
     # Path="/where/to/installed/epistat/epistat_examples/intragene/alleles/sars2_s"
            $EPISTAT_HOME/mk_summary.pl -u -a S.up00044.FDR10.mk_summary.cfg S.cor2pcor.n0.l90.R.out>S.cor2pcor.n0.l90.up00044.FDR10.tab
     #It builds the table of discordantly evolving pairs of sites having lower p-values<5.2e-4
     #To start, you need to modify the field "Path=" in the file S.lp0052.FDR10.mk_summary.cfg setting up to the actual path to the example folder, e.g. 
     # Path="/where/to/installed/epistat/epistat_examples/intragene/alleles/sars2_s"
            $EPISTAT_HOME/mk_summary.pl -u -a S.lp0052.FDR10.mk_summary.cfg -s 1 S.all.mtx>S.zscores.lp0052.FDR10.tab

back to top