Content - 97bf23ee52bb7b558c2aadd46c85a24991f2eb16 - 95d562b/docs/cookbook-bucky.rst

visit type:
Tip revision: 82782ca61ae1180bbbcf2353016ff2a123971319 authored by Isaac Overcast on 08 May 2021, 23:57:05 UTC
"Updating ipyrad/__init__.py to version - 0.9.74
Tip revision: 82782ca
cookbook-bucky.rst

Cookbook for running BUCKy in parallel in a Jupyter notebook
------------------------------------------------------------

This notebook uses the *Pedicularis* example data set from the first
empirical ipyrad tutorial. Here I show how to run BUCKy on a large set
of loci parsed from the output file with the ``.loci`` ending. All code
in this notebook is Python. You can simply follow along and execute this
same code in a Jupyter notebook of your own.

Software requirements for this notebook
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All required software can be installed through conda by running the
commented out code below in a terminal.

.. code:: ipython2

    ## conda install -c BioBuilds mrbayes
    ## conda install -c ipyrad ipyrad
    ## conda install -c ipyrad bucky

.. code:: ipython2

    ## import Python libraries
    import ipyparallel as ipp
    import subprocess as sps
    import ipyrad as ip
    import glob
    import os
    import ipyrad.file_conversion as ifc

Cluster setup
~~~~~~~~~~~~~

To execute code in parallel we will use the ``ipyparallel`` Python
library. A quick guide to starting a parallel cluster locally can be
found `here <link>`__, and instructions for setting up a remote cluster
on a HPC cluster is available
`here <http://ipyrad.readthedocs.io/HPC_Tunnel.html>`__. In either case,
this notebook assumes you have started an ``ipcluster`` instance that
this notebook can find, which the cell below will test.

.. code:: ipython2

    ## look for running ipcluster instance, and create load-balancer
    ipyclient = ipp.Client()
    lbview = ipyclient.load_balanced_view()
    print "{} engines found".format(len(ipyclient))


.. parsed-literal::

    4 engines found


.. code:: ipython2

    %%px
    ## push imports to parallel engines
    import subprocess as sps
    import glob
    import os

Input/output organization
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    ## enter the data file for your analysis here
    LOCIFILE = "/home/deren/Documents/ipyrad/tests/branch-test/base_outfiles/base.loci"
    WORKDIR = "analysis-bucky"

.. code:: ipython2

    ## This ensures the file paths are Full Paths (not relative) 
    LOCIFILE = os.path.realpath(LOCIFILE)
    WORKDIR = os.path.realpath(WORKDIR)
    print "infile is:", LOCIFILE
    print "outdir is:", WORKDIR


.. parsed-literal::

    infile is: /home/deren/Documents/ipyrad/tests/branch-test/base_outfiles/base.loci
    outdir is: /home/deren/Documents/ipyrad/tests/analysis-bucky


Set up some tests
~~~~~~~~~~~~~~~~~

List the names of the samples you wish to include in your analysis.
BUCKy begins to perform less well when the number of tips is >10 or so,
so you might want to try focus your analysis on subsampled sets of taxa.
Here we select just 9 of the 13 samples in the data set, with just one
representative of each species or subspecies.

.. code:: ipython2

    ## make a list of sample names you wish to include in your BUCKy analysis 
    SUBSAMPLES = [
        "29154_superba", 
        "30686_cyathophylla", 
        "41478_cyathophylloides", 
        "33413_thamno", 
        "30556_thamno",
        "35236_rex",
        "40578_rex", 
        "38362_rex", 
        "33588_przewalskii",
    ]

Sample loci and write NEXUS files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    ## create a name for this particular data set
    NAME = "example"
    
    ## create nexus files for this data set
    ifc.loci2multinex(name=NAME, 
                      locifile=LOCIFILE, 
                      subsamples=SUBSAMPLES, 
                      minSNPs=2, 
                      outdir=WORKDIR)


.. parsed-literal::

    wrote 709 nexus files to /home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-example


An example nexus file
~~~~~~~~~~~~~~~~~~~~~

Nexus files are written to a new directory called ``bucky-{name}``,
where name is the name entered into the ``loci2multinex()`` function. If
you entered a ``outdir`` argument as well then this new directory will
be made as a subdirectory inside that outdir. Above we used
name="example" and outdir=WORKDIR, which created files in the directory
shown above.

.. code:: ipython2

    ## get RUNDIR relative to WORKDIR to ensure it is a Full Path
    RUNDIR = os.path.join(WORKDIR, "bucky-{}".format(NAME))
    
    ## print an example nexus file
    with open(os.path.join(RUNDIR, "1.nex")) as nex:
        print nex.read()


.. parsed-literal::

    #NEXUS
    begin data;
    dimensions ntax=9 nchar=66;
    format datatype=dna interleave=yes gap=- missing=N;
    matrix
    30686_cyathophylla      CTTGGCAGGTGGCAGTTCGTTGCTGTTATATGCTGTAAGAAAAT-AAAAAAAAATCACCTGTTTAG
    33413_thamno            CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAACCACCTGTTTAG
    30556_thamno            CTTNGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT-NAAAAAAAATCACCTGTTTAG
    33588_przewalskii       CTTGGCAGGTGGCAGTTCGTTGCTGAAATATGCTGTAAGAAAAT-AAAGAAAAATCATTT-TTTGG
    29154_superba           CTTGGCAGTTGGCATTTCGTTGCTGTTATATGCTGTAAGAAAAT-AAAAAAAAATCACCTGTTTAA
    40578_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAATCACCTGTTTAG
    41478_cyathophylloides  CTTGGCAGGTGGCAGTTCGTTGCTGTTATATGCTGTAAGAAAATAAAAAAAAAATCACCTGTTTAG
    38362_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAATAAAAAAAAAATCACCTGTTTAG
    35236_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAATCACCTGTTTAG
    
        ;
    
    begin mrbayes;
    set autoclose=yes nowarn=yes;
    lset nst=6 rates=gamma;
    mcmc ngen=2000000 samplefreq=1000 printfreq=2000000;
    sump burnin=1000000;
    sumt burnin=1000000;
    end;
    


.. code:: ipython2

    ## get all nexus files for this data set
    nexfiles = glob.glob(os.path.join(RUNDIR, "*.nex"))

A Python function to call ``mrbayes``, ``mbsum`` and ``bucky``.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    def mrbayes(infile):
        ## double check file path
        infile = os.path.realpath(infile)
        if not os.path.exists(infile):
            raise Exception("infile not found; try using a fullpath")
            
        ## call mrbayes
        cmd = ['mb', infile]
        proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE)
        stdout = proc.communicate()
        
        ## check for errors
        if proc.returncode:
            return stdout

.. code:: ipython2

    def mbsum(dirs):
        trees1 = glob.glob(os.path.join(dirs, "*.run1.t"))
        trees2 = glob.glob(os.path.join(dirs, "*.run2.t"))
        tidx = 0
        for tidx in xrange(len(trees1)):
            cmd = ["mbsum", 
                   "-n", "0", 
                   "-o", os.path.join(dirs, str(tidx))+".in", 
                   trees1[tidx], 
                   trees2[tidx]]
            proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE)
            proc.communicate()
        print "summed {} trees in: {}".format(tidx, dirs)

.. code:: ipython2

    def bucky(outname, indir, alpha, nchains, nreps, niter):
        ## check paths
        if not os.path.exists(indir):
            raise Exception("infiles not found; try using a fullpath")
        
        ## call bucky 
        infiles = os.path.join(indir, "*.in")
        cmd = ["bucky", 
               "-a", str(alpha),
               "-c", str(nchains),
               "-k", str(nreps),
               "-n", str(int(niter)), 
               "-o", outname, 
               infiles]
        
        cmd = " ".join(cmd)
        proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE, shell=True)
        stdout = proc.communicate()
        if proc.returncode:
            return " ".join(cmd), stdout

Run mrbayes on all nexus files in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is important that the lists contain the full paths to the files.

.. code:: ipython2

    ## send jobs to the parallel engines
    asyncs = []
    for nexfile in nexfiles:
        async = lbview.apply(mrbayes, nexfile)
        asyncs.append(async)

Track progress of the mrbayes runs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to check the progress interactively then execute the cell
below, which will tell you how many jobs have finished. The cell below
that uses a wait() statement to block progress until all of the mrbayes
jobs are finished.

.. code:: ipython2

    ready =  [i for i in asyncs if i.ready()]
    failed = [i for i in ready if not i.successful()]
    
    ## print progress
    print "mrbayes batch runs:"
    print "{} jobs submitted".format(len(asyncs))
    print "{} jobs finished".format(len(ready))
    
    ## print errors, if any.
    if any(failed):
        print failed[0].exception()
        print failes[0].result()


.. parsed-literal::

    mrbayes batch runs:
    722 jobs submitted
    35 jobs finished


.. code:: ipython2

    ## waits until all mrbayes runs are finished
    ipyclient.wait()




.. parsed-literal::

    True



Summarize the mrbayes posteriors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    ## run mbsum on each directory of tree files
    mbsum(RUNDIR1)
    mbsum(RUNDIR2)


.. parsed-literal::

    summed 9 trees in: /home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13
    summed 0 trees in: /home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp9


Run BUCKy to infer concordance factors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    nchains = 4
    nreps = 4
    niter = 1e6
    alphas = [0.1, 1, 10]
    
    ## submit jobs to run at several values of alpha
    bsyncs = []
    for alpha in alphas:
        outname = os.path.join(RUNDIR, "bucky-{}".format(alpha))
        args = (outname, RUNDIR, alpha, nchains, nreps, niter)
        async = lbview.apply(bucky, *args)
        bsyncs.append(async)

Track progress of Bucky runs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython2

    ready =  [i for i in bsyncs if i.ready()]
    failed = [i for i in ready if not i.successful()]
    print "bucky batch runs:"
    print "{} jobs submitted".format(len(bsyncs))
    print "{} jobs finished".format(len(ready))
    if len(ready) == len(bsyncs):
        ## print errors, if any.
        if any(failed):
            print failed[0].exception()



.. parsed-literal::

    bucky batch runs:
    3 jobs submitted
    0 jobs finished


.. code:: ipython2

    ipyclient.wait()




.. parsed-literal::

    True



Results
~~~~~~~

Look at individual results files for final stats.

.. code:: ipython2

    results = glob.glob(os.path.join(RUNDIR, "bucky-*.concordance"))


.. code:: ipython2

    results




.. parsed-literal::

    ['/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-1.txt.concordance',
     '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-0.1.concordance',
     '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-0.1.txt.concordance',
     '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-1.concordance',
     '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-10.txt.concordance',
     '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-10.concordance']
Browse the archive

https://github.com/dereneaton/ipyrad