Raw File
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cookbook for running BUCKy in parallel in a Jupyter notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook uses the *Pedicularis* example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the `.alleles.loci` ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Software requirements for this notebook\n",
    "All required software can be installed through conda by running the commented out code below in a terminal. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## conda install -c BioBuilds mrbayes\n",
    "## conda install -c ipyrad ipyrad\n",
    "## conda install -c ipyrad bucky"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## import Python libraries\n",
    "import ipyrad.analysis as ipa\n",
    "import ipyparallel as ipp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cluster setup\n",
    "To execute code in parallel we will use the `ipyparallel` Python library. A quick guide to starting a parallel cluster locally can be found [here](link), and instructions for setting up a remote cluster on a HPC cluster is available [here](http://ipyrad.readthedocs.io/HPC_Tunnel.html). In either case, this notebook assumes you have started an `ipcluster` instance that this notebook can find, which the cell below will test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4 engines found\n"
     ]
    }
   ],
   "source": [
    "## look for running ipcluster instance, and create load-balancer\n",
    "ipyclient = ipp.Client()\n",
    "print \"{} engines found\".format(len(ipyclient))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a bucky analysis object\n",
    "The two required arguments are the `name` and `data` arguments. The `data` argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to `{workdir}/{name}/{number}.nexus`. Bucky doesn't deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. By default, all samples found in the loci file will be used, unless you enter a list of names (the `samples` argument) to subsample taxa, which we do here. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the `.params` dictionary. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## make a list of sample names you wish to include in your BUCKy analysis \n",
    "samples = [\n",
    "    \"29154_superba\", \n",
    "    \"30686_cyathophylla\", \n",
    "    \"41478_cyathophylloides\", \n",
    "    \"33413_thamno\", \n",
    "    \"30556_thamno\",\n",
    "    \"35236_rex\",\n",
    "    \"40578_rex\", \n",
    "    \"38362_rex\", \n",
    "    \"33588_przewalskii\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## initiate a bucky object\n",
    "b = ipa.bucky(\n",
    "    name=\"test\",\n",
    "    data=\"analysis-ipyrad/pedic_outfiles/pedic.alleles.loci\",\n",
    "    workdir=\"analysis-bucky\",\n",
    "    samples=samples,\n",
    "    minsnps=2,\n",
    "    maxloci=100,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bucky_alpha           [0.1, 1.0, 10.0]    \n",
       "bucky_nchains         4                   \n",
       "bucky_niter           1000000             \n",
       "bucky_nreps           4                   \n",
       "maxloci               100                 \n",
       "mb_mcmc_burnin        100000              \n",
       "mb_mcmc_ngen          1000000             \n",
       "mb_mcmc_sample_freq   1000                \n",
       "minsnps               2                   \n",
       "seed                  869236347           "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## print the params dictionary\n",
    "b.params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/test\n"
     ]
    }
   ],
   "source": [
    "## This will write nexus files to {workdir}/{name}/[number].nex\n",
    "b.write_nexus_files(force=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example nexus file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#NEXUS\r\n",
      "begin data;\r\n",
      "dimensions ntax=9 nchar=64;\r\n",
      "format datatype=dna interleave=yes gap=- missing=N;\r\n",
      "matrix\r\n",
      "30686_cyathophylla      TCCTCGGCAGCCATTAGACCGGTGGAATATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n",
      "33413_thamno            TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACTGATCCTGGATAATCAAAACTTG\r\n",
      "33588_przewalskii       TCCTCGGCAGCCATCAGACCGGTGGAGTGTGCACCATGCACCGATCCCGCATAATCAAAACTCG\r\n",
      "29154_superba           TCCTCGGCAGCCATTAGACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n",
      "41478_cyathophylloides  TCCTCGGCAGCCATTAGACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n",
      "40578_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n",
      "30556_thamno            TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n",
      "38362_rex               TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n",
      "35236_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n",
      "\r\n",
      "    ;\r\n",
      "\r\n",
      "begin mrbayes;\r\n",
      "set autoclose=yes nowarn=yes;\r\n",
      "lset nst=6 rates=gamma;\r\n",
      "mcmc ngen=1000000 samplefreq=1000 printfreq=1000000;\r\n",
      "sump burnin=100000;\r\n",
      "sumt burnin=100000;\r\n",
      "end;\r\n"
     ]
    }
   ],
   "source": [
    "## print an example nexus file\n",
    "! cat analysis-bucky/test/1.nex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete a BUCKy analysis\n",
    "There are three parts to a BUCKy analysis which we have broken into three distinct functions. The first is `.run_mrbayes()`, then `.run_mbsum()`, followed by `.run_bucky()`. Each uses the files produced by the previous function in order. You can use the force flag to overwrite existing files. An ipyclient should be provided to distribute the jobs in parallel. The parallelization is especially important for the mrbayes analyses, where more cores will lead to approximately linear speed improvements. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[####################] 100% [mb] infer gene-tree posteriors | 0:49:01 |  \n"
     ]
    }
   ],
   "source": [
    "## distributes mrbayes jobs across the parallel client\n",
    "b.run_mrbayes(force=True, ipyclient=ipyclient)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[####################] 100% [mbsum] sum replicate runs      | 0:00:07 |  \n"
     ]
    }
   ],
   "source": [
    "## this step is fast, simply summing the gene-tree posteriors\n",
    "b.run_mbsum(ipyclient=ipyclient)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[####################] 100% [bucky] infer CF posteriors     | 1:35:16 |  \n"
     ]
    }
   ],
   "source": [
    "## infer concordance factors with BUCKy. This will run in parallel \n",
    "## for however many alpha values are in b.params.bucky_alpha list\n",
    "b.run_bucky(ipyclient=ipyclient)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convenient access to results\n",
    "For now, parse results from the `.concordance` output files produced by BUCKy which will be located in your working directory (default is \"./analysis-bucky\"). Coming soon we will add some convenience functions for parsing results as tables and trees. "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
back to top