Content - 7f4210fac0bdc0103e4fffd38bd7f007501f9d2e - a4b9c8f/docs/outline.rst

visit type:
Tip revision: 5c67b47964794589ea13e3dd18edadbffeab64e3 authored by Isaac Overcast on 23 January 2018, 21:26:37 UTC
"Updating ipyrad/__init__.py to version - 0.7.21
Tip revision: 5c67b47
outline.rst

.. include:: global.rst

.. _outline:

Assembly Steps and Branching
=============================
The typical workflow to move from fastq formatted input data to assembled 
output files in ipyrad_ involves :ref:`seven sequential steps <seven_steps>`. 
The reason we separate assembly into distinct steps is to create a modular 
workflow that can be easily restarted if interrupted, and can be easily
:ref:`branched<branching_workflow>` at different points to create 
assemblies under different combinations of parameter settings. 


Basic workflow
---------------
The simplest use of ipyrad is to assemble a data set under a single set of 
parameters defined in a params file. Step 1 assigns data to each of the Samples, 
Steps 2-5 process data for each Sample, Step 6 clusters data across Samples, 
and Step 7 filters these data and formats them for downstream analyses. 

.. image:: images/steps.png

The code to run a basic workflow is quite simple:

.. code-block:: bash
    
    ## create an initial Assembly params file
    >>> ipyrad -n data1 

    ## fill in the new params file with your text editor
    ## ... editing params-data1.txt

    ## run steps 1-7 with the params file for this Assembly
    >>> ipyrad -p params-data1.txt -s 1234567



.. _branching_workflow:
Branching workflow
-------------------
A more effective way to use ipyrad_, however, is to create branching
assemblies in which multiple data sets are assembled under different parameter 
settings. The schematic below shows an example where an assembly 
is branched at step3. The new branch will inherit file paths and statistics 
from the first Assembly, but can then apply different parameters going forward.
Branching does not create hard copies of existing data files, and so is not 
an "expensive" action in terms of disk space or time. We suggest it be used 
quite liberally whenever applying a new set of parameters. 

.. image:: images/steps_branching.png


The code to run a branching workflow is only a bit more complex than the basic
workflow. You can find more branching examples in the 
:ref:`advanced tutorial<tutorial_advanced_cli>` and 
:ref:`cookbook<cookbook>` sections. 

.. code-block:: bash
    
    ## create an initial Assembly and params file, here called 'data1'
    >>> ipyrad -n data1 

    ## edit the params file for data1 with your text editor
    ## ... editing params-data1.txt

    ## run steps 1-2 with the params file
    >>> ipyrad -p params-data1.txt -s 12

    ## create a new branch of 'data1' before step3, here called 'data2'.
    >>> ipyrad -p params-data1.txt -b data2

    ## edit the params file for data2 using a text editor
    ## ... editing params-data2.txt

    ## run steps 3-7 for both assemblies
    >>> ipyrad -p params-data1.txt -s 34567
    >>> ipyrad -p params-data2.txt -s 34567


.. _dropping_samples:

Add or drop samples by branching
---------------------------------
Another point where branching is useful is for adding or dropping
samples from an assembly, either to analyze a subset of samples 
separately from others, or to exclude samples with low coverage. 
The `branching` and `merging` fuctions in ipyrad make this easy. 
By requiring a branching process in order to drop samples from an 
assembly ipyrad inherently forces you to retain the parent assembly
as a copy. This provides a nice fail safe so that you can mess around
with your new branched assembly without affecting it's pre-branched 
parent assembly. 

Examples using the ipyrad CLI

.. code-block:: bash

    ## branch and only keep 3 samples from assembly data1
    >>> ipyrad -n data1 -b data2 1A0 1B0 1C0

    ## branch and only exclude 3 samples from assembly data1
    >>> ipyrad -n data1 -b data3 - 1A0 1B0 1C0


Examples using the ipyrad API

.. code-block:: bash

    ## branch and only keep 3 samples from assembly data1
    >>> data1.branch("data2", subsamples=["1A0", "1B0", "1C0"])

    ## branch and only exclude 3 samples from assembly data1
    >>> keep_list = [i for i in data1.samples.keys() if i not in ["1A0", "1B0", "1C0"]]
    >>> data1.branch("data3", subsamples=keep_list)



.. _seven_steps:

Seven Steps
------------

1. Demultiplexing / Loading fastq files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step 1 loads sequence data into a named :ref:`Assembly<Assembly>` and assigns
reads to :ref:`Samples<Samples>` (individuals). If the data are not yet 
demultiplexed then step 1 uses information from a :ref:`barcodes file<barcodes_file>` 
to demultiplex the data, otherwise, it simply reads the data for each Sample. 

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step1:  

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`raw_fastq_path<raw_fastq_path>`  
* :ref:`barcodes_path<barcodes_path>`  
* :ref:`sorted_fastq_path<sorted_fastq_path>`  
* :ref:`*datatype<datatype>`  
* :ref:`restriction_overhang<restriction_overhang>`  
* :ref:`max_barcode_mismatch<max_barcode_mismatch>`  



2. Filtering / Editing reads
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step 2 uses the quality score recorded in the fastQ data files to filter low 
quality base calls. Sites with a score below a set value are changed into “N”s, 
and reads with more than the number of allowed “N”s are discarded. The threshold
for inclusion is set with the :ref:`phred_Qscore_offset<phred_Qscore_offset>` 
parameter. An optional filter can be applied to remove adapters/primers
(see :ref:`filter_adapters<filter_adapters>`), and there is an 
optional filter to clean up the edges of poor quality reads
(see :ref:`edit_cutsites<edit_cutsites>`).

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step2: 

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`barcodes_path<barcodes_path>`  
* :ref:`*datatype<datatype>`  
* :ref:`restriction_overhang<restriction_overhang>`  
* :ref:`max_low_qual_bases<max_low_qual_bases>`  
* :ref:`filter_adapters<filter_adapters>`  
* :ref:`filter_min_trim_len<filter_min_trim_len>`  
* :ref:`edit_cut_sites<edit_cut_sites>`  


3. Clustering / Mapping reads within Samples and alignment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step 3 first dereplicates the sequences from step 2, recording the number of 
times each unique read is observed. If the data are paired-end, it then uses
vsearch_ to merge paired reads which overlap. The resulting data are 
then either de novo clustered (using vsearch_) or mapped to a reference 
genome (using smalt_ and bedtools_), depending on the selected assembly method.
In either case, reads are matched together on the basis of sequence similarity
and the resulting clusters are aligned using muscle_. 

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step3: 

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`*assembly_method<assembly_method>`  
* :ref:`*datatype<datatype>`  
* :ref:`*clust_threshold<clust_threshold>`  


4. Joint estimation of heterozygosity and error rate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step4 jointly estimates sequencing error rate and heterozygosity based on counts
of site patterns across clustered reads. 
These estimates are used in step5 for consensus base calling. If the 
max_alleles_consens is set to 1 (haploid) then heterozygosity is fixed to 0 and 
only error rate is estimated. For all other settings of max_alleles_consens 
a diploid model is used (i.e., two alleles are expected to occur equally). 

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step4: 

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`*datatype<datatype>`  
* :ref:`*restriction_overhang<restriction_overhang>`  
* :ref:`*max_alleles_consens<max_alleles_consens>`  


5. Consensus base calling and filtering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step5 estimates consensus allele sequences from clustered reads given the estimated
parameters from step 4 and a binomial model. During this step we filter for maximum
number of undetermined sites (Ns) per locus (max_Ns_consens). The number of alleles
at each locus is recorded, but a filter for max_alleles is not applied until step7. 
Read depth information is also stored at this step for the VCF output in step7. 

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step5:  

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`*datatype<datatype>`  
* :ref:`*max_Ns_consens<max_Ns_consens>`  


6. Clustering / Mapping reads among Samples and alignment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step6 clusters consensus sequences across Samples using the same assembly method 
as in step 3. One allele is randomly sampled before clustering so that ambiguous
characters have a lesser effect on clustering, but the resulting data retain
information for heterozygotes. The clustered sequences are then aligned using 
muscle_.

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step6: 

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`*datatype<datatype>`  


7. Filtering and formatting output files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step7 applies filters to the final alignments and saves the final data in a 
number of possible :ref:`output formats<full_output_formats>`. This step is most 
often repeated at several different settings for the parameter 
:ref:`min_samples_locus` to create different assemblies with different 
proportions of missing data (see branching_workflow_). 

The following :ref:`parameters<parameters>` are *potentially*
used or required (\*) for step7:   

* :ref:`*assembly_name<assembly_name>`  
* :ref:`*project_dir<project_dir>`  
* :ref:`*datatype<datatype>`  
* :ref:`min_samples_locus<min_samples_locus>`  
* :ref:`max_Indels_locus<max_Indels_locus>`  
* :ref:`max_shared_Hs_locus<max_shared_Hs_locus>`  
* :ref:`max_alleles_consens<max_alleles_consens>`  
* :ref:`trim_overhang<trim_overhang>`  
* :ref:`output_formats<output_formats>`  
* :ref:`pop_assign_file<pop_assign_file>`  





**Example CLI branching workflow**


.. code-block:: bash

    ## create a params.txt file and rename it data1, and then use a text editor
    ## to edit the parameter settings in data1-params.txt
    ipyrad -n data1

    ## run steps 1-2 using the default settings
    ipyrad -p params-data1.txt -s 12

    ## branch to create a 'copy' of this assembly named data2
    ipyrad -p params-data1.txt -b data2

    ## edit data2-params.txt to a different parameter settings in a text editor,
    ## for example, change the clustering threshold from 0.85 to 0.90

    ## now run the remaining steps (3-7) on each data set
    ipyrad -p params-data1.txt -s 34567
    ipyrad -p params-data2.txt -s 34567


**Example Python API branching workflow**


.. code-block:: python

    ## import ipyrad 
    import ipyrad as ip

    ## create an Assembly and modify some parameter settings
    data1 = ip.Assembly("data1")
    data1.set_params("project_dir", "example")
    data1.set_params("raw_fastq_path", "data/*.fastq")
    data1.set_params("barcodes_path", "barcodes.txt")   

    ## run steps 1-2
    data1.run("12")

    ## create a new branch of this Assembly named data2
    ## and change some parameter settings 
    data2 = data1.branch("data2")
    data2.set_params("clust_threshold", 0.90)

    ## run steps 3-7 for the two Assemblies
    data1.run("34567")
    data2.run("34567")
Browse the archive

https://github.com/dereneaton/ipyrad