Content - 531f915b07dca2cbbf2b68b7f1e568980aec37f0 - 8109219/docs/files.rst

visit type:
Tip revision: 11e26eaa46286ae1e4eea61e6ec214de179f7f8b authored by isaacovercast on 01 May 2016, 20:33:56 UTC
Merge branch 'master' of https://github.com/dereneaton/ipyrad
Tip revision: 11e26ea
files.rst

.. include:: global.rst  

.. _files:

Input data/files
=================
ipyrad_ can be used to assemble any kind of data that is generated using a 
restriction digest method (RAD, ddRAD) or related amplification-based 
process (e.g., NextRAD, RApture). 

The :ref:`input files<input_files>` can be sorted among Samples 
(demultiplexed) before starting to use ipyrad, or ipyrad can be used 
to demultiplex the data based on a 
`barcodes file`_. Examples of both are available in the 
:ref:`tutorials<tutorials>`. 
ipyrad aims to be very flexible in allowing assembly of reads of various
lengths so that new data can be easily combined with older data. 


.. _data_types:
Supported data types
--------------------

There is increasingly a large variety of ways to generate reduced representation 
genomic data sets using either restriction digestion or primer sets, 
many of which can be assembled in ipyrad_. Because it is difficult to keep up with 
all of the names, we use our own terminology, described below, to group together
data types that can be analyzed using the same bioinformatic methods. 
If you have a data type that is not described below and you're not sure if it 
can be analyzed in ipyrad_ :ref:`let us know here<gitter>`. 


rad 
^^^^
This category includes data types which use a single cutter to generate 
DNA fragments for sequencing based on a single cut site. 

e.g., RAD-seq, NextRAD


ddrad 
^^^^^
This category includes data types which select fragments that were digested
by two different restriction enzymes which cut the fragment on either end. 
During assembly this type of data is analyzed differently from the **rad** data
type by more stringent filtering that looks for occurrences of the second 
(usually more common) cutter. 

e.g., double-digest RAD-seq


gbs
^^^
This category includes any data type which selects fragments that were digested
by a **single enzyme** that cuts both ends of DNA fragments. This data type requires 
reverse-complement clustering because the forward vs reverse adapters can attach
to either end of each fragment, and thus when shorter fragments are sequenced 
from either end the resulting reads often overlap partially or completely. 
When analyzing GBS data we strongly recommend using a stringent setting for 
the `filters_adapters` parameter.

e.g., genotyping-by-sequencing (Elshire et al.), EZ-RAD (Toonin et al.)


pairddrad
^^^^^^^^^
This category is for paired-end data from fragments that were generated 
through restriction digestion using two different enzymes. During step 3 the 
paired-reads will be tested for :ref:`paired read merging<paired_read_merging>`
if they overlap partially. 

e.g., double-digest RAD-seq (w/ paired-end sequencing)


pairgbs
^^^^^^^
This category is for paired-end data from fragments that were generated by 
digestion with a **single enzyme** that cuts both ends of the fragment. 
Because the forward adapter might bind to either end of these fragments,
approximately half of the matches are expected to be reverse-complemented 
with perfect overlap. Paired reads are checked for merging before clustering/mapping.


e.g., genotyping-by-sequencing, EZ-RAD, (w/ paired-end sequencing)


2brad
^^^^^^
This category is for a special class of reads sequenced fragments generated using
a type IIb restriction enzyme. The reads are usually very short in length, and 
are treated slightly differently in steps 2 and 7. 



.. _input_files:

FASTQ input files
------------------
Depending on how and where your sequence data are generated you may receive the
data in a single giant file, or in many smaller files. The files may contain data
from all of your individuals mixed up together, in which case the data need
to be demultiplexed based on their barcodes or index; or your data may 
already be demultiplexed, in which case each of your data files corresponds to 
a different sample. 


multiplexed (raw) sequence files  
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your data are not yet sorted among individuals/samples then you will need 
to have their barcode information organized into a 
:ref:`barcodes file<barcodes_file>`. Sample names are taken from the barcodes 
file. The raw data file(s) should be entered in the ``raw_fastq_path`` parameter.


demultiplexed (sorted) sequence files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your data are already sorted then you simply have to enter the path to the 
data files in the ``sorted_fastq_path`` parameter.
The :ref:`cookbook recipes <cookbook_recipes>` section provides more complex
methods for combining data from multiple sequencing runs into the same 
individual, or for using multiple barcodes file.


.. note:: 

    It's worth paying careful attention to file names before starting
    an analysis since these names, and any included typos, will be perpetuated 
    through all the resulting data files. Do not include spaces in file names.


.. _file_names:
Input file names
-----------------
If your data are not yet demultiplexed then Sample names will come from the 
:ref:`barcodes files<barcodes_file>`, as shown below. 
Otherwise, if data files are already sorted among Samples (demultiplexed) 
then Sample names will be extracted from the file names. 
The file names should not have any spaces in them. 
If you are using a paired-end data type then the rules for file names are a bit 
more strict than for single-end data. Every read1 file must contain the string 
``_R1_`` in it, and every R2 file must match exactly to the name of the R1 file
except that it has ``_R2_``. See the tutorials for an example. 



.. _barcodes_file:

Barcodes file
--------------
The barcodes file is a simple table linking barcodes to samples. 
Barcodes can be of varying lengths. 
Each line should have one name and then one barcode, separated by a tab or 
space. The names that you enter in the barcodes file are the names 
that will end up in your output files, so it is useful to check for 
typos or other errors, or to shorten the names as you see fit before 
running step1. Do not include any spaces in Sample names. 

.. parsed-literal:: 
    sample1     ACAGG
    sample2     ATTCA  
    sample3     CGGCATA  
    sample4     AAGAACA  


.. _params_file:
Params file
------------
The parameter input file, which typically includes ``params.txt`` in its name, 
can be created with the ``-n`` option from the ipyrad command line. This file 
lists all of the :ref:`parameter settings<paramater settings>` 
necessary to complete an assembly. 
A description of how to create and use a parmas file can be found in the 
:ref:`introductory tutorial<tutorial_intro_cli>`.