https://github.com/dereneaton/ipyrad
Tip revision: 11e26eaa46286ae1e4eea61e6ec214de179f7f8b authored by isaacovercast on 01 May 2016, 20:33:56 UTC
Merge branch 'master' of https://github.com/dereneaton/ipyrad
Merge branch 'master' of https://github.com/dereneaton/ipyrad
Tip revision: 11e26ea
files.rst
.. include:: global.rst
.. _files:
Input data/files
=================
ipyrad_ can be used to assemble any kind of data that is generated using a
restriction digest method (RAD, ddRAD) or related amplification-based
process (e.g., NextRAD, RApture).
The :ref:`input files<input_files>` can be sorted among Samples
(demultiplexed) before starting to use ipyrad, or ipyrad can be used
to demultiplex the data based on a
`barcodes file`_. Examples of both are available in the
:ref:`tutorials<tutorials>`.
ipyrad aims to be very flexible in allowing assembly of reads of various
lengths so that new data can be easily combined with older data.
.. _data_types:
Supported data types
--------------------
There is increasingly a large variety of ways to generate reduced representation
genomic data sets using either restriction digestion or primer sets,
many of which can be assembled in ipyrad_. Because it is difficult to keep up with
all of the names, we use our own terminology, described below, to group together
data types that can be analyzed using the same bioinformatic methods.
If you have a data type that is not described below and you're not sure if it
can be analyzed in ipyrad_ :ref:`let us know here<gitter>`.
rad
^^^^
This category includes data types which use a single cutter to generate
DNA fragments for sequencing based on a single cut site.
e.g., RAD-seq, NextRAD
ddrad
^^^^^
This category includes data types which select fragments that were digested
by two different restriction enzymes which cut the fragment on either end.
During assembly this type of data is analyzed differently from the **rad** data
type by more stringent filtering that looks for occurrences of the second
(usually more common) cutter.
e.g., double-digest RAD-seq
gbs
^^^
This category includes any data type which selects fragments that were digested
by a **single enzyme** that cuts both ends of DNA fragments. This data type requires
reverse-complement clustering because the forward vs reverse adapters can attach
to either end of each fragment, and thus when shorter fragments are sequenced
from either end the resulting reads often overlap partially or completely.
When analyzing GBS data we strongly recommend using a stringent setting for
the `filters_adapters` parameter.
e.g., genotyping-by-sequencing (Elshire et al.), EZ-RAD (Toonin et al.)
pairddrad
^^^^^^^^^
This category is for paired-end data from fragments that were generated
through restriction digestion using two different enzymes. During step 3 the
paired-reads will be tested for :ref:`paired read merging<paired_read_merging>`
if they overlap partially.
e.g., double-digest RAD-seq (w/ paired-end sequencing)
pairgbs
^^^^^^^
This category is for paired-end data from fragments that were generated by
digestion with a **single enzyme** that cuts both ends of the fragment.
Because the forward adapter might bind to either end of these fragments,
approximately half of the matches are expected to be reverse-complemented
with perfect overlap. Paired reads are checked for merging before clustering/mapping.
e.g., genotyping-by-sequencing, EZ-RAD, (w/ paired-end sequencing)
2brad
^^^^^^
This category is for a special class of reads sequenced fragments generated using
a type IIb restriction enzyme. The reads are usually very short in length, and
are treated slightly differently in steps 2 and 7.
.. _input_files:
FASTQ input files
------------------
Depending on how and where your sequence data are generated you may receive the
data in a single giant file, or in many smaller files. The files may contain data
from all of your individuals mixed up together, in which case the data need
to be demultiplexed based on their barcodes or index; or your data may
already be demultiplexed, in which case each of your data files corresponds to
a different sample.
multiplexed (raw) sequence files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your data are not yet sorted among individuals/samples then you will need
to have their barcode information organized into a
:ref:`barcodes file<barcodes_file>`. Sample names are taken from the barcodes
file. The raw data file(s) should be entered in the ``raw_fastq_path`` parameter.
demultiplexed (sorted) sequence files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your data are already sorted then you simply have to enter the path to the
data files in the ``sorted_fastq_path`` parameter.
The :ref:`cookbook recipes <cookbook_recipes>` section provides more complex
methods for combining data from multiple sequencing runs into the same
individual, or for using multiple barcodes file.
.. note::
It's worth paying careful attention to file names before starting
an analysis since these names, and any included typos, will be perpetuated
through all the resulting data files. Do not include spaces in file names.
.. _file_names:
Input file names
-----------------
If your data are not yet demultiplexed then Sample names will come from the
:ref:`barcodes files<barcodes_file>`, as shown below.
Otherwise, if data files are already sorted among Samples (demultiplexed)
then Sample names will be extracted from the file names.
The file names should not have any spaces in them.
If you are using a paired-end data type then the rules for file names are a bit
more strict than for single-end data. Every read1 file must contain the string
``_R1_`` in it, and every R2 file must match exactly to the name of the R1 file
except that it has ``_R2_``. See the tutorials for an example.
.. _barcodes_file:
Barcodes file
--------------
The barcodes file is a simple table linking barcodes to samples.
Barcodes can be of varying lengths.
Each line should have one name and then one barcode, separated by a tab or
space. The names that you enter in the barcodes file are the names
that will end up in your output files, so it is useful to check for
typos or other errors, or to shorten the names as you see fit before
running step1. Do not include any spaces in Sample names.
.. parsed-literal::
sample1 ACAGG
sample2 ATTCA
sample3 CGGCATA
sample4 AAGAACA
.. _params_file:
Params file
------------
The parameter input file, which typically includes ``params.txt`` in its name,
can be created with the ``-n`` option from the ipyrad command line. This file
lists all of the :ref:`parameter settings<paramater settings>`
necessary to complete an assembly.
A description of how to create and use a parmas file can be found in the
:ref:`introductory tutorial<tutorial_intro_cli>`.