https://github.com/dereneaton/ipyrad
Raw File
Tip revision: f09113d60bfeeba7768819d4cf6841adb3c2601d authored by Isaac Overcast on 09 May 2020, 18:03:35 UTC
"Updating ipyrad/__init__.py to version - 0.9.52
Tip revision: f09113d
output_formats.rst
.. include:: global.rst  

.. _full_output_formats:


Output Formats
==============
By default ipyrad will write out all output formats it is capable of 
generating. Converting between the various formats is very fast, but
if you want to save yourself the cpu and disk space, you can enable
only specific output formats with the ``output_formats`` 

Variant Call Format \*.vcf.gz
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
VCF is a standard format for storing and manipulating sequence data. The
format is too complicated to go into here, but you can see a good explanation
on the :ref:`1000 Genomes Project<http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40>` site.
The VCF format output by ipyrad includes full genotype information for all
bases in all loci, including information about genotype quality. Many useful 
conversions and filtering options for this format are available in the software 
vcftools.

.. parsed-literal::

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1A_0    1B_0    1C_0    1D_0    2E_0    2F_0    2G_0    2H_0    3I_0    3J_0    3K_0    3L_0
    0  0       .       G       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,0,19    0/0:0,0,0,22    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,22    0/0:0,0,0,20    0/0:0,0,0,21    0/0:0,0,0,15    0/0:0,0,0,14    0/0:0,0,0,24    0/0:0,0,0,21
    0  1       .       T       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,19,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,21,0    0/0:0,0,15,0    0/0:0,0,14,0    0/0:0,0,24,0    0/0:0,0,21,0
    0  2       .       T       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,19,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,22,0    0/0:0,0,19,1    0/0:0,0,21,0    0/0:0,0,15,0    0/0:0,0,14,0    0/0:0,0,24,0    0/0:0,0,21,0


ipyrad format \*.loci
^^^^^^^^^^^^^^^^^^^^^
This is a custom format that is easy to read, showing each individual locus 
with variable sites indicated. Custom scripts can easily parse this file for 
loci containing certain amounts of taxon coverage or variable sites. Also it 
is the most easily readable file for assuring that your analyses are working 
properly. A (-) indicates a variable site, and a (*) indicates the site is 
phylogenetically informative. Integers enclosed by ``|`` indicate the locus
number. Example:

.. parsed-literal::
    1A_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    1B_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    1C_0     GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    1D_0     GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTC
    2E_0     GTTATCCGTAGCGATTATTACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    2F_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGKACGCAGCTAGTC
    2G_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    2H_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    3I_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    3J_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGSGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    3K_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
    3L_0     GTTATCGGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACACAGCTAGTC
    //             -           *        * -                     -                     -  -  -         |0|
    1A_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    1B_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATMAGCTAGGCTTCGAGTCGTATC
    1C_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    1D_0     ACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    2E_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    2F_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    2G_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    2H_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGYGATCAGCTAGGCTTCGAGTCGTATS
    3I_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    3J_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    3K_0     ACAGCTCTGTTACATGCATCTGTCMATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
    3L_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTAYC
    //               *      -        -               -   *                  -   -                   --|1|

For paired-end data the two linked loci are shown separated by a 'nnnn' separator, any merged reads
will of course not contain the 'nnnn'::

    1A0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
    1B0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
    1C0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGAAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
    1D0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
    2E0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTSnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
    2F0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
    2G0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTTTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
    2H0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACTTCCCGGTATCCGACCT
    3I0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
    3J0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
    3K0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
    3L0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGAGACYAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
    //                                                                 -                          *       -     -  -        *           -    *           |0|
    1A0     GACAAATCTTACATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    1B0     GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    1C0     GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTAATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    1D0     GACAAATCTTAGTTTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGAACCAAACGCAGGTGGAGGACCCAAGAAC
    2E0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    2F0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    2G0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    2H0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    3I0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    3J0     GACAAATCTCAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    3K0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    3L0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
    //               - -- *                                         -                                                       -                            |1|


PHYLIP \*.phy
^^^^^^^^^^^^^
This is a phylip formatted data file which contains all of the loci from the .loci 
file concatenated into a supermatrix, with missing data for any sample filled in 
with N's. This format is used in RAxML among other phylogenetic programs. The 
header here indicates there are 12 samples and 89023 bases in the sequence. Because
of this the output is truncated here for clarity (indicated by the ellipses).

.. parsed-literal::

    12 89023
    1A_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
    1B_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
    1C_0     GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
    1D_0     GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTCACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCAT...


\*.snps.phy & \*.u.snps.phy
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Additionally we provide a two different PHYLIP formatted version that 
include only variable sites (SNPs). Paired loci are treated as a single 
locus, meaning SNPs from the two reads are not separated in this file 
(they're linked). The ``*.snps.phy`` file contains all SNPs from all
loci concatenated together, with missing values filled by ``N``'s. The
``*.u.snps.phy`` contains one SNP sampled from each locus. If multiple 
SNPs in a locus, SNP sites that contain the least missing data across 
taxa are sampled, if equal amounts of missing data, they are randomly 
sampled. The header indicates this file contains 12 samples and 990 
bases per sample. The output below is truncated for clarity.

.. parsed-literal::

    12 990
    1A_0     GAATGACATCCTCAAACACCCTGGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACWGAGGAACCCATGAGAGACCGCCTYCARYA...
    1B_0     GAAASRCATACTCAAACACCCTKGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACAGAGGAACCCAAGAGAGACCGCCTTCAATA...


MAP/PARTITION (\*.snps.map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because the concatenated SNPs file does not include information about which
SNPs come from which locus we provide a _map_ file with this information. 
This is used by the program _tetrad_ to randomly sample single SNPs 
from among loci. 

.. parsed-literal::

    1       rad0_snp0       0       1
    1       rad0_snp1       0       2
    1       rad0_snp2       0       3
    1       rad0_snp3       0       4
    1       rad0_snp4       0       5
    2       rad1_snp0       0       6
    2       rad1_snp1       0       7
    2       rad1_snp2       0       8
    2       rad1_snp3       0       9
    3       rad2_snp0       0       10
    3       rad2_snp1       0       11
    3       rad2_snp2       0       12
    3       rad2_snp3       0       13
    3       rad2_snp4       0       14
    3       rad2_snp5       0       15
    3       rad2_snp6       0       16




EIGENSTRAT \*.geno & \*.u.geno
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a SNP based format. Each line corresponds to one snp with one column per
sample. The value in the sample column indicates the number of copies of the 
reference allele each individual has. 9 indicates missing data. Below you will
see standard ``.geno`` output from the simulated data, so there are 12
columns, one per sample. This format is used by EIGENSTRAT, SMARTPCA, and 
ADMIXTURE, among other programs.

There is an additional ``*.u.geno`` file output that includes only unlinked
SNPS, with one SNP being randomly chosen per locus and the rest ignored.

.. parsed-literal::

    222222222220
    220202222222
    000222222222
    222122222222
    222222222122
    222022222222
    222221222222
    222222222220
    222200022222
    222122222222


G-PhoCS \*.gphocs
^^^^^^^^^^^^^^^^^
This is a full sequence based format that is very similar to the native
ipyrad .loci format. It is appropriate for use with the Bayesian MCMC
demographic inference program G-PhoCS: http://compgen.cshl.edu/GPhoCS/

.. parsed-literal::

    499

    locus0 10 90
    A_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGSTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
    B_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
    C_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT


STRUCTURE \*.str & \*.u.str
^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is another SNP based format, that includes either all variable
sites (``*.str``) or one randomly selected variable site per locus
(``*.u.str``). These files are suitable input files for the population
structure analysis program STRUCTURE, as well as a few others. The output
below is truncated for clarity.

.. parsed-literal::

    1A_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
    1A_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
    1B_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
    1B_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   0   1   3   0   ...


NEXUS \*.nex
^^^^^^^^^^^^
This is a nexus formatted data file which contains all of the loci from the .loci 
file concatenated into a supermatrix, but printed in an interleaved format, with 
missing data for any sample filled in with N's, and with data information appended 
to the beginning. This format is used in BEAST among other phylogenetic programs.

<TODO: Unimplemented>


back to top