https://github.com/dereneaton/ipyrad
Tip revision: 82782ca61ae1180bbbcf2353016ff2a123971319 authored by Isaac Overcast on 08 May 2021, 23:57:05 UTC
"Updating ipyrad/__init__.py to version - 0.9.74
"Updating ipyrad/__init__.py to version - 0.9.74
Tip revision: 82782ca
pedicularis_cli.rst
.. include:: global.rst
.. _pedicularis_cli:
Sub-sampling data sets
=======================
In this tutorial we show how to subsample both the number of taxa in an Assembly,
and the amount of sequence data. Again, we use the 13 taxa *Pedicularis* data set
from **Eaton and Ree (2013)** for our example.
.. use an empirical data set for the example.
.. The data set is composed of single-end reads for a RAD-seq library prepared with
.. the PstI enzyme for 13 individuals from the *Cyathophora* clade of the angiosperm genus
.. *Pedicularis*, originally published by
.. (:ref:`link to open access article <eaton_and_ree>`).
Download the empirical example data set (*Pedicularis*)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These data are archived on the NCBI sequence read archive (SRA) under
accession id SRP021469. For convenience, the data are also hosted at a
public Dropbox link which is a bit easier to access. Run the code below to
download and decompress the fastq data files, which will save them into a
directory called ``example_empirical_data/``. The compressed file size is
approximately 1.1GB.
.. code:: bash
## download fastq data from the SRA database
>>> ipyrad --download SRP021469 example_empirical_data/
Setup a base params file
~~~~~~~~~~~~~~~~~~~~~~~~
We start by using the ``-n`` argument to create a new named Assembly.
I'll use the name ``base`` to indicate that this is the base assembly from
which we will later create several branches.
.. code:: bash
>>> ipyrad -n "base"
.. parsed-literal::
New file 'params-base.txt' created in /home/deren/Downloads
The data come to us already demultiplexed so we are going to simply set the
**sorted\_fastq\_path** to tell ipyrad the location of the data files,
and also set a **project\_dir**, which will group all of our analyses into
a single directory. For the latter I use the name of our study organism, "pedicularis".
.. parsed-literal::
## Use your text editor to enter the following values:
## The wildcard (*) tells ipyrad to select all files ending in .gz
pedicularis ## [1] [project_dir] ...
example_empirical_rad/*.gz ## [4] [sorted_fastq_path] ...
For now we'll leave the remaining parameters at their default values.
Load the fastq Sample data
~~~~~~~~~~~~~~~~~~~~~~~~~~
When the data location is entered as a **sorted_fastq_path** step 1
simply counts the number of reads for each Sample and parses the file names to
extract names for each Sample. For example, the file ``29154_superba.fastq.gz``
will be assigned to Sample ``29154_superba``. Now, run step 1 (-s 1) and
tell ipyrad to print the results when it is finished (-r).
.. code:: bash
>>> ipyrad -p params-base.txt -s 1 -r
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.2.5]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
New Assembly: base
ipyparallel setup: Local connection to 4 Engines
Step1: Linking sorted fastq data to Samples
Linking to demultiplexed fastq files in:
/home/deren/Downloads/example_empirical_rad/*.gz
13 new Samples created in 'base'.
13 fastq files linked to 13 new Samples.
Saving Assembly.
Summary stats of Assembly base
------------------------------------------------
state reads_raw
29154_superba 1 696994
30556_thamno 1 1452316
30686_cyathophylla 1 1253109
32082_przewalskii 1 964244
33413_thamno 1 636625
33588_przewalskii 1 1002923
35236_rex 1 1803858
35855_rex 1 1409843
38362_rex 1 1391175
39618_rex 1 822263
40578_rex 1 1707942
41478_cyathophylloides 1 2199740
41954_cyathophylloides 1 2199613
Sub-sampling methods
~~~~~~~~~~~~~~~~~~~~~
Assembling this full data set takes around 3 hours on a 4-core laptop, which
is actually pretty fast. However, for very large data sets you may be interested in
running an even faster analysis by using just a subset of your data. This would
allow you to more easily explore the affect of many different parameter settings
on your results before running the full data set. Two forms of subsampling are
possible: first, subsampling the number of Samples in your analysis, and second,
subselecting the number of reads in your analysis.
.. note::
Importantly, no matter what you do in ipyrad, it will never delete or
modify your original fastq data files. Assembly objects simply store
information about Samples, and Samples simply contain statistics about
data files. Samples can be discarded from an Assembly, in which case the
Assembly loses some information, however, this does not delete any data files.
Nevertheless, to retain Sample information ipyrad only allows Samples to be
discarded during branching, so that Sample information is always retained
in the parent branch. See the example below.
**Subselecting samples**:
You can subselect Samples by creating a new branch. Here we call the new
branch "sub4", and pass it a list of Sample names in addition to the new branch
name. **This does NOT delete any files** (see above), but simply copies
a subset of information from "base" to the new assembly "sub4".
If you accidentally discarded the wrong Samples you could re-create "sub4" by
simply branching "base" again with a different list of Samples.
.. code:: bash
## Create new branch of base Assembly named sub4 and pass it
## the names of four Samples (if no names it keeps all Samples)
## names should be separated by a comma and spaces are optional.
## The '\' character simply continues our list across a line-break
## for easier viewing.
>>> ipyrad -p params-base.txt -b sub4 29154_superba 30556_thamno \
30686_cyathophylla 32082_przewalskii
.. parsed-literal::
loading Assembly: base
from saved path: ~/Documents/ipyrad/tests/cli/cli.json
Sample name not found: 4
Creating a new branch called 'sub4' with 4 Samples
Writing new params file to params-sub4.txt
.. code:: bash
## print stats for sub4 to confirm that Samples were discarded
>>> ipyrad -p params-sub4.txt -r
.. parsed-literal::
Summary stats of Assembly sub4
------------------------------------------------
state reads_raw
39618_rex 1 822263
40578_rex 1 1707942
41478_cyathophylloides 1 2199740
41954_cyathophylloides 1 2199613
**Running step 2**:
.. code:: bash
## run step2
>>> ipyrad -p params-sub4.txt -s 2 -r
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.2.5]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: base
from saved path: ~/Downloads/pedicularis/base.json
ipyparallel setup: Local connection to 4 Engines
Step2: Filtering reads
[####################] 100% processing reads | 0:02:48
Saving Assembly.
Summary stats of Assembly base
------------------------------------------------
state reads_raw reads_filtered
29154_superba 2 696994 92448
30556_thamno 2 1452316 93666
30686_cyathophylla 2 1253109 89122
32082_przewalskii 2 964244 92016
Run step 3 (clustering and aligning)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is generally one of the longest running steps, depending on how
many unique clusters (loci) there are in each sample. Using more
processors will allow it run much faster. On my laptop with 4 cores this
step finishes in approximately 30 minutes. From the results you can see
that there are many clusters found in each sample (clusters\_total), but
very few are recovered at high depth (clusters\_hidepth).
The coverage would of course be much better if we did not subsample the
data set in step2. Also, this data set has fairly low coverage to begin with.
We can either lower the mindepth setting to allow us to use more of this low
depth data, or we can decide to go ahead with our mindepth setting (currently
at the default of 6) and simply discard most of our data. I know, how about
we create a branch so that we can do both!
.. code:: bash
## create a lowdepth branch
>>> ipyrad -p params-sub4.txt -b sub4-lowdepth.
.. parsed-literal::
loading Assembly: base
from saved path: ~/Downloads/pedicularis/base.json
Creating a branch of assembly base called sub4
Writing new params file to params-sub4.txt
.. code:: bash
## create a lowdepth branch
>>> ipyrad -p params-base.txt -s 3
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.2.5]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: base
from saved path: ~/Downloads/pedicularis/base.json
ipyparallel setup: Local connection to 4 Engines
Step3: Clustering/Mapping reads
[####################] 100% dereplicating | 0:00:01
[####################] 100% clustering | 0:01:01
[####################] 100% chunking | 0:00:00
[####################] 100% aligning | 0:25:44
[####################] 100% concatenating | 0:00:05
Saving Assembly.
Use a text editor to enter the following new **mindepth_majrule** value
in the file ``params-sub4-lowdepth.txt``:
.. parsed-literal::
## 2 ## [mindepth_majrule] ...
Steps 4-5 (joint estimation of error rate & heterozygosity)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see in the results the error rate is about 10X the heterozygosity
estimate. The latter does not vary significantly across samples. With
data of greater depth the estimates will be more accurate.
.. code:: bash
>>> ipyrad -p params-sub4.txt -s 45 -r
>>> ipyrad -p params-sub4-lowdepth.txt -s 45 -r
.. parsed-literal::
... add output here
Run step 5 (consensus base calls)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is another step that can be computationally intensive. Here it
takes about 15 minutes on 4 cores. Although many clusters are filtered
out at this step (especially due to low depth) their information is
retained for the VCF output later so that the coverage/depth of excluded
reads can be examined.
.. code:: bash
ipyrad -p params-base.txt -s 5
ipyrad -p params-base.txt -r
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.1.70]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: base [~/Downloads/pedicularis/base.json]
ipyparallel setup: Local connection to 4 Engines
Step5: Consensus base calling
Diploid base calls and paralog filter (max haplos = 2)
error rate (mean, std): 0.00703, 0.00331
heterozyg. (mean, std): 0.04071, 0.00396
Saving Assembly.
Summary stats of Assembly base
------------------------------------------------
state reads_raw reads_filtered clusters_total
29154_superba 5 696994 92448 45531
30556_thamno 5 1452316 93666 45745
30686_cyathophylla 5 1253109 89122 50306
32082_przewalskii 5 964244 92016 44242
33413_thamno 5 636625 89428 52053
33588_przewalskii 5 1002923 92418 46674
35236_rex 5 1803858 92807 57801
35855_rex 5 1409843 92883 45139
38362_rex 5 1391175 93363 41580
39618_rex 5 822263 92096 47295
40578_rex 5 1707942 93386 45295
41478_cyathophylloides 5 2199740 93846 41965
41954_cyathophylloides 5 2199613 91756 47735
clusters_hidepth hetero_est error_est reads_consens
29154_superba 978 0.038530 0.006630 821
30556_thamno 987 0.038266 0.006009 810
30686_cyathophylla 757 0.044680 0.004627 606
32082_przewalskii 686 0.046796 0.007077 523
33413_thamno 728 0.041466 0.004528 597
33588_przewalskii 904 0.041445 0.011253 709
35236_rex 767 0.042423 0.005119 629
35855_rex 1106 0.035123 0.012086 844
38362_rex 1140 0.041206 0.004702 943
39618_rex 1258 0.040696 0.009077 1011
40578_rex 832 0.045177 0.002789 689
41478_cyathophylloides 992 0.041085 0.004468 872
41954_cyathophylloides 1307 0.032387 0.013090 983
Full stats files
------------------------------------------------
step 1: None
step 2: ./pedicularis/base_edits/s2_rawedit_stats.txt
step 3: ./pedicularis/base_clust_0.85/s3_cluster_stats.txt
step 4: ./pedicularis/base_clust_0.85/s4_joint_estimate.txt
step 5: ./pedicularis/base_consens/s5_consens_stats.txt
step 6: None
step 7: None
Step 6 (clustering and aligning across samples)
-----------------------------------------------
This step clusters consensus loci across Samples using the same
threshold for sequence similarity as used in step3.
.. code:: bash
ipyrad -p params-base.txt -s 6
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.1.70]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: base [~/Downloads/pedicularis/base.json]
ipyparallel setup: Local connection to 4 Engines
Step6: Clustering across 13 samples at 0.85 similarity
Saving Assembly.
Branch the assembly
-------------------
Here we will branch the assembly to create different assemblies that we
will use as our final outputs. The main parameter we will focus on is
the ``min_samples_locus``, which is the minimum number of samples that
must have data at a locus for the locus to be retained in the data set.
We create a ``min4``, ``min8``, and ``min12`` data sets.
.. code:: bash
ipyrad -p params-base.txt -b min4
ipyrad -p params-base.txt -b min8
ipyrad -p params-base.txt -b min12
.. parsed-literal::
loading Assembly: base [~/Downloads/pedicularis/base.json]
Creating a branch of assembly base called min4
Writing new params file to params-min4.txt
loading Assembly: base [~/Downloads/pedicularis/base.json]
Creating a branch of assembly base called min8
Writing new params file to params-min8.txt
loading Assembly: base [~/Downloads/pedicularis/base.json]
Creating a branch of assembly base called min12
Writing new params file to params-min12.txt
Change the parameter settings in params.txt for each assembly
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. parsed-literal::
## Enter the changes below into the params files using a text editor
## in the file params-min4.txt
4 ## [21] [min_samples_locus] ...
## in the file params-min8.txt
8 ## [21] [min_samples_locus] ...
## in the file params-min12.txt
12 ## [21] [min_samples_locus] ...
Step 7 (final filtering and create output files)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filter and create output files for the three assemblies with different
values for the parameter ``min_samples_locus``.
.. code:: bash
ipyrad -p params-min4.txt -s 7
ipyrad -p params-min8.txt -s 7
ipyrad -p params-min12.txt -s 7
.. parsed-literal::
--------------------------------------------------
ipyrad [v.0.1.70]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: min4 [~/Downloads/pedicularis/min4.json]
ipyparallel setup: Local connection to 4 Engines
Step7: Filter and write output files for 13 Samples.
Outfiles written to: ~/Downloads/pedicularis/min4_outfiles
Saving Assembly.
--------------------------------------------------
ipyrad [v.0.1.70]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: min8 [~/Downloads/pedicularis/min8.json]
ipyparallel setup: Local connection to 4 Engines
Step7: Filter and write output files for 13 Samples.
Outfiles written to: ~/Downloads/pedicularis/min8_outfiles
Saving Assembly.
--------------------------------------------------
ipyrad [v.0.1.70]
Interactive assembly and analysis of RADseq data
--------------------------------------------------
loading Assembly: min12 [~/Downloads/pedicularis/min12.json]
ipyparallel setup: Local connection to 4 Engines
Step7: Filter and write output files for 13 Samples.
Outfiles written to: ~/Downloads/pedicularis/min12_outfiles
Saving Assembly.
Take a look at the stats summary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each assembly that finishes step 7 will create a stats.txt output summary
in the 'assembly_name'_outfiles/ directory. This includes information about
which filters removed data from the assembly, how many loci were recovered
per sample, how many samples had data for each locus, and how many variable
sites are in the assembled data.
.. code:: python
cat ./pedicularis/min4_outfiles/min4_stats.txt
.. parsed-literal::
## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters
locus_filtering
total_prefiltered_loci 1206
filtered_by_rm_duplicates 2
filtered_by_max_indels 159
filtered_by_max_snps 0
filtered_by_max_hetero 15
filtered_by_min_sample 921
filtered_by_edge_trim 0
total_filtered_loci 221
## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples
sample_coverage
29154_superba 151
30556_thamno 120
30686_cyathophylla 118
32082_przewalskii 132
33413_thamno 155
33588_przewalskii 89
35236_rex 154
35855_rex 145
38362_rex 154
39618_rex 156
40578_rex 90
41478_cyathophylloides 156
41954_cyathophylloides 97
## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci
locus_coverage sum_coverage
1 NaN 0
2 NaN 0
3 NaN 0
4 65 65
5 36 101
6 16 117
7 10 127
8 10 137
9 6 143
10 2 145
11 12 157
12 7 164
13 57 221
## The distribution of SNPs (var and pis) across loci.
## pis = parsimony informative site (minor allele in >1 sample)
## var = all variable sites (pis + autapomorphies)
## ipyrad API location: [assembly].stats_dfs.s7_snps
var sum_var pis sum_pis
0 260 260 1140 1140
1 130 390 45 1185
2 130 520 12 1197
3 133 653 3 1200
4 90 743 4 1204
5 110 853 0 1204
6 77 930 0 1204
7 70 1000 2 1206
8 64 1064 0 1206
9 32 1096 0 1206
10 31 1127 0 1206
11 30 1157 0 1206
12 16 1173 0 1206
13 14 1187 0 1206
14 5 1192 0 1206
15 6 1198 0 1206
16 2 1200 0 1206
17 4 1204 0 1206
18 2 1206 0 1206
Take a peek at the .loci (easily human-readable) output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is one fo the first places I usually look when an assembly finishes. It
provides a clean view of the data with variable sites (-) and parsimony informative
SNPs (*) highlighted. Use the unix commands **less** or **head** to look at this
file briefly.
.. code:: bash
## head -n 50 prints just the first 50 lines of the file to stdout
head -n 50 pedicularis/min4_outfiles/min4.loci
.. parsed-literal::
29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACACCAACAGCAACAGTC
32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGG-TCCTTCTTCTCCACACTCTTGATRACACCAACAGCAACAGTC
35236_rex CCTTGGTCACCTTAGCACCTGATGG-TCCTTCTTCTCCACACTCTTGATGACACCAACAGCAACAGTC
38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACACCAACAGCAAC-GTC
// - - - - - - - - |
33413_thamno TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
33588_przewalskii GAGACAACCAGTGCCTTCTTGTCTATCAGCCTCACACCTGTCTTCGGTACTTTCGGTACTTAGAAGCA
35855_rex TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
38362_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
39618_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
40578_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
41478_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
41954_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA
// - * - - |
29154_superba AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
30556_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
30686_cyathophylla AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
32082_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
33413_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
33588_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
35236_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
35855_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
38362_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
39618_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
40578_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
41478_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC
41954_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGCNCTAGCTAACGCGCCC
// |
30686_cyathophylla TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT
32082_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT
33588_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT
35236_rex TAGCAATAAATGCAAGAATATTKACTTCCATAATTTCGTCKGTTTTTTAATTCGCAATAACTCGGGAT
38362_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT
39618_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT
40578_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT
41478_cyathophylloides TAGCAATAAATGCAAGAATATTT-CTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT
// * * |
35236_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAKCGGATCATCATGTGCGAGTTGAC
35855_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTGGAC
38362_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC
39618_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC
// - - |
30556_thamno C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATYAAACAGCAAAAACAATGACTSGATAAACTA
33413_thamno CTTTCTGWTTAATCTGMAAATTGTAATCAAATGAAATCAAACARCAAAAACAATGACTYGAYAAWCYR
35236_rex C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATCAAACA-CAAAAACAATGACT-GATAAACTA
41478_cyathophylloides CTTTCTGATTAATCTGCAAATTGTAATCAAATGAAATCAAACAGCAAAAACAATAACTTGATAAAATA
// - - - - - - - ----|
30556_thamno GAAAGATWT-AYTGTAGACGTAWTKGATCRSAGGWKGAGGTGATGWATCATAWTCAT-ATCAGAGGAG
38362_rex GAAAGATTTCACTGTAGACGTAATGGATCAGAGGTTGAGGTGATGRATCATAATCATGATCAGAGGWG
39618_rex GAAAGATTTCACTGTAGACGTAWWGGATCMSAGGWKGAGGTGATGRATCATAATCATKATCAGAGGAG
peek at the .phy files
~~~~~~~~~~~~~~~~~~~~~~
This is the concatenated sequence file of all loci in the data set. It is typically
used in phylogenetic analyses, like in the program *raxml*.
.. code:: bash
## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min4_outfiles/min4.phy
.. parsed-literal::
13 15034
29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACA
30556_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
30686_cyathophylla NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGGNTCCTTCTTCTCCACACTCTTGATRACA
33413_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
33588_przewalskii NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
35236_rex CCTTGGTCACCTTAGCACCTGATGGNTCCTTCTTCTCCACACTCTTGATGACA
35855_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACA
39618_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
40578_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
41478_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
41954_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
peek at the .snp file
~~~~~~~~~~~~~~~~~~~~~
This is similar to the phylip file format, but only variable site columns are
included. All SNPs are the file, in contrast to the .usnps file, which selects
only a single SNP per locus.
.. code:: bash
## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min4_outfiles/min4.snp
.. parsed-literal::
13 711
29154_superba SMWWYRKRNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGYATNYGAMGT
30556_thamno NNNNNNNNNNNNNNNNA-YGGSTACTACWYWTKRSWKWW-TAGTAT-NNNNGT
30686_cyathophylla NNNNNNNNNNNNTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCRACTT
32082_przewalskii SRWWY-GRNNNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGG
33413_thamno NNNNNNNNTTTGNNNNWMCRGYYWCYRYNNNNNNNNNNNNNNNNNNNNNNNGT
33588_przewalskii NNNNNNNNGTCTGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNRTRKNNNNNGG
35236_rex CAATT-GGNNNNKKKTA-C-G-TACTACNNNNNNNNNNNNNNG-ATMNNNNGT
35855_rex NNNNNNNNTGTGNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGRCGT
38362_rex CAATTRGGTTTGTTGTNNNNNNNNNNNNTCATGAGTTRAGTWNNNNMCGACGT
39618_rex NNNNNNNNTTTGTTGTNNNNNNNNNNNNTCWWGMSWKRAKTAGTATMCGRCGT
40578_rex NNNNNNNNTTTGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGT
41478_cyathophylloides NNNNNNNNTGTGTGNNACCGATTAATACWKATKRSWKRAGYAGTATNCRACGT
41954_cyathophylloides NNNNNNNNTGTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGT
peek at the .snp file for the min12 assembly
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Similar to above but you can see that there is much less missing data (Ns).
.. code:: bash
## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min12_outfiles/min12.snp
.. parsed-literal::
13 44
29154_superba GTTGACGCGGTTCTCCCCCGCGCAGTGCAGTCCGCTAANNAGTT
30556_thamno GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT
30686_cyathophylla TTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT
32082_przewalskii GGTGATGCAACCCCTACCCGCGTGGCGTARYCAGTTARGAGACC
33413_thamno GTTGACGCGGTTTTCACCCGCGCAGCGCGGTACACTAAGAAATT
33588_przewalskii GGTGATGCAACCCCTACCCGCGTG-TC-ARYCAGTTARGAGACC
35236_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGRYACACTWAKMAATT
35855_rex GTKRMYKMGGTTCTCACCCGCGCAGCGCGGTACACTAAGANNNT
38362_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT
39618_rex GTTGACGCGGTTCTCAYYYRYRCAGCGCGGTCCACTAAGAAATT
40578_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT
41478_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT
41954_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT
downstream analyses
~~~~~~~~~~~~~~~~~~~
...