.. include:: global.rst .. _pedicularis_cli: Sub-sampling data sets ======================= In this tutorial we show how to subsample both the number of taxa in an Assembly, and the amount of sequence data. Again, we use the 13 taxa *Pedicularis* data set from **Eaton and Ree (2013)** for our example. .. use an empirical data set for the example. .. The data set is composed of single-end reads for a RAD-seq library prepared with .. the PstI enzyme for 13 individuals from the *Cyathophora* clade of the angiosperm genus .. *Pedicularis*, originally published by .. (:ref:`link to open access article `). Download the empirical example data set (*Pedicularis*) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. For convenience, the data are also hosted at a public Dropbox link which is a bit easier to access. Run the code below to download and decompress the fastq data files, which will save them into a directory called ``example_empirical_data/``. The compressed file size is approximately 1.1GB. .. code:: bash ## download fastq data from the SRA database >>> ipyrad --download SRP021469 example_empirical_data/ Setup a base params file ~~~~~~~~~~~~~~~~~~~~~~~~ We start by using the ``-n`` argument to create a new named Assembly. I'll use the name ``base`` to indicate that this is the base assembly from which we will later create several branches. .. code:: bash >>> ipyrad -n "base" .. parsed-literal:: New file 'params-base.txt' created in /home/deren/Downloads The data come to us already demultiplexed so we are going to simply set the **sorted\_fastq\_path** to tell ipyrad the location of the data files, and also set a **project\_dir**, which will group all of our analyses into a single directory. For the latter I use the name of our study organism, "pedicularis". .. parsed-literal:: ## Use your text editor to enter the following values: ## The wildcard (*) tells ipyrad to select all files ending in .gz pedicularis ## [1] [project_dir] ... example_empirical_rad/*.gz ## [4] [sorted_fastq_path] ... For now we'll leave the remaining parameters at their default values. Load the fastq Sample data ~~~~~~~~~~~~~~~~~~~~~~~~~~ When the data location is entered as a **sorted_fastq_path** step 1 simply counts the number of reads for each Sample and parses the file names to extract names for each Sample. For example, the file ``29154_superba.fastq.gz`` will be assigned to Sample ``29154_superba``. Now, run step 1 (-s 1) and tell ipyrad to print the results when it is finished (-r). .. code:: bash >>> ipyrad -p params-base.txt -s 1 -r .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data -------------------------------------------------- New Assembly: base ipyparallel setup: Local connection to 4 Engines Step1: Linking sorted fastq data to Samples Linking to demultiplexed fastq files in: /home/deren/Downloads/example_empirical_rad/*.gz 13 new Samples created in 'base'. 13 fastq files linked to 13 new Samples. Saving Assembly. Summary stats of Assembly base ------------------------------------------------ state reads_raw 29154_superba 1 696994 30556_thamno 1 1452316 30686_cyathophylla 1 1253109 32082_przewalskii 1 964244 33413_thamno 1 636625 33588_przewalskii 1 1002923 35236_rex 1 1803858 35855_rex 1 1409843 38362_rex 1 1391175 39618_rex 1 822263 40578_rex 1 1707942 41478_cyathophylloides 1 2199740 41954_cyathophylloides 1 2199613 Sub-sampling methods ~~~~~~~~~~~~~~~~~~~~~ Assembling this full data set takes around 3 hours on a 4-core laptop, which is actually pretty fast. However, for very large data sets you may be interested in running an even faster analysis by using just a subset of your data. This would allow you to more easily explore the affect of many different parameter settings on your results before running the full data set. Two forms of subsampling are possible: first, subsampling the number of Samples in your analysis, and second, subselecting the number of reads in your analysis. .. note:: Importantly, no matter what you do in ipyrad, it will never delete or modify your original fastq data files. Assembly objects simply store information about Samples, and Samples simply contain statistics about data files. Samples can be discarded from an Assembly, in which case the Assembly loses some information, however, this does not delete any data files. Nevertheless, to retain Sample information ipyrad only allows Samples to be discarded during branching, so that Sample information is always retained in the parent branch. See the example below. **Subselecting samples**: You can subselect Samples by creating a new branch. Here we call the new branch "sub4", and pass it a list of Sample names in addition to the new branch name. **This does NOT delete any files** (see above), but simply copies a subset of information from "base" to the new assembly "sub4". If you accidentally discarded the wrong Samples you could re-create "sub4" by simply branching "base" again with a different list of Samples. .. code:: bash ## Create new branch of base Assembly named sub4 and pass it ## the names of four Samples (if no names it keeps all Samples) ## names should be separated by a comma and spaces are optional. ## The '\' character simply continues our list across a line-break ## for easier viewing. >>> ipyrad -p params-base.txt -b sub4 29154_superba 30556_thamno \ 30686_cyathophylla 32082_przewalskii .. parsed-literal:: loading Assembly: base from saved path: ~/Documents/ipyrad/tests/cli/cli.json Sample name not found: 4 Creating a new branch called 'sub4' with 4 Samples Writing new params file to params-sub4.txt .. code:: bash ## print stats for sub4 to confirm that Samples were discarded >>> ipyrad -p params-sub4.txt -r .. parsed-literal:: Summary stats of Assembly sub4 ------------------------------------------------ state reads_raw 39618_rex 1 822263 40578_rex 1 1707942 41478_cyathophylloides 1 2199740 41954_cyathophylloides 1 2199613 **Running step 2**: .. code:: bash ## run step2 >>> ipyrad -p params-sub4.txt -s 2 -r .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json ipyparallel setup: Local connection to 4 Engines Step2: Filtering reads [####################] 100% processing reads | 0:02:48 Saving Assembly. Summary stats of Assembly base ------------------------------------------------ state reads_raw reads_filtered 29154_superba 2 696994 92448 30556_thamno 2 1452316 93666 30686_cyathophylla 2 1253109 89122 32082_przewalskii 2 964244 92016 Run step 3 (clustering and aligning) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is generally one of the longest running steps, depending on how many unique clusters (loci) there are in each sample. Using more processors will allow it run much faster. On my laptop with 4 cores this step finishes in approximately 30 minutes. From the results you can see that there are many clusters found in each sample (clusters\_total), but very few are recovered at high depth (clusters\_hidepth). The coverage would of course be much better if we did not subsample the data set in step2. Also, this data set has fairly low coverage to begin with. We can either lower the mindepth setting to allow us to use more of this low depth data, or we can decide to go ahead with our mindepth setting (currently at the default of 6) and simply discard most of our data. I know, how about we create a branch so that we can do both! .. code:: bash ## create a lowdepth branch >>> ipyrad -p params-sub4.txt -b sub4-lowdepth. .. parsed-literal:: loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json Creating a branch of assembly base called sub4 Writing new params file to params-sub4.txt .. code:: bash ## create a lowdepth branch >>> ipyrad -p params-base.txt -s 3 .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json ipyparallel setup: Local connection to 4 Engines Step3: Clustering/Mapping reads [####################] 100% dereplicating | 0:00:01 [####################] 100% clustering | 0:01:01 [####################] 100% chunking | 0:00:00 [####################] 100% aligning | 0:25:44 [####################] 100% concatenating | 0:00:05 Saving Assembly. Use a text editor to enter the following new **mindepth_majrule** value in the file ``params-sub4-lowdepth.txt``: .. parsed-literal:: ## 2 ## [mindepth_majrule] ... Steps 4-5 (joint estimation of error rate & heterozygosity) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As you can see in the results the error rate is about 10X the heterozygosity estimate. The latter does not vary significantly across samples. With data of greater depth the estimates will be more accurate. .. code:: bash >>> ipyrad -p params-sub4.txt -s 45 -r >>> ipyrad -p params-sub4-lowdepth.txt -s 45 -r .. parsed-literal:: ... add output here Run step 5 (consensus base calls) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is another step that can be computationally intensive. Here it takes about 15 minutes on 4 cores. Although many clusters are filtered out at this step (especially due to low depth) their information is retained for the VCF output later so that the coverage/depth of excluded reads can be examined. .. code:: bash ipyrad -p params-base.txt -s 5 ipyrad -p params-base.txt -r .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.1.70] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: base [~/Downloads/pedicularis/base.json] ipyparallel setup: Local connection to 4 Engines Step5: Consensus base calling Diploid base calls and paralog filter (max haplos = 2) error rate (mean, std): 0.00703, 0.00331 heterozyg. (mean, std): 0.04071, 0.00396 Saving Assembly. Summary stats of Assembly base ------------------------------------------------ state reads_raw reads_filtered clusters_total 29154_superba 5 696994 92448 45531 30556_thamno 5 1452316 93666 45745 30686_cyathophylla 5 1253109 89122 50306 32082_przewalskii 5 964244 92016 44242 33413_thamno 5 636625 89428 52053 33588_przewalskii 5 1002923 92418 46674 35236_rex 5 1803858 92807 57801 35855_rex 5 1409843 92883 45139 38362_rex 5 1391175 93363 41580 39618_rex 5 822263 92096 47295 40578_rex 5 1707942 93386 45295 41478_cyathophylloides 5 2199740 93846 41965 41954_cyathophylloides 5 2199613 91756 47735 clusters_hidepth hetero_est error_est reads_consens 29154_superba 978 0.038530 0.006630 821 30556_thamno 987 0.038266 0.006009 810 30686_cyathophylla 757 0.044680 0.004627 606 32082_przewalskii 686 0.046796 0.007077 523 33413_thamno 728 0.041466 0.004528 597 33588_przewalskii 904 0.041445 0.011253 709 35236_rex 767 0.042423 0.005119 629 35855_rex 1106 0.035123 0.012086 844 38362_rex 1140 0.041206 0.004702 943 39618_rex 1258 0.040696 0.009077 1011 40578_rex 832 0.045177 0.002789 689 41478_cyathophylloides 992 0.041085 0.004468 872 41954_cyathophylloides 1307 0.032387 0.013090 983 Full stats files ------------------------------------------------ step 1: None step 2: ./pedicularis/base_edits/s2_rawedit_stats.txt step 3: ./pedicularis/base_clust_0.85/s3_cluster_stats.txt step 4: ./pedicularis/base_clust_0.85/s4_joint_estimate.txt step 5: ./pedicularis/base_consens/s5_consens_stats.txt step 6: None step 7: None Step 6 (clustering and aligning across samples) ----------------------------------------------- This step clusters consensus loci across Samples using the same threshold for sequence similarity as used in step3. .. code:: bash ipyrad -p params-base.txt -s 6 .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.1.70] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: base [~/Downloads/pedicularis/base.json] ipyparallel setup: Local connection to 4 Engines Step6: Clustering across 13 samples at 0.85 similarity Saving Assembly. Branch the assembly ------------------- Here we will branch the assembly to create different assemblies that we will use as our final outputs. The main parameter we will focus on is the ``min_samples_locus``, which is the minimum number of samples that must have data at a locus for the locus to be retained in the data set. We create a ``min4``, ``min8``, and ``min12`` data sets. .. code:: bash ipyrad -p params-base.txt -b min4 ipyrad -p params-base.txt -b min8 ipyrad -p params-base.txt -b min12 .. parsed-literal:: loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min4 Writing new params file to params-min4.txt loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min8 Writing new params file to params-min8.txt loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min12 Writing new params file to params-min12.txt Change the parameter settings in params.txt for each assembly ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. parsed-literal:: ## Enter the changes below into the params files using a text editor ## in the file params-min4.txt 4 ## [21] [min_samples_locus] ... ## in the file params-min8.txt 8 ## [21] [min_samples_locus] ... ## in the file params-min12.txt 12 ## [21] [min_samples_locus] ... Step 7 (final filtering and create output files) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Filter and create output files for the three assemblies with different values for the parameter ``min_samples_locus``. .. code:: bash ipyrad -p params-min4.txt -s 7 ipyrad -p params-min8.txt -s 7 ipyrad -p params-min12.txt -s 7 .. parsed-literal:: -------------------------------------------------- ipyrad [v.0.1.70] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: min4 [~/Downloads/pedicularis/min4.json] ipyparallel setup: Local connection to 4 Engines Step7: Filter and write output files for 13 Samples. Outfiles written to: ~/Downloads/pedicularis/min4_outfiles Saving Assembly. -------------------------------------------------- ipyrad [v.0.1.70] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: min8 [~/Downloads/pedicularis/min8.json] ipyparallel setup: Local connection to 4 Engines Step7: Filter and write output files for 13 Samples. Outfiles written to: ~/Downloads/pedicularis/min8_outfiles Saving Assembly. -------------------------------------------------- ipyrad [v.0.1.70] Interactive assembly and analysis of RADseq data -------------------------------------------------- loading Assembly: min12 [~/Downloads/pedicularis/min12.json] ipyparallel setup: Local connection to 4 Engines Step7: Filter and write output files for 13 Samples. Outfiles written to: ~/Downloads/pedicularis/min12_outfiles Saving Assembly. Take a look at the stats summary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each assembly that finishes step 7 will create a stats.txt output summary in the 'assembly_name'_outfiles/ directory. This includes information about which filters removed data from the assembly, how many loci were recovered per sample, how many samples had data for each locus, and how many variable sites are in the assembled data. .. code:: python cat ./pedicularis/min4_outfiles/min4_stats.txt .. parsed-literal:: ## The number of loci caught by each filter. ## ipyrad API location: [assembly].statsfiles.s7_filters locus_filtering total_prefiltered_loci 1206 filtered_by_rm_duplicates 2 filtered_by_max_indels 159 filtered_by_max_snps 0 filtered_by_max_hetero 15 filtered_by_min_sample 921 filtered_by_edge_trim 0 total_filtered_loci 221 ## The number of loci recovered for each Sample. ## ipyrad API location: [assembly].stats_dfs.s7_samples sample_coverage 29154_superba 151 30556_thamno 120 30686_cyathophylla 118 32082_przewalskii 132 33413_thamno 155 33588_przewalskii 89 35236_rex 154 35855_rex 145 38362_rex 154 39618_rex 156 40578_rex 90 41478_cyathophylloides 156 41954_cyathophylloides 97 ## The number of loci for which N taxa have data. ## ipyrad API location: [assembly].stats_dfs.s7_loci locus_coverage sum_coverage 1 NaN 0 2 NaN 0 3 NaN 0 4 65 65 5 36 101 6 16 117 7 10 127 8 10 137 9 6 143 10 2 145 11 12 157 12 7 164 13 57 221 ## The distribution of SNPs (var and pis) across loci. ## pis = parsimony informative site (minor allele in >1 sample) ## var = all variable sites (pis + autapomorphies) ## ipyrad API location: [assembly].stats_dfs.s7_snps var sum_var pis sum_pis 0 260 260 1140 1140 1 130 390 45 1185 2 130 520 12 1197 3 133 653 3 1200 4 90 743 4 1204 5 110 853 0 1204 6 77 930 0 1204 7 70 1000 2 1206 8 64 1064 0 1206 9 32 1096 0 1206 10 31 1127 0 1206 11 30 1157 0 1206 12 16 1173 0 1206 13 14 1187 0 1206 14 5 1192 0 1206 15 6 1198 0 1206 16 2 1200 0 1206 17 4 1204 0 1206 18 2 1206 0 1206 Take a peek at the .loci (easily human-readable) output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is one fo the first places I usually look when an assembly finishes. It provides a clean view of the data with variable sites (-) and parsimony informative SNPs (*) highlighted. Use the unix commands **less** or **head** to look at this file briefly. .. code:: bash ## head -n 50 prints just the first 50 lines of the file to stdout head -n 50 pedicularis/min4_outfiles/min4.loci .. parsed-literal:: 29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACACCAACAGCAACAGTC 32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGG-TCCTTCTTCTCCACACTCTTGATRACACCAACAGCAACAGTC 35236_rex CCTTGGTCACCTTAGCACCTGATGG-TCCTTCTTCTCCACACTCTTGATGACACCAACAGCAACAGTC 38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACACCAACAGCAAC-GTC // - - - - - - - - | 33413_thamno TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 33588_przewalskii GAGACAACCAGTGCCTTCTTGTCTATCAGCCTCACACCTGTCTTCGGTACTTTCGGTACTTAGAAGCA 35855_rex TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 38362_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 39618_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 40578_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 41478_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 41954_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA // - * - - | 29154_superba AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 30556_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 30686_cyathophylla AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 32082_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 33413_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 33588_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 35236_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 35855_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 38362_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 39618_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 40578_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 41478_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 41954_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGCNCTAGCTAACGCGCCC // | 30686_cyathophylla TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 32082_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 33588_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 35236_rex TAGCAATAAATGCAAGAATATTKACTTCCATAATTTCGTCKGTTTTTTAATTCGCAATAACTCGGGAT 38362_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 39618_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 40578_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 41478_cyathophylloides TAGCAATAAATGCAAGAATATTT-CTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT // * * | 35236_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAKCGGATCATCATGTGCGAGTTGAC 35855_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTGGAC 38362_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC 39618_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC // - - | 30556_thamno C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATYAAACAGCAAAAACAATGACTSGATAAACTA 33413_thamno CTTTCTGWTTAATCTGMAAATTGTAATCAAATGAAATCAAACARCAAAAACAATGACTYGAYAAWCYR 35236_rex C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATCAAACA-CAAAAACAATGACT-GATAAACTA 41478_cyathophylloides CTTTCTGATTAATCTGCAAATTGTAATCAAATGAAATCAAACAGCAAAAACAATAACTTGATAAAATA // - - - - - - - ----| 30556_thamno GAAAGATWT-AYTGTAGACGTAWTKGATCRSAGGWKGAGGTGATGWATCATAWTCAT-ATCAGAGGAG 38362_rex GAAAGATTTCACTGTAGACGTAATGGATCAGAGGTTGAGGTGATGRATCATAATCATGATCAGAGGWG 39618_rex GAAAGATTTCACTGTAGACGTAWWGGATCMSAGGWKGAGGTGATGRATCATAATCATKATCAGAGGAG peek at the .phy files ~~~~~~~~~~~~~~~~~~~~~~ This is the concatenated sequence file of all loci in the data set. It is typically used in phylogenetic analyses, like in the program *raxml*. .. code:: bash ## cut -c 1-80 prints only the first 80 characters of the file cut -c 1-80 pedicularis/min4_outfiles/min4.phy .. parsed-literal:: 13 15034 29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACA 30556_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 30686_cyathophylla NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGGNTCCTTCTTCTCCACACTCTTGATRACA 33413_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 33588_przewalskii NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 35236_rex CCTTGGTCACCTTAGCACCTGATGGNTCCTTCTTCTCCACACTCTTGATGACA 35855_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACA 39618_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 40578_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 41478_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 41954_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN peek at the .snp file ~~~~~~~~~~~~~~~~~~~~~ This is similar to the phylip file format, but only variable site columns are included. All SNPs are the file, in contrast to the .usnps file, which selects only a single SNP per locus. .. code:: bash ## cut -c 1-80 prints only the first 80 characters of the file cut -c 1-80 pedicularis/min4_outfiles/min4.snp .. parsed-literal:: 13 711 29154_superba SMWWYRKRNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGYATNYGAMGT 30556_thamno NNNNNNNNNNNNNNNNA-YGGSTACTACWYWTKRSWKWW-TAGTAT-NNNNGT 30686_cyathophylla NNNNNNNNNNNNTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCRACTT 32082_przewalskii SRWWY-GRNNNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGG 33413_thamno NNNNNNNNTTTGNNNNWMCRGYYWCYRYNNNNNNNNNNNNNNNNNNNNNNNGT 33588_przewalskii NNNNNNNNGTCTGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNRTRKNNNNNGG 35236_rex CAATT-GGNNNNKKKTA-C-G-TACTACNNNNNNNNNNNNNNG-ATMNNNNGT 35855_rex NNNNNNNNTGTGNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGRCGT 38362_rex CAATTRGGTTTGTTGTNNNNNNNNNNNNTCATGAGTTRAGTWNNNNMCGACGT 39618_rex NNNNNNNNTTTGTTGTNNNNNNNNNNNNTCWWGMSWKRAKTAGTATMCGRCGT 40578_rex NNNNNNNNTTTGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGT 41478_cyathophylloides NNNNNNNNTGTGTGNNACCGATTAATACWKATKRSWKRAGYAGTATNCRACGT 41954_cyathophylloides NNNNNNNNTGTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGT peek at the .snp file for the min12 assembly ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similar to above but you can see that there is much less missing data (Ns). .. code:: bash ## cut -c 1-80 prints only the first 80 characters of the file cut -c 1-80 pedicularis/min12_outfiles/min12.snp .. parsed-literal:: 13 44 29154_superba GTTGACGCGGTTCTCCCCCGCGCAGTGCAGTCCGCTAANNAGTT 30556_thamno GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT 30686_cyathophylla TTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT 32082_przewalskii GGTGATGCAACCCCTACCCGCGTGGCGTARYCAGTTARGAGACC 33413_thamno GTTGACGCGGTTTTCACCCGCGCAGCGCGGTACACTAAGAAATT 33588_przewalskii GGTGATGCAACCCCTACCCGCGTG-TC-ARYCAGTTARGAGACC 35236_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGRYACACTWAKMAATT 35855_rex GTKRMYKMGGTTCTCACCCGCGCAGCGCGGTACACTAAGANNNT 38362_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT 39618_rex GTTGACGCGGTTCTCAYYYRYRCAGCGCGGTCCACTAAGAAATT 40578_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT 41478_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT 41954_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT downstream analyses ~~~~~~~~~~~~~~~~~~~ ...