https://github.com/dereneaton/ipyrad
Tip revision: 82782ca61ae1180bbbcf2353016ff2a123971319 authored by Isaac Overcast on 08 May 2021, 23:57:05 UTC
"Updating ipyrad/__init__.py to version - 0.9.74
"Updating ipyrad/__init__.py to version - 0.9.74
Tip revision: 82782ca
faq.rst
.. _faq:
Frequenty Asked Questions
=========================
This is a collection of papers related to various aspects of RADSeq assembly and analysis
Just because we include a paper here doesn't mean we endorse the results, only that we think
it's worth considering. Where necessary, we highlight specific papers and interesting/important
results below.
* Huang & Knowles 2016 - `Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences <https://academic.oup.com/sysbio/article/65/3/357/2468879>`__
Especially useful. TL;DR Don't over-filter missing data. In practice this means allowing for more permissive `min_samples_locus` parameter settings. Sometimes people have a tendency to treat RAD data as if it were a giant multi-locus Sanger dataset. They want to see a "complete data matrix", so use very high `min_samples_locus` settings in order to reduce the amount of missing data. This is bad, and this paper shows us why.
* Shafer et al. 2016 - `Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference <https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12700>`__
* Linck & Battey 2018 - `Minor allele frequency thresholds dramatically affect population structure inference with genomic datasets <https://www.biorxiv.org/content/biorxiv/early/2018/10/21/188623.full.pdf>`__
Don't over-filter your data. Filtering too strictly on MAF reduces power to detect population structure. Given that low-frequency variants retain much of the signal of recent history this result seems sensible.
* Crotti et al 2019 - `Causes and analytical impacts of missing data in RADseq phylogenetics: Insights from an African frog (Afrixalus) <https://onlinelibrary.wiley.com/doi/abs/10.1111/zsc.12335>`__
Don't over-filter your data.
* Euclide et al 2019 - `Attack of the PCR clones: Rates of clonality have little effect on RAD‐seq genotype calls <https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13087>`__
"Removal of PCR clones reduced the number of called genotypes by 2% but had almost no influence on estimates of heterozygosity."
* Benjelloun et al 2019 - `An evaluation of sequencing coverage and genotyping strategies to assess neutral and adaptive diversity <https://sci-hub.tw/https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13070>`__
"Globally, 5K to 10K random variants were enough for an accurate estimation of genome diversity. Conversely, commercial panels and exome capture displayed strong ascertainment biases. ... the detection of the signature of selection and the accurate estimation of linkage disequilibrium required high-density panels of at least 1M variants."
Dealing with Missing Data Downstream
------------------------------------
* Ferretti et al 2012 - `Neutrality Tests for Sequences with Missing Data <http://www.genetics.org/content/genetics/191/4/1397.full.pdf>`__
"In this article we presented a general framework for estimators of variability and neutrality tests based on the frequency spectrum that take into account missing data in a natural way." Cool paper, however no code is provided to implement this.
Troubleshooting Procedures
==========================
Troubleshooting ipyparallel issues
----------------------------------
Sometimes ipyrad can have trouble talking to the ipyparallel
cluster on HPC systems. First we'll get an interactive shell
on an HPC compute node (YMMV with the `qsub -I` here, you might
need to specify the queue and allocate specific resource).
.. code-block:: bash
qsub -I
.. code-block:: bash
ipcluster start --n 4 --daemonize
Then type `ipython` to open an ipython session.
.. code-block:: python
import ipyparallel as ipp
rc = ipp.Client()
rc[:]
The result should look something like this:
.. parsed-literal::
Out[1]: <DirectView [0, 1, 2, 3]>
.. code-block:: python
import ipyparallel as ipp
rc = ipp.Client(profile="default")
rc[:]
.. code-block:: python
import ipyrad as ip
## location of your json file here
data = ip.load_json("dir/path.json")
print data.ipcluster
.. code-block:: python
data = ip.Assembly('test')
data.set_params("raw_fastq_path", "path_to_data/\*.gz")
data.set_params("barcodes_path", "path_to_barcode.txt")
data.run('1')
print data.stats
print data.ipcluster
.. parsed-literal::
{'profile': 'default', 'engines': 'Local', 'quiet': 0, 'cluster_id': '', 'timeout': 120, 'cores': 48}
.. code-block:: python
data.write_params('params-test.txt')
Don't forget to stop the ipcluster when you are done.
.. code-block:: bash
ipcluster stop
Running ipyrad on HPC that restricts write-access to /home on compute nodes
---------------------------------------------------------------------------
Some clusters forbid writing to `/home` on the compute nodes. It guarantees that users
only write to scratch drives or high performance high volume disk, and not the user
home directory (which is probably high latency/low volume). They have write access on
login, just not inside batch jobs. This manifests in weird ways, it's hard to debug,
but you can fix it by adding an `export` inside your batch script.
.. code-block:: bash
export HOME=/<path>/<to>/<some>/<writable>/<dir>
In this way, `ipcluster` and `ipyrad` will both look in `$HOME` for the `.ipython` directory.
ipyrad crashes during dereplication in step 3
---------------------------------------------
.. parsed-literal::
ERROR sample [XYZ] failed in step [derep_concat_split]; error: EngineError(Engine '68e79bbc-0aae-4c91-83ec-97530e257387' died while running task u'fdef6e55-dcb9-47cb-b4e6-f0d2b591b4af')
If step 3 crashes during dereplication you may see an error like above. Step 3
can take quite a lot of memory if your data do not de-replicate very efficiently.
Meaning that the sample which failed may contain a lot of singleton reads.
You can take advantage of the following steps during step 2 to better filter your
data so that it will be cleaner, and thus dereplicate more efficiently. This will
in turn greatly speed up the step3 clustering and aligning steps.
* Use the "filter_adapters" = 2 argument in ipyrad which will search for and remove Illumina adapters.
* Consider trimming edges of the reads with the "trim_reads" option. An argument like (5, 75, 5, 75) would trim the first five bases of R1 and R2 reads, and trim all reads to a max length of 75bp. Trimming to a fixed length helps if your read qualities are variable, because the reads may be trimmed to variable lengths.
* Try running on a computer with more memory, or requesting more memory if on a cluster.
Collisions with other local python/conda installs
-------------------------------------------------
.. parsed-literal::
Failed at nopython (nopython frontend)
UntypedAttributeError: Unknown attribute "any" of type Module(<module 'numpy' from...
In some instances if you already have conda/python installed the local environment
variable PYTHONPATH will be set, causing python to use versions of modules
outside the miniconda path set during ipyrad installation. This error can be fixed by
blanking the PYTHONPATH variable during execution (as below), or by adding the export
to your ~/.bashrc file.
.. code-block:: bash
export PYTHONPATH=""; ipyrad -p params.txt -s 1
Why doesn't ipyrad handle PE original RAD?
------------------------------------------
Paired-End RAD protocol is tricky to denovo assemble. Because of the sonication step R2
doesn't line up nicely. ipyrad makes strong assumptions about how r1 and r2 align,
assumptions which are met by PE gbs and ddrad, but which are not met by original RAD.
This doesn't matter (as much) if you have a reference genome, but if you don't have a
reference it's a nightmare... dDocent has a PE-RAD mode, but I haven't evaluated it.
I know that people have also used stacks (because stacks treats r1 andr2 as independent
loci). If people ask me how to denovo assemble with PE-RAD in ipyrad I tell them to
just assemble it as SE and ignore R2.
Why doesn't ipyrad write out the .alleles format with phased alleles like pyrad used to?
----------------------------------------------------------------------------------------
We're hoping to provide something similar eventually, the problem with the pyrad alleles
file is that the alleles are only phased correctly when we enforce that reads must align
almost completely, i.e., they are not staggered in their overlap. So the alleles are
correct for RAD data, because the reads match up perfectly on their left side, however,
staggered overlaps are common in other data sets that use very common cutters, like
ezRAD and some GBS, and especially so when R1 and R2 reads merge. So we needed to change
to an alternative way of coding the alleles so that we can store both phased and unphased
alleles, and its just taking a while to do. So for now we are only providing unphased
alleles, although we do save the estimated number of alleles for each locus. This
information is kind of hidden under the hood at the moment though.
Why is my assembly taking FOREVER to run?
-----------------------------------------
There have been a few questions recently about long running jobs (e.g., >150 hours), which
in my experience should be quite rare when many processors are being used. In general,
I would guess that libraries which take this long to run are probably overloaded with
singleton reads, meaning reads are not clustering well within or across samples. This
can happen for two main reasons: (1) Your data set actually consists of a ton of
singleton reads, which is often the case in libraries that use very common cutters like
ezRAD; or (2) Your data needs to be filtered better, because low quality ends and
adapter contamination are causing the reads to not cluster.
If you have a lot of quality issues or if your assemby is taking a long time to cluster
here are some ways to filter more aggressively, which should improve runtime and the
quality of the assembly:
* Set filter_adapters to 2 (stringent=trims Illumina adapters)
* Set phred_Qscore_offset to 43 (more aggressive trimming of low quality bases from 3' end of reads
* Hard trim the first or last N bases from raw reads by setting e.g., trim_reads to (5, 5, 0, 0)
* Add additional 'adapter sequences' to be filtered (any contaminant can be searched for, I have added long A-repeats in one library where this appeared common). This can be done easily in the API, but requires editing the JSON file for the CLI.
I still don't understand the `max_alleles_consens` parameter
------------------------------------------------------------
In step 5 base calls are made with a diploid model using the parameters estimated in
step 4. The only special case in when `max_alleles_consens` = 1, in which case the step 4
heterozygosity estimate will be fixed to zero and the error rate will suck up all of the
variation within sites, and then the step 5 base calls will be haploid calls. For all
other values of `max_alleles_consens`, base calls are made using the diploid model using
the H and E values estimated in step 4. **After site base calls are made** ipyrad then counts
the number of alleles in each cluster. This value is now simply stored in step 5 for use
later in step 7 to filter loci, under the assumption that if a locus has paralogs in one
sample then it probably has them in other samples but there just wasn't enough variation to
detect them.
Why does it look like ipyrad is only using 1/2 the cores I assign, and what does the `-t` flag do?
--------------------------------------------------------------------------------------------------
Most steps of ipyrad perform parallelization by multiprocessing, meaning that jobs are
split into smaller bits and distributed among all of the available cores. However, some
parts of the analysis also use multithreading, where a single function is performed over
multiple cores. More complicated, parts like step3 perform several multithreaded jobs in
parallel using multiprocessing... you still with me? The -c argument is the total number
of cores that are available, while the -t argument allows more fine-tuned control of how
the multithreaded functions will be distributed among those cores. For example, the
default with 40 cores and -t=2 would be to start 20 2-threaded vsearch jobs. There are
some parts of the code that cannot proceed until other parts finish, so at some points
the code may run while using fewer than the total number of cores available, which is
likely what you are seeing in step 3. Basically, it will not start the aligning step
until all of the samples have finished clustering. It's all fairly complicated, but we
generally try to keep everything working as efficiently as possible. If you have just
one or two samples that are much bigger (have more data) than the rest, and they are
taking much longer to cluster, then you may see a speed improvement by increasing the
threading argument (e.g., -t 4).
How to fix the GLIBC error
--------------------------
If you ever see something that looks like this `/lib64/libc.so.6: version `GLIBC_2.14' not found`
it's probably because you are on a cluster and it's using an old version of GLIBC. To
fix this you need to recompile whatever binary isn't working on your crappy old machine.
Easiest way to do this is a conda local build and install. Using `bpp` as the example:
.. parsed-literal::
git clone https://github.com/dereneaton/ipyrad.git
conda build ipyrad/conda.recipe/bpp/
conda install --use-local bpp
How do I interpret the `distribution of SNPs (var and pis) per locus` in the *_stats.txt output file
----------------------------------------------------------------------------------------------------
Here is an example of the first few lines of this block in the stats file:
.. parsed-literal::
bash var sum_var pis sum_pis
0 661 0 10090 0
1 1660 1660 5070 5070
2 2493 6646 1732 8534
3 2801 15049 483 9983
4 2683 25781 147 10571
5 2347 37516 59 10866
6 1740 47956 17 10968
7 1245 56671 7 11017
**pis** is exactly what you think, it's the count of loci with *n* parsimony informative sites. So row 0 is loci with no pis, row 1 is loci with 1 pis, and so on.
**sum_pis** keeps a running total of the counts for all pis across all loci up to that point, which is why the sum looks weird, but i assure you its fine. For the row that records 3 pis per site, you see the # pis = 483 and 483 * 3 + 8534 = 9983.
**var** is a little trickier and here's where the docs are a little goofy. This keeps track of the number of loci with n variable sites including autapomorphies and pis within each locus. So row 0 is all totally monomorphic loci. row 1 is all loci with *either* one pis or one autapomorphy. Row 2 is all loci with *either* two pis, or two autapomorphies, *OR* one of each, and so on.
**sum_var** is calculated identical to **sum_pis**, so it does look weird but it's right.
The reason the counts in, for example, row 1 do not appear to agree for var and pis is because the value of row 1 for pis *includes all* loci with only one pis irrespective of the number of autapomorphies, whereas the value for var records all loci with *only one* of either of these.
How to fix the `IOError(Unable to create file IOError(Unable to create file...` error
-------------------------------------------------------------------------------------
The HDF5_USE_FILE_LOCKING error is caused by the fact that your cluster filesystem is NFS (or some other network based filesystem). You can disable hdf5 file locking by setting an environment variable `export HDF5_USE_FILE_LOCKING=FALSE`. See here for more info:
http://hdf-forum.184993.n3.nabble.com/HDF5-files-on-NFS-td4029577.html
Why am I getting the 'empty varcounts' error during step 7?
-----------------------------------------------------------
Occasionally during step 7 you will see this error:
.. code-block::
Exception: empty varcounts array. This could be because no samples
passed filtering, or it could be because you have overzealous filtering.
Check the values for `trim_loci` and make sure you are not trimming the
edge too far.
This can actually be caused by a couple of different problems that all result in the same behavior, namely that you are filtering out *all* loci.
**trim_loci** It's true that if you set this parameter too aggressively all loci will be trimmed completely and thus there will be no data to output.
**min_samples_locs** Another way you can eliminate all data is by setting this parameter too high. Try dropping it way down, to like 3, then rerunning to get a better idea of what an appropriate value would be based on sample depths.
**pop_assign_file** A third way you can get this error is related to the previous one. The last line of the pop_assign_file is used for specifying min_sample per population for writing a locus. If you mis-specify the values for the pops in this line then it's possible to filter out all your data and thus obtain the above error.
How do I fix this error: "OSError: /lib64/libpthread.so.0: version `GLIBC_2.12' not found"?
-------------------------------------------------------------------------------------------
This error crops up if you are running ipyrad on a cluster that has an older version of GLIBC. The way to work around this is to install specific versions of some of the requirements that are compiled for the older version. Thanks to Edgardo M. Ortiz for this solution.
First clean up your current environment:
.. code-block:: bash
module unload python2
rm -rf miniconda2 .conda
bash Miniconda2-latest-Linux-x86_64.sh
source ~/.bashrc
then install the old version of llvmlite (and optionally the old versions of pyzmq and ipyparallel if necessary):
.. code-block:: bash
conda install llvmlite=0.22
conda install pyzmq=16
conda install ipyparallel=5.2
and finally reinstall ipyrad:
.. code-block:: bash
conda install -c ipyrad ipyrad
conda install toytree -c eaton-lab
optional:
.. code-block:: bash
conda clean --all
Why am i getting the 'ERROR R1 and R2 files are not the same length.' during step 1?
--------------------------------------------------------------------------------------
This is almost certainly a disk space issue. Please be sure you have _plenty_ of disk space on whatever drive you're doing your assembly on. Running out of disk can cause weird problems that seem to defy logic, and that are a headache to debug (like this one). Check your disk space: `df -h`
Why does the number of pis recovered in the output stats change when I change the value of `max_snp_locus`?
-----------------------------------------------------------------------------------------------------------
While it does seem that the # of pis shouldn't change under varying `max_snp_locus` thresholds, it is in fact not true. This is because the setting is for max __SNP__ per locus, not max __PIS__. So for example if you have `max_snp_locus` set to 5, and you have a locus with 5 singleton snps and one doubleton snp (which is parsimony informative), then this locus would be filtered out. However if you set `max_snp_locus` to 10, then this locus would be included and the 'pis' counter would be incremented by 1. In this way you can see that the number of PIS recovered will change because of variation in this parameter setting.
Can ipyrad assemble MIG-seq data?
---------------------------------
MIG-seq (multiplexed ISSR genotyping by sequencing) is a method proposed by Suyama and Matsuki (2015), which involves targeting variable regions between simple sequence repeats (SSR). The method produces data that is somewhat analogous to ddRAD, in that you have the variable region which is flanked on either side by sequences that are known to be repeated randomly and at some appreciable frequency throughout the genome. Check out the `figure from the manuscript <https://www.nature.com/articles/srep16963/figures/1>`__. Anyway.... yes, ipyrad can assemble this kind of data, though there are some tricks. Primarily we recommend higher values of `filter_min_trim_len` and `clust_threshold`. If sequenced on a desktop NGS platform (Ion Torrent PGM, MiSeq) it also helps to reduce both `mindepth` params to recover more clusters.
Why are my ipcluster engines dying silently on cluster compute notes?
---------------------------------------------------------------------
This is a nasty bug that's bitten me more than once. If you are having trouble with cluster engines running jobs and then dying silently it may be because the cluster is headless and the engines are trying to interact with a GUI backend. This causes nasty things to happen. Here are a couple links that provide workable solutions:
https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/o0pnE9PEqA0
https://github.com/ipython/ipyparallel/issues/213
Why are my ipyrad.analysis.structure runs taking so long/not doing anything?
----------------------------------------------------------------------------
See the previous FAQ answer. It's typical for HPC cluster systems to be configured without a GUI backend. Unfortunately ipyparallel and this particular GUI-less environment have a hard time interacting (for complicated reasons). We have derived a workaround that allows the parallelization to function. You should execute the following commands in a terminal on your cluster head node.
VERY IMPORTANT: This environment variable needs to be set in both .bashrc and .profile so that it is picked up when you run ipyparallel in either the head node of the cluster or on compute nodes.
.. code-block:: bash
$ echo "# Prevent ipyparallel engines from dying in a headless environment" >> ~/.bashrc
$ echo "export QT_QPA_PLATFORM=offscreen" >> ~/.bashrc
$ echo "export QT_QPA_PLATFORM=offscreen" >> ~/.profile
$ source ~/.bashrc
$ source ~/.profile
$ env | grep QT
Why is my structure analysis crashing when it looks like it should be working?
------------------------------------------------------------------------------
When running structure, specifically in the `get_clumpp_table` call, you might be told that "No files ready for XXX-K-2 in </your/structure/folder>", when in fact there are files ready. Well it turns out that CLUMPP has a 100 character file name limit, and it'll crash with names longer than this. The ipyrad.analysis.structure functions use absolute paths to specify file names, so it's not hard to see how this 100 character limit could be violated. Try moving your structure analysis to a place higher in the file system hierarchy. Baffling!
Problems with SRATools analysis package
---------------------------------------
Occasionlly with the sratools package you might have some trouble with downloading. It could look something like this `Exception in run() - index 29 is out of bounds for axis 0 with size 1`. This is a problem with your `esearch` install, which does not have https support built in. You can verify with with the command `esearch -db sra -query SRP021469`, which should give you an https protocol support error. You can easily fix this by installing `ssleay`: `conda install -c bioconda perl-net-ssleay`. Thanks to @ksil91.
ValueError in step 7
--------------------
During step 7 if you see something like this `error in filter_stacks on chunk 0: ValueError(zero-size array to reduction operation minimum which has no identity)` it means that one of your filtering parameters is filtering out all the loci. This is bad, obviously, and it's probably because one of your filtering parameters is too strict. Take a look at a couple of the samples in the *_consens directory to make sure they are reasonable, then try adjusting your filtering parameters based on how the consensus reads look.