Revision 3e17855d6a9288302842fee9bf152e258f348d4f authored by Santosh Gunturu on 05 August 2016, 19:16:55 UTC, committed by Santosh Gunturu on 05 August 2016, 19:16:55 UTC
1 parent 0117143
Raw File
redundancy.rst
Redundancy
==========

If you have large files (>1Gb) and access to a cluster, take a look at :doc:`mpi`.

For the impatient
-----------------

Even if you're in a hurry, taking a look at :doc:`preprocess` is very important. If you already did, you can simply run::

    nonpareil -s reads.fa -T kmer -f fastq -b output (fastq is recommended for kmer algorithm)
    nonpareil -s reads.fa -T kmer -f fasta -b output
    nonpareil -s reads.fa -T alignment -f fasta -b output (fasta is recommended for alignment algorithm)
    nonpareil -s reads.fa -T alignment -f fastq -b output
    

Where ``reads.fa`` is the file containing the trimmed single reads, and ``output`` is the prefix
of the output files to be created.

Mandatory options
-----------------
   -s <str>   Path to the (input) file containing the sequences.  This is lowercase S.
   -T <str>   nonpareil algorithm. can be 'kmer' or 'alignment'.
   -f <str>   The format of the sequence. Can be 'fasta' or 'fastq'.

Common options
--------------
   -b <str>   Path to the prefix for all the output files.  Replaces the options: -a, -C, -l, and -o; generating files
              with the suffixes .npa, npc, .npl, and .npo, respectively, unless explicitly set.
   -d <num>   Subsample iteratively applying this factor to the number of reads, resulting in logarithmic subsampling.
              Use -d 0 to fall back to linear sampling, controlled by -m, -M, & -i (this was the default before v2.4).
	      By default: 0.7.
   -n <int>   Number of sub-samples to generate per point.  If it is not a multiple of the number of threads (see -t),
              it is rounded to the next (upper) multiple.  By default: 1024.
   -L <num>   Minimum overlapping percentage of the aligned region on the largest sequence. The similarity (see -S) is
              evaluated for the aligned region only.  By default: 50.
   -X <int>   Maximum number of reads to use as query.  This is capital X.  By default, 1,000 reads.
   -q <str>   Path to the (input) file containing a second dataset to be used as query, for dataset comparisons.  This
	      option is currently experimental.
   -R <int>   Maximum RAM usage in Mib.  Ideally this value should be larger than the sequences to analyze (discarding
              non-sequence elements like headers or quality).  This is particularly important when running in multiple
              cores (see -t).  This value is approximated.  By default 1024.
              Maximum value in this version: 4194303
   -t <int>   Number of threads.  Highest efficiency when the number of sub-samples (see -n) is multiple of the number
              of threads.  By default: 2.
   -v <int>   Verbosity level, for debugging purposes.  By default 7.  This is lowercase V.
   -V         Show version information and exit.  This is uppercase V.
   -h         Display this message and exit.

Additional options
------------------
**Input/Output**
   -a <str>   Path to the (output) file where all data must be saved.  This report is not created by default.  See the
              OUTPUT section.
   -C <str>   Path to the (output) file where the mating vector is to be saved.  This is a capital C.
   -F         Report the sampled portions as a fraction of the library instead of the number of reads.  See -a, -o and
              the OUTPUT section.
   -l <str>   Path to the (output) file where the log of the run must be saved. By default the log is sent only to the
              STDERR.  If set, the log is sent to both the STDERR and the log file.
   -o <str>   Path to the (output) file where summary is to be saved.   By default the summary is sent to stdout (same
              behavior as using a dash '-').  If an empty string '' is provided, does not produce the summary. See the
              OUTPUT section.
   
**Sampling**
   -m <num>   Minimum value of sampling portion.  By default: 0.
   -M <num>   Maximum value of sampling portion.  By default: 1.
   -i <num>   Interval between sampling portions. By default: 0.01.

**Mating**
   -c         Do not use reverse-complement.  This is useful for single stranded sequences data (like RNA).  This is a
              lowercase C.
   -N         Treat Ns as mismatches.  By default, Ns (unknown nucleotides) match any nucleotide (even another N).
   -S <num>   Similarity threshold to group two reads together.   Reducing this option will increase sensitivity while
              increasing running time.  This is uppercase S.
   -x <num>   Probability of taking a sequence into account as query for the construction of the curve.  Higher values
              reduce accuracy but increase speed.  This is lower case x.  If set, overides -X.

**Misc**
   -A         Autoadjust parameters and re-run.  Evaluates the results looking for common problems, adjusts parameters
              and re-run the analyses.  THIS IS EXPERIMENTAL CODE.
   -r <int>   Random generator seed.  By default current time.

Input
-----
Sequences must be in FastA or FastQ format. See :doc:`preprocess`.

Output
------
Redundancy summary: ``.npo`` file
   Tab-delimited file with six columns. The first column indicates the sequencing effort (in number of reads), and the
   remaining columns indicate the summary of the distribution of redundancy (from the replicates, 1,024 by default) at
   the given sequencing effort. These five columns are: average redundancy, standard deviation, quartile 1, median
   (quartile 2), and quartile 3.

Redundancy values: ``.npa`` file
   Tab-delimited file with three columns. Similar to the .npo files, it contains information about the redundancy at
   each sequencing effort, but it provides ALL the results from the replicates, not only the summary at each point. The
   first column indicates the sequencing effort (as a fraction of the dataset), the second column indicates the ID of
   the replicate (a number used only to introduce some controlled noise in plots), and the third column indicates the
   estimated redundancy value.

Mates distribution: ``.npc`` file
   Raw list with the number of reads in the dataset matching a query read. A set of query reads is randomly drawn by
   Nonpareil (1,000 by default), and compared against all reads in the dataset. Each line on this file corresponds to a
   query read (the order is not important). We have seen certain correspondance between these numbers and the distribution
   of abundances in the community (compared, for example, as rank-abundance plots), but this file is provided only for
   quality-control purposes and comparisons with other tools.

Log: ``.npl`` file
   A verbose log of internal Nonpareil processing. The number to the left (inside squared brackets) indicate the CPU time
   (in minutes). This file also provide quality assessment of the Nonpareil run (automated consistency evaluation). Ideally,
   the last line should read "Everything seems correct". Otherwise, it suggests alternative parameters that may improve the
   estimation.

back to top