Content - 8097e9e2beff3bdb6c5e0b6bcffaae7ea0aa7df6 - 850fcaf/wcd.info

visit type:
https://github.com/shaze/wcdest

05 April 2024, 19:21:58 UTC
Tip revision: aee525029bb661b633097e989c6fe2eaa93d2def authored by Scott Hazelhurst on 30 April 2018, 15:03:42 UTC
Added readme
Tip revision: aee5250
wcd.info
This is wcd.info, produced by makeinfo version 4.8 from wcd.texi.


File: wcd.info,  Node: Top,  Next: Description,  Up: (dir)

   The `wcd' tools (pronounced 'wicked') do EST clustering.     The
most recent version is called `wcd-express'

This is the `wcd-express' manual.

Copyright (C) 2003-2010 Scott Hazelhurst, University of the
Witwatersrand, Johannesburg.  This documentation is distributed under
the GNU Free Documentation Licence.


* Menu:

* Description::    What EST clustering is

* Installing wcd:: How to install it

* Running wcd:: How to get it to work

* wcd inside stackPACK:: How to incorporate wcd into the stackPACK clustering system

* Technical manual:: The gory details

* Testing:: How tested

* Acknowledgements and copyright:: Thanks and  copyright information

* Concept Index:: Some index terms


File: wcd.info,  Node: Description,  Next: Installing wcd,  Prev: Top,  Up: Top

1 Description
*************

This manual describes `wcd' (pronounced `wicked'), a program that
performs clustering of sequences. Essentially, it takes a set of DNA
sequences (typically Expressed Sequence Tags, ESTs), and clusters the
sequences into the related groups.

   It reads in a set of DNA sequences (typically ESTs or mRNAs) in FASTA
format and organises them into clusters of similar sequences, producing
a list of clusters and their members on standard output.

   The clustering performed is single-linkage clustering. Sequences are
put in the same cluster if they have regions of close similarity.
Similarity is determined using one of the standard distance functions.
In versions > 0.2.0 of `wcd', three distance functions are supported:
the d2 function, edit distance and a common word heuristic. This may be
expanded  in later versions of `wcd'.

   `wcd' has been used mainly on ESTs but also on genomic sequences
(particularly metagenomic sets). The largest sequence sets known to be
clustered are: 6.6M 454 DNA sequences (2.55G bases) 35 hours 27 minutes
using 32 Irwindale (3.6GHz) Intel Xeon (1 M of L2 RAM); and 18M  454
DNA sequences (just under 8G) on the same cluster which took just less
than 10 days; the 18M set took 84 hours on a cluster of 128 AMD 2218
2.6GHz machines.

   When using the d2 distance function, `wcd''s results are similar to
those generated by the `d2_cluster' program, which also implements
clustering based on the d2 distance function.

   The use of the d2 distance function in DNA sequence comparison is
described in Hide et al. (1994) and Burke et al. (1999), and has been
shown to be sensitive to sequence similarity even in the absence of
direct sequence alignments. This corresponds with the nature of EST data
which typically contains single-base read errors and may contain
sequence from different splice forms of the same transcript (leading to
local dissimilarity).

   Thanks to the properties of the d2 distance function, sequence
clusters created by `wcd' correspond to closely gene classes i.e. all
the ESTs originating from a specific gene will be clustered into a
single class.

   `wcd' is written in C and has been successfully installed and tested
on Linux, FreeBSD, SGI Irix, Sun Solaris, Windows 2000 (under Cygwin)
and Windows XP.

   Users of `wcd' are welcome to contact the developer.

   The remainder of this manual assumes that a user understand the basic
principles of what EST clustering is about, and so the basic concepts
are only briefly reviewed in this chapter. Subsequent chapters discuss
installing and running `wcd', as well as giving some technical
background so someone who wants to use some of the code.

1.1 Citing `wcd'
================

If you cite `wcd' please use the following reference (this list may be
updated.)

   * S. Hazelhurst, W. Hide, Z. Liptak, R. Nogueira. R. Starfield
     An overview of the `wcd' EST clustering       tool.
     _Bioinformatics_ 24(17), July 2008,       pp. 1546-1547. doi:
     `10.1093/bioinformatics/btn203'.

   A description of the underlying algorithms can be found in
   * S. Hazelhurst. Algorithms for clustering EST sequences: the
     `wcd' tool.  _South African Computer Journal_,       _40_, June
     2008.

   A paper describing the KABOOM algorithm has been accepted by
Bioinformatics

   * S. Hazelhurst and Z. Liptak. KABOOM! A new suffix-array based
     algorithm for clustering expression data. _Bioinformatics_
     (Accepted for publication)

1.2 Clustering
==============

The basic idea is that we take a set of sequences and cluster them so
that sequences that are close to each other in the same cluster.  There
are many clustering algorithms, and one review can be found in


 @article{cluster2,
   author =       "A. K. Jain and M. N. Murty and P. J. Flynn",
   title =        "Data clustering: a review.",
   journal =      "ACM Comp. Surveys",
   volume =       "31",
   number =       "3",
   pages =        "264--323",
   month =        sep,
   year =         "1999"
 }

   The type of clustering we do is single linkage clustering. We create
the largest number of clusters (and hence smallest clusters) such that
if two sequences are very close they are put in the same clusters. Note
the definition is transitive: if A and B are close, and B and C are
close then A and C will be in the same cluster even if A and C are far
apart.

   The basic clustering algorithm is shown below (obviously, there are
performance enhancements).


foreach sequence i
  put i in its own cluster
foreach sequence i
  foreach sequence j
     if i and j are close to each other
        merge the clusters to which i and j belong

1.3 Distance functions
======================

A distance function takes two sequences and returns a number that says
how far apart the sequences are mathematically. Many distance functions
have been proposed with different biological and computational
motivations: hopefully a well-chosen distance function will return a
mathematical value that has some biological distance.

   One of the purposes of `wcd' is to allow different distance
functions to be used. The distance functions which `wcd' supports
   * d2, edit distance

1.3.1 The D2 distance function
------------------------------

The d2 distance function is based on the word count.

   * Let _x_ be a sequence and _w_ a word

     We use the notation x' \in x to mean that x' is a       _window_
     or substring of _x_ of some specified length.

   * c_x(w) is the number of times that _w_ appears in _x_.

1.3.1.1 General formulation
...........................

To measure the distance between sequences we compute the word
frequencies and then take the sum of the square of the differences

   For a fixed _k_, we examine all the words _w_ of length _k_, and
compute the frequencies which they occur in the strings _x_ and _y_,
and then sum the square of the differences. Formally,

   d^2_k(x,y) = \sum_|w|=k(c_x(w)-c_y(w))

Then in general

   d^2(x,y) = \sum_k=l^Ld^2_k(x,y)

for constants _l_ and _L_. In practice (and this is what `wcd' does),
we compute the d2 score for a fixed _k_.  The default value of _k_ is
6, but this is a parameter of the program.

1.3.1.2 Windows
...............

The goal of EST clustering is to see whether there is an overlap between
two sequences: thus we are no interested in the d2 score between the
whole of one sequence and the whole of another, but whether there are
two regions (called windows) that have a low d2 score.

   Thus the d2 score between two sequences is the minimum of the d2
score between all pairs of windows of those two sequences.  d^2_k(x,y)
= \min_x' \in x, y' \in y\sum_|w|=k(c_x'(w)-c_y'(w))

   The consequence of this is that d2 applied like this does not obey
the triangle inequality.  It is very common that

   d^2(x,z) > d^2(x,y)+d^2(y,z)

since there is a window that _x_ shares with _y_, another that _y_
shares with _z_ but no window common to both _x_ and _z_.

   As an aside, d2 even when applied to sequences is not a metric since
d^2(x,y)=0 does not necessarily imply that x=y.

   We don't compare two sequences directly. Rather we compare all
windows of fixed length in one sequence with all the windows in the
other. The default window length is 100 (again it's parameterisable).

1.3.1.3 References
..................

The paper by Burke et al describes the basic d2 method and justifies
its use.


@article{burke99,
   author = "J. Burke and D. Davison and W. Hide",
   title  = "{D2\_cluster: A Validated Method for Clustering EST and
              Full-length cDNA Sequences}",
   journal = "{Genome Research}",
   year =  1999,
   volume = 9,
   number = 11,
   pages = "1135--1142"
}

   The SACJ paper (hazel2008b) describes the algorithms in wcd in some
detail. The Bioinformatics paper describes the use of wcd and gives
comparative performance (the SACJ paper also concentrates on an earlier
version of wcd). If you use wcd, please cite one of these papers.



@article{hazel2008a,
    journal = "Bioinformatics",
    year =  2008,
    note =  "doi:10.1093/bioinformatics/btn203",
    title = "An overview of the wcd {EST} Clustering Tool",
    author = "S. Hazelhurst and W. Hide and Z. Lipt\'ak and R. Nogueira and R Starfield",
    month = jul,
    volume = 24,
    number = 13,
    pages = "1542-1546"
    doi  = "10.1093/bioinformatics/btn203",

}


@article{hazel2008b,
    journal = "South African Computer Journal",
    year =  2008,
    title = "Algorithms for clustering {EST} sequences: the wcd tool",
    month = jun,
    volume = 40,
    author = "S. Hazelhurst",
    pages  = "51-62",
}

1.3.2 Edit distance
-------------------

Edit distance is well known; the reference below gives a very thorough
introduction to the subject.

   The version of edit distance implemented in `wcd' does local
alignment (Smith-Waterman). Penalties (positive numbers) are given for
mismatches, deletions or gaps, and a reward (a negative number) is
given for a match. The default values are:
   * -1 for a match

   * +3 for mismatch

   * +5 for opening a gap (indel)

   * +2 for extending a gap

The user can change these values.


@Book{		  gusfield97,
  author	= "D. Gusfield",
  title		= "Algorithms on strings, trees, and sequences",
  address	= "Cambridge, United Kingdom",
  publisher	= "Cambridge University Press",
  year		= 1997
}


File: wcd.info,  Node: Installing wcd,  Next: Running wcd,  Prev: Description,  Up: Top

2 Installing `wcd'
******************

2.1 Getting new versions of `wcd'
=================================

The latest version of `wcd' can be found at:

   `http://code.google.com/p/wcdest'

   You should also be able to find it at:
`ftp://ftp.cs.wits.ac.za/pub/research/software/'

Please let me know if you would like to be told about new versions of
`wcd'.

2.2 Changes log
===============

2.2.1 Version 0.6.3
-------------------

Mainly internal changes - better memory behaviour and some bug fixes.
However, there are two changes in the way in which `wcd' is called.
Usually, I don't like doing this because it may make scripts break but
the alternatives here are worse. Sorry!

   * The `-y' or `--mirror' option no longer takes an argument. In
     0.6.1 it took the name of a file. Now, the output goes to standard
     output and the `-o' option should be used.

     A consequential change is that the `prep-kabm' and `dustprep-kabm'
     scripts have changed and should be replaced.

   * The `kl-seed' or `-$' option has changed so that it does take an
     argument: _s_ for symmetric mode or _a_ for asymmetric mode. See
     the documentation.

2.2.2 Version 0.6 `wcd-express'
-------------------------------

There is one very big change version 0.6, the KABOOM algorithm for
speeding up clustering.  Note that for data files that are greater than
1GB in size, when using KABOOMM a 64-bit machine is needed.

   Version 6  - `wcd-express' has been tested on Linux (Scientific
Linux and Ubuntu) and on MacOS X (10.5 and 10.6).

   There is a binary version for Windows. However, this does not at
this point support the KABOOM heuristic. The problems is some
incompatible libraries. This is not insurmountable but will take a
couple of days we don't have at the moment (if you are prepared to do
the schlep, the problem is that the mman library is not ported, so we
have to use the Windows memory management file). We have configured and
compiled the code on Windows and you can too, but the current
conditional configuration will mean that the KABOOM is not supported.

2.2.2.1 Miscellaneous changes.
..............................

   * Support for short read sequences such as 454 and Illumina
     sequences. Mainly this isn't so much new things or changes in wcd,
     but some experimentation so that we know which parameters to use.
     Our experimentation  has yielded good results.

   * Support for shorter sequences, particularly for files where there
     is variability in the sequence length. In previous versions of
     `wcd' any sequence below the specified window length would be
     thrown away. In this version of `wcd', shorter sequences than the
     window length will be clustered, but the all the distance and
     heuristics scaled for that sequence so that a higher level of
     similarity will be required.

   * Support for the type of input expected by Bragg and Stone's k-link
     tool
     `http://bioinformatics.oxfordjournals.org/content/25/18/2302.short'
     has been added. `wcd' can be used as a pre-processor for K-link.

   * An option where EST will only cluster sequences if the regions of
     similarity are at appropriate ends of the ESTs (the _ends_ option
     for the `-F' option).

   * The reading in of FASTA files has been made more robust. Previous
     versions sometimes choked silently on some FASTA files produced in
     Windows systems and then went into an endless loop.


2.2.2.2 KABOOM algorithm
........................

Version 0.6 supports the k,alpha,beta multiple-word suffix-array based
algorithm for faster clustering - known as the KABOOM algorithm. This
replaces the previous suffix-array algorithm supported in version 0.5.

   First, this is an option that the user must ask for. Unless you
explicitly ask for that option you will get the standard clustering
algorithm.

   The KABOOM algorithm is appropriate for clustering large files -
typically we see an order of magnitude improvement in performance.
Version 0.6.0 only supports files up to 2GB in size but this is not a
difficult limit to break. If you want to use the KABOOM algorithm on
larger files, let me know. It's on my list of things to do but there's
nothing like encouragement.

   The KABOOM algorithm requires that a suffix array of the input file
and its reverse complement are available. We recommend that the `mkESA'
tool of Homann, Fleer, Giegerich and Rehmsmeier be used
`http://bioinformatics.oxfordjournals.org/content/25/8/1084.short'. This
is a flexible and powerful tool that has performed exceptionally well on
our data.  However, you are welcome to use whatever tool you prefer as
long as it produces output in the same format.

   To create the reverse complement, `wcd' has an auxiliary feature
(the mirror or `-y' option documented below). Alternatively, the
`revseq' program in the EMBOSS suite `http://emboss.sourceforge.net/'
could be used.

2.2.3 Changes in 0.5
--------------------

Version 0.5 is big change from 0.4 internally, but not externally.
Memory representation was changed a lot to make `wcd' more robust for
very large data sets that run in  highly-parallel mode. From a user
perspective, there are few differences. Here are some of the changes (a
few were put in 0.4.5.7 but may not have been noticed).

   * NOAUXINFO. A new configure-time option `--enable-NOAUXINFO' has
     been included. This is switched OFF by default. If you enable it,
     `wcd' does not store any of the sequence auxiliary information like
     sequence ID. For extremely large data sets this may make a
     difference is memory usage. Also, I found on some systems, doing
     several millions mallocs stress tested the memory system
     implementation. My personal experience is that this option is not
     needed - but one person who used the system found it useful. This
     was introduced on 0.4.5.7, and may be less important in 0.5 as the
     memory implementation has changed.

   * Run time option: `-Q 1' Index sequences from 1 rather than 0. Only
     use if you know what you are doing. See the discussion below.

   * Run time option `-G val@dir'. `wcd' puts the clusters found in the
     given directory. See the discussion below.

   * Run time option `-k splitname -f clusterfile'. Split the input
     sequence file into clusters given an existing cluster table. This
     option is _post hoc_: here we have the cluster table already and
     are splitting according to that. The `-G' option is for when you
     don't have a cluster table and must do clustering.

2.2.4 Version 0.4
-----------------

There are a number of changes from 0.3 to 0.4 in memory organisation and
algorithm. The first two are important to note if you used the options
as the appropriate default values have changed.

   * The sample word length is no longer a parameter. It's fixed as a
     constant at 8. This was done after extensive empirical testing.
     Fixing so that a sample word fits in a machine word has numerous
     computational advantages. The -B option therefore no longer
     exists. The default value for the -K is now 6.

   * Our second heuristic, the so-called t/v heuristic has been
     replaced in favour of a common word heuristic. The the common word
     heuristic does seem better it is more expensive to implement. The
     -H or -common_word heuristic's default value is now 65.

   * The SMP algorithm has been simplified. It is now possible to use
     the Pthreads and MPI parallelisation together. This may be useful
     when you have a cluster of multi-processor machines, but if memory
     is available it is better and  easier just to use MPI. So if you
     have 16 machines with 4 cores each, run an MPI job with 64
     processes. You should only use pthreads in place of or as well as
     MPI where memory is a factor and some experimentation on your
     architecture shows performance improvement.


2.3 Installing the distribution
===============================

The distribution comes as a gzipped, tarred file.  The following
commands should create a directory called `wcd_distrib' with all the
necessary files.


  gunzip wcd_distrib.tar.gz
  tar -xf wcd_distrib.tar

2.4 Compiling
=============

There are two possible ways to compile and install `wcd'

2.4.1 Method 1: using `configure'
---------------------------------

The INSTALL file gives a more detailed explanation and description.
The following should work if you don't want anything fancy. If you do,
or if you want to ensure that some compile flags are different, then
read the INSTALL file.


./configure
make
make install

Note that the last instruction will attempt to install the executable
and documentation in the appropriate places on  your system and will
probably require you to have root privilege or write permission for
those standard directories. If  you don't have these, you can find the
executable `wcd' in the `src' directory and the texinfo documentation
`wcd.info' in   the `doc' directory. Manually copy these to the correct
place.

   If you are using FreeBSD, you may have to make a small change to the
`common.h' file.

   If you want a manual you can print out, say `make pdf'. Otherwise
you can read the on-line version using the `info' system. If the
documentation is installed in the right place on your system you can
just say `info wcd', otherwise you will have to say `info -f wcd.info'.

   `wcd' should work on 64-bit, little-endian and big-endian machines.

2.4.1.1 MPI Version
...................

From version 0.3.0, wcd has an MPI version that allows parallelisation
of the clustering on a cluster of workstations.


./configure --enable-MPI
make
make install

Initial experimentation has shown that using the MPI option when not
necessary leads to a 2\% degradation in performance, so the main benefit
of not using the `enable-MPI' option is that `wcd' will compile without
the MPI libraries.

2.4.1.2 Pthreads
................

From version 0.3.0, wcd has an MPI version that allows parallelisation
on SMP architectures.


./configure --enable-PTHREADS
make
make install

In version 0.4, the two parallelisation techniques complement each
other. You can run on a cluster of SMP machines. However, an assumption
is made that each of the SMP machines has the same number of processors.
In version 0.3, binaries supporting both techniques could be used but
only one used in a particular run; this no longer holds in 0.4.

2.4.2 Method 2: Manually using `make'
-------------------------------------

The `wcd' program is written in reasonably standard C.  If you don't
want to or can't use the autoconf method described above, try the
following. Change directory to the `src' directory.


make -f WcdMake wcd

This was tested on different versions of RedHat and Suse and worked
without problem. `wcd' is written in reasonably standard C, and so
should work on most systems, but unfortunately the vagaries of different
libraries and compilers might make the standard compilation fail. See
the troubleshooting section below for some suggestions.

   To get documentation, running `make -f WcdMake pdf' in the `src'
directory will produce a pdf version of the manual (placing output in
the `doc' directory). `make -f WcdMake info' will produce texinfo
version.

2.4.2.1 Trouble shooting
........................

If you know something about your system and are happy mucking about
with make files, please read the technical note below and make the
necessary changes. If not, try the following suggestions. There are
three likely causes of problems: (1) `wcd' supports long options (e.g.
you can say `--cluster_compare' rather than `-D'); (2) your compiler is
`cc' rather than `gcc'; and (3) your compiler does not support
procedure inlining (inlining may make some performance benefit, but
does not change functionality or use).  The makefile `WcdMake' is the
one to look at.

  1. Try `make -f WcdMake simple' to see whether the problem is that you
     don't have the libraries that support long options. If this works,
     `wcd' will run as is, except that you cannot use the long options
     form for `wcd', only the short-form options.

  2. If your compiler is named `cc' rather than `gcc', try saying `make
     -f WcdMake ccsimple'.

  3. If that doesn't work, try switching off inlining as well by doing
     `make reallysimple'. In this case neither long options or inlining
     is used. If your compiler is `cc' then try `make -f WcdMake
     ccreallysimple'.


`wcd' was tested on a number of different systems. The table below
shows what worked on those systems. With a little tweaking of the
makefile to specify where _gcc_ libraries on your machine are, you
might get the standard `make wcd' to work too.

_Linux RedHat and Suse_
     make wcd

_Sun OS 5.10, Darwin, Aix, FreeBSD_
     make simple

_Irix64, Alpha/OSF1 Tru64, Cray_
     make ccreallysimple

With the FreeBSD machine I managed to get the full version to work by
finding the appropriate libraries and includes. The same thing probably
applies to the others.

*This section only need be read if `make wcd' did not work and you
would like to have long options working.*

   The non-standard options of `wcd' are to use inlining and the long
options library. If your system does not support inlining then the
compiler must be given the `-DNOINLINE' flag when run. If your system
does not provide a library for getting long options then the compiler
must be given the `-DNOLONGOPT' flag.

   If, however, your system does provide a long options library and you
would like to use long options, then you must say where the include file
(`getopt.h') and where the library will be. You may also have to
explicitly tell the compiler to use the `libtextlib' library. Find out
where these things are, and change the variables in the `WcdMake' to
the appropriate values and then run `make try'.  For example, if the
include file is in `/usr/share/contrib/gcc' and the library is in
`/usr/local/lib' you might have


INCLUDES = /usr/share/contrib/gcc
LIBS     = /usr/lib
EXTRAFLAGS = gettextlib

You might or might not have the EXTRAFLAGS variable defined.

   The makefile assumes that the compiler available is `gcc'. If it's
not or if you wish to use a different compiler, then edit the file
`Makefile' and change the definition of the variable `CC' to the
appropriate compiler.

   If this doesn't work, then it may be that your libraries are in
non-standard places. Look at the Makefile and try the following things
   * Add `-DNOLONGOPT' to the CFLAGS variable. In this case, you can't
     use the long form options (e.g. you must say `-g' rather than
     `--histogram')

   * Try replacing `gcc' by `cc'.

   * Add to the CFLAGS the directories where the include and libraries
     can be found.

   `wcd''s MPI and PTHREADS code are optional. In the code there are
macros that protect all references to MPI and PTHREADS. Unless these
macros are defined the compiler won't know about them. This enables
`wcd' to be compiled without having either libraries.

   If you are making manually rather than through configure, and you
want to use MPI or PTHREADS (either or both),  then you must make sure
that either (or both) the MPI are PTHREADS macros are defined. You could
either put the relevant defines in the `wcd.h' file, or pass it to
`gcc' with something like `-D MPI'.

2.4.3 EMBOSS
------------

EMBOSS wrappers are available from the same source as the `wcd' code
itself. However, they do not currently support new features in 0.6

2.4.4 Auxiliary programs
------------------------

A number of auxiliary programs are included. They may be useful but are
not essential. They are either Perl or Python 

   They should run as is, but if your version of `perl' or `python' is
not in a standard place, edit the files and change their first lines to
replace the line that says

#!/usr/bin/perl
 with

#!/usr/local/bin/perl
 or whatever.

   In addition to these programs, there is a shell script
`wcd_wrapper.sh' that allows `wcd' to be used as a replacement for
`d2_cluster' in the `stackPACK' analysis pipeline.  

2.5 Documentation
=================

This info file can be read by saying

                  info -f wcd.info

To create the documentation say `make pdf, make html', etc depending on
what sort of document you want. This creates the following:

   * `wcd.info': the info file which can be read using the texinfo
     system. This is probably easiest for reference purposes.  You can
     read the on-line version using the `info' system. If the
     documentation is installed in the right place on your system you
     can just say `info wcd', otherwise you will have to say `info -f
     wcd.info', or give the full path.

   * `html': a directory that contains all this help information in
     HTML files. Use your browser to read it.

   * `wcd.pdf': A PDF version of this help file, which is probably the
     best for a hard copy.


You can play around with `makeinfo' etc. if you want different versions.

2.6 Using the Amazon Elastic Computing Cloud and S3
===================================================

`wcd' has successfully been used on the Amazon Elastic Computing Cloud.
To learn more, please go to `www.amazonaws.com' for an overview of EC2
and my paper _Scientific computing using virtual high-performance
computing: a case study using the Amazon elastic computing cloud_
published in _Proceedings of the 2008 Annual Research Conference of the
South African Institute of Computer Scientists and Information
Technologists_, October 2008
`http://doi.acm.org/10.1145/1456659.1456671'.

   This section assumes you are a registered EC2 user and know how to
use it.

2.6.1 Pre-packaged AMI
----------------------

Public prepackaged images are evailable from AmazonAWS. The AMI's
manifest name will be `wcdimages/wcd-express-6.x.y' so please search
for manifests with `wcd-express' in them.

   The AMI is based on the standard Amazon Linux AMI. In addition to the
packages that are part of that, gcc, gdb, and emacs are installed as
well as the mkESA, libfid and wcd code.

   The default user for this AMI is called `ec2-user'

   The `wcd' binary and ancillary scripts can be found in
`/usr/local/bin' and all the documentation can be found in
`/usr/local/doc'. In the `/home/ec2-user/Data' directory are two sample
directories.

   Using the standard wcd algorithm, check all is well be saying



wcd -g ~/Data/benchmark1000.fa

   Check that the KABOOM algorithm is working by saying



prep-kabm ~/Data/benchmark10000.fa -g

2.7 Guide to using MPI
======================

Note that the KABOOM algorithm is not parallelised yet. However, if you
choose not to use KABOOM, you can use MPI This section gives my
suggestions for running and configuring MPI. MPI generally is easy to
install and get to work. However, having tried to use wcd in a number
of environments and tried to help an number of wcd users, I have
encountered very severe problems caused by machines on which multiple
MPI implementations have been installed. Sometimes this could be
completely different versions of MPI (e.g. LAMMPI and MPICH2), and
sometimes just by different versions of of the same MPI (e.g.
openmpi1.2.3 and open1.2.8). What can happen is that the version of the
compiler mpicc is a slightly different (or even very different) version
to mpirun, or to some of the mpi libraries.

   This is a really horrible problem. Often the problem does not
manifest itself clearly. There may not be a problem when run on one
machine only. For small data sets there may be no or few problems but
as the data set size and number of machines grow then problems do occur.

   I find that the problem is particularly occurs when automatic package
managers like apt-get or fink are used.

   If you are sure that you are only going to have one version of MPI on
your system ever, and you are never going to upgrade it then ignore
these rambling.  I am sure there are much cleverer and better ways of
doing things, but this works for me as a clean way of doing things so
that if I add or change things that I don't break previous
assumptions.(Or use the `modules' or `env-switcher' packages!).

   First, I make sure that any existing MPI library or software is
deleted from any obvious places, including /usr/lib, /usr/bin,
/usr/local/bin, /usr/lib, /sw/bin, etc etc. If necessary do something
like

   `find / -name ``*mpi'' --print'

   Then, create a special directory for each MPI installation you want.
I use either /opt or /usr/local for this. e.g. /usr/local/mpich2

   Install the MPI version in this directory. If your package manager
can't do this, then get the source files and compile from scratch. When
you using configure you will say something like

   `./configure --prefix=/usr/local/mpich2'

All the library and binary files will be put into this directory. You
must now add the lib and bin directories to your path


setenv PATH /usr/local/mpich2/bin:${PATH}
setenv LD_LIBRARY_PATH /usr/local/mpich2/lib:{PATH}

   If you install a new version or upgrade even, put the new version in
new directory and change your paths.


File: wcd.info,  Node: Running wcd,  Next: wcd inside stackPACK,  Prev: Installing wcd,  Up: Top

3 Running wcd
*************

`wcd' can be invoked in different ways. The first just shows the usage;
the others are functional. They take an input file in FASTA format and
process it in some way. `wcd' numbers the sequences in the file from 0
upwards, and these indices are used to refer to the sequences.

   Options or switches to `wcd' can be given in short form (e.g. `-D')
or long form (e.g. `--cluster_compare'). On some systems, only the
short form options are supported.

3.1 More detail
===============

3.2 Preparing  your data: Garbage in, garbage out
=================================================

The value of clustering depends very strongly on the quality of the
input data. In particular, the presence of vectors, repeats and
significant low complexity data will compromise the quality of your
clustering. For this reason is strongly recommended that you put effort
into cleaning your data and doing masking before   you cluster.

3.3 Examples -- A Quick Introduction to `wcd'
=============================================

This section shows common ways in which `wcd' is likely to be invoked.
The following examples show straightforward clustering examples.

   > `wcd --show_clusters data/5000.seq'

     Cluster the sequences found in the file `data/5000.seq'. Print
     the clusters on standard output in compact form. Use the d2
     function to determine cluster membership.

   > `wcd --histogram --show_clusters data/5000.seq'

     As above, but also print a histogram that shows the size of the
     clusters found.

   > `wcd --output ans/5000.ans --histogram --c data/5000.seq '

   > `wcd -o ans/5000.ans -g -t data/5000.seq '

     As above, but save the output in a file.

   > `wcd -c -N 5 data/5000.seq ' 

     _If the `wcd' has been installed with the PTHREADS    option_. Run
     `wcd' on 5 processors at the same time.

   > `mpirun -np 16 wcd -c  data/5000.seq ' 

     _If the `wcd' has been installed with the MPI option._ Run `wcd'
     on 16 processors using the MPI libraries.

   > `wcd -X -c data/5000.seq '

     Cluster, but also use clone information. If two ESTs come from the
       same clone, they'll also be put together. The clone information
     comes    from the FASTA file directly - it's the symbol that
     follows the word    "clone" in the header. This is a convenient
     option, but for larger    files it would be better to put this
     information in a constraints file.  

   > `wcd --kabm -U 16 -c data_all.fa'

     Cluster using the KABOOM clustering algorithm to speed up. This
     assumes that the data file contains the sequences and the reverse
      complement and that the suffix array is also available.


3.4 Parameters
==============

`wcd' like most tools has many  parameters that can be used in
controlling the clustering. Broadly they can be divided into two
classes: parameters that control a biological model of what we want
clustered together; and heuristic parameters that we can use to
trade-off speed and quality. `wcd' comes with sensible default but
experience shows that the choice of parameters really matters and that
it data set dependant. Our advice is that you need to cluster with
different sets of parameters and then compare the results that you get.

   Below are the key parameters: the first two are the biological
parameters, the latter are the heuristic parameters. If you are using
the KABOOM algorithm, those heuristics are described later.

   * Window length `-l', `--window_len'.    `wcd' clusters sequences
     together if they have share regions that   are approximately same.
     This parameter says how the long these regions   should be. The
     default is 100: this means that two sequence will be   clustered
     if there are two regions of length at least 100 that are
     approximately the same. The bigger `window_len' (all other things
     being equal), the stricter the clustering.

   * Clustering threshhold `-T', `--theta'. This says how   similar the
     regions should be. For d2 clustering, a `theta'   value of 0 would
     be saying that two sequences should be clustered   together if
     they share regions (of length specified using   `window_len') that
     were exactly the same. The default value is   `40'. For a window
     length of 100, this corresponds to roughly a   96% similarity. For
     a window length of 80, a `theta' score of 60   is roughly a 92%
     similarity. However, there is no direct relationship   between
     `theta' score and similarity. d2 tends to punish   differences
     that are spread throughout two regions more harshly than
     differences that are clustered together; d2 tends to punish repeats
      more harshly than edit distance.

   * Common word heuristic `-H', `common_word'. This is a   heuristic
     used to speed up clustering. Sequence `i' is clustered   with `j'
     if a window of length `window_len' of sequence   `j' contains at
     least `H' words of length 8 that are in   `i'. The choice of `H'
     can help speed up clustering.

     For a window length of 100, if `theta' is 40, we can see that
     that an `H' value of 72 is very safe to use. For a window length
     of 80, if `theta' is 60, then an `H' value of 42 is safe to   use
     (provided there few sequences that are smaller than the window
     length, which is why for ultra-cautious clustering I might use an
     even   lower value).

   * The `-K', `--sample_threshold' parameter. 4 is the default.


Cookbook suggestions
--------------------

Below are some suggestions for parameters to use for different
sequences. However, again, I emphasise the importance of looking at your
data. Do you want short regions of very high similarity, or longish
regions of good but not necessarily excellent similarity. How
error-prone is your data? How long are the reads? It is only by
understanding your data that you can make sensible choices. Moreover, I
think you need to cluster with different sets of  parameters and see
what you get. If you find that you get roughly similar results for a
range of different parameters then you can have confidence in the your
data.

   Our recommendation is that you first cluster using relatively
aggressive heuristic parameters (e.g. `-H') with a range of different
biological parameters (e.g `--window_len' and `--theta'). Once you have
explored the effect of the biological parameters, you can choose one
and recluster with more conservative heuristic parameters that  will
take a longer time to process.

   * For Sanger style sequencing the following parameters are suggested

     `-l 100 -T 40 -H 70'

     This would be for good quality data for average sequence length
     well     above 300. If your data quality is not so good, you could
     try     relaxing the T value up to 60. An H value of 70 is
     conservative but     not ultra-conservative. You would get a
     slightly better quality maybe     by choosing H of 62 though
     clustering will take a little longer.

   * For 454 data (average sequence length of above 150), the following
        has worked well

     `-l 80 -T 60 -H 36'

     You could increase T value or change the window length down, but
     you     have to wonder about the quality of the data  then. Think
     about it,     but you are probably generating so much data that
     it's better to err     on the side of caution and be too strict.
     The H value of 36 is     conservative: particularly if you have
     longer read lengths and your     quality is reasonable going up to
     42 would probably be OK. To be     ultra-cautious, and probably in
     the last phase of the clustering     step try H 25.

     Choose `-K 3' is you are NOT using the KABOOM algorithm (or even
      2, but that would slow things down). If you are using the KABOOM
        algorithm don't worry - it takes care of this for you.

   * For Solexa data, how well `wcd' works will depend on sequence
     length. I've never tried with really short sequences (50 or
     less). For sequences of length 62 I used the following parameters
       with some success:

     `-l 58 -T 55 -H 22'

     Choose `-K 2'if you are NOT using the KABOOM algorithm.  (or even
       1, but that would slow things down). If you are using the
     KABOOM algorithm     don't worry - it takes care of this for you.


3.5 Format of input files
=========================

The format of input files is:
   * FASTA format

   * will treat Ns randomly.


*What is meant by FASTA format?* Each sequence MUST be preceded by an
identification line. Each sequence itself may be on one line, or it may
be on several lines. If it is on several lines, each line should
terminate with a carriage return and there must be NO spaces on each
line.

   The identification line starts with a `greater-than' sign (>). This
is all that is required. IF there is an alphanumeric sequences (string
with no blanks) IMMEDIATELY following the greater than sign then that is
treated as a sequence ID that is used by a few of the options for
display purposes. The rest of the identification line is completely
ignored.

3.6 Output and output format
============================

All output goes to standard output. Look at the arguments section to
decide the format of the output. Note if you don't have any format
arguments, nothing will get printed, which will be a waste.

Format of the Compressed Cluster Table
......................................

A convenient way (for humans and probably for many programs) to show the
output of the clustering process is to use the `--show_clusters'
option.  The format of the compressed cluster table is very simple. Each
cluster appears on a line by itself. The cluster is given by listing the
indices of the sequences that make up the cluster. The indices are
separated by a space, and the last sequence in the cluster is followed
by a full stop `.'.

     0 1 3.
     2.
     4.
     5 7.
     6 8 9.
     10.

Format of extended cluster table
................................

When the code `--ext_show' option is chosen, the clustering is given in
table format. The columns of the table are as follows:

   * the sequence identifier (note that the sequences are numbered from
     0, in the order that they appear in the input file);

   * cluster number: in each cluster, one sequence is chosen as the
     representative of the cluster and its index is used for the
     cluster. You can identify the roots because their indices are the
     same as their cluster numbers.

   * link: the number of another sequence in the cluster. It is
     guaranteed that if you start at the root, or representative
     sequence in the cluster, you can traverse the entire cluster using
     the link field. It is not guaranteed that two adjacent nodes are
     within the d2 threshold.  

   * orient: the orientation of that witness (positive or RC) with
     respect to the root or representative sequence in the cluster.

   * witness: the number of another sequence in the cluster which is
     within the d2 threshold of that sequence. This may or not be the
     same as the _link_ field. Note that the value of the _link_ field
     is an artifact, merely a convenient way in which can list all the
     sequences in one cluster, whereas the _witness_ field tells us
     about two sequences that do overlap.

The _orient_ field requires a little more explanation. It gives the
orientation of the sequence with respect to the _root_ of its cluster.
Formally, this is described as follows. Let x and y be two sequences in
a cluster. While they may not overlap, we know that there is an ordered
list or path of sequences x=x_0, x_1, \ldots, x_n-1, x_n=y such that
for each i either d^2(x_i,x_i+1)\leq \theta (positive match) or
d^2(x,rc(x_i+1))\leq \theta (reverse-complement match), where \theta is
the threshold. In particular for every sequence there is such a path
from the _root_ of the cluster to that sequence. In a path from the
root to a sequence x, we compute the number of times the match is a
positive one, and the number of times it is a reverse-complement one.
The `orient' field is 1 if the number of reverse-complement matches is
even, and -1 if the number is odd.

   In _principle_ it is possible for there to be two paths from the
root to a sequence which would yield different _orient_ values. First,
this is unlikely to happen. Second, all the orient field is saying is
that such an orientation of the sequence is legitimate. The fact that
other orientation is also legitimate does not affect the correctness of
the result.



3.7 KABOOM Algorithm
====================

The wcd tool has the KABOOM (pronounced KUHBAM) algorithm as an option
which can be used to speed clustering significantly. It's a little
trickier to use because it relies on external tools. We recommend the
use of the following tools (you don't have to use these tools and can
use others provided that they produce equivalent output):
   > mkESA: Download from
     `http://bibiserv.techfak.uni-bielefeld.de/mkesa/'. See
     `http://bioinformatics.oxfordjournals.org/content/25/8/1084.short'.
     This takes a FASTA file as input and produces a suffix array as
     output.

   > If you prefer to use EMBOSS revseq rather than `wcd''s facility,
     download and install that (but `wcd''s facility is good.).

   > The NCBI mdust program for masking (this part is optional).

Assuming that all preparation has been done (BUT see below how to do
preparation, don't just run wcd on your favourite FASTA file), to run
the program with KABOOM, you would say something like:

   `wcd --kabm --alpha 3 --beta 25 -U 16 data_all.fa'

This says: use the KABOOM-algorithm to speed-up clustering: In this
case, pairs of sequences should only be clustered together if they
share at least 3 (alpha) words of length 16 (U) and the left-most and
right-most shared words must start at least 25 (beta) positions apart.

   Suggestions for reasonably choices of these parameters are:
   * Sanger: U=16, alpha=5, beta=36

   * 454: U=16, alpha=3, beta=16

   * Solexa: U=14, alpha=3, beta=10

Preparing the data file
-----------------------

You can't just run this option on your raw input file. You need to
prepare it by adding the reverse complement and creating a suffix array.

   A number of shell scripts have been shipped with the latest version
of `wcd' to ease its use. These are described in the next section.
Otherwise in the following section, the whole procedure is described.

3.7.1 Scripts to run this option
--------------------------------

These scripts assume that `mkESA' and `wcd' program are in your path,
and if you are using `mdust', then this is also in the path. All of
these programs and the scripts must be executable.

   The scripts also assume that the `/tmp' directory can be written to.

   Assume that your input file is called _data.fa_ (that is the raw
input file, as-is : no reverse complement).

3.7.1.1 If you are running with mdust
.....................................

   * `dustprep-kabm data.fa options'

     This dusts your file, creates the reverse complement, creates the
     suffix array and then runs wcd with the appropriate parameters.

     Note that the first argument of the script _must_ be the data
     file. The rest of the arguments can be any parameters that `wcd'
     takes.  You do not have to specify `--kabm': the script does
     that.

     As an example, you could say

     `dustprep-kabm data.fa -H 64 -U 16 --beta 15 --theta 60 -c -o
     output'

   * `kbam data.fa options'

     As above, but this assumes that you have previously run
     `dustprep-kabm' once and are now running wcd again. That is, once
     you have dusted, and created the suffix array, you need not do it
     again.


3.7.1.2 If you are NOT running with mdust
.........................................

   * `prep-kbam data.fa options'

     As above, but no dusting is done.

     As an example, you could say

     `prep-kabm data.fa -H 64 -U 16 --beta 15 --theta 60 -c -o output'

   * `kbam data.fa options'

     As above, but this assumes that you have previously run
     `prep-kabm' once and are now running wcd again. That is, once
     you have created the suffix array, you need not do it again.


3.7.2 Doing it manually
-----------------------

To prepare your data, you must do the following:
   * Optionally dust. Note that if you dust, you should use X as the
     mask character.

   * Create a file which contains your original data followed
     immediately by the reverse complement.
        * You could do this using `wcd -y'.  This  takes the named
          input file and creates the new file with the data and the
          reverse complement.

          For example, `wcd -y -o data_all.fa data.fa' takes the file
             `data.fa' and copies it and its reverse complement into
             `data_all.fa'.

          Note that if you use the `-y' option, no clustering is done
             - all that happens is that a data file is prepared.

          *Note_that in previous versions, the `-y' option took
          an argument, the name of an output file. This caused some
          subtle       problems and so this argument cannot be used
          anymore: use the       `-o' option to redirect output*

        * Or you could do it some other way, but why not just use `wcd'

   * Make a suffix array of this new file

Suppose you have done this step and you now have three files:
   * data_all.fa (the prepared file as you created)

   * data_all.fa.suf (the suffix array as prepared produced by mkESA)

   * data_all.fa.ois (this is essentially the data file with all FASTA
     comments and unnecessary newlines stripped out)

But Your Mileage Will Vary. Of course, you can and must also use the
other parameters discussed previously (T, l, theta, H).

   Note that the name of  the data file is up to you (as is whatever
file extension you choose), but `wcd' assumes then that you there are
two corresponding files with an extra `.suf' and `.ois' extension.

3.7.3 Running mkESA
-------------------

Assuming that the FASTA file for clustering is in the file
`data_all.fa',   you should run `mkesa -p data_all.fa -d data_all.fa -b
N -g ois'

   Currently, `wcd' is limited to 32-bit addressing (in fact  it's a
trivial thing internally to change to 64-but but this would double
memory usage which can be substantial. It will take a little more effort
to deal with larger files gracefully but hopefully for 0.6.1!). This
means that the largest data file that `wcd' can deal with is 2GB (since
once, you've added the reverse complement you get up to 4GB).

   One small problem you might encounter is that `mkesa' is sensitive
to the input file (in the above, the `-b N' tells `mkesa' to use a
nucleotide data file A, C, G, T). However, some FASTA files use the
entire allowable IUPAC alphabet to represent classes of bases. In
which, case, create a file (say called alphabet) with all the allowable
characters in and then use `-a alphabet' instead of `-b N'.  This file
can also be found in the ancillary directory.

3.7.4 Parallelisation
---------------------

In version 0.6.0, the KABOOM algorithm is not parallelised. Please do
NOT use  either the pthreads or MPI parallelisation.

3.8 Catalogue of options
========================

3.8.1 Clustering an entire file
-------------------------------

   * `[ --output | -o] fname' 

     By default all `wcd' output goes to standard output. Using this
     option allows you to specify another file

   * `[ --num_seqs | -C] val'

     By default all the sequences in the input file will be processed.
     If you only want to process part of a file, you can use the -C
     option.

     e.g. `--num_seqs 100'

     will only process the first 100 sequences.

     If you specify a number  greater than the number of sequences
     actually in the file, then the whole file will be processed.

   * `--show_clusters, -c '

     Prints the results of the clustering in a compact way. Each
     cluster is printed on a line by itself. The sequences that make up
     the cluster are separated by commas.

   * `--histogram, -g'

     Show a histoGram of the results of the clustering. For each cluster
     size, it shows how many clusters there are that size up to some
     maximum size.

   * `--show_ext, -t'

     Prints the clustering in extended format. See Output, for a
     description of the format.

   * `[--function | -F] fun'   `wcd' decides whether two sequences
     should clustered together   on the basis of a distance function.
     The distance function that can   be used are

        * `--function d2': use the d2 function. This is  the
          default. The default threshold is 40.

        * `--function ed': use edit distance (local         alignment).
          The default threshold is -20. See the         `--parameter'
          option below for how to specify other         options.

        * `--function heuristic': use the common word heuristic
          described below. The common word heuristic gives a crude and
                fast membership criterion.

        * `--function ends': This uses the d2 function, in either
          normal (all-versus-all) or KABOOM mode. However, a pair of
          sequences are   only clustered if their ends match. That is,
          if the end of sequence   _i_ matches approximately the
          beginning of sequence _j_ (or   vice-versa), or if the
          beginning of _i_ matches the beginning of   the reverse
          complement of sequence _j_ or if the end of _i_   matches the
          end of reverse complement of _i_. Whether this or the
          standard mode is better to use depends on the biological
          question.

        * `[--parameter | P] fname'

          This specifies a parameter file that parameterises the
          distance function. In the current version this is only used by
            the edit distance option. The first four lines specify a
          4x4    matrix which give penalties for substitutions (a,c,g,t
          vs a, a, c,    g, t). There are four integers per line which
          should be separated    by a single space. The fifth line
          gives two integers, separated by    a space which give the
          cost for opening a gap and extending a gap.     The file that
          would be used for the default parameters is shown    below


          -1 3 3 3
          3 -1 3 3
          3 3 -1 3
          3 3 3 -1
          5 2


   * `[--common_word | -H] val'

     Set the common word heuristic threshold (the default is 65). Before
     running a d2 check between 2 sequences, this first checks to see
     how many distinct 6-words are shared between the sequence (NB, the
     sequence, not some windows). This can be done in linear rather
     than quadratic time and so is probably 2 orders of magnitude
     faster than checking d2. If not enough common words are found, a
     d2 check will not be done. [NB: this has changed from version 0.3]

   * `[--window_len| -l] val' 

     Set the window length to the given value. The default is 100

   * `[--skip_val|-S] val '

     Set skip value -- how much the window along the second sequence
     should be updated. The default is 1. Don't be too aggressive.
     Setting     the common word threshold at its default value is
     probably better     than changing the skip value (IMO).

   * `[--threshold_val, -T] val' 

     Set the distance threshold -- the default is 40 for d2, -20 for
     edit distance.

   * `[--word_len|-w] val' 

     Set the d2 word length (default 6)

   * `--performance, -s'

     Show performance stats

   * `--no_rc,  -n'

     Don't do the reverse complementation check (rc-checking is done by
      default).

   * `[--constraint, -j] filename'

     Give the constraint file for the first input data file. This is
     optional.  The constraint file enables you to ensure that certain
     sequences are not clustered together or to ignore certain
     sequences   while clustering. A later section   gives more details
     on the required format of the constraint   file, and the semantics.

   * `[--sample_word_len | -B] val'

     Word length used in the _sample_ heuristic. See below for use.

   * `[--sample_thresh | -K] val'

     This is the threshold for the _sample_ heuristic. Suppose K  and B
     are the sample word length and threshold parameters. When
     comparing two sequences i and j, the first sample test  described
     below is done. If it passes, a more rigorous test is done for
     similarity; if it fails the pair is declared not to have overlap.

     _The sample test: _ When comparing two sequences i and  jevery
     8-th word of length B is sampled from j;  at least K must also
     occur among all the words in i. The  defaults are K=7, B=8 which
     is conservative. [NB: this  has changed from version 0.3]

   * `-Q': Index from some value other than zero.

     For various sound internal and external reasons, `wcd' numbers
     the sequences in the input file from zero. However, some downstream
       applications number sequences from 1. If you use the `-Q'
     option, the input sequences get numbered from the value given
     (e.g. `-Q 1' will number from 1).

     _Warning!!!!!_ Treat this option with great care. Some of
     `wcd''s options (like the `-f' option with splitting) take    an
     input file with sequence indices.  You must be consistent in what
      you do. I would recommend against using this option if you are
     inputting cluster indices to `wcd'.

   * `-X': Do clone linking

     `wcd' will use the clone information in the sequence headers to
     put sequences together. If a sequence header contains the word
     _Clone_ followed by a string, then that sequence is identified
     as matching a particular clone. All ESTs matching a particular
     clone    will be clustered together. (The current implementation
     is very    simplistic and probably adds about 25\% cost to
     clustering. I    previously had plans to improve this but my
     feeling now is that this    is something that is better done in
     post-processing step.).


3.8.2 Clustering a range
------------------------


Usage: wcd [--range, -R]  dataf i j

The range option allows clustering only a range or slice of the input
data file in the following way.

   * Each sequence in the file with an index from _i_ (inclusive) to
     _j_-1 (inclusive) is compared to each sequence from _i+1_ to the
     end. More precisely


     for(k=i,k<j,k++)
       for(m=k+1, m<num_seqs; m++)
         compare sequences k and n and cluster if necessary

If you think of the all the comparisons to be done as the upper half of
a matrix, the range option restricts the comparisons to be done to a
slice of this matrix. The purpose of the option is to allow a simplistic
and crude parallelisation of work. We can run the multiple `wcd'
processes with the same data but different slices. Typically we would
also use the `--dump' option with this. See the section on
parallelisation in the technical manual.

   If the dump option is chosen, just before completion `wcd' creates a
file with an `-FIN' suffix. For example if the dump file name is
_range10_, then a file called _range10-FIN_ is created. This enables
monitoring programs which run asynchronously and cannot control the
`wcd' process directly to use the file system to determine whether the
job has completed.

3.9 Refining clusters
=====================

3.9.1 Format of clustering input
--------------------------------

The merge and add options require as input files that specify a
clustering. These files must use the compressed format described below.

Format of constraint file
.........................

Constraint files consist of a sequence of constraints, each on a line by
itself.  Each line in the constraint file is a directive followed by a
list of indices, terminated by a full stop `.'. There are three
directives and their semantics are described below.

   * `fix'

     This directive can be used to specify a list of sequences which
     should be labelled _fixed_. Any cluster than contains a fixed
     sequence will be labelled as fixed. This is useful when the user
     has    some external knowledge about the clustering and wants to
     ensure that    some sequences aren't clustered together (e.g. by a
     poor quality EST).

     Normally when a program starts, each sequence is put into a
     cluster. By default, a sequence is put into a cluster by itself,
     but    if a clustering file is given then the clustering specified
     by that    will be used.

     Thereafter, clustering starts. However, if the `fix' directive
     is used, two sequences that are labelled as fixed will never be
     merged. If an EST matches more than 1 fixed cluster it will be
     added    to at most 1 of them. Note that sequences that are not
     fixed can be    added to fixed clusters, and a non-fixed cluster
     can be added to a    fixed cluster.

     For example, suppose sequences 1, 17 and 325 definitely do not
     belong in the same cluster. Then you would have as a line in the
     constraint file:


     fix 1 17 325.

     Note that fixedness is either all or nothing. There is no way in
     the current version of saying don't cluster 1 with 325, and don't
     cluster 45 with 360, but it's OK to cluster 1 with 45 or 360.  If
     there turns out to be a need for it, it might be included in a
     later version of the program.

   * `cluster-only'

     This tells `wcd' to only try to cluster those sequences that    in
     the list (ignore the rest). This is useful if you only want to
     cluster a part of an input file (e.g. you might know the clustering
       for rest of the file).

     *NB: * It is an error to put more than one `cluster-only'    or
     `reset' in a constraint file.

     The clustering table allows you to provide an initial clustering,
     which `wcd' can then refine. Sometimes you may only want to
     refine the clustering of some of the sequences. In which case you
     can   make major performance savings by using the `cluster-only'
     directive to tell `wcd' to only refine the clustering of some of
     the sequences and to leave the clustering of all the others as
     given   initially by the clustering table. For example, if we had
     the   following clusters: [0, 1, 4] [2,3,7] [5] [6,8] [9]; and we
     were   generally happy with the clustering but wanted to see
     whether  [2,3,   7] should be merged with [6,8], the following
     directive could be used


     cluster-only 2 3 6 7 8.

     `wcd' would then check to whether those clusters would be merged.

   * `reset'

     This is similar to `cluster-only' but in addition, the
     clustering of the sequences in this list is reset to the default
     clustering (i.e. each sequence in the list is put in its own
     cluster).

     This is used where you are given a clustering file as input, but
     while you are generally happy with the clustering given for some
      sequences,  you would like others to be reclustered. The
     implication is    that those sequences in the reset list
        - will be clustered using d2 into one or more clusters;

        - `wcd' will not attempt to cluster them with any of the
          other sequences

Typically, you would be concerned that the clustering of the specified
sequences was too lenient (i.e. that some sequences had wrongly been
put together). So taking the above example, if you said


reset 2 3 6 7 8.

You would be saying that you wanted to leave the clustering of 0, 1, 4,
5, and 9, but you wanted to cluster the other sequences de novo,
completely ignoring the initial clustering.

   The major difference between `cluster-only' and `reset' is that with
`reset' you are saying that you want to recluster the specified
sequences de novo: you think that some of the sequences specified that
have been clustered already should not be and  you want to check again
(probably using other parameters). With `cluster-only' you are happy
with what clustering has been done, but you want to check whether there
should be even more clustering.

3.9.2 Merging clusters
----------------------

This can be used to merge two known clusterings. The input is two FASTA
files with the sequences and two files that give the clusterings.

   It assumed that the two FASTA files are disjoint.


Usage: wcd [--merge,-m] <seqf1> <clf1> <seqf2> <clf2>
           merge two clusterings

Here you merge two clusterings that have already been computed. The
four arguments are: the first FASTA file, the first clustering file, the
second FASTA file, and the second clustering file. These are mandatory.

   The `--constraint' option may be of particular use here.  This can
be used to constrain the first input file and its related clustering.
You can use the `--constraint2' option to constrain the second input
file.

   The files that specify the clustering must be in the same format as
produced by the compressed clustering format.. The sequences are
referred to by index number (the position of the sequence in the input
file), numbered from 0.  Each cluster is given on a line by itself
terminated by a full stop: the indices of the sequences in the cluster
are printed out, separated by spaces.

   The output is a a new cluster table in the same format as the input
cluster table. The indices shown in the table are:
   * The same as the input index if the sequence came from the first
     file specified.

   * n+input index if the sequence comes from the second file, assuming
     n sequences in the first file.

Another useful option for merging is:

      `[--constraint2, -J] filename'

     Give the constraint file for the second input data file (it. This
     is   optional.  The constraint file enables you to ensure that
     certain   sequences are not clustered together or to ignore
     certain sequences   while clustering.

3.9.3 Adding sequences
----------------------

This can be used to add a number of new sequences to an existing
cluster. It is assumed that the new sequences do not exist in the
original file.

   The input is two FASTA files and a cluster table for the first file.
The remarks above apply here.

3.9.4 Reclustering
------------------


Usage: wcd [--recluster,-r] <clf1> <seqf1>
           recluster from a more stringent clustering

This takes a clustering based on a more lenient (or just different)
criterion and reclusters using d^2-scores as the basis for clustering.
The clustering as given by the input cluster table is given as a
scheme. For each cluster of the initial cluster table, `wcd' does a
d2-clustering on the sequences in that cluster, ignoring all the other
sequences. `wcd' will never compare the sequences in one cluster with
the sequences in another. The resulting clustering is therefore a finer
partition.

Example of merging and adding
.............................

Suppose you have two files

   * File 1, `data1.seq': 15 sequences, numbered 0 through 14. The
     clusters (as produced by `wcd' and saved in a file `data1.cl')


     0 1 2 12 13 14.
     3.
     4 5 6 8.
     7 9 10 11.

   * File 2 (`data2.seq'): 12 sequences, numbered 0 through 11.  The
     clusters (as produced by `wcd' and saved in `data2.cl') are:


     0 2 4 10.
     1 3 5.
     6 7 8 9 11.

You now want to merge the two files. You are happy with the clustering
of the two files with respect to themselves, but you now need to see
whether the sequences in the one file are related to the sequences in
the second file. You would do this by saying:

   `wcd --show_clusters --merge data1.seq data.cl data2.seq data2.cl'

This merges the two clusterings. All the sequences in the first file
will be compared to all the sequences in the second. The new clustering
would be output. The sequences in the second file will be renumbered 15
to 26. For example a possible output might be:


0 1 2 12 13 14.
3 21 22 23 24 26.
4 5 6 8 16 18 20 7 9 10 11.
15 17 19 25.

This could happen if sequence 3 in file 1 is related to sequence 21 (6
in file 2); sequence  4 in file 1 related to sequence 16 (1 in file 2);
and sequence 7 in file 1 to 18 (3 in file 2).

3.9.5 Producing the clusters as sequences.
------------------------------------------

The `-G' option can be used to split the sequence file according to the
clusters that `wcd' produces. The general form of the option is `-G
val@dir'. This says put any cluster with more than _val_ sequences into
its own file in directort `dir'. For example, `-G 20@clsts' will
produce in the directory `clsts' a file for each cluster with at least
20 sequences in it (FASTA format, the sequences themselves). The FASTA
files from the clusters are put in files named `C000001.fasta',
`C000002.fasta', ... All the sequences that are not in a cluster of the
given size are put in a file called `S0.fasta'.

   Note, there was a bug in 0.5.0 where the `-c' and `-G' options could
not be used at the same time. This has been fixed.

3.9.6 Splitting the sequence file based on an existing clustering
-----------------------------------------------------------------

The `-G' option should be used when you have a sequence input file for
which you do not know the clustering. However, if you already have
produced the cluster table (using the `-c') option, then you can split
the the sequence file with the `-k' option: here you give `wcd' the
sequence file and a previously computed cluster table. In this case
`wcd' can split the input file for you without reclustering. Although
using this option means you have to call `wcd' more than once, it does
allow you to experiment with different cluster sized.

   This is an auxiliary feature of `wcd'. Suppose you have produced a
cluster table and now want to process it further - for example,
assemble the clusters. It is activated by using the `-k' option and the
meaning of the other command-line switches changes completely.


Usage: wcd -f <clusterfile> -k <splitname> <sequence file>

This reads in the given cluster file and sequence file and produces
output files based on the cluster file. The names of the output file are
prefixed by the _splitname_.

   For example, suppose _ex.clt_ contains the following


0 40 300.
6.
1 5 7 8.

and we run `wcd' like thus:


wcd -f ex.clt -k ans data.fa

Then the file _ans0000.fa_ will be created that contains sequences 0,
40 and 300; _ans0001.fa_ will be created that contains sequence 6; and
_ans0002.fa_ will be created that contains sequences 1, 5, 7, and 8.

By default, all clusters will be printed out.  The `-S' switch allows
you to print out only the clusters that have at least a given number of
elements. In the example below only one file will be created - one with
sequences 1, 5, 7 and 8.


wcd -S 4 -f ex.clt -k ans data.fa

   *Caution: * In this mode it is particularly important that each line
of the cluster file end with a full stop ("."). If not, `wcd' assumes
the next line is part of the same cluster. *This IS a feature, but
could easily be a bug.*

3.10 K-link clustering
======================

`wcd' does what is known as single-link, transitive clustering. If
sequence 1 overlaps with sequence 2 which   overlaps with sequence 3
which overlaps with sequence 4 and .... sequence 99 overlaps with
sequence 100, then at the end of the process all sequences will be in th
same cluster even if there are no other overlaps.

   In an idealised biological setting, this is a good framework.
However, this does mean that dirty data, and chimeras in particular,
can cause havoc. _k_-link clustering requires that there is some
redundancy in the number of overlaps.

   The algorithm of Bragg and Stone is one attempt to solve the problem
(see
`http://bioinformatics.oxfordjournals.org/content/25/18/2302.short').
This program requires as input some pre-clustering from another
clustering algorithm. `wcd' can be used as a tool. To prepare output
that is suitable for `K-link', you can use the `--kl-seed' or `-$'
option:

   The `-$' or `--kl-seed' option takes a mandatory argument, which
indicates whether you want a symmetric output or not. In the symmetric
mode, sequence _i_ appears on sequence _j_'s list if they overlap (and
by symmetry, _j_ appears on _i_ list). In the asymmetric mode, sequence
_i_ only appears on _j_'s list, if they overlap and _i >= j_.

   For example, suppose that sequence 2 overlaps with sequence 87. In
the symmetric mode, you will see 87 appearing in 2's list and 2
appearing on 87's. In the asymmetric mode, 87 will appear on 2's list
but not vice-versa

   * To choose the symmetric mode use "s" as the argument, e.g.,
     `wcd -$ s data.fa'

   * To choose the asymmetric mode use anything else (but you must put
     something), e.g.         `wcd -$ a data.fa'

These commands will cluster the data file `data.fa' and produce (to
standard output) data in the correct format for `K-link' (in the
first-case it will be symmetric and in the second case asymmetric).
Note that in according to the documentation of the k-link program, the
_symmetric_ version is required.

   Of course, all standard options work: `-o' can be used to send
output to a named file

   `wcd --kl-seed s -o data.klt  data.fa'

and all the modelling and heuristic parameters are available too.

   This mode is more expensive than normal clustering. First, the
symmetric mode requires twice as much work as the asymmetric mode
(since we have to check _i_ against _j_ and vice-versa). Secondly,
normal clustering can skip some testing. For example, if we know from a
previous step that sequences 32 and 87 are in the same cluster (perhaps
because sequence 2 and 32 overlapped, and sequence 2 and 87 overlapped)
there is no point in checking whether 32 and     87 overlap (since all
we care is that they are in same cluster). However, in this mode we need
essentially need to know the adjacency list of the graph and so 32 and
87 have to be compared against each other. This size of this penalty
will depend on the graph.

To explain the format, in symmetric mode you'll see output like


0:0 1 5 9.
1:0 1 5 7 11.
2:2.
3:3 4 7.
4:3 4 6.
5:5
...

This says that sequence 0 overlaps with sequence 0, 1, 5 and 9. Sequence
1 overlaps with 0, 1, 5, 7 and 11. Sequence 2 only overlaps with itself.



Restrictions: This is NOT parallelised yet (not high on the   todo list
since it's tricky, but if you yell it's more likely to happen).

   The KABOOM-algorithm DOES work though!

3.11 General options
====================

3.11.1 Help
-----------


Usage: wcd -u |-h
   -u: usage
   -h: usage
   -v: shows version

Shows how to invoke wcd

3.11.2 Identifying a sequence
-----------------------------


Usage: wcd [--show_seq, -i] <index>  <filename>
       wcd [--I show_rc_seq] <index>  <filename>

   * `wcd --show_seq i dataf'

     Prints the i-th sequence in the data file dataf

   * `wcd --show_rc_seq i dataf'

     Prints the reverse complement of the i-th sequence in the data
     file dataf

3.11.3 The Dump option
----------------------


Usage: wcd -d dump_file  seq_file

When used with this option, `wcd' will open the given dump file for
writing and then perform clustering. Whenever it finds two sequences
that should be clustered it writes the match to the dump file: the
output are the indices of the two sequences, and a 1 (if the there is a
positive match) or -1 (if there is an RC match).

   This was introduced into `wcd' to support our simplistic
parallelisation (see the parallelisation section in the technical
manual).

3.11.4 Comparing two sequences
------------------------------

*Warning: * These options _may_ be removed in later versions of
d2-cluster, and should be treated with caution.

   These option are included to allow exploration and evaluation of data
rather than for clustering purposes.  The problem is that optimisations
made for performance reasons have meant that they do not give completely
accurate answers. For example, if we find  windows where the d2 score is
less than a threshold, we announce success; we don't try to find the
pair of windows with the smallest overlap. In subsequent releases there
may be a separate program which provides these facilities (though with
less efficient code).

   All these options can also take as options the options which allow
changing of threshold, window and word size.


Usage: wcd [--compare|-E] <filename> <ind1> <ind2>
       wcd [--abbrev_compare|-e] <filename> <ind1> <ind2>
       wcd [--pairwise|-e] <filename> <ind1> <ind2>

   * `wcd --compare dataf i j'

     Compares the sequences i and j from the datafile dataf and prints
     out the following

        * The i-th sequence, the j-th sequence, and the reverse
          complement of the j-th sequence

        * A line with the following information

             * _i_ and _j_

             * An estimate the number of samples of  the _j_-th sequence
               which appears in the _i_-th.

             * An estimate the number of samples of  the _j_-th sequence
               which appears in reverse complement of the _i_-th.

             * An estimate the number of words of  the _j_-th sequence
               which appears in the _i_-th.

             * An estimate the number of words s of  the _j_-th sequence
               which appears in reverse complement of the _i_-th.

             * the d2 score between _i_ and _j_

             * the d2 score between _i_ and the reverse complement of
               _j_

   * `wcd --abbrev_compare dataf i j'

     Prints the minimum of the: (1) the d2 score of i an j; and (2) the
     d2 score of i and the reverse complement of j.

   * `wcd --pairwise dataf i j'

     First prints a table of the d2 scores of all windows of sequence i
     compared to all windows of sequence j. Then does the same with the
     RC of sequence j


Usage: wcd [--cluster_compare, -D]  dataf clusterf

Takes two arguments: a data file with sequences, and a file that gives
two clusters. This cluster file should contain exactly two lines.  Each
line should contain the indices of the sequences belonging to a
cluster. The indices should be separated by spaces, and the line
terminated by a full stop.

   The program will then compare each sequence in one cluster with each
sequence in the other and print out the d2 scores (both positive and
RC). One each line of output there is the result of one comparison.
First the two indices are printed out, and then the two d2 scores.

3.11.5 The mirror option - producing sequence and RC
----------------------------------------------------

For the KABOOM option (and perhaps other uses) it is necessary to have a
file that contains the original sequences followed immediately by the
reverse complement of the file. The `--mirror' (or `-y') option will do
this. This should not be used in conjunction with any other option. By
default output goes to stdout; use th `-o' option if you want to change
this behaviour. A typical call would be


wcd -y -o data_all.fa data.fa

3.12 Running wcd in parallel
============================

`wcd' has support for both shared and distributed memory
parallelisation. The parallel version supports straightforward
clustering only.

   There are, however, _major_ restrictions should you use these
options.

   In version 0.4, the `wcd' options for suffix clustering, merging,
reclustering, dealing with constraints etc, are NOT supported when you
use the parallel options. It is my intention that future versions will
fix these problems. The following options are not supported if you use
the parallel options.

   * suffix-last based clustering

   * -show_seq, -show_rc_seq

   * -E, -compare: show seqs, number common words, and d2scores

   * -e, -abbrev_compare: show min of d2scores (pos + rc)

   * -p, -pairwise: show pairwise d2 scores of all windows

   * -cluster_compare,-D]               compare two clusters

   * -merge,-m:           merge two clusterings

   * -add,-a

   * -recluster,-r

3.12.1 Shared Memory Parallelisation
------------------------------------

If you are running `wcd' on a shared memory processor with multiple
threads, the `--num_threads' or `-N' option can be used to specify how
many threads should be used. If there's a close match to the number of
CPUs that are available and unloaded, you should see a performance
improvement though the current version is not very scalable.  Why the
scalability of the pthreads implementation is so poor is a very
disappointing mystery.  The bottom line is that I recommend using MPI
rather than pthreads. Pthreads should not be used large multiprocessor
machined -- beyond $N$=4 you are very unlikely to get any performance
improvement.

3.12.2 MPI Parallelisation
--------------------------

By enabling MPI support when installing, `wcd' can be used in a cluster
of workstations. A description of MPI is beyond the scope of this
document. Use `mpirun' to run `wcd' (which takes the normal
parameters).   This code has been tested using LAMMPI (RedHat, Suse,
MacOS X), MPICH (Ubuntu) and MVAPICH (Suse).

   For example, using MPICH2 the `mpdboot' command specifies what
processors are availabe (the list is given in the hosts file - in its
simplest form a list of the machines or their IP addressed). The
`mpirun' command is then used to run `wcd'.  A simple example follows.


lamboot hosts
mpirun -np 4 wcd -c sample.fas

This will run `wcd' on 4 different processors (these procesors may be
real or virtual, depending on what's available on the machines
specified by the hosts file). When `wcd' runs like this with mutiplie
procesors available, one version of `wcd' runs as the master, and the
rest as slaves. The sequence input  file must be available on the
master node, but need not be on the others.

   The master process does not do any clustering itself, but merely
coordinates the clustering process. In the above example, this means you
would be running a master and three slaves and so could expect a 3-fold
improvement in performance at best. The computational load on the master
is fairly small and so it is safe (memory being available) to schedule
both a master and a slave on same processor.

   In future versions of `wcd', the behaviour is likely to change so
that the master does do clustering (to make it more memory effective).

   _NOTE: _ When you install `wcd' you can enable both Pthreads and MPI
so that the exectable can do both.

3.12.3 Checkpointing (for MPI)
------------------------------

If you are using MPI, you can checkpoint your data periodically. Use the
`-L' option and give a file name for backup.

   `mpirun -np 8 wcd -L job.back -g data.fa'

   Periodically, `wcd' will make a checkpoint of the file. If something
bad happens  you can restore from the backup. You need to use the `-A'
option to signal you are doing the restore, and the `-f' flag is used
to indicate the backup file as initialisation.

   `mpirun -np 8 wcd -A -f job.back -g data.fa'

   `wcd' will now run but will not have to redo the work done in the
previous run. Of course you will lose some work, but it can save you a
lot. There is probably a performance penalty of 10% or so.

   You _must_, however, call wcd in exactly the same way as you called
it the first time. If you use MPI with 7 processes the first time, you
must use it in the same way the second time.

3.13 Auxiliary programs
=======================

A number of auxiliary  programs come with the `wcd' distribution in the
`ancillary' directory.

   * `rindex.py'

     This Python program takes two arguments, the names of two files
     each    containing (compact) clusterings. It computes the
     sensitivity,    specificity, Jaccard index, Rand index and
     correlation coefficient    between the two clusterings.

     If you use the -index rand option only the Rand index is    shown.
     If you use the -index jaccard optio, only the Jaccard    index is
     shown.

     If you use the -diff n option, the indices above are not
     printed but the mismatches between the two clusterings are
     shown. First the pairs that are clustered by the first cluster but
       not the second are shown, and then the ones clustered by the
     second    but not the first. If n==1, then all such pairs are
     shown;    otherwise only the pairs that belong to clusters with n
     or fewer    sequences are shown. This is helpful to explore
     differences in    clusterings.

   * `ext2comp.pl'

     This Perl program converts the extended cluster table format to the
       compressed table format.

   * `comp2ext.pl'

     This Perl program converts the compressed table format to the
     extended    table format. Since all the information of the
     extended table is not    in the compressed format, you will find
     0s in the _orient_    column and -1 in the _witness_ field.

     For both programs, input and output are standard input and output.
     So you would probably run the programs thus

          ./ext2comp.pl < cluster.ext > cluster.com
          ./comp2ext.pl < cluster.com > cluster.ext

   * dustprep-kabm, prep-kabm, kabm: discussed earlier.

   * seqindex.py

     `wcd''s cluster table refers to sequences based upon their
     positions in the sequence files. Sometimes it's useful to convert
         this into the sequence ID as given in the FASTA file.

     `seqindex.py -s data.clt  data.fa'

     This takes a wcd  cluster table and the corresponding sequence
      file and produces a new table in the same format but rather than
          the sequence index being used, the sequence identifier as
     given       in the FASTA file is used.

     FASTA files vary in format so `seqextract' tries to guess the
     ID intelligently. If it fails, it assume that the sequence
     identifier is what appears between the > and the next space, | or
         end of line (whichever comes first).

     The `-o out' option can be used to send the output to a named
     file rather  than standard output.

     If you add the  `-i' option, then the translation happens in
     revese.

     _Remember that numbering starts from 0._

   * seqextract.py

     This can be used to extract sequences from a file

     `seqextract.py -s ids file.fa'

     The file `ids' should be a file in `wcd' cluster table
     format with sequence indices. _Remember that numbering starts
     from 0._ All the specified sequences are extracted from
     `file.fa'.

     Note that the sequences are output in ascending order of index,
       not the order specified in `ids'

     You can directly specify sequences to extract using the
     `asis' tag.

     `seqextract.py -s "asis:100 15 16" data.fa'

     This will extract out sequences (in order) 15, 16, 100.

     You can send output to a file using the `-o' option.

   * `analysecluster.py'

     Takes as input a clt file and produces a histogram of the cluster
     tables. If you use the `-t N' option, instead of the histograms any
     clusters with more than N sequences are output.

   * `combine.c'

     This takes as input a list of names of dump files, reads in each
     dump    file in turn, and constructs the clustering from that. To
     make the    executable, say `make combine'.

   * In addition to these programs, there is a shell script
     `wcd_wrapper.sh' that allows `wcd' to be used as a replacement for
     `d2_cluster' in the `stackPACK' analysis pipeline.



File: wcd.info,  Node: wcd inside stackPACK,  Next: Technical manual,  Prev: Running wcd,  Up: Top

4 `wcd' inside `stackPACK'
**************************

This chapter written by Peter van Heusden
-----------------------------------------

Electric Genetics' `stackPACK' is a widely-used software suite used for
expression variance analysis and transcript reconstruction.

   By default, `stackPACK' uses `d2_cluster' as the clustering tool in
its analysis pipeline. To incorporate `wcd' into `stackPACK' follow
these steps: \

  1. Install `wcd' as described earlier. If you wish to use `wcd' from
     the `stackPACK' web user interface, you need to ensure that `wcd',
     `wcd_wrapper.sh' and `comp2ext.pl' are runnable by the web server
     user.

  2. Edit the `/etc/stackpack' file. Find the section labelled
     [d2_cluster] and changed the line reading executable= to refer to
     the location of the `wcd_wrapper.sh' script.


   Please note that at present, you cannot do 'add' clustering in
`stackPACK' using `wcd'.


File: wcd.info,  Node: Technical manual,  Next: Testing,  Prev: wcd inside stackPACK,  Up: Top

5 Technical manual
******************

Hopefully this will fill out a bit and become a little more coherent.

5.1 Program structure
=====================

Originally `wcd' was intended to be one small program and so would fit
into one file. However, it grew and we needed to break up the program.
The main files are

   * `wcd.h': This contains the critical type definitions of the
     program.

   * `wcd.c': This is the main program. It contains some      utility
     and auxiliary code, processes all the command-line      arguments
     and then calls the right routines. The key function      in this
     code is the _docluster_ code. This does the      pair-wise
     comparison of the sequences and where the comparison      falls
     below the thresh-hold merges the clusters, calling the
     appropriate find-union data structures.

   * `ed.c': this contains the routiness for computing the edit
     distance.

   * `d2.c': this contains the routines for computing the       d2
     score.

   * `common.c': this contains code and declaration for
     manipulating words and sequences


5.2 Data structures
===================

The key data structure used is a find-union data structure. This
discussion assumes that you know how that works -- look in a standard
algorithms text like Cormen, Leieserson and Rivest if you don't.

   All the sequences in one cluster will belong at the end to the same
tree in a find-union forest. Each sequence will point to the root of the
tree.

   To represent sequences we use an array of following structs



typedef struct SeqStruct  {
  seqType seq;     // the sequence
  int    len;     // length
  string id;      // sequence id (from data file)
  int    cluster; // refers to index of root
  int    rank;    // for find-union
  int    next;    // for find-union: next in linked list of cluster
  int    last;    // for find-union: last ditto
  int    match;   // index of another seq for which a match was made
  short  orient;  // pos or rc for that match
  int    flag;    // flags for fix, reset etc
} SeqStruct;


typedef SeqStruct * SeqPtr;

SeqPtr seq;

The key variable is `seq' which is declared globally and statically.
The main function opens up the data file(s) and computes the number of
sequences in the file. Memory is then allocated for `seq' which is then
treated as an array thereafter. The values are actually given to the
individual array elements in the function `read_sequences'.

   The fields of the struct are:

   * seq: the sequence itself: we store the sequences compactly - see
     below.

   * len: the length of the sequence

   * id: the sequence ID -- this comes from the data file (the string
     to the right of the '>'.

   * cluster: This is the index of the representative sequence for the
     cluster to which a sequence belongs. This is initialised so that
     each sequence is the root of its own cluster.

   * rank: this is the rank used in the rank-heuristic of the
     find-union algorithm. Roughly it describes how many sequences are
     in the tree rooted at this sequence. This is only valid for
     sequences that are the root of the cluster.

   * next, last. These are not standard f-u fields. However, we want to
     not only know which sequence belongs to which cluster but to be
     able to print out the clusters efficiently. Thus, besides keeping
     the fields needed for the f-u structure, we also keep a linked
     list of all sequences in the cluster. next points to the next
     sequence in the list. The root will always be the first element in
     the linked list and for the root, the last field will point to the
     last sequence in the list. The make_union function updates these
     values.

   * match is an index of an arbitrary other sequence in the cluster
     for which there is an overlap.

   * orient: orientation of the match (-1 or 1). Initially this is set to
     1, since each sequence starts being the root of its own cluster.
     The following algorithm is used to update it. Suppose we find a
     match between a sequences i and j. We consider the orientations of
     _i_ and _j_ with respect to their current roots:
        - If the match was positive, then if the orientation of i and j
          to their respective roots is the same, then we know that if we
          take the union of the two clusters then the orientation of
          the elements of the two clusters are consistent and no
          changes need to be made.

          However, if the orientations of i and j to their respective
          roots are different, then we know that the orientation of the
          clusters is not consistent. Therefore, we must invert the
          orientations all the sequences in one of the two clusters
          before merging. The choice of cluster is arbitrary but we
          choose the one that will become the `child' cluster.

        - If the match was a reverse-complement match, the situation is
          the reverse of the above case. If the orientations of i and j
          are the same, then we know the orientations of the two
          clusters is inconsistent, and so change the orientations of
          the sequences in one of the two clusters.


Storage of sequences.
.....................

Different versions of this program have taken different approaches to
the storage of sequences. Initially, I went for one base per byte. Then,
I went for four bases per byte. The reason for this (explained below is
to save memory). However, this does make the code more complicated and
cause recomputation as we need to extract out information. Then I went
to a redundant representation of the data - each word being stored in
one machine word. However, this was too memory intensive for large data
sets so I went back to the compact representation

   The data is stored in compressed form. The sequences are read in in
ASCII format, one base per byte. However, we only need two bits to
represent the four bases, and so we compress the data storing four
bases per byte.  There is a little complexity in this -- the macros
GETWORD and so on look like witchcraft and some performance penalty in
time, but if we have 1M ESTs then it may make a few 100M of memory
difference which may be useful.

   A down side of the compressed format is that we can only treat Ns
that are in the data as some arbitrary symbol. It would be better to
ignore all words that contain N, but there is no way of knowing this in
the compressed format.

   The key data structures are shown below. Each sequence is an array of
integers: the definition of `seqElt' is key. Have a look at the `wcd'
code to see the current version (this may change). In 0.1.8 at least we
used `int32_t'.  If you want to change this to some other type, change
the definition of `seqElt' and change the definition of `SEQELTWIDTH'
to reflect this change (i.e. the bit width).

typedef int32_t   seqElt;
typedef seqElt  * seqType;
#define  SEQELTWIDTH  32
#define  BASESPERELT   (SEQELTWIDTH/2)
 In `read_sequences' each sequence is read into memory in turn. Once we
know the length of the sequence, we divide through by BASESPERELT to
compute the length of the needed array and allocate memory accordingly.

   The maximum EST length is set by a macro constant definition
MAX_SEQUENCE_LENGTH. It's currently 64K, but can be changed to any
reasonable value. The only real limitation is that the `next_seq'
variable in `read_sequences' is declared in the function and there
might be a limitation on how the memory allocated on the stack can be on
some machines. In version 0.1.8 for example this would mean that the the
actual limit on sequence length is 64K*8 which is probably sufficient.
If you run into this problem you may consider moving the declaration
globally.

5.3 The algorithms
==================

5.3.1 Heuristic
---------------

Since d2 is an expensive test, a heuristic test is done before d2 is
called. This is based upon the idea that for a d2 score to be under the
threshold, the two sequences must share a number of non-overlapping
common words.

   I have not had time to work on this analytically though I think it
can be shown. However, I have done some empirical work which showed that
if you graph d2 score (y axis) against common word score (x axis) that
all the data points lie above the line d2 = -12.5cw + 150.  This means
that at a d2 threshold of 40, all matches have over 8 common words.
Thus the default set of 5 is very conservative.

   The heuristic is a little more complicated than this. If we are
comparing sequences u and v, then

   * First we build a table of all words in u.

     (Note that this is not particularly expensive to do. When
     we do the clustering we have a double loop
               for(i=0; i<n; i++) {
                  for(j=i+1; j<n; j++) {
                     compare i and j

     the building of the table is done in the outer loop, outside of
        the inner loop and so the cost is negligible when amortised
     over        all values.)

   * Then if the threshold for the number of commone words is at least
          5, we sample the words in v ; If there are fewer than 2
     matches,        we report failure.

   * If there are more than 2 matches, we then see how many
     non-overlapping words in v appear in u. If they are above the
     threshold we report success


5.3.2 The parameters of do_cluster, and implementing merging and add
--------------------------------------------------------------------

Normally, to do a clustering we need to compare each sequence to each
other. The obvious way of doing this is by using a double loop:


   for(i=0; i<num_seqs; i++) {
      for(j=i+1; j<num_seqs; j++) {
         blah, blah
      }
   }

   When we do a merge, we have two separate clusterings and there is no
need to compare the sequences in the first half with each other, and the
sequences in the second half with each other but only those sequences in
the first half with those in the second. Assuming there are n1 sequences
in the first part, the code becomes


   for(i=0; i<n1; i++) {
      for(j=n1; j<num_seqs; j++) {
         blah, blah
      }
   }

   When we do an add, we have a clustering that we add sequences to. The
clustered data is in the first part and the new data is in the second.
There is no need to compare the sequences in the first half with each
other. The code becomes


   for(i=0; i<num_seqs; i++) {
      for(j=max(n1,i+1); j<num_seqs; j++) {
         blah, blah
      }
   }

   Each of these three cases can be dealt with by passing as parameters
of do_cluster the bounds of these two loops. The indices are set in the
main function based on the program's arguments.

   In addition, sometimes we only want to cluster some of the
sequences. There are two ways in which this can be done. One is to just
set the _ignore_ flag for the sequences that should be clustered to an
appropriate value.  The other is to pass the `do_cluster' function an
array which contains the indices of those sequences to be clustered.
This INDEX array can then be used to extract out the indices of the
sequences.

   Thus there are four parameters for the `do_cluster' function:

   * `i_end': where the i-loop should finish looping.

   * `j_beg': where the j-loop should start looping

   * `j_end': where the j-loop should end looping

   * `index': an array of the indices of sequences to       be
     clustered.

     If this parameter has a `NULL' value, then all the       sequences
     from 0 to `i_end' will be compared to all       the sequences from
     `j_beg' to `j_end'.

     If this parameter has a non-`NULL' value, then all the
     sequences from `index[0]' to `index[i_end]' will be       compared
     to all the sequences from `index[j_beg]' to       `index[j_end]'.

5.4 How to add distance functions
=================================

If you want to add a new distance function, say _farfun_, you need to
implement three functions

   * _farfun_: this should actually implement the distance function. It
     need not return the correct result: however if the distance
     between the two sequences is less than the threshold then the
     function should return a number less than the threshold (i.e. once
     you detect that the value is less than the threshold you can just
     return an estimate, you don't have to find the actual value)

   * _farfunpair_: this should implement the distance function but
     should return the correct answer.

   * _farfuninit_: this is initialisation code that will be called when
     the program runs.

   The code should be placed in a file called `farfun.c' and the
prototype should be put in `farfun.h'.

   Then in the `wcd.c' add to `chooseFun'.

   In `common.c', call the initialise code in `initialise'.

5.5 Parallelisation
===================

This section first discusses work allocation in MPI and Pthreads and the
discusses some more primitive support that `wcd' has for
parallelisation.

   Different work allocation methods are provided for MPI and Pthreads.
This is to allow for experimentation and future releases of `wcd' are
likely to change the way in which they do things.

5.6 Work allocation in MPI
==========================

In MPI, work allocation is done staticallly. The input file is read in
and estimates are made of the clustering costs based upon sequence
length and the number of sequences. The workload is then allocated to
the slaves and the slaves are launched. This may well change in future
so if you write a script that uses this option, please check in future
releses.

5.7 Work allocation using Pthreads
==================================

In Pthreads, the work allocation is done dynamically. Conceptually the
work is broken up into a d x d grid of work and the work is then
allocated to slaves. As there are many more blocks than slaves, each
slave will have to ask for work many times.

5.7.1 Distributed Parallelisation
---------------------------------

`wcd' has a number of options to support a simple parallelisation on a
local network. The approach is very simplistic: we had a very large
real job (110k mRNAs, 250M of data) that needed to be clustered in a
hurry and urgently, so we hacked an inelegant solution together. Now
that MPI support is available, these options will become less important
and may be removed in future versions of `wcd'. (Let me know if you
rely on them.)

   In our labs we have about 150 machines on one NFS server. We wrote a
Perl script that lauched `wcd' on each of the machines, using the
`--range' and `--dump' options to specify the workload that machine had
to take on and to say where the results should go.

   NFS is used for communication. Each process reads in the entire
sequence file - obviously hideously expensive.  To make what we did
socially acceptable and to prevent our NFS server from falling over,
the Perl script that lauches the `wcd' processes delays for about 60s
between launches which means that it takes over 2 hours to start all the
`wcd' processes.

   The auxilary program, `combine' can be used to read in all final
dump files to produce the overall clustering.


File: wcd.info,  Node: Testing,  Next: Acknowledgements and copyright,  Prev: Technical manual,  Up: Top

6 Testing
*********

Testing of wcd
--------------

We did extensive testing of `wcd' both with respect to computational
and biological performance and this was published in our
_Bioinformatics_ paper cited above.


File: wcd.info,  Node: Acknowledgements and copyright,  Next: Concept Index,  Prev: Testing,  Up: Top

7 Acknowledgements and copyright
********************************

7.1 Acknowledgements
====================

`wcd' was written after a very productive visit to the South African
National Bioinformatics Institute (SANBI). SANBI introduced me to the
problem and as we needed to have a d2 clustering program for our
research, I wrote one. Many people helped but thanks particularly to
Winston Hide for alpha and beta testing of the code as well as some very
detailed biological analysis. Peter van Heusden was responsbile for
significant testing as well as integrating into StackPack.

   Ramon Nogueria was responsible for the development of the acd files
and wrapper for the EMBOSS implementation. Richard Starfield worked on
the Pthreads implementation.

   Zsuzsanna Lipta'k, now at the University of Bielefeld introduced the
maths and the algorithms to me, and made sure I knew what I was doing.
She also helped me refine the algorithmic innovations in this version
of d2 cluster and made several useful suggestions.

   Anton Bergheim and Pravesh Ranchod at Wits also gave valuable
feedback.

   Dane Kennedy and Benjamin Kumwenda worked on various aspects of EST
clustering.

   Research grants from the National Bioinformatics Network and the
National Research Foundation supported this work.

   All the above helped with alpha testing.

7.2 Copyright
=============

The program copyright details are:

     Copyright (C) Scott Hazelhurst 2003-2011
     School of Electrical and Information Engineering
     University of the Witwatersrand,
     Johannesburg
     Private Bag 3, 2050 Wits
     South Africa
     scott.hazelhurst@wits.ac.za

   This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public Licence as published by
the Free Software Foundation; either version 2 of the Licence, or (at
your option) any later version.

   This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public Licence for more details.

   `Version date: 10 June 2011 wcd.texi,v 0.6.3 2011/10/01 13:22:58
scott Exp scott $'

This documentation is distributed under the GNU Free Documentation
Licence.


File: wcd.info,  Node: Concept Index,  Prev: Acknowledgements and copyright,  Up: Top

Concept Index
*************

[index]
* Menu:

* Amazon EC2:                            Installing wcd.     (line  434)
* analysing clusters:                    Running wcd.        (line 1405)
* anlusecluster:                         Running wcd.        (line 1405)
* auxiliary programs <1>:                Running wcd.        (line 1307)
* auxiliary programs:                    Installing wcd.     (line  389)
* backup:                                Running wcd.        (line 1284)
* checkpoint:                            Running wcd.        (line 1284)
* clone:                                 Running wcd.        (line   67)
* cluster table:                         Running wcd.        (line  223)
* constraint:                            Running wcd.        (line  594)
* distance function choice <1>:          Running wcd.        (line  507)
* distance function choice:              Description.        (line  128)
* distance function parameters:          Running wcd.        (line  535)
* dump:                                  Running wcd.        (line 1092)
* EC2:                                   Installing wcd.     (line  434)
* ends option:                           Running wcd.        (line  522)
* extended format of cluster table:      Running wcd.        (line  241)
* format, output:                        Running wcd.        (line  223)
* heuristic <1>:                         Technical manual.   (line  188)
* heuristic:                             Running wcd.        (line  555)
* installing:                            Installing wcd.     (line  184)
* kabm:                                  Running wcd.        (line  294)
* KABOOM <1>:                            Running wcd.        (line 1185)
* KABOOM <2>:                            Installing wcd.     (line   97)
* KABOOM:                                Running wcd.        (line   75)
* mirror:                                Running wcd.        (line 1185)
* mkESA:                                 Running wcd.        (line  294)
* MPI <1>:                               Running wcd.        (line   55)
* MPI <2>:                               Installing wcd.     (line  368)
* MPI:                                   Running wcd.        (line 1246)
* MPI, installation:                     Installing wcd.     (line  231)
* orientation <1>:                       Running wcd.        (line  256)
* orientation:                           Technical manual.   (line  107)
* output:                                Running wcd.        (line  472)
* parallelisation:                       Running wcd.        (line 1198)
* prep-kabm:                             Running wcd.        (line 1185)
* producing clusters:                    Running wcd.        (line  908)
* PTHREADS:                              Running wcd.        (line   50)
* Pthreads:                              Running wcd.        (line 1232)
* PTHREADS:                              Installing wcd.     (line  368)
* Pthreads, installation:                Installing wcd.     (line  247)
* range:                                 Running wcd.        (line  647)
* recluster, de novo:                    Running wcd.        (line  783)
* recluster, from a coarser cluster:     Running wcd.        (line  852)
* restore:                               Running wcd.        (line 1284)
* reverse complement:                    Running wcd.        (line 1185)
* seqextract:                            Running wcd.        (line 1399)
* seqindex:                              Running wcd.        (line 1376)
* sequence indices:                      Running wcd.        (line 1399)
* shared memory processor:               Running wcd.        (line 1232)
* SMP:                                   Running wcd.        (line 1232)
* splitting file:                        Running wcd.        (line  908)
* stackPACK:                             Installing wcd.     (line  403)
* stackPack:                             wcd inside stackPACK.
                                                             (line   12)
* suffix-array:                          Running wcd.        (line  294)
* threshold:                             Running wcd.        (line  574)
* window length:                         Running wcd.        (line  563)
* word length:                           Running wcd.        (line  579)



Tag Table:
Node: Top67
Node: Description829
Node: Installing wcd10284
Node: Running wcd31525
Ref: Format of Constraint File59463
Ref: Auxiliary Programs83180
Node: wcd inside stackPACK87206
Node: Technical manual88240
Node: Testing103521
Node: Acknowledgements and copyright103844
Node: Concept Index106244

End Tag Table