Content - 0d8aaacd4186fe96dda0f93984c5e20ce16d4590 - 8de617b/README.md

visit type:
Tip revision: 2f5196e4cae5943a5822be7c4493df50cd564a0c authored by Alec Heckert on 21 December 2020, 02:39:07 UTC
added cstring dependency to Gibbs samplers
Tip revision: 2f5196e
README.md
# dpsp
Dirichlet process mixture models for single particle tracking

## Purpose

`dpsp` is a Python analysis tool for single particle tracking (SPT) experiments. 
SPT experiments generate "trajectories" - paths of individual molecules
specified in spatial coordinates. In many cases, the trajectories aren't all the
same: they come from molecules moving at different rates. A common goal in SPT analysis
is to extract the number of different diffusive states in a set of trajectories,
along with the fractional occupancy and diffusion coefficient (spatial variance)
of each state.

`dpsp` attempts to do this using a
[Dirichlet process mixture model](https://www.jstor.org/stable/1390653),
which is intended to deal with situations where we don't know the true
number of states.

## What goes in and what comes out?

`dpsp` accepts trajectories in CSV format. Each row of the CSV should correspond
to one detection from an SPT experiment.
This CSV must contain, at minimum, the following columns:

 - `trajectory`: the index of the corresponding trajectory
 - `frame`: the index of the corresponding frame
 - `y`, `x`, or other columns with the spatial coordinates of the detection
   **in pixels**

See `examples/examples_tracks.csv` for an example of the kind of input
`dpsp` expects.

The output is a probability distribution over the diffusion coefficient.
Each point in this distribution represents the probability that a trajectory
originated from a state with the corresponding diffusion coefficient.

## What doesn't it do?

`dpsp` doesn't do localization and tracking. It expects to be handed 
trajectories generated by some other method. There are lots of methods
around.

`dpsp` doesn't check the quality of your raw trajectories.

`dpsp` doesn't place any filters on the input.
It will consider whatever you give it.

`dpsp` doesn't check whether you gave it sensible parameters. It 
obediently, happily will abuse the data when given bad parameters.

`dpsp` expects you to understand your own experiment. You need to 
know the frame interval, approximate localization error, and pixel
size.

Beyond a couple of simple plots, `dpsp` doesn't produce any 
sophisticated visualizations of the result.

`dpsp` doesn't consider state transitions. It assumes that every 
trajectory is generated from a single diffusive state. When the 
trajectories are very short, this could be a good assumption. It almost
certainly becomes worse as trajectories get longer.

## Dependencies

`numpy`, `pandas`, `matplotlib`, `seaborn`, and `dask`.

Compilation via the `setup.py` script requires `gcc`.

## Install

Clone the repository, navigate to the top-level `dpsp` directory,
and do
```
    python setup.py develop
```

If you're on a system with `gcc`, setup will try to compile the binaries and 
put them in the `PATH`. To check that this actually worked, open a **new** terminal
and do
```
    gsdpdiff -h
```

If installation worked correctly, a docstring should print to the terminal. Wonderful.

If not, then you'll need to compile the `dpsp/cpp/gsdpdiff.cpp` and
`dpsp/cpp/gsdpdiffdefoc.cpp` into executables called `gsdpdiff` and 
`gsdpdiffdefoc` and put them into `PATH`, or whatever your OS equivalent is.

## Example usage

```
    import pandas as pd
    from dpsp import dpsp

    # A sample set of trajectories
    sample_csv = "examples/example_tracks.csv"
    tracks = pd.read_csv(sample_csv)

    # Run dpsp
    occs, diff_coefs = dpsp(
        tracks,
        branch_prob=0.1,         # branch probability for the Gibbs sampler
        frame_interval=0.00748,  # frame interval in seconds
        pixel_size_um=1.0,       # size of pixels in microns
        loc_error=0.035,         # approximate localization error in microns
        n_iter=400,              # number of iterations to do
        n_threads=1,             # number of independent threads to use
        pos_cols=["y", "x"],     # columns in *tracks* with spatial coordinates
                                 # in pixels
        dz=0.7                   # depth of field in microns, if using a defocalization
                                 # correction
        plot=True                # make a summary plot of the results
    )

```

A sample script that does something very similar is at 
`examples/example_script.py`. 

## Where can I get a description of the parameters for `dpsp`?

In the future, under the `docs` folder of this repo. 
For now, in the docstring to the `dp.dpsp` function.

The magic sauce here is `branch_prob`, which - if you're familiar with 
Dirichlet processes - indirectly sets the concentration
parameter for the Dirichlet process. We find it more useful to set 
`branch_prob` directly.

If our dataset has `n` trajectories
and the concentration parameter is `alpha`, then 
```
    branch_prob = alpha / (alpha + n)
```

`branch_prob` is the probability that, for any trajectory at any given 
iteration, the trajectory will start a new component for the Dirichlet 
process Gibbs sampler. As a result, it determines the relative strength
of the uniform prior over the diffusion coefficient against the data.
The higher `branch_prob` is, the more skeptical we are about the data. 

I am usually **very** skeptical about the data. So I tend to use a very high
value for `branch_prob` - often 0.1. The data has to work hard to prove
itself in this regime.

## What do I do with the output?

Up to you.

## I want to try other analyses.

Check out [vbSPT](http://vbspt.sourceforge.net/),
[Spot-On](https://www.tjian-darzacq.mcb.berkeley.edu/spot-on/),
or - if you're willing to hazard another tool from me - 
[emdiff](https://github.com/alecheckert/emdiff).
Browse the archive

https://github.com/alecheckert/dpsp