https://github.com/alecheckert/dpsp
Tip revision: 2f5196e4cae5943a5822be7c4493df50cd564a0c authored by Alec Heckert on 21 December 2020, 02:39:07 UTC
added cstring dependency to Gibbs samplers
added cstring dependency to Gibbs samplers
Tip revision: 2f5196e
README.md
# dpsp
Dirichlet process mixture models for single particle tracking
## Purpose
`dpsp` is a Python analysis tool for single particle tracking (SPT) experiments.
SPT experiments generate "trajectories" - paths of individual molecules
specified in spatial coordinates. In many cases, the trajectories aren't all the
same: they come from molecules moving at different rates. A common goal in SPT analysis
is to extract the number of different diffusive states in a set of trajectories,
along with the fractional occupancy and diffusion coefficient (spatial variance)
of each state.
`dpsp` attempts to do this using a
[Dirichlet process mixture model](https://www.jstor.org/stable/1390653),
which is intended to deal with situations where we don't know the true
number of states.
## What goes in and what comes out?
`dpsp` accepts trajectories in CSV format. Each row of the CSV should correspond
to one detection from an SPT experiment.
This CSV must contain, at minimum, the following columns:
- `trajectory`: the index of the corresponding trajectory
- `frame`: the index of the corresponding frame
- `y`, `x`, or other columns with the spatial coordinates of the detection
**in pixels**
See `examples/examples_tracks.csv` for an example of the kind of input
`dpsp` expects.
The output is a probability distribution over the diffusion coefficient.
Each point in this distribution represents the probability that a trajectory
originated from a state with the corresponding diffusion coefficient.
## What doesn't it do?
`dpsp` doesn't do localization and tracking. It expects to be handed
trajectories generated by some other method. There are lots of methods
around.
`dpsp` doesn't check the quality of your raw trajectories.
`dpsp` doesn't place any filters on the input.
It will consider whatever you give it.
`dpsp` doesn't check whether you gave it sensible parameters. It
obediently, happily will abuse the data when given bad parameters.
`dpsp` expects you to understand your own experiment. You need to
know the frame interval, approximate localization error, and pixel
size.
Beyond a couple of simple plots, `dpsp` doesn't produce any
sophisticated visualizations of the result.
`dpsp` doesn't consider state transitions. It assumes that every
trajectory is generated from a single diffusive state. When the
trajectories are very short, this could be a good assumption. It almost
certainly becomes worse as trajectories get longer.
## Dependencies
`numpy`, `pandas`, `matplotlib`, `seaborn`, and `dask`.
Compilation via the `setup.py` script requires `gcc`.
## Install
Clone the repository, navigate to the top-level `dpsp` directory,
and do
```
python setup.py develop
```
If you're on a system with `gcc`, setup will try to compile the binaries and
put them in the `PATH`. To check that this actually worked, open a **new** terminal
and do
```
gsdpdiff -h
```
If installation worked correctly, a docstring should print to the terminal. Wonderful.
If not, then you'll need to compile the `dpsp/cpp/gsdpdiff.cpp` and
`dpsp/cpp/gsdpdiffdefoc.cpp` into executables called `gsdpdiff` and
`gsdpdiffdefoc` and put them into `PATH`, or whatever your OS equivalent is.
## Example usage
```
import pandas as pd
from dpsp import dpsp
# A sample set of trajectories
sample_csv = "examples/example_tracks.csv"
tracks = pd.read_csv(sample_csv)
# Run dpsp
occs, diff_coefs = dpsp(
tracks,
branch_prob=0.1, # branch probability for the Gibbs sampler
frame_interval=0.00748, # frame interval in seconds
pixel_size_um=1.0, # size of pixels in microns
loc_error=0.035, # approximate localization error in microns
n_iter=400, # number of iterations to do
n_threads=1, # number of independent threads to use
pos_cols=["y", "x"], # columns in *tracks* with spatial coordinates
# in pixels
dz=0.7 # depth of field in microns, if using a defocalization
# correction
plot=True # make a summary plot of the results
)
```
A sample script that does something very similar is at
`examples/example_script.py`.
## Where can I get a description of the parameters for `dpsp`?
In the future, under the `docs` folder of this repo.
For now, in the docstring to the `dp.dpsp` function.
The magic sauce here is `branch_prob`, which - if you're familiar with
Dirichlet processes - indirectly sets the concentration
parameter for the Dirichlet process. We find it more useful to set
`branch_prob` directly.
If our dataset has `n` trajectories
and the concentration parameter is `alpha`, then
```
branch_prob = alpha / (alpha + n)
```
`branch_prob` is the probability that, for any trajectory at any given
iteration, the trajectory will start a new component for the Dirichlet
process Gibbs sampler. As a result, it determines the relative strength
of the uniform prior over the diffusion coefficient against the data.
The higher `branch_prob` is, the more skeptical we are about the data.
I am usually **very** skeptical about the data. So I tend to use a very high
value for `branch_prob` - often 0.1. The data has to work hard to prove
itself in this regime.
## What do I do with the output?
Up to you.
## I want to try other analyses.
Check out [vbSPT](http://vbspt.sourceforge.net/),
[Spot-On](https://www.tjian-darzacq.mcb.berkeley.edu/spot-on/),
or - if you're willing to hazard another tool from me -
[emdiff](https://github.com/alecheckert/emdiff).