Content - df3869885ffbbdb93983e75b7c506aa098357267 - cf82377/docs/file-format.rst

swh:1:snp:300923221fcf626df34df8c763b7994a14d9c907

Tip revision: 71b83cbcb37f6020b50168154bbdfb7ff313809a authored by Jerome Kelleher on 07 October 2016, 16:16:43 UTC
Release notes for version 0.4.0.
Tip revision: 71b83cb
file-format.rst
.. _sec-file-format:

=========================
Tree Sequence File Format
=========================

The correlated trees output by a coalescent simulation are stored very
concisely in ``msprime`` as a sequence of coalescent records. To make this
information as efficient and easy as possible to use, we store the data in a
`HDF5 <https://www.hdfgroup.org/HDF5/>`_ based file format. This page fully
documents this format allowing efficient and convenient access to the
genealogical data generated by ``msprime`` outside of the native :ref:`Python
API <sec-api>`. Using the specification defined here, it should be
straightforward to access tree sequence information in any language with `HDF5
support <https://en.wikipedia.org/wiki/Hierarchical_Data_Format#Interfaces>`_.

*********
Structure
*********

The file format is broken into a number of groups. Each group contains
datasets to define the data along with attributes to provide necessary
contextual information.

The root group contains one attributes, ``format_version``. This
is a pair ``(major, minor)`` describing the file format version. This
document describes version 3.1.

================    ==============      ======      ===========
Path                Type                Dim         Description
================    ==============      ======      ===========
/format_version     H5T_STD_U32LE       2           The (major, minor) file format version.
================    ==============      ======      ===========

++++++++++++++++++
Provenance dataset
++++++++++++++++++

The provenance dataset records information relating the the provenance
of a particular tree sequence file. When a tree sequence file is generated
all the information required to reproduce the file should be encoded
as a string and stored in this dataset. Subsequent modifications to the
file should be also be recorded and appended to the list of strings.

The format of these strings is implementation defined. In the
current version of ``msprime`` provenance information is encoded
as JSON. This information is incomplete, and will be updated in future
versions.

================    ==============      ======      ===========
Path                Type                Dim         Description
================    ==============      ======      ===========
/provenance         H5T_STRING          Scalar      Provenance information.
================    ==============      ======      ===========

+++++++++++++++
Mutations group
+++++++++++++++

The ``mutations`` group is optional, and describes the location of mutations
with respect to tree nodes and their positions along the sequence. Each mutation
consists of a node (which must be defined in the ``trees`` group) and a
position. Positions are defined as a floating point value to allow us to
express infinite sites mutations. A mutation position :math:`x` is defined on the same
scale as the genomic coordinates for trees, and so we must have
:math:`0 \leq x < L`, where :math:`L` is the largest value in the
``/trees/breakpoints`` dataset.

As for the coalescence records in the ``trees`` group, mutation records are
stored as seperate vectors for efficiency reasons. Mutations must be stored
in nondecreasing order of position.

===================     ==============      =====
Path                    Type                Dim
===================     ==============      =====
/mutations/node         H5T_STD_U32LE       M
/mutations/position     H5T_IEEE_F64LE      M
===================     ==============      =====

+++++++++++
Trees group
+++++++++++

The ``trees`` group is mandatory and describes the topology of the tree
sequence. The ``trees`` group contains a number of nested groups and datasets,
which we will describe in turn.

^^^^^^^^^^^^^^^^^^^
Breakpoints dataset
^^^^^^^^^^^^^^^^^^^

The ``/trees/breakpoints`` dataset records the floating point positions of the
breakpoints between trees in the tree sequence, and the flanking positions
:math:`0` and :math:`L`. Positions in the ``/trees/records`` group refer to
(zero based) indexes into this array. The first breakpoint must be zero, and
they must be listed in increasing order.

=======================     ==============
Path                        Type
=======================     ==============
/trees/breakpoints          H5T_IEEE_F64LE
=======================     ==============

^^^^^^^^^^^
Nodes group
^^^^^^^^^^^

The ``/trees/nodes`` group records information about the individual
nodes in a tree sequence. Leaf nodes (from :math:`0` to :math:`n - 1`)
represent the samples and internal nodes (:math:`\geq n`) represent
their ancestors. Each node corresponds to a particular individual that
lived at some time time in the history of the sample. The ``nodes``
group is used to record information about these individuals.

=======================     ==============
Path                        Type
=======================     ==============
/trees/nodes/population     H5T_STD_U8LE
/trees/nodes/time           H5T_IEEE_F64LE
=======================     ==============

^^^^^^^^^^^^^
Records group
^^^^^^^^^^^^^

The ``/trees/records`` group stores the individual coalesence records.
Each record consists of four pieces of information: the left and
right coordinates of the coalescing interval, the list of child nodes
and the parent node.

The ``left`` and ``right`` datasets are indexes into the ``/trees/breakpoints``
dataset and define the genomic interval over which the record applies. The
interval is half-open, so that the left coordinate is inclusive and the right
coordinate is exclusive.

The ``node`` dataset records the parent node of the record, and is
an index into the ``/trees/nodes`` group.

The ``num_children`` dataset records the number of children for a particular
record. The ``children`` dataset then records the actual child nodes for each
coalescence record. This 1-dimensional array lists the child nodes for every
record in order, and therefore by using the ``num_children`` array we can
efficiently recover the actual children involved in each event. Within a given
event, child nodes must be sorted in increasing order. The records must be
listed in time increasing order.

===================       ==============      ======
Path                      Type                Dim
===================       ==============      ======
/trees/left               H5T_STD_U32LE       N
/trees/right              H5T_STD_U32LE       N
/trees/node               H5T_STD_U32LE       N
/trees/num_children       H5T_STD_U32LE       N
/trees/children           H5T_STD_U32LE       :math:`\leq 2 \times` N
===================       ==============      ======

^^^^^^^^^^^^^
Indexes group
^^^^^^^^^^^^^

The ``/trees/indexes`` group records information required to efficiently
reconstruct the individual trees from the tree sequence. The
``insertion_order`` dataset contains the order in which records must be applied
and the ``removal_order`` dataset the order in which records must be
removed for a left-to-right traversal of the trees.

==============================     ==============
Path                               Type
==============================     ==============
/trees/indexes/insertion_order     H5T_STD_U32LE
/trees/indexes/removal_order       H5T_STD_U32LE
==============================     ==============