Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://github.com/svohr/mixemt
20 October 2023, 16:20:37 UTC
  • Code
  • Branches (5)
  • Releases (0)
  • Visits
    • Branches
    • Releases
    • HEAD
    • refs/heads/fix-old-xrange-in-python3
    • refs/heads/master
    • refs/tags/v0.0.1-sub
    • refs/tags/v0.1-sub
    • refs/tags/v0.2
    • be33e2d3db69a1c5c64f68041f54f9f781deb769
    No releases to show
  • 1a54d84
  • /
  • bin
  • /
  • mixemt
Raw File Download
Take a new snapshot of a software origin

If the archived software origin currently browsed is not synchronized with its upstream version (for instance when new commits have been issued), you can explicitly request Software Heritage to take a new snapshot of it.

Use the form below to proceed. Once a request has been submitted and accepted, it will be processed as soon as possible. You can then check its processing state by visiting this dedicated page.
swh spinner

Processing "take a new snapshot" request ...

Permalinks

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • content
  • directory
  • revision
  • snapshot
origin badgecontent badge Iframe embedding
swh:1:cnt:b75e591197fce0325cb4ed7265eab4510df01f78
origin badgedirectory badge Iframe embedding
swh:1:dir:592ac52dba0c0cb4d13696039867115c59fa7f05
origin badgerevision badge
swh:1:rev:be33e2d3db69a1c5c64f68041f54f9f781deb769
origin badgesnapshot badge
swh:1:snp:69fdfeaf2093b6060505654e1d866ccc69e9bf2c
Citations

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • content
  • directory
  • revision
  • snapshot
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Tip revision: be33e2d3db69a1c5c64f68041f54f9f781deb769 authored by Sam Vohr on 06 October 2023, 00:46:11 UTC
Fix index error for report top proportions (#24)
Tip revision: be33e2d
mixemt
#! /usr/bin/env python3
"""

mixemt (Mix + EM + mitochondrial sequence)

This script implements an approach for deconvoluting mixtures of mitochondrial
sequences based on a known phylogeny from phylotree.org and an
Expectation-Maximization approach to estimate the number and relative
abundances of contributing haplotypes. Based on this information, reads are
assigned to sub-assemblies to reconstruct the haplotypes.

Outline of approach:

1. Identify all non-reference SNP variant sites that are present in this
sample.

2. Contruct a matrix of known mtDNA variants from phylotree and associated
haplotypes from phylotree.

3. EM algorithm to estimate contributing haplotypes and abundances.

4. Output fragments by contributing haplotype.

"""

import argparse
import pkg_resources
import sys

import numpy
import pysam

from mixemt import phylotree
from mixemt import preprocess
from mixemt import em
from mixemt import observe
from mixemt import assemble
from mixemt import stats


def open_aln_file(bam_fn, args):
    """
    Opens the file specifed by the filename parameter and returns the
    AlignmentFile object.

    Args:
        bam_fn: path to BAM file to open (str).
        args: argparse arguments namespace for verbose mode.

    Returns:
        Pysam AlignmentFile object.
    """
    bamfile = pysam.AlignmentFile(bam_fn, 'rb')
    if args.verbose:
        sys.stderr.write('Read %d alignments from "%s".\n'
                         % (bamfile.count(), bam_fn))
    return bamfile


def open_phylotree(phy_fn, refseq, args):
    """
    Opens the phylotree CSV file or the default phylotree file from the mixemt
    package if phy_fn is None and uses the module "phylotree" to read in
    a list of variant positions and a table of haplogroups and associated
    variants.

    Args:
        phy_fn: path to Phylotree CSV file (str) or None
        refseq: the reference sequence that is used to define the mutations
                in the tree (usually RSRS)
        args: argparse arguments namespace for relevant options.

    Returns:
        Phylotree object
    """
    def open_phylotree_resource(phy_fn):
        """ Opens phylotree from package or from file. """
        if phy_fn is None:
            if args.verbose:
                sys.stderr.write("Reading haplogroups from 'mixemt' package "
                                 "(phylotree/mtDNA_tree_Build_17.csv)\n")
            return open(pkg_resources.resource_filename(
                'mixemt', 'phylotree/mtDNA_tree_Build_17.csv'), 'r')
        else:
            if args.verbose:
                sys.stderr.write("Reading haplogroups from file (%s)\n"
                                 % (phy_fn))
            return open(phy_fn, 'r')
    with open_phylotree_resource(phy_fn) as phy_in:
        phy = phylotree.Phylotree(phy_in, refseq=refseq,
                                  anon_haps=args.anon_haps,
                                  rm_unstable=args.rm_unstable)
        if args.exclude_pos is not None:
            phy.ignore_sites(args.exclude_pos)
        if args.verbose:
            sys.stderr.write('Using %d variant sites from %d haplogroups.\n'
                             % (len(phy.variants), len(phy.hap_var)))

        if args.cust_hap_fn is not None:
            open_custom_haplotypes(args.cust_hap_fn, phy, args)
        return phy


def open_custom_haplotypes(hap_fn, phylo, args):
    """
    Reads in a file describing custom haplotypes to be considered along with
    the haplogroups from Phylotree. Each haplotype is described by a single
    line made up of an identifier and list of SNP variants. The identifier and
    the variant list are separated by a tab character. Variants should be
    described in terms of the Reconstructed Sapiens Reference Sequence and each
    variants in the list are separated by ','s. Custom haplotypes are read in
    from the filename "hap_fn" and added to the provided Phylotree object.

    i.e.
    "custom_hap1\tC152T,A2758G,C2885T,G7146A,T8468C\n"

    Args:
        hap_fn: path to tab-delimited file describing custom haplotypes (str)
        phylo: Phylotree object to add custom haplotypes to.
        args: argparse arguments namespace for relevant options.

    Returns:
        nothing
    """
    with open(hap_fn, 'r') as hap_in:
        n_haps = 0
        for line in hap_in:
            items = line.rstrip().split('\t')
            hap_id = items[0]
            variants = items[1].split(',')
            phylo.add_custom_hap(hap_id, variants)
            n_haps += 1
        if args.verbose:
            sys.stderr.write('%d custom haplotypes added from "%s"\n'
                             % (n_haps, hap_fn))
    return


def open_refseq(fa_fn, args):
    """
    Opens the FASTA formatted file specified by the fa_fn parameter or the
    default reference file from the mixemt package if fa_fn is None and returns
    the sequence of the first entry as a string. If the FASTA file is empty an
    error message is printed and the program exits.

    Args:
        fa_fn: path to reference fasta file (str) or None
        args: argparse arguments namespace for relevant options.

    Returns:
        string representing reference sequence.
    """
    try:
        if fa_fn is None:
            fa_fn = pkg_resources.resource_filename('mixemt',
                                                    'ref/RSRS.mtDNA.fa')
        fafile = pysam.FastaFile(fa_fn)
        if args.verbose:
            sys.stderr.write('Read reference sequence "%s" from "%s".\n'
                             % (fafile.references[0], fa_fn))
        return fafile.fetch(fafile.references[0]).upper()
    except IndexError:
        sys.stderr.write('Error: no records in "%s"\n' % (fa_fn))
        sys.exit(1)
    return


def load_prev(prefix):
    """
    Loads a previously state from a series of files with the same prefix name.
    Program exists if an exception is caught.

    Args:
        prefix: The path/prefix to the files storing the previous state.
    Returns:
        haps: List of haplotype IDs (strings)
        reads: list of lists for reads IDs (strings)
        wts: Array of weights (integer counts) for reads that represent each
             subhaplotype.
        init_mat: input matrix for EM.
        em_results tuple: results from EM algorithm; includes...
            props: Array of stimatated proportions of haplogroups
            read_hap_mat: Conditional read/haplogroup probability matrix
    """
    try:
        haps = list()
        reads = list()
        read_wts = list()
        with open("%s.haps" % (prefix), 'r') as hap_in:
            for line in hap_in:
                haps.append(line.rstrip())
        with open("%s.reads" % (prefix), 'r') as read_in:
            for line in read_in:
                items = line.rstrip().split('\t')
                read_ids = items[1:]
                reads.append(read_ids)
                read_wts.append(len(read_ids))
        read_hap_mat = numpy.load("%s.mat.npy" % (prefix))
        props = numpy.load("%s.prop.npy" % (prefix))
        wts = numpy.array(read_wts)
    except (ValueError, IOError) as inst:
        sys.stderr.write("Error loading previous results:\n%s\n" % (inst))
        sys.exit(1)
    # Load the initial matrix if it was saved.
    try:
        init_mat = numpy.load("%s.em.npy" % (prefix))
    except IOError as inst:
        sys.stderr.write("Error loading previous EM input:\n%s\n" % (inst))
        sys.stderr.write("Contribution estimate refinement will be skipped\n")
        init_mat = None
    return haps, reads, wts, init_mat, (props, read_hap_mat)


def dump_all(prefix, haps, reads, em_mat, em_results):
    """
    Writes the results of the EM step to a series of files that can be loaded
    later on. Used to skip the matrix building and convergence steps when
    debugging.

    Args:
        prefix: prefix for files to be written out (string)
        haps: list of Haplogroups IDs (strings)
        reads: List of lists of reads IDs (strings)
        em_mat: input matrix for the EM algorithm.
        em_results tuple: results from EM algorithm; includes...
            props: Array of stimatated proportions of haplogroups
            read_hap_mat: Conditional read/haplogroup probability matrix

    Returns:
        nothing
    """
    try:
        with open("%s.haps" % (prefix), 'w') as hap_out:
            for hap in haps:
                hap_out.write('%s\n' % (hap))
        with open("%s.reads" % (prefix), 'w') as read_out:
            for i in range(len(reads)):
                read_out.write('%d\t%s\n' % (i, '\t'.join(reads[i])))
        props, read_hap_mat = em_results
        numpy.save("%s.em" % (prefix), em_mat)
        numpy.save("%s.mat" % (prefix), read_hap_mat)
        numpy.save("%s.prop" % (prefix), props)
    except (ValueError, IOError) as inst:
        sys.stderr.write("Warning: %s\n" % (inst))
    return


def process_and_report(args):
    """
    This function takes all of the input from args and runs through the steps
    of preprocessing EM and interpretation.

    Args:
        args: argparse arguments namespace
    Returns:
        0 if completes
    """
    try:
        # Load up the data.
        refseq = open_refseq(args.ref_fn, args)
        phylo = open_phylotree(args.phy_fn, refseq, args)
        bamfile = open_aln_file(args.bam_fn, args)

        if args.load is not None:
            # Just load the saved results instead of running everything again.
            if args.verbose:
                sys.stderr.write('\nLoading previous results from %s.*\n'
                                 % (args.load))
            haplogroups, reads, wts, em_mat, em_results = load_prev(args.load)

            if args.verbose:
                sys.stderr.write('Considering %d fragments '
                                 '(%d distinct sub-haplotypes)\n\n'
                                 % (numpy.sum(wts), len(reads)))
        else:
            # Build input for EM step
            (em_mat,
             wts,
             haplogroups,
             reads) = preprocess.build_em_input(bamfile, refseq, phylo, args)

            # Run EM
            em_results = em.run_em(em_mat, wts, args)

            # Save the results if requested.
            if args.save is not None:
                if args.verbose:
                    sys.stderr.write('\nSaving results to %s.*\n\n'
                                     % (args.save))
                dump_all(args.save, haplogroups, reads, em_mat, em_results)

        # build observation table for assembly steps
        # Re-process the reads to get the reference base observation table.
        base_obs = observe.ObservedBases(bamfile.fetch(),
                                         mapq=args.min_mq,
                                         baseq=args.min_bq)

        contribs = assemble.get_contributors(phylo, base_obs, haplogroups,
                                             wts, em_results, args)
        if len(contribs) == 0:
            sys.stderr.write("\n0 contributors passed filtering steps.\n")
            return 1

        # Report initial results.
        if args.verbose:
            stats.report_top_props(haplogroups, em_results[0], 10)
            stats.report_read_votes(haplogroups, em_results[1], 10)
            sys.stderr.write("\n")

        # refine contribution estimates
        if args.refine_ests and em_mat is not None:
            if args.verbose:
                sys.stderr.write("Refining contribution estimates...\n")

            em_mat, haplogroups = preprocess.reduce_em_matrix(em_mat,
                                                              haplogroups,
                                                              contribs)
            em_results = em.run_em(em_mat, wts, args)
            contribs = assemble.update_contribs(contribs, em_results,
                                                haplogroups)

        contrib_reads = assemble.assign_reads(bamfile, contribs, em_results,
                                              haplogroups, reads, args)

    except (ValueError, IOError) as inst:
        sys.stderr.write("Error:\n %s\n" % (inst))
        return 1

    if args.extend and len(contribs) > 1:
        contrib_reads = assemble.extend_assemblies(refseq, contrib_reads, args)

    # Report the results
    stats.report_contributors(sys.stdout, contribs, contrib_reads)

    if args.stats_prefix:
        stats.write_statistics(phylo, base_obs, contribs, contrib_reads, args)
    if args.cons_prefix:
        assemble.write_consensus_seqs(refseq, contribs, contrib_reads, args)
    if args.out_prefix:
        return assemble.write_haplotypes(bamfile, contrib_reads, args)

    return 0


def main():
    """
    Reads input filenames from command line args and processes input
    through the analysis steps
    """
    parser = argparse.ArgumentParser(
        description="Estimates the number and proportions "
                    "of contributing haplotypes.")
    parser.add_argument("bam_fn", metavar="reads.bam", type=str,
                        help="Aligned reads in indexed BAM file.")

    parser.add_argument('-v', '--verbose', action="store_true",
                        help="Print detailed status while running.")
    parser.add_argument("--ref", dest="ref_fn",
                        metavar="ref.fasta", type=str, default=None,
                        help="FASTA file containing reference sequence "
                             "(must match Phylotree input).")
    parser.add_argument("--phy", dest="phy_fn",
                        metavar="phylotree.csv", type=str, default=None,
                        help="Phylotree CSV file (Default: Build 17).")

    cust_opts = parser.add_argument_group("customization options")
    cust_opts.add_argument('-H', '--haps', dest='cust_hap_fn', type=str,
                           metavar='custom.tab', default=None,
                           help="Custom haplotypes to be considered in "
                                "addition haplogroups from Phylotree.")
    cust_opts.add_argument('-e', '--exclude_pos', dest='exclude_pos', type=str,
                           metavar='SITES', default=None,
                           help="A comma-separated list of 1-based positions "
                                "or 'start-end' ranges to be excluded from "
                                "consideration.")
    cust_opts.add_argument('-A', "--anon-haps", dest="anon_haps",
                           action="store_false", default=True,
                           help="Ignore Phylotree haplogroups without IDs")
    cust_opts.add_argument('-U', "--unstable", dest="rm_unstable",
                           action="store_true", default=False,
                           help="Ignore sites with variants listed as "
                                "unstable in Phylotree.")

    qual_opts = parser.add_argument_group("quality filters")
    qual_opts.add_argument('-q', '--min-MQ', dest='min_mq', type=int,
                           metavar="INT", default=30,
                           help="Skip alignments with mapQ < INT "
                                "(default: %(default)s)")
    qual_opts.add_argument('-Q', '--min-BQ', dest='min_bq', type=int,
                           metavar="INT", default=30,
                           help="Skip bases with baseQ < INT "
                                "(default: %(default)s)")

    em_opts = parser.add_argument_group("expectation-maximization")
    em_opts.add_argument('-i', '--init', dest='init_alpha', type=float,
                         default=1.0, metavar="ALPHA",
                         help="Use parameter ALPHA to initialize haplogroup "
                              "contributions from Dirichlet distribution. "
                              "Set to 'inf' to give haplogroups equal priors. "
                              "(default: %(default)s)")
    em_opts.add_argument('-T', '--converge', dest='tolerance', type=float,
                         default=0.0001, metavar="TOLERANCE",
                         help="Stop EM iteration when abs. difference between "
                              "current and previous contribution estimates is "
                              "< TOLERANCE (default: %(default)s)")
    em_opts.add_argument('-m', '--max-em-iter', dest='max_iter', type=int,
                         default=10000, metavar="N",
                         help="Maximum of number of EM iterations to run "
                              "(default: %(default)s)")
    em_opts.add_argument('-M', '--multi-em', dest='n_multi', type=int,
                         default=1, metavar="N",
                         help="Runs EM until convergence multiple times and "
                              "reports the results averaged over all runs "
                              "(default: %(default)s)")
    em_opts.add_argument('-S', '--seed', dest='seed', type=int, metavar='N',
                         help="Sets the seed for random number generation "
                              "for reproducible results.")

    asm_opts = parser.add_argument_group("contributor detection and "
                                         "assembly options")
    asm_opts.add_argument('-C', '--contributors', dest='contributors',
                          type=str, default=None, metavar='HAP1,HAP2,...,HAPN',
                          help="Skip contributor detection step and use the "
                               "specified comma-separated list of haplogroups "
                               "instead (be careful)")
    asm_opts.add_argument('-a', '--assign-odds', dest='min_fold', type=float,
                          default=2.0, metavar='ODDS',
                          help='Minimum odds ratio (probability between '
                               'best and next haplogroup) required to assign '
                               'read to a contributor (default: %(default)s)')
    asm_opts.add_argument('-r', '--min-reads', dest='min_reads', type=int,
                          default=10, metavar='N',
                          help="Haplogroup must have N reads to be considered "
                               "a contributor (default: %(default)s)")
    asm_opts.add_argument('-R', '--var-min-reads', dest='min_var_reads',
                          type=int, default=3, metavar='N',
                          help="Variant base must be found in N reads to be "
                               "considered as present in sample "
                               "(default: %(default)s)")
    asm_opts.add_argument('-F', '--var-fraction-min-reads',
                          dest='frac_var_reads',
                          type=float, default=0.02, metavar='F',
                          help="Variant base must be found in fraction F of "
                               "reads to be considered as present in sample. "
                               "A minimum of reads can be set with the -R "
                               "option for when this is low or can be low "
                               "(default: %(default)s)")
    asm_opts.add_argument('-f', '--var-fraction', dest='var_fraction',
                          type=float, default=0.5, metavar='F',
                          help="Fraction of unique defining variants that "
                               "must be observed to call a haplogroup "
                               "present (default: %(default)s)")
    asm_opts.add_argument('-n', '--var-count', dest='var_count',
                          type=int, default=None, metavar='N',
                          help="Call haplogroup a contributor if it has at "
                               "least N unique variants observed in the "
                               "sample, regardless of total number of "
                               "defining variants. Use when allelic dropout "
                               "is likely. (default: %(default)s).")
    asm_opts.add_argument('-V', '--no-var-check',
                          dest='var_check', action="store_false",
                          help="Disable requirement that the majority of "
                               "contributors' unique defining variants are "
                               "present in the sample. Use when coverage is "
                               "very low and dropout is likely.")
    asm_opts.add_argument('-x', '--extend-assemblies',
                          dest='extend', action="store_true",
                          help='Attempt to extend haplotype assemblies '
                               'iteratively by identifying novel variants '
                               'from contributor consensus sequences '
                               'assigning reads based off of them.')
    asm_opts.add_argument('-c', '--cons-cov', dest='cons_cov',
                          default=2, metavar='N',
                          help="When extending assemblies with -x, sets the "
                               "depth of coverage required to call a base "
                               "for a contributor (default: %(default)s)")
    asm_opts.add_argument('-E', dest='refine_ests', default=True,
                          action="store_false",
                          help="Skip contribution estimate refinement and "
                               "report proportions from initial EM run.")

    out_opts = parser.add_argument_group("output options")
    out_opts.add_argument('-s', '--save', dest='save', type=str,
                          default=None, metavar='PREFIX',
                          help="Save the EM results using this file prefix.")
    out_opts.add_argument('-l', '--load', dest='load', type=str,
                          default=None, metavar='PREFIX',
                          help="Skip EM step and load from a previous result "
                               "(overrides -s).")
    out_opts.add_argument('-o', '--out', dest='out_prefix', type=str,
                          default=None, metavar='PREFIX',
                          help="If set, write assigned reads to contributor-"
                               "specific BAM files using this filename prefix")
    out_opts.add_argument('-t', '--stats', dest='stats_prefix', type=str,
                          default=None, metavar='PREFIX',
                          help="If set, write stats tables to be used for "
                               "plotting results later.")
    out_opts.add_argument('-b', '--cons-bases', dest='cons_prefix', type=str,
                          default=None, metavar='PREFIX',
                          help="Call consensus bases at every reference "
                               "position for each contributor and write out "
                               "sequences in FASTA format.")
    args = parser.parse_args()

    if args.verbose:
        sys.stderr.write('%s\n\n' % (' '.join(sys.argv)))
    if args.seed:
        numpy.random.seed(args.seed)

    return process_and_report(args)


if __name__ == "__main__":
    sys.exit(main())

Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Contact— JavaScript license information— Web API

back to top