Revision history - None - origin: https://github.com/ekg/freebayes

visit type:

Revision	Author	Date	Message	Commit Date
04d702c	Erik Garrison	25 May 2010, 21:44:14 UTC	added bamBayes back into make all	25 May 2010, 21:44:14 UTC
a0a3977	Erik Garrison	25 May 2010, 21:33:04 UTC	Merge branch 'bamBayes'	25 May 2010, 21:33:04 UTC
6b9345a	Erik Garrison	25 May 2010, 21:27:10 UTC	remove vcf and rpt output from bayes.cpp ... as we will provide these via pipelined filters which operate on JSON output streamed from the core executable component.	25 May 2010, 21:27:10 UTC
51ae935	Erik Garrison	25 May 2010, 20:47:28 UTC	BamMultiReader fixes Allow the BamMultiReader to selectively use indexes, obviating error messages when the indexes are not found.	25 May 2010, 20:47:28 UTC
b3b2c0f	Erik Garrison	25 May 2010, 20:43:45 UTC	incorporated updated to BamTools Updated BamAux.h provides a templated method to get any tag from our read header.	25 May 2010, 20:43:45 UTC
0e37139	Erik Garrison	25 May 2010, 20:35:59 UTC	stable release candidate This is a 'stable' release candidate of bamBayes meant to bridge the gap between the existing system and the one currently in development. Certain improvements from the upcoming version have been incorporated. Some testing has been completed as of this commit. bamBayes now incorporates a multi-reader which is capable of opening any number of sorted BAM files and reading from them as if they are one sorted, and merged, BAM. A list of files can be specified as a space or newline-delimited list passed as a string to the --bam parameter (e.g. --bam "one.bam two.bam three.bam"). This commit also includes extensive fixes to sample name (SM) handling. Now assume, per existing BAM producers, that we can have many read groups (RG's) per sample name (SM). At runtime we generate a map of RG->SM which is used to get sample names for each alignment in our analysis. In order to guarantee expected behavior, the system will raise an error and exit if a sample specified in a sample file is not present in the BAM file or files passed to the program. VCF 3.4 output is supported alongside RPT output.	25 May 2010, 20:35:59 UTC
1de4103	Erik Garrison	24 May 2010, 15:48:44 UTC	resolved most issues with allele overlap; checkpoint Previously had problems with Allele object overlap in the case of snps. Resolved these issues. Moved internal data structures to 0-based indexing. JSON output from bayes.cpp allows for automated validation via python. (Processing scripts to be included in subsequent commit.)	24 May 2010, 15:48:44 UTC
23099b8	Erik Garrison	20 May 2010, 22:18:18 UTC	completed prior logging legibility fix	20 May 2010, 22:18:18 UTC
ced4b7d	Erik Garrison	20 May 2010, 22:15:45 UTC	minor change in logging output	20 May 2010, 22:15:45 UTC
a612bae	Erik Garrison	20 May 2010, 22:10:25 UTC	BamBayes incorporation of BamMultiReader Incorporate BamMultiReader into bamBayes.	20 May 2010, 22:10:25 UTC
e54cae3	Erik Garrison	19 May 2010, 20:29:06 UTC	broke apart calculation and reporting Two step process allows us to cleanly integrate second-stage probability estimation following the generation of our data likelihoods.	19 May 2010, 20:29:06 UTC
64ab687	Erik Garrison	19 May 2010, 19:58:32 UTC	inline json output checkpoint Some cleanups wrt phred scale reporting; also checkpoint prior to removing inline json output.	19 May 2010, 19:58:32 UTC
c3cf87f	Erik Garrison	18 May 2010, 21:29:58 UTC	long double, json output I've moved math to use long doubles instead of doubles. (No significant performance loss noted, precision is necessary for calculations when coverage is >150 or so.) This commit produces a json-formatted output stream. In production this steam will then be piped to a processing application that can output a variety of interchange formats, including VCF and GLF.	18 May 2010, 21:29:58 UTC
ba1eb23	Erik Garrison	14 May 2010, 21:41:13 UTC	fixed mistakes with Parameters	14 May 2010, 21:41:13 UTC
7d75763	Erik Garrison	14 May 2010, 21:35:42 UTC	bump	14 May 2010, 21:35:42 UTC
aa402f5	Erik Garrison	14 May 2010, 21:27:42 UTC	added BamMultiReader Drawn from BamTools unstable. Tested and seen to be working.	14 May 2010, 21:27:42 UTC
1edfb6c	Erik Garrison	14 May 2010, 20:19:10 UTC	more tweaks Fixed mistake where Parameters object in Caller was referenced by pointer. Removing this indirection provided significant speedup. Other cleanups. I'm reworking some of the runtime options to match the existing algorithm. I may rewrite the help dialog to be more *nix culture conformant.	14 May 2010, 20:19:10 UTC
fd02c4f	Erik Garrison	12 May 2010, 17:59:52 UTC	Fixed alignment registration We need to register alignments and check certain properties of the registration before proceeding with analysis, otherwise we will have no means to filter on the basis of registration properties. This commit incorporates this fix, also investigates performance tweaks.	12 May 2010, 17:59:52 UTC
025b18b	Erik Garrison	12 May 2010, 16:03:07 UTC	Object recycling for Allele* In earlier tests I found that much (up to 40%) of runtime was spent creating and destroying Allele objects. I implemented a free list based recycling system. The result is mixed; in return for a reduction of time spent copying Alleles the system now incurs a slight dereferencing penalty whenever it works with them. Overall performance slightly improved. Now a bottleneck lies at the Allele sorting step. As Alleles are always manipulated in terms of their sample membership, I am going to change the primary accounting system for them from a list of all Alleles in alignments overlapping the current location to a map from sample id to Allele list.	12 May 2010, 16:03:07 UTC
b70fee7	Erik Garrison	09 May 2010, 21:36:54 UTC	FASTA reader fix, math simplification, optimization Incorporats a fix to the FASTA reader resolving a problem in which FASTA index entries were improperly sorted. Numerous tweaks to the math; checkpoint in this sense. The system is now essentially generating genotypes for each individual. Math attempting to estimate P(all genotype choices \| obserations) has been removed, and will be reincorporated after discussions with Gabor. My objective is to improve the performance of this aspect to the point that the system can massively reduce data, perhaps to a degree commensurate with the limitations of a scripting-language based P(SNP) estimator. I applied the -O3 compiler flags and found a large (2x) improvement in performance. I also noticed that nearly half the system runtime is spent dealing with Allele creation and destruction. In the subsequent commit I will move most Allele manipulation functions to work on Allele*, an avenue I elected not to go down until I had sorted through most of the algorithmic issues.	09 May 2010, 21:36:54 UTC
7205774	Erik Garrison	07 May 2010, 20:19:35 UTC	Fasta reader fix The Fasta.cpp fasta reader (from FastaHack: http://github.com/ekg/fastahack) had some erroneous math for calculating reference sequence lengths. I've resolved this in the other library and here incorporate the fix.	07 May 2010, 20:19:35 UTC
fca8c78	Erik Garrison	07 May 2010, 20:17:05 UTC	Math cleanup, multichoose.h truly lands Math cleanup and clarification continues. Finally 'really' landend multichoose.h; whoops.	07 May 2010, 20:17:05 UTC
13febf8	Erik Garrison	03 May 2010, 16:25:44 UTC	Normalize probabilities of data likelihoods Normalize by the number of observations in each allele combo.	03 May 2010, 16:25:44 UTC
7156f48	Erik Garrison	02 May 2010, 03:04:44 UTC	fixed segfault when we have no alleles to analyze multichoose(...) fails when it is given a 0-length vector to make multichoices from. I'll fix this somehow; but generally it makes no sense to attempt to analyze positions with no overlapping alleles, thus we now skip them.	02 May 2010, 03:04:44 UTC
4c668c9	Erik Garrison	02 May 2010, 02:16:05 UTC	Working posterior probability calculation Working functions for estimating the probability of a genotype given a set of observed alleles.	02 May 2010, 02:16:05 UTC
3c78d8a	Erik Garrison	30 April 2010, 20:01:35 UTC	checkpoint Working system, but buggy math. Checkpoint proir to overhaul.	30 April 2010, 20:01:35 UTC
fb3de0d	Erik Garrison	26 April 2010, 20:43:13 UTC	Correct usage of Read Group tag (@RG) Incorporates changes from the newest release of BamTools allowing us to use the @RG tag to get sample names instead of the read name. This is proper behavior and will allow better integration of bamBayes into existing pipelines.	26 April 2010, 20:43:13 UTC
31d7743	Erik Garrison	24 April 2010, 02:44:07 UTC	Allele observation probabilities, Allele probabilities This commit includes (compiled, but thus far untested) code to calculate allele observation probabilties (probability that we have a true observation of an allele) and allele probabilities (multinomial probability of a set of observations given an underlying genotype). This functionality is provided via a function that takes a set of allele observations from an individual sample and a list of genotypes. The function returns a vector of probabilities ordered according to the input genotypes.	24 April 2010, 02:44:07 UTC
b45e696	Erik Garrison	23 April 2010, 01:05:32 UTC	iterative multichoose function To provide a large memory and speed boost, I've incorporated a multichoose algorithm that uses a non-recursive method to generate all possible multiset combinations of a given size out of an input set/multiset vector.	23 April 2010, 01:05:32 UTC
7b10357	Erik Garrison	21 April 2010, 01:52:41 UTC	Added code to generate allele combinations Incorporation of http://github.com/ekg/multichoose	21 April 2010, 01:52:41 UTC
6fcc12c	Erik Garrison	14 April 2010, 01:23:01 UTC	Untested incremental commit This code probably doesn't even compile. This commit serves as a checkpoint of my progress over the past few days. In this commit I include a heavily commented and incomplete function, Caller::probObservedAllelesGivenGenotype, which estimates the probability of the observation of a set of alleles given an underlying genotype and a ploidy. This function generalizes the mathematical techniques found in GigaBayes.	14 April 2010, 01:23:01 UTC
4dbc74f	Erik Garrison	06 April 2010, 23:20:08 UTC	Incremental commit Myriad changes; maybe not the best to lump into one commit, but the source is in heavy development and I just wanted to checkpoint. Merged in new BamTools source, svn rev 43. Features much faster jump time for the reader (up to 80% improvement). Move quality strings into Alleles. Support for reference alleles, recorded for each region of a read that matches the target sequence. These seem to be necessary for any sane calculations about variation likelihood. The current implementation is untested. I'm going to instrument tomorrow to make sure things are working properly. Provided I can quickly complete a small data organization step (sorting the alleles by sample id), it should then be a straight shot to writing the likelihood calculation functions to work on all types of alleles!	06 April 2010, 23:20:08 UTC
41818ba	Erik Garrison	31 March 2010, 19:28:31 UTC	Target-driven update complete Now we step through a list of targets from the BED file without concern for order in our reference sequence. This makes our code simpler and However, doing this required some rewriting of the sequence loading sections. As alignments at the start and end of our target sequence will lie outside of the bounds of the target, we must grab more sequence than target, allowing the analysis of of the mapping of all of the reads overlapping the target sequence. For this iteration I'm processing reads at the beginning and end to exactly determine the length of the left and right overhangs; but in the future it may prove simpler (and much faster) to select a length greater than the longest read in the dataset but much shorter than the full reference sequence length (as was done previously).	31 March 2010, 19:28:31 UTC
bebfb6d	Erik Garrison	29 March 2010, 22:42:27 UTC	From position-driven to target driven Initially I attempted to use an indexing scheme which assumed that sequence 'order' matched order in a fasta file. This is not guaranteed; it makes no sense to assume this. I have moved to driving the processing forward using only targets from the input BED file. This led to a set of necessary changes which have bogged down progress. Rather than loading entire fasta sequences (a huge waste of resources for small targets), I'm using subsequencing facilities from Fasta.cpp to get just the reference subsequence of the target. Unfortunately this has caused increases in registration complexity, and I'm rewriting this portion to further reduce unnecessary processing.	29 March 2010, 22:42:27 UTC
955c16c	Erik Garrison	25 March 2010, 20:53:46 UTC	checkpoint Now the input is working (mostly). I'm running against Baylor's filtered snp data as a sanity check. Things seem to be working at present. Left to sort out is the deque handling. I am not properly addressing the 0-based, half-open coordinates of the BED targets.	25 March 2010, 20:53:46 UTC
d1b0f8a	Erik Garrison	24 March 2010, 23:09:14 UTC	another checkpoint Still not functional, although the allele registration system appears to be working. I'm sorting out the target traversal logic; it's a mess as I wrote it :(	24 March 2010, 23:09:14 UTC
737e7b4	Erik Garrison	23 March 2010, 22:07:03 UTC	it compiles Not much else. Untested; checkpointed...	23 March 2010, 22:07:03 UTC
be5098b	Erik Garrison	23 March 2010, 16:07:02 UTC	Restructuring work midway I've completed about half of the major changes I need to restructure the system. I didn't want to lead myself too far astray so I'm checkpointing here before I begin the tricky work of compiling the I/O portion of this large rewrite.	23 March 2010, 16:07:02 UTC
a654866	Erik Garrison	11 March 2010, 22:01:53 UTC	Restructuring work start bayes.cpp <-- new application main featuring simplified top-level Parameters.{cpp,h} <-- command line parameter parsing Allele.{cpp,h} <-- allele representation	11 March 2010, 22:01:53 UTC
cb483ee	Erik Garrison	09 March 2010, 22:33:17 UTC	Merge branch 'optimizations'	09 March 2010, 22:33:17 UTC
21249ff	Erik Garrison	09 March 2010, 22:31:48 UTC	PLAN for software update Minor changes; addition of software overview in PLAN file.	09 March 2010, 22:31:48 UTC
ffff7a0	Erik Garrison	04 March 2010, 17:47:37 UTC	removed gprof cflags	04 March 2010, 17:47:37 UTC
d73007c	Erik Garrison	04 March 2010, 16:36:27 UTC	Remove less<T> specifiers for performance boost I've measured a 13% performance boost by removing the (unused and unneeded) less specifiers in the maps in bamBayes.cpp and Function-Math.cpp. NB: removed -pg flag in Makefile	04 March 2010, 16:36:27 UTC
1a0e7e6	Erik Garrison	04 March 2010, 14:28:46 UTC	Comments, errata Minor changes. Mostly comments. Checkpoint prior to testing performance tweaks.	04 March 2010, 14:28:46 UTC
b080a1e	Erik Garrison	25 February 2010, 17:08:27 UTC	minor fix to TRY CATCH macros Added __FILE__ name to CATCH error report.	25 February 2010, 17:08:27 UTC
8c95b91	Erik Garrison	25 February 2010, 16:57:21 UTC	TRY and CATCH macros for error handling To properly trace the source of substr errors, I add a set of macros which centralize exception handling. These have been used to wrap all substr calls. Currently the exception handling is targeted only at out_of_range exceptions, but this architecture allows for the easy addition of catch clauses which handle other cases. Additionally I have added the beginnings of more thorough documentation of the operation of the BamBayes algorithm.	25 February 2010, 16:57:21 UTC
258e45a	Erik Garrison	19 February 2010, 21:21:56 UTC	new features landed In this commit we land several working, but incomplete features. VCF output - complete but not 100% to spec, need to sort out some output filtering issues, decide whether to output all possible genotypes per sample or only the most likely. Automatic sample name reading from BAM file headers - If no sample list is passed on program invocation, BamBayes will assume that all read groups in the input alignment file should be used in the analysis. (read groups are signified by the @RG header and are equivalent in scope to samples).	19 February 2010, 21:21:56 UTC
45580c4	Erik Garrison	17 February 2010, 23:24:41 UTC	Incremetal commit More small changes and fixes to bring the VCF output in line with the spec. Addition of a probability-to-phred score function to Function-Math.cpp.	17 February 2010, 23:24:41 UTC
7125964	Erik Garrison	17 February 2010, 00:40:42 UTC	Progressive commit This commit is a checkpoint before making some very major source changes. A wide array of changes have already been implemented, but are untested in this commit. They include changes designed to implement: Fasta file reading (instead of Mosaik binary reference format) VCF (variant call format) output in addition to the bespoke .rpt output.	17 February 2010, 00:40:42 UTC
2306e9b	Erik Garrison	17 February 2010, 00:39:59 UTC	Fasta reader Added fasta file reader and indexer.	17 February 2010, 00:39:59 UTC
e8f5e4b	Erik Garrison	04 February 2010, 22:51:32 UTC	Doxygen config, project Makefile Two useful additions...	04 February 2010, 22:51:32 UTC
a6caa52	Erik Garrison	04 February 2010, 16:52:28 UTC	Initial Commit Drawn directly from alpha-2010-01-18.	04 February 2010, 16:52:28 UTC

Newer
Older