Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://github.com/gabe-dubose/emus
27 October 2023, 08:23:24 UTC
  • Code
  • Branches (1)
  • Releases (0)
  • Visits
Revision 7e564642a8534324a40203783498dd8d6501e030 authored by Gabe DuBose on 15 May 2022, 13:05:17 UTC, committed by Gabe DuBose on 15 May 2022, 13:05:17 UTC
Updated simulation script
1 parent 5305461
  • Files
  • Changes
    • Branches
    • Releases
    • HEAD
    • refs/heads/main
    • 7e564642a8534324a40203783498dd8d6501e030
    No releases to show
  • 5541b2c
  • /
  • emus
  • /
  • simulate_mutations.py
Raw File Download Save again
Take a new snapshot of a software origin

If the archived software origin currently browsed is not synchronized with its upstream version (for instance when new commits have been issued), you can explicitly request Software Heritage to take a new snapshot of it.

Use the form below to proceed. Once a request has been submitted and accepted, it will be processed as soon as possible. You can then check its processing state by visiting this dedicated page.
swh spinner

Processing "take a new snapshot" request ...

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • revision
  • directory
  • content
  • snapshot
origin badgerevision badge
swh:1:rev:7e564642a8534324a40203783498dd8d6501e030
origin badgedirectory badge
swh:1:dir:27a089d6a1ee0b2918edefba57b84e90f8fbf402
origin badgecontent badge
swh:1:cnt:7fc952fa0ab0d69acc6fe4fc83d4949dc9efe3eb
origin badgesnapshot badge
swh:1:snp:2bd7ed1081ff598939e948409013744f1715835b

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • revision
  • directory
  • content
  • snapshot
(requires biblatex-software package)
Generating citation ...
(requires biblatex-software package)
Generating citation ...
(requires biblatex-software package)
Generating citation ...
(requires biblatex-software package)
Generating citation ...
Tip revision: 7e564642a8534324a40203783498dd8d6501e030 authored by Gabe DuBose on 15 May 2022, 13:05:17 UTC
Updated simulation script
Tip revision: 7e56464
simulate_mutations.py
#!/usr/bin/env python3

import argparse
import random
from datetime import date
import time
import pyfaidx

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', metavar = '<Input Genome in FASTA Format>', 
    help = "Genome sequence used to simulate null distribution of mutaitons")
parser.add_argument('-s', '--snp', metavar = '<Whole Number>', type = int,
    help = "Number of SNPs to simualte in input genome")
parser.add_argument('-o', '--output', metavar = '<output_file.vcf>', 
    help = "Handle that will be added to the output VCF file")
args = parser.parse_args()

class GenomeSequence:
    def __init__(self, genome):
        self.genome = genome
    
    def simulate_variants(self, fasta, number):

        #read fasta file
        genome = pyfaidx.Fasta(fasta)
        sequences = list(genome.keys())
        
        alphabet = ['A', 'T', 'G', 'C']

        mutation_count = 0
        mutation_info = {}
        variant_calls = []

        while mutation_count < number:
            variant = random.choices(alphabet, k=1)
            variant = variant[0]
            seqid = random.choices(sequences, k=1)
            seqid = seqid[0]
            position = random.randint(0, len(genome[seqid]) - 1)
            original = str(genome[seqid][position]).upper()
        
            #add sequence id to info dictionary if not already there
            if seqid not in mutation_info:
                mutation_info[seqid] = []

            #check if the variant is the same as the original and if that position had already been simulated
            #if conditions are met, place variant
            if variant != original and position not in mutation_info[seqid]:
                variant_call = [seqid, position, original, variant]
                variant_calls.append(variant_call)
                mutation_info[seqid].append(position)
                mutation_count += 1

        return variant_calls

    
def write_vcf(fasta, variants):
    #read fasta file
    genome = pyfaidx.Fasta(fasta)
    sequences = genome.keys()

    #initialize output VCF
    today = date.today().strftime("%Y%m%d")
    with open(args.output, 'a') as outfile:
        outfile.write(f'##fileformat=VCFv4.3\n')
        outfile.write(f'##date={today}\n')
        outfile.write(f'##source=emus.v.1.0.0\n')
        outfile.write(f'##reference={args.input}\n')
        for sequence in sequences:
            outfile.write(f'##contig=<ID={sequence}, lenght = {len(genome[sequence])}>\n')

        outfile.write(f'#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE\n')

        sorted_variants = sorted(variants, key=lambda x: (x[0], x[1]))
        for variant in sorted_variants:
            var_line = f'{variant[0]}\t{int(variant[1])+1}\t.\t{variant[2]}\t{variant[3]}\t.\t.\t.\tGT\t1\n'
            outfile.write(var_line)

#define sequence objects
sequence_object = GenomeSequence(args.input) 

#simulate mutations
print('Simulating mutations...')
start_sim = time.time()
simulation = sequence_object.simulate_variants(args.input, args.snp)
print(f'Done in {time.time() - start_sim} seconds.')

#write variant call file
print('Writing variant call file...')
write_vcf(args.input, simulation)
print('Done.')
The diff you're trying to view is too large. Only the first 1000 changed files have been loaded.
Showing with 0 additions and 0 deletions (0 / 0 diffs computed)
swh spinner

Computing file changes ...

back to top

Software Heritage — Copyright (C) 2015–2026, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API