Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

Revision 3a78098f11e03b0136e9f4bb3aa587d528f08d3f authored by Asher Preska Steinberg on 09 February 2021, 19:22:11 UTC, committed by Asher Preska Steinberg on 09 February 2021, 19:22:11 UTC
APS164 commit for running just data with original mcorr-fit (singleFit.py)
1 parent a3cece3
  • Files
  • Changes
  • 9e5cc86
  • /
  • python
  • /
  • APS143_findFunnyGenes2.py
Raw File Download

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • revision
  • directory
  • content
revision badge
swh:1:rev:3a78098f11e03b0136e9f4bb3aa587d528f08d3f
directory badge Iframe embedding
swh:1:dir:e116df2ded47748183b0eb11d1e2a64b35f009a3
content badge Iframe embedding
swh:1:cnt:d7523e897276c785e155a3977cca248010ebd577

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • revision
  • directory
  • content
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
APS143_findFunnyGenes2.py

'''
Goal:
see if there is a gene that has not been properly
annotated in the APS143 s enterica  MSA file
this time looking at where mcorr-pair broke from the slurm output file
'''


from tqdm import tqdm
import pandas as pd

MSA = "/scratch/aps376/recombo/APS143_1008_senterica_Archive/MSA_Master_Sorted"

##read the file

funnygenes = []

with open(MSA, 'r') as master_file:
    i = 0
    num = 1
    lastline = 'blargh'
    for ln in master_file:
        if lastline.startswith("="):
            num = num + 1
            if not ln.startswith(">"):
                print(str(i))
                print(ln)
                funnygenes.append(i)
        ##this is the line that mcorr-pair stopped at
#        if num == 1826:
#            print(ln)
        ##this is the line after
#        if num == 1827:
#            print(ln)
#        lastline = ln
 #       i = i + 1

##for some reason gene 1826 appears twice, with two different sequences ...
##let's see if any others do

###delete the funny gene
##store line numbers for future deletion
#funnygenes = []

# with open(MSA, 'r') as master_file:
#     ##line number count
#     i = 0
#     for ln in master_file:
#         if ln.startswith(">"):
#             genename = ln.split(' ')[0]
#             if genename == '>NC_003197.2|cds-YP_009325922.1':
#                 ##get that extra equals sign, which may screw things up
#                 if lastline.startswith("="):
#                     funnygenes.append(i-1)
#                 #grab the gene name line, and the next line which is the gene
#                 funnygenes.append(i)
#                 funnygenes.append(i+1)
#             ##get our other culprit
#             elif genename == '>NC_003197.2|cds-NP_459150.1':
#                 if lastline.startswith("="):
#                     funnygenes.append(i-1)
#                 funnygenes.append(i)
#                 funnygenes.append(i+1)
#         lastline = ln
#         i = i + 1

##remove the gene lines

# MSA_file = open(MSA, 'r')
# lines = MSA_file.readlines()
# MSA_file.close()
#
# for gene in funnygenes:
#     print(str(gene))
#     del lines[gene]
#
# ##re-write the file minus this line
# new_file = open(MSA, 'w+')
#
# for line in lines:
#     new_file.write(line)
#
# new_file.close()

## double check we gucci now

with open(MSA, 'r') as master_file:
    #line number count
    i = 0
    lastline = 'blargh'
    ##count the length of the genes
    genelen = 0
    ##scan line by line through the file
    ## set the first gene name
    firstline = master_file.readline()
    genename = firstline.split(' ')[0]
    for ln in master_file:
        ##add a count for the first line
        genelen = genelen + 1
        ##when a new gene begins
        if lastline.startswith("="):
            #if the length of the previous gene was
            ##greater than the total number of strains + 1
            ##print the genename
            if genelen > 4710*2+1:
                print('long gene')
                print(str(i))
                print(genename)
            ##reset the genename for this gene
            genename = ln.split(' ')[0]
            ##reset the gene length count
            genelen = 0
        lastline = ln
        i = i + 1

#print(funnygenes)


The diff you're trying to view is too large. Only the first 1000 changed files have been loaded.
Showing with 0 additions and 0 deletions (0 / 0 diffs computed)
swh spinner

Computing file changes ...

back to top

Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API