https://github.com/nbraffman/CylK-homologs
Revision 732a38ef8ab73988203ea1933158238ea90361b0 authored by nbraffman on 29 November 2021, 18:59:23 UTC, committed by GitHub on 29 November 2021, 18:59:23 UTC
1 parent 88e9848
Raw File
Tip revision: 732a38ef8ab73988203ea1933158238ea90361b0 authored by nbraffman on 29 November 2021, 18:59:23 UTC
Update README.md
Tip revision: 732a38e
README.md
# CylK-homologs
constructs and analyzes individual muscle alignments to characterize candidate CylK homologs, because a blast search is too general

you must have muscle installed: https://drive5.com/muscle5/

your input files (input_all_homologs.fasta) can include as many sequences as you'd like to analyze against a reference sequence (licheniforme.fasta)

script_Nterm.sh is a bash script that will parse through your input protein sequence(s) and construct individual muscle alignments against a reference protein sequence to look for the presence or absence of a defined number of N-terminal residues (135) before a set point in the protein sequence (Lys240). This was designed with the CylK enzyme family in mind because it has a relatively rare N-teminal domain and a common C-terminal domain (beta propeller), such that blast results would return a mixture of true homologs and some only containing the C-terminal domain. The output will be a fasta file (out_homologs.fasta) of protein sequences with N-terminal domains, and are thus more likely to be CylK homologs.

the output of script_Nterm.sh (out_homologs.fasta) was deduplicated with Geneious (out_homologs_unique.fasta) and used as the input file below:

script_R105.sh and script_Y473.sh similarly parse through input protein sequences (in this case, out_homologs_unique.fasta) and construct individual muscle alignments against a reference protein sequence (licheniforme.fasta), but they look for the presence or absence of key catalytic residues, R105 or Y473, respectively. The output will be a fasta file (out_homologs_unique_with_R105.fasta, or out_homologs_unique_with_Y473.fasta) of protein sequences with the key catalytic residue.
back to top