https://github.com/nbraffman/CylK-homologs
Tip revision: 732a38ef8ab73988203ea1933158238ea90361b0 authored by nbraffman on 29 November 2021, 18:59:23 UTC
Update README.md
Update README.md
Tip revision: 732a38e
README.md
# CylK-homologs
constructs and analyzes individual muscle alignments to characterize candidate CylK homologs, because a blast search is too general
you must have muscle installed: https://drive5.com/muscle5/
your input files (input_all_homologs.fasta) can include as many sequences as you'd like to analyze against a reference sequence (licheniforme.fasta)
script_Nterm.sh is a bash script that will parse through your input protein sequence(s) and construct individual muscle alignments against a reference protein sequence to look for the presence or absence of a defined number of N-terminal residues (135) before a set point in the protein sequence (Lys240). This was designed with the CylK enzyme family in mind because it has a relatively rare N-teminal domain and a common C-terminal domain (beta propeller), such that blast results would return a mixture of true homologs and some only containing the C-terminal domain. The output will be a fasta file (out_homologs.fasta) of protein sequences with N-terminal domains, and are thus more likely to be CylK homologs.
the output of script_Nterm.sh (out_homologs.fasta) was deduplicated with Geneious (out_homologs_unique.fasta) and used as the input file below:
script_R105.sh and script_Y473.sh similarly parse through input protein sequences (in this case, out_homologs_unique.fasta) and construct individual muscle alignments against a reference protein sequence (licheniforme.fasta), but they look for the presence or absence of key catalytic residues, R105 or Y473, respectively. The output will be a fasta file (out_homologs_unique_with_R105.fasta, or out_homologs_unique_with_Y473.fasta) of protein sequences with the key catalytic residue.