Content - db1f037a5275bcbc9e431cf4d1bf484f1d29a2d5 - 55d6411/README.md

visit type:

https://github.com/jferna10/EnvPaper

14 December 2023, 21:42:55 UTC

Tip revision: 152ada7da67b08f3e04ac95b284e45999c90341c authored by jferna10 on 15 July 2021, 07:18:18 UTC
Update README.md

Tip revision: 152ada7

README.md

---
Code Related to Functional and Structural Segregation of Overlapping Helices in HIV-1
---

This github repo contains code related to the submitted paper "Functional and Structural Segregation of Overlapping Helices in HIV-1". The files deposited here are intended to make the analyses - as they were done at the time of writing the paper - transparent. However due to things like files being renamed (e.g. GEO fastq names from GSE179046 are slightly different than original names), the compute environment, etc you probably can't just run this code and get the figures.  But it should be pretty close, and you shouldn't hesitate to contact us if you notice any issues.

These scipts use:

[bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) <br>
[Rstudio](https://www.rstudio.com/)<br>
java<br>
standard bash commands <br>

Here is the overview of what you will find in this repository:


## Stats
Basic QC metrics for MiSeq run for the Env Deep Mutational Scanning Data. These are run-wide stats like demultipliexing stats.

## Reports
Basic QC metrics for each of the fastqs for the Env Deep Mutational Scanning Data.

## seq
Reference genome sequence and associated bowtie2 index for mapping. Note this virus is the HIV-1 NL4-3 sequence with rev-in-nef.   

## process_fastqs
Code used to generate codon and amino acid counts.

First fastq's are aligned to the reference with bowtie2 with the following additional flags:  --fast-local --rdg 100,3 --rfg 100,3 . These flags allow the randomized codon to align to the ref sequence and not insert indels.

The bulk of the work is done by countDMS, a simple java program which attempts to count codons from each SAM generated by bowtie2.  If there is an indel in the alignment the read is not counted. 

The output of countDMS are codon and amino acid count files in tab delimited format.

Note that the BAMs provided are slightly different than the tab files as the BAMs are the result of a more recent remapping than the figure in the paper. However the differences are slight (slightly better mapping with the more recent mapping, maybe due to an upgraded version of bowtie2).

If you wish to perform a similar analysis and are worried about alignment artifacts or wish to avoid using the custom countDMS program I suggest using [seqkit](https://bioinf.shenwei.me/seqkit/usage/) and the associated amplicon feature to extract the DMS region and parse the resulting sequence.


## aa_tab
Amino acid counts generated from countDMS with simple number relabeling to make the coordinates readable.

## codon_tab
Codon counts generated from countDMS. These may be useful if you care about specific codons.



## Fernandes_GEO_seq_template.xlsx

This file should provide metadata mapping naming changes between GEO and filenames in this repo.