https://github.com/jferna10/EnvPaper
Revision 95bd54a60d0b36c97fbe88b5fccee0330dc998f0 authored by jferna10 on 15 July 2021, 06:14:03 UTC, committed by GitHub on 15 July 2021, 06:14:03 UTC
1 parent 73ecefc
Raw File
Tip revision: 95bd54a60d0b36c97fbe88b5fccee0330dc998f0 authored by jferna10 on 15 July 2021, 06:14:03 UTC
dependency update
Tip revision: 95bd54a
README.md
---
Code Related to Functional and Structural Segregation of Overlapping Helices in HIV-1
---

This github repo contains code related to the submitted paper "Functional and Structural Segregation of Overlapping Helices in HIV-1". The files deposited here are intended to make the analyses - as they were done at the time of writing the paper - transparent. However due to things like files being renamed (e.g. GEO fastq names from GSE179046 are slightly different than original names), the compute environment, etc you probably can't just run this code and get the figures.  But it should be pretty close, and you shouldn't hesitate to contact us if you notice any issues.

These scipts use:

bowtie2
Rstudio
java
standard bash commands 

Here is the overview of what you will find in this repository:


## Stats
Basic QC metrics for MiSeq run for the Env Deep Mutational Scanning Data. These are run-wide stats like demultipliexing stats.


## Reports
Basic QC metrics for each of the fastqs for the Env Deep Mutational Scanning Data.

## seq
Reference genome sequence and associated bowtie2 index for mapping.

## process_fastqs
Code used to generate codon and amino acid counts.

First fastq's are aligned to the reference with bowtie2 with the following additional flags:  --fast-local --rdg 100,3 --rfg 100,3 . These flags allow the randomized codon to align to the ref sequence and not insert indels.

The bulk of the work is done by countDMS, a simple java program which attempts to count codons from each SAM generated by bowtie2.  If there is an indel in the alignment the read is not counted. 

The output of countDMS are codon and amino acid count files in tab delimited format.

Note that the BAMs provided are slightly different than the tab files as the BAMs are the result of a more recent remapping than the figure in the paper. However the differences are slight (slightly better mapping with the more recent mapping, maybe due to an upgraded version of bowtie2).


## aa_tab
Amino acid counts generated from countDMS with simple number relabeling to make the coordinates readable.

## codon_tab
Codon counts generated from countDMS. These may be useful if you care about specific codons.



## Fernandes_GEO_seq_template.xlsx

This file should provide metadata mapping naming changes between GEO and filenames in this repo.
back to top