Raw File
This repository contains scripts for the following publication:
Navarro-Domínguez, B., Chang, C-H, Brand, C, Muirhead, C., Presgraves, D.C., and A.M. Larracuente. 2021 Epistatic selection on a selfish Segregation Distorter supergene: drive, recombination, and genetic load. BioRxiv 12/23/21 https://doi.org/10.1101/2021.12.22.473781. eLife 2022;11:e78981 DOI: 10.7554/eLife.78981


# SNP calling and annotation

## GATK
Depends: samtools, picard, bwa mem and GATK
Create a list of readfile prefixes with your sample IDs to give it on the command line: 

e.g. reads.txt
sample1
sample2

where in the same directory you have sample1_1.fastq and sample1_2.fastq

```
$ sh GATK.bestpract.pipeline.AML.R1.sh reads.txt reference.fasta
```

## SNPeff

Dependencies: SNPeff
```
$ snpeff_snpsift.sh path_to_SNPeff input.vcf out_dir
```

## Keep only biallelic & polymorphic SNPs

 Requires: bcftools, samtools


```
$  select_var_bial.sh input.vcf output.vcf
```

## Filter heterozygous calls

Requires: GATK

```
$ filterHetCalls.sh input.vcf.gz
```

## Intersect SNPs 

 Requires: bcftools. 
 VCF files must be tabix indexed (`tabix -f file.vcf.gz`). 

```
$ intersect_snps.sh file1.vcf.gz file2.vcf.gz file3.vcf.gz 
```

# Population genomics

Simplify VCF (for each sample)

```
$ SimplifyVCF_Basic.readFiltfromVCF.pl population_1.simple.vcf
$ SimplifyVCF_Basic.readFiltfromVCF.pl population_2.simple.vcf
$ SimplifyVCF_Basic.readFiltfromVCF.pl population_3.simple.vcf
```

Run population genomics analyses

```
$ Windows_Basic.pl sample_vcf_list sample_name_list gff_file min_TajD_depth min_Fst_depth max_depth window_size outfile
```
1. sample_vcf_list: comma-separated list of (simplified) vcfs for each sample
2. sample_name_list: comma-separated list of names for each sample
3. gff_file: gff-formatted list of masked sites; excluded *completely* from analysis (if nothing needs to be masked, give an empty text file instead)
4. min_tajD_depth:  minimum sample depth required to include a variant in Tajima's D calculations (required for focal sample)
5. min_Fst_depth minimum sample depth required to include a site in Fst calculations (required for all samples)
6. max_depth: maximum possible sample depth across all samples
7. window_size: size (in bp) of non-overlapping windows to consider
8. outfile: name of the output file


Example:

```
$ Windows_Basic.v2.2.pl population_1.simple.vcf,population_2.simple.vcf,population_3.simple.vcf pop1,pop2,pop3 mask_sites.gff 8 8 1000 10000 out.txt
```

# Simulations

## ABC (infer time of the sweep)

Depends: ms (http://home.uchicago.edu/rhudson1/source/mksamples.html)

Script to generate *m* posterior samples under a sweep model (absolute bottleneck of N=1 at a time t 4N<sub>e</sub> generations). 

We simulated with values of S <sub>sim</sub> drawn from a uniform distribution ±5% of S <sub>obs</sub> .We considered a prior uniform distribution of time of the sweep (t) ranging from 0 to 4N e generations. 

The rejection sampling algorithm is as follows: 
1.  Draw S <sub>sim</sub and t from prior distributions; 
2. Simulate 1000 samples using the coalescent under a selective sweep model
3. Calculate average summary statistics for drawn S <sub>sim</sub> and t
4. Accept or reject chosen parameter values conditional on |π <sub>obs</sub> − π <sub>sim</sub> | ≤ ε, |D <sub>obs</sub> − D <sub>sim</sub> | ≤ ε;
5. Return to step 1 and continue simulations until m desired samples from the joint posterior probability distribution are collected.

```
$ run_ms.py m out_prefix sumstats

```
1. m: Number of posterior samples
2. Out prefix
3. sumstats: file that contains π <sub>obs</sub>, S <sub>obs</sub> and D <sub>obs</sub> (separated by spaces)

## Model fitting

Generate m samples under three models:
- a sweep (absolute bottleneck of N = 1) at a time t
- constant population size
- exponential growth

```
$ run_ms.model_fitting.py Nsims OutPrefix ObsS tsweep alpha
```

1. Nsims: number of simulations
2. Out prefix
3. ObsS: observed S
4. tsweep: time of the sweep, estimated by abc
5. Alpha: exponential growth factor

# TE calling

## Generate TSV file for McClinctock

```
$ convert_gff_and_tsv_uniqIDs.sh gff fasta
```

1. gff: from RepeatMasker output
2. TE.fasta: fasta file with TE sequences

## Run McClinctock

```
$ run_ngstemapper_telocate_temp.sh
$ run_ngstemapper_temp.sh 
$ run_popoolation.sh
$ run_retroseq.sh
```
## Merge close TE insertions

Depends: [TE_insertion_merger.py](https://github.com/KamilSJaron/reproductive_mode_TE_dynamics/blob/master/empirics/TE_insertion_merger.py) from KamilSJaron

Concatenate bed files from McClinctock and merge overlapping evidence within a distance smaller than *d* bp

```
$ cat *.bed > all_te_calls.bed

## Format as a table for TE_insertion_merger.py
$ get_tab_from_bed.sh all_te_calls.bed

$ TE_insertion_merger.py all_te_calls.tab3  d
```

# Detect inversions from Illumina data

Read mapping (repeats in the reference must be masked)

```
$ run_bwa_mem.sh sample_name reference out_prefix
```

Detect In(2L)t, In(2R)NS, In(2R)Mal and Sd-RanGAP in the bams. 

```
$ detect_inversions.sh bamfile
$ detect_Sd.sh bamfile
$ Rscript SD_screening.R
```


# Recombination

## Linkage disequilibrium decay with distance

```
$ run_plink.sh vcf_file outdir chr coord_start coord_end thin-count(=0 for no thinning)
```
1. vcf_file with only only non-singletons (mac>1) and biallelic SNPs
2. outdir: output directory
3. chr: chromosome
4. coord_start: start coordinate
5. coord_end: end coordinate
6. thin-count: number of SNPs to reduce data, use 0 for no thinning (use all SNPs)

## Generate input for RecMin from VCF

```
$ SimplifyVCF_Basic.readFiltfromVCF.pl sample.vcf
$ Rscript recmin.R sample.simple.vcf
```

## Runs of shared and private SNPs

```
$ Rscript shared-priv_distribution.R private_sites.txt shared_sites.txt
```
back to top