Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

Revision db3ca8faeecf8697973f803bc05c5a3d0a187145 authored by Arjun Biddanda on 04 June 2020, 18:55:40 UTC, committed by Arjun Biddanda on 04 June 2020, 18:55:40 UTC
added in bugfix of italicization and factor of two
1 parent 2d4b8cc
  • Files
  • Changes
  • eb7458b
  • /
  • README.md
Raw File Download

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • revision
  • directory
  • content
revision badge
swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145
directory badge Iframe embedding
swh:1:dir:eb7458b7e7697b1c86c8ae0dd228796778171e57
content badge Iframe embedding
swh:1:cnt:3e88c589d4a74f1f4837baccdd49ef68f2516cdc

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • revision
  • directory
  • content
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
README.md
# geodist_rep_paper
Repository to replicate results from the GeoDist paper. 

## Cloning from github

To bring the repository to your local computer, please use `git clone` as follows:

```
git clone https://github.com/aabiddanda/geodist_rep_paper.git
cd geodist_rep_paper
```

## Installation Requirements

We have setup an [Anaconda](https://www.anaconda.com/distribution/) environment to ensure accurate replication of results and management of dependencies. We suggest using this with miniconda. You can create the relevant environment by running:

```
conda env create -f config/env_geodist.yml
conda activate geodist
```

## Working from intermediate data

The pipeline we have written uses the popular workflow managment system, [snakemake](https://snakemake.readthedocs.io/en/stable/). We refer users to the documentation there in order to understand the various rules and dependencies. The step of generating "Geographic distribution Codes" for the entire NYGC 1000 Genomes hg38 dataset takes ~40 minutes due to iterating over all ~92 million variants. If you are interesting in using the same allele frequency binning that we have, we highly suggest downloading an pre-computed dataset below:

```
snakemake download_minimal_data --cores 1 
```

If you are interested in generating the geodist codes from scratch - remove the `data/geodist` subdirectory and then run the command in the following section to regenerate all plots. Be warned that this can take a considerable amount of time and is best done on a HPC cluster (and has only been tested in Linux).

## Generating main plots

If you have the `geodist` conda environment activated, to recreate the main plots you will have to run:

```
snakemake gen_all_plots --cores 1 --dryrun
```

You can remove the `--dryrun` flag to actually run the pipeline. After running the pipeline, you should be able to see the major figures in the `plots` directory as PDFs. Note that these are somewhat different from the versions in the manuscript as they have not been annotated.  


## File Descriptions

### Frequency Files 

The gzipped frequency files are  tab separated files with the following columns:

  * `CHR` : chromosome
  * `SNP` : position
  * `A1` : major allele
  * `A2` : minor allele (globally)
  * `MAC` : global minor allele count
  * `MAF` : global minor allele frequency 

Then the subsequent columns represent the frequency of the globally minor allele (A2) across the defined populations. You can find these in `data/freq` for our minimal dataset.

### GeoDist Files 

The gzipped "GeoDist" files contain relevant frequency information as well as their geographic distribution "Codes" that we use in the manuscript. They have the following fields: 

  * `CHR` : chromosome
  * `SNP` : position
  * `A1` : major allele
  * `A2` : minor allele (globally)
  * `MAC` : global minor allele count
  * `MAF` : global minor allele frequency 
  * `ID` : geographic distribution code (length refers to the number of populations) 

We note that the integers correspond to the "frequency bin" that the variant falls into within that population. For further detail on the particular scheme used to bin variants please find the details in our paper:

TBD

## Questions

For any questions on this pipeline please either raise an issue or email Arjun Biddanda <abiddanda[at]uchicago.edu>.
The diff you're trying to view is too large. Only the first 1000 changed files have been loaded.
Showing with 0 additions and 0 deletions (0 / 0 diffs computed)
swh spinner

Computing file changes ...

back to top

Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API