Content - f868f32f55fad15d717f84879c9b48d290bbcc4d - 93d8635/workflow_genbank_ingest/README.md

README.md
# GenBank ingestion for COVID-CG

This workflow downloads data from GenBank, chunks data into files by submission date, and cleans the metadata as provided by GenBank. Additionally, lineages are assigned to each sequence with [pangolin](https://github.com/cov-lineages/pangolin).

## Running

Please include `--use-conda` in the snakemake call, to automatically install `pangolin` and its dependent packages/libraries. For example:

```
snakemake --cores 8 --config data_folder=../data_genbank --use-conda
```

The environment file for `pangolin` is placed in `envs/pangolin.yaml`

## Configuration

All configuration options, and their descriptions, are available in the `config/config_genbank.yaml` file.

Metadata columns (`metadata_cols`) and sequence groupings (`group_cols`) specific to this pipeline are defined in the `config/config_genbank.yaml` file.

## Metadata Requirements

The following fields are **required** by the downstream `workflow_main`:

- `Accession ID`
- `submission_date`
- `collection_date`
- `region`
- `country`
- `division`
- `location`

Sequences without `submission_date`, `collection_date`, or `region` are filtered out.

For more granular location metadata (`country`, `division`, `location`), missing or undetermined values are replaced by the integer -1. This is to facilitate easier groupby-aggregate operations downstream.

## Acknowledgements

The GenBank download code is derived from the ncov-ingest tool ([https://github.com/nextstrain/ncov-ingest](https://github.com/nextstrain/ncov-ingest)) from the [Nextstrain](https://nextstrain.org/) team. The license for this code can be found in the `LICENSE_NEXTSTRAIN` file.

The `pangolin` lineage assignment tool is hosted on GitHub: [https://github.com/cov-lineages/pangolin](https://github.com/cov-lineages/pangolin).