https://github.com/vejnar/LabxPipe
Tip revision: 5519892059f56f02c4e2da8490c50f98b08e592b authored by vejnar on 05 April 2023, 16:19:02 UTC
Add samtools_sort step
Add samtools_sort step
Tip revision: 5519892
README.md
# <img src="https://raw.githubusercontent.com/vejnar/LabxPipe/main/img/logo.svg" alt="LabxPipe" width="45%" />
[![MPLv2](https://img.shields.io/aur/license/python-labxpipe?color=1793d1&style=for-the-badge)](https://mozilla.org/MPL/2.0/)
* Integrated with [LabxDB](https://labxdb.vejnar.org): all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
* Based on existing robust technologies. No new language.
* LabxPipe pipelines are defined in JSON text files.
* LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
* Simple and complex pipelines.
* By default, pipelines are linear (one step after the other).
* Branching is easily achieved be defining a previous step (using `step_input` parameter) allowing users to create any dependency between tasks.
* Parallelized using robust asynchronous threads from the Python standard library.
## Examples
See JSON files in `config/pipelines` of this repository.
| Pipeline JSON file | |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `mrna_seq.json` | mRNA-seq |
| `mrna_seq_no_db.json` | mRNA-seq. No [LabxDB](https://labxdb.vejnar.org) |
| `mrna_seq_with_plotting.json` | mRNA-seq. Plotting non-mapped reads. Demonstrate `step_input` |
| `mrna_seq_cufflinks.json` | mRNA-seq. Replaces GeneAbacus by Cufflinks |
| `chip_seq.json` | ChIP-seq using [Bowtie2](https://github.com/BenLangmead/bowtie2) and [Samtools](http://www.htslib.org) to uniquify reads. |
Following demonstrates how to apply `mrna_seq.json` pipeline. It requires:
* [LabxDB](https://labxdb.vejnar.org)
* FASTQ files for sample named `AGR000850` and `AGR000912`
```
/plus/data/seq/by_run/AGR000850
├── 23_009_R1.fastq.zst
└── 23_009_R2.fastq.zst
/plus/data/seq/by_run/AGR000912
├── 65_009_R1.fastq.zst
└── 65_009_R2.fastq.zst
```
Note: `mrna_seq_no_db.json` demonstrates how to use LabxPipe *without* LabxDB: it only requires FASTQ files (in `path_seq_run` directory, see above).
Requirements:
* [LabxDB](https://labxdb.vejnar.org). Alternatively, `mrna_seq_no_db.json` doesn't require LabxDB.
* [ReadKnead](https://sr.ht/~vejnar/ReadKnead) to trim reads.
* [STAR](https://github.com/alexdobin/STAR) and genome index in directory defined `path_star_index`.
* [GeneAbacus](https://sr.ht/~vejnar/GeneAbacus) to count reads and generate genomic profile for tracks.
1. Start pipeline:
```bash
lxpipe run --pipeline mrna_seq.json \
--worker 2 \
--processor 16
```
Output is written in `path_output` directory.
2. Create report:
```bash
lxpipe report --pipeline mrna_seq.json
```
Report file `mrna_seq.xlsx` should be created in same directory as `mrna_seq.json`.
3. Merge gene/mRNA counts generated by [GeneAbacus](https://sr.ht/~vejnar/GeneAbacus) in `counting` directory:
```bash
lxpipe merge-count --pipeline mrna_seq.json \
--step counting
```
4. Trackhub. Requirements:
* [ChromosomeMappings](https://github.com/dpryan79/ChromosomeMappings) file (to map chromosome names from Ensembl/NCBI to UCSC)
* Tabulated file (with chromosome name and length)
Execute in a separate directory:
```bash
lxpipe trackhub --runs AGR000850,AGR000912 \
--species_ucsc danRer11 \
--path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \
--path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \
--input_sam \
--bam_names accepted_hits.sam.zst \
--make_config \
--make_trackhub \
--make_bigwig \
--processor 16
```
Directory is ready to be shared by a web server for display in the [UCSC genome browser](https://genome.ucsc.edu/cgi-bin/hgHubConnect).
## Configuration
Parameters can be defined [globally](https://labxdb.vejnar.org/doc/install/python/#configuration). See in `config` directory of this repository for examples.
## Writing pipelines
Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: `path_seq_run` defined in `/etc/hts/labxpipe.json` is used by default, but if `path_seq_run` is defined in the pipeline file, it will be used instead.
Main parameters
| Parameter | Type |
| ------------------ | ------------- |
| name | string |
| path_output | string |
| path_seq_run | string |
| path_annots | string |
| path_bowtie2_index | string |
| path_star_index | string |
| fastq_exts | []strings |
| adaptors | {} |
| logging_level | string |
| run_refs | []strings |
| replicate_refs | []strings |
| ref_info_source | []strings |
| ref_infos | {} |
| analysis | [{}, {}, ...] |
Parameters for all functions
| Parameter | Type |
| ------------- | ------- |
| step_name | string |
| step_function | string |
| step_desc | string |
| force | boolean |
Function-specific parameters
| Function | Synonym | Parameter | Type |
| ------------------ | ---------------- | --------------------- | ------------- |
| readknead | preparing | options | []strings |
| | | ops_r1 | [{}, {}, ...] |
| | | ops_r2 | [{}, {}, ...] |
| | | plot_fastq_in | boolean |
| | | plot_fastq | boolean |
| | | fastq_out | boolean |
| | | zip_fastq_out | string |
| bowtie2 | genomic_aligning | options | []strings |
| | | index | string |
| | | output | string |
| | | output_unfiltered | string |
| | | compress_sam | boolean |
| | | compress_sam_cmd | string |
| | | create_bam | boolean |
| | | index_bam | boolean |
| star | aligning | options | []strings |
| | | index | string |
| | | output_type | []strings |
| | | compress_sam | boolean |
| | | compress_sam_cmd | string |
| | | compress_unmapped | boolean |
| | | compress_unmapped_cmd | string |
| cufflinks | | options | []strings |
| | | inputs | [{}, {}, ...] |
| | | features | [{}, {}, ...] |
| geneabacus | counting | options | []strings |
| | | inputs | [{}, {}, ...] |
| | | features | [{}, {}, ...] |
| uniquify | | options | []strings |
| | | sort_by_name_bam | boolean |
| | | index_bam | boolean |
| cleaning | | steps | [{}, {}, ...] |
Sample-specific parameters. Automatically populated if using LabxDB or sourced from `ref_infos`. These parameters can be changed manually in any function (for example setting `paired` to `False` will ignore second reads in that step).
| Parameter | Type |
| -------------- | ------- |
| label_short | string |
| paired | boolean |
| directional | boolean |
| r1_strand | string |
| quality_scores | string |
## Demultiplexing sequencing reads: `lxpipe demultiplex`
* Demultiplex reads based on barcode sequences from the `Second barcode` field in [LabxDB](https://labxdb.vejnar.org)
* Demultiplexing using [ReadKnead](https://sr.ht/~vejnar/ReadKnead). The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the `Adapter 3'` field in LabxDB.
* Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the `Adapter 3'` field is set to `sRNA 1.5` in LabxDB for these samples) with the following pipeline:
```json
{
"sRNA 1.5": {
"R1": [{"name": "demultiplex",
"end": 5,
"max_mismatch": 1}],
"R2": null
}
}
```
The barcode sequences are added by LabxPipe using the `Second barcode` field in [LabxDB](https://labxdb.vejnar.org).
* Example for iCLIP demultiplexing. In [Vejnar et al.](https://pubmed.ncbi.nlm.nih.gov/31227602), iCLIP is demultiplexed (the `Adapter 3'` field is set to `TruSeq-DMS+A Index` in LabxDB for these samples) using the following pipeline:
```json
{
"TruSeq-DMS+A Index": {
"R1": [{"name": "clip",
"end": 5,
"length": 4,
"add_clipped": true},
{"name": "trim",
"end": 3,
"algo": "bktrim",
"min_sequence": 5,
"keep": ["trim_exact", "trim_align"]},
{"name": "length",
"min_length": 6},
{"name": "demultiplex",
"end": 3,
"max_mismatch": 1,
"length_ligand": 2},
{"name": "length",
"min_length": 15}],
"R2": null
}
}
```
Pipeline is stored in `demux_truseq_dms_a.json`. The barcode sequences are added by LabxPipe using the `Second barcode` field in [LabxDB](https://labxdb.vejnar.org). (NB: published demultiplexed data were generated using `"algo": "align"` with a minimum score of 80 instead of `"algo": "bktrim"`)
Then pipeline was tested running:
```bash
lxpipe demultiplex --bulk HHYLKADXX \
--path_demux_ops demux_truseq_dms_a.json \
--path_seq_prepared prepared \
--demux_nozip \
--processor 1 \
--demux_verbose_level 20 \
--no_readonly
```
This output is **very verbose**: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output, `--processor` must be set to `1`. Output is written in local directory `prepared`.
And finally, once pipeline is validated (data is written in `path_seq_prepared` directory, see [here](https://labxdb.vejnar.org/doc/install/python/#configuration)):
```bash
lxpipe demultiplex --bulk HHYLKADXX \
--path_demux_ops demux_truseq_dms_a.json \
--processor 10
```
## License
*LabxPipe* is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright © 2013-2023 Charles E. Vejnar