Project Aim
In the meta-omics special interest group we started a discussion about having a standardized, general subworkflow for reads quality check (QC) in nf-core. Currently, different nf-core pipelines (and not only) use their own implementations of essentially the same QC steps: stats, adapter clipping, paired-end merging, umi-detection deduplication etc.
For example, some modules are shared across multiple pipelines, and are used widely (e.g., fastqc -73 pipelines, fastp -22 pipelines) while others are used in fewer pipelines (e.g., trimgalore -15 pipelines, umitools_extract -10 pipelines,adapterremoval -2 pipelines, etc.).
The proposed project envisions a single Illumina short read preprocessing subworkflow that is installed and consistent across all nf-core pipelines using short read DNA data. However, at the same time the subworkflow will provide users the freedom to pick their preferred tool from a pool of nf-core QC modules or completely skip some of the QC steps.
It will also function as a template for an equivalent potential long reads QC subworkflow.
More information can be found on this link.
Anyone working with Illumina short read data, and particularly within meta-omics (metagenomics, metataxonomics, metatranscriptomics, metaproteomics, metabolomics), are welcome to join!
Goals
- Push any missing nf-core modules (i.e., seqfu_check).
- Create the required sub-subworkflows (i.e., shortread_adapter_removal, shortread_host_removal, etc.) and push to nf-core.
- (minimal success) Create a minimal short reads QC subworkflow that does the following; (i) inital QC + stats, (ii) barcoding, (iii) adapter removal + run merge, (iv) complexity filtering, (v) deduplication, (vi) host removal, and (vii) final QC, with the minimum required tools (1 module per task).
- (success) Create an extended version of the subworkflow with all available QC tools for short reads; multiple module options per task to choose from + skip parameters.
- (bonus) Initialize a similar subworkflow but for long reads QC.