Overview of the Pipeline

To set up the pipeline for projects, the following setups need to be performed:

  1. Setting up wildcards for the project (see how).

  2. According to the experiment select a module (module options).

  3. As required and wherever possible, setup the folder structures for different programs.

  4. Check and modify parameters of the programs to be run (in new_config.yaml file).

    1. (Currently) make sure all required python and R packages are present.

  5. Change global variables as required (in Snakefile - check).

Wildcard Processing

For the purpose of creating wildcards a list of samples to be processed is provided to the pipeline. There are 3 ways to achieve this:

  • list of samples/pools (as a folder structure)

  • yaml file containing the list of samples/pools

  • Directory containing input files

List of samples

This pipeline has many combinations of the aforementioned programs as a built-in set that can be executed using specific keywords.

Selectable Modules

The following combinations of programs can be run:

  • all
  • starsolo
  • starsolo_rnaseqmet
  • starsolo_gcbiasmet
  • starsolo_kb_solo
  • starsolo_picard
  • starsolo_gt_demux
  • starsolo_split_bams
  • starsolo_split_bams_gt_demux
  • starsolo_split_bams_gt_demux_multi_vcf
  • starsolo_gt_demux_multi_vcf
  • starsolo_cellsnp
  • starsolo_rnaseqmet_kb_solo
  • starsolo_gcbiasmet_kb_solo
  • starsolo_gt_demux_identify_swaps
  • starsolo_resolve_swaps_gt_demux #

where starsolo represents STARsolo; rnaseqmet and gcbiasmet refer to PICARD’s CollectRnaSeqMetrics and CollectGcBiasMetrics, respectively while picard represents inclusion of both the previously-mentioned programs; kb_solo refers to using kallisto, bustools and calico_solo for demultiplexing; gt_demux refers to using cellSNP and vireoSNP for genotype based demultiplexing; split_bams refers to splitting pooled/multiplexed bams using hashsolo’s outputs while split_bams_gt_demux refers to splitting pooled/multiplexed bams using vireo’s output; identify_swaps refers to using qtltools_mbv. The option multi_vcf is to provide muiltiple runs (i.e. multiple sets of vcf inputs) for the same sample.

Module description

Modules_info

Module Name

Module Info

Sub Worflows Involved

all

module_info (more desc in its own file)

sub_wkfl

all_multi_vcf

module_info (more desc in its own file)

sub_wkfl

starsolo

module_info (more desc in its own file)

sub_wkfl

starsolo_kb_solo

module_info (more desc in its own file)

sub_wkfl

starsolo_gt_demux

module_info (more desc in its own file)

sub_wkfl

starsolo_split_bams

module_info (more desc in its own file)

sub_wkfl

starsolo_split_bams_gt_demux

module_info (more desc in its own file)

sub_wkfl

starsolo_split_bams_gt_demux_multi_vcf

module_info (more desc in its own file)

sub_wkfl

starsolo_gt_demux_multi_vcf

module_info (more desc in its own file)

sub_wkfl

starsolo_cellsnp

module_info (more desc in its own file)

sub_wkfl

starsolo_gt_demux_identify_swaps

module_info (more desc in its own file)

sub_wkfl

starsolo_resolve_swaps_gt_demux

module_info (more desc in its own file)

sub_wkfl

Sub-Snakemake workflows

This pipeline divides each module into its self-contained individual workflows. These are:

Sub_Snakemake_workflows_table

Name of Workflow

Description

resources.snkmk

It contains memory (in MB per thread) and time requirements (in minutes) for each rule.

calico_solo_demux.snkmk

It contains hashsolo rule.

split_bams.snkmk 1

It contains rules needed to split pooled bams into individual bams dependent on output produced by either hashsolo or vireoSNP using custom scripts.

input_processing.snkmk

It contains rules that collects values for all the wildcards.

STARsolo.snkmk

It contains rules for STARsolo.

produce_targets.snkmk

It contains the rule all and the needed functions.

snv_aware_align.snkmk 2

This might be removed soon

kite.snkmk

It contains rules for the kite workflow.

picard_metrics.snkmk

It contains rules for all PICARD metrics (GCBiasMetrics and RNAseqMetrics).

pheno_demux3.snkmk

It contains rules for the cellSNP-vireoSNP pipeline.

split_bams_gt.snkmk 1

It contains rules needed to split pooled bams into individual bams dependent on output produced by vireo.

demultiplex_no_argp.snkmk

It contains rules for demultiplexing using hashsolo and/or vireoSNP output and create a count matrix file.

identify_swaps.snkmk

It contains rules for identifying swaps using QTLtools-mbv.

demultiplex_helper_funcs.ret_htos_calico_solo

Return HTO information and classification for each cell barcode.


1(1,2)

Consolidating into one

2

Not yet implemented