Skip to content

Reference

scprocess setup

Description: Download all data required for scprocess and index reference transcriptomes for simpleaf.

Parameters:

configuration file

The command requires a configuration file named scprocess_setup.yaml located in scprocess data directory (for instructions on how to set up the scprocess data directory see the Getting started section). In this file, the user can specify parameters that are used across all scprocess projects, such as HPC configuration and reference genomes that will be made available for scprocess. For example:

user:
  profile:        slurm_default # define local_cores instead if running locally
  your_name:      Testy McUser
  affiliation:    Unemployed
  int_use_gpu:    false
arvados:
  arv_instance:   instance_name
ref_txomes:
  tenx:
    - name:       human_2024
      decoys:     true
      rrnas:      true
  custom:
    - name:       custom_genome_name
      fasta:      /path/to/genome.fa
      gtf:        /path/to/genes.gtf
      decoys:     true
      mito_str:   "^mt-"
    - name:       custom_genome_name2
      index_dir:  /path/to/prebuild/alevin/index
      gtf:        /path/to/genes.gtf
      mito_str:   "^MT-"
user
  • profile: the name of the HPC profile to be used by Snakemake. Must correspond to the name of one of the subfolders in the profiles folder. This subfolder must contain a file called config.yaml. Exactly one of profile and local_cores should be specified.
  • local_cores: number of CPU cores available for local execution (see Snakemake documentation for more details). Exactly one of profile and local_cores should be specified.
  • your_name (optional): author's name. If specified it will be used in the configuration file for new projects created with the scprocess newproj -c command.
  • affiliation (optional): author's affiliation. If specified it will be used in the configuration file for new projects created with the scprocess newproj -c command.
  • int_use_gpu (optional): whether to use GPU acceleration (RAPIDS-singlecell) for integration and clustering steps. If false the value will be used in the configuration file for new projects created with the scprocess newproj -c command.
arvados
  • arv_instance (optional): the name of the default Arvados instance for the user. If specified it will be used in the configuration file for new projects created with the scprocess newproj -c command.
ref_txomes

Prebuilt human and mouse reference transcriptomes from 10x Genomics can be downloaded with scprocess setup by adding tenx to the scprocess_setup.yaml file. Valid values for names are human_2024, mouse_2024, human_2020, mouse_2020.

Names and specifications for custom references should be listed in the custom section of the scprocess_setup.yaml file. For each custom genome users have to provide the following parameters:

  • name: name to be used for the reference
  • fasta: path to FASTA file
  • gtf: path to GTF file
  • mito_str: regular expression used to identify genes in the mitochondial genome (example for mouse: "^mt-")

Optional parameters for both tenx and custom references are:

  • decoys: whether or not poison k-mer information should be inserted into the index. This parameter is optional. If not specified, it defaults to true for all genomes.

Optional paramater for tenx references is:

  • rrnas: whether or not ribosomal RNAs should be included in the reference. If not specified it defaults to true for all tenx genomes.

Impact of custom parameters for tenx genomes on scsetup runtime

When configuring tenx genomes with their default values, scprocess setup will download prebuilt indices for simpleaf. However, if the default parameters are modified (e.g., setting rrnas or decoys to false), scprocess setup will build the indices from scratch during execution, which will increase the runtime.

More about decoys

scprocess utilizes simpleaf, a lightweight mapping approach that, by default, maps sequenced fragments exclusively to the transcriptome. However, this can lead to incorrect mapping of reads that arise from unannotated genomic loci to the transcriptome. To mitigate this issue, the decoys parameter for ref_txomes is set to true. This option allows simpleaf to identify genomic regions with sequences similar to those in transcribed regions (decoys), thereby reducing the likelihood of false mappings. We strongly recommend keeping the decoy setting enabled. For further details, refer to Srivastava et al., 20191.

scprocess newproj

Description: Create a new workflowr project directory for scprocess outputs.

Parameters:

  • name (positional): name of the new workflowr project directory.
  • -w/--where (optional): path to the directory where the new project will be created; defaults to the current working directory
  • -s/--sub (optional): if provided, creates data/fastqs and data/metadata subdirectories within the project.
  • -c/--config (optional): generates a template configuration YAML file. If provided, it must be followed by either sc (single-cell) or sn (single-nucleus) to define standard QC thresholds. You can also append multiplex if your dataset requires demultiplexing e.g. scprocess newproj project_name -c sc multiplex

scprocess plotknee

Description: Create an interactive barcode-rank plot. Can only be used once the mapping step is completed.

Parameters:

  • sample: sample_id corresponding to the barcode-rank curve.
  • -k/--kneefile: path to the knee plot data file generated by scprocess, e.g. output/[short_tag]_mapping/af_[sample_id]/knee_plot_data_[sample_id]_[date_stamp].csv.gz. Exactly one of --kneefile and --configfile should be specified.
  • -c/--configfile: path to configuration file used for running scprocess. Exactly one of --kneefile and --configfile should be specified.

scprocess run

Description: Run scprocess.

Parameters:

  • -n/--dry-run: perform a trial run which lists all steps that scprocess would do and does not create any new files. Helpful for checking input files and parameters.
  • --create-envs: only create the conda environments needed for the workflow, without running any rules.
  • -E/--extraagrs: list of additional arguments to pass to Snakemake. Refer to Snakemake documentation for a detailed explanation of available command-line options.
  • -r/--rule: Specifies which rule scprocess should run. The options are:
    • all: default; includes all Core pipeline steps
    • mapping: read alignment and quantification.
    • ambient: ambient RNA removal (optional) and cell calling.
    • demux: sample demultiplexing.
    • qc: qc filtering.
    • hvg: calculation of highly variable genes.
    • integration: dimentionality reduction with PCA, optional batch correction with Harmony, UMAP and clustering.
    • marker_genes: marker gene identification and optional gene set enrichment analysis.
    • label_celltypes: cell type annotation using a pre-trained classifier.
    • zoom: subclustering.

configuration file

This is an example config file for scprocess with all parameters and their default values/placeholders. Required parameters are highlighted:

project:
  proj_dir:
  fastq_dir: # should not be defined if arv_uuids is defined
  arv_uuids: # should not be defined if fastq_dir is defined
  arv_instance: # should only be defined if arv_uuids is defined
  full_tag:
  short_tag:
  your_name:
  affiliation:
  date_stamp:
  sample_metadata:
  ref_txome:
  metadata_vars:
  show_arv_uuids: true
  custom_sample_params:
  tenx_chemistry:
  exclude:
    sample_id:
    pool_id:
multiplexing:
  demux_type: none
  fastq_dir:
  arv_uuids:
  feature_ref:
  demux_output:
ambient:
  ambient_method: decontx
  cell_calling: barcodeRanks
  cb_version: v0.3.2
  cb_max_prop_kept: 0.9
  cb_learning_rate:
  cb_posterior_batch_size: 128
  cb_empty_training_fraction:
  cb_expected_cells:
  cb_total_droplets_included:
  cb_low_count_threshold:
qc:
  qc_min_counts: 500
  qc_min_feats: 300
  qc_min_mito: 0
  qc_max_mito: 0.1
  qc_min_splice: 0
  qc_max_splice: 1
  qc_min_cells: 100
  dbl_min_feats: 100
  exclude_mito: true
pb_empties:
  ambient_genes_logfc_thr: 0
  ambient_genes_fdr_thr: 0.01
hvg:
  hvg_method: sample
  hvg_n_hvgs: 2000
  hvg_exclude_ambient_genes: True
  hvg_exclude_from_file:
  hvg_chunk_size: 2000
  hvg_metadata_split_var:
integration:
  int_use_gpu: true
  int_embedding: harmony
  int_theta: 0.1
  int_batch_var: sample_id
  int_n_dims: 50
  int_dbl_res: 4
  int_dbl_cl_prop: 0.5
  int_sce_outs: false
  int_res_ls: [0.1, 0.2, 0.5, 1, 2]
  int_use_paga: true
  int_paga_cl_res: 2
marker_genes:
  mkr_sel_res: 0.2
  mkr_min_cl_size: 100
  mkr_min_cells: 10
  mkr_not_ok_re: "(lincRNA|lncRNA|pseudogene|antisense)"
  mkr_min_cpm_mkr: 50
  mkr_min_cpm_go: 1
  mkr_max_zero_p: 0.5
  mkr_do_gsea: true
  mkr_gsea_cut: 0.1
  mkr_gsea_var: z_score
  mkr_custom_genesets:
  - name:
    file:
label_celltypes:
  - labeller:
    model:
    hi_res_cl: "RNA_snn_res.2"
    min_pred: 0.8
    min_cl_prop: 0.5
    min_cl_size: 100
zoom:
resources:
  retries: 3
  n_run_mapping: 8
project:
  proj_dir: /path/to/proj/directory 
  fastq_dir: /path/to/directory/with/fastq/files
  arv_uuids: ["arkau-qr8st-1a2b3c4d5e6f7g8", "arkau-9v0wx-h9i8j7k6l5m4n3o", "arkau-z2y3x-p0q1r2s3t4u5v6w"]
  arv_instance: instance_name
  full_tag: test_project
  short_tag: test
  your_name: Test McUser
  affiliation: where you work
  date_stamp: "2050-01-01"
  sample_metadata: /path/to/metadata.csv
  ref_txome: human_2024
  tenx_chemistry: 3v3
  metadata_vars: [var1, var2]
  show_arv_uuids: true
  custom_sample_params: /path/to/file/with/custom_parameters.yaml
  exclude:
    sample_id:
      - sample1
      - sample2
    pool_id:
      - pool1
      - pool2
multiplexing:
  demux_type: hto
  fastq_dir: /path/to/directory/with/hto_fastq/files
  arv_uuids: ["arkau-qr8st-1a2b3c4d5e6f7g8", "arkau-9v0wx-h9i8j7k6l5m4n3o", "arkau-z2y3x-p0q1r2s3t4u5v6w"]
  feature_ref: /path/to/feature_ref.csv
  demux_output: /path/to/demux_output.csv
ambient:
  ambient_method: cellbender
  cb_version: v0.3.2
  cb_empty_training_fraction:
  cb_expected_cells: 10000
  cb_total_droplets_included: 20000
  cb_low_count_threshold: 5
  cb_learning_rate: 0.001
  cb_posterior_batch_size: 128
qc:
  qc_min_counts: 500
  qc_min_feats: 300
  qc_min_mito: 0
  qc_max_mito: 0.1
  qc_min_splice: 0
  qc_max_splice: 1
  qc_min_cells: 100
  dbl_min_feats: 100
  exclude_mito: true
pb_empties:
  ambient_genes_logfc_thr: 0
  ambient_genes_fdr_thr: 0.01
hvg:
  hvg_method: sample
  hvg_n_hvgs: 2000
  hvg_exclude_ambient_genes: True
  hvg_exclude_from_file: /path/to/file/with/genes/to/exclude
  hvg_chunk_size: 2000
  hvg_metadata_split_var: var1
integration:
  int_use_gpu: true
  int_embedding: harmony
  int_theta: 0.1
  int_batch_var: sample_id
  int_n_dims: 50
  int_dbl_res: 4
  int_dbl_cl_prop: 0.5
  int_sce_outs: false
  int_res_ls: [0.1, 0.2, 0.5, 1, 2]
  int_use_paga: true
  int_paga_cl_res: 2
marker_genes:
  mkr_sel_res: 0.2
  mkr_min_cl_size: 100
  mkr_min_cells: 10
  mkr_not_ok_re: "(lincRNA|lncRNA|pseudogene|antisense)"
  mkr_min_cpm_mkr: 50
  mkr_min_cpm_go: 1
  mkr_max_zero_p: 0.5
  mkr_do_gsea: true
  mkr_gsea_cut: 0.1
  mkr_gsea_var: z_score
  mkr_custom_genesets:
  - name: mouse_brain
    file: /path/to/file/with/marker/genes.csv
label_celltypes:
  - labeller: "scprocess"
    model: "human_cns"
    hi_res_cl: "RNA_snn_res.2"
    min_pred: 0.8
    min_cl_prop: 0.5
    min_cl_size: 100
zoom:
  - /path/to/cell_subset_1_zoom_params.yaml
  - /path/to/cell_subset_2_zoom_params.yaml
  - /path/to/cell_subset_3_zoom_params.yaml
resources:
  retries: 3
  n_run_mapping: 8

Required parameters

project
  • proj_dir: absolute path to workflowr project directory created with the scprocess newproj function.
  • fastq_dir: path to directory containing FASTQ files. Should be absolute or relative to proj_dir. Exactly one of fastq_dir and arv_uuids should be specified.
  • arv_uuids: list of Arvados UUIDs where fastq files are located. Exactly one of fastq_dir and arv_uuids should be specified.
  • arv_instance: the name of Arvados instance. Required if arv_uuids is defined.
  • full_tag: full project label, used in output file names.
  • short_tag: abbreviated project label, used in output directory names.
  • your_name: author’s name, displayed in HTML outputs.
  • affiliation: author’s affiliation, displayed in HTML outputs.
  • date_stamp: start date of the analysis, formatted as "YYYY-MM-DD".
  • sample_metadata: path to CSV file with sample metadata. Should be absolute or relative to proj_dir. Spaces in column names are not allowed. Only required column is sample_id; values in sample_id should not contain _R1/.R1 and _R2/.R2 strings and should not overlap (a value should not be a subset of any other values).
  • ref_txome: must match one of the values in the ref_txome column of index_parameters.csv (created by scprocess setup).

Optional parameters

project
  • tenx_chemistry: 10x assay configurtaion. Accepted values are 3LT, 3v2, 3v3, 3v4, 5v1, 5v2, 5v3, and multiome. multiome refers only to gene expression data generated with the 10x multiome kit (ATACseq data is not supported).
  • metadata_vars: A list of column names in the sample_metadata file to be used for visualizing the distribution of cell annotations across identified clusters and regions of the low-dimensional embedding.
  • show_arv_uuids: Whether to display Arvados UUIDs (arv_uuids) in the configuration file details box on the index page. If false, UUIDs are replaced with "not shown". Defaults to true.
  • exclude: List of all samples that should be excluded from the analysis. Samples can be listed under pool_id (if multiplexed) or sample_id.
  • custom_sample_params: YAML file with optional custom parameters for each pool or sample (custom tenx_chemistry, custom mapping, custom ambient and custom qc parameters can be specified for each sample). For example:
pool_id:
  pool_1:
    tenx_chemistry: 5v2
    mapping:
      knee1: 4000
      shin1: 400
      knee2: 30
      shin2: 5
  pool_2:
    tenx_chemistry: 5v2
    mapping:
      knee1: 3000
      shin1: 400
      knee2: 30
      shin2: 5
    ambient:
      cb_total_droplets_included: 20000
      cb_learning_rate: 0.001
      cb_posterior_batch_size: 128 # only applicable if cb_version is v.0.3.2
sample_id:
  sample_1:
    qc:
      qc_min_counts: 100
multiplexing
  • demux_type: demux_type options (default is none):
    • none if experiment is not multiplexed;
    • hto if demultiplexing of samples should be performed with scprocess; or
    • custom if demultiplexing results will be used as input to scprocess.
  • fastq_dir: path to directory containing HTO FASTQ files. Should be absolute or relative to proj_dir. If demux_type is hto, exactly one of fastq_dir and arv_uuids should be specified.
  • arv_uuids: list of Arvados UUIDs where fastq files are located. Expects arv_instanceto be defined. If demux_type is hto, exactly one of fastq_dir and arv_uuids should be specified.
  • feature_ref: path to CSV file with columns hto_id and sequence. Required if demux_type is hto.
  • demux_output: path to CSV file with columns pool_id, sample_id, cell_id. Optional column class can be added with values doublet, singlet or negative. Required if demux_type is custom.
  • seurat_quantile: equivalent to the positive.quantile argument of the Seurat::HTODemux function (see Seurat documentation for more details).
ambient
  • ambient_method: method for ambient RNA removal; options are decontx (default), cellbender or none.
  • cell_calling: method for cell calling when ambient_method is none or decontx. Options are barcodeRanks (default) and emptyDrops.
  • cb_version: version of cellbender to use if ambient_method is set to cellbender. Options are v0.3.2 (default), v0.3.0' and v0.2.0'.
  • cb_max_prop_kept: maximum proportion of droplets, relative to --total-droplets-included, that cellbender can call as cells. Default is 0.9, meaning samples are excluded if cellbender calls more than 90% of --total-droplets-included droplets as cells. Applicable only if ambient_method is cellbender. For more information about the --total-droplets-included parameter see Cellbender documentation.
  • cb_learning_rate: Sets the --learning-rate CellBender parameter to the specified value; applicable only if ambient_method is cellbender. Default value is 0.0001. For more information about this parameter see Cellbender documentation.
  • cb_empty_training_fraction: Sets the --empty-drop-training-fraction CellBender parameter to the specified value; applicable only if ambient_method is cellbender. Default value is 0.2. Setting this to a lower value (e.g. 0.1 or 0.05) can help if CellBender jobs are failing on samples with very few cells. For more information about this parameter see Cellbender documentation.
  • cb_posterior_batch_size: Value of the --posterior-batch-size parameter; applicable only if ambient_method is cellbender and cellbender_version is v0.3.2. For more information about this parameter see Cellbender documentation.
  • cb_expected_cells: forces the --expected-cells Cellbender parameter to be consistent across all samples; applicable only if ambient_method is cellbender. For more information about this parameter see Cellbender documentation.
  • cb_total_droplets_included: forces the --total-droplets-included Cellbender parameter to be consistent across all samples; applicable only if ambient_method is cellbender. For more information about this parameter see Cellbender documentation.
  • cb_low_count_threshold: forces the --low-count-threshold CellBender parameter to be consistent across all samples; applicable only if ambient_method is cellbender. For more information about this parameter see Cellbender documentation.
qc
  • qc_min_counts: minimum number of UMIs per cell required to retain the cell.
  • qc_min_feats: minimum number of detected features per cell required to retain the cell.
  • qc_min_mito: minimum proportion of mitochondrial reads required to retain the cell.
  • qc_max_mito: maximum proportion of mitochondrial reads allowed to retain the cell.
  • qc_min_splice: minimum proportion of spliced reads required to retain the cell.
  • qc_max_splice: maximum proportion of spliced reads allowed to retain the cell.
  • qc_min_cells: minimum number of cells required in a sample after QC filtering to retain the sample.
  • dbl_min_feats: number of features required for each barcode to be included in scDblFinder calculations.
  • exclude_mito: boolean; whether to exclude mitochondrial genes or not.
pb_empties
  • ambient_genes_logfc_thr: log-fold change (logFC) threshold used to filter the results of the edgeR differential expression test comparing empty droplets to cells.
  • ambient_genes_fdr_thr: false discovery rate (FDR) threshold used to filter the results of the edgeR differential expression test comparing empty droplets to cells.
hvg
  • hvg_method: options:
    • sample - calculate highly variable genes per sample, then calculate combined ranking across samples;
    • all - calculate highly variable genes across all cells in the dataset; and
    • groups - calculate highly variable genes for each sample group then calculate combined ranking across groups.
  • hvg_metadata_split_var: if hvg_method is groups, which variable in sample_metadata should be used to define sample groups.
  • hvg_n_hvgs: number of HVGs to use for PCA
  • hvg_exclude_ambient_genes: if true, genes enriched in empty droplets relative to cells will be excluded from highly variable genes selection.
  • hvg_exclude_from_file: path to CSV file with genes to be excluded from HVGs. Should be absolute or relative to proj_dir. File should contain one column, named either gene_id or symbol. Values in the column should all be present in reference genome.
  • hvg_chunk_size: number of genes to use for each chunked matrix.
integration
  • int_use_gpu: whether to use GPU acceleration (RAPIDS-singlecell) for integration and clustering steps. Options are true (default) or false. If GPU is not available, Scanpy will be used.
  • int_embedding: which dimensionality reduction method to use for clustering and UMAP, options: pca (no batch correction), harmony (batch correction).
  • int_theta: theta parameter for Harmony integration, controlling batch variable mixing.
  • int_batch_var: variable to use for integration with Harmony. Default is sample_id; if demux_type is set to either hto or custom, then pool_id is an alternative option.
  • int_n_dims: number of principal components to use for data integration.
  • int_dbl_res: clustering resolution for identification of additional doublets.
  • int_dbl_cl_prop: threshold for the proportion of doublets within a cluster. Clusters where the proportion of doublets exceeds this value will be excluded.
  • int_sce_outs: if true H5AD outputs will be converted to SingleCellExperiment objects and stored ad RDS files.
  • int_res_ls: list of resolution values to be used for clustering.
  • int_use_paga: if true, enable Partition-based graph abstraction (PAGA) for trajectory analysis and cell hierarchy inference. A clustering at the specified resolution will be computed for PAGA.
  • int_paga_cl_res: clustering resolution for PAGA analysis. Must be a value listed in int_res_ls. Default is 2. Only used when int_use_paga is true.
marker_genes
  • mkr_sel_res: selected cluster resolution used for identifying marker genes.
  • mkr_min_cl_size: minimum number of cells required in a cluster to calculate marker genes for that cluster.
  • mkr_min_cells: minimum number of cells required in a pseudobulk sample to include it in marker gene calculations.
  • mkr_not_ok_re: regular expression pattern to exclude specific gene types from plots showing marker gene expression.
  • mkr_min_cpm_mkr: minimum counts per million (CPM) in a cell type required for a gene to be considered a marker gene.
  • mkr_do_gsea: boolean specifiying whether Gene Set Enrichment Analysis (GSEA) should be performed on marker genes.
  • mkr_min_cpm_go: minimum counts per million (CPM) in a cell type required for a gene to be used in GSEA.
  • mkr_max_zero_p: maximum proportion of pseudobulk samples for a cell type that can have zero counts for a gene to be used in GSEA.
  • mkr_gsea_cut: False discovery rate (FDR) cutoff for GSEA.
  • mkr_gsea_var: the statistical measure used for ranking genes for GSEA. Choices are z_score (z-score based on signed log10(FDR), the default) or logFC (log fold change).
  • mkr_custom_genesets: a list of custom marker gene sets, each defined by a unique name and associated file path.
    • name: a string representing the name of the marker gene set
    • file: path to CSV file containing a list of genes in the marker gene set. Must contain column label (marker gene category), and symbol and/or ensembl_id. If not speficied scprocess will look for file $SCPROCESS_DATA_DIR/marker_genes/{name}.csv
label_celltypes
  • labeller: specifies the method to annotate cell types; options include:
    • celltypist uses one of the models available in CellTypist for annotation.
    • scprocess: use an XGBoost classifier for cell type annotation.
  • model: determines the model to be used based on the selected labeller. For list of all available CellTypist models see $SCPROCESS_DATA_DIR/celltypist/celltypist_models.csv). If labeller is set to scprocess the value should be human_cns.
  • hi_res_cl: name of a column containing high-resolution clustering results. It must follow the pattern "RNA_snn_res.n" where n should be replaced with one of the values in int_sel_res. Default is "RNA_snn_res.2".
  • min_pred: minimum probability threshold for assigning a cell to a cell type.
  • min_cl_prop: minimum proportion of cells in a cluster that need to be labeled for that cluster to be labeled.
  • min_cl_size: minimum number of cells in a cluster required for that cluster to be labeled.
zoom

In this section, users can provide multiple YAML files, each specifying parameters for repeating certain stept of scprocess on a subset of cells. Some parameters in the YAML file inherit their definitions from the primary scprocess configuration file, including qc_min_cells, hvg_method, hvg_metadata_split_var, hvg_n_hvgs, hvg_chunk_size, hvg_exclude_ambient_genes, hvg_exclude_from_file, ambient_genes_logfc_thr, ambient_genes_fdr_thr, int_use_gpu, int_embedding, int_n_dims, int_theta, int_res_ls, int_use_paga, int_paga_cl_res, mkr_sel_res, mkr_min_cl_size, mkr_min_cells, mkr_not_ok_re, mkr_min_cpm_mkr, mkr_min_cpm_go, mkr_max_zero_p, mkr_do_gsea, mkr_gsea_cut, mkr_gsea_var and mkr_custom_genesets.

Additional parameters include:

  • name: name of cell subset to be analysed.
  • labels_source: specifies how a cell subset is defined (required). Options include:
    • scprocess: labels assigned by the XGBoost classifier (using rule label_celltypes)
    • celltypist: labels assigned by CellTypist(using the rule label_celltypes)
    • clusters: labels based on clustering results obtained with scprocess
    • custom: user-defined cell type annotations
  • model: required if labels_source is scprocess or celltypist.
  • sel_labels: a list of all labels that define cell types/clusters to be included in subclustering (required).
  • labels_col: name of column that contains cell type/cluster labels.
  • save_subset_sces: whether to create SingleCellExperiment objects containing cells that have been assigned one of the values in sel_labels; default is false.
  • save_subset_anndata: whether to create H5AD files containing cells that have been assigned one of the values in sel_labels; defaults is true.
  • custom_labels_f: required if labels_source is set to custom; path to CSV file with columns sample_id, cell_id and label.

Example zoom configuration file:

zoom:
  name: oligos_opcs
  labels_source: celltypist
  model: Mouse_Whole_Brain
  sel_labels: ["327 Oligo NN", "326 OPC NN"]
  labels_col: predicted_label_agg
  save_subset_sces: true
  save_subset_anndata: true
qc:
  qc_min_cells: 100
hvg:
  hvg_method: all 
resources

This section allows users to adjust the resource requirements for specific Snakemake rules. This is especially useful when a step/rule fails on a cluster due to insufficient memory or runtime limits. By specifying the parameters below, users can fine-tune these settings for their pipeline:

  • gb_[rule_name]: specifies the maximum memory (in GB) requested for running a specific rule. rule_name should be replaced with an scprocess rule name. This value applies for the entire job, not per thread.
  • mins_[rule_name]: specifies the maximum runtime (in minutes) requested for running a specific rule. rule_name should be replace with an scprocess rule name.

Additional parameters include:

  • retries: number of times to retry running a specific rule in scprocess if it fails. For each attempt the initial memory requested for the rule is multiplied by 1.5**(attempt - 1). Useful for when scprocess is ran on a cluster.
  • n_run_mapping: number of threads requested for running the mapping step. Default is 8.
Detailed information about resource parameters
  • gb_build_hto_index: maximum memory required (in GB) for rule build_hto_index.
  • gb_run_mapping: maximum memory required (in GB) for rule run_mapping.
  • gb_run_mapping_hto: maximum memory required (in GB) for rule run_mapping_hto.
  • gb_save_alevin_to_h5: maximum memory required (in GB) for rule save_alevin_to_h5.
  • gb_make_hto_sce_objects: maximum memory required (in GB) for rule make_hto_sce_objects.
  • gb_save_alevin_hto_to_h5: maximum memory required (in GB) for rule save_alevin_hto_to_h5.
  • gb_run_cellbender: maximum memory required (in GB) for rule run_cellbender.
  • gb_run_decontx: maximum memory required (in GB) for rule run_decontx.
  • gb_run_cell_celling: maximum memory required (in GB) for rule run_cell_celling.
  • gb_get_barcode_qc_metrics: maximum memory required (in GB) for rule get_barcode_qc_metrics.
  • gb_run_qc_one_run: maximum memory required (in GB) for rule run_qc_one_run.
  • gb_merge_qc: maximum memory required (in GB) for rule merge_qc.
  • gb_merge_rowdata: maximum memory required (in GB) for rule merge_rowdata.
  • gb_get_qc_sample_statistics: maximum memory required (in GB) for rule get_qc_sample_statistics.
  • gb_make_one_pb_empty: maximum memory required (in GB) for rule make_one_pb_empty.
  • gb_merge_pb_empty: maximum memory required (in GB) for rule merge_pb_empty.
  • gb_make_one_pb_cells: maximum memory required (in GB) for rule make_one_pb_cells.
  • gb_merge_pb_cells: maximum memory required (in GB) for rule merge_pb_cells.
  • gb_calculate_ambient_genes: maximum memory required (in GB) for rule calculate_ambient_genes.
  • gb_make_tmp_csr_matrix: maximum memory required (in GB) for rule make_tmp_csr_matrix.
  • gb_get_stats_for_std_variance_for_sample: maximum memory required (in GB) for rule get_stats_for_std_variance_for_sample.
  • gb_merge_sample_std_var_stats: maximum memory required (in GB) for rule merge_sample_std_var_stats.
  • gb_get_mean_var_for_group: maximum memory required (in GB) for rule get_mean_var_for_group.
  • gb_get_estimated_variances: maximum memory required (in GB) for rule get_estimated_variances.
  • gb_get_stats_for_std_variance_for_group: maximum memory required (in GB) for rule get_stats_for_std_variance_for_group.
  • gb_get_highly_variable_genes: maximum memory required (in GB) for rule get_highly_variable_genes.
  • gb_create_hvg_matrix: maximum memory required (in GB) for rule create_hvg_matrix.
  • gb_create_doublets_hvg_matrix: maximum memory required (in GB) for rule create_doublets_hvg_matrix.
  • gb_run_integration: maximum memory required (in GB) for rule run_integration.
  • gb_make_clean_h5ads: maximum memory required (in GB) for rule make_clean_h5ads.
  • gb_convert_h5ad_to_sce: maximum memory required (in GB) for rule convert_h5ad_to_sce.
  • gb_run_marker_genes: maximum memory required (in GB) for rule run_marker_genes.
  • gb_run_fgsea: maximum memory required (in GB) for rule run_fgsea.
  • gb_render_html_mapping: maximum memory required (in GB) for rule render_html_mapping.
  • gb_render_html_multiplexing: maximum memory required (in GB) for rule render_html_multiplexing.
  • gb_render_html_ambient: maximum memory required (in GB) for rule render_html_ambient.
  • gb_render_html_qc: maximum memory required (in GB) for rule render_html_qc.
  • gb_render_html_hvgs: maximum memory required (in GB) for rule render_html_hvgs.
  • gb_render_html_integration: maximum memory required (in GB) for rule render_html_integration.
  • gb_render_html_marker_genes: maximum memory required (in GB) for rule render_html_marker_genes.
  • gb_render_html_label_celltypes: maximum memory required (in GB) for rule render_html_label_celltypes.
  • gb_make_tmp_mtx_file: maximum memory required (in GB) for rule make_tmp_mtx_file.
  • gb_run_celltypist: maximum memory required (in GB) for rule run_celltypist.
  • gb_run_scprocess_labeller: maximum memory required (in GB) for rule run_scprocess_labeller.
  • gb_merge_labels: maximum memory required (in GB) for rule merge_labels.
  • gb_get_zoom_sample_statistics: maximum memory required (in GB) for rule get_zoom_sample_statistics.
  • gb_zoom_make_one_pb_cells: maximum memory required (in GB) for rule zoom_make_one_pb_cells.
  • gb_zoom_calculate_ambient_genes: maximum memory required (in GB) for rule zoom_calculate_ambient_genes.
  • gb_zoom_make_hvg_df: maximum memory required (in GB) for rule zoom_make_hvg_df.
  • gb_zoom_make_tmp_csr_matrix: maximum memory required (in GB) for rule zoom_make_tmp_csr_matrix.
  • gb_zoom_get_stats_for_std_variance_for_sample: maximum memory required (in GB) for rule zoom_get_stats_for_std_variance_for_sample.
  • gb_zoom_get_mean_var_for_group: maximum memory required (in GB) for rule zoom_get_mean_var_for_group.
  • gb_zoom_merge_group_mean_var: maximum memory required (in GB) for rule zoom_merge_group_mean_var.
  • gb_zoom_get_estimated_variances: maximum memory required (in GB) for rule zoom_get_estimated_variances.
  • gb_zoom_get_stats_for_std_variance_for_group: maximum memory required (in GB) for rule zoom_get_stats_for_std_variance_for_group.
  • gb_zoom_merge_stats_for_std_variance: maximum memory required (in GB) for rule zoom_merge_stats_for_std_variance.
  • gb_zoom_get_highly_variable_genes: maximum memory required (in GB) for rule zoom_get_highly_variable_genes.
  • gb_zoom_create_hvg_matrix: maximum memory required (in GB) for rule zoom_create_hvg_matrix.
  • gb_zoom_run_integration: maximum memory required (in GB) for rule zoom_run_integration.
  • gb_zoom_run_marker_genes: maximum memory required (in GB) for rule zoom_run_marker_genes.
  • gb_zoom_run_fgsea: maximum memory required (in GB) for rule zoom_run_fgsea.
  • gb_zoom_make_subsets: maximum memory required (in GB) for rule zoom_make_subsets.
  • gb_render_html_zoom: maximum memory required (in GB) for rule render_html_zoom.
  • mins_build_hto_index: maximum runtime required (in minutes) for rule build_hto_index.
  • mins_run_mapping: maximum runtime required (in minutes) for rule run_mapping.
  • mins_run_mapping_hto: maximum runtime required (in minutes) for rule run_mapping_hto.
  • mins_save_alevin_to_h5: maximum runtime required (in minutes) for rule save_alevin_to_h5.
  • mins_make_hto_sce_objects: maximum runtime required (in minutes) for rule make_hto_sce_objects.
  • mins_save_alevin_hto_to_h5: maximum runtime required (in minutes) for rule save_alevin_hto_to_h5.
  • mins_run_cellbender: maximum runtime required (in minutes) for rule run_cellbender.
  • mins_run_decontx: maximum runtime required (in minutes) for rule run_decontx.
  • mins_run_cell_celling: maximum runtime required (in minutes) for rule run_cell_celling.
  • mins_get_barcode_qc_metrics: maximum runtime required (in minutes) for rule get_barcode_qc_metrics.
  • mins_run_qc_one_run: maximum runtime required (in minutes) for rule run_qc_one_run.
  • mins_merge_qc: maximum runtime required (in minutes) for rule merge_qc.
  • mins_merge_rowdata: maximum runtime required (in minutes) for rule merge_rowdata.
  • mins_get_qc_sample_statistics: maximum runtime required (in minutes) for rule get_qc_sample_statistics.
  • mins_make_one_pb_empty: maximum runtime required (in minutes) for rule make_one_pb_empty.
  • mins_merge_pb_empty: maximum runtime required (in minutes) for rule merge_pb_empty.
  • mins_make_one_pb_cells: maximum runtime required (in minutes) for rule make_one_pb_cells.
  • mins_merge_pb_cells: maximum runtime required (in minutes) for rule merge_pb_cells.
  • mins_calculate_ambient_genes: maximum runtime required (in minutes) for rule calculate_ambient_genes.
  • mins_make_tmp_csr_matrix: maximum runtime required (in minutes) for rule make_tmp_csr_matrix.
  • mins_get_stats_for_std_variance_for_sample: maximum runtime required (in minutes) for rule get_stats_for_std_variance_for_sample.
  • mins_merge_sample_std_var_stats: maximum runtime required (in minutes) for rule merge_sample_std_var_stats.
  • mins_get_mean_var_for_group: maximum runtime required (in minutes) for rule get_mean_var_for_group.
  • mins_get_estimated_variances: maximum runtime required (in minutes) for rule get_estimated_variances.
  • mins_get_stats_for_std_variance_for_group: maximum runtime required (in minutes) for rule get_stats_for_std_variance_for_group.
  • mins_get_highly_variable_genes: maximum runtime required (in minutes) for rule get_highly_variable_genes.
  • mins_create_hvg_matrix: maximum runtime required (in minutes) for rule create_hvg_matrix.
  • mins_create_doublets_hvg_matrix: maximum runtime required (in minutes) for rule create_doublets_hvg_matrix.
  • mins_run_integration: maximum runtime required (in minutes) for rule run_integration.
  • mins_make_clean_h5ads: maximum runtime required (in minutes) for rule make_clean_h5ads.
  • mins_convert_h5ad_to_sce: maximum runtime required (in minutes) for rule convert_h5ad_to_sce.
  • mins_run_marker_genes: maximum runtime required (in minutes) for rule run_marker_genes.
  • mins_run_fgsea: maximum runtime required (in minutes) for rule run_fgsea.
  • mins_render_html_mapping: maximum runtime required (in minutes) for rule render_html_mapping.
  • mins_render_html_multiplexing: maximum runtime required (in minutes) for rule render_html_multiplexing.
  • mins_render_html_ambient: maximum runtime required (in minutes) for rule render_html_ambient.
  • mins_render_html_qc: maximum runtime required (in minutes) for rule render_html_qc.
  • mins_render_html_hvgs: maximum runtime required (in minutes) for rule render_html_hvgs.
  • mins_render_html_integration: maximum runtime required (in minutes) for rule render_html_integration.
  • mins_render_html_marker_genes: maximum runtime required (in minutes) for rule render_html_marker_genes.
  • mins_render_html_label_celltypes: maximum runtime required (in minutes) for rule render_html_label_celltypes.
  • mins_make_tmp_mtx_file: maximum runtime required (in minutes) for rule make_tmp_mtx_file.
  • mins_run_celltypist: maximum runtime required (in minutes) for rule run_celltypist.
  • mins_run_scprocess_labeller: maximum runtime required (in minutes) for rule run_scprocess_labeller.
  • mins_merge_labels: maximum runtime required (in minutes) for rule merge_labels.
  • mins_get_zoom_sample_statistics: maximum runtime required (in minutes) for rule get_zoom_sample_statistics.
  • mins_zoom_make_one_pb_cells: maximum runtime required (in minutes) for rule zoom_make_one_pb_cells.
  • mins_zoom_calculate_ambient_genes: maximum runtime required (in minutes) for rule zoom_calculate_ambient_genes.
  • mins_zoom_make_hvg_df: maximum runtime required (in minutes) for rule zoom_make_hvg_df.
  • mins_zoom_make_tmp_csr_matrix: maximum runtime required (in minutes) for rule zoom_make_tmp_csr_matrix.
  • mins_zoom_get_stats_for_std_variance_for_sample: maximum runtime required (in minutes) for rule zoom_get_stats_for_std_variance_for_sample.
  • mins_zoom_get_mean_var_for_group: maximum runtime required (in minutes) for rule zoom_get_mean_var_for_group.
  • mins_zoom_merge_group_mean_var: maximum runtime required (in minutes) for rule zoom_merge_group_mean_var.
  • mins_zoom_get_estimated_variances: maximum runtime required (in minutes) for rule zoom_get_estimated_variances.
  • mins_zoom_get_stats_for_std_variance_for_group: maximum runtime required (in minutes) for rule zoom_get_stats_for_std_variance_for_group.
  • mins_zoom_merge_stats_for_std_variance: maximum runtime required (in minutes) for rule zoom_merge_stats_for_std_variance.
  • mins_zoom_get_highly_variable_genes: maximum runtime required (in minutes) for rule zoom_get_highly_variable_genes.
  • mins_zoom_create_hvg_matrix: maximum runtime required (in minutes) for rule zoom_create_hvg_matrix.
  • mins_zoom_run_integration: maximum runtime required (in minutes) for rule zoom_run_integration.
  • mins_zoom_run_marker_genes: maximum runtime required (in minutes) for rule zoom_run_marker_genes.
  • mins_zoom_run_fgsea: maximum runtime required (in minutes) for rule zoom_run_fgsea.
  • mins_zoom_make_subsets: maximum runtime required (in minutes) for rule zoom_make_subsets.
  • mins_render_html_zoom: maximum runtime required (in minutes) for rule render_html_zoom.

  1. Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol., 21(1):239, September 2020.