Reference¶
scprocess setup¶
Description: Download all data required for scprocess and index reference transcriptomes for simpleaf.
Parameters:
-c/--rangerurl: download link for Cell Ranger (v9.0.0 or higher) available on the 10x Genomics CellRanger download & installation page; only required when running the command for the first time.
configuration file¶
The command requires a configuration file named scprocess_setup.yaml located in scprocess data directory (for instructions on how to set up the scprocess data directory see the Getting started section). In this file, the user can specify parameters that are used across all scprocess projects, such as HPC configuration and reference genomes that will be made available for scprocess. For example:
user:
profile: slurm_default # define local_cores instead if running locally
your_name: Testy McUser
affiliation: Unemployed
int_use_gpu: false
arvados:
arv_instance: instance_name
ref_txomes:
tenx:
- name: human_2024
decoys: true
rrnas: true
custom:
- name: custom_genome_name
fasta: /path/to/genome.fa
gtf: /path/to/genes.gtf
decoys: true
mito_str: "^mt-"
- name: custom_genome_name2
index_dir: /path/to/prebuild/alevin/index
gtf: /path/to/genes.gtf
mito_str: "^MT-"
user¶
profile: the name of the HPC profile to be used bySnakemake. Must correspond to the name of one of the subfolders in the profiles folder. This subfolder must contain a file calledconfig.yaml. Exactly one ofprofileandlocal_coresshould be specified.local_cores: number of CPU cores available for local execution (see Snakemake documentation for more details). Exactly one ofprofileandlocal_coresshould be specified.your_name(optional): author's name. If specified it will be used in the configuration file for new projects created with thescprocess newproj -ccommand.affiliation(optional): author's affiliation. If specified it will be used in the configuration file for new projects created with thescprocess newproj -ccommand.int_use_gpu(optional): whether to use GPU acceleration (RAPIDS-singlecell) for integration and clustering steps. Iffalsethe value will be used in the configuration file for new projects created with thescprocess newproj -ccommand.
arvados¶
arv_instance(optional): the name of the default Arvados instance for the user. If specified it will be used in the configuration file for new projects created with thescprocess newproj -ccommand.
ref_txomes¶
Prebuilt human and mouse reference transcriptomes from 10x Genomics can be downloaded with scprocess setup by adding tenx to the scprocess_setup.yaml file. Valid values for names are human_2024, mouse_2024, human_2020, mouse_2020.
Names and specifications for custom references should be listed in the custom section of the scprocess_setup.yaml file. For each custom genome users have to provide the following parameters:
name: name to be used for the referencefasta: path to FASTA filegtf: path to GTF filemito_str: regular expression used to identify genes in the mitochondial genome (example for mouse:"^mt-")
Optional parameters for both tenx and custom references are:
decoys: whether or not poison k-mer information should be inserted into the index. This parameter is optional. If not specified, it defaults totruefor all genomes.
Optional paramater for tenx references is:
rrnas: whether or not ribosomal RNAs should be included in the reference. If not specified it defaults totruefor alltenxgenomes.
Impact of custom parameters for tenx genomes on scsetup runtime
When configuring tenx genomes with their default values, scprocess setup will download prebuilt indices for simpleaf. However, if the default parameters are modified (e.g., setting rrnas or decoys to false), scprocess setup will build the indices from scratch during execution, which will increase the runtime.
More about decoys
scprocess utilizes simpleaf, a lightweight mapping approach that, by default, maps sequenced fragments exclusively to the transcriptome. However, this can lead to incorrect mapping of reads that arise from unannotated genomic loci to the transcriptome. To mitigate this issue, the decoys parameter for ref_txomes is set to true. This option allows simpleaf to identify genomic regions with sequences similar to those in transcribed regions (decoys), thereby reducing the likelihood of false mappings. We strongly recommend keeping the decoy setting enabled. For further details, refer to Srivastava et al., 20191.
scprocess newproj¶
Description: Create a new workflowr project directory for scprocess outputs.
Parameters:
name(positional): name of the newworkflowrproject directory.-w/--where(optional): path to the directory where the new project will be created; defaults to the current working directory-s/--sub(optional): if provided, createsdata/fastqsanddata/metadatasubdirectories within the project.-c/--config(optional): generates a template configuration YAML file. If provided, it must be followed by eithersc(single-cell) orsn(single-nucleus) to define standard QC thresholds. You can also appendmultiplexif your dataset requires demultiplexing e.g.scprocess newproj project_name -c sc multiplex
scprocess plotknee¶
Description: Create an interactive barcode-rank plot. Can only be used once the mapping step is completed.
Parameters:
sample: sample_id corresponding to the barcode-rank curve.-k/--kneefile: path to the knee plot data file generated by scprocess, e.g.output/[short_tag]_mapping/af_[sample_id]/knee_plot_data_[sample_id]_[date_stamp].csv.gz. Exactly one of--kneefileand--configfileshould be specified.-c/--configfile: path to configuration file used for running scprocess. Exactly one of--kneefileand--configfileshould be specified.
scprocess run¶
Description: Run scprocess.
Parameters:
-n/--dry-run: perform a trial run which lists all steps that scprocess would do and does not create any new files. Helpful for checking input files and parameters.--create-envs: only create the conda environments needed for the workflow, without running any rules.-E/--extraagrs: list of additional arguments to pass toSnakemake. Refer to Snakemake documentation for a detailed explanation of available command-line options.-r/--rule: Specifies which rule scprocess should run. The options are:all: default; includes all Core pipeline stepsmapping: read alignment and quantification.ambient: ambient RNA removal (optional) and cell calling.demux: sample demultiplexing.qc: qc filtering.hvg: calculation of highly variable genes.integration: dimentionality reduction with PCA, optional batch correction withHarmony, UMAP and clustering.marker_genes: marker gene identification and optional gene set enrichment analysis.label_celltypes: cell type annotation using a pre-trained classifier.zoom: subclustering.
configuration file¶
This is an example config file for scprocess with all parameters and their default values/placeholders. Required parameters are highlighted:
project:
proj_dir:
fastq_dir: # should not be defined if arv_uuids is defined
arv_uuids: # should not be defined if fastq_dir is defined
arv_instance: # should only be defined if arv_uuids is defined
full_tag:
short_tag:
your_name:
affiliation:
date_stamp:
sample_metadata:
ref_txome:
metadata_vars:
show_arv_uuids: true
custom_sample_params:
tenx_chemistry:
exclude:
sample_id:
pool_id:
multiplexing:
demux_type: none
fastq_dir:
arv_uuids:
feature_ref:
demux_output:
ambient:
ambient_method: decontx
cell_calling: barcodeRanks
cb_version: v0.3.2
cb_max_prop_kept: 0.9
cb_learning_rate:
cb_posterior_batch_size: 128
cb_empty_training_fraction:
cb_expected_cells:
cb_total_droplets_included:
cb_low_count_threshold:
qc:
qc_min_counts: 500
qc_min_feats: 300
qc_min_mito: 0
qc_max_mito: 0.1
qc_min_splice: 0
qc_max_splice: 1
qc_min_cells: 100
dbl_min_feats: 100
exclude_mito: true
pb_empties:
ambient_genes_logfc_thr: 0
ambient_genes_fdr_thr: 0.01
hvg:
hvg_method: sample
hvg_n_hvgs: 2000
hvg_exclude_ambient_genes: True
hvg_exclude_from_file:
hvg_chunk_size: 2000
hvg_metadata_split_var:
integration:
int_use_gpu: true
int_embedding: harmony
int_theta: 0.1
int_batch_var: sample_id
int_n_dims: 50
int_dbl_res: 4
int_dbl_cl_prop: 0.5
int_sce_outs: false
int_res_ls: [0.1, 0.2, 0.5, 1, 2]
int_use_paga: true
int_paga_cl_res: 2
marker_genes:
mkr_sel_res: 0.2
mkr_min_cl_size: 100
mkr_min_cells: 10
mkr_not_ok_re: "(lincRNA|lncRNA|pseudogene|antisense)"
mkr_min_cpm_mkr: 50
mkr_min_cpm_go: 1
mkr_max_zero_p: 0.5
mkr_do_gsea: true
mkr_gsea_cut: 0.1
mkr_gsea_var: z_score
mkr_custom_genesets:
- name:
file:
label_celltypes:
- labeller:
model:
hi_res_cl: "RNA_snn_res.2"
min_pred: 0.8
min_cl_prop: 0.5
min_cl_size: 100
zoom:
resources:
retries: 3
n_run_mapping: 8
project:
proj_dir: /path/to/proj/directory
fastq_dir: /path/to/directory/with/fastq/files
arv_uuids: ["arkau-qr8st-1a2b3c4d5e6f7g8", "arkau-9v0wx-h9i8j7k6l5m4n3o", "arkau-z2y3x-p0q1r2s3t4u5v6w"]
arv_instance: instance_name
full_tag: test_project
short_tag: test
your_name: Test McUser
affiliation: where you work
date_stamp: "2050-01-01"
sample_metadata: /path/to/metadata.csv
ref_txome: human_2024
tenx_chemistry: 3v3
metadata_vars: [var1, var2]
show_arv_uuids: true
custom_sample_params: /path/to/file/with/custom_parameters.yaml
exclude:
sample_id:
- sample1
- sample2
pool_id:
- pool1
- pool2
multiplexing:
demux_type: hto
fastq_dir: /path/to/directory/with/hto_fastq/files
arv_uuids: ["arkau-qr8st-1a2b3c4d5e6f7g8", "arkau-9v0wx-h9i8j7k6l5m4n3o", "arkau-z2y3x-p0q1r2s3t4u5v6w"]
feature_ref: /path/to/feature_ref.csv
demux_output: /path/to/demux_output.csv
ambient:
ambient_method: cellbender
cb_version: v0.3.2
cb_empty_training_fraction:
cb_expected_cells: 10000
cb_total_droplets_included: 20000
cb_low_count_threshold: 5
cb_learning_rate: 0.001
cb_posterior_batch_size: 128
qc:
qc_min_counts: 500
qc_min_feats: 300
qc_min_mito: 0
qc_max_mito: 0.1
qc_min_splice: 0
qc_max_splice: 1
qc_min_cells: 100
dbl_min_feats: 100
exclude_mito: true
pb_empties:
ambient_genes_logfc_thr: 0
ambient_genes_fdr_thr: 0.01
hvg:
hvg_method: sample
hvg_n_hvgs: 2000
hvg_exclude_ambient_genes: True
hvg_exclude_from_file: /path/to/file/with/genes/to/exclude
hvg_chunk_size: 2000
hvg_metadata_split_var: var1
integration:
int_use_gpu: true
int_embedding: harmony
int_theta: 0.1
int_batch_var: sample_id
int_n_dims: 50
int_dbl_res: 4
int_dbl_cl_prop: 0.5
int_sce_outs: false
int_res_ls: [0.1, 0.2, 0.5, 1, 2]
int_use_paga: true
int_paga_cl_res: 2
marker_genes:
mkr_sel_res: 0.2
mkr_min_cl_size: 100
mkr_min_cells: 10
mkr_not_ok_re: "(lincRNA|lncRNA|pseudogene|antisense)"
mkr_min_cpm_mkr: 50
mkr_min_cpm_go: 1
mkr_max_zero_p: 0.5
mkr_do_gsea: true
mkr_gsea_cut: 0.1
mkr_gsea_var: z_score
mkr_custom_genesets:
- name: mouse_brain
file: /path/to/file/with/marker/genes.csv
label_celltypes:
- labeller: "scprocess"
model: "human_cns"
hi_res_cl: "RNA_snn_res.2"
min_pred: 0.8
min_cl_prop: 0.5
min_cl_size: 100
zoom:
- /path/to/cell_subset_1_zoom_params.yaml
- /path/to/cell_subset_2_zoom_params.yaml
- /path/to/cell_subset_3_zoom_params.yaml
resources:
retries: 3
n_run_mapping: 8
Required parameters¶
project¶
proj_dir: absolute path toworkflowrproject directory created with the scprocess newproj function.fastq_dir: path to directory containing FASTQ files. Should be absolute or relative toproj_dir. Exactly one offastq_dirandarv_uuidsshould be specified.arv_uuids: list of Arvados UUIDs where fastq files are located. Exactly one offastq_dirandarv_uuidsshould be specified.arv_instance: the name of Arvados instance. Required ifarv_uuidsis defined.full_tag: full project label, used in output file names.short_tag: abbreviated project label, used in output directory names.your_name: author’s name, displayed in HTML outputs.affiliation: author’s affiliation, displayed in HTML outputs.date_stamp: start date of the analysis, formatted as"YYYY-MM-DD".sample_metadata: path to CSV file with sample metadata. Should be absolute or relative toproj_dir. Spaces in column names are not allowed. Only required column issample_id; values insample_idshould not contain_R1/.R1and_R2/.R2strings and should not overlap (a value should not be a subset of any other values).ref_txome: must match one of the values in theref_txomecolumn ofindex_parameters.csv(created by scprocess setup).
Optional parameters¶
project¶
tenx_chemistry: 10x assay configurtaion. Accepted values are3LT,3v2,3v3,3v4,5v1,5v2,5v3, andmultiome.multiomerefers only to gene expression data generated with the 10x multiome kit (ATACseq data is not supported).metadata_vars: A list of column names in thesample_metadatafile to be used for visualizing the distribution of cell annotations across identified clusters and regions of the low-dimensional embedding.show_arv_uuids: Whether to display Arvados UUIDs (arv_uuids) in the configuration file details box on the index page. Iffalse, UUIDs are replaced with "not shown". Defaults totrue.exclude: List of all samples that should be excluded from the analysis. Samples can be listed underpool_id(if multiplexed) orsample_id.custom_sample_params: YAML file with optional custom parameters for each pool or sample (customtenx_chemistry, custommapping, customambientand customqcparameters can be specified for each sample). For example:
pool_id:
pool_1:
tenx_chemistry: 5v2
mapping:
knee1: 4000
shin1: 400
knee2: 30
shin2: 5
pool_2:
tenx_chemistry: 5v2
mapping:
knee1: 3000
shin1: 400
knee2: 30
shin2: 5
ambient:
cb_total_droplets_included: 20000
cb_learning_rate: 0.001
cb_posterior_batch_size: 128 # only applicable if cb_version is v.0.3.2
sample_id:
sample_1:
qc:
qc_min_counts: 100
multiplexing¶
demux_type:demux_typeoptions (default isnone):noneif experiment is not multiplexed;htoif demultiplexing of samples should be performed with scprocess; orcustomif demultiplexing results will be used as input to scprocess.
fastq_dir: path to directory containing HTO FASTQ files. Should be absolute or relative toproj_dir. Ifdemux_typeishto, exactly one offastq_dirandarv_uuidsshould be specified.arv_uuids: list of Arvados UUIDs where fastq files are located. Expectsarv_instanceto be defined. Ifdemux_typeishto, exactly one offastq_dirandarv_uuidsshould be specified.feature_ref: path to CSV file with columnshto_idandsequence. Required ifdemux_typeishto.demux_output: path to CSV file with columnspool_id,sample_id,cell_id. Optional columnclasscan be added with valuesdoublet,singletornegative. Required ifdemux_typeiscustom.seurat_quantile: equivalent to thepositive.quantileargument of theSeurat::HTODemuxfunction (see Seurat documentation for more details).
ambient¶
ambient_method: method for ambient RNA removal; options aredecontx(default),cellbenderornone.cell_calling: method for cell calling whenambient_methodisnoneordecontx. Options arebarcodeRanks(default) andemptyDrops.cb_version: version ofcellbenderto use ifambient_methodis set tocellbender. Options arev0.3.2(default),v0.3.0'andv0.2.0'.cb_max_prop_kept: maximum proportion of droplets, relative to--total-droplets-included, thatcellbendercan call as cells. Default is0.9, meaning samples are excluded ifcellbendercalls more than 90% of--total-droplets-includeddroplets as cells. Applicable only ifambient_methodiscellbender. For more information about the--total-droplets-includedparameter see Cellbender documentation.cb_learning_rate: Sets the--learning-rateCellBenderparameter to the specified value; applicable only ifambient_methodiscellbender. Default value is0.0001. For more information about this parameter see Cellbender documentation.cb_empty_training_fraction: Sets the--empty-drop-training-fractionCellBenderparameter to the specified value; applicable only ifambient_methodiscellbender. Default value is0.2. Setting this to a lower value (e.g. 0.1 or 0.05) can help ifCellBenderjobs are failing on samples with very few cells. For more information about this parameter see Cellbender documentation.cb_posterior_batch_size: Value of the--posterior-batch-sizeparameter; applicable only ifambient_methodiscellbenderandcellbender_versionisv0.3.2. For more information about this parameter see Cellbender documentation.cb_expected_cells: forces the--expected-cellsCellbenderparameter to be consistent across all samples; applicable only ifambient_methodiscellbender. For more information about this parameter see Cellbender documentation.cb_total_droplets_included: forces the--total-droplets-includedCellbenderparameter to be consistent across all samples; applicable only ifambient_methodiscellbender. For more information about this parameter see Cellbender documentation.cb_low_count_threshold: forces the--low-count-thresholdCellBenderparameter to be consistent across all samples; applicable only ifambient_methodiscellbender. For more information about this parameter see Cellbender documentation.
qc¶
qc_min_counts: minimum number of UMIs per cell required to retain the cell.qc_min_feats: minimum number of detected features per cell required to retain the cell.qc_min_mito: minimum proportion of mitochondrial reads required to retain the cell.qc_max_mito: maximum proportion of mitochondrial reads allowed to retain the cell.qc_min_splice: minimum proportion of spliced reads required to retain the cell.qc_max_splice: maximum proportion of spliced reads allowed to retain the cell.qc_min_cells: minimum number of cells required in a sample after QC filtering to retain the sample.dbl_min_feats: number of features required for each barcode to be included inscDblFindercalculations.exclude_mito: boolean; whether to exclude mitochondrial genes or not.
pb_empties¶
ambient_genes_logfc_thr: log-fold change (logFC) threshold used to filter the results of the edgeR differential expression test comparing empty droplets to cells.ambient_genes_fdr_thr: false discovery rate (FDR) threshold used to filter the results of the edgeR differential expression test comparing empty droplets to cells.
hvg¶
hvg_method: options:sample- calculate highly variable genes per sample, then calculate combined ranking across samples;all- calculate highly variable genes across all cells in the dataset; andgroups- calculate highly variable genes for each sample group then calculate combined ranking across groups.
hvg_metadata_split_var: ifhvg_methodisgroups, which variable insample_metadatashould be used to define sample groups.hvg_n_hvgs: number of HVGs to use for PCAhvg_exclude_ambient_genes: iftrue, genes enriched in empty droplets relative to cells will be excluded from highly variable genes selection.hvg_exclude_from_file: path to CSV file with genes to be excluded from HVGs. Should be absolute or relative toproj_dir. File should contain one column, named eithergene_idorsymbol. Values in the column should all be present in reference genome.hvg_chunk_size: number of genes to use for each chunked matrix.
integration¶
int_use_gpu: whether to use GPU acceleration (RAPIDS-singlecell) for integration and clustering steps. Options aretrue(default) orfalse. If GPU is not available,Scanpywill be used.int_embedding: which dimensionality reduction method to use for clustering and UMAP, options:pca(no batch correction),harmony(batch correction).int_theta: theta parameter forHarmonyintegration, controlling batch variable mixing.int_batch_var: variable to use for integration withHarmony. Default issample_id; ifdemux_typeis set to eitherhtoorcustom, thenpool_idis an alternative option.int_n_dims: number of principal components to use for data integration.int_dbl_res: clustering resolution for identification of additional doublets.int_dbl_cl_prop: threshold for the proportion of doublets within a cluster. Clusters where the proportion of doublets exceeds this value will be excluded.int_sce_outs: iftrueH5AD outputs will be converted toSingleCellExperimentobjects and stored ad RDS files.int_res_ls: list of resolution values to be used for clustering.int_use_paga: iftrue, enable Partition-based graph abstraction (PAGA) for trajectory analysis and cell hierarchy inference. A clustering at the specified resolution will be computed for PAGA.int_paga_cl_res: clustering resolution for PAGA analysis. Must be a value listed inint_res_ls. Default is 2. Only used whenint_use_pagaistrue.
marker_genes¶
mkr_sel_res: selected cluster resolution used for identifying marker genes.mkr_min_cl_size: minimum number of cells required in a cluster to calculate marker genes for that cluster.mkr_min_cells: minimum number of cells required in a pseudobulk sample to include it in marker gene calculations.mkr_not_ok_re: regular expression pattern to exclude specific gene types from plots showing marker gene expression.mkr_min_cpm_mkr: minimum counts per million (CPM) in a cell type required for a gene to be considered a marker gene.mkr_do_gsea: boolean specifiying whether Gene Set Enrichment Analysis (GSEA) should be performed on marker genes.mkr_min_cpm_go: minimum counts per million (CPM) in a cell type required for a gene to be used in GSEA.mkr_max_zero_p: maximum proportion of pseudobulk samples for a cell type that can have zero counts for a gene to be used in GSEA.mkr_gsea_cut: False discovery rate (FDR) cutoff for GSEA.mkr_gsea_var: the statistical measure used for ranking genes for GSEA. Choices arez_score(z-score based on signedlog10(FDR), the default) orlogFC(log fold change).mkr_custom_genesets: a list of custom marker gene sets, each defined by a unique name and associated file path.name: a string representing the name of the marker gene setfile: path to CSV file containing a list of genes in the marker gene set. Must contain columnlabel(marker gene category), andsymboland/orensembl_id. If not speficiedscprocesswill look for file$SCPROCESS_DATA_DIR/marker_genes/{name}.csv
label_celltypes¶
labeller: specifies the method to annotate cell types; options include:celltypistuses one of the models available inCellTypistfor annotation.scprocess: use anXGBoostclassifier for cell type annotation.
model: determines the model to be used based on the selectedlabeller. For list of all availableCellTypistmodels see$SCPROCESS_DATA_DIR/celltypist/celltypist_models.csv). Iflabelleris set toscprocessthe value should behuman_cns.hi_res_cl: name of a column containing high-resolution clustering results. It must follow the pattern"RNA_snn_res.n"wherenshould be replaced with one of the values inint_sel_res. Default is"RNA_snn_res.2".min_pred: minimum probability threshold for assigning a cell to a cell type.min_cl_prop: minimum proportion of cells in a cluster that need to be labeled for that cluster to be labeled.min_cl_size: minimum number of cells in a cluster required for that cluster to be labeled.
zoom¶
In this section, users can provide multiple YAML files, each specifying parameters for repeating certain stept of scprocess on a subset of cells. Some parameters in the YAML file inherit their definitions from the primary scprocess configuration file, including qc_min_cells, hvg_method, hvg_metadata_split_var, hvg_n_hvgs, hvg_chunk_size, hvg_exclude_ambient_genes, hvg_exclude_from_file, ambient_genes_logfc_thr, ambient_genes_fdr_thr, int_use_gpu, int_embedding, int_n_dims, int_theta, int_res_ls, int_use_paga, int_paga_cl_res, mkr_sel_res, mkr_min_cl_size, mkr_min_cells, mkr_not_ok_re, mkr_min_cpm_mkr, mkr_min_cpm_go, mkr_max_zero_p, mkr_do_gsea, mkr_gsea_cut, mkr_gsea_var and mkr_custom_genesets.
Additional parameters include:
name: name of cell subset to be analysed.labels_source: specifies how a cell subset is defined (required). Options include:scprocess: labels assigned by theXGBoostclassifier (using rulelabel_celltypes)celltypist: labels assigned byCellTypist(using the rulelabel_celltypes)clusters: labels based on clustering results obtained with scprocesscustom: user-defined cell type annotations
model: required iflabels_sourceisscprocessorcelltypist.sel_labels: a list of all labels that define cell types/clusters to be included in subclustering (required).labels_col: name of column that contains cell type/cluster labels.save_subset_sces: whether to createSingleCellExperimentobjects containing cells that have been assigned one of the values insel_labels; default isfalse.save_subset_anndata: whether to create H5AD files containing cells that have been assigned one of the values insel_labels; defaults istrue.custom_labels_f: required iflabels_sourceis set tocustom; path to CSV file with columnssample_id,cell_idandlabel.
Example zoom configuration file:
zoom:
name: oligos_opcs
labels_source: celltypist
model: Mouse_Whole_Brain
sel_labels: ["327 Oligo NN", "326 OPC NN"]
labels_col: predicted_label_agg
save_subset_sces: true
save_subset_anndata: true
qc:
qc_min_cells: 100
hvg:
hvg_method: all
resources¶
This section allows users to adjust the resource requirements for specific Snakemake rules. This is especially useful when a step/rule fails on a cluster due to insufficient memory or runtime limits. By specifying the parameters below, users can fine-tune these settings for their pipeline:
gb_[rule_name]: specifies the maximum memory (in GB) requested for running a specific rule.rule_nameshould be replaced with an scprocess rule name. This value applies for the entire job, not per thread.mins_[rule_name]: specifies the maximum runtime (in minutes) requested for running a specific rule.rule_nameshould be replace with an scprocess rule name.
Additional parameters include:
retries: number of times to retry running a specific rule in scprocess if it fails. For each attempt the initial memory requested for the rule is multiplied by1.5**(attempt - 1). Useful for when scprocess is ran on a cluster.n_run_mapping: number of threads requested for running the mapping step. Default is 8.
Detailed information about resource parameters
gb_build_hto_index: maximum memory required (in GB) for rulebuild_hto_index.gb_run_mapping: maximum memory required (in GB) for rulerun_mapping.gb_run_mapping_hto: maximum memory required (in GB) for rulerun_mapping_hto.gb_save_alevin_to_h5: maximum memory required (in GB) for rulesave_alevin_to_h5.gb_make_hto_sce_objects: maximum memory required (in GB) for rulemake_hto_sce_objects.gb_save_alevin_hto_to_h5: maximum memory required (in GB) for rulesave_alevin_hto_to_h5.gb_run_cellbender: maximum memory required (in GB) for rulerun_cellbender.gb_run_decontx: maximum memory required (in GB) for rulerun_decontx.gb_run_cell_celling: maximum memory required (in GB) for rulerun_cell_celling.gb_get_barcode_qc_metrics: maximum memory required (in GB) for ruleget_barcode_qc_metrics.gb_run_qc_one_run: maximum memory required (in GB) for rulerun_qc_one_run.gb_merge_qc: maximum memory required (in GB) for rulemerge_qc.gb_merge_rowdata: maximum memory required (in GB) for rulemerge_rowdata.gb_get_qc_sample_statistics: maximum memory required (in GB) for ruleget_qc_sample_statistics.gb_make_one_pb_empty: maximum memory required (in GB) for rulemake_one_pb_empty.gb_merge_pb_empty: maximum memory required (in GB) for rulemerge_pb_empty.gb_make_one_pb_cells: maximum memory required (in GB) for rulemake_one_pb_cells.gb_merge_pb_cells: maximum memory required (in GB) for rulemerge_pb_cells.gb_calculate_ambient_genes: maximum memory required (in GB) for rulecalculate_ambient_genes.gb_make_tmp_csr_matrix: maximum memory required (in GB) for rulemake_tmp_csr_matrix.gb_get_stats_for_std_variance_for_sample: maximum memory required (in GB) for ruleget_stats_for_std_variance_for_sample.gb_merge_sample_std_var_stats: maximum memory required (in GB) for rulemerge_sample_std_var_stats.gb_get_mean_var_for_group: maximum memory required (in GB) for ruleget_mean_var_for_group.gb_get_estimated_variances: maximum memory required (in GB) for ruleget_estimated_variances.gb_get_stats_for_std_variance_for_group: maximum memory required (in GB) for ruleget_stats_for_std_variance_for_group.gb_get_highly_variable_genes: maximum memory required (in GB) for ruleget_highly_variable_genes.gb_create_hvg_matrix: maximum memory required (in GB) for rulecreate_hvg_matrix.gb_create_doublets_hvg_matrix: maximum memory required (in GB) for rulecreate_doublets_hvg_matrix.gb_run_integration: maximum memory required (in GB) for rulerun_integration.gb_make_clean_h5ads: maximum memory required (in GB) for rulemake_clean_h5ads.gb_convert_h5ad_to_sce: maximum memory required (in GB) for ruleconvert_h5ad_to_sce.gb_run_marker_genes: maximum memory required (in GB) for rulerun_marker_genes.gb_run_fgsea: maximum memory required (in GB) for rulerun_fgsea.gb_render_html_mapping: maximum memory required (in GB) for rulerender_html_mapping.gb_render_html_multiplexing: maximum memory required (in GB) for rulerender_html_multiplexing.gb_render_html_ambient: maximum memory required (in GB) for rulerender_html_ambient.gb_render_html_qc: maximum memory required (in GB) for rulerender_html_qc.gb_render_html_hvgs: maximum memory required (in GB) for rulerender_html_hvgs.gb_render_html_integration: maximum memory required (in GB) for rulerender_html_integration.gb_render_html_marker_genes: maximum memory required (in GB) for rulerender_html_marker_genes.gb_render_html_label_celltypes: maximum memory required (in GB) for rulerender_html_label_celltypes.gb_make_tmp_mtx_file: maximum memory required (in GB) for rulemake_tmp_mtx_file.gb_run_celltypist: maximum memory required (in GB) for rulerun_celltypist.gb_run_scprocess_labeller: maximum memory required (in GB) for rulerun_scprocess_labeller.gb_merge_labels: maximum memory required (in GB) for rulemerge_labels.gb_get_zoom_sample_statistics: maximum memory required (in GB) for ruleget_zoom_sample_statistics.gb_zoom_make_one_pb_cells: maximum memory required (in GB) for rulezoom_make_one_pb_cells.gb_zoom_calculate_ambient_genes: maximum memory required (in GB) for rulezoom_calculate_ambient_genes.gb_zoom_make_hvg_df: maximum memory required (in GB) for rulezoom_make_hvg_df.gb_zoom_make_tmp_csr_matrix: maximum memory required (in GB) for rulezoom_make_tmp_csr_matrix.gb_zoom_get_stats_for_std_variance_for_sample: maximum memory required (in GB) for rulezoom_get_stats_for_std_variance_for_sample.gb_zoom_get_mean_var_for_group: maximum memory required (in GB) for rulezoom_get_mean_var_for_group.gb_zoom_merge_group_mean_var: maximum memory required (in GB) for rulezoom_merge_group_mean_var.gb_zoom_get_estimated_variances: maximum memory required (in GB) for rulezoom_get_estimated_variances.gb_zoom_get_stats_for_std_variance_for_group: maximum memory required (in GB) for rulezoom_get_stats_for_std_variance_for_group.gb_zoom_merge_stats_for_std_variance: maximum memory required (in GB) for rulezoom_merge_stats_for_std_variance.gb_zoom_get_highly_variable_genes: maximum memory required (in GB) for rulezoom_get_highly_variable_genes.gb_zoom_create_hvg_matrix: maximum memory required (in GB) for rulezoom_create_hvg_matrix.gb_zoom_run_integration: maximum memory required (in GB) for rulezoom_run_integration.gb_zoom_run_marker_genes: maximum memory required (in GB) for rulezoom_run_marker_genes.gb_zoom_run_fgsea: maximum memory required (in GB) for rulezoom_run_fgsea.gb_zoom_make_subsets: maximum memory required (in GB) for rulezoom_make_subsets.gb_render_html_zoom: maximum memory required (in GB) for rulerender_html_zoom.mins_build_hto_index: maximum runtime required (in minutes) for rulebuild_hto_index.mins_run_mapping: maximum runtime required (in minutes) for rulerun_mapping.mins_run_mapping_hto: maximum runtime required (in minutes) for rulerun_mapping_hto.mins_save_alevin_to_h5: maximum runtime required (in minutes) for rulesave_alevin_to_h5.mins_make_hto_sce_objects: maximum runtime required (in minutes) for rulemake_hto_sce_objects.mins_save_alevin_hto_to_h5: maximum runtime required (in minutes) for rulesave_alevin_hto_to_h5.mins_run_cellbender: maximum runtime required (in minutes) for rulerun_cellbender.mins_run_decontx: maximum runtime required (in minutes) for rulerun_decontx.mins_run_cell_celling: maximum runtime required (in minutes) for rulerun_cell_celling.mins_get_barcode_qc_metrics: maximum runtime required (in minutes) for ruleget_barcode_qc_metrics.mins_run_qc_one_run: maximum runtime required (in minutes) for rulerun_qc_one_run.mins_merge_qc: maximum runtime required (in minutes) for rulemerge_qc.mins_merge_rowdata: maximum runtime required (in minutes) for rulemerge_rowdata.mins_get_qc_sample_statistics: maximum runtime required (in minutes) for ruleget_qc_sample_statistics.mins_make_one_pb_empty: maximum runtime required (in minutes) for rulemake_one_pb_empty.mins_merge_pb_empty: maximum runtime required (in minutes) for rulemerge_pb_empty.mins_make_one_pb_cells: maximum runtime required (in minutes) for rulemake_one_pb_cells.mins_merge_pb_cells: maximum runtime required (in minutes) for rulemerge_pb_cells.mins_calculate_ambient_genes: maximum runtime required (in minutes) for rulecalculate_ambient_genes.mins_make_tmp_csr_matrix: maximum runtime required (in minutes) for rulemake_tmp_csr_matrix.mins_get_stats_for_std_variance_for_sample: maximum runtime required (in minutes) for ruleget_stats_for_std_variance_for_sample.mins_merge_sample_std_var_stats: maximum runtime required (in minutes) for rulemerge_sample_std_var_stats.mins_get_mean_var_for_group: maximum runtime required (in minutes) for ruleget_mean_var_for_group.mins_get_estimated_variances: maximum runtime required (in minutes) for ruleget_estimated_variances.mins_get_stats_for_std_variance_for_group: maximum runtime required (in minutes) for ruleget_stats_for_std_variance_for_group.mins_get_highly_variable_genes: maximum runtime required (in minutes) for ruleget_highly_variable_genes.mins_create_hvg_matrix: maximum runtime required (in minutes) for rulecreate_hvg_matrix.mins_create_doublets_hvg_matrix: maximum runtime required (in minutes) for rulecreate_doublets_hvg_matrix.mins_run_integration: maximum runtime required (in minutes) for rulerun_integration.mins_make_clean_h5ads: maximum runtime required (in minutes) for rulemake_clean_h5ads.mins_convert_h5ad_to_sce: maximum runtime required (in minutes) for ruleconvert_h5ad_to_sce.mins_run_marker_genes: maximum runtime required (in minutes) for rulerun_marker_genes.mins_run_fgsea: maximum runtime required (in minutes) for rulerun_fgsea.mins_render_html_mapping: maximum runtime required (in minutes) for rulerender_html_mapping.mins_render_html_multiplexing: maximum runtime required (in minutes) for rulerender_html_multiplexing.mins_render_html_ambient: maximum runtime required (in minutes) for rulerender_html_ambient.mins_render_html_qc: maximum runtime required (in minutes) for rulerender_html_qc.mins_render_html_hvgs: maximum runtime required (in minutes) for rulerender_html_hvgs.mins_render_html_integration: maximum runtime required (in minutes) for rulerender_html_integration.mins_render_html_marker_genes: maximum runtime required (in minutes) for rulerender_html_marker_genes.mins_render_html_label_celltypes: maximum runtime required (in minutes) for rulerender_html_label_celltypes.mins_make_tmp_mtx_file: maximum runtime required (in minutes) for rulemake_tmp_mtx_file.mins_run_celltypist: maximum runtime required (in minutes) for rulerun_celltypist.mins_run_scprocess_labeller: maximum runtime required (in minutes) for rulerun_scprocess_labeller.mins_merge_labels: maximum runtime required (in minutes) for rulemerge_labels.mins_get_zoom_sample_statistics: maximum runtime required (in minutes) for ruleget_zoom_sample_statistics.mins_zoom_make_one_pb_cells: maximum runtime required (in minutes) for rulezoom_make_one_pb_cells.mins_zoom_calculate_ambient_genes: maximum runtime required (in minutes) for rulezoom_calculate_ambient_genes.mins_zoom_make_hvg_df: maximum runtime required (in minutes) for rulezoom_make_hvg_df.mins_zoom_make_tmp_csr_matrix: maximum runtime required (in minutes) for rulezoom_make_tmp_csr_matrix.mins_zoom_get_stats_for_std_variance_for_sample: maximum runtime required (in minutes) for rulezoom_get_stats_for_std_variance_for_sample.mins_zoom_get_mean_var_for_group: maximum runtime required (in minutes) for rulezoom_get_mean_var_for_group.mins_zoom_merge_group_mean_var: maximum runtime required (in minutes) for rulezoom_merge_group_mean_var.mins_zoom_get_estimated_variances: maximum runtime required (in minutes) for rulezoom_get_estimated_variances.mins_zoom_get_stats_for_std_variance_for_group: maximum runtime required (in minutes) for rulezoom_get_stats_for_std_variance_for_group.mins_zoom_merge_stats_for_std_variance: maximum runtime required (in minutes) for rulezoom_merge_stats_for_std_variance.mins_zoom_get_highly_variable_genes: maximum runtime required (in minutes) for rulezoom_get_highly_variable_genes.mins_zoom_create_hvg_matrix: maximum runtime required (in minutes) for rulezoom_create_hvg_matrix.mins_zoom_run_integration: maximum runtime required (in minutes) for rulezoom_run_integration.mins_zoom_run_marker_genes: maximum runtime required (in minutes) for rulezoom_run_marker_genes.mins_zoom_run_fgsea: maximum runtime required (in minutes) for rulezoom_run_fgsea.mins_zoom_make_subsets: maximum runtime required (in minutes) for rulezoom_make_subsets.mins_render_html_zoom: maximum runtime required (in minutes) for rulerender_html_zoom.
-
Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol., 21(1):239, September 2020. ↩