Usage¶

Basic usage¶

Assuming the required hardware is available, all software is installed and you have successfully completed the setup of scprocess data directory, you can run scprocess on your dataset by following the steps outlined below:

scprocess expects multiple samples

1. Prepare project directory¶

scprocess relies on the workflowr¹ project directory template. You can create a new workflowr project using scprocess newproj, as follows:

# create a new project in the current directory, with subdirectories for FASTQ and metadata files, and a default config file
# use the -c flag with the sc option if you have single cell data or the sn option if you have single nuclei data
scprocess newproj my_project -c sc

2. Prepare input files¶

scprocess requires 2 types of input files:

FASTQ files generated using the 10x Genomics technology: names of FASTQ files have to contain a [SAMPLE_ID] in the name as well as _R1/.R1 an _R2/.R2 labels for read one (forward read) and read two (reverse read) respectively. For example:

[SAMPLE_ID]*_R1*.fastq.gz and [SAMPLE_ID]*_R2*.fastq.gz, where * can be replaced with any character.
metadata: a CSV file with sample information. The only required column in the metadata file is sample_id, where the values should match the [SAMPLE_ID] labels included in FASTQ file names.

To see examples of input files go to Quick start tutorial.

3. Prepare configuration file¶

scprocess requires a configuration YAML file where you can specify analysis parameters. This is an example of a configuration file with all required parameters.

project:
  proj_dir: /path/to/proj/directory
  fastq_dir: /path/to/directory/with/fastq/files
  full_tag: test_project
  short_tag: test
  your_name: Testy McUser
  affiliation: where you work
  sample_metadata: /path/to/metadata.csv
  ref_txome: human_2024
  date_stamp: "2026-01-01"

4. Run the analysis¶

To run scprocess use:

scprocess run /path/to/config-my_project.yaml

If you want to run a dry run you can add a -n or --dry-run flag to this command. In case you need to use other snakemake options that are not included in scprocess run by default, you can use the -E or --extraargs flag. For example, if you would like to set a global maximum for the number of threads available to any rule you can use:

scprocess run /path/to/config-my_project.yaml -E "--max-threads 8"

Why you should always consider running a dry run?

Before launching your analysis, you should always consider performing a dry run to your scprocess run or scprocess setup command. This process allows the workflow manager to map out the entire execution plan and display a summary of the tasks to be performed without actually running any scripts or consuming computational resources. It serves as a safety check to ensure that all input files are accessible and that the job sequence aligns with your expectations before you commit to the full execution of the pipeline.

The initial run of scprocess will take some time as the Conda environments are built. You can pre-build these environments without running the workflow by using the --create-envs flag:

scprocess run /path/to/config-my_project.yaml --create-envs

By default scprocess run will run rule all which includes all core steps. The optional steps (with the exception of gene set enrichment analysis) can run only after rule all is completed and have to be specifically requested.

Additionally, you can run individual rules that generate HTML outputs (mapping, ambient, demux, qc, hvg, integration, marker_genes). This is useful if you want to inspect the HTML outputs for the intermediate steps first and then continue with the analysis. To run each rule separately you have to specify the rule using the -r or --rule flag e.g.

scprocess run /path/to/config.yaml -r qc

Analysis of multiplexed samples¶

scprocess supports two approaches for handling multiplexed samples:

Hashtag oligo (HTO)-based demultiplexing: scprocess uses HTO-derived cDNA libraries to generate a count matrix which can be used for sample demultiplexing.
Outputs of external demultiplexing algorithms: If the data has already been demultiplexed using an external method (e.g. genetic demultiplexing tools like Demuxlet⁵), users can provide a cell-sample assignment file to process the data further using scprocess

Input files¶

Processing multiplexed samples requires a different format for the sample metadata CSV file. In addition to the standard sample_id column, the following columns must be included:

pool_id: specifies the pool to which each sample belongs. Instead of values in the sample_id column, FASTQ filenames must match values in the pool_id column.
hto_id: only required for HTO-based demultiplexing. Specifies the HTO label used to tag each sample before pooling.

The dataset must consist entirely of either multiplexed or non-multiplexed samples. Mixed datasets are not supported.

multiplexing

Schematic representation of sample multiplexing for single-cell sequencing. Individual samples (with corresponding names in the sample_id column) are labelled with antibodies carrying different HTOs (with corresponding labels in the hto_id column). These labeled samples are then combined into pools (with corresponding names in the pool_id column). HTO labels can be shared across different pools.

Options for integrating multiplexed samples¶

scprocess offers two approaches to integration of multiplexed samples, defined by int_batch_var in the integration section of the configuration file. The two possibilities are:

Batch correction performed with sample_id as a batch variable: this is the default/standard approach which relies on Seurat HTODemux⁶ for accurate cell-to-sample assignment and doublet identification. If selected, any cells flagged as doublets by either scDblFinder⁷ or Seurat HTODemux are excluded from the analysis.
Batch correction performed with pool_id as a batch variable: This approach might be preferred if a significant proportion of cells cannot be confidently demultiplexed, which may point to technical issues with sample multiplexing rather than poor cell quality. If selected, unassigned cells are retained when integrating and defining clusters with the aim to improve clustering results. Doublet calls from Seurat HTODemux are ignored.

Best practices¶

The default parameters in the configuration file are suitable for running scprocess on the example dataset. Here are some of the most important things to consider when you are running scprocess on your own dataset:

Setting parameters for ambient contamination removal¶

Ambient method¶

By default scprocess will use DecontX² for ambient RNA removal, which doesn't require GPU. If a GPU is available, we recommend using CellBender³ for ambient RNA decontamination as it was found to perform better than other related algorithms in a recent benchmark ⁴.

Knee parameters¶

empties_cells Both algorithms for ambient RNA decontamination available in scprocess estimate background noise from empty droplets. Therefore, correctly identifying the subset of barcodes corresponding to empty droplets is critical. In the barcode-rank "knee plot", where barcodes are ranked in descending order based on their library size, two distinct plateaus are typically observed: the first plateau represents droplets containing cells with high RNA content, while the second corresponds to empty droplets containing ambient RNA.

scprocess identifies the cell-containing and empty droplet populations by detecting key transition points on the barcode-rank curve — namely, the inflection and knee points. These points allow scprocess to infer the optimal parameters for both DecontX and CellBender. Additionally, scprocess uses these estimates to identify genes enriched in empty droplets.

We recommend verifying the accuracy of these parameters by inspecting knee plots after running mapping. The two main parameters inferred by scprocess based on transition points in the barcode-rank curve are expected_cells and the empty_plateau_middle (which corresponds to the --total-droplets-included parameter in cellbender). The empty_plateau_middle should extend a few thousand barcodes into the second plateau.

all_knee_examples

Three examples of knee plots with four transition points (knee1, shin1, knee2, shin2) represented by purple horizontal lines and two inferred parameters (expected_cells and empty_plateau_middle) represented by blue vertical lines. In example (A) a knee plot with two clearly distinguished plateaus and properly predicted parameters is shown. In example (B) the same knee plot is shown, however the predicted parameters are wrong. Example (C) represents a sample with very high ambient RNA contamination making it impossible to distinguish the cells and empty droplets populations by eye.

To identify problematic samples, scprocess computes two diagnostic ratios:

expected_cells/empty_plateau_middle ratio: this helps assess whether the estimated number of cells is reasonable compared to the empty_plateau_middle. In examples B and C this ratio would be increased.
slope_ratio: This is the ratio of the slope of the barcode-rank curve in the empty droplet region compared to the slope at the first inflection point (shin1). Samples with a high slope ratio, as seen in example C, are likely problematic because the empty droplet plateau is not clearly distinguishable. In such cases, ambient RNA contamination algorithms like DecontX and CellBender may struggle to accurately estimate background noise, and we recommend considering removing these samples from further analysis.

If scprocess fails to estimate the knee plot parameters but the barcode-rank curve appears normal, we suggest manually adjusting the knee1, knee2, shin1, and shin2 parameters in the custom_sample_params file. A convenient way to fine-tune these parameters is by using the plotknee function in scprocess. This allows for easy visualization and adjustment of knee points.

Setting QC parameters¶

It is recommended to review the QC HTML report generated by scprocess and adjust QC thresholds if needed. Detailed descriptions of the QC parameters that can be modified are available in the Reference section. For instance, if you anticipate a lower number of cells within a sample, we suggest lowering the value of the qc_min_cells parameter. Additionally, depending on whether you are working with single cell or single nuclei data, you might want to adjust the default thresholds for the minimum and maximum allowed spliced proportions (qc_min_splice and qc_max_splice parameters) in the configuration file.

John D Blischak, Peter Carbonetto, and Matthew Stephens. Creating and sharing reproducible research code the workflowr way. F1000Res., 8(1749):1749, October 2019. ↩
Shiyi Yang, Sean E Corbett, Yusuke Koga, Zhe Wang, W Evan Johnson, Masanao Yajima, and Joshua D Campbell. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol., 21(1):57, March 2020. ↩
Stephen J Fleming, Mark D Chaffin, Alessandro Arduini, Amer-Denis Akkad, Eric Banks, John C Marioni, Anthony A Philippakis, Patrick T Ellinor, and Mehrtash Babadi. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat. Methods, 20(9):1323–1335, September 2023. ↩
Philipp Janssen, Zane Kliesmete, Beate Vieth, Xian Adiconis, Sean Simmons, Jamie Marshall, Cristin McCabe, Holger Heyn, Joshua Z Levin, Wolfgang Enard, and Ines Hellmann. The effect of background noise and its removal on the analysis of single-cell expression data. Genome Biol., 24(1):140, June 2023. ↩
Hyun Min Kang, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, Simon Wong, Lauren Byrnes, Cristina M Lanata, Rachel E Gate, Sara Mostafavi, Alexander Marson, Noah Zaitlen, Lindsey A Criswell, and Chun Jimmie Ye. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol., 36(1):89–94, January 2018. ↩
Marlon Stoeckius, Shiwei Zheng, Brian Houck-Loomis, Stephanie Hao, Bertrand Z Yeung, William M Mauck, 3rd, Peter Smibert, and Rahul Satija. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol., 19(1):224, December 2018. ↩
Pierre-Luc Germain, Aaron Lun, Will Macnair, and Mark D Robinson. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res., 10:979, September 2021. ↩