What is scprocess?¶

scprocess is a Snakemake pipeline designed to automate various steps of processing single-cell and single-nuclei RNA sequencing data. This comprehensive solution effectively handles data generated using 10x Genomics technology. Starting from raw sequencing files (as well as a metadata file and a config file), scprocess performs a series of standard analysis steps, delivering outputs suitable for a variety of downstream analyses. The pipeline supports analyses across multiple samples, enabling researchers to gain a complete understanding of their datasets while ensuring reproducibility and scalability in their workflows.

Steps¶

Overview¶

scprocess consists of a series of core steps which can be performed in a single execution of the workflow. Additional optional steps are available to extend the core analyses as needed. The diagram below outlines all steps, with detailed descriptions provided in the following sections.

workflow_chart

The figure illustrates all steps in scprocess. Specific software packages used are listed for individual steps. Several steps generate HTML reports with diagnostic plots, enabling users to inspect the results at key points in the workflow.

Core pipeline steps¶

Read alignment and quantification¶

scprocess starts by mapping sequencing reads to the reference genome and performing transcript quantification using simpleaf¹. This open-source alternative to Cell Ranger is designed for speed and memory efficiency, while also offering the advantage of reporting counts for spliced and unspliced transcripts separately.
Ambient RNA removal (optional)¶

In droplet-based assays, cells or nuclei are encapsulated in droplets, but some freely-floating RNA can also be captured, which is referred to as ambient RNA. Ambient RNA contamination is particularly common in single-nucleus assays due to residual cytoplasmic material and harsh isolation protocols that can cause nuclei to rupture. Since ambient RNA can interfere with results of downstream analysis, it can be beneficial to remove this contamination in silico. In scprocess, users can select from CellBender² and DecontX³ to remove ambient RNA.
Cell calling¶

An important step in single cell data analysis is distinguishing barcodes/droplets that contain cells/nuclei from those that are empty. scprocess employs several options for detecting barcodes that correspond to cell-containing droplets. When CellBender is used for ambient RNA removal, its built-in method generates a filtered counts matrix. If DecontX is selected or ambient RNA removal is skipped, users can choose between barcodeRanks and emptyDrops methods from the DropletUtils⁴,⁵ R package. barcodeRanks separates cell-containing and empty droplet populations by detecting key transition points on the barcode-rank curve while emptyDrops tests each barcode for significant deviations from the ambient profile.
Doublet detection¶

Doublets, formed when two cells are captured in the same droplet, can distort single-cell RNA-seq results. scprocess addresses this by using scDblFinder⁶ for doublet detection. Additionally, scprocess integrates scDblFinder-flagged doublets with cells that pass the QC filtering step. This allows users to detect doublet-enriched clusters which can be removed from downstream analysis.
QC filtering¶

In addition to removing doublets, scprocess filters out cells based on library size, feature counts, mitochondrial read proportions and spliced read proportions using user-defined thresholds. The spliced proportion is a particularly informative metric in single-nuclei RNA-seq, as elevated levels may indicate residual cytoplasmic material ⁷.
Generating pseudobulks from cells and empty droplets and ambient gene detection¶

scprocess aggregates counts from both cell-containing and empty droplets into pseudobulk profiles to identify genes enriched in empty droplets—i.e., ambient genes. These ambient genes likely represent residual contamination rather than true biological signals. Identifying them supports cleaner and more accurate downstream analyses and helps prioritize genes less affected by contamination.
Highly variable gene detection¶

Highly variable gene detection in scprocess is performed using the Seurat VST⁸ method in a chunk-wise or sample-wise manner, enabling the computation of ranking metrics for all genes without the need to load the entire dataset into memory. The efficient generation of a reduced matrix containing only highly variable genes ensures optimal performance in downstream analyses and facilitates the processing of larger datasets.
Integration¶

After identifying highly variable genes, scprocess proceeds with dimentionality reduction and clustering. In multi-sample analysis, various factors can introduce batch effects that obscure true biological signals. To mitigate this, scprocess offers the option to compute batch-corrected PCA embeddings using Harmony⁹. This ensures that clustering and downstream analyses reflect true biological relationships rather than technical variation. To perform the integration step users can choose between a standard Scanpy-based workflow ¹⁰ or a workflow with equivalent functionality using RAPIDS-singlecell. The latter laverages GPU acceleration to achieve massive performance boosts, particularly for clustering and UMAP ¹¹,¹². Additionally, users can optionally run partition-based graph abstraction (PAGA)¹³ within the integration step.
Marker gene identification¶

Assigning meaningful labels to clusters in single-cell data is essential for interpretation of single cell data. This is commonly achieved by examining marker genes for each cluster, identified by comparing the expression profile of each cluster against all others. In scprocess, transcript counts are aggregated per cluster within each sample to generate "pseudobulk" values, which are then compared using edgeR¹⁴,¹⁵. This approach avoids the assumption that individual cells from the same sample are independent, thereby enhancing the statistical reliability of the results. For human and mouse datasets scprocess also performs gene set enrichment analysis on all marker genes and includes visualizations of user-defined gene sets in the HTML report.

Optional steps¶

Processing multiplexed samples¶

Multiplexing strategies are commonly used to scale up single-cell experiments by enabling the analysis of multiple samples in a single run. Common approaches include labeling cells in individual samples with lipid-bound or antibody-conjugated oligonucleotides (hashtag oligos, or HTOs) prior to pooling. Alternatively, sample labels can be derived based on differences in genetic backgrounds. scprocess supports the analysis of multiplexed samples by quantifying HTOs and demultiplexing samples using the Seurat HTODemux¹⁶ function. It also accommodates outputs from external demultiplexing tools, enabling seamless processing of samples regardless of the multiplexing strategy employed.
Gene set enrichment analysis¶

scprocess includes an option to perform gene set enrichment analysis (GSEA)¹⁷ on the set of identified marker genes using the fgsea¹⁸ R package. By identifying biological processes and pathways unique to each cluster, GSEA can provide additional evidence for characterizing cellular identity.
Cell type labelling¶

scprocess provides automated cell type annotation of human brain datasets using an XGBoost classifier trained adult human whole-brain dataset¹⁹. In addition, scprocess supports cell type annotation using pre-trained models available through CellTypist²⁰,²¹.
Subclustering¶

scprocess offers a subclustering feature that enables users to perform a second round of analysis on a specific subset of cells and includes the following steps: generating pseudobulks from selected cells and detecting ambient genes, indentifying higly variable genes, data integration and marker gene identification. Cell subsets can be defined based on user-provided cell type labels, clusters identified during the primary round of scprocess, or cell type labels assigned with a selected classifier. This functionality is particularly valuable when a primary cluster or cell type contains diverse cell states, developmental stages, or activation states that warrant more detailed exploration.

Dongze He and Rob Patro. Simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics, October 2023. ↩
Stephen J Fleming, Mark D Chaffin, Alessandro Arduini, Amer-Denis Akkad, Eric Banks, John C Marioni, Anthony A Philippakis, Patrick T Ellinor, and Mehrtash Babadi. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat. Methods, 20(9):1323–1335, September 2023. ↩
Shiyi Yang, Sean E Corbett, Yusuke Koga, Zhe Wang, W Evan Johnson, Masanao Yajima, and Joshua D Campbell. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol., 21(1):57, March 2020. ↩
Aaron T L Lun, Samantha Riesenfeld, Tallulah Andrews, The Phuong Dao, Tomas Gomes, participants in the 1st Human Cell Atlas Jamboree, and John C Marioni. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol., 20(1):63, March 2019. ↩
Jonathan A Griffiths, Arianne C Richard, Karsten Bach, Aaron T L Lun, and John C Marioni. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun., 9(1):2667, July 2018. ↩
Pierre-Luc Germain, Aaron Lun, Will Macnair, and Mark D Robinson. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res., 10:979, September 2021. ↩
Tomàs Montserrat-Ayuso and Anna Esteve-Codina. High content of nuclei-free low-quality cells in reference single-cell atlases: a call for more stringent quality control using nuclear fraction. BMC Genomics, 25(1):1124, November 2024. ↩
Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, 3rd, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902.e21, June 2019. ↩
Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods, 16(12):1289–1296, December 2019. ↩
F Alexander Wolf, Philipp Angerer, and Fabian J Theis. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol., December 2018. ↩
Severin Dicks. GPU-accelerated single-cell RNA analysis with RAPIDS-singlecell. \url https://developer.nvidia.com/blog/gpu-accelerated-single-cell-rna-analysis-with-rapids-singlecell/, June 2023. Accessed: 2025-12-29. ↩
Corey Nolet, Avantika Lal, Rajesh Ilango, Taurean Dyer, Rajiv Movva, John Zedlewski, and Johnny Israeli. Accelerating single-cell genomic analysis with GPUs. bioRxiv, May 2022. ↩
F Alexander Wolf, Fiona K Hamey, Mireya Plass, Jordi Solana, Joakim S Dahlin, Berthold Göttgens, Nikolaus Rajewsky, Lukas Simon, and Fabian J Theis. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol., 20(1):59, March 2019. ↩
Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, January 2010. ↩
Yunshun Chen, Lizhong Chen, Aaron T L Lun, Pedro L Baldoni, and Gordon K Smyth. edgeR v4: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. Nucleic Acids Res., January 2025. ↩
Marlon Stoeckius, Shiwei Zheng, Brian Houck-Loomis, Stephanie Hao, Bertrand Z Yeung, William M Mauck, 3rd, Peter Smibert, and Rahul Satija. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol., 19(1):224, December 2018. ↩
Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, and Jill P Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A., 102(43):15545–15550, October 2005. ↩
Gennady Korotkevich, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N Artyomov, and Alexey Sergushichev. Fast gene set enrichment analysis. bioRxiv, June 2016. ↩
Kimberly Siletti, Rebecca Hodge, Alejandro Mossi Albiach, Ka Wai Lee, Song-Lin Ding, Lijuan Hu, Peter Lönnerberg, Trygve Bakken, Tamara Casper, Michael Clark, Nick Dee, Jessica Gloe, Daniel Hirschstein, Nadiya V Shapovalova, C Dirk Keene, Julie Nyhus, Herman Tung, Anna Marie Yanny, Ernest Arenas, Ed S Lein, and Sten Linnarsson. Transcriptomic diversity of cell types across the adult human brain. Science, 382(6667):eadd7046, October 2023. ↩
Chuan Xu, Martin Prete, Simone Webb, Laura Jardine, Benjamin J Stewart, Regina Hoo, Peng He, Kerstin B Meyer, and Sarah A Teichmann. Automatic cell-type harmonization and integration across human cell atlas datasets. Cell, 186(26):5876–5891.e20, December 2023. ↩
C Domínguez Conde, C Xu, L B Jarvis, D B Rainbow, S B Wells, T Gomes, S K Howlett, O Suchanek, K Polanski, H W King, L Mamanova, N Huang, P A Szabo, L Richardson, L Bolt, E S Fasouli, K T Mahbubani, M Prete, L Tuck, N Richoz, Z K Tuong, L Campos, H S Mousa, E J Needham, S Pritchard, T Li, R Elmentaite, J Park, E Rahmani, D Chen, D K Menon, O A Bayraktar, L K James, K B Meyer, N Yosef, M R Clatworthy, P A Sims, D L Farber, K Saeb-Parsy, J L Jones, and S A Teichmann. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197, May 2022. ↩

What is scprocess?¶

Steps¶

Overview¶

Core pipeline steps¶

Read alignment and quantification¶

Ambient RNA removal (optional)¶

Cell calling¶

Doublet detection¶

QC filtering¶

Generating pseudobulks from cells and empty droplets and ambient gene detection¶

Highly variable gene detection¶

Integration¶

Marker gene identification¶

Optional steps¶

Processing multiplexed samples¶

Gene set enrichment analysis¶

Cell type labelling¶

Subclustering¶