Tutorials¶
In this section we will run scprocess on small example datasets. To be able to follow the tutorials make sure you completed all the steps in the Getting started section.
Tutorial 1: a basic workflow¶
A single nuclei dataset for this tutorial was generated by downsampling the data in this study: BTK regulates microglial function and neuroinflammation in human stem cell models and mouse models of multiple sclerosis.
More about the tutorial dataset
This tutorial uses single nuclei RNA sequencing data from mouse spinal cord samples. The data comes from a study of the Experimental Autoimmune Encephalomyelitis (EAE) model of multiple sclerosis, which compared the effects of a Bruton's tyrosine kinase (BTK) inhibitor to a vehicle control. The raw data is available in Gene Expression Omnibus (GEO) under accession GSE281176.
We selected a subset of sequencing files from four different samples (GSM8612257 - SRR31256781, GSM8612258 - SRR31256780, GSM8612261 - SRR31256774, GSM8612262 - SRR31256771).
To generate smaller files via downsampling, we first processed the FASTQ files with Cell Ranger to generate BAM files. We then downsampled the BAM files by retaining 40% of the barcodes called as cells, 40% of barcodes in the empty plateau, and 30% of the remaining barcodes. Additionally, we discarded any barcodes with fewer than 10 unique molecular identifiers (UMIs) and all reads flagged as PCR duplicates. The filtered BAM files were then converted back to FASTQ format, resulting in the dataset used for this tutorial.
Creating a new project directory and preparing input data¶
First we will create a new project directory where all outputs of scprocess for this dataset will be stored:
# create a new directory called test_project in your current working directory
# add the -c flag with the sn option to generate a template configuration file with qc thresholds typically used for single nuclei data
scprocess newproj test_project -s -c sn
# change your working directory to test_project
cd test_project
Using the command tree ., you can inspect the structure of the test_project directory:
.
├── analysis
│ ├── about.Rmd
│ ├── custom.css
│ ├── index.Rmd
│ ├── license.Rmd
│ └── _site.yml
├── code
├── config-test_project.yaml
├── data
│ ├── fastqs
│ └── metadata
├── output
├── public
├── test_project.Rproj
└── _workflowr.yml
In the analysis/ directory, scprocess will store all R Markdown files that are used to create HTML reports in the public/ directory. In the code/ directory all R scripts used in scprocess will be stored.
For storing input FASTQ files and sample metadata for scprocess we will use the data/fastqs and data/metadata subdirectories, repectively.
To download input files for scprocess run the following lines:
File size
The size of FASTQ files is approximately 7 GB. Download may take a while.
# download all raw sequencing files into data/fastqs
curl -s https://api.github.com/repos/marusakod/scprocessData/releases/tags/v0.1.0 \
| grep -o "https://.*GSM.*\.tar\.gz" \
| wget -c --show-progress -P data/fastqs -i -
# extract raw sequencing files
for file in data/fastqs/GSM*.tar.gz; do tar -xzvf "$file" -C data/fastqs && rm "$file"; done
# download sample metadata into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/test_project_metadata.csv
With ls data/fastqs you should be able to see the following files:
GSM8612257_SRR31256781_R1_.fastq.gz GSM8612261_SRR31256774_R1_.fastq.gz
GSM8612257_SRR31256781_R2_.fastq.gz GSM8612261_SRR31256774_R2_.fastq.gz
GSM8612258_SRR31256780_R1_.fastq.gz GSM8612262_SRR31256771_R1_.fastq.gz
GSM8612258_SRR31256780_R2_.fastq.gz GSM8612262_SRR31256771_R2_.fastq.gz
To inspect the contents of the test_project_metadata.csv file, you can use the following command:
This is what the output should look like:
sample_id group tissue
GSM8612257_SRR31256781 vehicle spinal_cord
GSM8612258_SRR31256780 vehicle spinal_cord
GSM8612261_SRR31256774 btk_inhibitor spinal_cord
GSM8612262_SRR31256771 btk_inhibitor spinal_cord
Note that the first column of the sample metadata file (sample_id) contains values that can be matched to FASTQ files.
Creating a configuration file¶
The configuration file template config-test_project.yaml was created in the test_project root directory with the newproj function. In this file, all required parameters for scprocess are listed, with some default values already set:
project:
proj_dir: /projects/site/pred/neurogenomics/users/kodermam/test_project
fastq_dir: data/fastqs
full_tag: test_project
short_tag:
your_name:
affiliation:
sample_metadata: data/metadata/
ref_txome:
date_stamp: "2026-01-01"
qc:
qc_max_mito: 0.1
qc_max_splice: 0.75
In addition to setting values for the required parameters, we will include the optional metadata_vars and mkr_custom_genesets parameters. The metadata_vars parameter allows us to specify additional metadata variables for visualization, while the mkr_custom_genesets parameter enables us to provide a file containing a list of marker genes associated with different cell types expected in our dataset. In this case, we only need to specify the name of the marker gene file, as the corresponding file mouse_brain.csv already exists in the $SCPROCESS_DATA_DIR/marker_genes directory.
Note that proj_dir requires an absolute path, whereas fastq_dir and sample_metadata can use relative paths, since the raw data and sample metadata are stored within the project directory:
project:
proj_dir: /absolute/path/to/test_project # replace with correct absolute path
fastq_dir: data/fastqs
full_tag: test_project
short_tag: test
your_name: Testy McUser
affiliation: Unemployed
sample_metadata: data/metadata/test_project_metadata.csv
ref_txome: mouse_2024
date_stamp: "2026-01-01"
metadata_vars: [group]
qc:
qc_max_mito: 0.1
qc_max_splice: 0.75
marker_genes:
mkr_custom_genesets:
- name: mouse_brain
Running scprocess¶
We are now ready to run scprocess run using:
Consider adding a -n or --dry-run flag to this command to verify the setup and get a list of all tasks scprocess will perform (see the Usage section for more details).
scprocess dry run output for the tutorial dataset
If --dry-run flag is used with the scprocess run command you will see a detailed list of all steps that will be executed. This is the summary:
Job stats:
job count
------------------------------------- -------
all 1
calculate_ambient_genes 1
check_qc_quality 1
collect_chemistry_stats 1
create_doublets_hvg_matrix 1
create_hvg_matrix 1
get_ambient_run_statistics 1
get_barcode_qc_metrics 4
get_highly_variable_genes 1
get_qc_sample_statistics 1
get_stats_for_std_variance_for_sample 4
make_clean_h5ad_paths_yaml 1
make_clean_h5ads 4
make_empty_pb_input_df 1
make_hvg_df 1
make_one_pb_cells 4
make_one_pb_empty 4
make_qc_thresholds_csv 1
make_runs_to_batches_df 1
make_tmp_csr_matrix 1
make_tmp_pb_cells_df 1
merge_pb_cells 1
merge_pb_empty 1
merge_qc 1
merge_rowdata 1
merge_sample_std_var_stats 1
render_html_ambient 1
render_html_hvgs 1
render_html_integration 1
render_html_mapping 1
render_html_marker_genes 1
render_html_qc 1
run_decontx 4
run_fgsea 1
run_integration 1
run_mapping 4
run_marker_genes 1
run_qc_one_run 4
save_alevin_to_h5 4
total 66
Reasons:
(check individual jobs above for details)
input files updated by another job:
all, calculate_ambient_genes, check_qc_quality, collect_chemistry_stats, create_doublets_hvg_matrix, create_hvg_matrix, get_ambient_run_statistics, get_barcode_qc_metrics, get_highly_variable_genes, get_qc_sample_statistics, get_stats_for_std_variance_for_sample, make_clean_h5ad_paths_yaml, make_clean_h5ads, make_empty_pb_input_df, make_hvg_df, make_one_pb_cells, make_one_pb_empty, make_tmp_csr_matrix, make_tmp_pb_cells_df, merge_pb_cells, merge_pb_empty, merge_qc, merge_rowdata, merge_sample_std_var_stats, render_html_ambient, render_html_hvgs, render_html_integration, render_html_mapping, render_html_marker_genes, render_html_qc, run_decontx, run_fgsea, run_integration, run_marker_genes, run_qc_one_run, save_alevin_to_h5
output files have to be generated:
calculate_ambient_genes, check_qc_quality, collect_chemistry_stats, create_doublets_hvg_matrix, create_hvg_matrix, get_ambient_run_statistics, get_barcode_qc_metrics, get_highly_variable_genes, get_qc_sample_statistics, get_stats_for_std_variance_for_sample, make_clean_h5ad_paths_yaml, make_clean_h5ads, make_empty_pb_input_df, make_hvg_df, make_one_pb_cells, make_one_pb_empty, make_qc_thresholds_csv, make_runs_to_batches_df, make_tmp_csr_matrix, make_tmp_pb_cells_df, merge_pb_cells, merge_pb_empty, merge_qc, merge_rowdata, merge_sample_std_var_stats, render_html_ambient, render_html_hvgs, render_html_integration, render_html_mapping, render_html_marker_genes, render_html_qc, run_decontx, run_fgsea, run_integration, run_mapping, run_marker_genes, run_qc_one_run, save_alevin_to_h5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Labelling cell types¶
Tutorial results may vary from your scprocess outputs
Modifying the default mouse_2024 reference transcriptome settings in scprocess_setup.yaml may cause your scprocess outputs to differ slightly from the results in this tutorial. Furthermore, if int_use_gpu is false or a GPU is unavailable, Scanpy will be used for integration instead of RAPIDS-singlecell, which was used for generate results shown on this page.
scprocess supports cell type annotation using various pretrained models. In this tutorial we will use the mouse whole brain classifier provided by CellTypist 1,2. To specify the annotation model, add the highlighed lines to your configuration file:
project:
proj_dir: /absolute/path/to/test_project # replace with correct absolute path
fastq_dir: data/fastqs
full_tag: test_project
short_tag: test
your_name: Testy McUser
affiliation: Unemployed
sample_metadata: data/metadata/test_project_metadata.csv
ref_txome: mouse_2024
date_stamp: "2026-01-01"
metadata_vars: [group]
qc:
qc_max_mito: 0.1
qc_max_splice: 0.75
marker_genes:
mkr_custom_genesets:
- name: mouse_brain
label_celltypes:
- labeller: celltypist
model: Mouse_Whole_Brain
Cell type annotation can now be initiated using the following command:
Zooming in¶
After label asignment, we can subcluster cell populations of interest using scprocess. To identify those populations we can inspect the test_label_celltypes.html report which should be located in the public directory.
The report includes a UMAP plot displaying the predicted cell type annotations:

In this example, we will subcluster the populations labelled "327 Oligo NN" (oligodendrocytes) and "326 OPC NN" (oligodendrocyte precursor cells).
-
Create a Subclustering configuration file
First, create a new configuration file named
zoom_params_test_project-oligos_opcs.yaml, with the following parameters:zoom: name: oligos_opcs labels_source: celltypist model: Mouse_Whole_Brain sel_labels: ["327 Oligo NN", "326 OPC NN"] labels_col: predicted_label_aggThe value of the
labels_sourceparameter matches the value of thelabellerparameter in the main configuration file. Thesel_labelsparameter lists specific clusters/cell types to include in subclustering.labels_colreferes to a column name in the annotation output file (output/test_label_celltypes/labels_celltypist_model_Mouse_Whole_Brain_test_project_2026-01-01.csv.gz) containg cell type names. -
Link subclustering configuration file to main project
Link the subclustering YAML to your main project configuration file (
config-test_project.yaml) by adding it to thezoomsection:project: proj_dir: /absolute/path/to/test_project # replace with correct absolute path fastq_dir: data/fastqs full_tag: test_project short_tag: test your_name: Test McUser affiliation: Unemployed sample_metadata: data/metadata/test_project_metadata.csv ref_txome: mouse_2024 date_stamp: "2026-01-01" metadata_vars: [group] label_celltypes: - labeller: "celltypist" model: Mouse_Whole_Brain zoom: - zoom_params_test_project-oligos_opcs.yaml -
Run subclustering
To run subclustering use the following command:
Inspecting reports¶
All reports generated by scprocess for this tutorial are available at https://marusakod.github.io/scprocess_test_project/.
Tutorial 2: Analysis of multiplexed single cell data¶
A single cell dataset for this tutorial was generated by downsampling the data in this study: The coenzyme A precursor pantethine enhances antitumor immunity in sarcoma. The dataset includes samples representing four distinct experimental groups, with one replicate from each group combined into a single pooled (multiplexed) sample.
More about the tutorial dataset
This tutorial uses single cell RNA sequencing data from mouse tumor samples. The data was generated in a study where MCA205 mouse fibrosarcoma cells were injected subcutaneously into C57BL/6 mice. Ten days after engraftment, the mice were treated with either pantethine (a vitamin B5 precursor) or PBS (vehicle control). Tumors were harvested at 20 and 28 days post-engraftment, and CD45+ immune cells were isolated for single-cell RNA sequencing. The raw data is available in Gene Expression Omnibus (GEO) under accession GSE221164.
We selected sequencing files from two multiplexed samples, including both gene expression and all associated HTO FASTQ files. The GEO sample accession numbers and their corresponding sequencing run identifiers are as follows: (1) gene expression: GSM6846850 (SRR22767531), HTO: GSM6846851; and (2) gene expression: GSM6846854 (SRR22767514), HTO: GSM6846855.
To generate smaller gene expression files via downsampling, we first processed the FASTQ files with Cell Ranger to generate BAM files. We then downsampled the BAM files by retaining 50% of the barcodes called as cells, 70% of barcodes in the empty plateau, and 70% of the remaining barcodes. Additionally, we discarded any barcodes with fewer than 10 unique molecular identifiers (UMIs) and all reads flagged as PCR duplicates. The filtered BAM files were then converted back to FASTQ format, resulting in the dataset used for this tutorial.
Creating a new project directory and preparing input data¶
First we will create a new project directory where all outputs of scprocess will be stored:
# create a new directory called test_multiplexed_project in your current working directory
# add the -c flag with sc and multiplex options to generate a template config file with :
# - qc thresholds commonly used for single cell data
# - multiplexing section
scprocess newproj test_multiplexed_project -s -c sc multiplex
# change your working directory to test_project
cd test_multiplexed_project
We will use the data/metadata directory to store the sample metadata file. Additionally, a feature reference file will also be downloaded into the same directory. Input FASTQ files will be stored in the data/fastqs directory, with two subdirectories: data/fastqs/rna for storing gene expression FASTQ files and data/fastqs/hto for storing HTO FASTQ files.
To create the directories for FASTQ files, use:
To download input files for scprocess run the following lines:
File size
The size of FASTQ files is approximately 4 GB. Download may take a while.
# download all raw sequencing files into data/fastqs
curl -s https://api.github.com/repos/marusakod/scprocessData/releases/tags/v0.1.0 \
| grep -o "https://.*run.*\.tar\.gz" \
| wget -c --show-progress -P data/fastqs -i -
# extract gene expression sequening files
for file in data/fastqs/run*_rna*.tar.gz; do tar -xzvf "$file" -C data/fastqs/rna && rm "$file"; done
# extract HTO sequencing files
for file in data/fastqs/run*_hto*.tar.gz; do tar -xzvf "$file" -C data/fastqs/hto && rm "$file"; done
# download sample metadata into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/multiplexed_test_project_metadata.csv
# dowload feature reference file into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/multiplexed_test_project_feature_ref.csv
With ls data/fastqs/rna you should be able to see the following files:
ls data/fastqs/hto you should be able to see the following files:
run1_hto1_R1_.fastq.gz run1_hto3_R1_.fastq.gz run2_hto1_R1_.fastq.gz run2_hto3_R1_.fastq.gz
run1_hto1_R2_.fastq.gz run1_hto3_R2_.fastq.gz run2_hto1_R2_.fastq.gz run2_hto3_R2_.fastq.gz
run1_hto2_R1_.fastq.gz run1_hto4_R1_.fastq.gz run2_hto2_R1_.fastq.gz run2_hto4_R1_.fastq.gz
run1_hto2_R2_.fastq.gz run1_hto4_R2_.fastq.gz run2_hto2_R2_.fastq.gz run2_hto4_R2_.fastq.gz
To inspect the contents of the multiplexed_test_project_metadata.csv file, you can use the following command:
This is what the output should look like:
pool_id sample_id hto_id day treatment
run1 D21_PBS_R1 totalseq_B0301_hashtag1 21 PBS
run1 D21_Panth_R1 totalseq_B0302_hashtag2 21 pantethine
run1 D29_PBS_R1 totalseq_B0303_hashtag3 29 PBS
run1 D29_Panth_R1 totalseq_B0304_hashtag4 29 pantethine
run2 D21_PBS_R2 totalseq_B0301_hashtag1 21 PBS
run2 D21_Panth_R2 totalseq_B0302_hashtag2 21 pantethine
run2 D29_PBS_R2 totalseq_B0303_hashtag3 29 PBS
run2 D29_Panth_R2 totalseq_B0304_hashtag4 29 pantethine
Note that the first column of the sample metadata file (pool_id) contains values that can be matched to both gene expression and HTO FASTQ files i.e. run*. The metadata file requires two additional columns:
sample_id: Lists all samples corresponding to a single pool.hto_id: Lists the names of HTO tags used to label each sample.
The hto_id column must correspond to a column in the feature reference file, which links the name of each HTO tag to its sequence.
To see the contents of the feature reference file use
hto_id sequence
totalseq_B0301_hashtag1 ACCCACCAGTAAGAC
totalseq_B0302_hashtag2 GGTCGAGAGCATTCA
totalseq_B0303_hashtag3 CTTGCCGCATGTCAT
totalseq_B0304_hashtag4 AAAGCATTCTTCACG
Creating a configuration file¶
The configuration file template config-test_multiplexed_project.yaml was created in the test_multiplexed_project root directory with the newproj function. In this file, all required parameters for scprocess are listed, with some default values already set:
project:
proj_dir: /absolute/path/to/test_multiplexed_project # replace with correct absolute path
fastq_dir: data/fastqs
full_tag: test_multiplexed_project
short_tag:
your_name:
affiliation:
sample_metadata: data/metadata/
ref_txome:
date_stamp: "2026-01-01"
multiplexing:
demux_type:
qc:
qc_max_mito: 0.1
qc_min_splice: 0.10
qc_max_splice: 0.99
In addition to setting the parameters already present in the template configuration file, we must also define additional parameters specific to the multiplexed nature of the dataset. These key parameters are highlighted below:
project:
proj_dir: /absolute/path/to/test_multiplexed_project
fastq_dir: data/fastqs/rna
full_tag: test_multiplexed_project
short_tag: test
your_name: Testy McUser
affiliation: Unemployed
sample_metadata: data/metadata/multiplexed_test_project_metadata.csv
ref_txome: mouse_2024
date_stamp: "2026-01-01"
multiplexing:
demux_type: hto
fastq_dir: data/fastqs/hto
feature_ref: data/metadata/multiplexed_test_project_feature_ref.csv
qc:
qc_max_mito: 0.1
qc_min_splice: 0.10
qc_max_splice: 0.99
Setting demux_type to hto instructs scprocess to use HTO-based demultiplexing for this dataset. By specifying fastq_dir and feature_ref, we provide scprocess with the paths to the HTO FASTQ files and the feature reference file, respectively.
Running scprocess¶
We are now ready to run scprocess run using:
Inspecting reports¶
All reports generated by scprocess for this tutorial are available at https://marusakod.github.io/scprocess_test_multiplexed_project/.
-
Chuan Xu, Martin Prete, Simone Webb, Laura Jardine, Benjamin J Stewart, Regina Hoo, Peng He, Kerstin B Meyer, and Sarah A Teichmann. Automatic cell-type harmonization and integration across human cell atlas datasets. Cell, 186(26):5876–5891.e20, December 2023. ↩
-
C Domínguez Conde, C Xu, L B Jarvis, D B Rainbow, S B Wells, T Gomes, S K Howlett, O Suchanek, K Polanski, H W King, L Mamanova, N Huang, P A Szabo, L Richardson, L Bolt, E S Fasouli, K T Mahbubani, M Prete, L Tuck, N Richoz, Z K Tuong, L Campos, H S Mousa, E J Needham, S Pritchard, T Li, R Elmentaite, J Park, E Rahmani, D Chen, D K Menon, O A Bayraktar, L K James, K B Meyer, N Yosef, M R Clatworthy, P A Sims, D L Farber, K Saeb-Parsy, J L Jones, and S A Teichmann. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197, May 2022. ↩