Tutorials¶

In this section we will run scprocess on small example datasets. To be able to follow the tutorials make sure you completed all the steps in the Getting started section.

Tutorial 1: a basic workflow¶

A single nuclei dataset for this tutorial was generated by downsampling the data in this study: BTK regulates microglial function and neuroinflammation in human stem cell models and mouse models of multiple sclerosis.

Creating a new project directory and preparing input data¶

First we will create a new project directory where all outputs of scprocess for this dataset will be stored:

# create a new directory called test_project in your current working directory
# add the -c flag with the sn option to generate a template configuration file with qc thresholds typically used for single nuclei data
scprocess newproj test_project -s -c sn

# change your working directory to test_project
cd test_project

Using the command tree ., you can inspect the structure of the test_project directory:

.
├── analysis
│   ├── about.Rmd
│   ├── custom.css
│   ├── index.Rmd
│   ├── license.Rmd
│   └── _site.yml
├── code
├── config-test_project.yaml
├── data
│   ├── fastqs
│   └── metadata
├── output
├── public
├── test_project.Rproj
└── _workflowr.yml

In the analysis/ directory, scprocess will store all R Markdown files that are used to create HTML reports in the public/ directory. In the code/ directory all R scripts used in scprocess will be stored.

For storing input FASTQ files and sample metadata for scprocess we will use the data/fastqs and data/metadata subdirectories, repectively.

To download input files for scprocess run the following lines:

File size

The size of FASTQ files is approximately 7 GB. Download may take a while.

# download all raw sequencing files into data/fastqs
curl -s https://api.github.com/repos/marusakod/scprocessData/releases/tags/v0.1.0 \
| grep -o "https://.*GSM.*\.tar\.gz" \
| wget -c --show-progress -P data/fastqs -i -

# extract raw sequencing files
for file in data/fastqs/GSM*.tar.gz; do tar -xzvf "$file" -C data/fastqs && rm "$file"; done

# download sample metadata into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/test_project_metadata.csv

With ls data/fastqs you should be able to see the following files:

GSM8612257_SRR31256781_R1_.fastq.gz  GSM8612261_SRR31256774_R1_.fastq.gz
GSM8612257_SRR31256781_R2_.fastq.gz  GSM8612261_SRR31256774_R2_.fastq.gz
GSM8612258_SRR31256780_R1_.fastq.gz  GSM8612262_SRR31256771_R1_.fastq.gz
GSM8612258_SRR31256780_R2_.fastq.gz  GSM8612262_SRR31256771_R2_.fastq.gz

To inspect the contents of the test_project_metadata.csv file, you can use the following command:

cat data/metadata/test_project_metadata.csv | column -s, -t

This is what the output should look like:

sample_id               group          tissue
GSM8612257_SRR31256781  vehicle        spinal_cord
GSM8612258_SRR31256780  vehicle        spinal_cord
GSM8612261_SRR31256774  btk_inhibitor  spinal_cord
GSM8612262_SRR31256771  btk_inhibitor  spinal_cord

Note that the first column of the sample metadata file (sample_id) contains values that can be matched to FASTQ files.

Creating a configuration file¶

The configuration file template config-test_project.yaml was created in the test_project root directory with the newproj function. In this file, all required parameters for scprocess are listed, with some default values already set:

project:
  proj_dir: /projects/site/pred/neurogenomics/users/kodermam/test_project
  fastq_dir: data/fastqs
  full_tag: test_project
  short_tag:
  your_name:
  affiliation:
  sample_metadata: data/metadata/
  ref_txome:
  date_stamp: "2026-01-01"
qc:
  qc_max_mito: 0.1
  qc_max_splice: 0.75

In addition to setting values for the required parameters, we will include the optional metadata_vars and mkr_custom_genesets parameters. The metadata_vars parameter allows us to specify additional metadata variables for visualization, while the mkr_custom_genesets parameter enables us to provide a file containing a list of marker genes associated with different cell types expected in our dataset. In this case, we only need to specify the name of the marker gene file, as the corresponding file mouse_brain.csv already exists in the $SCPROCESS_DATA_DIR/marker_genes directory.

Note that proj_dir requires an absolute path, whereas fastq_dir and sample_metadata can use relative paths, since the raw data and sample metadata are stored within the project directory:

project:
  proj_dir: /absolute/path/to/test_project # replace with correct absolute path 
  fastq_dir: data/fastqs
  full_tag: test_project
  short_tag: test
  your_name: Testy McUser
  affiliation: Unemployed
  sample_metadata: data/metadata/test_project_metadata.csv
  ref_txome: mouse_2024
  date_stamp: "2026-01-01"
  metadata_vars: [group]
qc:
  qc_max_mito: 0.1
  qc_max_splice: 0.75
marker_genes:
  mkr_custom_genesets:
    - name: mouse_brain

Running scprocess¶

We are now ready to run scprocess run using:

scprocess run config-test_project.yaml

Consider adding a -n or --dry-run flag to this command to verify the setup and get a list of all tasks scprocess will perform (see the Usage section for more details).

scprocess dry run output for the tutorial dataset

If --dry-run flag is used with the scprocess run command you will see a detailed list of all steps that will be executed. This is the summary:

Job stats:
job                                      count
-------------------------------------  -------
all                                          1
calculate_ambient_genes                      1
check_qc_quality                             1
collect_chemistry_stats                      1
create_doublets_hvg_matrix                   1
create_hvg_matrix                            1
get_ambient_run_statistics                   1
get_barcode_qc_metrics                       4
get_highly_variable_genes                    1
get_qc_sample_statistics                     1
get_stats_for_std_variance_for_sample        4
make_clean_h5ad_paths_yaml                   1
make_clean_h5ads                             4
make_empty_pb_input_df                       1
make_hvg_df                                  1
make_one_pb_cells                            4
make_one_pb_empty                            4
make_qc_thresholds_csv                       1
make_runs_to_batches_df                      1
make_tmp_csr_matrix                          1
make_tmp_pb_cells_df                         1
merge_pb_cells                               1
merge_pb_empty                               1
merge_qc                                     1
merge_rowdata                                1
merge_sample_std_var_stats                   1
render_html_ambient                          1
render_html_hvgs                             1
render_html_integration                      1
render_html_mapping                          1
render_html_marker_genes                     1
render_html_qc                               1
run_decontx                                  4
run_fgsea                                    1
run_integration                              1
run_mapping                                  4
run_marker_genes                             1
run_qc_one_run                               4
save_alevin_to_h5                            4
total                                       66

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all, calculate_ambient_genes, check_qc_quality, collect_chemistry_stats, create_doublets_hvg_matrix, create_hvg_matrix, get_ambient_run_statistics, get_barcode_qc_metrics, get_highly_variable_genes, get_qc_sample_statistics, get_stats_for_std_variance_for_sample, make_clean_h5ad_paths_yaml, make_clean_h5ads, make_empty_pb_input_df, make_hvg_df, make_one_pb_cells, make_one_pb_empty, make_tmp_csr_matrix, make_tmp_pb_cells_df, merge_pb_cells, merge_pb_empty, merge_qc, merge_rowdata, merge_sample_std_var_stats, render_html_ambient, render_html_hvgs, render_html_integration, render_html_mapping, render_html_marker_genes, render_html_qc, run_decontx, run_fgsea, run_integration, run_marker_genes, run_qc_one_run, save_alevin_to_h5
    output files have to be generated:
        calculate_ambient_genes, check_qc_quality, collect_chemistry_stats, create_doublets_hvg_matrix, create_hvg_matrix, get_ambient_run_statistics, get_barcode_qc_metrics, get_highly_variable_genes, get_qc_sample_statistics, get_stats_for_std_variance_for_sample, make_clean_h5ad_paths_yaml, make_clean_h5ads, make_empty_pb_input_df, make_hvg_df, make_one_pb_cells, make_one_pb_empty, make_qc_thresholds_csv, make_runs_to_batches_df, make_tmp_csr_matrix, make_tmp_pb_cells_df, merge_pb_cells, merge_pb_empty, merge_qc, merge_rowdata, merge_sample_std_var_stats, render_html_ambient, render_html_hvgs, render_html_integration, render_html_mapping, render_html_marker_genes, render_html_qc, run_decontx, run_fgsea, run_integration, run_mapping, run_marker_genes, run_qc_one_run, save_alevin_to_h5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Labelling cell types¶

Tutorial results may vary from your scprocess outputs

Modifying the default mouse_2024 reference transcriptome settings in scprocess_setup.yaml may cause your scprocess outputs to differ slightly from the results in this tutorial. Furthermore, if int_use_gpu is false or a GPU is unavailable, Scanpy will be used for integration instead of RAPIDS-singlecell, which was used for generate results shown on this page.

scprocess supports cell type annotation using various pretrained models. In this tutorial we will use the mouse whole brain classifier provided by CellTypist ¹,². To specify the annotation model, add the highlighed lines to your configuration file:

project:
  proj_dir: /absolute/path/to/test_project # replace with correct absolute path 
  fastq_dir: data/fastqs
  full_tag: test_project
  short_tag: test
  your_name: Testy McUser
  affiliation: Unemployed
  sample_metadata: data/metadata/test_project_metadata.csv
  ref_txome: mouse_2024
  date_stamp: "2026-01-01"
  metadata_vars: [group]
qc:
  qc_max_mito: 0.1
  qc_max_splice: 0.75
marker_genes:
  mkr_custom_genesets:
    - name: mouse_brain
label_celltypes:
  - labeller: celltypist
    model: Mouse_Whole_Brain

Cell type annotation can now be initiated using the following command:

scprocess run config-test_project.yaml -r label_celltypes

Zooming in¶

After label asignment, we can subcluster cell populations of interest using scprocess. To identify those populations we can inspect the test_label_celltypes.html report which should be located in the public directory.

The report includes a UMAP plot displaying the predicted cell type annotations:

labels_umap

In this example, we will subcluster the populations labelled "327 Oligo NN" (oligodendrocytes) and "326 OPC NN" (oligodendrocyte precursor cells).

Create a Subclustering configuration file

First, create a new configuration file named zoom_params_test_project-oligos_opcs.yaml, with the following parameters:
```
zoom:
  name: oligos_opcs
  labels_source: celltypist
  model: Mouse_Whole_Brain
  sel_labels: ["327 Oligo NN", "326 OPC NN"]
  labels_col: predicted_label_agg
```
The value of the labels_source parameter matches the value of the labeller parameter in the main configuration file. The sel_labels parameter lists specific clusters/cell types to include in subclustering. labels_col referes to a column name in the annotation output file (output/test_label_celltypes/labels_celltypist_model_Mouse_Whole_Brain_test_project_2026-01-01.csv.gz) containg cell type names.

Link subclustering configuration file to main project

Link the subclustering YAML to your main project configuration file (config-test_project.yaml) by adding it to the zoom section:

project:
  proj_dir: /absolute/path/to/test_project # replace with correct absolute path 
  fastq_dir: data/fastqs
  full_tag: test_project
  short_tag: test
  your_name: Test McUser
  affiliation: Unemployed
  sample_metadata: data/metadata/test_project_metadata.csv
  ref_txome: mouse_2024
  date_stamp: "2026-01-01"
  metadata_vars: [group]
label_celltypes:
  - labeller: "celltypist"
    model: Mouse_Whole_Brain
zoom:
  - zoom_params_test_project-oligos_opcs.yaml

Run subclustering

To run subclustering use the following command:
```
scprocess run config-test_project.yaml -r zoom
```

Inspecting reports¶

All reports generated by scprocess for this tutorial are available at https://marusakod.github.io/scprocess_test_project/.

Tutorial 2: Analysis of multiplexed single cell data¶

A single cell dataset for this tutorial was generated by downsampling the data in this study: The coenzyme A precursor pantethine enhances antitumor immunity in sarcoma. The dataset includes samples representing four distinct experimental groups, with one replicate from each group combined into a single pooled (multiplexed) sample.

Creating a new project directory and preparing input data¶

First we will create a new project directory where all outputs of scprocess will be stored:

# create a new directory called test_multiplexed_project in your current working directory
# add the -c flag with sc and multiplex options to generate a template config file with :
# - qc thresholds commonly used for single cell data 
# - multiplexing section
scprocess newproj test_multiplexed_project  -s -c sc multiplex

# change your working directory to test_project
cd test_multiplexed_project

We will use the data/metadata directory to store the sample metadata file. Additionally, a feature reference file will also be downloaded into the same directory. Input FASTQ files will be stored in the data/fastqs directory, with two subdirectories: data/fastqs/rna for storing gene expression FASTQ files and data/fastqs/hto for storing HTO FASTQ files.

To create the directories for FASTQ files, use:

mkdir data/fastqs/rna
mkdir data/fastqs/hto

To download input files for scprocess run the following lines:

File size

The size of FASTQ files is approximately 4 GB. Download may take a while.

# download all raw sequencing files into data/fastqs
curl -s https://api.github.com/repos/marusakod/scprocessData/releases/tags/v0.1.0 \
| grep -o "https://.*run.*\.tar\.gz" \
| wget -c --show-progress -P data/fastqs -i -

# extract gene expression sequening files
for file in data/fastqs/run*_rna*.tar.gz; do tar -xzvf "$file" -C data/fastqs/rna && rm "$file"; done

# extract HTO sequencing files
for file in data/fastqs/run*_hto*.tar.gz; do tar -xzvf "$file" -C data/fastqs/hto && rm "$file"; done

# download sample metadata into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/multiplexed_test_project_metadata.csv

# dowload feature reference file into data/metadata
wget -P data/metadata https://github.com/marusakod/scprocessData/releases/download/v0.1.0/multiplexed_test_project_feature_ref.csv

With ls data/fastqs/rna you should be able to see the following files:

run1_rna_R1_.fastq.gz  run1_rna_R2_.fastq.gz  run2_rna_R1_.fastq.gz  run2_rna_R2_.fastq.gz

With ls data/fastqs/hto you should be able to see the following files:

run1_hto1_R1_.fastq.gz  run1_hto3_R1_.fastq.gz  run2_hto1_R1_.fastq.gz  run2_hto3_R1_.fastq.gz
run1_hto1_R2_.fastq.gz  run1_hto3_R2_.fastq.gz  run2_hto1_R2_.fastq.gz  run2_hto3_R2_.fastq.gz
run1_hto2_R1_.fastq.gz  run1_hto4_R1_.fastq.gz  run2_hto2_R1_.fastq.gz  run2_hto4_R1_.fastq.gz
run1_hto2_R2_.fastq.gz  run1_hto4_R2_.fastq.gz  run2_hto2_R2_.fastq.gz  run2_hto4_R2_.fastq.gz

To inspect the contents of the multiplexed_test_project_metadata.csv file, you can use the following command:

cat data/metadata/multiplexed_test_project_metadata.csv | column -s, -t

This is what the output should look like:

pool_id  sample_id     hto_id                   day  treatment
run1     D21_PBS_R1    totalseq_B0301_hashtag1  21   PBS
run1     D21_Panth_R1  totalseq_B0302_hashtag2  21   pantethine
run1     D29_PBS_R1    totalseq_B0303_hashtag3  29   PBS
run1     D29_Panth_R1  totalseq_B0304_hashtag4  29   pantethine
run2     D21_PBS_R2    totalseq_B0301_hashtag1  21   PBS
run2     D21_Panth_R2  totalseq_B0302_hashtag2  21   pantethine
run2     D29_PBS_R2    totalseq_B0303_hashtag3  29   PBS
run2     D29_Panth_R2  totalseq_B0304_hashtag4  29   pantethine

Note that the first column of the sample metadata file (pool_id) contains values that can be matched to both gene expression and HTO FASTQ files i.e. run*. The metadata file requires two additional columns:

sample_id: Lists all samples corresponding to a single pool.
hto_id: Lists the names of HTO tags used to label each sample.

The hto_id column must correspond to a column in the feature reference file, which links the name of each HTO tag to its sequence.

To see the contents of the feature reference file use

cat data/metadata/multiplexed_test_project_feature_ref.csv | column -s, -t

hto_id                   sequence
totalseq_B0301_hashtag1  ACCCACCAGTAAGAC
totalseq_B0302_hashtag2  GGTCGAGAGCATTCA
totalseq_B0303_hashtag3  CTTGCCGCATGTCAT
totalseq_B0304_hashtag4  AAAGCATTCTTCACG

Creating a configuration file¶

The configuration file template config-test_multiplexed_project.yaml was created in the test_multiplexed_project root directory with the newproj function. In this file, all required parameters for scprocess are listed, with some default values already set:

project:
  proj_dir: /absolute/path/to/test_multiplexed_project # replace with correct absolute path 
  fastq_dir: data/fastqs
  full_tag: test_multiplexed_project
  short_tag:
  your_name:
  affiliation:
  sample_metadata: data/metadata/
  ref_txome:
  date_stamp: "2026-01-01"
multiplexing:
  demux_type:
qc:
  qc_max_mito: 0.1
  qc_min_splice: 0.10
  qc_max_splice: 0.99

In addition to setting the parameters already present in the template configuration file, we must also define additional parameters specific to the multiplexed nature of the dataset. These key parameters are highlighted below:

project:
  proj_dir: /absolute/path/to/test_multiplexed_project
  fastq_dir: data/fastqs/rna 
  full_tag: test_multiplexed_project
  short_tag: test
  your_name: Testy McUser
  affiliation: Unemployed
  sample_metadata: data/metadata/multiplexed_test_project_metadata.csv
  ref_txome: mouse_2024
  date_stamp: "2026-01-01"
multiplexing:
  demux_type: hto
  fastq_dir: data/fastqs/hto
  feature_ref: data/metadata/multiplexed_test_project_feature_ref.csv
qc:
  qc_max_mito: 0.1
  qc_min_splice: 0.10
  qc_max_splice: 0.99

Setting demux_type to hto instructs scprocess to use HTO-based demultiplexing for this dataset. By specifying fastq_dir and feature_ref, we provide scprocess with the paths to the HTO FASTQ files and the feature reference file, respectively.

Running scprocess¶

We are now ready to run scprocess run using:

scprocess run config-test_multiplexed_project.yaml

Inspecting reports¶

All reports generated by scprocess for this tutorial are available at https://marusakod.github.io/scprocess_test_multiplexed_project/.

Chuan Xu, Martin Prete, Simone Webb, Laura Jardine, Benjamin J Stewart, Regina Hoo, Peng He, Kerstin B Meyer, and Sarah A Teichmann. Automatic cell-type harmonization and integration across human cell atlas datasets. Cell, 186(26):5876–5891.e20, December 2023. ↩
C Domínguez Conde, C Xu, L B Jarvis, D B Rainbow, S B Wells, T Gomes, S K Howlett, O Suchanek, K Polanski, H W King, L Mamanova, N Huang, P A Szabo, L Richardson, L Bolt, E S Fasouli, K T Mahbubani, M Prete, L Tuck, N Richoz, Z K Tuong, L Campos, H S Mousa, E J Needham, S Pritchard, T Li, R Elmentaite, J Park, E Rahmani, D Chen, D K Menon, O A Bayraktar, L K James, K B Meyer, N Yosef, M R Clatworthy, P A Sims, D L Farber, K Saeb-Parsy, J L Jones, and S A Teichmann. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197, May 2022. ↩