SPEEDI (Single-cell Pipeline for End-to-End Data Integration)

Yuan Wang1 and William Thistlethwaite1*

1Princeton University

9 November 2023

Abstract

To facilitate single-cell omics analyses and improve reproducibility, we present SPEEDI, a fully automated end-to-end single-cell data science framework. SPEEDI introduces a data-driven batch inference method and transforms often heterogeneous data matrices obtained from different samples into a uniformly cell-type annotated and integrated dataset. By eliminating manual parameter selection, developing batch identification, providing full automation, and performing initial interpretive analyses, SPEEDI improves reproducibility and democratizes biological discovery using single-cell datasets.

Package

SPEEDI 1.2.0.0

1 Citation
2 Installation and Loading
3 Quick Start
4 Example Analysis for scRNA-seq (PBMC, COVID)
5 Other Types of Data Analyses
Session info

1 Citation

The SPEEDI manuscript is currently under review.

2 Installation and Loading

To begin, we install the SPEEDI package and its dependencies.

# Install necessary packages
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
devtools::install_github('FunctionLab/SPEEDI', repos = BiocManager::repositories())

We then load the SPEEDI package.

# Load SPEEDI
library(SPEEDI)

3 Quick Start

After loading SPEEDI, the fastest way to begin using the pipeline is through the run_SPEEDI() function. In the provided code snippet, we show a brief illustration with some example inputs, but you should read the documentation for the function (i.e., ?run_SPEEDI) and look at our complete example analysis below to learn more.

# Example parameters for run_SPEEDI - note that some optional parameters were not used
# To see other options for reference_tissue, data_type, and species, look at ?run_SPEEDI
reference_tissue <- "PBMC"
data_type <- "RNA"
species <- "human"
# input_dir points to where input data is stored
# Note that input data should be FILTERED H5 or MEX (matrix.mtx.gz / barcodes.tsv.gz / features.tsv.gz) files
# generated by Cell Ranger
input_dir <- "~/test_input"
# output_dir points to where output data will be stored
output_dir <- "~/test_output"
# Run SPEEDI pipeline (not run)
run_SPEEDI(reference_tissue = reference_tissue, data_type = data_type, species = species, input_dir = input_dir,  
           output_dir = output_dir)

4 Example Analysis for scRNA-seq (PBMC, COVID)

4.1 Introduction

Our example scRNA-seq data come from the publication Differential chromatin accessibility in peripheral blood mononuclear cells underlies COVID-19 disease severity prior to seroconversion. We integrate data from 5 healthy controls and 4 severe COVID patients, and we also perform some preliminary downstream analyses (differential expression analysis and functional module discovery).

Importantly, the input data for our analysis are available on GEO. You can download the files using wget as in the code snippet below:

wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249272/suppl/GSM6249272_scRNA_2209_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249273/suppl/GSM6249273_scRNA_2211_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249274/suppl/GSM6249274_scRNA_2213_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249275/suppl/GSM6249275_scRNA_2224_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249276/suppl/GSM6249276_scRNA_2227_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249252/suppl/GSM6249252_scRNA_34-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249254/suppl/GSM6249254_scRNA_39-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249257/suppl/GSM6249257_scRNA_41-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249270/suppl/GSM6249270_scRNA_145-7_filtered_feature_bc_matrix.tar.gz

We can then unarchive these files and rename the resulting folders so that the names are cleaner:

for i in *.tar.gz; do tar -xvf $i; done
for folder in *_filtered_feature_bc_matrix; do
    new_folder="${folder%_filtered_feature_bc_matrix}"
    mv "$folder" "$new_folder"
done

Within each folder, we have filtered data generated by Cell Ranger in the MEX format. SPEEDI also works with H5 format files. If working with MEX files, sample data should always follow the standard naming convention (three files with names barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz). If working with H5 files, sample data file names should always end with filtered_feature_bc_matrix.h5. For example, filtered_feature_bc_matrix.h5 and Sample_1_filtered_feature_bc_matrix.h5 would both be fine, but filtered_matrix.h5 would result in an error.

IMPORTANT: SPEEDI expects data that have been filtered using Cell Ranger or some other similar process.

4.2 Step 0: Set Up SPEEDI

After downloading our input data, we are ready to begin using the SPEEDI pipeline. We could use the run_SPEEDI() function to run the entire pipeline, but we will go through the pipeline step-by-step for clarity.

First, we’ll go over the different parameters for SPEEDI. We highly recommend that you read the run_SPEEDI() function documentation to see all possible values for each parameter.

We’ll begin with the parameters that we provide values for in this analysis:

reference_tissue: Reference tissue type (used to map cell types via reference). Here, we choose PBMC_full (full explanation below).
data_type: Type of data being processed. Here, we choose RNA.
species: Species being analyzed. Here, we choose human.
input_dir: Path to directory where input data are located. Here, we choose input/.
reference_dir: Path to directory where reference will be downloaded by LoadReferenceSPEEDI() if necessary. Here, we choose references/.
output_dir: Path to directory where output will be saved. Here, we choose output/.
metadata_df: Data frame that contains metadata about our samples. Its value will declared in the code snippet below.

There are two different PBMC reference tissue types in SPEEDI: PBMC and PBMC_full. The former uses a SeuratData reference and Azimuth to map cell types, while the latter downloads a more detailed, quite large (~8 GB) PBMC reference for cell type mapping. To get the best possible mapping, we choose the latter. Importantly, SPEEDI loads this reference into memory during processing, so you should use PBMC unless you have sufficient memory to load the PBMC_full reference.

Our other parameters will use the default values:

reference_file_name: Base name of custom reference file. Because we’re not using a custom reference file, we leave this at its default value (NULL).
reference_cell_type_attribute: If using a Seurat reference object, this parameter captures where the cell type information is stored. We leave this at its default value ("celltype.l2").
analysis_name: Name used to create subdirectory in output_dir for current analysis run. We’ll let SPEEDI set this for us, so we’ll leave it at its default value (NULL).
sample_id_list: Vector of sample names (optional - if not provided, will select all samples found recursively in input_dir). Because we’re analyzing all of the samples that we downloaded (and not a subset), we leave this at its default value (NULL).
sample_file_paths: Vector of sample file paths (optional - if not provided, will select all samples found recursively in input_dir). If this argument is used, sample_id_list is required and should be written in the same order as the sample file paths. Because we’re analyzing all of the samples that we downloaded (and not a subset), we leave this at its default value (NULL).
record_doublets: Boolean flag to indicate whether we will record doublets in the data (using the scDblFinder package). We leave this at its default value (FALSE).

reference_tissue <- "PBMC_full"
data_type <- "RNA"
species <- "human"
input_dir <- "input/"
reference_dir <- "references/"
output_dir <- "output/"
# Here, we set up our (very simple) metadata data frame
metadata_df <- data.frame("disease" = c("COVID", "COVID", "COVID", "COVID", 
                                        "healthy", "healthy", "healthy", "healthy", "healthy"))
rownames(metadata_df) <- c("scRNA_34-7", "scRNA_39-7", "scRNA_41-7", "scRNA_145-7", 
                           "scRNA_2209", "scRNA_2211", "scRNA_2213", "scRNA_2224", "scRNA_2227")

To complete our setup, we set a random seed using set.seed() to ensure reproducibility and then use the initialize_SPEEDI() function. While this function isn’t strictly necessary for using SPEEDI, it will help make sure that all variables are set properly and will create a log file for SPEEDI.

# Note that we use get_speedi_seed() to get the global seed used by SPEEDI for tasks like SCTransform normalization, UMAP generation etc.
# If you want to set a different global seed for SPEEDI, you can use the set_speedi_seed() function
set.seed(get_speedi_seed())
SPEEDI_variables <- initialize_SPEEDI(reference_tissue, data_type, species, 
                                      input_dir, reference_dir, output_dir, metadata_df)

After running initialize_SPEEDI(), all of the corrected variable values will be stored in the SPEEDI_variables list. These variables include everything highlighted above as well as a few new ones:

RNA_output_dir: Output directory for RNA analyses (will contain all analysis output for this example)
ATAC_output_dir: Output directory for ATAC analyses (not relevant for this example)
log_file_path: File name associated with log file
old_wd: ATAC analyses require our working directory be updated to ATAC_output_dir, so we save the previous working directory in old_wd (not relevant for this example)

To refer to a given variable, use list notation (e.g., the corrected input_dir is stored in SPEEDI_variables$input_dir).

Note the directory structures below:

# output_dir now includes a subdirectory for our specific analysis
# This directory contains our log file
SPEEDI_variables$output_dir

## [1] "/home/wat2/SPEEDI_vignette/output/2023-11-09_19-02-28_SPEEDI/"

# RNA_output_dir is a subdirectory of our new output_dir
SPEEDI_variables$RNA_output_dir

## [1] "/home/wat2/SPEEDI_vignette/output/2023-11-09_19-02-28_SPEEDI/RNA/"

4.3 Step 1: Load Reference

Before we begin processing the input data, we first load our PBMC reference. Because we’re using PBMC_full as our reference_tissue value, SPEEDI will automatically download the reference object (pbmc_multimodal.h5seurat) and save it in reference_dir. Note that we set log_flag to TRUE so that any messages are automatically recorded in our log file.

reference <- LoadReferenceSPEEDI(reference_tissue = SPEEDI_variables$reference_tissue, 
                                 reference_dir = SPEEDI_variables$reference_dir,
                                 species = SPEEDI_variables$species, log_flag = TRUE)

We can see that reference now stores a Seurat object:

reference

## An object of class Seurat 
## 20957 features across 161764 samples within 2 assays 
## Active assay: SCT (20729 features, 5000 variable features)
##  2 layers present: counts, data
##  1 other assay present: ADT
##  6 dimensional reductions calculated: apca, aumap, pca, spca, umap, wnn.umap

4.4 Step 2: Read Input Data

Next, we will read in the input data (scRNA-seq) for processing using the Read_RNA() function. Sample data are processed in parallel and compiled into a single comprehensive expression matrix.

all_sc_exp_matrices <- Read_RNA(input_dir = SPEEDI_variables$input_dir, log_flag = TRUE)
nrow(all_sc_exp_matrices)

## [1] 32738

ncol(all_sc_exp_matrices)

## [1] 124692

Our matrix contains 124692 barcodes and 32738 transcripts in total.

4.5 Step 3: Filter Input Data

While our input data have been previously filtered by Cell Ranger, we still need to perform QC filtering (based on number of transcripts, number of genes, percentage of mitochondrial content, percentage of hemoglobin content, and percentage of ribosomal content). For each category, the FilterRawData_RNA() function automatically selects the QC threshold for each sample using the kneedle algorithm and the quantile function. In addition, the FilterRawData_RNA() function performs the following tasks:

Creates pre-filtered QC plots for each category above (see below)
Transforms the expression matrix into a Seurat object for easier downstream processing
If record_doublets is set to TRUE, records information about doublets in the Seurat object

sc_obj <- FilterRawData_RNA(all_sc_exp_matrices = all_sc_exp_matrices, species = SPEEDI_variables$species,
                           output_dir = SPEEDI_variables$RNA_output_dir, log_file_path = SPEEDI_variables$log_file_path, 
                           log_flag = TRUE)

# We can delete all_sc_exp_matrices because we're done with it
rm(all_sc_exp_matrices)
sc_obj

## An object of class Seurat 
## 21693 features across 88870 samples within 1 assay 
## Active assay: RNA (21693 features, 0 variable features)
##  9 layers present: counts.1.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.2.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject, counts.unbias.SeuratProject, counts.unbias

After QC filtering is complete, we see that 88870 barcodes and 21693 transcripts remain. You can always see the automatically chosen QC thresholds by looking at the log file (they may also be displayed in your standard R output depending on your environment). In addition, if you’re interested in the quality of your samples before filtering, you can look at the automatically generated pre-filtering QC plots (found in RNA_output_dir). One example (for number of genes) is displayed below:

Figure 1: Pre-filtered QC plot for number of genes

4.6 Step 4: Initial Processing

Now that the data have been filtered, we will normalize the data using SCTransform() (from Seurat). We regress against mitochondrial percentage (percent.mt), ribosomal percentage (percent.rps and percent.rpl), and hemoglobin percentage (percent.hb). We also calculate cell cycle scores and regress against the differences between S score and G2M score (CC.Difference). Finally, we add metadata from metadata_df into our Seurat object.

sc_obj <- InitialProcessing_RNA(sc_obj = sc_obj, species = SPEEDI_variables$species, 
                                output_dir = SPEEDI_variables$RNA_output_dir, metadata_df = SPEEDI_variables$metadata_df, 
                                log_flag = TRUE)

Looking at the UMAP generated from InitialProcessing_RNA(), we can see that the samples are not yet integrated:

Figure 2: UMAP of sample data before batch correction and integration

4.7 Step 5: Inferring Data-Derived Batches

Next, we will use SPEEDI’s data-derived batch inference method to determine sample batches in our data:

sc_obj <- InferBatches(sc_obj = sc_obj, log_flag = TRUE)

In total, SPEEDI found 4 batches for our 9 samples:

length(unique(sc_obj$batch))

## [1] 4

4.8 Step 6: Integrate Batches

After we have grouped our samples according to batch, we are ready to integrate by batch. We first normalize counts within each batch using SCTransform() and then use a selection of methods from Seurat (SelectIntegrationFeatures(), PrepSCTIntegration(), FindIntegrationAnchors(), and IntegrateData()) to integrate our data.

sc_obj <- IntegrateByBatch_RNA(sc_obj = sc_obj, log_flag = TRUE)

Note that we will now use the integrated assay for visualization (as opposed to the SCT assay) unless only one batch was found above:

sc_obj

## An object of class Seurat 
## 43931 features across 88870 samples within 3 assays 
## Active assay: integrated (3000 features, 3000 variable features)
##  2 layers present: data, scale.data
##  2 other assays present: RNA, SCT

4.9 Step 7: Final Processing

We perform some final processing of our integrated data (scaling and preparing for differential expression) using the VisualizeIntegration() function.

sc_obj <- VisualizeIntegration(sc_obj = sc_obj, output_dir = SPEEDI_variables$RNA_output_dir, log_flag = TRUE)

4.10 Step 8: Map Cell Types

We are now ready to use our PBMC reference to map cell types onto our data. In this case, because we have a full Seurat reference object, our MapCellTypes_RNA() function uses the Seurat functions FindMappingAnchors() and MapQuery() to map cell types. For other built-in references, SPEEDI will use the Azimuth function RunAzimuth() to map cell types. Finally, SPEEDI runs a majority vote algorithm (MajorityVote_RNA()) to finalize cell types within each cluster.

sc_obj <- MapCellTypes_RNA(sc_obj = sc_obj, reference = reference, output_dir = SPEEDI_variables$RNA_output_dir,
                           log_flag = TRUE)

# Table of final cell type predictions
knitr::kable(table(sc_obj$predicted_celltype_majority_vote), col.names = c("Cell Type", "Cell Count"))

Cell Type	Cell Count
B memory	2789
B naive	3941
CD14 Mono	15975
CD16 Mono	2424
CD4 Naive	2163
CD4 TCM	28319
CD8 Naive	3262
CD8 TEM	10655
cDC2	678
HSPC	81
MAIT	6531
NK	7426
NK_CD56bright	599
pDC	518
Platelet	1044
Treg	2465

The MapCellTypes_RNA() function prints UMAPs of our integrated sample data broken down by majority vote cell type, raw predicted cell type, cluster, and sample. Below, we print the sample UMAP and the majority vote cell type UMAP.

Figure 3: UMAP of sample data after batch correction and integration

Note that the samples are much better integrated versus our previous UMAP.

Figure 4: UMAP of cell types determined by majority vote

Importantly, if the cells in a given cluster have an even mix of assigned cell types, the majority vote algorithm may not be able to determine an overall cell type for the cluster. We mark these clusters as “Undetermined”. To better understand these Undetermined clusters, we recommend that users visualize the expression of canonical markers for different cell types (e.g., by using the Seurat function FeaturePlot()).

In addition to the UMAPs described above, SPEEDI also prints a heatmap showing cell type proportions for each sample:

Figure 5: Heatmap of cell type proportions

4.11 Step 9: Downstream Analyses

Now that we have a uniformly cell-type annotated and integrated dataset, we will use SPEEDI to perform two downstream analyses: differential expression analysis and functional module discovery. We use the run_downstream_analyses_RNA() function to run both analyses.

de_results <- run_downstream_analyses_RNA(sc_obj = sc_obj, reference_tissue = SPEEDI_variables$reference_tissue, 
                                      species = SPEEDI_variables$species, metadata_df = SPEEDI_variables$metadata_df, 
                                      output_dir = SPEEDI_variables$RNA_output_dir, log_flag = TRUE)

In our metadata data frame, we provided one metadata category (“disease”) with each sample being assigned a value of “COVID” or “healthy”. The run_downstream_analyses_RNA() wrapper function finds differentially expressed genes (DEGs) for each cell type (COVID vs. healthy) using SPEEDI’s RunDE_RNA() function. SPEEDI first uses the Seurat function FindMarkers() to find DEGs on a single cell sample level (adjusted p-value threshold of 0.05, log fold change threshold of 0.1, and min.pct threshold of 0.1) and then uses DESeq2 to further curate this list using pseudobulk level filtering (p-value threshold of 0.05 and log fold change threshold of 0.3).

The final output is a table containing DEGs for each cell type:

DEGs_table <- utils::read.table(paste0(SPEEDI_variables$RNA_output_dir, "disease.DE.tsv"), header = TRUE, sep = "\t")
print(nrow(DEGs_table))

## [1] 2516

subset_DEGs_table <- rbind(DEGs_table[100:110,], DEGs_table[400:410,])
subset_DEGs_table$sc_pval_adj <- format(subset_DEGs_table$sc_pval_adj, digits = 5)
knitr::kable(subset_DEGs_table, "simple")

	Cell_Type	Gene_Name	sc_pval_adj	sc_log2FC	pseudo_bulk_pval	pseudo_bulk_log2FC
100	CD4 TCM	NR4A2	5.6160e-210	-1.5251357	0.0007697	-1.8935872
101	CD4 TCM	SUN2	6.9428e-185	0.7928248	0.0128745	0.8659437
102	CD4 TCM	MT-ND1	2.5417e-167	-0.2726354	0.0060409	-0.3768555
103	CD4 TCM	DNAJB1	1.4335e-160	-0.5047394	0.0035380	-0.6876837
104	CD4 TCM	KLF6	5.8023e-155	0.7111543	0.0002115	0.6499916
105	CD4 TCM	LINC00861	6.7041e-152	0.6511951	0.0000535	0.5334065
106	CD4 TCM	C1orf63	4.0507e-127	0.5987083	0.0134283	0.4393595
107	CD4 TCM	ZFP36	1.6867e-125	-0.6325388	0.0053459	-0.8987507
108	CD4 TCM	ZRANB2	2.4364e-118	0.6791998	0.0000096	0.6048188
109	CD4 TCM	FKBP8	1.4353e-110	0.5802708	0.0203568	0.6735848
110	CD4 TCM	RN7SL1	2.0350e-110	1.7416472	0.0181545	1.6971147
400	CD8 Naive	RASSF3	2.4735e-04	1.0512495	0.0406761	0.7675221
401	CD8 Naive	C11orf48	2.7214e-04	0.8435597	0.0124949	0.7051809
402	CD8 Naive	CABIN1	5.2228e-04	0.6316841	0.0432814	0.3746976
403	CD8 Naive	NHLRC3	5.7896e-04	0.8454088	0.0336770	0.5575887
404	CD8 Naive	LINC00936	6.1817e-04	0.8217519	0.0414509	0.5629821
405	CD8 Naive	RBM3	6.3853e-04	-0.2762142	0.0121591	-0.3742365
406	CD8 Naive	CCR7	9.2105e-04	-0.2937378	0.0053840	-0.3944238
407	CD8 Naive	METTL12	9.2811e-04	-1.0073312	0.0024619	-1.1553749
408	CD8 Naive	ZNF331	1.2551e-03	-0.8437378	0.0003378	-1.1235462
409	CD8 Naive	TRIM73	1.3633e-03	0.9726430	0.0390137	0.6770856
410	CD8 Naive	WDR37	1.8856e-03	0.8626598	0.0456461	0.5463623

In total, there were 2516 DEGs found. We show a subset of the DEGs found for CD4 TCM and CD8 Naive cells.

Next, because our samples are human, the run_downstream_analyses_RNA() wrapper function will also run functional module discovery (FMD) using SPEEDI’s run_fmd_wrapper() function. As input, SPEEDI uses the cell-type specific DEGs found above. First, each list of cell-type DEGs is divided into separate lists of positive and negative fold change. For each separate list, if at least 20 genes are found in HumanBase, an FMD job is launched. Results are provided in a TSV file and include a HumanBase URL linking to the user’s full results as well as a complete list of associated gene ontology (GO) enrichment terms.

Below, we show example output for CD16 monocytes that were upregulated in our COVID patients:

FMD_file <- paste0(SPEEDI_variables$RNA_output_dir, "FMD_high_CD16_Mono_blood_disease.tsv")
FMD_URL <- readLines(FMD_file, n = 1)
FMD_table <- utils::read.table(FMD_file, header = TRUE, sep = "\t")
print(FMD_URL)

## [1] "# https://hb.flatironinstitute.org/module/overview/?body_tag=c89a1690becc46f63dfeb2185b97feb26eb6ba73"

knitr::kable(head(FMD_table, 20), "simple")

term	name	p_value	q_value
GO:0017144	drug metabolic process	0.0387789	0.0634331
GO:0042110	T cell activation	0.0562688	0.0794194
GO:0001775	cell activation	0.0451936	0.0689062
GO:0044282	small molecule catabolic process	0.0610010	0.0836018
GO:0045321	leukocyte activation	0.0318460	0.0576678
GO:1901136	carbohydrate derivative catabolic process	0.0182024	0.0424151
GO:0009164	nucleoside catabolic process	0.0013437	0.0195233
GO:0009116	nucleoside metabolic process	0.0076972	0.0322241
GO:1901657	glycosyl compound metabolic process	0.0129072	0.0403553
GO:0009119	ribonucleoside metabolic process	0.0052480	0.0275802
GO:0034656	nucleobase-containing small molecule catabolic process	0.0013437	0.0195233
GO:1901658	glycosyl compound catabolic process	0.0021896	0.0208010
GO:0022407	regulation of cell-cell adhesion	0.0763038	0.0947087
GO:0042454	ribonucleoside catabolic process	0.0010601	0.0195233
GO:0031334	positive regulation of protein complex assembly	0.0582871	0.0818006
GO:0032091	negative regulation of protein binding	0.0186749	0.0427103
GO:0002253	activation of immune response	0.0021053	0.0208010
GO:0002429	immune response-activating cell surface receptor signaling pathway	0.0177345	0.0424151
GO:0050778	positive regulation of immune response	0.0008411	0.0195233
GO:0002757	immune response-activating signal transduction	0.0009157	0.0195233

If we go to the URL listed at the top of the table, we will see a network visualization of our results:

Figure 6: HumanBase network visualization for CD16 monocytes upregulated in COVID patients (background network of blood)

Note that HumanBase has multiple background networks for FMD. In our case, because our reference tissue was PBMC_full, SPEEDI automatically analyzed each list of DEGs using the global network as well as the blood network.

4.12 Reminder: Use the SPEEDI Wrapper!

Above, we went through the SPEEDI pipeline step-by-step. However, the easiest way to run SPEEDI is with the run_SPEEDI() wrapper function. This function will complete all of the above steps automatically and will even save your Seurat object in RNA_output_dir. For our example, we would simply write:

sc_obj <- run_SPEEDI(reference_tissue = reference_tissue, data_type = data_type, species = species, 
                     input_dir = input_dir, reference_dir = reference_dir, 
                     output_dir = output_dir, metadata_df = metadata_df)

5 Other Types of Data Analyses

In addition to scRNA-seq analyses, SPEEDI can also run scATAC-seq, paired scRNA-seq / scATAC-seq, and same-cell multiome snRNA-seq / snRNA-seq analyses. To learn more, read the documentation for run_SPEEDI()!

Session info

## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Springdale Open Enterprise Linux 8.8 (Modena)
## 
## Matrix products: default
## BLAS:   /usr/local/R/4.2.3/lib64/R/lib/libRblas.so
## LAPACK: /usr/local/R/4.2.3/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bluster_1.8.0               Seurat_5.0.0               
##  [3] SeuratObject_5.0.0          sp_2.1-1                   
##  [5] SPEEDI_1.2.0.0000           rhdf5_2.42.1               
##  [7] SummarizedExperiment_1.28.0 Biobase_2.58.0             
##  [9] MatrixGenerics_1.10.0       Rcpp_1.0.11                
## [11] Matrix_1.6-1.1              GenomicRanges_1.50.2       
## [13] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [15] S4Vectors_0.36.2            BiocGenerics_0.44.0        
## [17] matrixStats_1.1.0           data.table_1.14.8          
## [19] stringr_1.5.0               plyr_1.8.9                 
## [21] magrittr_2.0.3              ggplot2_3.4.4              
## [23] gtable_0.3.4                gtools_3.9.4               
## [25] gridExtra_2.3               ArchR_1.0.2                
## [27] BiocStyle_2.26.0            knitr_1.45                 
## [29] rmarkdown_2.25             
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.4                spatstat.explore_3.2-5   
##   [3] reticulate_1.34.0         tidyselect_1.2.0         
##   [5] RSQLite_2.3.3             AnnotationDbi_1.60.2     
##   [7] htmlwidgets_1.6.2         BiocParallel_1.32.6      
##   [9] Rtsne_0.16                munsell_0.5.0            
##  [11] codetools_0.2-19          ragg_1.2.6               
##  [13] ica_1.0-3                 future_1.33.0            
##  [15] miniUI_0.1.1.1            withr_2.5.2              
##  [17] spatstat.random_3.2-1     colorspace_2.1-0         
##  [19] progressr_0.14.0          highr_0.10               
##  [21] ROCR_1.0-11               tensor_1.5               
##  [23] listenv_0.9.0             labeling_0.4.3           
##  [25] GenomeInfoDbData_1.2.9    polyclip_1.10-6          
##  [27] pheatmap_1.0.12           bit64_4.0.5              
##  [29] farver_2.1.1              rprojroot_2.0.4          
##  [31] parallelly_1.36.0         vctrs_0.6.4              
##  [33] generics_0.1.3            xfun_0.41                
##  [35] R6_2.5.1                  doParallel_1.0.17        
##  [37] ggbeeswarm_0.7.2          locfit_1.5-9.8           
##  [39] hdf5r_1.3.8               bitops_1.0-7             
##  [41] rhdf5filters_1.10.1       spatstat.utils_3.0-4     
##  [43] cachem_1.0.8              DelayedArray_0.24.0      
##  [45] promises_1.2.1            scales_1.2.1             
##  [47] beeswarm_0.4.0            Cairo_1.6-1              
##  [49] globals_0.16.2            goftest_1.2-3            
##  [51] spam_2.10-0               rlang_1.1.2              
##  [53] systemfonts_1.0.5         splines_4.2.3            
##  [55] lazyeval_0.2.2            spatstat.geom_3.2-7      
##  [57] BiocManager_1.30.22       yaml_2.3.7               
##  [59] reshape2_1.4.4            abind_1.4-5              
##  [61] httpuv_1.6.12             SeuratDisk_0.0.0.9021    
##  [63] tools_4.2.3               bookdown_0.36            
##  [65] ellipsis_0.3.2            jquerylib_0.1.4          
##  [67] RColorBrewer_1.1-3        ggridges_0.5.4           
##  [69] sparseMatrixStats_1.10.0  zlibbioc_1.44.0          
##  [71] purrr_1.0.2               RCurl_1.98-1.13          
##  [73] deldir_1.0-9              pbapply_1.7-2            
##  [75] cowplot_1.1.1             lisi_1.0                 
##  [77] zoo_1.8-12                ggrepel_0.9.4            
##  [79] cluster_2.1.4             here_1.0.1               
##  [81] RSpectra_0.16-1           glmGamPoi_1.10.2         
##  [83] scattermore_1.2           lmtest_0.9-40            
##  [85] RANN_2.6.1                fitdistrplus_1.1-11      
##  [87] patchwork_1.1.3           mime_0.12                
##  [89] evaluate_0.23             xtable_1.8-4             
##  [91] XML_3.99-0.15             fastDummies_1.7.3        
##  [93] compiler_4.2.3            tibble_3.2.1             
##  [95] KernSmooth_2.23-22        crayon_1.5.2             
##  [97] htmltools_0.5.7           later_1.3.1              
##  [99] geneplotter_1.76.0        tidyr_1.3.0              
## [101] logr_1.3.5                DBI_1.1.3                
## [103] rappdirs_0.3.3            MASS_7.3-60              
## [105] cli_3.6.1                 parallel_4.2.3           
## [107] dotCall64_1.1-0           igraph_1.5.1             
## [109] pkgconfig_2.0.3           plotly_4.10.3            
## [111] spatstat.sparse_3.0-3     foreach_1.5.2            
## [113] annotate_1.76.0           vipor_0.4.5              
## [115] bslib_0.5.1               XVector_0.38.0           
## [117] digest_0.6.33             sctransform_0.4.1        
## [119] RcppAnnoy_0.0.21          common_1.1.0             
## [121] Biostrings_2.66.0         spatstat.data_3.0-3      
## [123] leiden_0.4.3              uwot_0.1.16              
## [125] DelayedMatrixStats_1.20.0 curl_5.1.0               
## [127] shiny_1.7.5.1             outliers_0.15            
## [129] lifecycle_1.0.4           nlme_3.1-163             
## [131] jsonlite_1.8.7            Rhdf5lib_1.20.0          
## [133] BiocNeighbors_1.16.0      limma_3.54.2             
## [135] viridisLite_0.4.2         fansi_1.0.5              
## [137] pillar_1.9.0              lattice_0.22-5           
## [139] KEGGREST_1.38.0           ggrastr_1.0.2            
## [141] fastmap_1.1.1             httr_1.4.7               
## [143] survival_3.5-7            glue_1.6.2               
## [145] png_0.1-8                 iterators_1.0.14         
## [147] bit_4.0.5                 presto_1.0.0             
## [149] stringi_1.7.12            sass_0.4.7               
## [151] blob_1.2.4                textshaping_0.3.7        
## [153] RcppHNSW_0.5.0            DESeq2_1.38.3            
## [155] memoise_2.0.1             dplyr_1.1.3              
## [157] irlba_2.3.5.1             future.apply_1.11.0