To facilitate single-cell omics analyses and improve reproducibility, we present SPEEDI, a fully automated end-to-end single-cell data science framework. SPEEDI introduces a data-driven batch inference method and transforms often heterogeneous data matrices obtained from different samples into a uniformly cell-type annotated and integrated dataset. By eliminating manual parameter selection, developing batch identification, providing full automation, and performing initial interpretive analyses, SPEEDI improves reproducibility and democratizes biological discovery using single-cell datasets.
SPEEDI 1.2.0.0
The SPEEDI manuscript is currently under review.
To begin, we install the SPEEDI package and its dependencies.
# Install necessary packages
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
devtools::install_github('FunctionLab/SPEEDI', repos = BiocManager::repositories())
We then load the SPEEDI package.
# Load SPEEDI
library(SPEEDI)
After loading SPEEDI, the fastest way to begin using the pipeline is through the run_SPEEDI()
function. In the provided code snippet, we show a brief illustration with some example inputs, but you should read the documentation for the function (i.e., ?run_SPEEDI
) and look at our complete example analysis below to learn more.
# Example parameters for run_SPEEDI - note that some optional parameters were not used
# To see other options for reference_tissue, data_type, and species, look at ?run_SPEEDI
reference_tissue <- "PBMC"
data_type <- "RNA"
species <- "human"
# input_dir points to where input data is stored
# Note that input data should be FILTERED H5 or MEX (matrix.mtx.gz / barcodes.tsv.gz / features.tsv.gz) files
# generated by Cell Ranger
input_dir <- "~/test_input"
# output_dir points to where output data will be stored
output_dir <- "~/test_output"
# Run SPEEDI pipeline (not run)
run_SPEEDI(reference_tissue = reference_tissue, data_type = data_type, species = species, input_dir = input_dir,
output_dir = output_dir)
Our example scRNA-seq data come from the publication Differential chromatin accessibility in peripheral blood mononuclear cells underlies COVID-19 disease severity prior to seroconversion. We integrate data from 5 healthy controls and 4 severe COVID patients, and we also perform some preliminary downstream analyses (differential expression analysis and functional module discovery).
Importantly, the input data for our analysis are available on GEO. You can download the files using wget
as in the code snippet below:
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249272/suppl/GSM6249272_scRNA_2209_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249273/suppl/GSM6249273_scRNA_2211_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249274/suppl/GSM6249274_scRNA_2213_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249275/suppl/GSM6249275_scRNA_2224_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249276/suppl/GSM6249276_scRNA_2227_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249252/suppl/GSM6249252_scRNA_34-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249254/suppl/GSM6249254_scRNA_39-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249257/suppl/GSM6249257_scRNA_41-7_filtered_feature_bc_matrix.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6249nnn/GSM6249270/suppl/GSM6249270_scRNA_145-7_filtered_feature_bc_matrix.tar.gz
We can then unarchive these files and rename the resulting folders so that the names are cleaner:
for i in *.tar.gz; do tar -xvf $i; done
for folder in *_filtered_feature_bc_matrix; do
new_folder="${folder%_filtered_feature_bc_matrix}"
mv "$folder" "$new_folder"
done
Within each folder, we have filtered data generated by Cell Ranger in the MEX format. SPEEDI also works with H5 format files. If working with MEX files, sample data should always follow the standard naming convention (three files with names barcodes.tsv.gz
, features.tsv.gz
, and matrix.mtx.gz
). If working with H5 files, sample data file names should always end with filtered_feature_bc_matrix.h5
. For example, filtered_feature_bc_matrix.h5
and Sample_1_filtered_feature_bc_matrix.h5
would both be fine, but filtered_matrix.h5
would result in an error.
IMPORTANT: SPEEDI expects data that have been filtered using Cell Ranger or some other similar process.
After downloading our input data, we are ready to begin using the SPEEDI pipeline. We could use the run_SPEEDI()
function to run the entire pipeline, but we will go through the pipeline step-by-step for clarity.
First, we’ll go over the different parameters for SPEEDI. We highly recommend that you read the run_SPEEDI()
function documentation to see all possible values for each parameter.
We’ll begin with the parameters that we provide values for in this analysis:
reference_tissue
: Reference tissue type (used to map cell types via reference). Here, we choose PBMC_full
(full explanation below).data_type
: Type of data being processed. Here, we choose RNA
.species
: Species being analyzed. Here, we choose human
.input_dir
: Path to directory where input data are located. Here, we choose input/
.reference_dir
: Path to directory where reference will be downloaded by LoadReferenceSPEEDI()
if necessary. Here, we choose references/
.output_dir
: Path to directory where output will be saved. Here, we choose output/
.metadata_df
: Data frame that contains metadata about our samples. Its value will declared in the code snippet below.There are two different PBMC reference tissue types in SPEEDI: PBMC
and PBMC_full
. The former uses a SeuratData
reference and Azimuth
to map cell types, while the latter downloads a more detailed, quite large (~8 GB) PBMC reference for cell type mapping. To get the best possible mapping, we choose the latter. Importantly, SPEEDI loads this reference into memory during processing, so you should use PBMC
unless you have sufficient memory to load the PBMC_full
reference.
Our other parameters will use the default values:
reference_file_name
: Base name of custom reference file. Because we’re not using a custom reference file, we leave this at its default value (NULL
).reference_cell_type_attribute
: If using a Seurat reference object, this parameter captures where the cell type information is stored. We leave this at its default value ("celltype.l2"
).analysis_name
: Name used to create subdirectory in output_dir
for current analysis run. We’ll let SPEEDI set this for us, so we’ll leave it at its default value (NULL
).sample_id_list
: Vector of sample names (optional - if not provided, will select all samples found recursively in input_dir
). Because we’re analyzing all of the samples that we downloaded (and not a subset), we leave this at its default value (NULL
).sample_file_paths
: Vector of sample file paths (optional - if not provided, will select all samples found recursively in input_dir
). If this argument is used, sample_id_list
is required and should be written in the same order as the sample file paths. Because we’re analyzing all of the samples that we downloaded (and not a subset), we leave this at its default value (NULL
).record_doublets
: Boolean flag to indicate whether we will record doublets in the data (using the scDblFinder
package). We leave this at its default value (FALSE
).reference_tissue <- "PBMC_full"
data_type <- "RNA"
species <- "human"
input_dir <- "input/"
reference_dir <- "references/"
output_dir <- "output/"
# Here, we set up our (very simple) metadata data frame
metadata_df <- data.frame("disease" = c("COVID", "COVID", "COVID", "COVID",
"healthy", "healthy", "healthy", "healthy", "healthy"))
rownames(metadata_df) <- c("scRNA_34-7", "scRNA_39-7", "scRNA_41-7", "scRNA_145-7",
"scRNA_2209", "scRNA_2211", "scRNA_2213", "scRNA_2224", "scRNA_2227")
To complete our setup, we set a random seed using set.seed()
to ensure reproducibility and then use the initialize_SPEEDI()
function. While this function isn’t strictly necessary for using SPEEDI, it will help make sure that all variables are set properly and will create a log file for SPEEDI.
# Note that we use get_speedi_seed() to get the global seed used by SPEEDI for tasks like SCTransform normalization, UMAP generation etc.
# If you want to set a different global seed for SPEEDI, you can use the set_speedi_seed() function
set.seed(get_speedi_seed())
SPEEDI_variables <- initialize_SPEEDI(reference_tissue, data_type, species,
input_dir, reference_dir, output_dir, metadata_df)
After running initialize_SPEEDI()
, all of the corrected variable values will be stored in the SPEEDI_variables
list. These variables include everything highlighted above as well as a few new ones:
RNA_output_dir
: Output directory for RNA analyses (will contain all analysis output for this example)ATAC_output_dir
: Output directory for ATAC analyses (not relevant for this example)log_file_path
: File name associated with log fileold_wd
: ATAC analyses require our working directory be updated to ATAC_output_dir
, so we save the previous working directory in old_wd
(not relevant for this example)To refer to a given variable, use list notation (e.g., the corrected input_dir
is stored in SPEEDI_variables$input_dir
).
Note the directory structures below:
# output_dir now includes a subdirectory for our specific analysis
# This directory contains our log file
SPEEDI_variables$output_dir
## [1] "/home/wat2/SPEEDI_vignette/output/2023-11-09_19-02-28_SPEEDI/"
# RNA_output_dir is a subdirectory of our new output_dir
SPEEDI_variables$RNA_output_dir
## [1] "/home/wat2/SPEEDI_vignette/output/2023-11-09_19-02-28_SPEEDI/RNA/"
Before we begin processing the input data, we first load our PBMC
reference. Because we’re using PBMC_full
as our reference_tissue
value, SPEEDI will automatically download the reference object (pbmc_multimodal.h5seurat
) and save it in reference_dir
. Note that we set log_flag
to TRUE
so that any messages are automatically recorded in our log file.
reference <- LoadReferenceSPEEDI(reference_tissue = SPEEDI_variables$reference_tissue,
reference_dir = SPEEDI_variables$reference_dir,
species = SPEEDI_variables$species, log_flag = TRUE)
We can see that reference
now stores a Seurat object:
reference
## An object of class Seurat
## 20957 features across 161764 samples within 2 assays
## Active assay: SCT (20729 features, 5000 variable features)
## 2 layers present: counts, data
## 1 other assay present: ADT
## 6 dimensional reductions calculated: apca, aumap, pca, spca, umap, wnn.umap
Next, we will read in the input data (scRNA-seq) for processing using the Read_RNA()
function. Sample data are processed in parallel and compiled into a single comprehensive expression matrix.
all_sc_exp_matrices <- Read_RNA(input_dir = SPEEDI_variables$input_dir, log_flag = TRUE)
nrow(all_sc_exp_matrices)
## [1] 32738
ncol(all_sc_exp_matrices)
## [1] 124692
Our matrix contains 124692 barcodes and 32738 transcripts in total.
While our input data have been previously filtered by Cell Ranger, we still need to perform QC filtering (based on number of transcripts, number of genes, percentage of mitochondrial content, percentage of hemoglobin content, and percentage of ribosomal content). For each category, the FilterRawData_RNA()
function automatically selects the QC threshold for each sample using the kneedle
algorithm and the quantile
function. In addition, the FilterRawData_RNA()
function performs the following tasks:
Seurat
object for easier downstream processingrecord_doublets
is set to TRUE
, records information about doublets in the Seurat
objectsc_obj <- FilterRawData_RNA(all_sc_exp_matrices = all_sc_exp_matrices, species = SPEEDI_variables$species,
output_dir = SPEEDI_variables$RNA_output_dir, log_file_path = SPEEDI_variables$log_file_path,
log_flag = TRUE)
# We can delete all_sc_exp_matrices because we're done with it
rm(all_sc_exp_matrices)
sc_obj
## An object of class Seurat
## 21693 features across 88870 samples within 1 assay
## Active assay: RNA (21693 features, 0 variable features)
## 9 layers present: counts.1.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.2.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject.SeuratProject, counts.unbias.SeuratProject.SeuratProject, counts.unbias.SeuratProject, counts.unbias
After QC filtering is complete, we see that 88870 barcodes and 21693 transcripts remain. You can always see the automatically chosen QC thresholds by looking at the log file (they may also be displayed in your standard R output depending on your environment). In addition, if you’re interested in the quality of your samples before filtering, you can look at the automatically generated pre-filtering QC plots (found in RNA_output_dir
). One example (for number of genes) is displayed below:
Figure 1: Pre-filtered QC plot for number of genes
Now that the data have been filtered, we will normalize the data using SCTransform()
(from Seurat
). We regress against mitochondrial percentage (percent.mt
), ribosomal percentage (percent.rps
and percent.rpl
), and hemoglobin percentage (percent.hb
). We also calculate cell cycle scores and regress against the differences between S score and G2M score (CC.Difference
). Finally, we add metadata from metadata_df
into our Seurat object.
sc_obj <- InitialProcessing_RNA(sc_obj = sc_obj, species = SPEEDI_variables$species,
output_dir = SPEEDI_variables$RNA_output_dir, metadata_df = SPEEDI_variables$metadata_df,
log_flag = TRUE)
Looking at the UMAP generated from InitialProcessing_RNA()
, we can see that the samples are not yet integrated:
Figure 2: UMAP of sample data before batch correction and integration
Next, we will use SPEEDI’s data-derived batch inference method to determine sample batches in our data:
sc_obj <- InferBatches(sc_obj = sc_obj, log_flag = TRUE)
In total, SPEEDI found 4 batches for our 9 samples:
length(unique(sc_obj$batch))
## [1] 4
After we have grouped our samples according to batch, we are ready to integrate by batch. We first normalize counts within each batch using SCTransform()
and then use a selection of methods from Seurat
(SelectIntegrationFeatures()
, PrepSCTIntegration()
, FindIntegrationAnchors()
, and IntegrateData()
) to integrate our data.
sc_obj <- IntegrateByBatch_RNA(sc_obj = sc_obj, log_flag = TRUE)
Note that we will now use the integrated
assay for visualization (as opposed to the SCT
assay) unless only one batch was found above:
sc_obj
## An object of class Seurat
## 43931 features across 88870 samples within 3 assays
## Active assay: integrated (3000 features, 3000 variable features)
## 2 layers present: data, scale.data
## 2 other assays present: RNA, SCT
We perform some final processing of our integrated data (scaling and preparing for differential expression) using the VisualizeIntegration()
function.
sc_obj <- VisualizeIntegration(sc_obj = sc_obj, output_dir = SPEEDI_variables$RNA_output_dir, log_flag = TRUE)
We are now ready to use our PBMC reference to map cell types onto our data. In this case, because we have a full Seurat
reference object, our MapCellTypes_RNA()
function uses the Seurat
functions FindMappingAnchors()
and MapQuery()
to map cell types. For other built-in references, SPEEDI will use the Azimuth
function RunAzimuth()
to map cell types. Finally, SPEEDI runs a majority vote algorithm (MajorityVote_RNA()
) to finalize cell types within each cluster.
sc_obj <- MapCellTypes_RNA(sc_obj = sc_obj, reference = reference, output_dir = SPEEDI_variables$RNA_output_dir,
log_flag = TRUE)
# Table of final cell type predictions
knitr::kable(table(sc_obj$predicted_celltype_majority_vote), col.names = c("Cell Type", "Cell Count"))
Cell Type | Cell Count |
---|---|
B memory | 2789 |
B naive | 3941 |
CD14 Mono | 15975 |
CD16 Mono | 2424 |
CD4 Naive | 2163 |
CD4 TCM | 28319 |
CD8 Naive | 3262 |
CD8 TEM | 10655 |
cDC2 | 678 |
HSPC | 81 |
MAIT | 6531 |
NK | 7426 |
NK_CD56bright | 599 |
pDC | 518 |
Platelet | 1044 |
Treg | 2465 |
The MapCellTypes_RNA()
function prints UMAPs of our integrated sample data broken down by majority vote cell type, raw predicted cell type, cluster, and sample. Below, we print the sample UMAP and the majority vote cell type UMAP.
Figure 3: UMAP of sample data after batch correction and integration
Note that the samples are much better integrated versus our previous UMAP.
Figure 4: UMAP of cell types determined by majority vote
Importantly, if the cells in a given cluster have an even mix of assigned cell types, the majority vote algorithm may not be able to determine an overall cell type for the cluster. We mark these clusters as “Undetermined”. To better understand these Undetermined clusters, we recommend that users visualize the expression of canonical markers for different cell types (e.g., by using the Seurat
function FeaturePlot()
).
In addition to the UMAPs described above, SPEEDI also prints a heatmap showing cell type proportions for each sample:
Figure 5: Heatmap of cell type proportions
Now that we have a uniformly cell-type annotated and integrated dataset, we will use SPEEDI to perform two downstream analyses: differential expression analysis and functional module discovery. We use the run_downstream_analyses_RNA()
function to run both analyses.
de_results <- run_downstream_analyses_RNA(sc_obj = sc_obj, reference_tissue = SPEEDI_variables$reference_tissue,
species = SPEEDI_variables$species, metadata_df = SPEEDI_variables$metadata_df,
output_dir = SPEEDI_variables$RNA_output_dir, log_flag = TRUE)
In our metadata data frame, we provided one metadata category (“disease”) with each sample being assigned a value of “COVID” or “healthy”. The run_downstream_analyses_RNA()
wrapper function finds differentially expressed genes (DEGs) for each cell type (COVID vs. healthy) using SPEEDI’s RunDE_RNA()
function. SPEEDI first uses the Seurat
function FindMarkers()
to find DEGs on a single cell sample level (adjusted p-value threshold of 0.05, log fold change threshold of 0.1, and min.pct threshold of 0.1) and then uses DESeq2
to further curate this list using pseudobulk level filtering (p-value threshold of 0.05 and log fold change threshold of 0.3).
The final output is a table containing DEGs for each cell type:
DEGs_table <- utils::read.table(paste0(SPEEDI_variables$RNA_output_dir, "disease.DE.tsv"), header = TRUE, sep = "\t")
print(nrow(DEGs_table))
## [1] 2516
subset_DEGs_table <- rbind(DEGs_table[100:110,], DEGs_table[400:410,])
subset_DEGs_table$sc_pval_adj <- format(subset_DEGs_table$sc_pval_adj, digits = 5)
knitr::kable(subset_DEGs_table, "simple")
Cell_Type | Gene_Name | sc_pval_adj | sc_log2FC | pseudo_bulk_pval | pseudo_bulk_log2FC | |
---|---|---|---|---|---|---|
100 | CD4 TCM | NR4A2 | 5.6160e-210 | -1.5251357 | 0.0007697 | -1.8935872 |
101 | CD4 TCM | SUN2 | 6.9428e-185 | 0.7928248 | 0.0128745 | 0.8659437 |
102 | CD4 TCM | MT-ND1 | 2.5417e-167 | -0.2726354 | 0.0060409 | -0.3768555 |
103 | CD4 TCM | DNAJB1 | 1.4335e-160 | -0.5047394 | 0.0035380 | -0.6876837 |
104 | CD4 TCM | KLF6 | 5.8023e-155 | 0.7111543 | 0.0002115 | 0.6499916 |
105 | CD4 TCM | LINC00861 | 6.7041e-152 | 0.6511951 | 0.0000535 | 0.5334065 |
106 | CD4 TCM | C1orf63 | 4.0507e-127 | 0.5987083 | 0.0134283 | 0.4393595 |
107 | CD4 TCM | ZFP36 | 1.6867e-125 | -0.6325388 | 0.0053459 | -0.8987507 |
108 | CD4 TCM | ZRANB2 | 2.4364e-118 | 0.6791998 | 0.0000096 | 0.6048188 |
109 | CD4 TCM | FKBP8 | 1.4353e-110 | 0.5802708 | 0.0203568 | 0.6735848 |
110 | CD4 TCM | RN7SL1 | 2.0350e-110 | 1.7416472 | 0.0181545 | 1.6971147 |
400 | CD8 Naive | RASSF3 | 2.4735e-04 | 1.0512495 | 0.0406761 | 0.7675221 |
401 | CD8 Naive | C11orf48 | 2.7214e-04 | 0.8435597 | 0.0124949 | 0.7051809 |
402 | CD8 Naive | CABIN1 | 5.2228e-04 | 0.6316841 | 0.0432814 | 0.3746976 |
403 | CD8 Naive | NHLRC3 | 5.7896e-04 | 0.8454088 | 0.0336770 | 0.5575887 |
404 | CD8 Naive | LINC00936 | 6.1817e-04 | 0.8217519 | 0.0414509 | 0.5629821 |
405 | CD8 Naive | RBM3 | 6.3853e-04 | -0.2762142 | 0.0121591 | -0.3742365 |
406 | CD8 Naive | CCR7 | 9.2105e-04 | -0.2937378 | 0.0053840 | -0.3944238 |
407 | CD8 Naive | METTL12 | 9.2811e-04 | -1.0073312 | 0.0024619 | -1.1553749 |
408 | CD8 Naive | ZNF331 | 1.2551e-03 | -0.8437378 | 0.0003378 | -1.1235462 |
409 | CD8 Naive | TRIM73 | 1.3633e-03 | 0.9726430 | 0.0390137 | 0.6770856 |
410 | CD8 Naive | WDR37 | 1.8856e-03 | 0.8626598 | 0.0456461 | 0.5463623 |
In total, there were 2516 DEGs found. We show a subset of the DEGs found for CD4 TCM and CD8 Naive cells.
Next, because our samples are human, the run_downstream_analyses_RNA()
wrapper function will also run functional module discovery (FMD) using SPEEDI’s run_fmd_wrapper()
function. As input, SPEEDI uses the cell-type specific DEGs found above. First, each list of cell-type DEGs is divided into separate lists of positive and negative fold change. For each separate list, if at least 20 genes are found in HumanBase, an FMD job is launched. Results are provided in a TSV file and include a HumanBase URL linking to the user’s full results as well as a complete list of associated gene ontology (GO) enrichment terms.
Below, we show example output for CD16 monocytes that were upregulated in our COVID patients:
FMD_file <- paste0(SPEEDI_variables$RNA_output_dir, "FMD_high_CD16_Mono_blood_disease.tsv")
FMD_URL <- readLines(FMD_file, n = 1)
FMD_table <- utils::read.table(FMD_file, header = TRUE, sep = "\t")
print(FMD_URL)
## [1] "# https://hb.flatironinstitute.org/module/overview/?body_tag=c89a1690becc46f63dfeb2185b97feb26eb6ba73"
knitr::kable(head(FMD_table, 20), "simple")
cluster | term | name | p_value | q_value |
---|---|---|---|---|
0 | GO:0017144 | drug metabolic process | 0.0387789 | 0.0634331 |
0 | GO:0042110 | T cell activation | 0.0562688 | 0.0794194 |
0 | GO:0001775 | cell activation | 0.0451936 | 0.0689062 |
0 | GO:0044282 | small molecule catabolic process | 0.0610010 | 0.0836018 |
0 | GO:0045321 | leukocyte activation | 0.0318460 | 0.0576678 |
0 | GO:1901136 | carbohydrate derivative catabolic process | 0.0182024 | 0.0424151 |
0 | GO:0009164 | nucleoside catabolic process | 0.0013437 | 0.0195233 |
0 | GO:0009116 | nucleoside metabolic process | 0.0076972 | 0.0322241 |
0 | GO:1901657 | glycosyl compound metabolic process | 0.0129072 | 0.0403553 |
0 | GO:0009119 | ribonucleoside metabolic process | 0.0052480 | 0.0275802 |
0 | GO:0034656 | nucleobase-containing small molecule catabolic process | 0.0013437 | 0.0195233 |
0 | GO:1901658 | glycosyl compound catabolic process | 0.0021896 | 0.0208010 |
0 | GO:0022407 | regulation of cell-cell adhesion | 0.0763038 | 0.0947087 |
0 | GO:0042454 | ribonucleoside catabolic process | 0.0010601 | 0.0195233 |
0 | GO:0031334 | positive regulation of protein complex assembly | 0.0582871 | 0.0818006 |
0 | GO:0032091 | negative regulation of protein binding | 0.0186749 | 0.0427103 |
0 | GO:0002253 | activation of immune response | 0.0021053 | 0.0208010 |
0 | GO:0002429 | immune response-activating cell surface receptor signaling pathway | 0.0177345 | 0.0424151 |
0 | GO:0050778 | positive regulation of immune response | 0.0008411 | 0.0195233 |
0 | GO:0002757 | immune response-activating signal transduction | 0.0009157 | 0.0195233 |
If we go to the URL listed at the top of the table, we will see a network visualization of our results:
Figure 6: HumanBase network visualization for CD16 monocytes upregulated in COVID patients (background network of blood)
Note that HumanBase has multiple background networks for FMD. In our case, because our reference tissue was PBMC_full
, SPEEDI automatically analyzed each list of DEGs using the global
network as well as the blood
network.
Above, we went through the SPEEDI pipeline step-by-step. However, the easiest way to run SPEEDI is with the run_SPEEDI()
wrapper function. This function will complete all of the above steps automatically and will even save your Seurat
object in RNA_output_dir
. For our example, we would simply write:
sc_obj <- run_SPEEDI(reference_tissue = reference_tissue, data_type = data_type, species = species,
input_dir = input_dir, reference_dir = reference_dir,
output_dir = output_dir, metadata_df = metadata_df)
In addition to scRNA-seq analyses, SPEEDI can also run scATAC-seq, paired scRNA-seq / scATAC-seq, and same-cell multiome snRNA-seq / snRNA-seq analyses. To learn more, read the documentation for run_SPEEDI()
!
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Springdale Open Enterprise Linux 8.8 (Modena)
##
## Matrix products: default
## BLAS: /usr/local/R/4.2.3/lib64/R/lib/libRblas.so
## LAPACK: /usr/local/R/4.2.3/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bluster_1.8.0 Seurat_5.0.0
## [3] SeuratObject_5.0.0 sp_2.1-1
## [5] SPEEDI_1.2.0.0000 rhdf5_2.42.1
## [7] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [9] MatrixGenerics_1.10.0 Rcpp_1.0.11
## [11] Matrix_1.6-1.1 GenomicRanges_1.50.2
## [13] GenomeInfoDb_1.34.9 IRanges_2.32.0
## [15] S4Vectors_0.36.2 BiocGenerics_0.44.0
## [17] matrixStats_1.1.0 data.table_1.14.8
## [19] stringr_1.5.0 plyr_1.8.9
## [21] magrittr_2.0.3 ggplot2_3.4.4
## [23] gtable_0.3.4 gtools_3.9.4
## [25] gridExtra_2.3 ArchR_1.0.2
## [27] BiocStyle_2.26.0 knitr_1.45
## [29] rmarkdown_2.25
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.4 spatstat.explore_3.2-5
## [3] reticulate_1.34.0 tidyselect_1.2.0
## [5] RSQLite_2.3.3 AnnotationDbi_1.60.2
## [7] htmlwidgets_1.6.2 BiocParallel_1.32.6
## [9] Rtsne_0.16 munsell_0.5.0
## [11] codetools_0.2-19 ragg_1.2.6
## [13] ica_1.0-3 future_1.33.0
## [15] miniUI_0.1.1.1 withr_2.5.2
## [17] spatstat.random_3.2-1 colorspace_2.1-0
## [19] progressr_0.14.0 highr_0.10
## [21] ROCR_1.0-11 tensor_1.5
## [23] listenv_0.9.0 labeling_0.4.3
## [25] GenomeInfoDbData_1.2.9 polyclip_1.10-6
## [27] pheatmap_1.0.12 bit64_4.0.5
## [29] farver_2.1.1 rprojroot_2.0.4
## [31] parallelly_1.36.0 vctrs_0.6.4
## [33] generics_0.1.3 xfun_0.41
## [35] R6_2.5.1 doParallel_1.0.17
## [37] ggbeeswarm_0.7.2 locfit_1.5-9.8
## [39] hdf5r_1.3.8 bitops_1.0-7
## [41] rhdf5filters_1.10.1 spatstat.utils_3.0-4
## [43] cachem_1.0.8 DelayedArray_0.24.0
## [45] promises_1.2.1 scales_1.2.1
## [47] beeswarm_0.4.0 Cairo_1.6-1
## [49] globals_0.16.2 goftest_1.2-3
## [51] spam_2.10-0 rlang_1.1.2
## [53] systemfonts_1.0.5 splines_4.2.3
## [55] lazyeval_0.2.2 spatstat.geom_3.2-7
## [57] BiocManager_1.30.22 yaml_2.3.7
## [59] reshape2_1.4.4 abind_1.4-5
## [61] httpuv_1.6.12 SeuratDisk_0.0.0.9021
## [63] tools_4.2.3 bookdown_0.36
## [65] ellipsis_0.3.2 jquerylib_0.1.4
## [67] RColorBrewer_1.1-3 ggridges_0.5.4
## [69] sparseMatrixStats_1.10.0 zlibbioc_1.44.0
## [71] purrr_1.0.2 RCurl_1.98-1.13
## [73] deldir_1.0-9 pbapply_1.7-2
## [75] cowplot_1.1.1 lisi_1.0
## [77] zoo_1.8-12 ggrepel_0.9.4
## [79] cluster_2.1.4 here_1.0.1
## [81] RSpectra_0.16-1 glmGamPoi_1.10.2
## [83] scattermore_1.2 lmtest_0.9-40
## [85] RANN_2.6.1 fitdistrplus_1.1-11
## [87] patchwork_1.1.3 mime_0.12
## [89] evaluate_0.23 xtable_1.8-4
## [91] XML_3.99-0.15 fastDummies_1.7.3
## [93] compiler_4.2.3 tibble_3.2.1
## [95] KernSmooth_2.23-22 crayon_1.5.2
## [97] htmltools_0.5.7 later_1.3.1
## [99] geneplotter_1.76.0 tidyr_1.3.0
## [101] logr_1.3.5 DBI_1.1.3
## [103] rappdirs_0.3.3 MASS_7.3-60
## [105] cli_3.6.1 parallel_4.2.3
## [107] dotCall64_1.1-0 igraph_1.5.1
## [109] pkgconfig_2.0.3 plotly_4.10.3
## [111] spatstat.sparse_3.0-3 foreach_1.5.2
## [113] annotate_1.76.0 vipor_0.4.5
## [115] bslib_0.5.1 XVector_0.38.0
## [117] digest_0.6.33 sctransform_0.4.1
## [119] RcppAnnoy_0.0.21 common_1.1.0
## [121] Biostrings_2.66.0 spatstat.data_3.0-3
## [123] leiden_0.4.3 uwot_0.1.16
## [125] DelayedMatrixStats_1.20.0 curl_5.1.0
## [127] shiny_1.7.5.1 outliers_0.15
## [129] lifecycle_1.0.4 nlme_3.1-163
## [131] jsonlite_1.8.7 Rhdf5lib_1.20.0
## [133] BiocNeighbors_1.16.0 limma_3.54.2
## [135] viridisLite_0.4.2 fansi_1.0.5
## [137] pillar_1.9.0 lattice_0.22-5
## [139] KEGGREST_1.38.0 ggrastr_1.0.2
## [141] fastmap_1.1.1 httr_1.4.7
## [143] survival_3.5-7 glue_1.6.2
## [145] png_0.1-8 iterators_1.0.14
## [147] bit_4.0.5 presto_1.0.0
## [149] stringi_1.7.12 sass_0.4.7
## [151] blob_1.2.4 textshaping_0.3.7
## [153] RcppHNSW_0.5.0 DESeq2_1.38.3
## [155] memoise_2.0.1 dplyr_1.1.3
## [157] irlba_2.3.5.1 future.apply_1.11.0