This tutorial describes how to use harmony in Seurat v5 single-cell
analysis workflows. RunHarmony() is a generic function is
designed to interact with Seurat objects. This vignette will walkthrough
basic workflow of Harmony with Seurat objects. Also, it will provide
some basic downstream analyses demonstrating the properties of
harmonized cell embeddings and a brief explanation of the exposed
algorithm parameters.
Install Harmony from CRAN with standard commands.
For this demo, we will be aligning two groups of PBMCs Kang et al., 2017. In this experiment, PBMCs are in stimulated and control conditions. The stimulated PBMC group was treated with interferon beta.
## Generate SeuratObject
```r
## Source required data
data("pbmc_stim")
pbmc <- CreateSeuratObject(counts = cbind(pbmc.stim, pbmc.ctrl), project = "PBMC", min.cells = 5)
## Separate conditions
pbmc@meta.data$stim <- c(rep("STIM", ncol(pbmc.stim)), rep("CTRL", ncol(pbmc.ctrl)))The example above contains only two thousand cells. The full Kang et al., 2017 dataset is deposited in the GEO. This analysis uses GSM2560248 and GSM2560249 samples from GSE96583_RAW.tar file and the GSE96583_batch2.genes.tsv.gz gene file.
library(Matrix)
## Download and extract files from GEO
##setwd("/path/to/downloaded/files")
genes =  read.table("GSE96583_batch2.genes.tsv.gz", header = FALSE, sep = "\t")
pbmc.ctrl.full = as.readMM("GSM2560248_2.1.mtx.gz")
colnames(pbmc.ctrl.full) = paste0(read.table("GSM2560248_barcodes.tsv.gz", header = FALSE, sep = "\t")[,1], "-1")
rownames(pbmc.ctrl.full) = genes$V1
pbmc.stim.full = readMM("GSM2560249_2.2.mtx.gz")
colnames(pbmc.stim.full) = paste0(read.table("GSM2560249_barcodes.tsv.gz", header = FALSE, sep = "\t")[,1], "-2")
rownames(pbmc.stim.full) = genes$V1
library(Seurat)
pbmc <- CreateSeuratObject(counts = cbind(pbmc.stim.full, pbmc.ctrl.full), project = "PBMC", min.cells = 5)
pbmc@meta.data$stim <- c(rep("STIM", ncol(pbmc.stim.full)), rep("CTRL", ncol(pbmc.ctrl.full)))
# Running Harmony
Harmony works on an existing matrix with cell embeddings and outputs its transformed version with the datasets aligned according to some user-defined experimental conditions. By default, harmony will look up the `pca` cell embeddings and use these to run harmony. Therefore, it assumes that the Seurat object has these embeddings already precomputed.
## Calculate PCA cell embeddings
Here, using `Seurat::NormalizeData()`, we will be generating a union of highly variable genes using each condition (the control and stimulated cells). These features are going to be subsequently used to generate the 20 PCs with `Seurat::RunPCA()`.pbmc <- pbmc %>%
    NormalizeData(verbose = FALSE)
VariableFeatures(pbmc) <- split(row.names(pbmc@meta.data), pbmc@meta.data$stim) %>% lapply(function(cells_use) {
    pbmc[,cells_use] %>%
        FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>% 
        VariableFeatures()
}) %>% unlist %>% unique
#> Finding variable features for layer counts
#> Finding variable features for layer counts
pbmc <- pbmc %>% 
    ScaleData(verbose = FALSE) %>% 
    RunPCA(features = VariableFeatures(pbmc), npcs = 20, verbose = FALSE)To run harmony on Seurat object after it has been normalized, only one argument needs to be specified which contains the batch covariate located in the metadata. For this vignette, further parameters are specified to align the dataset but the minimum parameters are shown in the snippet below:
Here, we will be running harmony with some indicative parameters and plotting the convergence plot to illustrate some of the under the hood functionality.
pbmc <- pbmc %>% 
    RunHarmony("stim", plot_convergence = TRUE, nclust = 50, max_iter = 10, early_stop = T)
#> Transposing data matrix
#> Initializing state using k-means centroids initialization
#> Harmony 1/10
#> Harmony 2/10
#> Harmony 3/10
#> Harmony 4/10
#> Harmony 5/10
#> Harmony converged after 5 iterations
By setting plot_converge=TRUE, harmony will generate a plot
with its objective showing the flow of the integration. Each point
represents the cost measured after a clustering round. Different colors
represent different Harmony iterations which is controlled by
max_iter (assuming that early_stop=FALSE). Here
max_iter=10 and up to 10 correction steps are expected.
However, early_stop=TRUE so harmony will stop after the
cost plateaus.
RunHarmony has several parameters accessible to users
which are outlined below.
object (required)The Seurat object. This vignette assumes Seurat objects are version 5.
group.by.vars (required)A character vector that specifies all the experimental covariates to be corrected/harmonized by the algorithm.
When using RunHarmony() with Seurat, harmony will look
up the group.by.vars metadata fields in the Seurat Object
metadata.
For example, given the pbmc[["stim"]] exists as the stim
condition, setting group.by.vars="stim" will perform
integration of these samples accordingly. If you want to integrate on
another variable, it needs to be present in Seurat object’s
meta.data.
To correct for several covariates, specify them in a vector:
group.by.vars = c("stim", "new_covariate").
reduction.useThe cell embeddings to be used for the batch alignment. This
parameter assumes that a reduced dimension already exists in the
reduction slot of the Seurat object. By default, the pca
reduction is used.
dims.useOptional parameter which can use a name vector to select specific dimensions to be harmonized.
nclustis a positive integer. Under the hood, harmony applies k-means
soft-clustering. For this task, k needs to be determined.
nclust corresponds to k. The harmonization
results and performance are not particularly sensitive for a reasonable
range of this parameter value. If this parameter is not set, harmony
will autodetermine this based on the dataset size with a maximum cap of
200. For dataset with a vast amount of different cell types and batches
this pamameter may need to be determined manually.
sigmaa positive scalar that controls the soft clustering probability
assignment of single-cells to different clusters. Larger values will
assign a larger probability to distant clusters of cells resulting in a
different correction profile. Single-cells are assigned to clusters by
their euclidean distance \(d\) to some
cluster center \(Y\) after cosine
normalization which is defined in the range [0,4]. The clustering
probability of each cell is calculated as \(e^{-\frac{d}{\sigma}}\) where \(\sigma\) is controlled by the
sigma parameter. Default value of sigma is 0.1
and it generally works well since it defines probability assignment of a
cell in the range \([e^{-40}, e^0]\).
Larger values of sigma restrict the dynamic range of
probabilities that can be assigned to cells. For example,
sigma=1 will yield a probabilities in the range of \([e^{-4}, e^0]\).
thetatheta is a positive scalar vector that determines the
coefficient of harmony’s diversity penalty for each corrected
experimental covariate. In challenging experimental conditions,
increasing theta may result in better integration results. Theta is an
expontential parameter of the diversity penalty, thus setting
theta=0 disables this penalty while increasing it to
greater values than 1 will perform more aggressive corrections in an
expontential manner. By default, it will set theta=2 for
each experimental covariate.
max_iterThe number of correction steps harmony will perform before completing
the data set integration. In general, more iterations than necessary
increases computational runtime especially which becomes evident in
bigger datasets. Setting early_stop=TRUE may reduce the
actual number of correction steps which will be smaller than
max_iter.
early_stopUnder the hood, harmony minimizes its objective function through a
series of clustering and integration tests. By setting
early_stop=TRUE, when the objective function is less than
1e-4 after a correction step harmony exits before reaching
the max_iter correction steps. This parameter can
drastically reduce run-time in bigger datasets.
.optionsA set of internal algorithm parameters that can be overriden. For advanced users only.
These parameters are Seurat-specific and do not affect the flow of the algorithm.
project_dimToggle-like parameter, by default project_dim=TRUE. When
enabled, RunHarmony() calculates genomic feature loadings
using Seurat’s ProjectDim() that correspond to the
harmonized cell embeddings.
reduction.saveThe new Reduced Dimension slot identifier. By default,
reduction.save=TRUE. This option allows several independent
runs of harmony to be retained in the appropriate slots in the
SeuratObjects. It is useful if you want to try Harmony with multiple
parameters and save them as e.g. ‘harmony_theta0’, ‘harmony_theta1’,
‘harmony_theta2’.
These parameters help users troubleshoot harmony.
plot_convergenceOption that plots the convergence plot after the execution of the
algorithm. By default FALSE. Setting it to
TRUE will collect harmony’s objective value and plot it
allowing the user to troubleshoot the flow of the algorithm and
fine-tune the parameters of the dataset integration procedure.
RunHarmony() returns the Seurat object which contains
the harmonized cell embeddings in a slot named harmony.
This entry can be accessed via pbmc@reductions$harmony. To
access the values of the cell embeddings we can also use:
After Harmony integration, we should inspect the quality of the harmonization and contrast it with the unharmonized algorithm input. Ideally, cells from different conditions will align along the Harmonized PCs. If they are not, you could increase the theta value above to force a more aggressive fit of the dataset and rerun the workflow.
Evaluate harmonization of stim parameter in the harmony generated cell embeddings
Plot Genes correlated with the Harmonized PCs
The harmonized cell embeddings generated by harmony can be used for
further integrated analyses. In this workflow, the Seurat object
contains the harmony reduction modality name in the method
that requires it.
pbmc <- pbmc %>%
    FindNeighbors(reduction = "harmony") %>%
    FindClusters(resolution = 0.5) 
#> Computing nearest neighbor graph
#> Computing SNN
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 2000
#> Number of edges: 71873
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8714
#> Number of communities: 10
#> Elapsed time: 0 secondst-SNE Visualization of harmony embeddings
One important observation is to assess that the harmonized data contain biological states of the cells. Therefore by checking the following genes we can see that biological cell states are preserved after harmonization.
Expression of gene panel heatmap in the harmonized PBMC dataset
Very similarly with TSNE we can run UMAP by passing the harmony reduction in the function.
pbmc <- pbmc %>%
    RunUMAP(reduction = "harmony",  dims = 1:20)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
#> 14:11:22 UMAP embedding parameters a = 0.9922 b = 1.112
#> 14:11:22 Read 2000 rows and found 20 numeric columns
#> 14:11:22 Using Annoy for neighbor search, n_neighbors = 30
#> 14:11:22 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 14:11:22 Writing NN index file to temp file /tmp/Rtmph1Sdwa/file66221f6e48d6
#> 14:11:22 Searching Annoy index using 1 thread, search_k = 3000
#> 14:11:22 Annoy recall = 100%
#> 14:11:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 14:11:23 Initializing from normalized Laplacian + noise (using RSpectra)
#> 14:11:23 Commencing optimization for 500 epochs, with 83170 positive edges
#> 14:11:24 Optimization finished
p1 <- DimPlot(pbmc, reduction = "umap", group.by = "stim", pt.size = .1)
p2 <- DimPlot(pbmc, reduction = "umap", label = TRUE,  pt.size = .1)
plot_grid(p1, p2)UMAP Visualization of harmony embeddings
sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-conda-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS/LAPACK: /home/main/miniconda3/envs/Renv/lib/libopenblasp-r0.3.21.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cowplot_1.1.3      dplyr_1.1.4        Seurat_5.0.1       SeuratObject_5.0.1
#> [5] sp_1.6-1           harmony_1.2.3      Rcpp_1.0.12       
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rtsne_0.16             colorspace_2.1-0       deldir_1.0-9          
#>   [4] ellipsis_0.3.2         ggridges_0.5.4         RcppHNSW_0.6.0        
#>   [7] spatstat.data_3.0-1    farver_2.1.2           leiden_0.4.3          
#>  [10] listenv_0.9.0          ggrepel_0.9.3          RSpectra_0.16-1       
#>  [13] fansi_1.0.6            codetools_0.2-19       splines_4.2.0         
#>  [16] cachem_1.0.7           knitr_1.42             polyclip_1.10-4       
#>  [19] spam_2.10-0            jsonlite_1.8.7         RhpcBLASctl_0.23-42   
#>  [22] ica_1.0-3              cluster_2.1.4          png_0.1-8             
#>  [25] uwot_0.1.16            spatstat.sparse_3.0-1  shiny_1.7.4           
#>  [28] sctransform_0.4.1      compiler_4.2.0         httr_1.4.5            
#>  [31] Matrix_1.6-3           fastmap_1.1.1          lazyeval_0.2.2        
#>  [34] cli_3.6.2              later_1.3.0            htmltools_0.5.6.1     
#>  [37] tools_4.2.0            igraph_1.6.0           dotCall64_1.1-1       
#>  [40] gtable_0.3.5           glue_1.7.0             RANN_2.6.1            
#>  [43] reshape2_1.4.4         scattermore_1.2        jquerylib_0.1.4       
#>  [46] vctrs_0.6.5            nlme_3.1-162           spatstat.explore_3.2-1
#>  [49] progressr_0.13.0       lmtest_0.9-40          spatstat.random_3.1-5 
#>  [52] xfun_0.40              stringr_1.5.0          globals_0.16.2        
#>  [55] mime_0.12              miniUI_0.1.1.1         lifecycle_1.0.4       
#>  [58] irlba_2.3.5.1          goftest_1.2-3          future_1.32.0         
#>  [61] MASS_7.3-58.3          zoo_1.8-12             scales_1.3.0          
#>  [64] promises_1.2.0.1       spatstat.utils_3.0-5   parallel_4.2.0        
#>  [67] RColorBrewer_1.1-3     yaml_2.3.7             reticulate_1.29       
#>  [70] pbapply_1.7-0          gridExtra_2.3          ggplot2_3.5.1         
#>  [73] sass_0.4.5             stringi_1.7.12         highr_0.10            
#>  [76] fastDummies_1.7.3      rlang_1.1.3            pkgconfig_2.0.3       
#>  [79] matrixStats_1.0.0      evaluate_0.22          lattice_0.20-45       
#>  [82] tensor_1.5             ROCR_1.0-11            purrr_1.0.2           
#>  [85] labeling_0.4.3         patchwork_1.2.0        htmlwidgets_1.6.2     
#>  [88] tidyselect_1.2.1       parallelly_1.36.0      RcppAnnoy_0.0.22      
#>  [91] plyr_1.8.8             magrittr_2.0.3         R6_2.5.1              
#>  [94] generics_0.1.3         withr_3.0.0            pillar_1.9.0          
#>  [97] fitdistrplus_1.1-11    abind_1.4-5            survival_3.5-5        
#> [100] tibble_3.2.1           future.apply_1.11.0    KernSmooth_2.23-21    
#> [103] utf8_1.2.4             spatstat.geom_3.2-1    plotly_4.10.2         
#> [106] rmarkdown_2.21         grid_4.2.0             data.table_1.14.8     
#> [109] digest_0.6.33          xtable_1.8-4           tidyr_1.3.0           
#> [112] httpuv_1.6.9           munsell_0.5.1          viridisLite_0.4.2     
#> [115] bslib_0.4.2