In analyzing scRNA-seq data, batch effects are an influential source of variability. These derive from a range of factors, including the timing of cell capture, the personnel handling the samples, variations in reagent lots, differences in equipment, and even the technological platforms used. These factors can lead to substantial discrepancies in the data collected. Numerous algorithms have been developed to mitigate batch effects. Among them, Harmony and fastMNN are often the ones I employ in my analyses.
1. fastMNN
FastMNN is built upon its more complex predecessor, the Mutual Nearest Neighbors (MNN) algorithm. We'll first discuss MNN (1).
a. First and foremost, MNN assumes that data from two batches rest on parallel hyperplanes in a high-dimensional gene expression space (Figure1a). In this context, the batch effect can be envisioned as (almost orthogonally positioned) vectors between the batches.
b. To discern the vectors representing the batch effect, the MNN algorithm identifies MNN pairs of cells. Consider two batches: For cell i in batch 1, we identify k nearest neighbors from cells in batch 2, likewise, for cell j in batch 2, we locate the k nearest neighbors from cells in batch 1. If one of cell i's nearest neighbors happens to be cell j, and one of cell j's nearest neighbors is cell i, the pair of cells are deemed MNNs. As per the assumptions of the MNN algorithm, these cells would be of the same type. The distances between these cells are calculated as Cosine distances.
Note: A crucial assumption here is that MNN pairs should share the same cell type. Hence, if two datasets do not include equivalent cell types, suitable MNNs cannot be found, which may lead to inappropriate integrations.
c. Upon identifying the MNNs, the discrepancies between the MNNs—representable by vectors—are deemed batch effects. They are consequently used to adjust the data in batch 2. The adjustment is straightforward, as each data point in batch 2, in high-dimensional gene expression space, is adjusted by subtracting the vector of equal length. More specifically, considering that there are multiple MNNs, this generates numerous batch effect correction vectors. The MNN algorithm does not simply average these vectors; it computes a cell-specific batch-correction vector, calculated as a weighted average of these vectors using a Gaussian kernel. Put simply, we have a multitude of vectors calculated from various MNNs, where for a specific cell, the closer an MNN (the MNN member in batch 2) is to it, the higher the weight of its vector.
As outlined above, MNN corrects for batch effects using some important assumptions that should not be violated.
First, it assumes the presence of at least one cell population common to both batches. As discussed earlier, this assumption is key to the functioning of the MNN algorithm.
Second, it supposes that the batch effect is nearly orthogonal to the biological subspace. The authors posit that in high-dimensional space, this is usually the case.
Third, the variation in batch effects across cells is considered much smaller than the variation in the biological effects between different cell types. If this assumption is violated, it leads to uncertainty in identifying correspondingly typed cells from two batches using MNNs.
Given that MNN analyzes all genes' expression data and constructs the high-dimensional space using all genes, it can be quite resource-intensive and time-consuming in identifying MNNs. To circumvent this, fastMNN primarily leverages Principle Component Analysis for dimension reduction and constructs data space using these principle components, significantly improving efficiency.
During our analysis, when attempting to merge epithelial cells from different samples, their considerable heterogeneity results in a violation of the first assumption, leading to unsatisfactory UMAP plots.
Indeed, the concept of Harmony is intriguing, but its implementation is notably complex, requiring extensive mathematical knowledge to fully grasp (2)
2. Harmony
Harmony employs a method known as soft clustering to maintain high batch-diversity within a cluster (3), hence mitigating the batch effect. It deviates from traditional clustering techniques such as k-means, which assigns every cell definitively to a single cluster.
In Harmony's soft clustering approach, each cell is placed in various clusters with certain probabilities. For instance, in conventional k-means clustering, if a cell i belongs to cluster 1, this relationship would be denoted as Ri1=1, and the same cell would not belong to, say, cluster 2, i.e., Ri2=0. In soft clustering, the relationship might be represented as Ri1=0.4 and Ri2=0.2, indicating proportional associations.
In essence, Harmony's goal is to maximize the diversity between batches within individual clusters. A comprehensive understanding of Harmony's algorithm does require a solid mathematical foundation, making it challenging to appreciate in its entirety without prior exposure.
In summary, both the MNN and Harmony methods for correcting batch effects hinge on the presence of common cells across different batches. When dealing with highly heterogeneous cell types, these methods may yield unpredictable results. For instance, LIGER, which has shown robust performance with non-identical cell types, may prove to be a more effective solution for integrating epithelial cells (4).
Reference
Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology. 2018;36(5):421-7.
Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods. 2019;16(12):1289-96.
Mao Q, Wang L, Goodison S, Sun Y. Dimensionality Reduction Via Graph Structure Learning2015. 765-74 p.
Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome biology. 2020;21(1):12.