生物資訊的降維演算法及視覺化¶

Comparison of tSNE and UMAP¶

Yueh-Hua Tu 杜岳華¶

2020.4.10¶

Outline¶

Brief introduction of Dimensional reduction
Types of dimensional reduction
T-distributed stochastic neighbor embedding (t-SNE)
Uniform manifold approximation and projection (UMAP)
Comparison of two models
Dimensional reduction via graph

Biological Data¶

High dimensional biological data¶

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1426-0/figures/4

High dimensional data form manifold¶

Example of dimensional reduction¶

Dimensional reduction brings information loss¶

Dimensional Reduction¶

A mapping from high dimensional space to low dimensional space.

Types of Dimensional Reduction¶

picture source

Types of Dimensional Reduction¶

Linear method
Non-linear method
- Preserve global data structure
- Preserve local data structure

Global structure v.s. local structure¶

scikit-learn doc

Before we start...¶

Notation¶

$\large p_{j|i} = p(x_j | x_i)$: the probability of a path from $x_i$ to $x_j$

$\large q_{j|i} = q(y_j | y_i)$: the probability of a path from $y_i$ to $y_j$

Nonlinear Dimensional Reduction¶

To preserve distance or neighbors?¶

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

picture source

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

$$ p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

How likely $x_i$ would pick $x_j$ as its neighbor.

van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE . Journal of Machine Learning Research, 9, 2579--2605.

Stochastic neighbor embedding (SNE)¶

Similarity in (high-dimension) original space¶

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in (low-dimension) reconstructed space¶

$$ \large q_{j|i} = \frac{exp(-||y_i - y_j||^2)}{\sum_{k \ne i} exp(-||y_i - y_k||^2)} $$

Distance to probability¶

SNE¶

Loss function¶

$$ \large \arg\min D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} $$

Minimize the dissimilarity of distribution between $p$ and $q$¶

Optimization: gradient descent¶

The crowding problem¶

picture source

Use t-distribution for repulsive force¶

Student's t-distribution

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

Similarity in high dimension¶

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in low dimension¶

$$ \large q_{j|i} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \ne i} (1 + ||y_i - y_k||^2)^{-1}} $$

How to decide the variance of Gaussian distribution?¶

Perplexity 困惑度¶

In information theory, to measure how well a probability distribution predicts a sample. A low perplexity indicates the prediction is precise.

$$ perp(p) = 2^{H(p_i)} $$

where $H(p_i) = -\sum_j p_{j|i} \log_2 p_{j|i}$

Perplexity: how smooth a Gaussian distribution would be¶

$Perplexity \uparrow \Rightarrow H(p_i) \uparrow \Rightarrow \sigma_i \uparrow$

$\sigma_i$:
- large: points get crowded
- small: components break into segments

Making it symmetric¶

$p_{j|i}, p_{i|j} \rightarrow p_{i,j}$:

$$ \large p_{i,j} = \frac{1}{2n} (p_{j|i} + p_{i|j}) $$

Assumption: make each data point contributes significantly.¶

$$ \sum_j p_{j|i} > \frac{1}{2n} $$

t-SNE¶

Loss function¶

$$ \large D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j -p_{j|i} \log \frac{q_{j|i}}{p_{j|i}} $$

To make $p$ and $q$ as likely as possible.

Uniform Manifold Approximation and Projection (UMAP)¶

picture source

UMAP¶

Mcinnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. doi: 10.21105/joss.00861

UMAP¶

Mcinnes, L. (2018). UMAP Uniform Manifold Approximation and Projection for Dimension Reduction. SciPy 2018.

$\epsilon$-neighborhood¶

Graph is broken into pieces¶

If the data is uniformly distributed on the manifold, the cover will be good.¶

Assumption: Data is uniformly distributed on the manifold¶

Use fuzzy metric $p_{j|i}$¶

Assumption: manifold is locally connected¶

UMAP¶

Similarity in original space¶

$$ \large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i}) $$

Oskolkov, N. (2019, November 6). How Exactly UMAP Works.

Oskolkov, N. (2019, November 3). How to Program UMAP from Scratch.

Directed connected graph¶

Make it symmetric¶

Probabilistic fuzzy union¶

$$ \large p_{i,j} = p_{j|i} + p_{i|j} - p_{j|i} p_{i|j} $$

k-nearest neighbor graph construction¶

Given k, search $\sigma_i$ s.t.¶

$$ \large \log_2 k = \sum_i p_{i,j} $$

where $\large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i})$

Undirected locally connected graph¶

Reconstruction in low-dimensional space¶

t distribution with $\nu = 1$¶

$$ \large q_{i,j} = (1 + a||y_i - y_j||^2b)^{-1} $$

UMAP¶

Binary cross entropy loss¶

$$ \large H(p, q) = -\sum_i (p_i \log q_i) + ((1 - p_i) \log (1 - q_i)) $$

$$ = \underline{H(p)} + D_{KL}(p || q) $$

Optimization: gradient descent¶

Initialization¶

Eigenvectors of normalized Laplacian for initialization¶

For fast convergence, normalized Laplacian is constructed from $p_{i,j}$. Eigenvectors of normalized Laplacian is used as initialization of $q_{i,j}$ and further gradient descent method is used.

$$ \large Lu = \lambda u $$

kNN graph provides a underling backbone of data points, which somehow provides the global structure (connection) to dimensional reduction¶

What's new in UMAP?¶

UMAP introduces the baseline defined by the nearest neighbor, which unifies data, and make graph connected.¶

UMAP use eigenvectors of normalized Laplacian as initialization¶

Comparison of two models¶

Dimensional reduction via graph¶

How to decide the topology of graph?¶

Discrete approach (which generates a sparse graph)¶

$\epsilon$-neighborhood
- consider vertecies within $\epsilon$ as neighbors
k-nearest neighbor
- consider the k nearest vertecies as neighbors

Continuous approach (which generates a dense graph)¶

distance
- $\Large d(x_i, x_j) = ||x_i - x_j||^2$ as pairwise distance
probability derived from distance (could be asymmetric)
- $\Large p_{ij} = e^{-d(x_i, x_j)/\sigma}$ as pairwise distance

But...we all have to define distance metric in prior¶

Application of UMAP in single-cell analysis¶

A new algorithm, called uniform manifold approximation and projection (UMAP) has been recently published and is claimed to preserve as much of the local and more of the global data structure than t-SNE, with a shorter run time.

Becht, E., McInnes, L., Healy, J. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37, 38–44 (2019) doi:10.1038/nbt.4314

Application of UMAP in single-cell analysis¶

UMAP does not preserve global structure any better than t-SNE¶

Kobak, D., & Linderman, G. C. (2019). UMAP does not preserve global structure any better than t-SNE when using the same initialization. doi: 10.1101/2019.12.19.877522

UMAP does not preserve global structure any better than t-SNE¶

Pseudotemporal ordering/Trajectory inference¶

picture source

Diffusion map¶

Comparison¶

About kNN graph¶

Both visualization and trajectory inference aim to approximate the data manifold. They construct kNN graph to approximate the data manifold.

Different purposes¶

Data visualization: data points separate with each other which is better.

Trajectory: data points form a pattern which is better.