高維資料的降維演算法及視覺化

Comparison of tSNE and UMAP

Yueh-Hua Tu 杜岳華

2020.3.20

Outline

  • Brief introduction of Dimensional reduction
  • Types of dimensional reduction
  • T-distributed stochastic neighbor embedding (t-SNE)
  • Uniform manifold approximation and projection (UMAP)
  • Comparison of two models
  • Dimensional reduction via graph
  • Relation to graph nerual network

Dimensional Reduction

A mapping from high dimensional space to low dimensional space.

Types of Dimensional Reduction

Types of Dimensional Reduction

  • Linear method
  • Non-linear method
    • Preserve global data structure
    • Preserve local data structure

Global structure v.s. local structure

Before we start...

Notation

$\large p_{j|i} = p(x_j | x_i)$: the probability of a path from $x_i$ to $x_j$

$\large q_{j|i} = q(y_j | y_i)$: the probability of a path from $y_i$ to $y_j$

Nonlinear Dimensional Reduction

To preserve distance or neighbors?

T-distributed Stochastic Neighbor Embedding (t-SNE)

T-distributed Stochastic Neighbor Embedding (t-SNE)

$$ p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

How likely $x_i$ would pick $x_j$ as its neighbor.

van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE . Journal of Machine Learning Research, 9, 2579--2605.

Stochastic neighbor embedding (SNE)

Similarity in (high-dimension) original space

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in (low-dimension) reconstructed space

$$ \large q_{j|i} = \frac{exp(-||y_i - y_j||^2)}{\sum_{k \ne i} exp(-||y_i - y_k||^2)} $$

SNE

Loss function

$$ \large \arg\min D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} $$

Minimize the dissimilarity of distribution between $p$ and $q$

Optimization: gradient descent

The crowding problem

Use t-distribution for repulsive force

T-distributed Stochastic Neighbor Embedding (t-SNE)

Similarity in high dimension

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in low dimension

$$ \large q_{j|i} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \ne i} (1 + ||y_i - y_k||^2)^{-1}} $$

How to decide the variance of Gaussian distribution?

Perplexity 困惑度

In information theory, to measure how well a probability distribution predicts a sample. A low perplexity indicates the prediction is precise.

$$ perp(p) = 2^{H(p_i)} $$

where $H(p_i) = -\sum_j p_{j|i} \log_2 p_{j|i}$

Perplexity: how smooth a Gaussian distribution would be.

$Perplexity \uparrow \Rightarrow H(p_i) \uparrow \Rightarrow \sigma_i \uparrow$

  • $\sigma_i$:
    • large: points get crowded
    • small: components break into segments

Making it symmetric

$p_{j|i}, p_{i|j} \rightarrow p_{i,j}$:

$$ \large p_{i,j} = \frac{1}{2n} (p_{j|i} + p_{i|j}) $$

Assumption: make each data point contributes significantly.

$$ \sum_j p_{j|i} > \frac{1}{2n} $$

t-SNE

Loss function

$$ \large D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j -p_{j|i} \log \frac{q_{j|i}}{p_{j|i}} $$

To make $p$ and $q$ as likely as possible.

Uniform Manifold Approximation and Projection (UMAP)

UMAP

Mcinnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. doi: 10.21105/joss.00861

UMAP

$\epsilon$-neighborhood

Graph is broken into pieces

If the data is uniformly distributed on the manifold, the cover will be good.

Assumption: Data is uniformly distributed on the manifold

Use fuzzy metric $p_{j|i}$

Assumption: manifold is locally connected

UMAP

Similarity in original space

$$ \large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i}) $$

Directed connected graph

Make it symmetric

Probabilistic fuzzy union

$$ \large p_{i,j} = p_{j|i} + p_{i|j} - p_{j|i} p_{i|j} $$

k-nearest neighbor graph construction

Given k, search $\sigma_i$ s.t.

$$ \large \log_2 k = \sum_i p_{i,j} $$

where $\large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i})$

Undirected locally connected graph

Reconstruction in low-dimensional space

t distribution with $\nu = 1$

$$ \large q_{i,j} = (1 + a||y_i - y_j||^2b)^{-1} $$

UMAP

Binary cross entropy loss

$$ \large H(p, q) = -\sum_i (p_i \log q_i) + ((1 - p_i) \log (1 - q_i)) $$
$$ = \underline{H(p)} + D_{KL}(p || q) $$

Optimization: gradient descent

Initialization

Eigenvectors of normalized Laplacian for initialization

For fast convergence, normalized Laplacian is constructed from $p_{i,j}$. Eigenvectors of normalized Laplacian is used as initialization of $q_{i,j}$ and further gradient descent method is used.

$$ \large Lu = \lambda u $$

kNN graph provides a underling backbone of data points, which somehow provides the global structure (connection) to dimensional reduction

What's new in UMAP?

UMAP introduces the baseline defined by the nearest neighbor, which unifies data, and make graph connected.

UMAP use eigenvectors of normalized Laplacian as initialization

Comparison of two models

Dimensional reduction via graph

How to decide the topology of graph?

Discrete approach (which generates a sparse graph)

  1. $\epsilon$-neighborhood
    • consider vertecies within $\epsilon$ as neighbors
  2. k-nearest neighbor
    • consider the k nearest vertecies as neighbors

Continuous approach (which generates a dense graph)

  1. distance
    • $\Large d(x_i, x_j) = ||x_i - x_j||^2$ as pairwise distance
  2. probability derived from distance (could be asymmetric)
    • $\Large p_{ij} = e^{-d(x_i, x_j)/\sigma}$ as pairwise distance

But...we all have to define distance metric in prior

UMAP does not preserve global structure any better than t-SNE

Kobak, D., & Linderman, G. C. (2019). UMAP does not preserve global structure any better than t-SNE when using the same initialization. doi: 10.1101/2019.12.19.877522

UMAP does not preserve global structure any better than t-SNE

Relation to graph nerual network

Graph convolutional network

$$ \begin{align} X' &= f(X, graph) \\ &= x *_G g_{\theta} \\ &= \theta (I + D^{-\frac{1}{2}}AD^{-\frac{1}{2}})x \\ &= \theta \tilde{A} x \end{align} $$

Normalized Laplacian: $L = I - D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

Graph neural network v.s. dimensional reduction

Graph neural network

transform features on the geometry of graph

$$ Y = f(X, graph) $$

Dimensional reduction via graph

construct graph from features and transform features on the geometry of graph

$$ graph = f(X) \\ Y = g(X, graph) $$

Thank you for attention