高維資料的降維演算法及視覺化¶

Comparison of tSNE and UMAP¶

Yueh-Hua Tu 杜岳華¶

2020.3.20¶

Outline¶

Brief introduction of Dimensional reduction
Types of dimensional reduction
T-distributed stochastic neighbor embedding (t-SNE)
Uniform manifold approximation and projection (UMAP)
Comparison of two models
Dimensional reduction via graph
Relation to graph nerual network

Dimensional Reduction¶

A mapping from high dimensional space to low dimensional space.

Types of Dimensional Reduction¶

picture source

Types of Dimensional Reduction¶

Linear method
Non-linear method
- Preserve global data structure
- Preserve local data structure

Global structure v.s. local structure¶

scikit-learn doc

Before we start...¶

Notation¶

$\large p_{j|i} = p(x_j | x_i)$: the probability of a path from $x_i$ to $x_j$

$\large q_{j|i} = q(y_j | y_i)$: the probability of a path from $y_i$ to $y_j$

Nonlinear Dimensional Reduction¶

To preserve distance or neighbors?¶

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

picture source

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

$$ p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

How likely $x_i$ would pick $x_j$ as its neighbor.

van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE . Journal of Machine Learning Research, 9, 2579--2605.

Stochastic neighbor embedding (SNE)¶

Similarity in (high-dimension) original space¶

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in (low-dimension) reconstructed space¶

$$ \large q_{j|i} = \frac{exp(-||y_i - y_j||^2)}{\sum_{k \ne i} exp(-||y_i - y_k||^2)} $$

SNE¶

Loss function¶

$$ \large \arg\min D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} $$

Minimize the dissimilarity of distribution between $p$ and $q$¶

Optimization: gradient descent¶

The crowding problem¶

picture source

Use t-distribution for repulsive force¶

Student's t-distribution

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

Similarity in high dimension¶

$$ \large p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2 \sigma_i^2)}{\sum_{k \ne i} exp(-||x_i - x_k||^2 / 2 \sigma_i^2)} $$

Similarity in low dimension¶

$$ \large q_{j|i} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \ne i} (1 + ||y_i - y_k||^2)^{-1}} $$

How to decide the variance of Gaussian distribution?¶

Perplexity 困惑度¶

In information theory, to measure how well a probability distribution predicts a sample. A low perplexity indicates the prediction is precise.

$$ perp(p) = 2^{H(p_i)} $$

where $H(p_i) = -\sum_j p_{j|i} \log_2 p_{j|i}$

Perplexity: how smooth a Gaussian distribution would be.

$Perplexity \uparrow \Rightarrow H(p_i) \uparrow \Rightarrow \sigma_i \uparrow$

$\sigma_i$:
- large: points get crowded
- small: components break into segments

Making it symmetric¶

$p_{j|i}, p_{i|j} \rightarrow p_{i,j}$:

$$ \large p_{i,j} = \frac{1}{2n} (p_{j|i} + p_{i|j}) $$

Assumption: make each data point contributes significantly.¶

$$ \sum_j p_{j|i} > \frac{1}{2n} $$

t-SNE¶

Loss function¶

$$ \large D_{KL}(p || q) = \sum_i D_{KL}(p_i || q_i) = \sum_i \sum_j -p_{j|i} \log \frac{q_{j|i}}{p_{j|i}} $$

To make $p$ and $q$ as likely as possible.

Uniform Manifold Approximation and Projection (UMAP)¶

picture source

UMAP¶

Mcinnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. doi: 10.21105/joss.00861

UMAP¶

Mcinnes, L. (2018). UMAP Uniform Manifold Approximation and Projection for Dimension Reduction. SciPy 2018.

$\epsilon$-neighborhood¶

Graph is broken into pieces¶

If the data is uniformly distributed on the manifold, the cover will be good.¶

Assumption: Data is uniformly distributed on the manifold¶

Use fuzzy metric $p_{j|i}$¶

Assumption: manifold is locally connected¶

UMAP¶

Similarity in original space¶

$$ \large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i}) $$

Oskolkov, N. (2019, November 6). How Exactly UMAP Works.

Oskolkov, N. (2019, November 3). How to Program UMAP from Scratch.

Directed connected graph¶

Make it symmetric¶

Probabilistic fuzzy union¶

$$ \large p_{i,j} = p_{j|i} + p_{i|j} - p_{j|i} p_{i|j} $$

k-nearest neighbor graph construction¶

Given k, search $\sigma_i$ s.t.¶

$$ \large \log_2 k = \sum_i p_{i,j} $$

where $\large p_{j|i} = exp(-\frac{||x_i - x_j||^2 - \rho_i}{\sigma_i})$

Undirected locally connected graph¶

Reconstruction in low-dimensional space¶

t distribution with $\nu = 1$¶

$$ \large q_{i,j} = (1 + a||y_i - y_j||^2b)^{-1} $$

UMAP¶

Binary cross entropy loss¶

$$ \large H(p, q) = -\sum_i (p_i \log q_i) + ((1 - p_i) \log (1 - q_i)) $$

$$ = \underline{H(p)} + D_{KL}(p || q) $$

Optimization: gradient descent¶

Initialization¶

Eigenvectors of normalized Laplacian for initialization¶

For fast convergence, normalized Laplacian is constructed from $p_{i,j}$. Eigenvectors of normalized Laplacian is used as initialization of $q_{i,j}$ and further gradient descent method is used.

$$ \large Lu = \lambda u $$

kNN graph provides a underling backbone of data points, which somehow provides the global structure (connection) to dimensional reduction¶

What's new in UMAP?¶

UMAP introduces the baseline defined by the nearest neighbor, which unifies data, and make graph connected.¶

UMAP use eigenvectors of normalized Laplacian as initialization¶

Comparison of two models¶

Dimensional reduction via graph¶

How to decide the topology of graph?¶

Discrete approach (which generates a sparse graph)¶

$\epsilon$-neighborhood
- consider vertecies within $\epsilon$ as neighbors
k-nearest neighbor
- consider the k nearest vertecies as neighbors

Continuous approach (which generates a dense graph)¶

distance
- $\Large d(x_i, x_j) = ||x_i - x_j||^2$ as pairwise distance
probability derived from distance (could be asymmetric)
- $\Large p_{ij} = e^{-d(x_i, x_j)/\sigma}$ as pairwise distance

But...we all have to define distance metric in prior¶

UMAP does not preserve global structure any better than t-SNE¶

Kobak, D., & Linderman, G. C. (2019). UMAP does not preserve global structure any better than t-SNE when using the same initialization. doi: 10.1101/2019.12.19.877522

UMAP does not preserve global structure any better than t-SNE¶

Relation to graph nerual network¶

Graph convolutional network¶

$$ \begin{align} X' &= f(X, graph) \\ &= x *_G g_{\theta} \\ &= \theta (I + D^{-\frac{1}{2}}AD^{-\frac{1}{2}})x \\ &= \theta \tilde{A} x \end{align} $$

Normalized Laplacian: $L = I - D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

Graph neural network v.s. dimensional reduction¶

Graph neural network¶

transform features on the geometry of graph

$$ Y = f(X, graph) $$

Dimensional reduction via graph¶

construct graph from features and transform features on the geometry of graph

$$ graph = f(X) \\ Y = g(X, graph) $$

高維資料的降維演算法及視覺化¶

Comparison of tSNE and UMAP¶

Yueh-Hua Tu 杜岳華¶

2020.3.20¶

Outline¶

Dimensional Reduction¶

Types of Dimensional Reduction¶

Types of Dimensional Reduction¶

Global structure v.s. local structure¶

Before we start...¶

Notation¶

Nonlinear Dimensional Reduction¶

To preserve distance or neighbors?¶

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

Stochastic neighbor embedding (SNE)¶

Similarity in (high-dimension) original space¶

Similarity in (low-dimension) reconstructed space¶

SNE¶

Loss function¶

Minimize the dissimilarity of distribution between $p$ and $q$¶

Optimization: gradient descent¶

The crowding problem¶

Use t-distribution for repulsive force¶

T-distributed Stochastic Neighbor Embedding (t-SNE)¶

Similarity in high dimension¶

Similarity in low dimension¶

How to decide the variance of Gaussian distribution?¶

Perplexity 困惑度¶

Making it symmetric¶

Assumption: make each data point contributes significantly.¶

t-SNE¶

Loss function¶

Uniform Manifold Approximation and Projection (UMAP)¶

UMAP¶

UMAP¶

$\epsilon$-neighborhood¶

Graph is broken into pieces¶

If the data is uniformly distributed on the manifold, the cover will be good.¶

Assumption: Data is uniformly distributed on the manifold¶

Use fuzzy metric $p_{j|i}$¶

Assumption: manifold is locally connected¶

UMAP¶

Similarity in original space¶

Directed connected graph¶

Make it symmetric¶

Probabilistic fuzzy union¶

k-nearest neighbor graph construction¶

Given k, search $\sigma_i$ s.t.¶

Undirected locally connected graph¶

Reconstruction in low-dimensional space¶

t distribution with $\nu = 1$¶

UMAP¶

Binary cross entropy loss¶

Optimization: gradient descent¶

Initialization¶

Eigenvectors of normalized Laplacian for initialization¶

kNN graph provides a underling backbone of data points, which somehow provides the global structure (connection) to dimensional reduction¶

What's new in UMAP?¶

UMAP introduces the baseline defined by the nearest neighbor, which unifies data, and make graph connected.¶

UMAP use eigenvectors of normalized Laplacian as initialization¶

Comparison of two models¶

Dimensional reduction via graph¶

How to decide the topology of graph?¶

Discrete approach (which generates a sparse graph)¶

Continuous approach (which generates a dense graph)¶

But...we all have to define distance metric in prior¶

UMAP does not preserve global structure any better than t-SNE¶

UMAP does not preserve global structure any better than t-SNE¶

Relation to graph nerual network¶

Graph convolutional network¶

Graph neural network v.s. dimensional reduction¶

Graph neural network¶

Dimensional reduction via graph¶

Thank you for attention¶