Distance, Biology and Representation Learning¶

Yueh-Hua Tu¶

2019.06.12¶

Machine Learning in Bioinformatics¶

Supervised Learning¶

Unsupervised Learning¶

Clustering¶

Some relatives¶

Regression¶

Inferential statistics¶

Representations¶

Representations ≡ Coordinates (basis)¶

What is a good representation for my task against my data?¶

Carry out by task-specific data¶

Supervised learning¶

How to find a good representation?¶

Representation learning¶

CNNs demostrate as a good example of learning good representation from supervised learning¶

Coordinates¶

We use coordinate system to describe a geometric object.

Vectors¶

Linear independence¶

Space¶

Spanning sets: a set of linear combination of vectors.¶

A space is a set of objects.

Orthogonality¶

Orthogonality has several good properties.

Dimensions¶

The measure of a space/basis.

Inner product¶

$$ <x, y> = x^Ty = \sum_i x_i y_i $$

$$ <f, g> = \int f(x)g(x)dx $$

Norm¶

$$ ||x|| = \sqrt{<x, x>} = \sqrt{x^Tx} $$

Angle¶

$$ <x, y> = x^Ty = ||x|| \cdot ||y|| \cos \theta $$

$$ \cos \theta = \frac{x^Ty}{||x|| \cdot ||y||} $$

Distance¶

$$ d(x, y) = ||x - y|| = \sqrt{(x - y)^T(x - y)} $$

Generalization¶

We have define norm, distance, inner product on vector space (linear).¶

Can we generalize them?¶

Space¶

Vector space¶

A set of vectors

Function space¶

A set of functions

Other space¶

points, lines, triangles, numbers, sets, matrices, trees, graphs, relations, groups

Formal definition - distance¶

Distance is a function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

$d(x, y) \ge 0$
$d(x, x) = 0$
$d(x, y) = d(y, x)$
$d(x, y) + d(y, z) \ge d(x, z)$

Metric space¶

$(V, d)$ is a metric space, a space equipped with distance.

Pseudodistance¶

A function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

$d(x, y) \ge 0$
$d(x, x) = 0\ \times$
$d(x, y) = d(y, x)$
$d(x, y) + d(y, z) \ge d(x, z)$

Pseudometric space¶

$(V, d)$ is a pseudometric space.

Quasidistance¶

A function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

$d(x, y) \ge 0$
$d(x, x) = 0$
$d(x, y) = d(y, x)\ \times$
$d(x, y) + d(y, z) \ge d(x, z)$

Quasimetric space¶

$(V, d)$ is a quasimetric space.

Formal definition - norm¶

Norm is a function $||\cdot||: V \rightarrow \mathbb{R}$, $V$ is a vector space.

$\forall x, y \in V$

$||x|| \ge 0$
$||x|| = 0 \Leftrightarrow x = \vec{0}$
$||ax|| = |a|\cdot||x||$
$||x|| + ||y|| \ge ||x + y||$

Normed vector space¶

$(V, ||\cdot||)$ is a normed vector space.

Seminorm¶

Norm is a function $||\cdot||: V \rightarrow \mathbb{R}$, $V$ is a vector space.

$\forall x, y \in V$

$||x|| \ge 0$
$||x|| = 0 \Leftrightarrow x = \vec{0}\ \times$
$||ax|| = |a|\cdot||x||$
$||x|| + ||y|| \ge ||x + y||$

Seminormed vector space¶

$(V, ||\cdot||)$ is a seminormed vector space.

Norm vs distance¶

$$norm \Rightarrow distance$$

$$norm \not\Leftarrow distance$$

Example - Euclidean distance¶

$$ ||x|| = \sqrt{\sum x_i^2} $$$$ d(x, y) = ||x - y|| = \sqrt{\sum (x_i - y_i)^2} $$

Example - Hamming distance¶

$$ d(x, y) = \sum_i [x_i \ne y_i] $$

Mathematical spaces¶

Example: MDS¶

Calculate Euclidean distances for all pairs of points.
Compare the similarity matrix with the original input matrix by evaluating the stress function.
Adjust coordinates, if necessary, to minimize stress.

Space beyond coordinate¶

$$ \min stress = \sqrt{\frac{\sum_{i,j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i,j} d_{ij}^2}} $$

$d_{ij} = ||x_i - x_j||^2, \hat{d}_{ij} = ||y_i - y_j||^2$

Distances¶

Euclidean distance ($l^2$ norm)¶

$$d(x, y) = ||x - y||_2 = (\sum_i |x_i - y_i|^2)^{\frac{1}{2}}$$

Manhattan distance ($l^1$ norm)¶

$$d(x, y) = ||x - y||_1 = \sum_i |x_i - y_i|$$

$L^p$-norm¶

$$d(x, y) = ||x - y||_p = (\sum_i |x_i - y_i|^p)^{\frac{1}{p}}$$

Jaccard distance¶

$$ d(A, B) = \frac{|A \backslash B| + |B \backslash A|}{|A \cup B|} $$

New distance¶

$X$ is a set, a distance is a function $d^p: X \times X \rightarrow \mathbb{R}$

Unnormalized metrics on sets¶

$$ d^p(A, B) = (|A \backslash B|^p + |B \backslash A|^p)^{\frac{1}{p}} $$

$(X, d^p)$ is a metric space.

Normalized metrics on sets¶

$$ d^p_N(A, B) = \frac{(|A \backslash B|^p + |B \backslash A|^p)^{\frac{1}{p}}}{|A \cup B|} $$

$(X, d^p_N)$ is a metric space.

Jaccard distance is a special case of $d^p_N$ when $p = 1$

New distance¶

$\mathbb{R}^k$ is a vector space, a distance is a function $d^p: \mathbb{R}^k \times \mathbb{R}^k \rightarrow \mathbb{R}$

Unnormalized metrics on vectors¶

$$ d^p(x, y) = ((\sum_{x_i \ge y_i} x_i - y_i)^p + (\sum_{x_j < y_j} y_j - x_j)^p)^{\frac{1}{p}} $$

$(\mathbb{R}^k, d^p)$ is a metric space.

Manhattan distance is a special case of $d^p$ when $p = 1$

Normalized metrics on vectors¶

$$ d^p_N(x, y) = \frac{d^p(x, y)}{\sum_i max(|x_i|, |y_i|, |x_i - y_i|)} $$

$(\mathbb{R}^k, d^p_N)$ is a metric space.

Relationship to Minkowski distance¶

Minkowski distance¶

$$ d(x, y) = (\sum_i (x_i - y_i)^p)^{\frac{1}{p}} $$

New distance¶

$$ d^p(x, y) = ((\sum_{x_i \ge y_i} x_i - y_i)^p + (\sum_{x_j < y_j} y_j - x_j)^p)^{\frac{1}{p}} $$

Only when $p = 1$, they both reduce to Manhattan distance.

New distance¶

$L(\mathbb{R})$ is a set of integrable function, a distance is a function $d^p: L(\mathbb{R}) \times L(\mathbb{R}) \rightarrow \mathbb{R}$

Unnormalized metrics on functions¶

$$ d^p(f, g) = ((\int (f - g)^+ dx)^p + (\int (f - g)^- dx)^p)^{\frac{1}{p}} $$

$(L(\mathbb{R}), d^p)$ is a metric space.

Normalized metrics on functions¶

$$ d^p_N(f, g) = \frac{d^p(f, g)}{\int max(|f|, |g|, |f - g|) dx} $$

$(L(\mathbb{R}), d^p_N)$ is a metric space.

Interpretation¶

$$ d^p_N(f, g) = \frac{d^p(f, g)}{\int max(|f|, |g|, |f - g|) dx} $$

Relationship to f-divergences¶

f-divergence¶

A class of divergence. A measure on probability distribution.

$$ D_f(P||Q) = \int_{\Omega} f(\frac{dP}{dQ}) dQ $$

Total variation distance¶

$$ TV(P, Q) = \sup |P(X) - Q(X)|, f(t) = \frac{1}{2} |t - 1| $$

KL divergence¶

$$ D_{KL}(P||Q) = \int p(x) \log \frac{p(x)}{q(x)} dx, f(t) = t \log t $$

Hellinger distance¶

$$ H(P, Q) = \frac{1}{2} \int (\sqrt{dP} - \sqrt{dQ})^2, f(t) = (\sqrt{t} - 1)^2 $$

Application to ontologies¶

An ontology $\mathcal{O} = (V, E)$ is a directed acyclic graph (DAG).

Consistent subgraph¶

Consistent subgraph $F \subseteq V$, a vertex $v \in F$, then $\forall ancestor(v) \in F$

Consistent subgraph¶

The underlying probabilistic graph model is related to Bayesian network.

Consider each concept in the ontology as a binary random variable.¶

The graph structure can be factorized as¶

$$ P(F) = \prod_{v \in F} P(v \mid parents(v)) $$

$P(v \mid parents(v))$ states the probability that node $v$ is part of an ontological annotation given all its parents.

Information content of a consistent subgraph¶

$$ i(F) = \log \frac{1}{P(F)} = \sum_{v \in F} ia(v) \\ ia(v) = -\log P(v \mid parents(v)) $$

Information accretion¶

Additional information inherent to the node $v$ assuming that all its parents are already in the annotation.

Comparison of two ontologies¶

Suppose we have two ontologies $F$ and $G$. $G$ is a prediction of $F$.¶

Misinformation¶

The information content of the nodes in $G$ but not part of true annotation $F$.

$$ mi(F, G) = \sum_{v \in G \backslash F} ia(v) $$

Remaining uncertainty¶

The information content of the nodes in $F$ but not part of prediction $G$.

$$ ru(F, G) = \sum_{v \in F \backslash G} ia(v) $$

Comparison of two ontologies¶

Distance of two ontologies (semantic distance)¶

$$ d^p(F, G) = (ru^p(F, G) + mi^p(F, G))^{\frac{1}{p}} $$

Normalized distance of two ontologies (normalized semantic distance)¶

$$ d^p_N(F, G) = \frac{(ru^p(F, G) + mi^p(F, G))^{\frac{1}{p}}}{\sum_{v \in F \cup G} ia(v)} $$

Compare protein sequence simialrity to ontological distance¶

5000 human-mouse protein pairs from UniProt
3 classes of protein sequence simialrity [0, 1/3), [1/3, 2/3), [2/3, 1]
distribution of pairwise distances ($d^2_N$) between proteins

Reconstruct functional phylogeny using protein functions without sequence information¶

A good distance measure is expected to provide the same evolutionary tree.
Single linkage hierarchical clustering
Using the Molecular Function and Cellular Component approach did recover the correct relationship (left)
Using the Biological Process did not (right)