Distance, Biology and Representation Learning

Yueh-Hua Tu

2019.06.12

Machine Learning in Bioinformatics

Supervised Learning

Unsupervised Learning

Clustering

Some relatives

Regression

Inferential statistics

Representations

Representations ≡ Coordinates (basis)

What is a good representation for my task against my data?

Carry out by task-specific data

Supervised learning

How to find a good representation?

Representation learning

CNNs demostrate as a good example of learning good representation from supervised learning

Coordinates

We use coordinate system to describe a geometric object.

Vectors

Linear independence

Space

Spanning sets: a set of linear combination of vectors.

A space is a set of objects.

Orthogonality

Orthogonality has several good properties.

Dimensions

The measure of a space/basis.

Inner product

$$ <x, y> = x^Ty = \sum_i x_i y_i $$
$$ <f, g> = \int f(x)g(x)dx $$

Norm

$$ ||x|| = \sqrt{<x, x>} = \sqrt{x^Tx} $$

Angle

$$ <x, y> = x^Ty = ||x|| \cdot ||y|| \cos \theta $$
$$ \cos \theta = \frac{x^Ty}{||x|| \cdot ||y||} $$

Distance

$$ d(x, y) = ||x - y|| = \sqrt{(x - y)^T(x - y)} $$

Generalization

We have define norm, distance, inner product on vector space (linear).

Can we generalize them?

Space

Vector space

A set of vectors

Function space

A set of functions

Other space

points, lines, triangles, numbers, sets, matrices, trees, graphs, relations, groups

Formal definition - distance

Distance is a function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

  1. $d(x, y) \ge 0$
  2. $d(x, x) = 0$
  3. $d(x, y) = d(y, x)$
  4. $d(x, y) + d(y, z) \ge d(x, z)$

Metric space

$(V, d)$ is a metric space, a space equipped with distance.

Pseudodistance

A function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

  1. $d(x, y) \ge 0$
  2. $d(x, x) = 0\ \times$
  3. $d(x, y) = d(y, x)$
  4. $d(x, y) + d(y, z) \ge d(x, z)$

Pseudometric space

$(V, d)$ is a pseudometric space.

Quasidistance

A function $d: V \times V \rightarrow \mathbb{R}$, $V$ is a set.

$\forall x, y \in V$

  1. $d(x, y) \ge 0$
  2. $d(x, x) = 0$
  3. $d(x, y) = d(y, x)\ \times$
  4. $d(x, y) + d(y, z) \ge d(x, z)$

Quasimetric space

$(V, d)$ is a quasimetric space.

Formal definition - norm

Norm is a function $||\cdot||: V \rightarrow \mathbb{R}$, $V$ is a vector space.

$\forall x, y \in V$

  1. $||x|| \ge 0$
  2. $||x|| = 0 \Leftrightarrow x = \vec{0}$
  3. $||ax|| = |a|\cdot||x||$
  4. $||x|| + ||y|| \ge ||x + y||$

Normed vector space

$(V, ||\cdot||)$ is a normed vector space.

Seminorm

Norm is a function $||\cdot||: V \rightarrow \mathbb{R}$, $V$ is a vector space.

$\forall x, y \in V$

  1. $||x|| \ge 0$
  2. $||x|| = 0 \Leftrightarrow x = \vec{0}\ \times$
  3. $||ax|| = |a|\cdot||x||$
  4. $||x|| + ||y|| \ge ||x + y||$

Seminormed vector space

$(V, ||\cdot||)$ is a seminormed vector space.

Norm vs distance

$$norm \Rightarrow distance$$
$$norm \not\Leftarrow distance$$

Example - Euclidean distance

$$ ||x|| = \sqrt{\sum x_i^2} $$$$ d(x, y) = ||x - y|| = \sqrt{\sum (x_i - y_i)^2} $$

Example - Hamming distance

$$ d(x, y) = \sum_i [x_i \ne y_i] $$

Mathematical spaces

Example: MDS

  1. Calculate Euclidean distances for all pairs of points.
  2. Compare the similarity matrix with the original input matrix by evaluating the stress function.
  3. Adjust coordinates, if necessary, to minimize stress.

Space beyond coordinate

$$ \min stress = \sqrt{\frac{\sum_{i,j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i,j} d_{ij}^2}} $$

$d_{ij} = ||x_i - x_j||^2, \hat{d}_{ij} = ||y_i - y_j||^2$

Distances

Euclidean distance ($l^2$ norm)

$$d(x, y) = ||x - y||_2 = (\sum_i |x_i - y_i|^2)^{\frac{1}{2}}$$

Manhattan distance ($l^1$ norm)

$$d(x, y) = ||x - y||_1 = \sum_i |x_i - y_i|$$

$L^p$-norm

$$d(x, y) = ||x - y||_p = (\sum_i |x_i - y_i|^p)^{\frac{1}{p}}$$

Jaccard distance

$$ d(A, B) = \frac{|A \backslash B| + |B \backslash A|}{|A \cup B|} $$

New distance

$X$ is a set, a distance is a function $d^p: X \times X \rightarrow \mathbb{R}$

Unnormalized metrics on sets

$$ d^p(A, B) = (|A \backslash B|^p + |B \backslash A|^p)^{\frac{1}{p}} $$

$(X, d^p)$ is a metric space.

Normalized metrics on sets

$$ d^p_N(A, B) = \frac{(|A \backslash B|^p + |B \backslash A|^p)^{\frac{1}{p}}}{|A \cup B|} $$

$(X, d^p_N)$ is a metric space.

Jaccard distance is a special case of $d^p_N$ when $p = 1$

New distance

$\mathbb{R}^k$ is a vector space, a distance is a function $d^p: \mathbb{R}^k \times \mathbb{R}^k \rightarrow \mathbb{R}$

Unnormalized metrics on vectors

$$ d^p(x, y) = ((\sum_{x_i \ge y_i} x_i - y_i)^p + (\sum_{x_j < y_j} y_j - x_j)^p)^{\frac{1}{p}} $$

$(\mathbb{R}^k, d^p)$ is a metric space.

Manhattan distance is a special case of $d^p$ when $p = 1$

Normalized metrics on vectors

$$ d^p_N(x, y) = \frac{d^p(x, y)}{\sum_i max(|x_i|, |y_i|, |x_i - y_i|)} $$

$(\mathbb{R}^k, d^p_N)$ is a metric space.

Relationship to Minkowski distance

Minkowski distance

$$ d(x, y) = (\sum_i (x_i - y_i)^p)^{\frac{1}{p}} $$

New distance

$$ d^p(x, y) = ((\sum_{x_i \ge y_i} x_i - y_i)^p + (\sum_{x_j < y_j} y_j - x_j)^p)^{\frac{1}{p}} $$

Only when $p = 1$, they both reduce to Manhattan distance.

New distance

$L(\mathbb{R})$ is a set of integrable function, a distance is a function $d^p: L(\mathbb{R}) \times L(\mathbb{R}) \rightarrow \mathbb{R}$

Unnormalized metrics on functions

$$ d^p(f, g) = ((\int (f - g)^+ dx)^p + (\int (f - g)^- dx)^p)^{\frac{1}{p}} $$

$(L(\mathbb{R}), d^p)$ is a metric space.

Normalized metrics on functions

$$ d^p_N(f, g) = \frac{d^p(f, g)}{\int max(|f|, |g|, |f - g|) dx} $$

$(L(\mathbb{R}), d^p_N)$ is a metric space.

Interpretation

$$ d^p_N(f, g) = \frac{d^p(f, g)}{\int max(|f|, |g|, |f - g|) dx} $$

Relationship to f-divergences

f-divergence

A class of divergence. A measure on probability distribution.

$$ D_f(P||Q) = \int_{\Omega} f(\frac{dP}{dQ}) dQ $$

Total variation distance

$$ TV(P, Q) = \sup |P(X) - Q(X)|, f(t) = \frac{1}{2} |t - 1| $$

KL divergence

$$ D_{KL}(P||Q) = \int p(x) \log \frac{p(x)}{q(x)} dx, f(t) = t \log t $$

Hellinger distance

$$ H(P, Q) = \frac{1}{2} \int (\sqrt{dP} - \sqrt{dQ})^2, f(t) = (\sqrt{t} - 1)^2 $$

Application to ontologies

An ontology $\mathcal{O} = (V, E)$ is a directed acyclic graph (DAG).

Consistent subgraph

Consistent subgraph

Consistent subgraph $F \subseteq V$, a vertex $v \in F$, then $\forall ancestor(v) \in F$

Consistent subgraph

The underlying probabilistic graph model is related to Bayesian network.

Consider each concept in the ontology as a binary random variable.

The graph structure can be factorized as

$$ P(F) = \prod_{v \in F} P(v \mid parents(v)) $$

$P(v \mid parents(v))$ states the probability that node $v$ is part of an ontological annotation given all its parents.

Information content of a consistent subgraph

$$ i(F) = \log \frac{1}{P(F)} = \sum_{v \in F} ia(v) \\ ia(v) = -\log P(v \mid parents(v)) $$

Information accretion

Additional information inherent to the node $v$ assuming that all its parents are already in the annotation.

Comparison of two ontologies

Suppose we have two ontologies $F$ and $G$. $G$ is a prediction of $F$.

Misinformation

The information content of the nodes in $G$ but not part of true annotation $F$.

$$ mi(F, G) = \sum_{v \in G \backslash F} ia(v) $$

Remaining uncertainty

The information content of the nodes in $F$ but not part of prediction $G$.

$$ ru(F, G) = \sum_{v \in F \backslash G} ia(v) $$

Comparison of two ontologies

Distance of two ontologies (semantic distance)

$$ d^p(F, G) = (ru^p(F, G) + mi^p(F, G))^{\frac{1}{p}} $$

Normalized distance of two ontologies (normalized semantic distance)

$$ d^p_N(F, G) = \frac{(ru^p(F, G) + mi^p(F, G))^{\frac{1}{p}}}{\sum_{v \in F \cup G} ia(v)} $$

Compare protein sequence simialrity to ontological distance

  • 5000 human-mouse protein pairs from UniProt
  • 3 classes of protein sequence simialrity [0, 1/3), [1/3, 2/3), [2/3, 1]
  • distribution of pairwise distances ($d^2_N$) between proteins

Reconstruct functional phylogeny using protein functions without sequence information

  • A good distance measure is expected to provide the same evolutionary tree.
  • Single linkage hierarchical clustering
  • Using the Molecular Function and Cellular Component approach did recover the correct relationship (left)
  • Using the Biological Process did not (right)

Thank you for attention