在 Information Bottleneck 之後出現了不少驚豔的呼喊,也出現了指出這個方法的缺點及詮釋錯誤。

在這之後有人專注在確定性這件事上。

1
2
3
4
5
6
7
graph LR
X -->|"p(x, y)"|Y
X -->|"q(t|x)"|T
T --- Y
X((X))
Y((Y))
T((T))

$T$:

  • soft sufficient statistics (for statistics)
  • lossy compression (for signal)
  • maximally informative clustering (for machine learning)

IB

$$
min\ \mathcal{L} [q(t|x)] = I(T; X) - \beta I(T; Y), \beta > 0
$$

$I(T; X)$: compression
$I(T; Y)$: relevance

Markov constraint: $T \leftarrow X \leftrightarrow Y$

$$
q(t|x) = \frac{q(t)}{Z(x, \beta)} exp(- \beta D_{KL} [p(y|x) || q(y|t)])) \\
q(t) = \sum_x p(x)q(t|x) \\
q(y|t) = \frac{1}{q(t)} \sum_x p(y|x)q(t|x)p(x)
$$

$I(T; X)$ from channel coding, rate distortion theory

DIB

$$
min\ \mathcal{L} [q(t|x)] = H(T) - \beta I(T; Y)
$$

$H(T)$: penalize coding itself
$I(T; Y)$: lead to deterministic $\mathcal{L}_{IB}$

$$
\mathcal{L} _{IB} - \mathcal{L} _{DIB} = I(T; X) - H(T) = -H(T|X)
$$

$\mathcal{L}_{IB}$: implicit encourage of stochastic

Generalized IB

$$
\mathcal{L}_{\alpha} = H(T) - \alpha H(T|X) - \beta I(Y; T)
$$

$\alpha = 1 \Rightarrow \mathcal{L} _{IB}$: stochastic $\rightarrow$ soft clustering
$\alpha = 0 \Rightarrow \mathcal{L} _{DIB}$: deterministic $\rightarrow$ hard clustering