Transfer learning

杜岳華

2019.4.20

Transfer learning

Data not directly related to the task.

  • Domain transfer: Different data, similar task
  • Task transfer: Similar data, different task

Large unrelated data applied to specific data on specific task.

Domain transfer

$$ \text{(large) Source data }(x^s, y^s) \rightarrow \text{(small) Target data }(x^t, y^t) $$

Neural network

Fine-tuning

  • Conservative training
  • Layer transfer
  • Task transfer

Conservative training

Conservative training

It fails easily...

Ways to avoid training fail...

  • regularization on the outcome.
  • restrict model parameters have to be similar to the original model.

Layer transfer

Layer transfer

Which layers can be transfered?

It depends...

Speech

$$voice \Rightarrow frequency \Rightarrow timbre (音色) \Rightarrow articulation (發音) \Rightarrow word$$

Transfer the latter layers.

Image

$$pixel \Rightarrow edge \Rightarrow shape \Rightarrow pattern \Rightarrow object \Rightarrow instance$$

Transfer the former layers.

Task transfer

Fully Convolutional Networks for Semantic Segmentation

Multitask learning

Help to learn a common and good representation for multiple tasks.

Multitask learning

Feature-based multitask learning

Parameter-based multitask learning

Instance-based multitask learning

Few works

Feature-based multitask learning

Different tasks share identical or similar feature representation.

  • Feature transformation approach
  • Feature selection approach
  • Deep learning approach

Parameter-based multitask learning

Put task relatedness into model learning via regularization on model parameters

  • Low-rank approach
  • Task-clustering approach
  • Task-relation learning approach
  • Dirty approach
  • Multi-level approach

Hard parameter sharing

Soft parameter sharing

Multimodal learning (not transfer learning)

Integrate multiple data modalities, which provides rich information for model prediction.

Multimodal learning (not transfer learning)

Multimodal learning (not transfer learning)

Multimodal learning (not transfer learning)

Domain-adversarial learning

Domain-adversarial learning

Domain-adversarial learning

  • Feature extractor try to reduce the domain-specific properties
  • Domain classifier: classify the domain where data come from
$$ \mathcal{L} = (\text{loss of label classifier}) - (\text{loss of domain classifier}) $$

Domain-adversarial learning

Zero-shot learning

Training over seen dataset, while have the ability to classify unseen instances.

  • Images are mapped into a semantic space.
  • Classifier assign test images into classes for which they have seen.
  • Novelty detection

Zero-shot learning

Zero-shot learning

Novelty detection

  • Novelty variable $V$
  • Seen image classifier: softmax classifier
  • Unseen classifier: Gaussian classifier

Seen image classifier

  • give the probability of known classes.

Unseen classifier

  • Estimate the probability of a known semantic word vector $w_y$.
  • The probability is a Gaussian distribution $(w_y, \Sigma_y)$.

Self-taught learning

Self-taught learning

With the help of unlabeled data, train supervised task.

Obtain bases $b$ from sparse encoding with unlabeled data

$$ \begin{align} \mathop{\arg\min}_{a,b} & \ \ \sum_i \left\lVert x_u^{(i)} - \sum_j a_j^{(i)}b_j \right\rVert ^2 + \beta \left\lVert a_j^{(i)} \right\rVert_1 \\ \text{subject to} & \ \ \left\lVert b_j \right\rVert_2 \le 1 \end{align} $$

Compute features with labeled data

$$ \hat{a}(x_l^{(i)}) = \mathop{\arg\min}_{a^{(i)}} \left\lVert x_l^{(i)} - \sum_j a_j^{(i)}b_j \right\rVert ^2 + \beta \left\lVert a^{(i)} \right\rVert_1 $$

Train supervised learning with $(\hat{a}(x_l^{(i)}), y^{(i)})$

Self-taught learning

Self-taught clustering

Large unlabeled auxiliary data helps clustering small unlabeled target data

  • Auxiliary data help uncover a better data representation for target data
  • target data $X$, auxiliary data $Y$
  • common feature $Z$

Self-taught clustering

Information theoretic co-clustering

Loss function

$$ \min I(X, Z) - I(\tilde{X}, \tilde{Z}) $$

Mutual information

$$ I(X, Y) = \sum_y \sum_x p(x, y) log(\frac{p(x, y)}{p(x)p(y)}) $$

Self-taught clustering

Loss function

$$ \mathcal{J} = I(X, Z) - I(\tilde{X}, \tilde{Z}) + \lambda [I(Y, Z) - I(\tilde{Y}, \tilde{Z})] $$
  • $\tilde{X}$: clusters of $X$
  • $\tilde{Y}$: clusters of $Y$
  • $\tilde{Z}$: clusters of $Z$

Thank you for attention