Attention mechanism

杜岳華

2020.3.7

About me

  • 中研院生物資訊學程博士生
  • Julia Taiwan 社群主持人
  • AI Tech 社群隱藏管理員
  • Deep Learning 101 社群管理員
  • 工研院 機器學習理論與實作 講師
  • 著作:《Julia程式設計》、《Julia資料科學與科學計算》
  • 專長:系統生物學、計算生物學、機器學習

Outline

  • RNN 的問題
  • Seq2Seq encoder-decoder 架構
  • Attention model 解決的問題
  • Attention types
  • Applications of attention
    • Translation
    • Summarization

RNN 的問題

Seq2Seq encoder-decoder 架構

Seq2Seq encoder-decoder 架構

Attention model 解決的問題

How to solve the problem?

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention types

  • Global/local attention
  • Hard/soft attention
  • Self-attention

Global/local attention

Hard/soft attention

Soft attention

  • Alignment weights are learned to attend over all data
  • $0 \le w \le 1$
  • Pro: model is smooth and differentiable
  • Con: large computation if input is large

Hard attention

  • Select part of data to attend or not at a time
  • 0 or 1
  • Pro: less inference time
  • Con: model is non-differentiable

Alignment (compatibility function)

query: $q_{j}$, key: $k_i$

Location-based

$$ \alpha^i_j = softmax(W_{\alpha} q_j) $$

Content-based

$$ score(q_j, k_i) = cos([q_j; k_i]) $$

Additive

$$ score(q_j, k_i) = v^T_{\alpha} tanh(W_{\alpha} [q_j; k_i]) $$

Alignment (compatibility function)

General

$$ score(q_j, k_i) = q_j^T W_{\alpha} k_i $$

Dot-product

$$ score(q_j, k_i) = q_j^T k_i $$

Scaled dot-product

$$ score(q_j, k_i) = \frac{q_j^T k_i}{\sqrt{n}} $$

Applications of attention

Neural Machine Translation by Jointly Learning to Align and Translate

Yoshua Bengio

ICLR 2015

Bidirectional LSTM as encoder

Attention v.s. Seq2Seq

Bilingual evaluation understudy (BLEU)

A way to evaluate the quality of machine-translated text from one natural language to another.

A modified form of precision to compare a translation candidate with multiple references

  • Candidate: the the the the the the the
  • Reference 1: the cat is on the mat
  • Reference 2: there is a cat on the mat

$\Large BLEU = \frac{\text{matched number of words in candidate}}{\text{total number of words in candidate}} = \frac{7}{7}$

Quantitative comparison

Translation

Translation

A Neural Attention Model for Abstractive Sentence Summarization

Facebook

EMNLP 2015

Summarization

To shorten the sentense while keep relevant or important information.

But...what's so different from translation operationally?

What is summarization?

We usually execute some series of operations to summarize a sentense/article.

These operations are Deletion(刪除), Generalization (廣義化) and Paraphrase (改寫).

Types of summarization

Compressive

Summarize original sentnese by deletion-only

Extractive

Summarize original sentnese by deletion and reordering

Abstractive

Summarize original sentnese by arbitary transformation

Proposed models

Bag-of-Words Encoder

$$ \begin{align} enc_1(\mathbf{x}, \mathbf{y}_c) &= \mathbf{p}^T \tilde{\mathbf{x}} \\ \mathbf{p} &= [1/M, \cdots, 1/M] \\ \tilde{\mathbf{x}} &= F [x_1, \cdots , x_M] \end{align} $$

Ignoring properties of the original order or relationships between neighboring words

Convolutional Encoder

$$ enc_2(\mathbf{x}, \mathbf{y}_c) = (\text{temporal convolution layer} \rightarrow \text{max pooling layer})^L $$

Allowing local interactions between words while also not requiring the context $y_c$ while encoding the input.

Proposed models

Attention-Based Encoder

$$ \Large \begin{align} enc_3(\mathbf{x}, \mathbf{y}_c) &= \mathbf{p}^T \bar{\mathbf{x}} \\ \mathbf{p} &= exp(\tilde{\mathbf{x}} P \mathbf{y}_c') \\ \tilde{\mathbf{x}} &= F [x_1, \cdots , x_M] \\ \mathbf{y}_c' &= G [y_{i-C+1}, \cdots, y_i] \\ \bar{x}_i &= \sum_{q=i-Q}^{i+Q} \tilde{\mathbf{x}}_i / Q \end{align} $$

Neural network language model (NNLM)

Perplexity 困惑度

A method to evaluate a language model. A language model describes the probability distribution over whole sentnese.

Perplexity of discrete probability distribution

$$ \LARGE 2^{H(p)} = 2^{-\sum p(x) \log2 p(x)} $$

Language model is a probability distribution

If each word is specified in a sentense, the meaning of a sentense is clear. We evaluate the occurrence (probability) of words in sentense. Lower entropy means precise meaning. Small perplexity is better.

Evaluation

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

A method to evaluate machine translation and machine abstraction. It evaluates generated result and a reference (usually written by human), and further calculate the similarity between them.

$$ \Large \text{ROUGE-N} = \frac{\sum_{S \in reference} \sum_{gram_n \in S} count_{match}(gram_n)}{\sum_{S \in reference} \sum_{gram_n \in S} count(gram_n)} $$
  • $count_{match}(gram_n)$: Maximum number of n-grams co-occuring in candidate and reference.

ROUGE-N

  • Candidate: the cat was found under the bed
  • Reference: the cat was under the bed

1-gram

  • Candidate: the, cat, was, found, under, bed
  • Reference: the, cat, was, under, bed
  • ROUGE-1: $5/5 = 1.0$

2-gram

  • Candidate: the cat, cat was, was found, found under, under the, the bed
  • Reference: the cat, cat was, was under, under the, the bed
  • ROUGE-2: $4/5 = 0.8$

Evaluation

Summarization

Summarization

Summarization

Summarization

Summarization

Summarization

Summarization

Thank you for attention.

References

Papers