Attention mechanism¶

杜岳華¶

2020.3.7¶

About me¶

中研院生物資訊學程博士生
Julia Taiwan 社群主持人
AI Tech 社群隱藏管理員
Deep Learning 101 社群管理員
工研院機器學習理論與實作講師
著作：《Julia程式設計》、《Julia資料科學與科學計算》
專長：系統生物學、計算生物學、機器學習

Outline¶

RNN 的問題
Seq2Seq encoder-decoder 架構
Attention model 解決的問題
Attention types
Applications of attention
- Translation
- Summarization

RNN 的問題¶

picture source

Seq2Seq encoder-decoder 架構¶

picture source

Seq2Seq encoder-decoder 架構¶

picture source

Attention model 解決的問題¶

picture source

How to solve the problem?¶

picture source

Attention mechanism¶

Attention types¶

Global/local attention
Hard/soft attention
Self-attention

Global/local attention¶

Hard/soft attention¶

Soft attention¶

Alignment weights are learned to attend over all data
$0 \le w \le 1$
Pro: model is smooth and differentiable
Con: large computation if input is large

Hard attention¶

Select part of data to attend or not at a time
0 or 1
Pro: less inference time
Con: model is non-differentiable

Alignment (compatibility function)¶

query: $q_{j}$, key: $k_i$

Location-based¶

$$ \alpha^i_j = softmax(W_{\alpha} q_j) $$

Content-based¶

$$ score(q_j, k_i) = cos([q_j; k_i]) $$

Additive¶

$$ score(q_j, k_i) = v^T_{\alpha} tanh(W_{\alpha} [q_j; k_i]) $$

Alignment (compatibility function)¶

General¶

$$ score(q_j, k_i) = q_j^T W_{\alpha} k_i $$

Dot-product¶

$$ score(q_j, k_i) = q_j^T k_i $$

Scaled dot-product¶

$$ score(q_j, k_i) = \frac{q_j^T k_i}{\sqrt{n}} $$

Applications of attention¶

Summarization: Rush 2015
Translation: Bahdanau 2014, Luong 2015
Image caption: Xu 2015
...

Neural Machine Translation by Jointly Learning to Align and Translate¶

Yoshua Bengio¶

ICLR 2015¶

Bidirectional LSTM as encoder¶

Attention v.s. Seq2Seq¶

Bilingual evaluation understudy (BLEU)¶

A way to evaluate the quality of machine-translated text from one natural language to another.¶

A modified form of precision to compare a translation candidate with multiple references¶

Candidate: the the the the the the the
Reference 1: the cat is on the mat
Reference 2: there is a cat on the mat

$\Large BLEU = \frac{\text{matched number of words in candidate}}{\text{total number of words in candidate}} = \frac{7}{7}$

Quantitative comparison¶

Translation¶

A Neural Attention Model for Abstractive Sentence Summarization¶

Facebook¶

EMNLP 2015¶

Summarization¶

To shorten the sentense while keep relevant or important information.¶

But...what's so different from translation operationally?¶

What is summarization?¶

We usually execute some series of operations to summarize a sentense/article.¶

These operations are Deletion(刪除), Generalization (廣義化) and Paraphrase (改寫).¶

Types of summarization¶

Compressive¶

Summarize original sentnese by deletion-only

Extractive¶

Summarize original sentnese by deletion and reordering

Abstractive¶

Summarize original sentnese by arbitary transformation

Proposed models¶

Bag-of-Words Encoder¶

$$ \begin{align} enc_1(\mathbf{x}, \mathbf{y}_c) &= \mathbf{p}^T \tilde{\mathbf{x}} \\ \mathbf{p} &= [1/M, \cdots, 1/M] \\ \tilde{\mathbf{x}} &= F [x_1, \cdots , x_M] \end{align} $$

Ignoring properties of the original order or relationships between neighboring words

Convolutional Encoder¶

$$ enc_2(\mathbf{x}, \mathbf{y}_c) = (\text{temporal convolution layer} \rightarrow \text{max pooling layer})^L $$

Allowing local interactions between words while also not requiring the context $y_c$ while encoding the input.

Proposed models¶

Attention-Based Encoder¶

$$ \Large \begin{align} enc_3(\mathbf{x}, \mathbf{y}_c) &= \mathbf{p}^T \bar{\mathbf{x}} \\ \mathbf{p} &= exp(\tilde{\mathbf{x}} P \mathbf{y}_c') \\ \tilde{\mathbf{x}} &= F [x_1, \cdots , x_M] \\ \mathbf{y}_c' &= G [y_{i-C+1}, \cdots, y_i] \\ \bar{x}_i &= \sum_{q=i-Q}^{i+Q} \tilde{\mathbf{x}}_i / Q \end{align} $$

Neural network language model (NNLM)¶

Perplexity 困惑度¶

A method to evaluate a language model. A language model describes the probability distribution over whole sentnese.

Perplexity of discrete probability distribution¶

$$ \LARGE 2^{H(p)} = 2^{-\sum p(x) \log2 p(x)} $$

Language model is a probability distribution¶

If each word is specified in a sentense, the meaning of a sentense is clear. We evaluate the occurrence (probability) of words in sentense. Lower entropy means precise meaning. Small perplexity is better.

Evaluation¶

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)¶

A method to evaluate machine translation and machine abstraction. It evaluates generated result and a reference (usually written by human), and further calculate the similarity between them.

$$ \Large \text{ROUGE-N} = \frac{\sum_{S \in reference} \sum_{gram_n \in S} count_{match}(gram_n)}{\sum_{S \in reference} \sum_{gram_n \in S} count(gram_n)} $$

$count_{match}(gram_n)$: Maximum number of n-grams co-occuring in candidate and reference.

ROUGE: A Package for Automatic Evaluation of Summaries

ROUGE-N¶

Candidate: the cat was found under the bed
Reference: the cat was under the bed

1-gram¶

Candidate: the, cat, was, found, under, bed
Reference: the, cat, was, under, bed
ROUGE-1: $5/5 = 1.0$

2-gram¶

Candidate: the cat, cat was, was found, found under, under the, the bed
Reference: the cat, cat was, was under, under the, the bed
ROUGE-2: $4/5 = 0.8$

Attention mechanism¶

杜岳華¶

2020.3.7¶

About me¶

Outline¶

RNN 的問題¶

Seq2Seq encoder-decoder 架構¶

Seq2Seq encoder-decoder 架構¶

Attention model 解決的問題¶

How to solve the problem?¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention mechanism¶

Attention types¶

Global/local attention¶

Hard/soft attention¶

Soft attention¶

Hard attention¶

Alignment (compatibility function)¶

Location-based¶

Content-based¶

Additive¶

Alignment (compatibility function)¶

General¶

Dot-product¶

Scaled dot-product¶

Applications of attention¶

Neural Machine Translation by Jointly Learning to Align and Translate¶

Yoshua Bengio¶

ICLR 2015¶

Bidirectional LSTM as encoder¶

Attention v.s. Seq2Seq¶

Bilingual evaluation understudy (BLEU)¶

A way to evaluate the quality of machine-translated text from one natural language to another.¶

A modified form of precision to compare a translation candidate with multiple references¶

Quantitative comparison¶

Translation¶

Translation¶

A Neural Attention Model for Abstractive Sentence Summarization¶

Facebook¶

EMNLP 2015¶

Summarization¶

To shorten the sentense while keep relevant or important information.¶

But...what's so different from translation operationally?¶

What is summarization?¶

We usually execute some series of operations to summarize a sentense/article.¶

These operations are Deletion(刪除), Generalization (廣義化) and Paraphrase (改寫).¶

Types of summarization¶

Compressive¶

Extractive¶

Abstractive¶

Proposed models¶

Bag-of-Words Encoder¶

Convolutional Encoder¶

Proposed models¶

Attention-Based Encoder¶

Neural network language model (NNLM)¶

Perplexity 困惑度¶

Perplexity of discrete probability distribution¶

Language model is a probability distribution¶

Evaluation¶

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)¶

ROUGE-N¶

1-gram¶

2-gram¶

Evaluation¶

Summarization¶

Summarization¶

Summarization¶

Summarization¶

Summarization¶

Summarization¶

Summarization¶

Thank you for attention.¶

References¶