Attention mechanism

杜岳華

2019.3.30

About me

  • Julia Taiwan 社群發起人
  • AI Tech 社群常規成員與講師
  • 工研院 機器學習理論與實作 講師
  • 著作:《Julia程式設計》


  • 專長:系統生物學、計算生物學、機器學習
  • 碩論:Identification of cell state using super-enhancer RNA


  • 陽明 生物醫學資訊所 碩士
  • 成大 雙主修 醫學檢驗生物技術學系 學士,資訊工程學系 學士

Outline

  • RNN 的問題
  • Seq2Seq encoder-decoder 架構
  • Attention model 解決的問題
  • Attention types
  • Applications of attention
    • Translation
    • Summarization
    • Image caption
  • Transformer

RNN 的問題

Seq2Seq encoder-decoder 架構

Seq2Seq encoder-decoder 架構

Attention model 解決的問題

How to solve the problem?

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention types

  • Global/local attention
  • Hard/soft attention
  • Self-attention

Global/local attention

Hard/soft attention

Soft attention

  • Alignment weights are learned to attend over all data
  • $0 \le w \le 1$
  • Pro: model is smooth and differentiable
  • Con: large computation if input is large

Hard attention

  • Select part of data to attend or not at a time
  • 0 or 1
  • Pro: less inference time
  • Con: model is non-differentiable

Self-attention

Alignment (compatibility function)

query: $q_{j}$, key: $k_i$

Location-based

$$ \alpha^i_j = softmax(W_{\alpha} q_j) $$

Content-based

$$ score(q_j, k_i) = cos([q_j; k_i]) $$

Additive

$$ score(q_j, k_i) = v^T_{\alpha} tanh(W_{\alpha} [q_j; k_i]) $$

Alignment (compatibility function)

General

$$ score(q_j, k_i) = q_j^T W_{\alpha} k_i $$

Dot-product

$$ score(q_j, k_i) = q_j^T k_i $$

Scaled dot-product

$$ score(q_j, k_i) = \frac{q_j^T k_i}{\sqrt{n}} $$

Scaled dot-product attention

Applications of attention

Translation

Translation

Summarization

Image caption

Transformer

The era of Transformer

Why self-attention?

  • More or equal efficient than RNN/CNN

Integration of RNN/CNN into Transformer

Thank you for attention.

References

Papers