Neural Ordinary Differential Equations¶

杜岳華¶

2019.3.31¶

Outline¶

Introduction
- Core concept - 將離散層拉成連續層
- Neural network is an approximator for derivatives
- Solve for next state
- Backpropagation
- Adjoint method
Applications
- Replace ResNet with ODEs
- Continuous normalizing flow
- Generative latent time-series model
Backpropagation through ODE solver
Features and limitations
Extentions
- Why ODE?
- Why efficiency?
- Continuous brings topology

Introduction¶

NeurIPS 2018 Best Papers¶

Instead of discrete sequences of layer, parameterize the continuous dynamics of hidden layer¶

Some results¶

Core concept - 將離散層拉成連續層¶

離散版本¶

$$ \frac{dy}{dt} = \frac{y(t + \Delta) - y(t)}{\Delta} $$

連續版本¶

$$ \frac{dy}{dt} = \lim_{\Delta \rightarrow 0} \frac{y(t + \Delta) - y(t)}{\Delta} $$

轉化成¶

$$ y(t + \Delta) = y(t) + \Delta \frac{dy}{dt} $$

Core concept - 將離散層拉成連續層¶

前回饋網路（feed-forward network）¶

$$ h_{t+1} = f(h_t, \theta) $$

ResNet with skip connection¶

$$ h_{t+1} = h_t + f(h_t, \theta) $$

是不是很像？¶

$$ y(t + \Delta) = y(t) + \Delta \frac{dy}{dt} $$

Core concept - 將離散層拉成連續層¶

ResNet with skip connection¶

$$ h_{t+1} = h_t + f(h_t, \theta) $$

$\Delta = 1$ 代入¶

$$ y(t+1) = y(t) + \frac{dy}{dt} $$

Neural network is an approximator for derivatives¶

$$ \frac{dy}{dt} = f(h_t, \theta) $$

神經網路層 $f$ 就可以被我們拿來計算微分 $\frac{dy}{dt}$！

$$ y(t + \Delta) = y(t) + \Delta \frac{dy}{dt} \\\\ \downarrow \\\\ y(t + \Delta) = y(t) + \Delta f(t, h(t), \theta_t) $$

Neural network is an approximator for derivatives¶

Solve for next state¶

Layer for approximation¶

$$ \frac{dh(t)}{dt} = f(h(t), t, \theta) $$

Integral for forwarding¶

$$ h(t) = \int f(h(t), t, \theta) dt $$

Solve for next state¶

Integral for forwarding¶

$$ h(t) = \int f(h(t), t, \theta) dt $$

Solve for next state¶

$h(t_0)$ to $h(t_1)$¶

$$ h(t_1) = F(h(t), t, \theta) \bigg|_{t=t_0} $$

Solving with ODE solver¶

$$ h(t_1) = ODESolve(h(t_0), t_0, t_1, \theta, f) $$

Backpropagation¶

Optimization¶

$$ \mathcal{L}(t_0, t, \theta) = \mathcal{L}(ODESolve(\cdot)) $$

Require gradient respect to parameters¶

$$ \frac{\partial \mathcal{L}}{\partial h(t)} $$

Adjoint method¶

Adjoint state¶

$$ \frac{\partial \mathcal{L}}{\partial h(t)} = -a(t) $$

How to find $a(t)$¶

$$ a(t) = \int -a(t)^T \frac{\partial f}{\partial h} dt = - \frac{\partial \mathcal{L}}{\partial h(t)} $$

Adjoint method¶

Find $a(t)$ from $f$¶

$$ a(t) = \int_{t_1}^{t_0} -a(t)^T \frac{\partial f(h(t), t, \theta)}{\partial h(t)} dt $$

Augmented state¶

$\frac{d \theta}{dt} = 0$

$\frac{dt}{dt} = 1$

For computation efficiency, let $$\begin{bmatrix} h \\ \theta \\ t \end{bmatrix}$$ be a augmented state

Augmented state function¶

$$ f_{aug}(\begin{bmatrix} h \\ \theta \\ t \end{bmatrix}) = \begin{bmatrix} f(h(t), t, \theta) \\ 0 \\ 1 \end{bmatrix} $$

Augmented state dynamics¶

$$ \frac{d}{dt} \begin{bmatrix} h \\ \theta \\ t \end{bmatrix} = f_{aug}( \begin{bmatrix} h \\ \theta \\ t \end{bmatrix}) $$

Augmented adjoint state¶

$$ a_{aug} = \begin{bmatrix} a \\ a_{\theta} \\ a_t \end{bmatrix} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial h} \\ \frac{\partial \mathcal{L}}{\partial \theta} \\ \frac{\partial \mathcal{L}}{\partial t} \end{bmatrix} $$

Augmented adjoint state dynamics¶

$$ \frac{d a_{aug}}{dt} = - \begin{bmatrix} a \frac{\partial f}{\partial h} \\ a \frac{\partial f}{\partial \theta} \\ a \frac{\partial f}{\partial t} \end{bmatrix} $$

Replace ResNet with ODEs¶

RK-Net: use Runge-Kutta integrator
ODE-Net: use ODESolve
Experiment on TensorFlow with GPU
$\tilde{L}$: the numbers of function evaluations

Continuous normalizing flow¶

Generative latent time-series model¶

Backpropagation through ODE solver¶

Features and limitations¶

Minibatch¶

Uniqueness¶

Picard's existence thm.¶

The solution of an initial value problem exists and is unique, if the differential equation is uniformly Lipschitz continuous in z and continuous in t.

Reversibility¶

Ajoint method is not reversible.

Why ODE?¶

Efficiency¶

Borrow concepts and interpretations from science¶

Why efficiency?¶

Continuous assumption guarantee convergence¶

$$ \int f^2 dt $$

converges

Continuous brings topology¶

Layer is a discrete mapping¶

$$ h_{t+1} = \sigma(Wh_{t} + b) $$

Continuous brings topology¶

Layer becomes continuous function¶

$$ h(t + \Delta t) = \sigma(Wh(t) + b) $$

Neural Ordinary Differential Equations¶

杜岳華¶

2019.3.31¶

Outline¶

Introduction¶

NeurIPS 2018 Best Papers¶

Instead of discrete sequences of layer, parameterize the continuous dynamics of hidden layer¶

Some results¶

Core concept - 將離散層拉成連續層¶

離散版本¶

連續版本¶

轉化成¶

Core concept - 將離散層拉成連續層¶

前回饋網路（feed-forward network）¶

ResNet with skip connection¶

是不是很像？¶

Core concept - 將離散層拉成連續層¶

ResNet with skip connection¶

$\Delta = 1$ 代入¶

Neural network is an approximator for derivatives¶

Neural network is an approximator for derivatives¶

Neural network is an approximator for derivatives¶

Neural network is an approximator for derivatives¶

Solve for next state¶

Layer for approximation¶

Integral for forwarding¶

Solve for next state¶

Integral for forwarding¶

Solve for next state¶

$h(t_0)$ to $h(t_1)$¶

Solving with ODE solver¶

Backpropagation¶

Optimization¶

Require gradient respect to parameters¶

Adjoint method¶

Adjoint state¶

How to find $a(t)$¶

Adjoint method¶

Find $a(t)$ from $f$¶

Augmented state¶

Augmented state function¶

Augmented state dynamics¶

Augmented adjoint state¶

Augmented adjoint state dynamics¶

Replace ResNet with ODEs¶

Continuous normalizing flow¶

Continuous normalizing flow¶

Generative latent time-series model¶

Generative latent time-series model¶

Generative latent time-series model¶

Backpropagation through ODE solver¶

Features and limitations¶

Minibatch¶

Uniqueness¶

Picard's existence thm.¶

Reversibility¶

Why ODE?¶

Efficiency¶

Borrow concepts and interpretations from science¶

Why efficiency?¶

Continuous assumption guarantee convergence¶

Continuous brings topology¶

Layer is a discrete mapping¶

Continuous brings topology¶

Layer becomes continuous function¶

Thank you for attention¶

References¶