Diffusion Models
Many recent papers and blogs [1-4] have provided a good overview of diffusion models. In this note, I will summarize some high-level understandings towards the diffusion models and write down some key steps in the derivations.
Summary
- DDPM is parameterized as a Markov chain which means that the latent variables depend only on the previous (or following) timestep. However, DDIM shows that it is indeed not necessary.
- The forward prcess of DDPM is a chain of additive Gaussian noise channels and the variance of the noises are pre-defined, i.e., $p_{X_t|X_{t-1}}$ is known. Our goal is to learn the reverse process parameters $q_{\phi}(X_{t-1}|X_{t})$ or $q_{\phi}(X_{t-1}|X_{t},X_0)$ by matching $p_{X_t|X_{t-1}}$.
- With sufficiently large $T$, $p_{X_T}$ is asymptotically distributed as an isotropic Gaussian.
- Diffusion models can be viewed as hierarchical VAEs, where each layer in the encoder is a Gaussian noise channel.
- Score-based diffusion models is a general form of discrete timestep diffusion models.
- The training objective of diffusion models can be characterized to the variational lower bound of negative log-likelihood of the data, and each term in the training objective can be cast as KL divergence, which is easily computed since the distributions are Gaussian.
Road Map
Variational Diffusion Model (VDM) can be viewed as a Markovian Hierarchical Variational Autoencoder with teh following restrictions [5]:
- The latent dimension is same as the data dimension;
- The encoding process is a chain of additive Gaussian noise channels without any learnable parameters;
- The distribution of the last latent variable is an isotropic Gaussian.
Source: Understanding Diffusion Models: A Unified Perspective Here we use $q$ to denote the pre-defined encoder and $p$ to denote the unknown decoder.
1. DDPM
Denosing Diffusion Probabilistic Model (DDPM) is a diffusion model with discrete timesteps. DDPM has the following assumptions.
Assumptions:
- Markov chain property: $$q(X_{1:T}|X_0)=\prod_{t=1}^T q(X_t|X_{t-1})$$
- Additive Gaussian noise channels:
$$q(X_t | X_{t-1})=\mathcal{N}(X_t; \sqrt{\alpha_t} X_{t-1},(1-\alpha_t) \mathbf{I})$$
or
$$X_t = \sqrt{\alpha_t} X_{t-1}+\sqrt{1-\alpha_t}Z, Z\sim \mathcal{N}(\mathbf{0},\mathbf{I})$$
- Decaying noise variance: $$1>\alpha_1>\alpha_2>\cdots>\alpha_T>0$$
With the above assumptions, we can derive the following properties:
- $X_t$ can be obtained from $X_0$ directly:
$$q(X_t|X_0)=\mathcal{N}(X_t; \sqrt{\bar{\alpha}_t}X_0, (1-\bar{\alpha}_t)\mathbf{I}), \bar{\alpha}_t:=\prod_{s=1}^t\alpha_s$$ or $$X_t = \bar{\alpha}_t X_0 + \sqrt{1-\bar{\alpha}_t}Z$$
- $q(X_{t-1}|X_t,X_0)$ is tractable: $$q(X_{t-1}|X_t,X_0) \propto \mathcal{N}(X_{t-1} ; \underbrace{\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) X_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) X_0}{1-\bar{\alpha}_t}}_{\mu_q(X_t, X_0)}, \underbrace{\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{I})}_{\boldsymbol{\Sigma}_q(t)}$$
By deriving the variational lower bound of $\log p(X)$, we can express the ELBO as a sum of KL divergences: $$ \begin{align} \log p(X) &\geq E_{q(X_1|X_0)}[\log p_{\theta}(X_0|X_1)] \text{ (reconstruction term)} \\ &+D(q(X_T|X_0)|p(X_T)) \text{ (prior matching term)}\\ &-\sum_{t=2}^{T}D(q(X_{t-1}|X_{t},X_0)|p(X_{t-1}|X_{t})) \text{ (denosing matching term)} \end{align} $$ We can model $p_{\theta}(X_{t-1}|X_t)$ as $$ p_{\theta}(X_{t-1}|X_t) = \mathcal{N}(X_{t-1}; \mu_{\theta}(X_t,t), \Sigma_q(t)) $$ and the loss function becomes the KL divergence between two Gaussians. With different choices of $\mu_{\theta}(X_t,t)$, we can obtain three equivalent objectives to optimize a VDM:
- predict the noise that was used to generate $X_t$;
- predict $X_0$ given $X_t$ and $t$;
- predict th score of $X_t$ given $X_0$ and $t$.
Score Matching
For every $x$, taking the gradient of its log likelihood with respect to $x$ essentially describes what direction in data space to move in order to further increase its likelihood. The objective of score matching is that:
$$ \begin{align} &\frac12 E_{p_{\text{data}}}[|s_{\theta}(X)-\nabla_X \log p_{\text{data}}(X)|_2^2]\\ \Rightarrow &\frac12 E_{p_{\text{data}}}[\operatorname{tr}(\nabla_X s_{\theta}(X)+\frac12 |s_{\theta}(X)|_2^2)] \end{align} $$
Denosing Score Matching The score function is ill-defined when $x$ lies on a low-dimensional manifold in a high-dimensional space. Also the estimated score function trained via vanilla score matching will not be accurate in low density regions. To address these issues, we can use introduce multiple levels of Gaussian noise to the data and then apply the score matching method. The denoising score matching objective is:
$$ \begin{align} \frac12 E_{q_{\sigma}(\tilde{X}|X)p_{\text{data}}}[|s_{\theta}(\tilde{X})-\nabla_{\tilde{X}} \log q_{\sigma}(\tilde{X}|X)|_2^2] \end{align} $$
where $\sigma$ forms a positive geometric sequence: $\frac{\sigma_1}{\sigma_2}=\frac{\sigma_2}{\sigma_3}=\cdots=\frac{\sigma_{T-1}}{\sigma_T}>1$. We define
$$q_{\sigma}(\tilde{X}|X)=\mathcal{N}(\tilde{X};X,\sigma^2 \mathbf{I})$$
langevin dynamics
$$\tilde{X}_t=\tilde{X}_{t-1}+\frac{\alpha}{2}\nabla_X \log p(\tilde{X}_{t-1})+\sqrt{\alpha}Z_t$$
where $\alpha_t=\frac{\sigma_t^2}{\sigma_T^2}$, $Z_t\sim \mathcal{N}(0,\mathbf{I})$.
Reference
[1] What are diffusion models?
[2] Introduction to Diffusion Models for Machine Learning
[3] Tutorial on Denoising Diffusion-based Generative Modeling: Foundations and Applications
[4] Generative Modeling by Estimating Gradients of the Data Distribution