Understanding Diffusion Model Through First Principles of Probability
Published:
1. Outline
2. Problem formulation
2.1. Notation
Let’s use notation from the paper:
- \(\mathbf{x}_0\) is the data sample, which is the initial state.
- The sequence \(\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T\) denotes the states through which \(\mathbf{x}_0\) transitions through over time steps \(1, 2, \cdots, T\).
- All of these are random variables and hence the forward and reverse transitions become random processes.
- \(q(.)\) is the probability density function for the forward process.
- \(p_{\mathbf{\theta}}(.)\) is the probability density function for the reverse process. Will talk about \(\mathbf{\theta}\) later.
- Abbreviations: PDF - probability density function.
2.2. Forward (Diffusion) process
- \(\mathbf{x}_0 \rightarrow \mathbf{x}_1 \cdots \rightarrow \mathbf{x}_T\).
- We can see this as the encoder, where data \(\mathbf{x}_0\) is encoded into \(T\) latent variables \(\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T\).
- Here, \(\mathbf{x}_0\) is transitioned to \(\mathbf{x}_T\), by gradual addition of noise over \(T\) steps in a Markovian fashion.
- Markovian property: Given the current state \(\mathbf{x}_{t}\), the future state \(\mathbf{x}_{t+1}\) is independent of the past states.
We define Gaussian formulation for the forward state-transition PDF (\(\mathbf{x}_{t-1} \rightarrow \mathbf{x}_t\)): 
 \(\begin{align} q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) := \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I}) \tag{2.1} \end{align}\) 
 Here \(\beta_t\) is a constant and you can see that the mean is dependent only on the preceding state \(\mathbf{x}_{t-1}\), which makes it Markovian.
Using reparameterization trick, we can write the formulation as: \(\begin{align} \text{Given } \mathbf{x}_{t-1}, \quad \mathbf{x}_{t} = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{\epsilon}, \quad \mathbf{\epsilon}\sim \mathcal{N}(0, \mathbf{I}) \tag{2.2} \end{align}\) 
 This highlights the addition of noise \(\mathbf{\epsilon}\) at every step.
What advantage does the Gaussian distribution offer? 
 We can get closed form PDFs for transitions over multiple steps, that is \(q(\mathbf{x}_{t_2}|\mathbf{x}_{t_1})\), where \(t_1\) and \(t_2\) are not consecutive.
Especially, we are interested in \(q(\mathbf{x}_{t}|\mathbf{x}_{0})\) for any \(1 \le t < T\). Follow this derivation… 
 \(\begin{align*} &\text{Eqn. $2.2$} \implies \mathbf{x}_{t} = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{\epsilon} \tag{} \\ &\mathbf{x}_{t} = \sqrt{1-\beta_t}(\sqrt{1-\beta_{t-1}}\mathbf{x}_{t-2} + \sqrt{\beta_{t-1}}\mathbf{\epsilon}) + \sqrt{\beta_t}\mathbf{\epsilon} \\ \\ & \cdots \quad \text{keep substituting } \mathbf{x}_{t-2}, \mathbf{x}_{t-3}, \cdots, \mathbf{x}_{1} \text{ using eqn. $2.2$} \\ \\ &\mathbf{x}_{t} = \left(\sqrt{\prod_{s=1}^{t}{(1-\beta_s)}}\right) \mathbf{x}_0 + \left(\sqrt{1 - \prod_{s=1}^{t}{(1-\beta_s)}}\right)\mathbf{\epsilon} \tag{2.3} \end{align*}\) 
 
 Using the notation \(\alpha_t := 1-\beta_t\) and \(\bar{\alpha}_t := \prod_{s=1}^{t} \alpha_s\), eqn. \(2.3\) becomes 
 \(\begin{align} &\mathbf{x}_{t} = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\mathbf{\epsilon}, \quad \mathbf{\epsilon }\sim \mathcal{N}(0, \mathbf{I}) \\ \implies &\boxed{q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) \mathbf{I})} \tag{2.4} \end{align}\) 
 
 Note: We want the final state \(\mathbf{x}_T\) to be a zero-mean, identity-covariance Gaussian noise sample, i.e., \(q(\mathbf{x}_T|\mathbf{x}_0) = \mathcal{N}(0, \mathbf{I})\). This means, for large \(T\) (\(T \to \infty\)), we need \(\bar{\alpha}_t\) to approach zero, which can be achieved by maintaining \(0 < \beta_t < 1\).
- We are taking a data sample, such as an image, from an unknown distribution, and are gradually adding noise to transform it into a sample belonging to a known distribution, which is the Gaussian distribution.
- This diffusion process is not learned and is fixed, i.e., the governing parameters \(\beta_t\)'s are set using heuristics or hyperparameters.
2.3. Reverse (Generative) process
- \(\mathbf{x}_T \rightarrow \mathbf{x}_{T-1} \cdots \rightarrow \mathbf{x}_0\).
- We can see this as the decoder, where the data \(\mathbf{x}_0\) is recovered from a Gaussian noise sample \(\mathbf{x}_T\) by gradually removing noise over \(T\) steps in a Markovian fashion.
Here also, we define Gaussian formulation for the reverse state-transition \((\mathbf{x}_{t} \rightarrow \mathbf{x}_{t-1})\):
\[\boxed{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) := \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t, t), \mathbf{\Sigma}_{\mathbf{\theta}}(\mathbf{x}_t, t))} \tag{2.5}\]Owing to the Markovian property, the mean and variance are functions of the preceding state \(\mathbf{x}_t\) and time \(t\). 
 Observe that the reverse state-transition PDF is governed by the learnable parameters \(\mathbf{\theta}\).
We can also apply the reparameterization trick like in eqn. \(2.4\), but it is not required for our discussion.
- In the reverse process, we gradually remove noise from a Gaussian noise sample to obtain a coherent sample from the original data distribution, such as a detailed image.
- The parameterized model, through training, understands how noise was added during the forward process and then uses this understanding for the denoising.
2.4. Objective
Determine the parameters \(\mathbf{\theta}\) that maximize the likelihood of the data \(p_{\mathbf{\theta}}(\mathbf{x}_0)\).
\[\begin{align*} &\mathbf{\theta}^{\ast} = \arg \max_{\mathbf{\theta}} p_{\mathbf{\theta}}(\mathbf{x}_0) \\ \text{or} \\ &\mathbf{\theta}^{\ast} = \arg \max_{\mathbf{\theta}} \log p_{\mathbf{\theta}}(\mathbf{x}_0) \tag{2.7} \end{align*}\]3. Derivation of equations
3.1. Handling log-likelihood
Agenda: Derive the lower bound for log-likelihood.
Denote all the latent variables using one literal, i.e., \(\mathbf{z} := \{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T\}\) or \(\mathbf{z} := \mathbf{x}_{1:T}\)
Data variable: \(\mathbf{x}_0\)
We can write log-likelihood as,
\[\begin{align*} \log p_{\boldsymbol{\theta}}(\mathbf{x}_0) &= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}_0, \mathbf{z}) d\mathbf{z} \quad \text{(writing as marginal)}\\ &= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}_0 | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) d\mathbf{z} \quad \text{(using chain rule)} \tag{3.1} \end{align*}\]
 Introduce the forward state-transition PDF \(q(\mathbf{z}|\mathbf{x}_0)\) in the above equation.
From Jensen’s inequality, and \(\log\) being a concave function, we have \(\log{(\mathbb{E}[.])} \ge \mathbb{E}[\log{(.)}]\). 
 Taking logarithm inside the expectation in eqn. \(3.2\) to get the following inequality:
Let’s define \(L := -\mathbb{E}_{q(\mathbf{z}\vert \mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 \vert \mathbf{z}) p_{\mathbf{\theta}}(\mathbf{z})}{q(\mathbf{z}\vert\mathbf{x}_0)}}\right]\)
- The above inequality is true for any PDF \(q(\mathbf{z} \vert \mathbf{x}_0)\).
- The right-hand side expression (excluding the negative sign) is commonly referred to as the Variational Lower Bound (VLB) or Evidence Lower Bound (ELBO).
- Maximizing the lower bound will automatically maximize the log-likelihood.
3.2. Handling Variational Lower Bound (\(-L\))
Agenda: Represent the lower bound in terms of forward and reverse state-transition probabilities and derive the \(L_T\), \(L_{t-1}\), \(L_0\) terms.
Let’s begin by substituting \(\mathbf{z} = \mathbf{x}_{1:T}\).
\[\begin{align*} L :&= -\mathbb{E}_{q(\mathbf{z}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{z}) p_{\mathbf{\theta}}(\mathbf{z})}{q(\mathbf{z}|\mathbf{x}_0)}}\right] \quad \text{(from eqn. $3.3$)} \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 , \mathbf{x}_{1:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}}\right] \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 , \mathbf{x}_{1:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}}\right] \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}}\right] \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_T) p_{\mathbf{\theta}}(\mathbf{x}_{T-1} | \mathbf{x}_T) \cdots p_{\mathbf{\theta}}(\mathbf{x}_1 | \mathbf{x}_2) p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1) } {q(\mathbf{x}_1|\mathbf{x}_0) \cdots q(\mathbf{x}_{T-1}|\mathbf{x}_{T-2}) q(\mathbf{x}_{T}|\mathbf{x}_{T-1}) }}\right] \tag{3.4} \end{align*}\]
 Let’s simplify the eqn. \(3.4\) further.
Breaking \(L\) into \(L_T\), \(L_{t-1}\), \(L_0\) terms
We will modify the eqn. \(3.5\) to get \(L_T\), \(L_{t-1}\), \(L_0\) terms that were defined in the paper. 
 In the eqn. \(3.5\), separate out the \(t=1\) term and substitute eqn. \(3.6\) in the \(t>1\) terms. 
 \(\begin{align} L &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})} } + \overbrace{\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}}}^{t=1} \right] \nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \overbrace{\log{\left(\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \right)}}^{\text{from eqn. $3.6$}} + \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}}\right] \nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \sum_{t>1}^{T} \log{\frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} } + \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}}\right] \nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \log{\prod_{t>1}^{T} \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} } + \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}}\right] \nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \log{\left(\frac{q(\mathbf{x}_1 | \mathbf{x}_0)}{\cancel{q(\mathbf{x}_2 | \mathbf{x}_0)}} \frac{\cancel{q(\mathbf{x}_2 | \mathbf{x}_0)}}{\cancel{q(\mathbf{x}_3 | \mathbf{x}_0)}} \cdots \frac{\cancel{q(\mathbf{x}_{T-1} | \mathbf{x}_0)}}{q(\mathbf{x}_T | \mathbf{x}_0)} \right)} + \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}}\right] \nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \log{\frac{\cancel{q(\mathbf{x}_1 | \mathbf{x}_0)}}{q(\mathbf{x}_T | \mathbf{x}_0)} \frac{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}{\cancel{q(\mathbf{x}_1|\mathbf{x}_0)}} } \right]\nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{p_{\mathbf{\theta}}(\mathbf{x}_T)} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \log{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)} - \log{q(\mathbf{x}_T | \mathbf{x}_0)} \right]\nonumber \\ &= -\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_T)}{q(\mathbf{x}_T | \mathbf{x}_0)}} + \sum_{t>1}^{T} \log{\frac{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} {q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} } + \log{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)} \right]\nonumber \\ &\text{(Taking the negative sign inside)} \nonumber \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log{\frac{q(\mathbf{x}_T | \mathbf{x}_0)}{p_{\mathbf{\theta}}(\mathbf{x}_T)}} + \sum_{t>1}^{T} \log{\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)}{p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)} } - \log{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)} \right] \tag{3.7} \end{align}\) 
 Observe that we paired the reverse transition PDFs \(p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)\) with the corresponding posteriors \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\) from the forward process.
By taking in the expectation \(\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}[.]\), we can write the KL-divergence terms as shown below, \(\begin{align} L &= D_{KL}\left(q(\mathbf{x}_T|\mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_T) \right) + \sum_{t>1}^{T} D_{KL}\left(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) - \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \log{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}\right] \nonumber \end{align}\)
However, the paper retains the notation \(\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}[.]\) for the entire equation to underscore its significance, although this doesn’t change the outcome. \(\begin{align} \boxed{L = \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \underbrace{D_{KL}\left(q(\mathbf{x}_T|\mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_T) \right)}_{L_{T}} + \sum_{t>1}^{T} \underbrace{D_{KL}\left(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t) \right)}_{L_{t-1}} - \underbrace{\log{p_{\mathbf{\theta}}(\mathbf{x}_0 | \mathbf{x}_1)}}_{L_0} \right]} \tag{3.8} \end{align}\)
- Incorporation of forward posteriors into the lower bound (\(-L\)).
- The KL-divergence terms \(L_{t-1}\) in equation \(3.8\) highlight the aim for reverse transition PDFs to approximate the forward posterior PDFs.
- Derivation of the forward posterior PDF using Bayes' rule is straightforward, as demonstrated in the following section.
3.3. Derivation of forward posterior PDF
The PDF \(q(\mathbf{x}_{t-1}\vert\mathbf{x}_t, \mathbf{x}_0)\) is called the forward posterior.
 Let’s find a closed-form representation for \(q(\mathbf{x}_{t-1}\vert\mathbf{x}_t, \mathbf{x}_0)\) using the Bayes’ rule:
Here, 
 \(\begin{align*} \mathbb{a} &\implies q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I) \quad \text{(from equation $2.1$)} \\ \mathbb{b} &\implies q(\mathbf{x}_{t-1}\vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1-\bar{\alpha}_{t-1}) I) \quad \text{(from equation $2.5$)} \\ \mathbb{c} &\implies q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) I) \quad \text{(from equation $2.5$)} \end{align*}\)
Since all three PDFs are Gaussian, the result in eqn. \(3.9\) will also be Gaussian. We just need to plug in the Gaussian expressions and find the expressions for the mean and covariance. Lets denote mean with \(\tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)\) and covariance with \(\tilde{\beta}_t I\).
\[\begin{align} \boxed{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t I)} \tag{3.10} \end{align}\]Here’s an outline of the steps involved in deriving the mean and covariance of the forward posterior: \(\begin{align*} &q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1}| \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \\ \implies&\mathcal{N}(\mathbf{x}_t; \tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t I) = \frac{\mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I) \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1-\bar{\alpha}_{t-1}) I)}{\mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) I)} \\ \\ &\text{(Write down the Gaussian expressions, and group the terms in the exponential as below)} \\ \\ & K_1 \exp\left(-\frac{1}{2}\left[\frac{\mathbf{x}_{t-1}^T\mathbf{x}_{t-1}}{\tilde{\beta}_t} - \frac{2\mathbf{x}_{t-1}^T\tilde{\mathbf{\mu}}_t}{\tilde{\beta}_t} + \frac{\tilde{\mathbf{\mu}}_t^T\tilde{\mathbf{\mu}}_t}{\tilde{\beta}_t} \right]\right) = K_2 \exp\left(-\frac{1}{2}\left[\mathbf{x}_{t-1}^T\mathbf{x}_{t-1} (.) - 2\mathbf{x}_{t-1}^T\tilde{\mathbf{\mu}}_t (.) + ... \right]\right) \end{align*}\) 
 By comparing the coefficients of the terms \(\mathbf{x}_{t-1}^T\mathbf{x}_{t-1}\) and \(2\mathbf{x}_{t-1}^T\tilde{\mathbf{\mu}}_t\), we get 
 \(\begin{align} \boxed{\tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t, \text{ and } \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t } \tag{3.11} \end{align}\)
- We've successfully derived the mean and covariance of the Gaussian forward posterior.
- With the forward posterior understood, our focus now shifts to train our model effectivley. This means teaching our model's reverse process to predict the forward posterior well at every step.
4. Understanding the Training Objective
Recall that our objective was to maximize the log-likelihood which translates to maximizing its Variational Lower Bound, i.e., \(-L\).
Reframing the objective: \(\begin{align} \mathbf{\theta}^{\ast} &= \arg \min_{\mathbf{\theta}} -\log p_{\mathbf{\theta}}(\mathbf{x}_0) \\ &= \arg \min_{\mathbf{\theta}} L(\mathbf{\theta}; \mathbf{x}_0) \tag{4.1} \\ &= \arg \min_{\mathbf{\theta}} \mathbb{E}_{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}\left[ L_T + \sum_{t>1}^{T} L_{t-1} + L_0 \right] \nonumber \end{align}\)
Let’s break down the components \(L_T\), \(L_{t-1}\), and \(L_0\) that comprise \(L\).
4.1. About \(L_T\)
\[\begin{align*} L_T = D_{KL}\left(q(\mathbf{x}_T\vert\mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_T) \right) \end{align*}\]As you know, the forward process is governed solely by hyper-parameters \(\beta_t\) and there are no learnable parameters. Hence, \(q(\mathbf{x}_T\vert\mathbf{x}_0)\) is considered constant during training. On the other hand, \(p_{\mathbf{\theta}}(\mathbf{x}_T)\) represents a zero-mean, univariate Gaussian \(\mathcal{N}(\mathbf{x}_T; 0, I)\), which remians independent of \(\mathbf{\theta}\). Consquently, \(L_T\) remains constant during training and can be ignored.
4.2. About \(L_{t-1}\)
\[\begin{align*} L_{t-1} = D_{KL}\left(q(\mathbf{x}_{t-1}\vert\mathbf{x}_t, \mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \right), \quad t = 2, 3, \cdots, T \end{align*}\]where, 
 \(\begin{align*} &q(\mathbf{x}_{t-1}\vert\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t I) \\ &p_{\mathbf{\theta}}(\mathbf{x}_{t-1} \vert \mathbf{x}_{t}) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_{\mathbf{\theta}}(\mathbf{x}_t, t), \boldsymbol{\Sigma}_{\mathbf{\theta}}(\mathbf{x}_t, t)). \end{align*}\) 
 
 Given the Gaussian nature of both distributions, we can derive a closed form for the KL divergence. In the paper, the authors set \(\boldsymbol{\Sigma}_{\mathbf{\theta}}(\mathbf{x}_t, t) = \sigma_t^2 I\). Notably the paper suggests that empirically, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t\) yield similar results. For our purposes, we shall consider \(\sigma_t^2 = \beta_t\).
 Therefore, \(\begin{align} L_{t-1} = D_{KL}\left(q(\mathbf{x}_T\vert\mathbf{x}_0) \parallel p_{\mathbf{\theta}}(\mathbf{x}_T) \right) = \frac{1}{2\sigma_t^2} \lVert \tilde{\mathbf{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \mathbf{\mu}_{\mathbf{\theta}}(\mathbf{x}_t, t) \rVert^2 + C, \quad \text{($C$ is a constant)} \tag{4.2} \end{align}\)
For the eqn. \(2.4\), we can apply reparameterization trick as follows:
\[\begin{align} \mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}) &= \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\mathbf{\epsilon}, \quad \mathbf{\epsilon}\sim \mathcal{N}(0, I) \nonumber \\ &\text{(Observe that we are writing $\mathbf{x}_{t}$ as a function $\mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon})$)} \nonumber \\ \implies & \mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}) - \sqrt{1-\bar{\alpha}_t} \mathbf{\epsilon} \right) \tag{4.3} \end{align}\]Replacing \(\mathbf{x}_0\) in eqn. \(4.2\) using eqn. \(4.3\):
\[\begin{align} L_{t-1} - C &= \frac{1}{2\sigma_t^2} \left\| {\color{blue}\tilde{\mathbf{\mu}}_t} \left(\mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}), \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}) - \sqrt{1-\bar{\alpha}_t} \mathbf{\epsilon} \right) \right) - \mathbf{\mu}_{\mathbf{\theta}}(\mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}), t) \right\|^2 \nonumber \\ \\ &\text{(Substituting ${\color{blue}\tilde{\mathbf{\mu}}_t}(\mathbf{x}_t, \mathbf{x}_0)$ from equation $3.11$ and simplifying further)} \nonumber \\ \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}) - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\mathbf{\epsilon} \right) - {\color{red}\mathbf{\mu}_{\mathbf{\theta}}}(\mathbf{x}_{t}(\mathbf{x}_0, \mathbf{\epsilon}), t) \right\|^2 \tag{4.4} \end{align}\]As explained in the paper, with \(\mathbf{x}_t\) as input to the model, it also makes sense to parameterize \({\color{red}\mathbf{\mu}_{\mathbf{\theta}}}(\mathbf{x}_{t}, t)\) as follows:
\[\begin{align} {\color{red}\mathbf{\mu}_{\mathbf{\theta}}}(\mathbf{x}_{t}, t) &= {\color{blue}\tilde{\mathbf{\mu}}_t} \left(\mathbf{x}_{t}, \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_{t} - \sqrt{1-\bar{\alpha}_t} {\color{red}\mathbf{\epsilon}_{\mathbf{\theta}}}(\mathbf{x}_t, t) \right) \right) \nonumber \\ &= \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_{t} - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}{\color{red}\mathbf{\epsilon}_{\mathbf{\theta}}}(\mathbf{x}_t, t) \right) \tag{4.5} \end{align}\]Substitute equation \(4.5\) in equation \(4.4\) and it simplifies to,
\[\begin{align} \boxed{L_{t-1} - C = \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)} \left\| \mathbf{\epsilon} - {\color{red}\mathbf{\epsilon}_{\mathbf{\theta}}}\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\mathbf{\epsilon}, t \right) \right\|} \tag{4.6} \end{align}\]- In equations \(4.4\) and \(4.6\), we observe two distinct approaches to achieve the same objective: training the model to learn either the forward posterior mean (\(\tilde{\mathbf{\mu}}_t\)) or solely the noise (\(\mathbf{\epsilon}\)) present in the forward process.
- During sampling, which is the reverse generative process, we predict \(\mathbf{x}_{t-1}\) based on \(\mathbf{x}_t\), i.e., \(\mathbf{x}_{t-1} \sim p_{\mathbf{\theta}}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\).
- We can compute \(\mathbf{x}_{t-1}\) by applying reparameterization trick to \(p_{\mathbf{\theta}}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\) from eqn. \(2.5\): $$ \begin{align} \mathbf{x}_{t-1} &= {\color{red}\mathbf{\mu}_{\mathbf{\theta}}}(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I) \nonumber \\ &= \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_{t} - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}{\color{red}\mathbf{\epsilon}_{\mathbf{\theta}}}(\mathbf{x}_t, t) \right) + \sigma_t \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I) \tag{4.7} \end{align} $$
4.3. About \(L_0\)
- I will avoid any derivation here and provide the main idea.
- Data Scaling: It’s important to note that the image data, which consists of integers ranging from \(\{0, 1, \cdots, 255\}\), are linearly scaled to the interval \([-1, 1]\). This scaling is applied consistently to every sampled \(\mathbf{x}_t\), starting from the standard normal distribution \(p(\mathbf{x}_T)\) in the reverse process.
- Final Step Transition \(\mathbf{x}_1 \rightarrow \mathbf{x}_0\): In the context of the \(L_0\) term, the transition from \(\mathbf{x}_1\) to \(\mathbf{x}_0\) is known to be Gaussian, i.e., \(p_{\theta}(\mathbf{x}_0 \vert \mathbf{x}_1) = \mathcal{N}(\mathbf{x}_0; \mu_{\theta}(\mathbf{x}_1, 1), \sigma_1^2 I)\).
- Independent Assumption: We assume that the components of the data are independent. (See the data as a \(D\) dimensional vector). This assumption simplifies the modeling process. However, the paper mentions the possibility of incorporating dependency among components through a ‘conditional autoregressive model’ in future work.
- Modeling Independent Components: Under the assumption of independent components, \(p_{\theta}(\mathbf{x}_0 \vert \mathbf{x}_1)\) is expressed as the product of individual component probabilities. Refer to the section 3.3 of the DDPM paper for more details.
- Probability Calculation: Each individual probability is approximated as the Gaussian probability density at a specific \(x\), multiplied by a small bin width, which is a standard approach for approximating continuous probability density functions, like the Gaussian. \(\begin{align}\begin{split} p_\theta(\mathbf{x}_0 | \mathbf{x}_1) &= \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} \mathcal{N}(x; \mu_\theta^i(\mathbf{x}_1, 1), \sigma_1^2) \, dx \\ \delta_{+}(x) &= \begin{cases} \infty & \text{if}\ x=1 \\ x+\frac{1}{255} & \text{if}\ x < 1 \end{cases} \qquad \delta_{-}(x) = \begin{cases} -\infty & \text{if}\ x=-1 \\ x-\frac{1}{255} & \text{if}\ x > -1 \end{cases} \end{split} \tag{4.8} \end{align}\) 
 where \(D\) is the data dimensionality and the \(i\) superscript indicates extraction of one coordinate.
Differences Between Training and Sampling Phases for the step involving \(\mathbf{x}_0, \mathbf{x}_1\):
- Training Phase:- Parameter Learning: Utilizes the accurate depiction of log-likelihood by the above-mentioned discretization approach for accurately modeling the discrete nature of image data.
 
- Sampling Phase:- Output Generation: Typically uses the mean of the distribution (\(\mu_{\theta}(\mathbf{x}_1, 1)\)) instead of random sampling, for generating the final output (\(\mathbf{x}_0\)), prioritizing efficiency and stability of the output.
 
Conclusion
- The forward process, known as diffusion, iteratively adds noise determined by fixed hyperparameters. In contrast, the reverse process is parameterized. Through training, it learns to reverse the diffusion and recreate data samples that match the desired data distribution.
- In simpler terms, the reverse process PDF needs to mimic the posterior PDF of the forward process at each step. We achieve this using a neural network.
- Through our mathematical journey, we’ve seen how this neural network’s task boils down to either predicting the forward posterior mean or just the noise present in the forward process. I trust this article has helped clarify many aspects of understanding the mathematics behind denoising diffusion models.
References
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. (2020). Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239.
- David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859-877. http://dx.doi.org/10.1080/01621459.2017.1285773.
- Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. (2024). Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv preprint arXiv:2209.00796.
