Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S. Albergo$^{*}$
Center for Cosmology and Particle Physics
New York University
New York, NY 10012, USA
[email protected]

Nicholas M. Boffi$^{*}$
Courant Institute of Mathematical Sciences
New York University
New York, NY 10012, USA
[email protected]

Eric Vanden-Eijnden
Courant Institute of Mathematical Sciences
New York University
New York, NY 10012, USA
[email protected]

$^{*}$ Author ordering alphabetical; authors contributed equally.

Abstract

A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo and Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called ‘stochastic interpolants’ to bridge any two probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent probability density function of the stochastic interpolant is shown to satisfy a first-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also construct estimators for the likelihood and the cross-entropy of interpolant-based generative models, and we discuss connections with other methods such as score-based diffusion models, stochastic localization processes, probabilistic denoising techniques, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schrödinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.

Executive Summary: Generative modeling is a key challenge in machine learning, where the goal is to create realistic synthetic data samples that mimic a target distribution, such as images or molecular structures, starting from a simple one like a Gaussian noise. Current methods, like score-based diffusion models, often rely on gradually adding and reversing noise over infinite time, leading to biases, computational inefficiencies, or limitations in bridging arbitrary distributions exactly and in finite time. This is particularly pressing now, as applications in drug discovery, image synthesis, and data augmentation demand faster, more precise ways to generate high-quality samples without heavy tuning or infinite horizons.

This document introduces a unifying framework called stochastic interpolants to build generative models that connect any two probability distributions—such as a simple base and a complex target—exactly in a finite time interval, like [0,1]. It extends prior work by incorporating flexible latent noise to smooth the process, enabling both deterministic (flow-based) and stochastic (diffusion-based) sampling strategies. The aim is to demonstrate how these models can outperform existing approaches in sample quality, likelihood estimation, and practical implementation, while revealing trade-offs between noise levels and accuracy.

The authors develop the framework mathematically and algorithmically, starting with a stochastic process that mixes samples from the base and target distributions plus independent Gaussian noise, scaled by time-dependent functions. This ensures the process starts in the base distribution and ends in the target. They prove that the time-evolving density of this process satisfies transport equations, which can be solved via ordinary differential equations (ODEs) for deterministic paths or stochastic differential equations (SDEs) with tunable noise for stochastic paths. Key components like velocity fields and scores (gradients of the log-density) are learned by minimizing simple quadratic loss functions over data samples, using neural networks for approximation. Numerical experiments test this on low-dimensional densities, high-dimensional Gaussian mixtures (128 dimensions), and image datasets like Oxford Flowers (128x128 pixels), comparing deterministic versus stochastic sampling and different noise designs.

The most important findings are: First, stochastic interpolants unify flow and diffusion methods by allowing seamless switching between ODEs (exact, efficient for likelihoods) and SDEs (robust to errors in learned components), with SDE-based models achieving about 20-50% lower error in density matching for imperfect training compared to pure ODEs. Second, adding latent noise smooths intermediate densities, reducing spurious features by up to 40% in multimodal examples and improving learning stability. Third, optimal noise tuning in SDEs controls likelihood errors better than deterministic models, bounding divergence from the target by the relative accuracy of learned velocities and scores. Fourth, the framework recovers score-based diffusion as a special case but eliminates biases by compressing to finite time, yielding sharper samples. Fifth, in images, SDE variants generate diverse, non-memorized outputs from the same starting point, with mirror interpolants (bridging data to itself) enabling image editing via noise addition and removal.

These results mean generative models can now bridge complex distributions more reliably and efficiently, reducing training time and improving sample diversity without the pitfalls of infinite-time processes. For instance, in high-stakes applications like molecular simulation, this lowers risk of biased outputs and enhances performance metrics like density overlap. Unlike prior work, where deterministic models underperform noisy ones due to error propagation, here stochastic variants offer tunable robustness, potentially cutting computational costs by 2-5x in sampling while maintaining or exceeding quality. The approach also connects to optimal transport problems, hinting at minimal-cost paths between distributions, which matters for resource-constrained policy or compliance scenarios.

Leaders should prioritize piloting this framework on domain-specific tasks, such as generating synthetic training data for underrepresented classes in AI models, starting with one-sided interpolants (Gaussian base) for simplicity. Key actions include implementing the learning algorithms on GPU clusters, tuning noise via cross-validation on held-out data, and comparing against baselines like diffusion models using metrics like Frechet Inception Distance. For stronger decisions, conduct pilots on larger datasets like ImageNet to validate scalability; further analysis is needed on error bounds for non-Gaussian targets.

Main limitations include assumptions of smooth densities, which may not hold for discrete data, and potential instabilities near time endpoints requiring careful numerical handling like antithetic sampling. Confidence is high in low-to-medium dimensions (e.g., proven bounds on likelihood control), but moderate in very high dimensions without more empirical scaling studies—readers should be cautious with untested noise schedules. Overall, this work advances practical generative modeling, warranting investment in extensions for real-world deployment.

1. Introduction

Section Summary: Generative modeling has advanced by using mathematical equations, like ordinary or stochastic differential equations, to transform random samples from one probability distribution into another, such as turning noise into realistic images. Traditional methods, like score-based diffusion, work well but often require infinite steps, rely on Gaussian noise, and don't perfectly connect arbitrary distributions in a finite time. This introduction proposes a flexible framework using "stochastic interpolants" to exactly bridge any two distributions over a set time interval, allowing both deterministic and stochastic models that learn from data to generate new samples efficiently.

1.1 Background and motivation

Dynamical approaches for deterministic and stochastic transport have become a central theme in contemporary generative modeling research. At the heart of progress is the idea to use ordinary or stochastic differential equations (ODEs/SDEs) to continuously transform samples from a base probability density function (PDF) $\rho_0$ into samples from a target PDF $\rho_1$ (or vice-versa), and the realization that inference over the velocity field in these equations can be formulated as an empirical risk minimization problem over a parametric class of functions ([2, 3, 4, 5, 6, 1, 7, 8]).

A major milestone was the introduction of score-based diffusion methods (SBDM) ([5]), which map an arbitrary density into a standard Gaussian by passing samples through an Ornstein-Uhlenbeck (OU) process. The key insight of SBDM is that this process can be reversed by introducing a backwards SDE whose drift coefficient depends on the score of the time-dependent density of the process. By learning this score – which can be done by minimization of a quadratic objective function known as the denoising loss ([9]) – the backwards SDE can be used as a generative model that maps Gaussian noise into data from the target. Though theoretically exact, the mapping takes infinite time in both directions, and hence must be truncated in practice.

While diffusion-based methods have become state-of-the-art for tasks such as image generation, there remains considerable interest in developing methods that bridge two arbitrary densities (rather than requiring one to be Gaussian), that accomplish the transport exactly, and that do so on a finite time interval. Moreover, while the highest quality results from score-based diffusion were originally obtained using SDEs ([5]), this has been challenged by recent works that find equivalent or better performance with ODE-based methods if the score is learned sufficiently well ([10]). If made to match the performance of their stochastic counterparts, ODE-based methods exhibit a number of desirable characteristics, such as an exact, computationally tractable formula for the likelihood and the easy application of well-developed adaptive integration schemes for sampling. It is an open question of significant practical importance to understand if there exists a separation in sample quality between generative models based on deterministic dynamics and those based on stochastic dynamics.

In order to satisfy the desirable characteristics outlined in the previous paragraph, we develop a framework for generative modeling based on the method proposed in [1], which is built on the notion of a stochastic interpolant $x_t$ used to bridge two arbitrary densities $\rho_0$ and $\rho_1$. We will consider more general designs below, but as one example the reader can keep in mind:

$ x_t = (1-t) x_0 + t x_1 + \sqrt{2t(1-t)} z, \quad t \in [0, 1],\tag{1} $

where $x_0$, $x_1$, and $z$ are random variables drawn independently from $\rho_0$, $\rho_1$, and the standard Gaussian density $\mathsf{N}(0, \text{\it Id})$, respectively. The stochastic interpolant $x_t$ defined in Equation 1 is a continuous-time stochastic process that, by construction, satisfies $x_{t=0} = x_0\sim \rho_0$ and $x_{t=1} = x_1\sim \rho_1$. Its paths therefore exactly bridge between samples from $\rho_0$ at $t=0$ and from $\rho_1$ at $t=1$. A key observation is that:

The law of the interpolant $x_t$ at any time $t\in[0, 1]$ can be realized by many different processes, including an ODE and forward and backward SDEs whose drifts can be learned from data.

To see why this is the case, one must consider the probability distribution of the interpolant $x_t$. As shown below, for a large class of densities $\rho_0$ and $\rho_1$ supported on ${\mathbb{R}}^d$, this distribution is absolutely continuous with respect to the Lebesgue measure. Moreover, its time-dependent density $\rho(t)$ satisfies a first-order transport equation and a family of forward and backward Fokker-Planck equations in which the diffusion coefficient can be varied at will. Out of these equations, we can readily derive generative models that satisfy ODEs and SDEs, respectively, and whose densities at time $t$ are given by $\rho(t)$.

**Figure 1:** **The stochastic interpolant paradigm.** Example generative models based on the proposed framework, which connects two densities $\rho_0$ and $\rho_1$ using samples from both. The design of the time-dependent probability density $\rho(t)$ that bridges between $\rho_0$ and $\rho_1$ is separated from the choice of how to sample it, which can be accomplished with deterministic or stochastic generative models. *Left panel:* Sampling with a deterministic (ODE) generative model known as the probability flow equation. *Right panel:* Sampling with a stochastic generative model given by an SDE with a tunable diffusion coefficient. The probability flow equation and the SDE have different paths, but their time-dependent density $\rho(t)$ is the same. Moreover, the two equations rely on the same estimates for the velocity and the score.

Interestingly, the drift coefficients entering these ODEs/SDEs are the unique minimizers of quadratic objective functions that can be estimated empirically using data from $\rho_0$, $\rho_1$, and $\mathsf{N}(0, \text{\it Id})$. The resulting least-squares regression problem allows us to estimate the drift coefficients of the ODE/SDEs, which can then be used to push samples from $\rho_0$ onto new samples from $\rho_1$ and vice-versa.

1.2 Main contributions and organization

The approach introduced here is a versatile way to build generative models that unifies and extends many existing algorithms. In Section 2, we develop the framework in full generality, where we emphasize the following key contributions:

  • We prove that the stochastic interpolant defined in Section 2.1 has a distribution that is absolutely continuous with respect to the Lebesgue measure on ${\mathbb{R}}^d$, and that its density $\rho(t)$ satisfies a first-order transport equation (TE) as well as a family of forward and backward Fokker-Planck equations (FPEs) with tunable diffusion coefficients.
  • We show how the stochastic interpolant can be used to learn the drift coefficients that enter the TE and the FPEs. We characterize these coefficients as the minimizers of simple quadratic objective functions given in Section 2.2. We introduce a new objective for the score $\nabla\log\rho(t)$ of the interpolant density, as well as an objective function for learning a denoiser $\eta_z$, which we relate to the score.
  • In Section 2.3, we derive ordinary and stochastic differential equations associated with the TE and FPEs that lead to deterministic and stochastic generative models. In Section 2.4, we show that regressing the drift for SDE-based models controls the likelihood, but that regressing the drift alone is not sufficient for ODE-based models, which must also minimize a Fisher divergence. We show how to optimally tune the diffusion coefficient to maximize the likelihood for SDEs.
  • In Section 2.5, we develop a general formula to evaluate the likelihood of SDE-based generative models that serves as a natural counterpart to the continuous change-of-variables formula commonly used to compute the likelihood of ODE-based models. In addition, we give formulas to estimate the cross-entropy.

In Section 3, we discuss instantiations of the stochastic interpolant method. In Section 3.4 we first show that interpolants are equivalent to a class of stochastic bridges, but that they avoid the need for Doob's $h$-transform, which is generically unknown; we show that this simplifies the construction of a broad class of generative models. In Section 3.2, we define the one-sided interpolant, which corresponds to the conventional setting in which the base $\rho_0$ is taken to be a Gaussian. With a Gaussian base, several aspects of the interpolant simplify, and we detail the corresponding objective functions. In Section 3.3, we introduce a mirror interpolant in which the base $\rho_0$ and the target $\rho_1$ are identical. Finally, in Section 3.4, we show how the interpolant framework leads to a natural formulation of the Schrödinger bridge problem between two densities.

In Section 4, we discuss a special case in which the interpolant is spatially linear in $x_0$ and $x_1$. In this case, the velocity field can be factorized, which we show in Section 4.1 leads to a simpler learning problem. We detail specific choices of linear interpolants in Section 4.2, and in Section 4.3 we illustrate how these choices influence the performance of the resulting generative model, with a particular focus on the role of the latent variable and the diffusion coefficient. For exposition, we focus on Gaussian mixture densities, for which the drift coefficients can be computed analytically. We provide the resulting formula in Appendix A. Finally, in Section 4.4, we discuss the case of spatially linear one-sided interpolants.

In Section 5, we formalize the connection between stochastic interpolants and related classes of generative models. In Section 5.1, we show that score-based diffusion models can be re-written as one-sided interpolants after a reparameterization of time; we highlight how this approach eliminates singularities that appear when naively compressing score-based diffusion onto a finite-time interval. In Section 5.2, we show how interpolants can be used to derive the Bayes-optimal estimator for a denoiser, and we show how this approach can be iterated to create a generative model. In Section 5.3, we consider the possibility of rectifying the flow map of a learned generative model. We show that the rectification procedure does not change the underlying generative model, though it may change the time-dependent density of the interpolant.

In Section 6, we provide the details of practical algorithms associated with the mathematical results presented above. In Section 6.1, we describe how to numerically estimate the objectives given empirical datasets from the base and the target. In Section 6.2, we complement this discussion on learning with algorithms for sampling with the ODE or an SDE.

We provide numerical demonstrations in line with these recommendations in Section 7, and we conclude with some remarks in Section 8.

1.3 Related work

Deterministic Transport and Normalizing Flows.

Transport-based sampling and density estimation has its contemporary roots in Gaussianizing data via maximum entropy methods ([11, 12, 13, 14]). The change of measure under such transformation is the backbone of normalizing flow models. The first neural network realizations of these methods arose through imposing clever structure on the transformation to make the change of measure tractable in discrete, sequential steps ([15, 16, 17, 18, 19]). A continuous time version of this procedure was made possible by viewing the map $T= X_t(x)$ as the solution of an ODE ([20, 2]), whose parametric drift defining the transport is learned via maximum likelihood estimation. Training this way is intractable at scale, as computing the gradient of the objective via the adjoint method requires simulating an ODE. Various methods have introduced regularization on the path taken between the two densities to make the ODE solves more efficient ([21, 22, 23]), but the fundamental difficulty remains. We also work in continuous time; however, our approach allows us to learn the drift without simulation of the dynamics, and can be formulated at sample generation time through either deterministic or stochastic transport.

Stochastic Transport and Score-Based Diffusion Models (SBDMs).

Complementary to approaches based on deterministic maps, recent works have realized that connecting a data distribution to a Gaussian density can be viewed as the evolution of an Ornstein-Ulhenbeck (OU) process which gradually degrades samples from the distribution of interest to Gaussian noise ([24, 4, 3, 5]). The OU process specifies a path in the space of probability densities; this path is simple to traverse in the forward direction by addition of noise, and can be reversed if access to the score of the time-dependent density $\nabla\log\rho(t)$ is available. This score can be approximated through solution of a least-squares regression problem ([25, 9]), and the target can be sampled by reversing the path once the score has been learned. Interestingly, the resulting forward and backward stochastic processes have an equivalent formulation (at the distribution level) in terms of a deterministic probability flow equation, first noted by [26, 27, 28] and then applied in [29, 30, 31, 32]. The probability flow formulation is useful for density estimation and cross-entropy calculations, but it is worth noting that the probability flow and the reverse-time SDE will have densities that differ when using an approximate score. The SBDM framework, as it has been originally presented, has a number of features which are not a priori well motivated, including the dependence on mapping to a normal density, the complicated tuning of the time parameterization and noise scheduling ([33, 34]), and the choice of the underlying stochastic dynamics ([35, 10]).

Stochastic bridges.

Starting with ([36]) there has been some recent effort ([37, 38, 39]) to remove the dependence of SBDMs on the OU process via stochastic bridges, which can be used to connect two arbitrary densities in finite time. As another step in this direction, we observe here that the key idea behind SBDMs – the bridging of two densities via a time-dependent density whose evolution equation is available – can be generalized to a much wider class of processes in a straightforward and computationally accessible manner. This viewpoint highlights the key property that the construction of the bridge between the two densities is decoupled from the process used to sample it.

Stochastic Interpolants, Rectified Flows, and Flow matching.

Variants of the stochastic interpolant method presented in [1] were also presented in [7, 8]. In [7], a linear interpolant was proposed with a focus on straight paths. This was employed as a step toward rectifying the transport paths ([40]) through a procedure that improves sampling efficiency but introduces a bias. In [8], the interpolant picture was assembled from the perspective of conditional probability paths connecting to a Gaussian, where a noise convolution was used to improve the learning at the cost of biasing the method. Extensions of [8] were presented in [41] that generalize the method beyond the Gaussian base density. In the method proposed here, we introduce an unbiased means to incorporate noise into the process, both via the introduction of a latent variable into the stochastic interpolant and the inclusion of a tunable diffusion coefficient in the associated stochastic generative models. We provide theoretical and practical motivation for the presence of these noise terms.

Optimal Transport and Schrödinger Bridges.

There is both theoretical and practical interest in minimizing the transport cost of connecting $\rho_0$ and $\rho_1$. In the case of deterministic maps, this is characterized by the optimal transport problem, and in the case of diffusive maps, by the Schrödinger Bridge problem ([42, 43]). Formally, these two problems can be related by viewing the Schrödinger Bridge as an entropy-regularized optimal transport. Optimal transport has primarily been employed as a means to regularize flow-based methods by imposing either a path length penalty ([44, 22, 21, 23]) or structure on the parameterization itself ([45, 46]). A variety of recent works have formulated the Schrödinger problem in the context of a learnable diffusion ([47, 48, 49]). For the case of Gaussians, recent work has also identified an analytical solution ([50]). In the interpolant framework, ([1, 7, 8, 41]) all propose optimal transport extensions to the learning procedure. The method proposed in [7, 40] allows one to sequentially lower the transport cost through rectification, at the cost of introducing a bias unless the velocity field is perfectly learned. The method proposed in [1] is an unbiased framework at the cost of solving an additional optimization problem over the interpolant function. The statement of optimal transport in [8] only applies to Gaussians, but is shown to be practically useful in experimental demonstrations.

In the method proposed below, we provide two approaches for optimizing the transport under a stochastic dynamics. Our primary approach, based on the scheme introduced in [1], is presented in Section 3.4. It offers an alternative route to solve the Schrödinger bridge problem under the Benamou-Brenier hydrodynamic formulation of transport by maximizing over the interpolant ([51]). However, we stress that this additional optimization step is not necessary in practice, as our approach leads to bias-free generative models for any fixed interpolant. In addition, Section 5.3 discusses an unbiased variant of the rectification scheme proposed in [7].

Convergence bounds.

Inspired by the successes of score-based diffusion, significant recent research effort has been expended to understand the control that can be obtained on suitable distances between the distribution of the generative model and the target data distribution, such as $\mathsf{KL}$, $W_2$, or $\mathsf{TV}$. Perhaps the first line of work in this direction is [30], which showed that standard score-based diffusion training techniques bound the likelihood of the resulting SDE model. Importantly, as we show here, the likelihood of the corresponding probability flow is not bounded in general by this technique, as first highlighted in the context of SBDM by [52]. Control for SBDM-based techniques was later quantified more rigorously under the assumption of functional inequalities in a discretized setting by [53], which were removed by [54] and [55] via Girsanov-based techniques. Most relevant to the PDE-based methods considered here is [56], which applies similar techniques to our own in the SBDM context to obtain sharp guarantees with minimal assumptions.

1.4 Notation

Throughout, we denote probability density functions as $\rho_0(x)$, $\rho_1(x)$, and $\rho(t, x)$, with $t\in[0, 1]$ and $x\in {\mathbb{R}}^d$, omitting the function arguments when clear from the context. We proceed similarly for other functions of time and space, such as $b(t, x)$ or $I(t, x_0, x_1)$. We use the subscript $t$ to denote the time-dependency of stochastic processes, such as the stochastic interpolant $x_t$ or the Wiener process $W_t$. To specify that the random variable $x_0$ is drawn from the probability distribution with density $\rho_0$, say, with a slight abuse of notations we use $x_0\sim\rho_0$. Similarly, we use ${\sf N}(0, \text{\it Id})$ to denote both the density and the distribution of the Gaussian random variable with mean zero and covariance identity. We denote expectation by ${\mathbb{E}}$, and usually specify the random variables this expectation is taken over. With a slight abuse of terminology, we say that the law of the process $x_t$ is $\rho(t)$ if $\rho(t)$ is the density of the probability distribution of $x_t$ at time $t$.

We use standard notation for function spaces: for example, $C^1([0, 1])$ is the space of continuously differentiable functions from $[0, 1]$ to ${\mathbb{R}}$, $(C^2({\mathbb{R}}^d))^d$ is the space of twice continuously differentiable functions from ${\mathbb{R}}^d$ to ${\mathbb{R}}^d$, and $C^p_0({\mathbb{R}}^d)$ is the space of compactly supported functions from ${\mathbb{R}}^d$ to ${\mathbb{R}}$ that are continuously differentiable $p$ times. Given a function $b:[0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}^d$ with value $b(t, x)$ at $(t, x)$, we use $b \in C^1([0, 1]; (C^2({\mathbb{R}}^d))^d)$ to indicate that $b$ is continuously differentiable in $t$ for all $(t, x)\in[0, 1]\times {\mathbb{R}}^d$ and that $b(t, \cdot)$ is an element of $(C^2({\mathbb{R}}^d))^d$ for all $t\in[0, 1]$.

2. Stochastic interpolant framework

Section Summary: This section introduces the stochastic interpolant framework, a method to bridge two probability distributions, like starting data and a target one, through a random process that begins at a sample from the first and ends at a sample from the second. The process follows a smooth path interpolated between the endpoints, with added Gaussian noise that rises and falls over time to create variability in between, while ensuring the overall path stays controlled and useful for building generative models. It also defines the evolving distribution of points along this path and conditional expectations to study its properties.

2.1 Definitions and assumptions

We begin by defining the stochastic processes that are central to our approach:

########## {caption="Definition 1: Stochastic interpolant"}

Given two probability density functions $\rho_0, \rho_1 : {{\mathbb{R}}^d} \rightarrow {\mathbb{R}}_{\geq 0}$, a stochastic interpolant between $\rho_0$ and $\rho_1$ is a stochastic process $x_t$ defined as

$ x_t = I(t, x_0, x_1) + \gamma(t) z, \qquad t\in [0, 1],\tag{2} $

where:

  1. $I \in C^2([0, 1], C^2({\mathbb{R}}^d\times {\mathbb{R}}^d)^d)$ satisfies the boundary conditions $I(0, x_0, x_1) = x_0$ and $I(1, x_0, x_1) = x_1$, as well as

$ \begin{aligned} &\exists C_1<\infty \ :
&&|\partial_t I(t, x_0, x_1)|\le C_1|x_0-x_1| \quad &&\forall (t, x_0, x_1) \in [0, 1]\times {\mathbb{R}}^d \times {\mathbb{R}}^d. \end{aligned}\tag{3} $

  1. $\gamma: [0, 1] \to {\mathbb{R}} $ satisfies $\gamma(0)=\gamma(1) = 0$, $\gamma(t) > 0$ for all $t \in (0, 1)$, and $\gamma^2\in C^2([0, 1])$.
  2. The pair $(x_0, x_1)$ is drawn from a probability measure $\nu$ that marginalizes on $\rho_0$ and $\rho_1$, i.e.

$ \nu(dx_0, {\mathbb{R}}^d) = \rho_0(x_0) dx_0, \qquad \nu({\mathbb{R}}^d, dx_1) = \rho_1(x_1) dx_1.\tag{4} $

  1. $z$ is a Gaussian random variable independent of $(x_0, x_1)$, i.e. $z\sim {\sf N}(0, \text{\it Id})$ and $z\perp(x_0, x_1)$.

Equation 3 states that $I(t, x_0, x_1)$ does not move too fast along the way from $x_0$ at $t=0$ to $x_1$ at $t=1$, and as a result does not wander too far from either endpoint – this assumption is made for convenience but is not necessary for most arguments below. Later, we will find it useful to consider choices for $I$ that are spatially nonlinear, which we show can recover the solution to the Schrödinger bridge problem. Nevertheless, a simple example that serves as a valid $I$ in the sense of Definition 1 is given in Equation 1. The measure $\nu$ allows for a coupling between the two densities $\rho_0$ and $\rho_1$, which affects the properties of the stochastic interpolant, but a simple choice is to take the product measure $\nu(dx_0, dx_1) = \rho_0(x_0) \rho_1(x_1) dx_0dx_1$, in which case $x_0$ and $x_1$ are independent. In Section 6 we discuss how to design the stochastic interpolant in Equation 2 and state some properties of the corresponding process $x_t$. Examples of stochastic interpolants are also shown in Figure 2 for various choices of $I$ and $\gamma$.

########## {caption="Remark: Comparison with [1]"}

The main difference between the stochastic interpolant defined in Equation 2 and the one originally introduced in [1] is the inclusion of the latent variable $\gamma(t) z$. Many of the results below also hold when we set $\gamma(t)z=0$, but the objective of the present paper is to elucidate the advantages that this additional term provides when neither of the endpoints are Gaussian. We note that we could generalize the construction by making $\gamma(t)$ a tensor; here we focus on the scalar case for simplicity. Another difference is the possibility to couple $\rho_0$ and $\rho_1$ via $\nu$. While the latent variable can be drawn from any noise distribution, as we will see, it will be convenient to choose it to be a Gaussian.

**Figure 2:** **Design flexibility.** An illustration of how stochastic interpolants can be tailored to specific aims. All examples show one realization of $x_t$ with one $x_0\sim\rho_0$, one $x_1\sim \rho_1$ (the flowers at the left and right of the figures), and one $z\sim {\sf N}(0, \text{\it Id})$. *Top, Upper middle, and lower middle*: various interpolants, ranging from direct interpolation with no latent variable (as in [1]) to Gaussian encoding-decoding in which the data transitions to pure noise at the mid-point. *Bottom*: one-sided interpolant, which connects with score-based diffusion methods.

The stochastic interpolant $x_t$ in Equation 2 is a continuous-time stochastic process whose realizations are samples from $\rho_0$ at time $t=0$ and from $\rho_1$ at time $t=1$ by construction. As a result, it offers a way to bridge $\rho_0$ and $\rho_1$ – we are interested in characterizing the law of $x_t$ over the full interval $[0, 1]$, as it will allow us to design generative models. Mathematically, we want to characterize the properties of the time-dependent probability distribution $\mu(t, dx)$ such that

$ \forall t \in [0, 1] \quad : \quad \int_{{\mathbb{R}}^d} \phi(x) \mu(t, dx) = {\mathbb{E}} [\phi(x_t)] \quad \text{for any test function} \quad \phi \in C^\infty_0({\mathbb{R}}^d),\tag{5} $

where $x_t$ is defined in Equation 2 and the expectation is taken independently over $(x_0, x_1)\sim\nu$, and $z\sim {\sf N}(0, \text{\it Id})$. To this end, we will need to use conditional expectations over $x_t$ [^1], as described in the following definition.

[^1]: Formally, in terms of the Dirac delta distribution, we can write

$ {\mathbb{E}}\left[f(t, x_0, x_1, z)| x_t=x\right] = \frac{{\mathbb{E}}\left[f(t, x_0, x_1, z)\delta(x-x_t)\right]}{{\mathbb{E}}[\delta(x-x_t)]} $

and in this notation we also have $\mu(t, x) = {\mathbb{E}}[\delta(x-x_t)$].

########## {caption="Definition 2"}

Given any $f:[0, 1]\times {\mathbb{R}}^d\times {\mathbb{R}}^d\times {\mathbb{R}}^d\to {\mathbb{R}}$, its conditional expectation ${\mathbb{E}}\left[f(t, x_0, x_1, z)| x_t=x\right]$ is the function of $x$ such that, for any test function $\phi \in C^\infty_0({\mathbb{R}}^d)$, we have

$ \forall t \in [0, 1] \quad : \quad \int_{{\mathbb{R}}^d} \phi(x) {\mathbb{E}}\left[f(t, x_0, x_1, z)| x_t=x\right] \mu(t, dx) = {\mathbb{E}}[\phi(x_t) f(t, x_0, x_1, z)],\tag{6} $

where $\mu(t, dx)$ is the time-dependent distribution of $x_t$ defined by Definition 2, and the expectation on the right-hand side is taken independently over $(x_0, x_1)\sim\nu$ and $z\sim {\sf N}(0, \text{\it Id})$ with $x_t$ given by Equation 2.

Vector-valued functions have conditional expectations that are defined analogously. Note that, with our definition, ${\mathbb{E}}\left[f(t, x_0, x_1, z)| x_t=x\right]$ is a deterministic function of $(t, x)\in[0, 1]\times {\mathbb{R}}^d$, not to be confused with the random variable ${\mathbb{E}}\left[f(t, x_0, x_1, z)| x_t\right]$ that can be defined analogously.

########## {caption="Remark 3"}

Another seemingly more general way to define the stochastic interpolant is via

$ x^\mathsf{d}_t = I(t, x_0, x_1)+ N_t\tag{7} $

where $N: [0, 1]\to {\mathbb{R}}^d$ is a zero-mean Gaussian stochastic process constrained to satisfy $N_{t=0}=N_{t=1}=0$. As we will show below, our construction only depends on the single-time properties of $N_t$, which are completely specified by ${\mathbb{E}} [N_t N^\mathsf{T}_t]$. That is, if we take $\gamma(t)$ in Equation 2 such that ${\mathbb{E}} [N_t N^\mathsf{T}_t] = \gamma^2(t)\text{\it Id}$, then the probability distribution of $x_t$ will coincide with that of $x'_t$ defined in Equation 7, $x_t \stackrel{\text{d}}{=} x^\mathsf{d}_t$. For example, taking $\gamma(t) = \sqrt{t(1-t)}$ in Equation 2 – a choice we will consider below in Section 3.4 – is equivalent to choosing $N_t$ to be a Brownian bridge in Equation 7, i.e. the stochastic process realizable in terms of the Wiener process $W_t$ as $N_t = W_t - t W_1$. This observation will also help us draw an analogy between our approach and the construction used in score-based diffusion models as well as methods based on stochastic bridges. As we will show in Section 3.4, it is simpler for both analysis and practical implementation to work with the definition Equation 2 for $x_t$.

To proceed, we will make the following assumption on the densities $\rho_0$, $\rho_1$, and the interplay between the measure $\nu$ to the function $I$:

########## {caption="Assumption 4"}

The densities $\rho_0$ and $\rho_1$ are strictly positive elements of $C^2({\mathbb{R}}^d)$ and are such that

$ \int_{{\mathbb{R}}^d} |\nabla \log \rho_0(x)|^2 \rho_0(x) dx < \infty \quad\text{and} \quad \int_{{\mathbb{R}}^d} |\nabla \log \rho_1(x)|^2 \rho_1(x) dx < \infty.\tag{8} $

The measure $\nu$ and the function $I$ are such that

$ \exists M_1, M_2 < \infty \ \ : \ \ {\mathbb{E}}\big[|\partial_t I(t, x_0, x_1)|^4\big] \le M_1; \quad {\mathbb{E}}\big[|\partial^2_t I(t, x_0, x_1)|^2\big] \le M_2, \quad \forall t\in [0, 1],\tag{9} $

where the expectation is taken over $(x_0, x_1)\sim\nu$.

Note that for the interpolant Equation 1, Assumption 4 holds if $\rho_0$ and $\rho_1$ both have finite fourth moments.

**Figure 3:** **Algorithmic implementation.** A simple overview of suggested implementation strategies. For deterministic sampling, a single velocity field $b$ can be learned by minimizing the empirical loss in the top row. For stochastic sampling, the velocity field $b$, along with the denoiser $\eta_z$, can be learned by minimizing the two empirical losses specified in the bottom row. To sample deterministically, off-the-shelf ODE integrators can be used to integrate the probability flow equation. To sample stochastically, the listed SDE can be integrated using standard techniques such as the Euler-Maruyama method or the Heun sampler introduced in [10]. The time-dependent diffusion coefficient $\epsilon(t)$ can be specified *after learning* to maximize sample quality.

2.2 Transport equations, score, and quadratic objectives

We now state a result that specifies some important properties of the probability distribution of the stochastic interpolant $x_t$:

########## {caption="Theorem 5: Stochastic interpolant properties"}

The probability distribution of the stochastic interpolant $x_t$ defined in Equation 2 is absolutely continuous with respect to the Lebesgue measure at all times $t\in[0, 1]$ and its time-dependent density $\rho(t)$ satisfies $\rho(0) = \rho_0$, $\rho(1) = \rho_1$, $\rho \in C^1([0, 1];C^p({\mathbb{R}}^d))$ for any $p\in {\mathbb{N}}$, and $\rho(t, x) > 0$ for all $(t, x) \in [0, 1]\times {\mathbb{R}}^d$. In addition, $\rho$ solves the transport equation

$ \partial_t \rho + \nabla \cdot \left(b\rho\right) = 0,\tag{10} $

where we defined the velocity

$ b(t, x) = {\mathbb{E}} [\dot{x}_t| x_t = x] = {\mathbb{E}} [\partial_t I(t, x_0, x_1) + \dot{\gamma}(t) z| x_t = x].\tag{11} $

This velocity is in $C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ for any $p\in {\mathbb{N}}$, and such that

$ \forall t\in [0, 1] \quad : \quad \int_{{\mathbb{R}}^d} |b(t, x)|^2 \rho(t, x) dx < \infty.\tag{12} $

Note that this theorem means that we can write Equation 5 as

$ \forall t \in [0, 1] \quad : \quad \int_{{\mathbb{R}}^d} \phi(x) \rho(t, x)dx = {\mathbb{E}} \phi(x_t) \quad \text{for any test function} \quad \phi \in C^\infty_0({\mathbb{R}}^d).\tag{13} $

The transport equation 10 can be solved either forward in time from the initial condition $\rho(0)=\rho_0$, in which case $\rho(1)=\rho_1$, or backward in time from the final condition $\rho(1)=\rho_1$, in which case $\rho(0)=\rho_0$.

The proof of Theorem 5 is given in Appendix B.1; it mostly relies on manipulations involving the characteristic function of the stochastic interpolant $x_t$. The transport equation 10 for $\rho$ lead to methods for generative modeling and density estimation, as explained in Secs. Section 2.3 and Section 2.5, provided that we can estimate the velocity $b$. This velocity is explicitly available only in special cases, for example when $\rho_0$ and $\rho_1$ are both Gaussian mixture densities: this case is treated in Appendix A. In general $b$ must be calculated numerically, which can be performed via empirical risk minimization of a quadratic objective function, as characterized by our next result:

########## {caption="Theorem 6: Objective"}

The velocity $b$ defined in Equation 11 is the unique minimizer in $C^0([0, 1];(C^1({\mathbb{R}}^d))^d)$ of the quadratic objective

$ \mathcal{L}_b[\hat{b}] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{b}(t, x_t)|^2 - \left(\partial_t I(t, x_0, x_1) + \dot{\gamma}(t) z \right) \cdot \hat{b}(t, x_t) \right) dt\tag{14} $

where $x_t$ is defined in Equation 2 and the expectation is taken independently over $(x_0, x_1)\sim\nu$ and $z\sim {\sf N}(0, \text{\it Id}).$

The proof of Theorem 6 is given in Appendix B.1: it relies on the definitions of $b$ in Equation 11, as well as the definition of $\rho$ in Equation 13 and some elementary properties of the conditional expectation. We discuss how to estimate the objective function Equation 14 in practice in Section 6. Interestingly, we also have access to the score of the probability density, as shown by our next result:

########## {caption="Theorem 7: Score"}

The score of the probability density $\rho$ specified in Theorem 5 is in $C^1([0, 1];(C^p({\mathbb{R}}^d))^d)$ for any $p\in {\mathbb{N}}$ and given by

$ s(t, x) = \nabla \log\rho(t, x) = - \gamma^{-1}(t) {\mathbb{E}}(z |x_t=x) \quad \forall (t, x)\in(0, 1)\times {\mathbb{R}}^d\tag{15} $

In addition it satisfies

$ \forall t\in [0, 1] \quad : \quad \int_{{\mathbb{R}}^d} |s(t, x)|^2 \rho(t, x) dx < \infty,\tag{16} $

and is the unique minimizer in $C^1([0, 1];(C^1({\mathbb{R}}^d))^d)$ of the quadratic objective

$ \mathcal{L}_s[\hat{s}] = \int_0^1 {\mathbb{E}}\left(\tfrac12|\hat{s}(t, x_t)|^2 +\gamma^{-1}(t) z\cdot \hat{s}(t, x_t) \right) dt\tag{17} $

where $x_t$ is defined in Equation 2 and the expectation is taken independently over $(x_0, x_1)\sim\nu$ and $z\sim {\sf N}(0, \text{\it Id})$

The proof of Equation 202 is given in Appendix B.1. We stress that the objective function is well defined despite the fact that $\gamma(0)=\gamma(1) =0$: see Section 6 for more details about how to evaluate this objective in practice.

########## {caption="Remark 8: Denoiser"}

The quantity

$ \eta_z(t, x) = {\mathbb{E}}(z | x_t = x),\tag{18} $

will be referred as the denoiser, for reasons that will be made clear in Section 5.2. By Equation 15, this quantity gives access to the score on $t\in(0, 1)$ (where $\gamma(t)>0$) since, from Equation 15,

$ s(t, x) = - \gamma^{-1}(t) \eta_z(t, x). $

This denoiser is the minimizer of an equivalent expression to 17,

$ \mathcal{L}_{\eta_z}[\hat{\eta}_z] = \int_0^1 {\mathbb{E}}\left(\tfrac12|\hat{\eta}_z(t, x_t)|^2 - z\cdot \hat{\eta}_z(t, x_t) \right) dt.\tag{19} $

The denoiser $\eta_z$ is useful for numerical realizations. In particular, the objective in Equation 19 is easier to use than the one in Equation 17 because it does not contain the factor $\gamma^{-1}(t)$, which needs careful handling as $t$ approaches 0 and 1.

Having access to the score immediately allows us to rewrite the TE Equation 10 as forward and backward Fokker-Planck equations, which we state as:

########## {caption="Corollary 9: Fokker-Planck equations"}

For any $\epsilon\in C^0([0, 1])$ with $\epsilon(t)\ge 0$ for all $t\in[0, 1]$, the probability density $\rho$ specified in Theorem 5 satisfies:

  1. The forward Fokker-Planck equation

$ \partial_t \rho + \nabla \cdot \left(b_{\mathsf{F}}\rho\right) = \epsilon(t) \Delta \rho, \qquad \rho(0) = \rho_0,\tag{20} $

where we defined the forward drift

$ b_\mathsf{F}(t, x) = b(t, x) + \epsilon(t) s(t, x).\tag{21} $

Equation 20 is well-posed when solved forward in time from $t=0$ to $t=1$, and its solution for the initial condition $\rho(t=0) = \rho_0$ satisfies $\rho(t=1) = \rho_1$. 2. The backward Fokker-Planck equation

$ \partial_t \rho + \nabla \cdot \left(b_\mathsf{B}\rho\right) = -\epsilon(t) \Delta \rho, \qquad \rho(1) = \rho_1,\tag{22} $

where we defined the backward drift

$ b_\mathsf{B}(t, x) = b(t, x) - \epsilon(t) s(t, x).\tag{23} $

Equation 22 is well-posed when solved backward in time from $t=1$ to $t=0$, and its solution for the final condition $\rho(1) = \rho_1$ satisfies $\rho(0) = \rho_0$.

In Section 2.3 we will use the results of this theorem to design generative models based on forward and backward stochastic differential equations. Note that we can replace the diffusion coefficient $\epsilon(t)$ by a positive semi-definite tensor; also note that if we define $\rho_\mathsf{B}(t_\mathsf{B}, x) = \rho(1-t_\mathsf{B}, x)$, the reversed FPE Equation 22 can be written as

$ \partial_{t_\mathsf{B}} \rho_\mathsf{B}({t_\mathsf{B}}, x) - \nabla \cdot \left(b_\mathsf{B}(1-{t_\mathsf{B}}, x)\rho_\mathsf{B}({t_\mathsf{B}}, x)\right) = \epsilon(1-t_\mathsf{B}) \Delta \rho_\mathsf{B}({t_\mathsf{B}}, x), \qquad \rho_\mathsf{B}({t_\mathsf{B}}=0) = \rho_1,\tag{24} $

which is now well-posed forward in (reversed) time $t_\mathsf{B}$. So as to have only one definition of time $t$, it is more convenient to work with Equation 22.

Let us make a few remarks about the statements made so far:

########## {caption="Remark"}

If we set $\gamma(t)=0$ in $x_t$ (i.e, if we remove the latent variable), the stochastic interpolant Equation 2 reduces to the one originally considered in [1]. In this setup, the results above formally stand except that we cannot guarantee the spatial regularity of $b(t, x)$ and $s(t, x)$, since it relies on the presence of the latent variable (as shown in the proof of Theorem 5). Hence, we expect the introduction of the latent variable $\gamma(t) z$ to help for generative modeling, where the solution to the corresponding ODEs/SDEs will be better behaved, and for statistical approximation, since the targets $b$ and $s$ will be more regular. We will see in Section 6 that it also gives us much greater flexibility in the way we can bridge $\rho_0$ and $\rho_1$, which will enable us to design generative models with appealing properties.

########## {caption="Remark"}

We will see in Section 2.4 that the forward and backward FPE in Equation 20 and 22 are more robust than the TE in Equation 10 against approximation errors in the velocity $b$ and the score $s$, which has practical implications for generative models based on these equations.

########## {caption="Remark 10"}

We could also obtain $b(t, \cdot)$ at any $t\in[0, 1]$ by minimizing

$ {\mathbb{E}} \left(\tfrac12|\hat{b}(t, x_t)|^2 - \left(\partial_t I(t, x_0, x_1)+ \dot{\gamma}(t) z\right) \cdot \hat{b}(t, x_t) \right) \qquad t\in [0, 1]\tag{25} $

and $s(t, \cdot)$ at any $t\in(0, 1)$ by minimizing

$ {\mathbb{E}} \left(\tfrac12|\hat{s}(t, x_t)|^2 +\gamma^{-1}(t) z\cdot \hat{s}(t, x_t) \right) \qquad t\in (0, 1)\tag{26} $

Using the time-integrated versions of these objectives given in Equation 14 and 17 is more convenient numerically as it allows one to parameterize $\hat{b}$ and $\hat{s}$ globally for $(t, x)\in [0, 1]\times {\mathbb{R}}^d$.

########## {caption="Remark 11"}

From Equation 11 we can write

$ b(t, x) = v(t, x) - \dot{\gamma}(t) \gamma(t) s(t, x),\tag{27} $

where $s$ is the score given in Equation 15 and we defined the velocity field

$ v(t, x) = {\mathbb{E}} (\partial_t I(t, x_0, x_1) | x_t = x).\tag{28} $

The velocity field $v \in C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ for any $p\in {\mathbb{N}}$ and can be characterized as the unique minimizer of

$ \mathcal{L}_v[\hat{v}] = \int_0^1 {\mathbb{E}}\left(\tfrac12|\hat{v}(t, x_t)|^2 -\partial_tI(t, x_0, x_1) \cdot \hat{v}(t, x_t) \right) dt\tag{29} $

Learning this velocity and the score separately may be useful in practice.

########## {caption="Remark"}

The objectives in Equation 14 and 17 (as well as the ones in Equation 19 and 29) are amenable to empirical estimation if we have samples $(x_0, x_1)\sim\nu$, since in that case we can generate samples of $x_t= I(t, x_0, x_1) + \gamma(t) z$ at any time $t\in[0, 1]$. We will use this feature in the numerical experiments presented below.

########## {caption="Remark 12"}

Since $s$ is the score of $\rho$, an alternative objective to estimate it is ([25])

$ \int_0^1 {\mathbb{E}}\left(|\hat{s}(t, x_t)|^2 +2\nabla \cdot \hat{s}(t, x_t) \right)dt.\tag{30} $

The derivation of Equation 30 is standard: for the reader's convenience we recall it at the end of Appendix B.1. The advantage of using Equation 17 over Equation 30 is that it does not require us to take the divergence of $\hat{s}$.

########## {caption="Remark 13: Energy-based models"}

By definition, the score $s(t, x)=\nabla \log \rho(t, x)$ is a gradient field. As a result, if we model $\hat{s}(t, x) = -\nabla \hat{E}(t, x) $, we can turn Equation 17 into an objective function for $\hat{E}(t, x)$

$ \mathcal{L}_E[\hat{E}] = \int_0^1 {\mathbb{E}}\left(\tfrac12|\nabla \hat{E}(t, x_t)|^2 +\gamma^{-1}(t) z\cdot \nabla\hat{E}(t, x_t) \right) dt\tag{31} $

This objective is invariant to constant shifts in $\hat{E}$ and should therefore be minimized under some constraint, such as $\min_x \hat{E}(t, x) =0$ for all $t\in [0, 1]$. The minimizer of Equation 31 provides us with an energy-based model (EBM) ([57, 58]) that can in principle be used to sample the PDF of the stochastic interpolant, $\rho(t, x)$, at any fixed $t\in [0, 1]$ using e.g. Langevin dynamics. We will not exploit this possibility here, and instead rely on generative models to sample $\rho(t, x)$, as discussed next in Section 2.3.

2.3 Generative models

Our next result is a direct consequence of Theorem 5, and it shows how to design generative models using the stochastic processes associated with the TE Equation 10, the forward FPE Equation 20, and the backward FPE Equation 22:

########## {caption="Corollary 14: Generative models"}

At any time $t\in[0, 1]$, the law of the stochastic interpolant $x_t$ coincides with the law of the three processes $X_t$, $X^\mathsf{F}_t$, and $X^\mathsf{B}_t$, respectively defined as:

  1. The solutions of the probability flow associated with the transport equation 10

$ \frac{d}{dt} X_t = b(t, X_t),\tag{32} $

solved either forward in time from the initial data $X_{t=0} \sim\rho_0$ or backward in time from the final data $X_{t=1} = x_1\sim\rho_1$. 2. The solutions of the forward SDE associated with the FPE Equation 20

$ dX^\mathsf{F}t = b\mathsf{F}(t, X^\mathsf{F}_t)dt + \sqrt{2 \epsilon(t)} , dW_t,\tag{33} $

solved forward in time from the initial data $X^\mathsf{F}_{t=0}\sim\rho_0$ independent of $W$. 3. The solutions of the backward SDE associated with the backward FPE Equation 22

$ dX^\mathsf{B}t = b\mathsf{B}(t, X^\mathsf{B}_t)dt + \sqrt{2 \epsilon(t)} , dW^\mathsf{B}t, \quad W_t^\mathsf{B} = -W{1-t},\tag{34} $

solved backward in time from the final data $X^\mathsf{B}{t=1}\sim\rho_1$ independent of $W^\mathsf{B}$; the solution of Equation 34 is by definition $X^\mathsf{B}{t}= Z^\mathsf{F}_{1-t}$ where $Z^\mathsf{F}_t$ satisfies

$ dZ^\mathsf{F}t = -b\mathsf{B}(1-t, Z^\mathsf{F}_t)dt+ \sqrt{2 \epsilon(t)} , dW_t,\tag{35} $

solved forward in time from the initial data $Z^\mathsf{F}_{t=0}\sim \rho_1$ independent of $W$.

To avoid repeated applications of the transformation $t\mapsto1-t$, it is convenient to work with Equation 34 directly using the reversed Itô calculus rules stated in the following lemma, which follows from the results in [59] and is proven in Appendix B.2:

########## {caption="Lemma 15: Reverse Itô Calculus"}

If $X^\mathsf{B}_t $ solves the backward SDE Equation 34:

  1. For any $f\in C^1([0, 1];C_0^2({\mathbb{R}}_d))$ and $t\in[0, 1]$, the backward Itô formula holds

$ df(t, X^\mathsf{B}_t) = \partial_t f(t, X^\mathsf{B}_t) dt+ \nabla f(X^\mathsf{B}_t) \cdot dX^\mathsf{B}_t - \epsilon(t) \Delta f(t, X^\mathsf{B}_t) dt.\tag{36} $

  1. For any $g\in C^0([0, 1];(C_0({\mathbb{R}}_d))^d)$ and $t\in[0, 1]$, the backward Itô isometries hold:

$ \begin{aligned} {\mathbb{E}}^x_\mathsf{B}\int_t^1 g(t, X^\mathsf{B}_t) \cdot dW^\mathsf{B}t = 0; \qquad {\mathbb{E}}^x\mathsf{B}\left|\int_t^1 g(t, X^\mathsf{B}_t)\cdot dW^\mathsf{B}t\right|^2 &= \int_t^1 {\mathbb{E}}^x\mathsf{B}\left|g(t, X^\mathsf{B}_t)\right|^2 dt, \end{aligned}\tag{37} $

where ${\mathbb{E}}^x_\mathsf{B}$ denotes expectation conditioned on the event $X_{t=1}^\mathsf{B} = x$.

The relevance of Corollary 14 for generative modeling is clear. Assuming, for example, that $\rho_0$ is a simple density that can be sampled easily (e.g. a Gaussian or a Gaussian mixture density), we can use the ODE Equation 32 or the SDE Equation 33 to push these samples forward in time and generate samples from a complex target density $\rho_1$. In Section 2.5, we will show how to use the ODE Equation 32 or the reverse SDE Equation 34 to estimate $\rho_1$ at any $x\in {\mathbb{R}}^d$ assuming that we can evaluate $\rho_0$ at any $x\in {\mathbb{R}}^d$. We will also show how similar ideas can be used to estimate the cross entropy between $\rho_0$ and $\rho_1$.

########## {caption="Remark"}

We stress that the stochastic interpolant $x_t$, the solution $X_t$ to the ODE Equation 32, and the solutions $X^\mathsf{F}_t$ and $X^\mathsf{B}_t$ of the forward and backward SDEs Equation 33 and 34 are different stochastic processes, but their laws all coincide with $\rho(t)$ at any time $t\in[0, 1]$. This is all that matters when applying these processes as generative models. However, the fact that these processes are different has implications for the accuracy of the numerical integration used to sample from them at any $t$ as well as for the propagation of statistical errors (see also the next remark).

Generative models based on solutions $X_t$ to the ODE Equation 32, solutions $X^\mathsf{F}t$ to the forward SDE Equation 33, and solutions $X^\mathsf{B}t$ to the backward SDE Equation 34 will typically involve drifts $b$, $b\mathsf{F}$, and $b\mathsf{B}$ that are, in practice, imperfectly estimated via minimization of Equation 14 and 17 over finite datasets. It is important to estimate how this statistical estimation error propagates to errors in sample quality, and how the propagation of error depends on the generative model used, which is the object of our next section.

2.4 Likelihood control

In this section, we demonstrate that jointly minimizing the objective functions Equation 29 and 17 (or the losses Equation 14 and 17) controls the $\mathsf{KL}$-divergence from the target density $\rho_1$ to the model density $\hat{\rho}_1$. We focus on bounds involving the score, but we note that analogous results hold for learning the denoiser $\eta_z(t, x)$ defined in Equation 18 by the relation $\eta_z(t, x) = -s(t, x) / \gamma(t)$. The derivation is based on a simple and exact characterization of the $\mathsf{KL}$-divergence between two transport equations or two Fokker-Planck equations with different drifts. Remarkably, we find that the presence of a diffusive term determines whether or not it is sufficient to learn the drift to control $\mathsf{KL}$. This can be seen as a generalization of the result for score-based diffusion models described in [30] to arbitrary generative models described by ODEs or SDEs. The proofs of the statements in this section are provided in Appendix B.3. These proofs rely on manipulation of the time derivative of the $\mathsf{KL}$ divergence, which is a practice that has proven useful elsewhere in the literature ([60]).

We first characterize the $\mathsf{KL}$ divergence between two densities transported by two different continuity equations but initialized from the same initial condition:

########## {caption="Lemma 16"}

Let $\rho_0: {\mathbb{R}}^d\rightarrow {\mathbb{R}}{\geq 0}$ denote a fixed base probability density function. Given two velocity fields $b, \hat{b} \in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$, let the time-dependent densities $\rho: [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}{\ge0}$ and $\hat{\rho}: [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}_{\ge0}$ denote the solutions to the transport equations

$ \begin{aligned} &\partial_t\rho + \nabla \cdot(b \rho) = 0, \qquad &&\rho(0)=\rho_0, \ &\partial_t\hat\rho + \nabla \cdot(\hat{b}\hat{\rho}) = 0, \qquad &&\hat{\rho}(0)=\rho_0. \end{aligned}\tag{38} $

Then, the Kullback-Leibler divergence of $\rho(1)$ from $\hat\rho(1)$ is given by

$ \mathsf{KL}(\rho(1): \Vert : \hat\rho(1)) = \int_0^1 \int_{{\mathbb{R}}^d} \left(\nabla\log\hat\rho(t, x) - \nabla\log\rho(t, x)\right)\cdot\big(\hat{b}(t, x) - b(t, x)\big)\rho(t, x) dxdt.\tag{39} $

Lemma 16 shows that it is insufficient in general to match $\hat{b}$ with $b$ to obtain control on the $\mathsf{KL}$ divergence. The essence of the problem is that a small error in $\hat{b} - b$ does not ensure control on the Fisher divergence $\mathsf{FI}(\rho(t): \Vert : \hat\rho(t)) = \int_{{\mathbb{R}}^d}\left|\nabla\log\rho(t, x) - \nabla\log\hat{\rho}(t, x)\right|^2\rho(t, x)dx$, which is necessary due to the presence of $\left(\nabla\log\hat\rho - \nabla\log\rho\right)$ in Equation 39.

In the next lemma, we study the case for two Fokker-Planck equations, and highlight that the situation becomes quite different.

########## {caption="Lemma 17"}

Let $\rho_0: {\mathbb{R}}^d\rightarrow {\mathbb{R}}{\geq 0}$ denote a fixed base probability density function. Given two velocity fields $b\mathsf{F}, \hat{b}\mathsf{F} \in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$, let the time-dependent densities $\rho: [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}{\ge0}$ and $\hat{\rho}: [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}_{\ge0}$ denote the solutions to the Fokker-Planck equations

$ \begin{aligned} &\partial_t\rho + \nabla \cdot(b_\mathsf{F} \rho) = \epsilon\Delta \rho, \qquad &&\rho(0)=\rho_0, \ &\partial_t\hat\rho + \nabla \cdot(\hat{b}_\mathsf{F}\hat{\rho}) = \epsilon \Delta \hat{\rho}, \qquad &&\hat{\rho}(0)=\rho_0. \end{aligned}\tag{40} $

where $\epsilon>0$. Then, the Kullback-Leibler divergence from $\rho(1)$ to $\hat\rho(1)$ is given by

$ \begin{aligned} \mathsf{KL}(\rho(1): \Vert : \hat\rho(1)) &= \int_0^1\int_{{\mathbb{R}}^d} \left(\nabla\log\hat\rho(t, x) - \nabla\log\rho(t, x)\right)\cdot \left(\hat{b}\mathsf{F}(t, x) - b\mathsf{F}(t, x)\right)\rho(t, x) dxdt\ &\qquad -\epsilon\int_0^1 \int_{{\mathbb{R}}^d} \left|\nabla\log\rho(t, x) - \nabla\log\hat{\rho}(t, x)\right|^2\rho(t, x)dx dt, \end{aligned} $

and as a result

$ \begin{aligned} \mathsf{KL}(\rho(1): \Vert : \hat\rho(1)) &\leq \frac{1}{4 \epsilon}\int_0^1 \int_{{\mathbb{R}}^d} \left|\hat{b}\mathsf{F}(t, x) - b\mathsf{F}(t, x)\right|^2\rho(t, x) dxdt. \end{aligned} $

Lemma 17 shows that, unlike for transport equations, the $\mathsf{KL}$-divergence between the solutions of two Fokker-Planck equations is controlled by the error in their drifts. The diffusive term in each Fokker-Planck equation provides an additional negative term in the $\mathsf{KL}$-divergence, which eliminates the need for explicit control on the Fisher divergence.

Putting the above results together, we can state the following result, which demonstrates that the losses Equation 14 and 17 control the likelihood for learned approximations to the FPE Equation 20.

########## {caption="Theorem 18"}

Let $\rho$ denote the solution of the Fokker-Planck equation 20 with $\epsilon(t)= \epsilon>0$. Given two velocity fields $\hat{b}, \hat{s} \in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$, define

$ \hat{b}_\mathsf{F}(t, x) = \hat{b}(t, x) + \epsilon \hat{s}(t, x), \qquad \hat{v}(t, x) = \hat{b}(t, x) + \gamma(t) \dot{\gamma}(t) \hat{s}(t, x)\tag{41} $

where the function $\gamma$ satisfies the properties listed in Definition 1. Let $\hat{\rho}$ denote the solution to the Fokker-Planck equation

$ \partial_t \hat{\rho} + \nabla \cdot (\hat{b}_\mathsf{F}\hat{\rho}) = \epsilon \Delta \hat{\rho}, \qquad \hat{\rho}(0) = \rho_0. $

Then,

$ \mathsf{KL}(\rho_1: \Vert : \hat{\rho}(1)) \leq \frac{1}{2 \epsilon}\left(\mathcal{L}b[\hat{b}] - \min{\hat{b}}\mathcal{L}_b[\hat{b}]\right) + \frac{\epsilon}{2}\left(\mathcal{L}s[\hat{s}] - \min{\hat{s}}\mathcal{L}_s[\hat{s}]\right),\tag{42} $

where $\mathcal{L}_b[\hat{b}]$ and $\mathcal{L}_s[\hat{s}]$ are the objective functions defined in Equation 14 and 17, and

$ \mathsf{KL}(\rho_1: \Vert : \hat\rho(1)) \leq \frac{1}{2 \epsilon}\left(\mathcal{L}{v}[\hat{v}] - \min{\hat{v}}\mathcal{L}{v}[\hat{v}]\right) + \frac{\sup{t\in [0, 1]}(\gamma(t)\dot\gamma(t) - \epsilon)^2}{2 \epsilon}\left(\mathcal{L}{s}[\hat{s}] - \min{\hat{v}}\mathcal{L}_{s}[\hat{s}]\right). $

where $\mathcal{L}_v[\hat{v}]$ is the objective function defined in Equation 29.

########## {caption="Remark 19: Generative modeling"}

The above results have practical ramifications for generative modeling. In particular, they show that minimizing either the losses Equation 14 and 17 or 29 and 17 maximize the likelihood of the stochastic generative model

$ d\hat{X}_t^\mathsf{F} = \left(\hat{b}(t, \hat{X}_t^\mathsf{F}) + \epsilon\hat{s}(t, \hat{X}_t^\mathsf{F})\right)dt + \sqrt{2 \epsilon}dW_t, $

but that minimizing the objective Equation 14 is insufficient in general to maximize the likelihood of the deterministic generative model

$ \dot{\hat{X}}_t = \hat{b}(t, \hat{X}_t). $

Moreover, they show that, when learning $\hat{b}$ and $\hat{s}$, the choice of $\epsilon$ that minimizes the upper bound is given by

$ \epsilon^* = \left(\frac{\mathcal{L}b[\hat{b}] - \min{\hat{b}}\mathcal{L}_b[\hat{b}]}{\mathcal{L}s[\hat{s}] - \min{\hat{s}}\mathcal{L}_s[\hat{s}]}\right)^{1/2},\tag{43} $

so that $\epsilon^* > 1$ if the score is learned to higher accuracy than $\hat{b}$ and $\epsilon^*<1$ in the opposite situation. Note that Equation 43 suggests to take $\epsilon= 0$ if $\hat{b}$ is learned perfectly but $\hat{s}$ is not, and send $\epsilon \to \infty$ in the opposite situation. While taking $\epsilon=0$ is achievable in practice and leads to the ODE Equation 32, taking $\epsilon\to\infty$ is not, as increasing $\epsilon$ increases the expense of the numerical integration in Equation 33 and 34.

2.5 Density estimation and cross-entropy calculation

It is well-known that the solution of the TE Equation 10 can be expressed in terms of the solution to the probability flow ODE Equation 32; for completeness, we now recall this fact:

########## {caption="Lemma 20"}

Given the velocity field $\hat{b} \in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$, let $\hat\rho$ satisfy the transport equation

$ \partial_t \hat\rho + \nabla \cdot(\hat{b} \hat{\rho}) = 0,\tag{44} $

and let $X_{s, t}(x)$ solve the ODE

$ \frac{d}{dt} X_{s, t}(x) = b(t, X_{s, t}(x)), \qquad X{_{s, s}(x) = x, \qquad t, s \in [0, 1]}\tag{45} $

Then, given the PDFs $\rho_0$ and $\rho_1$:

  1. The solution to 44 for the initial condition $\hat{\rho}(0) = \rho_0$ is given at any time $t\in[0, 1]$ by

$ \hat{\rho}(t, x) = \exp\left(- \int_0^t \nabla \cdot b_{}(\tau, X_{t, \tau}(x)) d\tau \right) \rho_0(X_{t, 0}(x))\tag{46} $

  1. The solution to 44 for the final condition $\hat{\rho}(1) = \rho_1$ is given at any time $t\in[0, 1]$ by

$ \hat{\rho}(t, x) = \exp\left(\int_t^1 \nabla \cdot b_{}(\tau, X_{t, \tau}(x)) d\tau \right) \rho_1(X_{t, 1}(x))\tag{47} $

The proof of Lemma 20 can be found in Appendix B.4. Interestingly, we can obtain a similar result for the solution of the forward and backward FPEs in Equation 20 and 22. These results make use of auxiliary forward and backward SDEs in which the roles of the forward and backward drifts are switched:

########## {caption="Theorem 21"}

Given $\epsilon>0$ and two velocity fields $\hat{b}, \hat{s} \in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$, define

$ \hat{b}\mathsf{F}(t, x) = \hat{b}(t, x) + \epsilon \hat{s}(t, x), \qquad \hat{b}\mathsf{B}(t, x) = \hat{b}(t, x) - \epsilon \hat{s}(t, x),\tag{48} $

and let $Y^\mathsf{F}_t$ and $Y^\mathsf{B}_t$ denote solutions of the following forward and backward SDEs:

$ dY^\mathsf{F}t = b\mathsf{B}(t, Y^\mathsf{F}_t)dt + \sqrt{2\epsilon} dW_t,\tag{49} $

to be solved forward in time from the initial condition $Y^\mathsf{F}_{t=0}=x$ independent of $W$; and

$ dY^\mathsf{B}t = b\mathsf{F}(t, Y^\mathsf{B}_t)dt + \sqrt{2\epsilon} dW^\mathsf{B}t, \quad W_t^\mathsf{B} = -W{1-t},\tag{50} $

to be solved backwards in time from the final condition $Y^\mathsf{B}_{t=1}=x$ independent of $W^\mathsf{B}$. Then, given the densities $\rho_0$ and $\rho_1$:

  1. The solution to the forward FPE

$ \partial_t \hat{\rho}\mathsf{F} + \nabla \cdot (\hat{b}\mathsf{F} \hat{\rho}\mathsf{F}) = \epsilon \Delta \hat{\rho}\mathsf{F}, \qquad \hat{\rho}_\mathsf{F} (0) = \rho_0,\tag{51} $

can be expressed at $t=1$ as

$ \hat{\rho}\mathsf{F} (1, x) = {\mathbb{E}}\mathsf{B}^x \left(\exp\left(- \int_0^1\nabla \cdot \hat{b}_\mathsf{F}(t, Y^\mathsf{B}t) dt\right) \rho_0(Y{t=0}^\mathsf{B})\right),\tag{52} $

where ${\mathbb{E}}\mathsf{B}^x$ denotes expectation on the path of $Y_t^\mathsf{B}$ conditional on the event $Y{t=1}^\mathsf{B} = x$. 2. The solution to the backward FPE

$ \partial_t \hat{\rho}\mathsf{B}+ \nabla \cdot (\hat{b}\mathsf{B} \hat{\rho}\mathsf{B}) = -\epsilon \Delta \hat{\rho}\mathsf{B}, \qquad \hat{\rho}_\mathsf{B}(1) = \rho_1,\tag{53} $

can be expressed at any $t=0$ as

$ \hat{\rho}\mathsf{B}(0, x) = {\mathbb{E}}\mathsf{F}^x \left(\exp\left(\int_0^1\nabla \cdot \hat{b}_\mathsf{B}(t, Y^\mathsf{F}t) dt\right)\rho_1(Y^\mathsf{F}{t=1})\right),\tag{54} $

where ${\mathbb{E}}_\mathsf{F}^x$ denotes expectation on the path of $Y^\mathsf{F}t$ conditional on $Y^\mathsf{F}{t=0} = x$.

The proof of Theorem 21 can be found in Appendix B.4. Note that to generate data from either $\hat{\rho}\mathsf{F}(1)$ or $\hat{\rho}\mathsf{B}(0)$ assuming that we can sample exactly the PDF at the other end, i.e. $\rho_0$ and $\rho_1$ respectively, we would still rely on the equivalent of the forward and backward SDE in Equation 33 and 34, now used with the approximate drifts in Equation 48, i.e.

$ d\hat{X}^\mathsf{F}t = \hat{b}\mathsf{F}(t, \hat{X}^\mathsf{F}_t)dt + \sqrt{2\epsilon} dW_t,\tag{55} $

and

$ d\hat{X}^\mathsf{B}t = \hat{b}\mathsf{B}(t, \hat{X}^\mathsf{B}_t)dt + \sqrt{2\epsilon} dW^\mathsf{B}t, \quad W_t^\mathsf{B} = -W{1-t},\tag{56} $

If we solve Equation 55 forward in time from initial data $\hat{X}^\mathsf{F}{t=0} \sim \rho_0$, we then have $\hat{X}^\mathsf{F}{t=1} \sim \hat{\rho}\mathsf{F}(1)$ where $\hat{\rho}\mathsf{F}$ is the solution to the forward FPE Equation 51. Similarly If we solve Equation 56 backward in time from final data $\hat{X}^\mathsf{B}{t=1} \sim \rho_1$, we then have $\hat{X}^\mathsf{B}{t=0} \sim \hat{\rho}\mathsf{B}(0)$ where $\hat{\rho}\mathsf{B}$ is the solution to the backward FPE Equation 53.

The results of Lemma 20 and Theorem 21 can be used to test the quality of samples generated by either the ODE Equation 32 or the forward and backward SDEs Equation 33 and 34. In particular, the following two results are direct consequences of Lemma 20 and Theorem 21, respectively:

########## {caption="Corollary 22"}

Under the same conditions as Lemma 20, if $\hat{\rho}(0) = \rho_0$, the cross-entropy of $\hat{\rho}(1)$ relative to $\rho_1$ is given by

$ \begin{aligned} \mathsf{H}(\rho_1:\Vert:\hat\rho(1)) &= - \int_{{\mathbb{R}}^d} \log \hat\rho(1, x) \rho_1(x) dx\ & = {\mathbb{E}}1 \int_0^1 \nabla \cdot b{}(\tau, X_{1, \tau}(x_1)) d\tau - {\mathbb{E}}1 \log \rho_0(X{1, 0}(x_1)) \end{aligned}\tag{57} $

where ${\mathbb{E}}_1$ denotes an expectation over $x_1\sim\rho_1$. Similarly, if $\hat{\rho}(1) = \rho_1$, the cross-entropy of $\hat{\rho}(0)$ relative to $\rho_0$ is given by

$ \begin{aligned} \mathsf{H}(\rho_0:\Vert:\hat\rho(0)) &= - \int_{{\mathbb{R}}^d} \log \hat\rho(0, x) \rho_0(x) dx\ & = -{\mathbb{E}}0 \int_0^1 \nabla \cdot b{}(\tau, X_{0, \tau}(x_0)) d\tau - {\mathbb{E}}0 \log \rho_1(X{0, 1}(x_0)) \end{aligned}\tag{58} $

where ${\mathbb{E}}_0$ denotes an expectation over $x_0\sim\rho_0$.

########## {caption="Corollary 23"}

Under the same conditions as Theorem 21, the cross-entropy of $\hat{\rho}_\mathsf{F}(1)$ relative to $\rho_1$ is given by

$ \begin{aligned} \mathsf{H}(\rho_1:\Vert:\hat{\rho}\mathsf{F}(1)) &= - \int{{\mathbb{R}}^d} \log \hat{\rho}\mathsf{F}(1, x) \rho_1(x) dx\ & = -{\mathbb{E}}1 \log {\mathbb{E}}\mathsf{B}^{x_1} \left(\exp\left(- \int_0^1\nabla \cdot b\mathsf{F}(t, Y^\mathsf{B}t) dt\right) \rho_0(Y{t=0}^\mathsf{B})\right), \end{aligned}\tag{59} $

where ${\mathbb{E}}_\mathsf{B}^{x_1}$ denotes an expectation over $Y^\mathsf{B}t$ conditioned on the event $Y^\mathsf{B}{t=1}=x_1$, and ${\mathbb{E}}1$ denotes an expectation over $x_1\sim\rho_1$. Similarly, the cross-entropy of $\hat{\rho}\mathsf{B}(0)$ relative to $\rho_0$ is given by

$ \begin{aligned} \mathsf{H}(\rho_0:\Vert:\hat{\rho}\mathsf{B}(0)) &= - \int{{\mathbb{R}}^d} \log \hat{\rho}\mathsf{B}(0, x) \rho_0(x) dx\ & = -{\mathbb{E}}0 \log {\mathbb{E}}\mathsf{F}^{x_0} \left(\exp\left(\int_0^1\nabla \cdot b\mathsf{B}(t, Y^\mathsf{F}t) dt\right)\rho_1(Y^\mathsf{F}{t=1})\right), \end{aligned}\tag{60} $

where ${\mathbb{E}}_\mathsf{B}^{x_0}$ denotes an expectation over $Y^\mathsf{F}t$ conditioned on the event $Y^\mathsf{F}{t=0}=x_0$, and ${\mathbb{E}}_0$ denotes an expectation over $x_0\sim\rho_0$.

If in Equation 57, Equation 58, Equation 59, and 60 we approximate the expectations ${\mathbb{E}}_0$ and ${\mathbb{E}}_1$ over $\rho_0$ and $\rho_1$ by empirical expectations over the available data, these equations allow us to cross-validate different approximations of $\hat{b}$ and $\hat{s}$, as well as to compare the cross-entropies of densities evolved by the TE Equation 44 with those of the forward and backward FPEs Equation 51 and 53.

########## {caption="Remark 24"}

When using Equation 59 and 60 in practice, taking the $\log$ of the expectations ${\mathbb{E}}\mathsf{B}^{x_1}$ and ${\mathbb{E}}\mathsf{F}^{x_0}$ may create difficulties, such as when using Hutchinson's trace estimator to compute the divergence of $b_\mathsf{F}$ or $b_\mathsf{B}$, which will introduce a bias. One way to remove this bias is to use Jensen's inequality, which leads to the upper bounds

$ \begin{aligned} \mathsf{H}(\rho_1:\Vert:\hat{\rho}\mathsf{F}(1)) \le \int_0^1 {\mathbb{E}}1 {\mathbb{E}}\mathsf{B}^{x_1}\nabla \cdot b\mathsf{F}(t, Y^\mathsf{B}t) dt - {\mathbb{E}}1 {\mathbb{E}}\mathsf{B}^{x_1} \log \rho_0(Y{t=0}^\mathsf{B}), \end{aligned}\tag{61} $

and

$ \begin{aligned} \mathsf{H}(\rho_0:\Vert:\hat{\rho}\mathsf{B}(0)) \le -{\mathbb{E}}0 {\mathbb{E}}\mathsf{F}^{x_0} \int_0^1\nabla \cdot b\mathsf{B}(t, Y^\mathsf{F}t) dt- {\mathbb{E}}0 {\mathbb{E}}\mathsf{F}^{x_0} \log \rho_1(Y^\mathsf{F}{t=1}). \end{aligned}\tag{62} $

However, these bounds are not sharp in general – in fact, using calculations similar to the one presented in the proof of Corollary 23, we can derive exact expressions that capture precisely what is lost when applying Jensen's inequality:

$ \begin{aligned} \mathsf{H}(\rho_1:\Vert:\hat{\rho}\mathsf{F}(1))= \int_0^1 {\mathbb{E}}1 {\mathbb{E}}\mathsf{B}^{x_1}\left(\nabla \cdot b\mathsf{F}(t, Y^\mathsf{B}t) -\epsilon |\nabla \log \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B})|^2\right) dt - {\mathbb{E}}1 {\mathbb{E}}\mathsf{B}^{x_1} \log \rho_0(Y_{t=0}^\mathsf{B}), \end{aligned}\tag{63} $

and

$ \begin{aligned} \mathsf{H}(\rho_0:\Vert:\hat{\rho}\mathsf{B}(0)) = -{\mathbb{E}}0 {\mathbb{E}}\mathsf{F}^{x_0} \int_0^1\left(\nabla \cdot b\mathsf{B}(t, Y^\mathsf{F}t) +\epsilon |\nabla \log \hat{\rho}\mathsf{B}(t, Y_t^\mathsf{F})|^2\right) dt- {\mathbb{E}}0 {\mathbb{E}}\mathsf{F}^{x_0} \log \rho_1(Y^\mathsf{F}_{t=1}). \end{aligned}\tag{64} $

Unfortunately, since $\nabla \log \hat\rho_\mathsf{F} \not= \hat{s}$ and $\nabla \log \hat\rho_\mathsf{B} \not= \hat{s}$ in general due to approximation errors, we do not know how to estimate the extra terms on the right-hand side of Equation 63 and 64. One possibility is to use $\hat{s}$ as a proxy for $\nabla \log \hat{\rho}\mathsf{F}$ and $\nabla \log \hat\rho\mathsf{B}$, which may be useful in practice, but this approximation is uncontrolled in general.

3. Instantiations and extensions

Section Summary: This section explores specific applications of the stochastic interpolant framework by introducing diffusive interpolants, which build generative models similar to those in recent studies on diffusive bridge processes but in a simpler way. These interpolants combine a basic path between data points with random noise from a Brownian bridge, allowing the same probability distributions at each step as the original method while making sampling easier without complex math. It also shows how this approach enables generating samples from any target distribution starting from a single fixed point, using a special equation that keeps the process smooth and non-singular throughout.

In this section, we instantiate the stochastic interpolant framework discussed in Section 2.

3.1 Diffusive interpolants

Recently, there has been a surge of interest in the construction of generative models through diffusive bridge processes ([36, 37, 39]). In this section, we connect these approaches with our own, highlighting that stochastic interpolants allow us to manipulate certain bridge processes in a simpler and more direct manner. We also show that this perspective leads to a generative process that samples any target density $\rho_1$ by pushing a point mass at any $x_0\in {\mathbb{R}}^d$ through an SDE. We begin by introducing a new kind of interpolant:

########## {caption="Definition 25: Diffusive interpolant"}

Given two probability density functions $\rho_0, \rho_1 : {{\mathbb{R}}^d} \rightarrow {\mathbb{R}}_{\geq 0}$, a diffusive interpolant between $\rho_0$ and $\rho_1$ is a stochastic process $x^\mathsf{d}_t$ defined as

$ x^\mathsf{d}_t = I(t, x_0, x_1) + \sqrt{2a(t)} B_t, \qquad t\in [0, 1],\tag{65} $

where:

  1. $I(t, x_0, x_1)$ is as in Definition 1;
  2. $(x_0, x_1)\sim \nu$ with $\nu$ satisfying Equation 4 in Definition 1;
  3. $a(t)\in C^2([0, 1])$ with $a(0) >0$ and $a(t)\ge 0 $ for all $t\in(0, 1]$, and;
  4. $B_t$ is a standard Brownian bridge process, independent of $x_0$ and $x_1$.

Pathwise, Equation 65 is different from the stochastic interpolant introduced in Definition 1: in particular, $x^\mathsf{d}_t$ is continuous but not differentiable in time. At the same time, since $B_t$ is a Gaussian process with mean zero and variance ${\mathbb{E}} B^2_t = t(1-t)$, Equation 65 has the same single-time statistics and time-dependent density $\rho(t, x)$ as the stochastic interpolant Equation 2 if we set $\gamma(t) = \sqrt{2a(t)t(1-t)}$, i.e.

$ x_t = I(t, x_0, x_1)+ \sqrt{2a(t)t(1-t)} z \quad \text{with} \quad (x_0, x_1)\sim \nu, \ z \sim {\sf N}(0, \text{\it Id}), \ (x_0, x_1) \perp z.\tag{66} $

As a result, Equation 65 and 66 lead to the same generative models. Technically, it is easier to work with Equation 66 than with Equation 65, because it avoids the use of Itô calculus, and enables direct sampling of $x_t$ using samples from $\rho_0$, $\rho_1$, and $\mathsf{N}(0, \text{\it Id})$. However, Equation 65 sheds light on some interesting properties of the generative models based on Equation 66, i.e. stochastic interpolants with $\gamma(t) = \sqrt{2a(t)t(1-t)}$. To see why, we now re-derive the transport equation for the density $\rho(t, x)$ shared by Equation 65 and 66 using the relation Equation 65 using Fourier analysis. For simplicity, we focus on the case where $a(t)$ is constant in time, i.e. we set $a(t)=a>0$ in Equation 65.

To begin, recall that the Brownian Bridge $B_t$ can be expressed in terms of the Wiener process $W_t$ as $B_t = W_t - tW_{t=1}$. Moreover, it satisfies the SDE obtained by conditioning on $B_{t=1}=0$ via Doob's $h$-transform ([61]):

$ dB_t = - \frac{B_t} 1-t dt + dW_t, \qquad B_{t=0}=0.\tag{67} $

A direct application of Itô's formula implies that

$ de^{ik\cdot x_t^\mathsf{d}} = ik \cdot \Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2a}B_t} 1-t \Big) e^{ik\cdot x_t^\mathsf{d}} dt - a|k|^2 e^{ik\cdot x_t^\mathsf{d}} dt + \sqrt{2a}ik\cdot dW_t e^{ik\cdot x_t^\mathsf{d}}.\tag{68} $

Taking the expectation of this expression and using the independence between $(x_0, x_1)$ and $B_t$, we deduce that

$ \partial_t {\mathbb{E}} e^{ik\cdot x_t^\mathsf{d}} = ik \cdot {\mathbb{E}} \Big(\Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2a}B_t} 1-t \Big) e^{ik\cdot x_t^\mathsf{d}} \Big) - a|k|^2 {\mathbb{E}} e^{ik\cdot x_t^\mathsf{d}}.\tag{69} $

Since for all fixed $t\in [0, 1]$ we have $B_t \stackrel{d}{=} \sqrt{t(1-t)}z$ and $x^\mathsf{d}_t \stackrel{d}{=} x_t $ with $x_t$ defined in Equation 66, the time derivative Equation 69 can also be written as

$ \partial_t {\mathbb{E}} e^{ik\cdot x_t} = ik \cdot {\mathbb{E}} \Big(\Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2at}, z} {\sqrt{1-t} }\Big) e^{ik\cdot x_t} \Big) - a|k|^2 {\mathbb{E}} e^{ik\cdot x_t}.\tag{70} $

Moreover, since by definition of their probability density we have ${\mathbb{E}} e^{ik\cdot x_t^\mathsf{d}}= {\mathbb{E}} e^{ik\cdot x_t} = \int_{{\mathbb{R}}^d} e^{ik\cdot x}\rho(t, x)dx$, we can deduce from Equation 70 that $\rho(t)$ satisfies

$ \partial_t \rho+\nabla \cdot(u \rho) = a \Delta \rho,\tag{71} $

where we defined

$ u(t, x) = {\mathbb{E}}\Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2at} , z} {\sqrt{1-t}}\Big| x_t = x\Big).\tag{72} $

For the interpolant $x_t$ in Equation 66, we have from the definitions of $b$ and $s$ in Equation 11 and 15 that

$ \begin{aligned} b(t, x) &= {\mathbb{E}}\Big(\partial_t I(t, x_0, x_1) + \frac{a(1-2t) z} {\sqrt{2t(1-t)}}\Big| x_t = x\Big), \ s(t, x) &= \nabla \log\rho(t, x) = - \frac1{\sqrt{2at(1-t)}}{\mathbb{E}}(z| x_t = x), \end{aligned}\tag{73} $

As a result, $u-s=b$ and 71 can also be written as the TE Equation 10 using $\Delta \rho = \nabla \cdot(s \rho)$.

Conditional sampling.

Remarkably, the drift $u$ defined in Equation 72 remains non-singular for all $t\in [0, 1]$ (including $t=0$) even if $\rho_0$ is replaced by a point mass at $x_0$; by contrast, both $b$ and $s$ are singular at $t=0$ in this case. Hence, the SDE associated with the FPE Equation 71 provides us with a generative model that samples $\rho_1$ from a base measure concentrated at a single $x_0$ (i.e. such that the density $\rho_0$ is replaced by a point mass measure at $x=x_0$). We formalize this result in the following theorem:

########## {caption="Theorem 26"}

Assume that $I(t, x_0, x_1) = x_0$ for $t\in [0, \delta]$ with some $\delta \in (0, 1]$. Given any $a>0$, let

$ u^\mathsf{d}(t, x, x_0) = {\mathbb{E}}_{x_1, z}\Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2a t} , z} {\sqrt{1-t}}\Big| x_t = x\Big),\tag{74} $

where $x_t$ is given by Equation 66 and where ${\mathbb{E}}_{x_1, z}(\cdot|x_t=x)$ denotes an expectation over $x_1\sim \rho_1 \perp z\sim \mathsf{N}(0, \text{\it Id})$ conditioned on $x_t=x$ with $x_0\in {\mathbb{R}}^d$ fixed. Then $u^\mathsf{d}(\cdot, \cdot, x_0) \in C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ for any $p\in {\mathbb{N}}$ and $x_0 \in {\mathbb{R}}^d$. Moreover, the solutions to the forward SDE

$ dX_t^\mathsf{d} = u^\mathsf{d}(t, X_t^\mathsf{d}, x_0) dt + \sqrt{2a} , dW_t, \qquad X_{t=0}^\mathsf{d} = x_0,\tag{75} $

are such that $X^\mathsf{d}_{t=1} \sim \rho_1$.

Note that the additional assumption we make on $I(t, x_0, x_1)$ is consistent with the requirements in Definition 1 and Assumption 4: this additional assumption is made for simplicity and can probably be relaxed to $\partial_tI(t=0, x_0, x_1) = 0$.

The proof of Theorem 26 is given in Appendix B.5. It relies on the calculations that led to 72, along with the observation that at $t=0$ and $x=x_0$,

$ u^\mathsf{d}(t=0, x_0, x_0) = {\mathbb{E}}_{x_1}\left(\partial_t I(t=0, x_0, x_1) \right),\tag{76} $

whereas at $t=1$ and any $x\in {\mathbb{R}}^d$, we have

$ u^\mathsf{d}(t=1, x, x_0) = \partial_t I(t=1, x_0, x) + 2a \nabla \log \rho_1(x),\tag{77} $

which are both well-defined. To put the result in Theorem 26 in perspective, observe that no probability flow ODE with $b\in C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ can achieve the same feat as the diffusion in Equation 75. This is because the solutions of such an ODE are unique, and therefore can only map $x_0$ onto a single point at time $t=1$. Moreover, $u^\mathsf{d}(t, x, x_0)$ is the unique minimizer of the objective function

$ \mathcal{L}{u^\mathsf{d}} [\hat{u}^{\mathsf{d}}] = \int_0^1 {\mathbb{E}}{x_1, z} \left(|\hat{u}^d(t, x_t, x_0)|^2 - 2\Big(\partial_t I(t, x_0, x_1) - \frac{\sqrt{2a t} z} {\sqrt{(1-t)}}\Big)\cdot \hat{u}^d(t, x_t, x_0)\right) dt.\tag{78} $

3.2 One-sided interpolants for Gaussian $\rho_0$

A common choice of base density for generative modeling in the absence of prior information is to choose $\rho_0 = \mathsf{N}(0, \text{\it Id})$. In this setting, we can group the effect of the latent variable $z$ with $x_0$. This leads to a simpler type of stochastic interpolant that, in particular, will enables us to instantiate score-based diffusion within our general framework (see Section 3.4).

########## {caption="Definition 27: One-sided stochastic interpolant"}

Given a probability density function $\rho_1: {{\mathbb{R}}^d} \rightarrow {\mathbb{R}}_{\geq 0}$, a one-sided stochastic interpolant between ${\sf N}(0, \text{\it Id})$ and $\rho_1$ is a stochastic process $x^\mathsf{os}_t$

$ x^\mathsf{os}_t = \alpha(t) z+ J(t, x_1), \qquad t\in [0, 1]\tag{79} $

that fulfills the requirements:

  1. $J\in C^2([0, 1], C^2({\mathbb{R}}^d)^d)$ satisfies the boundary conditions $J(0, x_1) = 0$ and $J(1, x_1) = x_1$.
  2. $x_1$ and $z$ are independent random variables drawn from $\rho_1$ and ${\sf N}(0, \text{\it Id})$, respectively.
  3. $\alpha: [0, 1] \to {\mathbb{R}} $ satisfies $\alpha(0)=1$, $\alpha(1) = 0$, $\alpha(t)>0$ for all $t \in [0, 1)$, and $\alpha^2 \in C^2([0, 1])$.

By construction, $x^\mathsf{os}{t=0} = z \sim {\sf N}(0, \text{\it Id})$ and $x^\mathsf{os}{t=1} = x_1 \sim \rho_1$, so that the distribution of the stochastic process $x^\mathsf{os}_t$ bridges ${\sf N}(0, \text{\it Id})$ and $\rho_1$. It is easy to see that the one-sided stochastic interpolant defined in Equation 79 will have the same density as the stochastic interpolant defined in Equation 2 if we set $I(t, x_0, x_1) = J_t(x_1) + \delta(t) x_0$ and take $\delta^2(t)+ \gamma^2(t) = \alpha^2(t)$. Restricting to this case, our earlier theoretical results apply where the velocity field $b$ defined in Equation 11 becomes

$ b(t, x) = {\mathbb{E}} (\dot{\alpha}(t) z + \partial_t J(t, x_1)| x^\mathsf{os}_t = x),\tag{80} $

and the quadratic objective in Equation 14 becomes

$ \mathcal{L}_b[\hat{b}] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{b}(t, x^\mathsf{os}_t)|^2 - \left(\dot{\alpha}(t) z + \partial_t J(t, x_1) \right) \cdot \hat{b}(t, x^\mathsf{os}_t) \right) dt.\tag{81} $

In the expression above, $x^\mathsf{os}_t$ is given by Equation 79 and the expectation ${\mathbb{E}}$ is taken independently over $x_1\sim \rho_1$ and $z\sim {\sf N}(0, \text{\it Id})$. Similarly, the score is given by

$ s(t, x) = - \alpha^{-1}(t) \eta_z(t, x), \qquad \eta_z(t, x) = {\mathbb{E}}(z|x^\mathsf{os}_t=x),\tag{82} $

where $\eta_z(t, x)$ is the equivalent of the denoiser defined in Equation 18. These functions are the unique minimizers of the objectives

$ \mathcal{L}_s[\hat{s}] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{s}(t, x^\mathsf{os}_t)|^2 +\gamma^{-1}(t) z \cdot \hat{s}(t, x^\mathsf{os}_t) \right) dt,\tag{83} $

$ \mathcal{L}_{\eta_z}[\hat{\eta}_z] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{\eta}_z(t, x^\mathsf{os}_t)|^2 - z \cdot \hat{\eta}_z(t, x^\mathsf{os}_t) \right) dt.\tag{84} $

Moreover, we can weaken Assumption 4 to the following requirement:

########## {caption="Assumption 28"}

The density $\rho_1 \in C^2({\mathbb{R}}^d)$, satisfies $\rho_1(x) > 0$ for all $x \in {\mathbb{R}}^d$, and:

$ \int_{{\mathbb{R}}^d} |\nabla \log \rho_1(x)|^2 \rho_1(x) dx < \infty.\tag{85} $

The function $J$ satisfies

$ \begin{aligned} &\exists C_1<\infty \ :
&&|\partial_t J(t, x_1)|\le C_1|x_1| \quad &&\text{for all}\quad (t, x_1) \in [0, 1]\times {\mathbb{R}}^d, \end{aligned}\tag{86} $

and

$ \exists M_1, M_2 < \infty \ \ : \ \ {\mathbb{E}}\big[|\partial_t J(t, x_1)|^4\big] \le M_1; \quad {\mathbb{E}}\big[|\partial^2_t J(t, x_1)|^2\big] \le M_2, \quad \text{for all}\quad t\in [0, 1],\tag{87} $

where the expectation is taken over $x_1\sim \rho_1$.

########## {caption="Remark 29"}

The construction above can easily be generalized to the case where $\rho_0 = \mathsf{N}(0, C_0)$ with $C_0$ a positive-definite matrix. Without loss of generality, we can then assume that $C_0$ can be represented as $C_0 = \sigma_0 \sigma_0^\mathsf{T}$ where $\sigma_0$ is a lower-triangular matrix and replace Equation 79

$ x^\mathsf{os}_t = \alpha(t) \sigma_0 z + J(t, x_1), \qquad t\in [0, 1],\tag{88} $

with $J$ and $\alpha$ satisfying the conditions listed in Definition 27 and where $z\sim {\sf N}(0, \text{\it Id})$.

3.3 Mirror interpolants

Another practically relevant setting is when the base and the target are the same density $\rho_1$. In this setting we can define a stochastic interpolant as:

########## {caption="Definition 30: Mirror stochastic interpolant"}

Given a probability density function $\rho_1: {{\mathbb{R}}^d} \rightarrow {\mathbb{R}}_{\geq 0}$, a mirror stochastic interpolant between $\rho_0$ and itself is a stochastic process $x^\mathsf{mir}_t$

$ x^\mathsf{mir}_t = K(t, x_1) + \gamma(t) z, \qquad t\in [0, 1]\tag{89} $

that fulfills the requirements:

  1. $K\in C^2([0, 1], C^2({\mathbb{R}}^d)^d)$ satisfies the boundary conditions $K(0, x_1) = x_1$ and $K(1, x_1) = x_1$.
  2. $x_1$ and $z$ are random variables drawn independently from $\rho_1$ and ${\sf N}(0, \text{\it Id})$, respectively.
  3. $\gamma: [0, 1] \to {\mathbb{R}} $ satisfies $\gamma(0)=\gamma(1)=0$, $\gamma(t)>0$ for all $t \in (0, 1)$, and $\gamma^2 \in C^1([0, 1])$.

By construction, $x^\mathsf{mir}{t=0} = x^\mathsf{mir}{t=1}= x_1 \sim \rho_1$, so that the distribution of the stochastic process $x^\mathsf{mir}_t$ bridges $\rho_1$ to itself. Note that a valid choice is $K(t, x_1) = \alpha(t) x_1$ with $\alpha(0)=\alpha(1) =1$ (e.g. $\alpha(t) =1$): in this case, mirror interpolants are related to denoisers, as will be discussed in Section 5.2.

It is easy to see that our earlier theoretical results apply where the velocity field $b$ defined in Equation 11 becomes

$ b(t, x) = {\mathbb{E}} (\partial_t K(t, x_1) + \dot{\gamma}(t) z| x^\mathsf{mir}_t = x),\tag{90} $

and the quadratic objective in Equation 14 becomes

$ \mathcal{L}_b[\hat{b}] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{b}(t, x^\mathsf{mir}_t)|^2 - \left(\partial_t K(t, x_1) + \dot{\gamma}(t) z\right) \cdot \hat{b}(t, x^\mathsf{mir}_t) \right) dt.\tag{91} $

In the expression above, $x^\mathsf{mir}_t$ is given by Equation 89 and the expectation ${\mathbb{E}}$ is taken independently over $x_1\sim \rho_1$ and $z\sim {\sf N}(0, \text{\it Id})$. Similarly, the score is given by

$ s(t, x) = - \gamma^{-1}(t) \eta_z(t, x), \qquad \eta_z(t, x)= {\mathbb{E}}(z|x^\mathsf{mir}_t=x),\tag{92} $

which are the unique minimizers of the objective functions

$ \mathcal{L}_s[\hat{s}] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{s}(t, x^\mathsf{mir}_t)|^2 +\gamma^{-1}(t) z \cdot \hat{s}(t, x^\mathsf{mir}_t) \right) dt.\tag{93} $

$ \mathcal{L}_{\eta_z}[\hat{\eta}_z] =\int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{\eta}_z(t, x^\mathsf{mir}_t)|^2 - z \cdot \hat{\eta}_z(t, x^\mathsf{mir}_t) \right) dt.\tag{94} $

Moreover, we can weaken Assumption 4 to the following requirement:

########## {caption="Assumption 31"}

The density $\rho_1 \in C^2({\mathbb{R}}^d)$ satisfies $\rho_1(x) > 0$ for all $x \in {\mathbb{R}}^d$ and

$ \int_{{\mathbb{R}}^d} |\nabla \log \rho_1(x)|^2 \rho_1(x) dx < \infty.\tag{95} $

The function $K$ satisfies

$ \begin{aligned} &\exists C_1<\infty \ :
&&|\partial_t K(t, x_1)|\le C_1|x_1| \quad &&\text{for all}\quad (t, x_1) \in [0, 1]\times {\mathbb{R}}^d, \end{aligned}\tag{96} $

and

$ \exists M_1, M_2 < \infty \ \ : \ \ {\mathbb{E}}\big[|\partial_t K(t, x_1)|^4\big] \le M_1; \quad {\mathbb{E}}\big[|\partial^2_t K(t, x_1)|^2\big] \le M_2, \quad \text{for all}\quad t\in [0, 1],\tag{97} $

where the expectation is taken over $x_1\sim \rho_1$.

########## {caption="Remark 32"}

Interestingly, if we take $K(t, x_1)=x_1$, then $\partial_t K(t, x_1) = 0$, and the velocity field defined in Equation 90 is completely defined by the denoiser $\eta_z$

$ b(t, x) = \dot{\gamma}(t) \eta_z(t, x)\tag{98} $

Since the score $s$ also depends on $\eta_z$, this denoiser is the only quantity that needs to be learned.

########## {caption="Remark"}

If $\rho_1$ is only accessible via empirical samples, mirror interpolants do not enable calculation of the functional form of $\rho_1$. A notable exception is if we set $K(t, x_1) =0$ for $t\in [t_1, t_2]$ with $0< t_1\le t_2 <1$: in that case, $x^\mathsf{mir}_t = \gamma(t) z \sim \gamma(t) {\sf N}(0, \text{\it Id})$ for $t\in [t_1, t_2]$, which gives us a reference density for comparison. In this setup, mirror interpolants essentially reduce to two one-sided interpolants glued together (with the second one time-reversed), or in fact a regular stochastic interpolant when $\rho_0=\rho_1$ and we set $I(t, x_0, x_1) = 0 $ for $t\in [t_1, t_2]$.

3.4 Stochastic interpolants and Schrödinger bridges

The stochastic interpolant framework can also be used to solve the Schrödinger bridge problem. For background material on this problem, we refer the reader to [62] and the references therein. Consistent with the overall viewpoint of this paper, we consider the hydrodynamic formulation of the Schrödinger bridge problem, in which the goal is to obtain a pair $(\rho, u)$, that solves the following optimization problem for a fixed $\epsilon > 0$

$ \begin{aligned} &\min_{\hat{u}, \hat{\rho}} \int_0^1 \int_{{\mathbb{R}}^d} |\hat{u}(t, x)|^2 \hat{\rho}(t, x) dx dt\ \text{subject to:} \quad & \partial_t \hat{\rho} + \nabla \cdot \left(\hat{u} \hat\rho\right) = \epsilon \Delta \hat{\rho}, \quad \hat{\rho}(0) = \rho_0\quad \hat{\rho}(1) = \rho_1 \end{aligned}\tag{99} $

Under our assumptions on $\rho_0$ and $\rho_1$ listed in Assumption 4, it is known (see e.g. Proposition 4.1 in [62]) that Equation 99 has a unique minimizer $(\rho, u=\nabla\lambda)$, with $(\rho, \lambda)$ classical solutions of the Euler-Lagrange equations:

$ \begin{aligned} & \partial_t \rho + \nabla \cdot \left(\nabla \lambda \rho\right) = \epsilon \Delta \rho, \quad \rho(0) = \rho_0, \quad \rho(1) = \rho_1, \ & \partial_t \lambda +\tfrac12 |\nabla \lambda|^2=- \epsilon \Delta \lambda. \end{aligned}\tag{100} $

To proceed we will make the additional assumption that the solution $\rho$ to 100 can be reversibly mapped to a standard Gaussian:

########## {caption="Assumption 33"}

There exists a reversible map $T:[0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}^d$ with $T, T^{-1}\in C^1([0, 1], (C^d({\mathbb{R}}^d))^d)$ such that:

$ \forall t\in [0, 1] \quad : \quad z \sim {\sf N}(0, \text{\it Id}) \ \Rightarrow \ T(t, z) \sim \rho(t); \quad x_t \sim \rho(t) \ \Rightarrow \ T^{-1}(t, x_t) \sim {\sf N}(0, \text{\it Id}),\tag{101} $

where $\rho$ is the solution to 100.

We stress that the actual form of the map $T$ is not important for the arguments below. Assumption 33 can be used to show the existence of a stochastic interpolant whose density solves Equation 100:

########## {caption="Lemma 34"}

If Assumption 33 holds, then the solution $\rho(t)$ to 100 is the density of the stochastic interpolant

$ x_t = T(t, \alpha(t) T^{-1}(0, x_0) + \beta(t) T^{-1}(1, x_1)) + \gamma(t) z,\tag{102} $

as long as $\alpha^2(t)+ \beta^2(t) + \gamma^2(t) =1$.

The proof is given in Appendix B.6: Equation 102 corresponds to choosing $I(t, x_0, x_1)= T(t, \alpha(t) T^{-1}(t, x_0) + \beta(t) T^{-1}(t, x_1))$ in Equation 2. With the help of Lemma 34, we can establish the following result, which shows how to optimize over the function $I$ to solve the problem Equation 99

########## {caption="Theorem 35"}

Pick some $\gamma:[0, 1]\to [0, 1)$ such that $\gamma(0)=\gamma(1) = 0$, $\gamma(t)>0$ for $t\in(0, 1)$, $\gamma \in C^2((0, 1))$ and $\gamma^2\in C^1([0, 1])$, and let $\hat{x}_t = \hat{I}(t, x_0, x_1)+\gamma(t) z$, with $x_0\sim\rho_0$, $x_1\sim\rho_1$, and $z\sim {\sf N}(0, \text{\it Id})$ all independent. Consider the max-min problem over $\hat{I}\in C^1([0, 1], (C^1({\mathbb{R}}^d\times {\mathbb{R}}^d))^d)$ and $\hat{u}\in C^0([0, 1], (C^1({\mathbb{R}}^d))^d)$:

$ \max_{\hat{I}} \min_{\hat{u}} \int_0^1 {\mathbb{E}} \left(\tfrac12|\hat{u}(t, \hat{x}_t)|^2 - \left(\partial_t \hat{I}(t, x_0, x_1) + (\dot{\gamma}(t) -\epsilon \gamma^{-1}(t)) z \right) \cdot \hat{u}(t, \hat{x}_t) \right) dt.\tag{103} $

If Assumption 33 holds, then all the optimizers $(I, u)$ of Equation 103 are such that the density of the associated $x_t = I(t, x_0, x_1)+\gamma(t) z$ is the solution $\rho$ to 100. Moreover, $u = \nabla \lambda$, with $\lambda$ the solution to 100.

The proof is also given in Appendix B.6. Note that if we fix $\hat{I}$, the velocity $u$ minimizing this objective is the forward drift $b_\mathsf{F}$ defined in Equation 21. Note also that if we set $\epsilon\to0$, the minimizing velocity field is $b$ as defined in Equation 11, and the max-min problem formally reduces to solving the optimal transport problem. In this case, Assumption 33 becomes more stringent, as we need to assume that that system Equation 100 with $\epsilon=0$ (i.e. in the absence of the diffusive terms) has a classical solution. Theorem 35 gives a practical route towards solving the Schrödinger bridge problem with stochastic interpolants, and we leave the numerical investigation of this formulation to future work.

4. Spatially linear interpolants

Section Summary: This section explores a straightforward way to blend two probability distributions—representing starting and ending data patterns—using a linear combination of points from each, plus some added random noise that varies over time. This approach simplifies the math for intermediate steps in generative models, which create new data by following paths defined by ordinary or stochastic differential equations, and allows for flexible choices in how the blending and noise evolve. By assuming simple Gaussian-like distributions, the authors derive key equations for the model's "velocity" (how points move) and "score" (a noise-correction tool), and propose specific blending functions that keep the overall scale consistent, such as linear or wavy patterns that peak in the middle.

In this section, we study the stochastic interpolants that are obtained when we specialize the function $I$ used in Equation 2 to be linear in both $x_0$ and $x_1$, i.e. we consider

$ x^\mathsf{lin}_t = \alpha(t) x_0+ \beta(t) x_1 + \gamma(t) z,\tag{104} $

where $(x_0, x_1)\sim \nu$ and $z \sim {\sf N}(0, \text{\it Id})$ with $(x_0, x_1)\perp z$, and $\alpha, \beta, \gamma^2\in C^2([0, 1])$ satisfy the conditions

$ \begin{aligned} & \alpha(0)=\beta(1)=1; \quad \alpha(1)=\beta(0)=\gamma(0) = \gamma(1) = 0; \quad \forall t\in(0, 1) \ : \ \gamma(t) > 0. \end{aligned}\tag{105} $

Despite its simplicity, this setup offers significant design flexibility. The discussion highlights how the presence of the latent variable $\gamma(t)z$ can simplify the structure of the intermediate density $\rho(t)$. Since our ultimate aim is to investigate the properties of practical generative models built upon ODEs or SDEs, we will also study the effect of time-dependent diffusion coefficient $\epsilon(t)$, which controls the amplitude of the noise in a generative SDE. Throughout, to build intuition, we choose $\rho_0$ and $\rho_1$ to be Gaussian mixture densities, for which the drift coefficients can be computed analytically (see Appendix A). This enables us to visualize the effect of each choice on the resulting generative models.

4.1 Factorization of the velocity field

When the stochastic interpolant is of the form Equation 104, the velocity $b$ and the score $s$ defined in Equation 11 and 15 can both be expressed in terms of the following three conditional expectations (the third is the denoiser already defined in Equation 18):

$ \eta_0(t, x) = {\mathbb{E}}(x_0|x^\mathsf{lin}_t = x), \quad \eta_1(t, x) = {\mathbb{E}}(x_1|x^\mathsf{lin}_t = x), \quad \eta_z(t, x) = {\mathbb{E}}(z|x^\mathsf{lin}_t = x).\tag{106} $

Specifically, we have

$ b(t, x) = \dot\alpha(t) \eta_0(t, x) + \dot\beta(t) \eta_1(t, x)+ \dot\gamma(t) \eta_z(t, x), \qquad s(t, x) = - \gamma^{-1}(t) \eta_z(t, x).\tag{107} $

The second relation above holds for $t\in(0, 1)$ (i.e. when $\gamma(t) \not=0$). Moreover, because ${\mathbb{E}} (x_t|x_t=x) = x$ by definition, the functions $\eta_0$, $\eta_1$, and $\eta_z$ satisfy

$ \forall (t, x) \in [0, 1]\times {\mathbb{R}}^d \quad : \quad \alpha(t) \eta_0(t, x) + \beta(t) \eta_1(t, x) + \gamma(t) \eta_z(t, x) = x.\tag{108} $

This enables us to reduce computational expense: given two of the $\eta$ 's, the third can always be calculated via Equation 108. Finally, it is easy to see that the functions $\eta_0$, $\eta_1$, and $\eta_z$ are the unique minimizers of the objectives

$ \begin{aligned} \mathcal L_{\eta_0}(\hat{\eta}_0) &= \int_0^1 {\mathbb{E}} \left[\tfrac12|\hat{\eta}_0(t, x^\mathsf{lin}_t)|^2 - x_0\cdot \hat{\eta}_0(t, x^\mathsf{lin}t)\right] dt, \ \mathcal L{\eta_1}(\hat{\eta}_1) &= \int_0^1 {\mathbb{E}} \left[\tfrac12|\hat{\eta}_1(t, x^\mathsf{lin}_t)|^2 - x_1\cdot \hat{\eta}_1(t, x^\mathsf{lin}t)\right] dt, \ \mathcal L{\eta_z}(\hat{\eta}_z) &= \int_0^1 {\mathbb{E}} \left[\tfrac12|\hat{\eta}_z(t, x^\mathsf{lin}_t)|^2 - z\cdot \hat{\eta}_z(t, x^\mathsf{lin}_t)\right] dt, \end{aligned}\tag{109} $

where $x^\mathsf{lin}_t$ is defined in Equation 104 and the expectation is taken independently over $(x_0, x_1)\sim\nu$ and $z\sim {\sf N}(0, \text{\it Id}).$

::: {caption="Table 1: Spatially linear interpolants. A table suggesting various linear interpolants. In general, this paper describes methods for arbitrary $\rho_0$ and $\rho_1$. In Section 4.4, we detail linear interpolants for one-sided generation, where $\rho_0$ is a Gaussian and the latent variable $z$ can be absorbed into $x_0$. Later, in Section 5.1, we discuss how to recast score-based diffusion models (SBDM) as linear one-sided interpolants, which leads to the expressions given in the table when using a variance-preserving parameterization. Linear mirror interpolants, where $\rho_0 = \rho_1$ are equal, are defined by Equation 89 with the choice $K(t, x_1) = \alpha(t)x_1$."}

:::

4.2 Some specific design choices

It is useful to assume that both $\rho_0$ and $\rho_1$ have been scaled to have zero mean and identity covariance (which can be achieved in practice, for example, by an affine transformation of the data). In this case, the time-dependent mean and covariance of Equation 104 are given by

$ {\mathbb{E}} [x^\mathsf{lin}_t] = 0, \qquad {\mathbb{E}} [x^\mathsf{lin}_t (x^\mathsf{lin}_t)^\mathsf{T}] = (\alpha^2(t)+\beta^2(t)+\gamma^2(t)) \text{\it Id}.\tag{110} $

Preserving the identity covariance at all times therefore leads to the constraint

$ \forall t \in [0, 1] \quad : \quad \alpha^2(t)+\beta^2(t)+\gamma^2(t) = 1.\tag{111} $

This choice is also sensible if $\rho_0$ and $\rho_1$ have covariances that are not the identity but are on a similar scale. In this case we no longer need to enforce Equation 111 exactly, and could, for example, take three functions whose sum of squares is of order one. For definiteness, in the sequel we discuss possible choices that satisfy Equation 111 exactly, with the understanding that the corresponding functions $\alpha$, $\beta$, and $\gamma$ could all be slightly modified without significantly affecting the conclusions.

Linear and trigonometric $\alpha$ and $\beta$.

One way to ensure that Equation 111 holds while maintaining the influence of $\rho_0$ and $\rho_1$ everywhere on $[0, 1]$ except at the endpoints is to choose

$ \alpha(t) = t, \qquad \beta(t) = 1-t, \qquad \gamma(t) = \sqrt{2t(1-t)}.\tag{112} $

This choice was advocated in [7], without the inclusion of the latent variable ($\gamma =0$). Another possibility that gives more leeway is to pick any $\gamma: [0, 1]\to[0, 1]$ and set

$ \alpha(t) = \sqrt{1-\gamma^2(t)} \cos(\tfrac 12 \pi t), \qquad \beta(t) = \sqrt{1-\gamma^2(t)} \sin(\tfrac 12 \pi t).\tag{113} $

With $\gamma=0$, this was the choice preferred in [1]. The PDF $\rho(t)$ obtained with the choices Equation 112 and 113 when $\rho_0$ and $\rho_1$ are both Gaussian mixture densities are shown in Figure 4. As this example shows, when $\rho_0$ and $\rho_1$ have distinct complex features, these would be duplicated in $\rho(t)$ at intermediate times if not for the smoothing effect of the latent variable; this behavior is seen in Figure 4, where it is most prominent in the first row with $\gamma(t) = 0$. From a statistical learning perspective, eliminating the formation of spurious features will simplify the estimation of the velocity field $b$, which becomes smoother as the formation of such features is suppressed.

**Figure 4:** **The effect of $\gamma(t)z$ on $\rho(t)$.** A visualization of how the choice of $\gamma(t)$ changes the density $\rho(t)$ of $x^\mathsf{lin}_t=\alpha(t) x_0 + \beta(t) x_1+ \gamma(t)z $ when $\rho_0$ and $\rho_1$ are Gaussian mixture densities with two modes and three modes, respectively. The first row depicts $\gamma(t)=0$, which reduces to the stochastic interpolant developed in [1]. This case forms a valid transport between $\rho_0$ and $\rho_1$, but produces spurious intermediate modes in $\rho(t)$. The second row depicts the choice of $\gamma(t) = \sqrt{2t(1-t)}$. In this case, the spurious modes are partially damped by the addition of the latent variable, leading to a simpler $\rho(t)$. The final row shows the Gaussian encoding-decoding, which smoothly encodes $\rho_0$ into a standard normal distribution on the time interval $[0, 1/2)$, which is then decoded into $\rho_1$ on the interval $(1/2, 1]$. In this case, no intermediate modes form in $\rho(t)$: the two modes in $\rho_0$ collide to form $\mathsf{N}(0, 1)$ at $t=\tfrac12$, which then spreads into the three modes of $\rho_1$. A visualization of individual sample trajectories from deterministic and stochastic generative models based on ODEs and SDEs whose solutions have density $\rho(t)$ can be seen in Figure 5.

**Figure 5:** **The effect of $\epsilon$ on sample trajectories.** A visualization of how the choice of $\epsilon$ affects the sample trajectories obtained by solving the ODE Equation 32 or the forward SDE Equation 33. The set-up is the same as in Figure 4: $\rho_0$ and $\rho_1$ are taken to be the same Gaussian mixture densities, and the analytical expressions for $b$ and $s$ are used. In the three panels in each column the value of $\gamma$ is the same, and each panel shows trajectories with different $\epsilon$. Three specific trajectories from the same three initial conditions drawn from $\rho_0$ are also highlighted in white in every panel. As $\epsilon$ increases but $\gamma$ stays the same, the density $\rho(t)$ is unchanged, but the individual trajectories become increasingly stochastic. While all choices are equivalent with exact $b$ and $s$, Theorem 18 shows that nonzero values of $\epsilon$ provide control on the likelihood in terms of the error in $b$ and $s$ when they are approximate.

Gaussian encoding-decoding.

A useful limiting case is to devolve the data from $\rho_0$ completely into noise by the halfway point $t=\tfrac12$ and to reconstruct $\rho_1$ completely from noise starting from $t=\frac12$. One choice that allows us to do so while satisfying Equation 111 is

$ \alpha(t) = \cos^2(\pi t) 1_{[0, \frac12)}(t), \qquad \beta(t) = \cos^2(\pi t)1_{(\frac12, 1]}(t), \qquad \gamma(t) = \sin^2(\pi t),\tag{114} $

where $1_{A}(t)$ is the indicator function of $A$, i.e. $1_A(t) =1$ if $t\in A$ and $1_A(t)=0$ otherwise. With this choice, it is easy to see that $x_{t=\frac12} = \gamma(\tfrac{1}{2}) z \sim {\sf N}(0, \gamma^2(\tfrac{1}{2}))$, which seamlessly glues together two interpolants: one between $\rho_0$ and a standard Gaussian, and one between a standard Gaussian and $\rho_1$.

Even though the choice Equation 114 encodes $\rho_0$ into pure noise on the interval $[0, \tfrac12]$, which is then decoded into $\rho_1$ on the interval $[\tfrac12, 1]$ (and vice-versa when proceeding backwards in time), the resulting velocity $b$ still defines a single continuity equation that maps $\rho_0$ to $\rho_1$ on $[0, 1]$. This is most clearly seen at the level of the probability flow Equation 32, since its solution $X_t$ is a bijection between the initial and final conditions $X_{t=0}$ and $X_{t=1}$, but a similar pairing can also be observed in the solutions to the forward and backward SDEs Equation 33 and 34, whose solutions at time $t=1$ or $t=0$ remain correlated with the initial or final condition used.

**Figure 6:** **The functions $\alpha(t)$, $\beta(t)$, and $\gamma(t)$** for the linear Equation 112, trigonometric Equation 113 with $\gamma(t)=\sqrt{2t(1-t)}$, Gaussian encoding-decoding Equation 114, one-sided Equation 79, and mirror Equation 89 interpolants.

This allows for a more direct means of image-to-image translation with diffusions when compared to the recent approach described in [48]. The choice Equation 114 is depicted in the final row of Figure 4, where no spurious modes form at all; individual sample trajectories of the deterministic and stochastic generative models based on ODEs and SDEs whose solutions have this $\rho(t)$ as density can be seen in the panels forming the third column in Figure 5. We note that the elimination of spurious intermediate modes can also be implemented by use of a data-adapted coupling $\nu(dx_0, dx_1)$, as considered in [63].

Unsurprisingly, it is necessary to have $\gamma(t) > 0$ for the choice Equation 114: for $\gamma(t) = 0$, the density $\rho(t)$ collapses to a Dirac measure at $t=\frac12$. This consideration highlights that the inclusion of the latent variable $\gamma(t) z$ matters even for the deterministic dynamics Equation 32, and its presence is distinct from the stochasticity inherent to the SDEs Equation 33 and 34.

4.3 Impact of the latent variable $\gamma(t) z$ and the diffusion coefficient $\epsilon(t)$

The stochastic interpolant framework enables us to discern the independent roles of the latent variable $\gamma(t) z$ and the diffusion coefficient $\epsilon(t)$ we use in a generative model. As shown in Theorem 5, the presence of the latent variable $\gamma(t) z$ for $\gamma \neq 0$ smooths both the density $\rho(t)$ and the velocity $b$ defined in Equation 11 spatially. This provides a computational advantage at sample generation time because it simplifies the required numerical integration of Equation 32, Equation 33, and 34. Intuitively, this is because the density $\rho(t)$ of $x_t$ can be represented exactly as the density that would be obtained with $\gamma(t) = 0$ convolved with $\mathsf{N}(0, \gamma^2(t)Id)$ at each $t\in(0, 1)$. A comparison between the density $\rho(t)$ obtained with trigonometric interpolants with $\gamma(t) = 0$ and $\gamma(t) = \sqrt{2t(1-t)}$ can be seen in the first and second row of Figure 4.

By contrast, the diffusion coefficient $\epsilon(t)$ leaves the density $\rho(t)$ unchanged, and only affects the way we sample it. In particular, the probability flow ODE Equation 32 results in a map that pushes every $X_{t=0}=x_0$ onto a single $X_{t=1}=x_1$ and vice-versa. The forward SDE Equation 33 maps each $X^\mathsf{F}{t=0}=x_0$ onto an ensemble $X^\mathsf{F}{t=1} $ whose spread is controlled by the amplitude of $\epsilon(t)$ (and similarly for the reversed SDE Equation 34 that maps each $X^\mathsf{B}{t=1}=x_1$ onto an ensemble $X^\mathsf{B}{t=0}$). This ensemble is not distributed according to $\rho_1$ for finite $\epsilon(t)$ – like with the ODE, we need to sample initial conditions from $\rho_0$ to get solutions at time $t=1$ that sample $\rho_1$ – but its density converges towards $\rho_1$ as $\epsilon(t) \to \infty$. These features are illustrated in Figure 5.

########## {caption="Remark 36"}

Another potential advantage of including the latent variable $\gamma(t) z$ is its impact on the velocity $b$ at the end points. Since $x_{t=0}=x_0$ and $x_{t=1}=x_1$, it is easy to see that the velocity $b$ of the linear interpolant $x_t$ defined in Equation 104 satisfies

$ \begin{aligned} b(0, x) &= \dot\alpha(0) x + \dot{\beta}(0) {\mathbb{E}} [x_1|x_0=x] - \lim_{t\to0} \gamma(t) \dot\gamma(t) s_0(x), \ b(1, x) &= \dot\alpha(1) {\mathbb{E}} [x_0|x_1=x] + \dot{\beta}(1) x - \lim_{t\to1} \gamma(t) \dot\gamma(t) s_1(x), \end{aligned}\tag{115} $

where $s_0 = \nabla \log \rho_0$ and $s_1 = \nabla \log \rho_1$. If $\gamma\in C^2([0, 1])$, because $\gamma(0)=\gamma(1)=0$, the terms involving the scores $s_0$ and $s_1$ in these expressions vanish. Choosing $\gamma^2 \in C^1([0, 1])$ but $\gamma$ not differentiable at $t=0$ or $t=1$ leaves open the possibility that the limits remain nonzero. For example, if we take one of the choices discussed in Section 3.1, i.e.

$ \gamma(t) = \sqrt{a t(1-t)}, \qquad a>0,\tag{116} $

we obtain

$ \lim_{t\to0}\gamma(t)\dot{\gamma}(t) = - \lim_{t\to1}\gamma(t)\dot{\gamma}(t) = \frac{a}{2}.\tag{117} $

As a result, the choice Equation 116 ensures that the velocity $b$ encodes information about the score of the densities $\rho_0$ and $\rho_1$ at the end points. We stress however that, while the choice of $\gamma(t)$ given in Equation 116 is appealing because of its nontrivial influence on the velocity $b$ at the endpoints, the user is free to explore a variety of alternatives. We present some examples in Table 2, specifying the differentiability of $\gamma$ at $t=0$ and $t=1$. The function $\gamma(t)$ specified in Equation 116 is the only featured case for which the contribution from the score is non-vanishing in the velocity $b$ at the endpoints. In Section 7, we illustrate on numerical examples that there are tradeoffs between different choices of $\gamma$, which might be directly related to this fact. When using the ODE as a generative model, the score is only felt through $b$, whereas it is explicit when using the SDE as a generative model.

: Table 2: Differentiability of $\gamma(t)z$. A characterization of the possible choices of $\gamma(t)$ with respect to their differentiability. The column specified by $\hat{\sigma}(t)$ is sum of sigmoid functions, made compact by the notation $\hat\sigma(t) = \sigma(f (t - \tfrac{1}{2}) + 1) - \sigma(f(t - \tfrac{1}{2}) - 1) - \sigma(-\tfrac{f}{2}+1) + \sigma(-\tfrac{f}{2}-1) $, where $\sigma(t) = e^t/(1+e^t)$ and $f$ is a scaling factor.

$\gamma(t):$ $\sqrt{a t(1-t)}$ $ t(1-t)$ $\hat{\sigma}(t)$ $\sin^2(\pi t)$
$C^1$ at $t=0, 1$

4.4 Spatially linear one-sided interpolants

Much of the discussion above generalizes to one-sided interpolants if we take the function $J(t, x_1)$ in Equation 79 to be linear in $x_1$ and define

$ x^\mathsf{os, lin}_t = \alpha(t) z + \beta(t) x_1, \qquad t\in[0, 1]\tag{118} $

where $\alpha^2, \beta\in C^2([0, 1])$ and $\alpha(0)=\beta(1) = 1$, $\alpha(1) = \beta(0) = 0$, and $\alpha(t) >0$ for all $t\in[0, 1)$. The velocity $b$ and the score $s$ defined in Equation 11 and 15 can now be expressed as

$ b(t, x) = \dot\alpha(t) \eta^\mathsf{os}_z(t, x) + \dot{\beta}(t) \eta^\mathsf{os}_1(t, x), \qquad s(t, x) = -\alpha^{-1}(t) \eta^\mathsf{os}_z(t, x),\tag{119} $

where the second expression holds for all $t\in [0, 1)$ and we defined:

$ \eta^\mathsf{os}_z(t, x) = {\mathbb{E}}(z|x^\mathsf{os, lin}_t = x), \qquad \eta^\mathsf{os}_1(t, x) = {\mathbb{E}}(x_1|x^\mathsf{os, lin}_t = x).\tag{120} $

Note that, by definition of the conditional expectation, $\eta^\mathsf{os}_z$ and $\eta^\mathsf{os}_1$ satisfy

$ \forall (t, x) \in [0, 1]\times {\mathbb{R}}^d \quad : \quad \alpha(t) \eta^\mathsf{os}_z(t, x) + \beta(t) \eta^\mathsf{os}_1(t, x) = x.\tag{121} $

As a result, only one of them needs to be estimated. For example, we can express $\eta^\mathsf{os}_1$ as a function of $\eta^\mathsf{os}_z$ for all $t$ such that $\beta(t)\not=0$, and use the result to express the velocity Equation 119 as

$ b(t, x) = \dot{\beta}(t) \beta^{-1}(t) x + \big(\dot{\alpha}(t) - \alpha(t) \dot\beta(t) \beta^{-1}(t)\big) \eta^\mathsf{os}_z(t, x) \qquad \forall t \ : \ \beta(t)\not=0.\tag{122} $

Assuming that $\beta(t)\not=0$ for all $t\in(0, 1]$, this formula only needs to be supplemented at $t=0$ with

$ b(0, x) = \dot\alpha(0) x + \dot\beta(0) {\mathbb{E}}[x_1]\tag{123} $

which follows from Equation 119 since $x^\mathsf{os, lin}_{t=0} = z$. Later in Section 5.2 we will show that using the velocity $b$ in Equation 122 to solve the probability flow ODE Equation 32 can be seen as using a denoiser to construct a generative model.

Finally note that $\eta_z$ and/or $\eta_1$ can be estimated using the following two objective functions, respectively:

$ \begin{aligned} \mathcal L_{\eta_z}(\hat{\eta}^\mathsf{os}_z) &= \int_0^1 {\mathbb{E}} \left[\tfrac12|\hat{\eta}^\mathsf{os}_z(t, x^\mathsf{os, lin}_t)|^2 - z\cdot \hat{\eta}^\mathsf{os}_z(t, x^\mathsf{os, lin}t)\right] dt, \ \mathcal L{\eta_1}(\hat{\eta}^\mathsf{os}_1) &= \int_0^1 {\mathbb{E}} \left[\tfrac12|\hat{\eta}^\mathsf{os}_1(t, x^\mathsf{os, lin}_t)|^2 - x_1\cdot \hat{\eta}^\mathsf{os}_1(t, x^\mathsf{os, lin}_t)\right] dt. \end{aligned}\tag{124} $

5. Connections with other methods

Section Summary: This section explores how the stochastic interpolant framework relates to other generative modeling techniques, particularly score-based diffusion models and stochastic localization methods, which use processes like the Ornstein-Uhlenbeck equation to evolve data toward a normal distribution over time. By reparameterizing these diffusion processes into a finite time interval, stochastic interpolants eliminate biases and singularities that arise in traditional diffusion models, allowing for unbiased sample generation without additional computational effort. This approach separates the definition of probability paths from the sampling process, making it more flexible and numerically stable compared to direct adaptations of diffusion equations.

In this section, we discuss connections between the stochastic interpolant framework and the score-based diffusion method ([5]), the stochastic localization framework ([64, 65, 66])), denoising methods ([67, 68, 69, 4]), and the rectified flow method ([7]).

5.1 Score-based diffusion models and stochastic localization

Score-based diffusion models (SBDM) are based on variants of the Ornstein-Uhlenbeck process

$ d Z_\tau = - Z_\tau dt + \sqrt{2} dW_\tau, \qquad Z_{\tau=0} \sim \rho_1,\tag{125} $

which has the property that the marginal density of its solution at time $\tau$ converges to a standard normal as $\tau$ tends towards infinity. By learning the score of the density of $Z_\tau$, we can write the associated backward SDE for Equation 125, which can then be used as a generative model – this backwards SDE is also the one that is used in the stochastic localization process, see [66].

To see the connection with stochastic interpolants, notice that the solution of Equation 125 from the initial condition $Z_{\tau=0} = x_1\sim \rho_1$ can be written exactly as

$ Z_\tau = x_1 e^{-\tau} + \sqrt{2} \int_0^\tau e^{-\tau+s} dW_s.\tag{126} $

As a result, the law of $Z_\tau$ conditioned on $Z_{\tau=0} = x_1$ is given by

$ Z_\tau \sim {\sf N}(x_1 e^{-\tau}, (1-e^{-2\tau})\text{\it Id}),\tag{127} $

for any time $\tau\in[0, \infty)$. This is also the law of the process

$ y_\tau = x_1 e^{-\tau} + \sqrt{1-e^{-2\tau}}, z, \qquad z\sim {\sf N}(0, \text{\it Id}), \qquad \tau \in [0, \infty).\tag{128} $

If we let $x_1\sim \rho_1$ with $x_1\perp z$, the process $y_\tau$ is similar to a one-sided stochastic interpolant, except the density of $y_\tau$ only converges to ${\sf N}(0, \text{\it Id})$ as $\tau\to\infty$; by contrast, the one-sided interpolants we introduced in Section 3.2 converge on the finite interval $[0, 1]$. In SBDM, this is handled by capping the evolution of $Z_\tau$ to a finite time interval $[0, T]$ with $T < \infty$, and then by using the backward SDE associated with Equation 125 restricted to $[0, T]$. However, this introduces a bias that is not present with one-sided stochastic interpolants, because the final condition used for the backwards SDE in SBDM is drawn from ${\sf N}(0, \text{\it Id})$ even though the density of the process Equation 125 is not Gaussian at time $T$.

We can, however, turn Equation 128 into a one-sided linear stochastic interpolant by defining $t=e^{-\tau}$ and by choosing $\alpha(t)$ and $\beta(t)$ in Equation 118 to have a specific form. More precisely, evaluating Equation 128 at $\tau = -\log t$,

$ y_{\tau=-\log t} = \sqrt{1-t^2} z + t x_1 \equiv x^\mathsf{os, lin}_t \quad \text{for} \quad \alpha(t) = \sqrt{1-t^2}, \quad \beta(t) =t.\tag{129} $

With this choice of $\alpha(t)$ and $\beta(t)$, from Equation 119 we get the velocity field

$ b(t, x) = -\frac{t}{\sqrt{1-t^2}} \eta^\mathsf{os}_z(t, x) + \eta^\mathsf{os}_1(t, x) \equiv t s(t, x) + \eta^\mathsf{os}_1(t, x)\tag{130} $

where $\eta_z^\mathsf{os}$ and $\eta_1^\mathsf{os}$ are defined in Equation 120. This expression shows that the velocity $b$ used in the probability flow ODE Equation 32 is well-behaved at all times, including at $t=1$ where $\dot\alpha(t)$ is singular. The same is true for the drift $b_\mathsf{F}(t, x) = b(t, x) + \epsilon(t) s(t, x)$ used in the forward SDE Equation 33, regardless of the choice of $\epsilon \in C^0([0, 1])$ with $\epsilon(t)\ge 0$. This shows that casting SBDM into a one-sided linear stochastic interpolant Equation 118 allows the construction of unbiased generative models that operate on $t\in[0, 1]$. This comes at no extra computational cost, since only one of the two functions defined in Equation 120 needs to be estimated, which is akin to estimating the score in SBDM.

It is worth comparing the above procedure to an equivalent change of time at the level of the diffusion process Equation 125, which we now show leads to singular terms that pose numerical and analytical difficulties. Indeed, if we define $Z^\mathsf{B}t = Z{\tau=-\log t}$, from Equation 125 we obtain

$ d Z^\mathsf{B}_t = t^{-1} Z^\mathsf{B}_tdt + \sqrt{2 t^{-1}} dW^\mathsf{B}t, \quad Z^\mathsf{B}{t=1} \sim \rho_1,\tag{131} $

to be solved backwards in time. Because of the factor $t^{-1}$, this SDE cannot easily be solved until $t=0$, which corresponds to $\tau=\infty$ in the original Equation 125. For the same reason, the forward SDE associated with Equation 131

$ d Z^\mathsf{F}_t = t^{-1} Z^\mathsf{F}_tdt + 2 t^{-1}s(t, Z^\mathsf{F}_t) dt + \sqrt{2 t^{-1}} dW_t,\tag{132} $

cannot be solved from $t=0$, where formally $Z^\mathsf{F}{t=0} \stackrel{d}{=} Z^\mathsf{B}{t=0} = Z_{\tau=\infty} \sim \mathsf N(0, \text{\it Id})$. This means it cannot be used as a generative model unless we start from some $t>0$, which introduces a bias. Importantly, this problem does not arise with the stochastic interpolant framework, because the construction of the density $\rho(t)$ connecting $\rho_0$ and $\rho_1$ is handled separately from the construction of the process that generates samples from $\rho(t)$. By contrast, SBDM combines these two operations into one, leading to the singularity at $t=0$ in the coefficients in Equation 131 and 132.

########## {caption="Remark 37"}

To emphasize the last point made above, we stress that there is no contradiction between having a singular drift and diffusion coefficient in Equation 132, and being able to write a nonsingular SDE with stochastic interpolants. To see why, notice that the stochastic interpolant tells us that we can change the diffusion coefficient in Equation 132 to any nonsingular $\epsilon\in C^0([0, 1])$ with $\epsilon(t)\ge 0$ and replace this SDE with

$ d X^\mathsf{F}_t = t^{-1} Z^\mathsf{F}_tdt + (t^{-1} +\epsilon(t)) s(t, X^\mathsf{F}_t) dt + \sqrt{2 \epsilon(t)} dW_t,\tag{133} $

This SDE has the property that $X^\mathsf{F}{t=1}\sim \rho_1$ if $X^\mathsf{F}{t=0}\sim \rho_0$, and its drift is also nonsingular at $t=0$ and given precisely by Equation 130. Indeed, using the constraint Equation 121, which here reads $x= \sqrt{1-t^2} \eta^\mathsf{os}_z(t, x) + t \eta^\mathsf{os}_1(t, x)\equiv -(1-t^2) s(t, x) + t \eta^\mathsf{os}_1(t, x)$, it is easy to see that

$ t^{-1} x + t^{-1} s(t, x) = t s(t, x) + \eta_1^\mathsf{os}(t, x),\tag{134} $

which is nonsingular at $t=0$.

5.2 Denoising methods

Consider the spatially-linear one-sided stochastic interpolant defined in Equation 118. By solving this equation for $x_1$, we obtain

$ x_1 = \beta^{-1}(t) \left(x^\mathsf{os, lin}_t - \alpha(t) z \right)\qquad t \in (0, 1].\tag{135} $

Taking a conditional expectation at fixed $x^\mathsf{os, lin}_t$ and using Equation 120 implies that

$ {\mathbb{E}} (x_1| x^\mathsf{os, lin}_t) = \eta^\mathsf{os}_1(t, x^\mathsf{os, lin}_t) = \beta^{-1}(t) \left(x^\mathsf{os, lin}_t - \alpha(t) \eta^\mathsf{os}_z(t, x^\mathsf{os, lin}_t) \right)\qquad t \in (0, 1]\tag{136} $

while trivially ${\mathbb{E}} (x_1| x^\mathsf{os, lin}{t=0}) = {\mathbb{E}}[x_1]$ since $x^\mathsf{os, lin}{t=0} = z$. This expression is commonly used in denoising methods ([67, 69]), and it is Stein’s unbiased risk estimator (SURE) for $x_1$ given the noisy information in $x^\mathsf{os, lin}_t$ [70]. Rather than considering the conditional expectation of $x_1$, we can consider an analogous quantity for $x_s^{\mathsf{os, lin}}$ for any $s \in [0, 1]$; this leads to the following result.

########## {caption="Lemma 38: SURE"}

For $s\in [0, 1]$, we have

$ {\mathbb{E}} (x^\mathsf{os, lin}_s| x^\mathsf{os, lin}_t) = \frac{\beta(s)}{\beta(t)} x^\mathsf{os, lin}_t + \left(\alpha(s) - \frac{\alpha(t)\beta(s)}{\beta(t)}\right) \eta^\mathsf{os}_z(t, x^\mathsf{os, lin}_t) \qquad t \in (0, 1]\tag{137} $

and ${\mathbb{E}} (x^\mathsf{os, lin}s| x^\mathsf{os, lin}{t=0}) = \alpha(s) x^\mathsf{os, lin}_{t=0} + \beta(s) {\mathbb{E}}[x_1]$.

@eq:identity_1: follows from inserting Equation 135 in the expression for $x^\mathsf{os, lin}s$ and taking the conditional expectation using the definition of $\eta_1^\mathsf{os}$ and $\eta_z^\mathsf{os}$ in Equation 120. ${\mathbb{E}} (x^\mathsf{os, lin}s| x^\mathsf{os, lin}{t=0}) = \alpha(s) x^\mathsf{os, lin}{t=0} + \beta(s) {\mathbb{E}}[x_1]$ follows from $x^\mathsf{os, lin}_{t=0} = z$ together with Equation 135.

At this stage, equations 135 and 137 cannot be used as generative models: the random variable ${\mathbb{E}} (x_1| x^\mathsf{os, lin}_t) $ is not a sample of $\rho_1$, and the random variable ${\mathbb{E}} (x^\mathsf{os, lin}_s| x^\mathsf{os, lin}_t) $ is not a sample from $\rho(s)$, the density of $x^\mathsf{os, lin}_s$. However, the following result shows that if we iterate upon formula Equation 137 by taking infinitesimal steps, we obtain a generative model consistent with the probability flow equation 32 associated with $x^\mathsf{os, lin}_t$.

########## {caption="Theorem 39"}

Let $t_j=j/N$ with $j\in{1, \ldots, N}$, set $X^\mathsf{den}_{1} = z$, and define for $j=1, \ldots, N-1$,

$ X^\mathsf{den}{{j+1}} = \frac{\beta(t{j+1})}{\beta(t_j)} X^\mathsf{den}{j}+ \left(\alpha(t{j+1}) - \frac{\alpha(t_j)\beta(t_{j+1})}{\beta(t_j)}\right) \eta^\mathsf{os}z(t{j}, X^\mathsf{den}_{j}) .\tag{138} $

Then, Equation 138 is a consistent integration scheme for the probability flow equation 32 associated with the velocity field Equation 119 expressed as in Equation 122. That is, if $N, j\to\infty$ with $j/N \to t\in[0, 1]$, then $X^\mathsf{den}_j \to X_t$ where

$ \dot{X}_t =b(t, X_t) = \frac{\dot{\beta}(t)}{\beta(t)} X_t + \left(\dot{\alpha}(t) - \frac{\alpha(t) \dot\beta(t)}{\beta(t)}\right) \eta^\mathsf{os}z(t, X_t), \quad X{t=0} = z.\tag{139} $

In particular, if $z\sim {\sf N}(0, \text{\it Id})$, then $X^\mathsf{den}_N \to x_1 \sim \rho_1$ in this limit.

The proof of this theorem is given in Appendix B.7, and proceeds by Taylor expansion of the right-hand side of Equation 138.

5.3 Rectified flows

We now discuss how stochastic interpolants can be rectified according to the procedure proposed in [7]. Suppose that we have perfectly learned the velocity field $b$ in the probability flow equation 32 for a given stochastic interpolant. Denote by $X_t(x)$ the solution to this ODE with the initial condition $X_{t=0}(x)=x$, i.e.

$ \frac{d}{dt} X_t(x) = b(t, X_t(x)), \qquad X_{t=0}(x)=x.\tag{140} $

We can use the map $X_{t=1}: {\mathbb{R}}^d \to {\mathbb{R}}^d$ to define a new stochastic interpolant

$ x^\mathsf{rec}t = \alpha(t) x_0 + \beta(t) X{t=1}(x_0),\tag{141} $

where $\alpha^2, \beta\in C^2([0, 1])$ satisfy $\alpha(0)=\beta(1) = 1$, $\alpha(1) = \beta(0)w= 0$, and $\alpha(t) >0$ for all $t\in[0, 1)$. Clearly, we then have $x^\mathsf{rec}{t=0} = x_0\sim \rho_0$ since $X{t=0}(x_0)=x_0$ and $x^\mathsf{rec}{t=1} = X{t=1}(x_0) \sim \rho_1$ by definition of the probability flow equation. We can define a new probability flow equation associated with the velocity field

$ b^\mathsf{rec}(t, x) = {\mathbb{E}}[\dot{x}^\mathsf{rec}_t | x^\mathsf{rec}_t =x] = \dot\alpha(t) {\mathbb{E}}[x_0 | x^\mathsf{rec}t =x] + \dot\beta(t) {\mathbb{E}}[X{t=1}(x_0)| x^\mathsf{rec}_t =x].\tag{142} $

It is easy to see that this velocity field is amenable to estimation, since it is the unique minimizer of

$ \mathcal L_{b^\mathsf{rec}}[\hat{b}^\mathsf{rec}] = \int_0^1 {\mathbb{E}} \big[\tfrac12 |\hat{b}^\mathsf{rec}(t, x^\mathsf{rec}t) |^2 - (\dot\alpha(t) x_0 + \dot\beta(t) X{t=1}(x_0)) \cdot \hat{b}^\mathsf{rec}(t, x^\mathsf{rec}_t)\big] dt,\tag{143} $

where $x^\mathsf{rec}_t$ is given in Equation 141 and the expectation is now only on $x_0 \sim \rho_0$. Our next result show that the probability flow equation associated with the velocity field Equation 142 has straight line solutions, but ultimately it leads to a generative model that is identical to the one based on Equation 140. To phrase this result, we first make an assumption on the invertibility of $x_t^{\mathsf{rec}}$.

########## {caption="Assumption 40"}

The map $x\to M(t, x) $ where $M(t, x) = \alpha(t) x + \beta(t) X_{t=1}(x)$ with $X_t(x)$ solution to 140 is invertible for all $(t, x)\in[0, 1]\times {\mathbb{R}}^d$, i.e. $\exists N(t, \cdot, ): {\mathbb{R}}^d \to {\mathbb{R}}^d $ such that

$ \forall (t, x)\in[0, 1]\times {\mathbb{R}}^d \quad : \quad N(t, M(t, x))= M(t, N(t, x)) = x.\tag{144} $

This is equivalent to requiring that the determinant of the Jacobian of $M(t, x)$ is nonzero for all $(t, x)\in[0, 1]\times {\mathbb{R}}^d$; put differently, Assumption 40 requires that the Jacobian of $X_{t=1}(x)$ never has eigenvalues precisely equal to $-\alpha(t)/\beta(t)$, which is generic. Under this assumption, we state the following theorem.

########## {caption="Theorem 41"}

Consider the probability flow equation associated with Equation 142,

$ \frac{d}{dt} X^\mathsf{rec}_t(x) = b^\mathsf{rec}(t, X^\mathsf{rec}t(x)), \qquad X^\mathsf{rec}{t=0}(x)=x.\tag{145} $

Then, all solutions are such that $X^\mathsf{rec}_{t=1}(x_0)\sim \rho_1$ if $x_0 \sim \rho_0$. In addition, if Assumption 40 holds, the velocity field defined in Equation 142 reduces to

$ b^\mathsf{rec}(t, x) = \dot{\alpha}(t) N(t, x) + \dot\beta(t) X_{t=1}(N(t, x))\tag{146} $

and the solution to the probability flow ODE Equation 145 is simply

$ X^\mathsf{rec}t(x)= \alpha(t)x + \beta(t)X{t=1}(x).\tag{147} $

The proof is given in Appendix B.8.

Theorem 41 implies that $X^\mathsf{rec}t(x)$ is a simpler flow than $X_t(x)$, but we stress that they give the same map, $X^\mathsf{rec}{t=1}=X_{t=1}$. In particular, $X^\mathsf{rec}t(x)$ reduces to a straight line between $x$ and $X{t=1}(x)$ for $\alpha(t)=1-t$ and $\beta(t)=t$. We also note that the approach can be used to learn a single-step map, since Equation 146 and $N(t=0, x) = x$ give

$ b^\mathsf{rec}(t=0, x) = \dot{\alpha}(0) x + \dot\beta(0) X_{t=1}(x),\tag{148} $

which expresses $X_{t=1}(x)$ in terms of known quantities as long as $\dot{\beta}(0) \not = 0$. For example, if $\dot\alpha(0)=0$ and $\dot{\beta}(0) = 1$, we obtain $b^\mathsf{rec}(t=0, x) = X_{t=1}(x)$.

########## {caption="Remark: Optimal transport"}

The discussion above highlights the fact that a probability flow equation can have straight line solutions and lead to a map that exactly pushes $\rho_0$ onto $\rho_1$ but is not the optimal transport map. That is, straight line solutions is a necessary condition for optimal transport, but it is not sufficient.

########## {caption="Remark: Gradient fields"}

The map is unaffected by the rectification procedure because we do not impose that the velocity $b^\mathsf{rec}(t, x)$ be a gradient field. If we do impose this structure by setting $b^\mathsf{rec}(t, x) = \nabla \phi(t, x)$ for some $\phi: {\mathbb{R}}^d \to {\mathbb{R}}$, then $X_{t=1}^\mathsf{rec} \not = X_{t=1}$. As shown in [40], iterating over this procedure eventually gives the optimal transport map. That is, implemented over gradient fields and iterated infinitely, rectification computes Brenier's polar decomposition of the map ([71]).

########## {caption="Remark: Consistency models"}

Recent work has introduced the notion of consistency models ([72]), which distill a velocity field learned via score-based diffusion into a single-step map. Section 5.1 and the previous discussion provide an alternative perspective on consistency models, and show how they may be computed in the framework of stochastic interpolants via rectification.

6. Algorithmic aspects

Section Summary: This section outlines practical algorithms for implementing the earlier mathematical methods, focusing on two main tasks: training models to learn key drift coefficients and generating samples using ordinary or stochastic differential equations. It compares approaches for learning the overall drift function b directly, which is more efficient for deterministic models requiring just one neural network, as shown in Algorithm 1, versus separately learning components v and s for flexible stochastic models that might need two networks. To ensure stable training, it recommends antithetic sampling to reduce high variance near the start and end times, a technique that pairs positive and negative noise perturbations to keep computations reliable even with singular terms.

The methods described in the previous sections have efficient numerical realizations. Here, we detail algorithms and practical recommendations for an implementation. These suggestions can be split into two complementary tasks: learning the drift coefficients, and sampling with an ODE or an SDE.

6.1 Learning

As described in Section 2.2, there are a variety of algorithmic choices that can be made when learning the drift coefficients in Equation 10, Equation 20, and 22. While all choices lead to exact generative models in the absence of numerical and statistical errors, in practice, the presence of these errors ensures that different choices lead to different generative models, some of which may perform better for specific applications. Here, we describe the various realizations explicitly.

Deterministic generative modeling: Learning $b$ versus learning $v$ and $s$.

Recall from Section 2.2 that the drift $b$ of the transport equation 10 can be written as $b(t, x) = v(t, x) - \gamma(t)\dot{\gamma}(t)s(t, x)$. This raises the practical question of whether it would be better to learn an estimate $\hat{b}$ of $b$ by minimizing the empirical risk

$ \hat{\mathcal{L}}b(\hat{b}) = \frac{1}{N}\sum{i=1}^N\left(\frac{1}{2}|\hat{b}(t_i, x_{t_i}^{i})|^2 - \hat{b}(t_i, x_{t_i}^{i})\cdot \left(\partial_t I(t_i, x_0^i, x_1^i) + \dot{\gamma}(t_i) z^{i}\right)\right),\tag{149} $

or to learn estimates of $\hat{v}$ and $\hat{s}$ by minimizing the empirical risks

$ \hat{\mathcal{L}}v(\hat{v}) = \frac{1}{N}\sum{i=1}^N\left(\frac{1}{2}|\hat{v}(t_i, x_{t_i}^{i})|^2 - \hat{v}(t_i, x_{t_i}^{i})\cdot \partial_t I(t_i, x_0^i, x_1^i)\right)\tag{150} $

and

$ \hat{\mathcal{L}}s(\hat{s}) = \frac{1}{N}\sum{i=1}^N\left(\frac{1}{2}|\hat{s}(t_i, x_{t_i}^{i})|^2 + \gamma(t_i)^{-1} \hat{s}(t_i, x_{t_i}^{i})\cdot z^{i}\right)\tag{151} $

**Input:** Batch size $N$, interpolant function $I(t,x_0, x_1)$, coupling $\nu(dx_0, dx_1)$ to sample $x_0, x_1$, noise function $\gamma(t)$, initial parameters $\theta_b$, gradient-based optimization algorithm to minimize $\mathcal L_b $ defined in Equation 14, number of gradient steps $N_g$.
    **Returns**: An estimate $\hat{b}$ of $b$.
**for** $j=1, \cdots, N_g$:
  Draw $N$ samples $(t_i, x_0^i, x_1^i, z^i) \sim \mathsf{Unif}([0, 1])\times\nu \times\mathsf{N}(0, 1)$, for $i=1,\cdots, N$.
  Construct samples $x_t^i = I(t_i,x_0^i, x_1^i) + \gamma(t_i)z^i$, for $i = 1, \cdots, N$.
  Take gradient step with respect to $\mathcal L_b\left(\theta_b, \{x_t^i\}_{i=1}^N \right)$.
    **Return:** $\hat{b}$.

and construct the estimate $\hat{b}(t, x) = \hat{v}(t, x) - \gamma(t)\dot{\gamma}(t)\hat{s}(t, x)$. Above, $x_{t_i}^{i} = I(t_i, x_0^i, x_1^i) + \gamma(t_i) z^{i}$, and $N$ denotes the number of samples $t^i$, $x_0^i$, and $x_1^i$. If the practitioner is interested in a deterministic generative model (for example, to exploit adaptive integration or exact likelihood computation), learning the estimate $\hat{b}$ directly only requires learning a single model, and hence will typically lead to greater efficiency. This recommendation is captured in Algorithm 1. If both stochastic and deterministic generative models are of interest, it is necessary to learn two models for most choices of the interpolant; we discuss more suggestions for stochastic case below.

Antithetic sampling and capping.

In practice, the losses for $b$ Equation 14 and $s$ Equation 17 can become high-variance near the endpoints $t=0$ and $t=1$ due to the presence of the singular term $1/\gamma(t)$ and the (potentially) singular term $\dot{\gamma}(t)$. This issue can be eliminated by using antithetic sampling, which we found necessary for stable training of objectives involving $\gamma^{-1}(t)$. To show why, we consider the loss Equation 17 for $s$, but an analogous calculation can be performed for the loss Equation 14 for $b$ or 19 for $\eta_z$ (even though it is not necessary for this last quantity). We first observe that, by definition of $x_t$ and by Taylor expansion, as $t\to0$ or $t\to1$

$ \begin{aligned} \frac{1}{\gamma(t)}z\cdot s(t, x_t) &= \frac{1}{\gamma(t)}z\cdot s(t, I(t, x_0, x_1) + \gamma(t) z), \ &= \frac{1}{\gamma(t)}z\cdot \left(s(t, I(t, x_0, x_1)) + \gamma(t)\nabla s(t, I(t, x_0, x_1))z + o(\gamma(t))\right), \ &= \frac{1}{\gamma(t)}z\cdot s(t, I(t, x_0, x_1)) + z\cdot \nabla s(t, I(t, x_0, x_1))z + o(1). \end{aligned}\tag{152} $

Even though the conditional mean of the first term at the right-hand side is finite in the limit as $t\to0$ or $t\to1$, its variance diverges. By contrast, let $x_t^+ = I(t, x_0, x_1) + \gamma(t) z$ and $x_t^- = I(t, x_0, x_1) - \gamma(t) z$ with $x_0, x_1$, and $z$ fixed. Then,

$ \begin{aligned} &\frac{1}{2\gamma(t)}\left(z\cdot s(t, x_t^+) - z\cdot s(t, x_t^-)\right) \ &= \frac{1}{2\gamma(t)}\left(z\cdot s(t, I(t, x_0, x_1) + \gamma(t) z)) - z\cdot s(t, I(t, x_0, x_1) - \gamma(t) z)\right), \ &= \frac{1}{2\gamma(t)}z\cdot \left(s(t, I(t, x_0, x_1)) + \gamma(t)\nabla s(t, I(t, x_0, x_1))z + o(\gamma(t))\right)\ &\qquad -\frac{1}{2\gamma(t)}z\cdot \left(s(t, I(t, x_0, x_1)) - \gamma(t)\nabla s(t, I(t, x_0, x_1))z + o(\gamma(t))\right), \ &= z\cdot \nabla s(t, I(t, x_0, x_1))z + o(1), \end{aligned}\tag{153} $

so that both the conditional mean and variance are finite in the limit as $t\to0$ or $t\to1$ despite the singularity of $1/\gamma(t)$. In practice, this can be implemented by using $x_t^+$ and $x_t^-$ for every draw of $x_0, x_1$, and $z$ in the empirical discretization of the population loss.

**Input:** Batch size $N$, interpolant function $I(t,x_0, x_1)$, a coupling $\nu(dx_0, dx_1)$ to sample $x_0, x_1$, noise function $\gamma(t)$, initial parameters $\theta_{\eta_z}$, gradient-based optimization algorithm to minimize $\mathcal L_{\eta_z} $ defined in Equation 19, number of gradient steps $N_g$.
    **Returns**: An estimate $\hat{\eta}_z$ of $\eta_z$.
**for** $j = 1, \cdots, N_g$:
  Draw $N$ samples $(t_i, x_0^i, x_1^i, z^i) \sim \mathsf{Unif}([0, 1])\times\nu \times\mathsf{N}(0, 1)$, for $i=1,\cdots, N$.
  Construct samples $x_t^i = I(t_i,x_0^i, x_1^i) + \gamma(t_i)z^i$, for $i = 1, \cdots, N$.
  Take gradient step with respect to $\mathcal L_{\eta_z}\left(\theta_{\eta_z}, \{x_t^i\}_{i=1}^N \right)$.
    **Return:** $\hat{\eta}_z$.

Learning the score $s$ versus learning a denoiser $\eta_z$.

When learning $s$, an alternative to antithetic sampling is to consider learning the denoiser $\eta_z$ defined in Equation 18, which is related to the score by a factor of $\gamma$. Note that the objective function for the denoiser in Equation 19 is well behaved for all $t \in [0, 1]$, and can be thought of as a generalization of the DDPM loss introduced in [4]. The empirical risk associated with this loss reads

$ \hat{\mathcal{L}}_{\eta_z}(\hat{\eta}z) = \frac{1}{N}\sum{i=1}^N\left(\frac{1}{2}|\hat{\eta}z(t_i, x{t_i}^{i})|^2 - \hat{\eta}z(t_i, x{t_i}^{i})\cdot z^{i}\right)\tag{154} $

A detailed procedure for learning the denoiser $\eta_z$, e.g. for its use in an SDE-based generative model is given in Algorithm 2. For the case of one-sided spatially-linear interpolants, the procedure becomes particularly simple, which is highlighted in Algorithm 3.

**Input:** Batch size $N$, interpolant function $x^\mathsf{os,lin}_t $, a coupling $\nu(dx_0, dx_1)$ to sample $z, x_1$, initial parameters $\theta_{\eta_z^\mathsf{os}}$, gradient-based optimization algorithm to minimize $\mathcal L_{\eta_z^\mathsf{os}}$ defined in Equation 19, number of gradient steps $N_g$.
    **Returns**: An estimate $\hat{\eta}_z^\mathsf{os}$ of $\eta_z^\mathsf{os}$.
**for** $j = 1, \cdots, N_g$:
  Draw $N$ samples $(t_i, z^i, x_1^i) \sim \mathsf{Unif}([0, 1])\times\nu$, for $i=1,\cdots, N$.
  Construct samples $x_t^i = \alpha(t_i)z + \beta(t_i)x_1^i$, for $i = 1, \cdots, N$.
  Take gradient step w.r.t $\mathcal L_{\eta_z^\mathsf{os}}\left(\theta_{\eta_z^\mathsf{os}}, \{x_t^i\}_{i=1}^N \right)$.
    **Return:** $\hat{\eta}_z^\mathsf{os}$.

6.2 Sampling

We now discuss several practical aspects of sampling generative models based on stochastic interpolants. These are intimately related to the choice of objects that are learned, as well as to the specific interpolant used to build a path between $\rho_0$ and $\rho_1$. A general algorithm for sampling models built on either ordinary or stochastic differential equations is presented in Algorithm 4.

**Input:** Number of samples $n$, timestep $\Delta t$, drift estimates $\hat{b}$ and $\hat{\eta}_z$, initial time $t_0$, final time $t_f$, noise function $\gamma(t)$, diffusion coefficient $\epsilon(t)$, SDE or ODE timestepper `TakeStep`.
    **Returns**: $\{\hat{x}_1^{(i)}\}_{i=1}^n$, a batch of model samples.
**Initialize:**
  Set time $t = t_0$.
  Draw initial conditions $\hat{x}^{(i)}_{t_0} \sim \rho_0$ for $i = 1, \cdots, n$.
  Construct $\hat{s}(t, x) = -\hat{\eta}_z(t, x) / \gamma(t)$.
  Construct $\hat{b}_{\mathsf{F}}(t, x) = \hat{b}(t, x) + \epsilon(t) \hat{s}(t, x)$. // Reduces to $\hat{b}$ for $\epsilon(t) = 0$ (ODE).
**while** $t < t_f$:
  Propagate $\hat{x}^{(i)}_{t+\Delta t} = \texttt{TakeStep}(t, \hat{x}^{(i)}_t, b_\mathsf{F}, \epsilon, \Delta t)$ for $i = 1, \cdots, n$. // ODE or SDE integrator.
  Update $t = t + \Delta t$.
    **Return**: $\{\hat{x}^{(i)}\}_{i=1}^n$.

Using the denoiser $\eta_z$ instead of the score $s$.

We remarked in Section 6.1 that learning the denoiser $\eta_z$ is more numerically stable than learning the score $s$ directly. We note that while the objective for $\eta_z$ is well-behaved for all $t \in [0, 1]$, the resulting drifts can become singular at $t=0$ and $t=1$ when using $s(t, x) = -\eta_z(t, x) / \gamma(t)$. There are several ways to avoid this singularity in practice. One method is to choose a time-varying $\epsilon(t)$ that vanishes in a small interval around the endpoints $t=0$ and $t=1$, which avoids this numerical instability. An alternative option is to integrate the SDE up to a final time $t_f$ with $t_f < 1$, and then to perform a step of denoising using Equation 137. We use this approach in Section 7 below when sampling the SDE.

A denoiser is all you need for spatially-linear one-sided interpolants.

**Input:** Number of samples $n$, timestep $\Delta t$, denoiser estimate $\hat{\eta}_z$, initial time $t_0$, final time $t_f$, noise function $\gamma(t)$, diffusion coefficient $\epsilon(t)$, interpolant functions $\alpha(t)$ and $\beta(t)$, SDE or ODE timestepper `TakeStep`.
    **Returns**: $\{\hat{x}_1^{(i)}\}_{i=1}^n$, a batch of model samples.
**Initialize:**
  Set time $t = t_0$.
  Draw initial conditions $\hat{x}^{(i)}_{t_0} \sim \rho_0$ for $i = 1, \cdots, n$.
  Construct $\hat{s}(t, x) = -\hat{\eta}_z(t, x) / \alpha(t)$.
  Construct $\hat{b}(t, x) = \dot{\alpha}(t)\hat{\eta}_z^\mathsf{os}(t, x) + \frac{\dot\beta(t)}{\beta(t)}\left(x - \alpha(t)\hat{\eta}_z^\mathsf{os}(t, x)\right)$.
  Construct $\hat{b}_{\mathsf{F}}(t, x) = \hat{b}(t, x) + \epsilon(t) \hat{s}(t, x)$. // Reduces to $\hat{b}$ for $\epsilon(t) = 0$ (ODE).
**while** $t < t_f$:
  Propagate $\hat{x}^{(i)}_{t+\Delta t} = \texttt{TakeStep}(t, \hat{x}^{(i)}_t, b_\mathsf{F}, \epsilon, \Delta t)$ for $i = 1, \cdots, n$. // ODE or SDE integrator.
  Update $t = t + \Delta t$.
    **Return**: $\{\hat{x}^{(i)}\}_{i=1}^n$.

As shown in Equation 122, and as considered in Section 5.2, the denoiser $\eta_z^{\mathsf{os}}$ is sufficient to represent the velocity field $b$ appearing in the probability flow equation 32.

Using this definition for $b$ and the relationship $s(t, x) = -\eta_z(t, x) / \gamma(t)$, we state the following ordinary and stochastic differential equations for sampling

$ \begin{aligned} \text{ODE}:& \quad \dot{X}_t = \dot{\alpha}(t) \eta^\mathsf{os}_z(t, X_t) + \frac{\dot{\beta}(t)}{\beta(t)} \big (X_t- \alpha(t) \eta^\mathsf{os}_z(t, X_t) \big) \ \text{SDE}:& \quad dX^\mathsf{F}_t = \big (\dot{\alpha}(t) \eta^\mathsf{os}_z(t, X^\mathsf{F}_t) + \frac{\dot{\beta}(t)}{\beta(t)} \big (X^\mathsf{F}_t- \alpha(t) \eta^\mathsf{os}_z(t, X^\mathsf{F}_t) \big) - \frac{\epsilon(t)}{\alpha(t)} \eta^\mathsf{os}_z(t, X^\mathsf{F}_t) \big) dt \ &\qquad\qquad + \sqrt{2 \epsilon(t)} dW_t. \end{aligned} $

Because $\beta(0) = 0$, the drift is numerically singular in both equations. However, $b(t=0, x)$ has a finite limit

$ b(t=0, x) = \dot\alpha(0) x + \dot{\beta}(0) \mathbb E[x_1],\tag{155} $

as originally given in Equation 123. Equation 155 can be estimated using available data, which means that when learning a one-sided interpolant, ODE and SDE-based generative models can be defined exactly on the interval $t \in [0, 1]$ using only a score or a denoiser without singularity.

The factor of $\alpha(t)^{-1}$ in the final term of the SDE could pose numerical problems at $t=1$, as $\alpha(1) = 0$. As discussed in the paragraph above, a choice of $\epsilon(t)$ which is such that $\epsilon(t)/\alpha(t) \rightarrow C$ for some constant $C$ as $t\rightarrow 1$ avoids any issue.

An algorithm for sampling with only the denoiser $\eta_z^\mathsf{os}$ is given in Algorithm 5.

7. Numerical results

Section Summary: This section presents numerical experiments evaluating generative models based on ordinary differential equations (ODEs) and stochastic differential equations (SDEs), focusing on how parameters like gamma(t), which shapes the probability flow, and epsilon, which controls noise levels, influence sample quality. In two-dimensional tests using a challenging checkerboard density, SDE sampling with added noise generally outperformed deterministic ODE sampling, though the difference was smallest for a particular gamma function that improves endpoint behavior. Additional experiments on high-dimensional Gaussian mixtures and image generation further demonstrate the tradeoffs, showing SDEs provide better control and accuracy when learned drift and score estimates are imperfect.

**Figure 7:** **The effects of $\gamma(t)$ and $\epsilon$ on sample quality: qualitative comparison.** Kernel density estimates of $\hat\rho(1)$ for models with different choices of $\gamma$ and $\epsilon$. Sampling with $\epsilon = 0$ corresponds to using the probability flow with the learned drift $\hat{b} = \hat{v} - \gamma\dot{\gamma} \hat{s}$, whereas sampling with $\epsilon >0 $ corresponds to using the SDE with the learned drift $\hat{b}$ and score $\hat{s}$. We find empirically that SDE sampling is generically better than ODE sampling for this target density, though the gap is smallest for the probability flow specified with $\gamma(t) = \sqrt{t(1-t)}$, in agreement with the Remark 36 regarding the influence of $\gamma$ on $b$ at the endpoints. The SDE performs well at any noise level, though numerically integrating it for higher $\epsilon$ requires a smaller step size.

So far, we have been focused on the impact of $\alpha$, $\beta$, and $\gamma$ in Equation 104 on the density $\rho(t)$, which we illustrated analytically. In this section, we study examples where the drift coefficients must be learned over parametric function classes. In particular, we explore numerically the tradeoffs between generative models based on ODEs and SDEs, as well as the various design choices introduced in Section 3, Section 4, and Section 6. In Section 7.1, we consider simple two-dimensional distributions that can be visualized easily. In Section 7.2, we consider high-dimensional Gaussian mixtures, where we can compare our learned models to analytical solutions. Finally in Section 7.3 we perform some experiments in image generation.

7.1 Deterministic versus stochastic models: 2D

**Figure 8:** **The effects of $\gamma(t)$ and $\epsilon$ on sample quality: quantitative comparison.** For each $\gamma$ and each $\epsilon$ specified in Figure 7, we compute the mean and variance of the absolute value of the difference of $\log \rho_1$ (exact) and $\log \hat{\rho}(1)$ (model). The model specified with $\gamma(t) = \sqrt{t(1-t)}$ is the best performing probability flow ($\epsilon = 0$). At large $\epsilon$, SDE sampling with the same learned drift $\hat{b}$ and score $\hat{s}$ performs better, complementing the observations in the previous figure.

As shown in Section 2.2, the evolution of $\rho(t)$ can be captured exactly by either the transport equation 10 or by the forward and backward Fokker-Planck equations 20 and 22. These perspectives lead to generative models that are either based on the deterministic dynamics Equation 32 or the forward and backward stochastic dynamics Equation 33 and 34, where the level of stochasticity can be tuned by varying the diffusion coefficient $\epsilon(t)$. We showed in Section 2.4 that setting a constant $\epsilon(t) = \epsilon > 0$ can offer better control on the likelihood when using an imperfect velocity $b$ and an imperfect score $s$. Moreover, the optimal choice of $\epsilon$ is determined by the relative accuracy of the estimates $\hat{b}$ and $\hat{s}$. Having laid out the evolution of $\rho(t)$ for different choices of $\gamma$ in the previous section, we now show how different values of $\epsilon$ can build these densities up from individual trajectories. The stochasticity intrinsic to the sampling process increases with $\epsilon$, but by construction, the marginal density $\rho(t)$ for fixed $\alpha$, $\beta$ and $\gamma$ is independent of $\epsilon$.

The roles of $\gamma(t)$ and $\epsilon$ for 2D density estimation.

To explore the roles of $\gamma$ and $\epsilon$, we consider a target density $\rho_1$ whose mass concentrates on a two-dimensional checkerboard and a base density $\rho_0 = \mathsf{N}(0, \text{\it Id})$; here, the target was chosen to highlight the ability of the method to learn a challenging density with sharp boundaries. The same model architecture and training procedure was used to learn both $v$ and $s$ for several choices of $\gamma$ given in Table 2. The feed-forward network was defined with $4$ layers, each of size $512$, and with the ReLU [73] as an activation function.

After training, we draw 300, 000 samples using either an ODE ($\epsilon=0$) or an SDE with $\epsilon = 0.5$, $\epsilon=1.0$, or $\epsilon = 2.5$. We compute kernel density estimates for each resulting density, which we compare to the exact density and to the original stochastic interpolant from [1] (obtained by setting $\gamma = 0$). Results are given in Figure 7 for each $\gamma$ and each $\epsilon$. Sampling with $\epsilon > 0 $ empirically performs better, though the gap is smallest when using the $\gamma$ specified in Equation 116. Moreover, even when $\epsilon = 0$, using the probability flow with $\gamma$ given by Equation 116 performs better than the original interpolant from [1]. Numerical comparisons of the mean and variance of the absolute value of the difference of $\log \rho_1$ (exact) from $\log \hat\rho(1)$ (model) for the various configurations are given in Figure 8, which corroborate the above observations.

7.2 Deterministic versus stochastic models: 128D Gaussian mixtures

We now study the performance of the stochastic interpolant method in the case where the target is a high-dimensional Gaussian mixture. Gaussian mixtures (GMs) are a convenient class of target distributions to study, because they can be made arbitrarily complex by increasing the number of modes, their separation, and the overall dimensionality. Moreover, by considering low-dimensional projections, we can compute quantitative error metrics such as the $\mathsf{KL}$-divergence between the target and the model as a function of the (constant) diffusion coefficient $\epsilon$. This enables us to quantify the tradeoffs of ODE and SDE-based samplers.

Experimental details.

We consider the problem of mapping $\rho_0 = \mathsf{N}(0, \text{\it Id})$ to a Gaussian mixture with five modes in dimension $d=128$. The mean $m_i\in {\mathbb{R}}^d$ of each mode is drawn i.i.d. $m_i \sim \mathsf{N}(0, \sigma^2 \text{\it Id})$ with $\sigma = 7.5$. To maximize performance at high $\epsilon$, the timestep should be adapted to $\epsilon$; here, we chose to use a fixed computational budget that performs well for moderate levels of $\epsilon$ to avoid computational effort that may become unreasonable in practice.

**Figure 9:** **Gaussian mixtures: target projection.** Low-dimensional marginal of the target density $\rho_1$ for the Gaussian mixture experiment, visualized via KDE.

Each covariance $C_i \in {\mathbb{R}}^{d\times d}$ is also drawn randomly with $C_i = \tfrac{1}{d}W_i^\mathsf{T} W_i + \text{\it Id}$ and $(W_i){kl} \sim \mathsf{N}(0, 1)$ for $k, l = 1, \cdots, d$; this choice ensures that each covariance is a positive definite perturbation of the identity with diagonal entries that are $O(1)$ with respect to dimension $d$. For a fixed random draw of the means ${m_i}{i=1}^5$ and covariances ${C_i}_{i=1}^5$, we study the four combinations of learning $b$ or $v$ and $s$ or $\eta_z$ to form a stochastic interpolant from $\rho_0$ to $\rho_1$. In each case, we consider a linear interpolant with $\alpha(t) = 1-t$, $\beta(t) = t$, and $\gamma(t) = \sqrt{t(1-t)}$. For visual reference, a projection of the target density $\rho_1$ onto the first two coordinates is depicted in Figure 9 – it contains significant multimodality, several modes that are difficult to distinguish, and one mode that is well-separated from the others, which requires nontrivial transport to resolve. In the following experiments, all samples were generated with the fourth-order Dormand-Prince adaptive ODE solver (dopri5) for $\epsilon=0$ and by using one thousand timesteps of the Heun SDE integrator introduced in [10] for $\epsilon \neq 0$. When learning $\eta_z$, to avoid singularity at $t=0$ and $t=1$ when dividing by $\gamma(t)$ in the formula $s(t, x) = -\eta(t, x) / \gamma(t)$, we set $t_0 = 10^{-4}$ and $t_f = 1 - t_0$ in Algorithm 4. For all other cases, we set $t_0 = 0$ and $t_f = 1$.

**Figure 10:** **Gaussian mixtures: density errors.** Errors $\hat{\rho}_1(x, y) - \rho_1(x, y)$ in the marginals over the first two coordinates for all four variations of learning $b$ or $v$ and $s$ or $\eta$, computed via kernel density estimation. For small $\epsilon$, the model densities tend to be overly-concentrated, and overestimate the density within the modes and underestimate the densities in the tails. As $\epsilon$ increases, the model becomes less concentrated and more accurately represents the target density. For $\epsilon$ too large, the model becomes overly spread out and under-estimates the density within the modes. *Note: visualizations show two-dimensional slices of a $128$-dimensional density.*

Quantitative metric

To quantify performance, we make use of an error metric given by a $\mathsf{KL}$-divergence between kernel density estimates (KDE) of low-dimensional marginals of $\rho_1$ and the model density $\hat{\rho}_1$; this error metric was chosen for computational tractability and interpretability. To compute it, we draw $50, 000$ samples from $\rho_1$ and each $\hat{\rho}1$. We obtain samples from the marginal density over the first two coordinates by projection, and then compute a Gaussian KDE with bandwidth parameter chosen by Scott's rule. We then draw a fresh set of $N_e = 50, 000$ samples ${x_i}{i=1}^{N_e}$ with each $x_i \sim \rho_1$ for evaluation. To compute the $\mathsf{KL}$-divergence, we form a Monte-Carlo estimate with control variate

$ \mathsf{KL}(\rho_1: \Vert : \hat{\rho}1) \approx \frac{1}{N_e}\sum{i=1}^{N_e} \left(\log \rho_1(x_i) - \log \hat{\rho}_1(x_i) - \left(\frac{\hat{\rho}_1(x_i)}{\rho_1(x_i)} - 1\right)\right). $

We found use of the control variate $\hat{\rho}_1 / \rho_1 - 1$ helpful to reduce variance in the Monte-Carlo estimate; moreover, by concavity of the logarithm, use of the control variate ensures that the Monte-Carlo estimate cannot become negative.

Results.

**Figure 11:** **Gaussian mixtures: densities.** A visualization of the marginal densities in the first two variables of the model density $\hat{\rho}_1$ computed for all four variations of learning $b$ or $v$ and $s$ or $\eta$, computed via kernel density estimation. *Note: visualizations show two-dimensional slices of a $128$-dimensional density.*

**Figure 12:** **Gaussian mixtures: quantitative comparison.** $\mathsf{KL}(\rho_1\: \Vert \: \hat{\rho}_1^{\epsilon})$ as a function of $\epsilon$ when learning each of the four possible sets of drift coefficients $(b, s)$, $(b, \eta)$, $(v, s)$, $(v, \eta)$. The best performance is achieved when learning $b$ and $\eta$, along with a proper choice of $\epsilon > 0$.

Figure 10 and Figure 11 display two-dimensional projections (computed via KDE) of the model density error $\hat{\rho}_1 - \rho_1$ and the model density $\hat{\rho}_1$ itself, respectively, for different instantiations of Algorithm 1 and Algorithm 2 and different choices of $\epsilon$ in Algorithm 4. Taken together with Figure 9, these results demonstrate qualitatively that small values of $\epsilon$ tend to over-estimate the density within the modes and under-estimate the density in the tails. Conversely, when $\epsilon$ is taken too large, the model tends to under-estimate the modes and over-estimate the tails. Somewhere in between (and for differing levels of $\epsilon$), every model obtains its optimal performance. Figure 12 makes these observations quantitative, and displays the $\mathsf{KL}$-divergence from the target marginal to the model marginal $\mathsf{KL}(\rho_1: \Vert : \hat{\rho}_1^\epsilon)$ as a function of $\epsilon$, with each data point on the curve matching the models depicted in Figure 10 and Figure 11. We find that for each case, there is an optimal value of $\epsilon \neq 0$, in line with the qualitative picture put forth by Figure 10 and Figure 11. Moreover, we find that learning $b$ generically performs better than learning $v$, and that learning $\eta$ generically performs better than learning $s$ (except when $\epsilon$ is taken large enough that performance starts to degrade). With proper treatment of the singularity in the sampling algorithm when using the denoiser in the construction of $s(t, x) = -\eta(t, x)/\gamma(t)$ – either by capping $t_0 \neq 0$ and $t_f \neq 1$ or by properly tuning $\epsilon(t)$ as discussed in Section 6.2 – our results suggest that learning the denoiser is best practice.

7.3 Image generation

In the following, we demonstrate that the proposed method scales straightforwardly to high-dimensional problems like image generation. To this end, we illustrate the use of our approach on the $128\times128$ Oxford flowers dataset [74] by testing two different variations of the interpolant for image generation: the one-sided interpolant, using $\rho_0 = \mathsf{N}(0, \text{\it Id})$, as well as the mirror interpolant, where $\rho_0 = \rho_1$ both represent the data distribution. The purpose of this section is to demonstrate that our theory is well-motivated, and that it provides a framework that is both scalable and flexible. In this regard, image generation is a convenient exercise, but is not the main focus of this work, and we will leave a more thorough study on other datasets such as ImageNet with standard benchmarks such as the Frechet Inception Distance (FID) for a future study.

Generation from Gaussian $\rho_0$.

We train spatially-linear one-sided interpolants $x_t = (1 -t)z + t x_1$ and $x_t = \cos({\tfrac{\pi}{2}}t) z + \sin({\tfrac{\pi}{2} t}) x_1$ on the $128\times128$ Oxford flowers dataset, where we take $z \sim \mathsf{N}(0, \text{\it Id})$ and $x_1$ from the data distribution. Based on our results for Gaussian mixtures, we learn the drift $b(t, x)$, the score $s(t, x)$, and the denoiser $\eta_z(t, x)$ to benchmark our generative models based on ODEs or SDEs. In all cases, we parameterize the networks representing $\hat{\eta}$, $\hat{s}$ and $\hat{b}$ using the U-Net architecture used in [4]. Minimization of the objective functions given in Section 3.2 is performed using the Adam optimizer. Details of the architecture in both cases and all training hyperparameters are provided in Appendix C.1.

**Figure 13:** **Image generation: Oxford flowers.** Example generated flowers from the same initial condition $x_0$ using either the ODE with $\epsilon = 0$ and learned $b$ or the SDE for various increasing values of $\epsilon$ with learned $b$ and $s$. For $\epsilon = 0$, sampling is done using the `dopri5` solver and therefore the number of steps is adaptive. Otherwise, $2000$, $2500$, and $4000$ steps were taken using the Heun solver for $\epsilon = 1.0$, $2.0$, and $4.0$ respectively.

Like in the case of learning Gaussian mixtures, we use the fourth-order dopri5 solver when sampling with the ODE and the Heun method for the SDE, as detailed in Algorithm 4. When learning a denoiser $\eta_z$, we found it beneficial to complete the image generation with a final denoising step, in which we set $\epsilon=0$ and switch the integrator to the one given in Equation 138.

Exemplary images generated from the model using the ODE and the SDE with various value of the diffusion coefficient $\epsilon$ are shown in Figure 13, starting from the same sample from $\rho_0$. The result illustrates that different images can be generated from the same sample when using the SDE, and their diversity increases as we increase the diffusion coefficient $\epsilon$. To highlight that the model does not memorize the training set, in Figure 14 we compare an example generated image to its five nearest neighbors in the training set (measured in $\ell_1$ norm), which appear visually quite different.

**Figure 14:** **No memorization.** Top: An example generated image from a trained linear one-sided interpolant. Bottom: Five nearest neighbors (in $\ell_1$ norm) to the generated image in the dataset. The nearest neighbors are visually distinct, highlighting that the interpolant does not overfit on the dataset.

Mirror interpolant.

We consider the mirror interpolant $x_t = x_1 + \gamma(t) z$, for which Equation 98 shows that the drift $b$ is given in terms the denoiser $\eta_z$ by $b(t, x) = \dot{\gamma}(t) \eta_z(t)$; this means that it is sufficient to only learn an estimate $\hat{\eta}_z$ to construct a generative model. Similar to the previous section, we demonstrate this on the Oxford flowers dataset, again making use of a U-Net parameterization for $\hat{\eta}_z(t, x)$. Further experimental details can be found in Appendix C.1. In this setup the output image at time $t=1$ is the same as the input image if we use the ODE Equation 32; with the SDE, however, we can generate new images from the same input. This is illustrated in Figure 15, where we show how a sample image from the dataset $\rho_1$ is pushed forward through the SDE Equation 33 with $\epsilon(t) = \epsilon = 10$. As can be seen the original image is resampled to a proximal flower not seen in the dataset.

**Figure 15:** **Mirror interpolant.** Example trajectory from a learned denoiser model $\hat{b} = \dot{\gamma}(t)\hat{\eta}_z$ for the mirror interpolant $x_t = x_1 + \gamma(t) z$ on the $128\times128$ Oxford flowers dataset with $\gamma(t) = \sqrt{t(1-t)}$. The parametric form for $\hat{\eta}_z$ is the U-Net from [4], with hyperparameter details given in Appendix C. The choice of $\epsilon$ in the generative SDE given in Equation 33 influences the extent to which the generated image differs from the original. Here, $\epsilon(t) = 10.0$.

8. Conclusion

Section Summary: This conclusion wraps up a detailed explanation of the stochastic interpolant method, a versatile approach for creating generative models that transform data distributions using dynamic processes. It outlines the mathematical foundations and practical algorithms for building both precise deterministic and probabilistic models that achieve exact mappings between densities in a finite timeframe, while exploring key adjustable elements and links to concepts like optimal transport and Schrödinger bridges. The framework extends beyond specific examples, such as mirror and one-sided interpolants, to inspire new designs for applications including image restoration, weather prediction, molecular simulations, and advanced sampling techniques in machine learning.

The above exposition provides a full treatment of the stochastic interpolant method, as well as a careful consideration of its relation to existing literature. Our goal is to provide a general framework that can be used to devise generative models built upon dynamical transport of measure. To this end, we have detailed mathematical theory and efficient algorithms for constructing both deterministic and stochastic generative models that map between two densities exactly in finite time. Along the way, we have illustrated the various design parameters that can be used to shape this process, with connections, for example, to optimal transport and Schrödinger bridges. While we detail specific instantiations, such as the mirror and one-sided interpolants, we highlight that there is a much broader space of possible designs that may be relevant for future applications. Several candidate application domains include the solution of inverse problems such as image inpainting and super-resolution, spatiotemporal forecasting of dynamical systems, and scientific problems such as sampling of molecular configurations and machine learning-assisted Markov chain Monte-Carlo.

9. Acknowledgements

Section Summary: The authors express gratitude to several researchers, including Joan Bruna, Jonathan Niles-Weed, Loucas Pillaud-Vivien, and Cédric Gerbelot, for discussions on stability in dynamical transport, as well as to Qiang Liu, Ricky Chen, Yaron Lipman, Kyle Cranmer, Michael Lindsey, and Mark Goldstein for feedback on related work and practical aspects of training diffusion models. They also acknowledge funding support for MSA from the National Science Foundation under award PHY-2141336, and for EVE from multiple NSF grants, the Simons Collaboration on Wave Turbulence, and a Vannevar Bush Faculty Fellowship.

We thank Joan Bruna, Jonathan Niles-Weed, Loucas Pillaud-Vivien, and Cédric Gerbelot for helpful discussions regarding stability estimates of dynamical transport. We are also grateful to Qiang Liu, Ricky Chen, and Yaron Lipman for feedback on previous and related work, and to Kyle Cranmer and Michael Lindsey for discussions on transport costs. We thank Mark Goldstein for insightful comments regarding practical considerations when training denoising-diffusion models. MSA is supported by the National Science Foundation under the award PHY-2141336. EVE is supported by the National Science Foundation under awards DMR-1420073, DMS-2012510, and DMS-2134216, by the Simons Collaboration on Wave Turbulence, Grant No. 617006, and by a Vannevar Bush Faculty Fellowship.

Appendix

Section Summary: This appendix examines how to smoothly connect two probability distributions made up of mixtures of Gaussian (bell-shaped) curves, showing that the intermediate distributions at different stages also form such mixtures. It provides mathematical formulas for the "velocity" that moves points along this path and the "score" that measures density gradients, which simplify nicely when the starting and ending distributions are single Gaussians, ensuring the process preserves expected means and spreads. The section includes proofs using characteristic functions to verify these properties and extends to detailed arguments for theorems omitted from the main text.

A. Bridging two Gaussian mixture densities

In this appendix, we consider the case where $\rho_0$ and $\rho_1$ are both Gaussian mixture densities. We denote by

$ \begin{aligned} {\sf N}(x|m, C) &= (2\pi)^{-d/2} [\det C]^{-1/2} \exp\left(-\tfrac12 (x-m)^\mathsf{T} C^{-1} (x-m)\right), \ & = (2\pi)^{-d} \int_{{\mathbb{R}}^d} e^{ik\cdot (x-m) -\frac12 k^\mathsf{T} C k} dk, \end{aligned}\tag{156} $

the Gaussian probability density with mean vector $m \in {\mathbb{R}}^d$ and positive-definite symmetric covariance matrix $C=C^\mathsf{T}\in {\mathbb{R}}^{d\times d}$. We assume that

$ \rho_0(x) = \sum_{i=1}^{N_0} p^0_i {\sf N}(x|m^0_i, C^0_i), \qquad \rho_1(x) = \sum_{i=1}^{N_1} p^1_i {\sf N}(x|m^1_i, C^1_i)\tag{157} $

where $N_0, N_1\in {\mathbb{N}}$, $p^0_i>0$ with $\sum_{i=1}^{N_0} p_i^0 = 1$, $m_i^0\in {\mathbb{R}}^d$, $C_i^0= (C_i^0)^\mathsf{T} \in {\mathbb{R}}^{d\times d}$ positive-definite, and similarly for $p_i^1$, $m_i^1$, and $C_i^1$. We have:

########## {caption="Proposition 42"}

Consider the process $x_t$ defined in Equation 2 using the probability densities in Equation 157 and the interpolant in Equation 104, i.e.

$ \tag{Equation 104} x^\mathsf{lin}_t = \alpha(t) x_0+ \beta(t) x_1 + \gamma(t) z, $

where $x_0, \sim\rho_0$, $x_1\sim \rho_1$, and $z \sim {\sf N}(0, \text{\it Id})$ with $x_0\perp x_1\perp z$, and $\alpha, \beta, \gamma^2\in C^2([0, 1])$ satisfy the conditions in Equation 105. Denote

$ m_{ij}(t) = \alpha(t) m^0_i+\beta(t) m^1_j, \quad C_{ij}(t) = \alpha^2(t) C^0_i+\beta^2(t) C^1_j +\gamma^2(t) \text{\it Id},\tag{158} $

where $i=1, \ldots, N_0, $ $j=1, \ldots, N_1$. Then the probability density $\rho$ of $x_t$ is the Gaussian mixture density

$ \rho(t, x) = \sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j {\sf N}(x| m_{ij}(t), C_{ij}(t))\tag{159} $

and the velocity $b$ and the score $s$ defined in Equation 11 and 15 are

$ b(t, x) = \frac{\sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j \left(\dot{m}{ij}(t) + \tfrac12\dot{C}{ij}(t) C^{-1}{ij}(t)(x-m^{ij}(t)) \right) {\sf N}(x| m{ij}(t), C_{ij}(t))}{\sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j {\sf N}(x| m_{ij}(t), C_{ij}(t))},\tag{160} $

and

$ s(t, x) = -\frac{\sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j C^{-1}{ij}(t)(x-m{ij}(t)) {\sf N}(x| m_{ij}(t), C_{ij}(t))}{\sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_i p^1_j {\sf N}(x| m_{ij}(t), C_{ij}(t))}.\tag{161} $

This proposition implies that $b$ and $s$ grow at most linearly in $x$, and are approximately linear in regions where the modes of $\rho(t, x)$ remain well-separated. In particular, if $\rho_0$ and $\rho_1$ are both Gaussian densities, $\rho_0=\mathsf{N}(m_0, C_0)$ and $\rho_1=\mathsf{N}(m_1, C_1)$, we have

$ b(t, x)=\dot{m}(t) + \tfrac12 \dot{C}(t) C^{-1}(t)(x-m(t)),\tag{162} $

and

$ s(t, x)= - C^{-1}(t)(x-m(t)),\tag{163} $

where

$ m(t) = \alpha(t) m_0+ \beta(t) m_1, \quad C(t) = \alpha^2(t) C_0+ \beta^2(t) C_1 +\gamma^2(t) \text{\it Id}.\tag{164} $

Note that the probability flow ODE Equation 32 associated with the velocity Equation 162 is the linear ODE

$ \frac{d}{dt} X_t = \dot{m}(t) + \tfrac12 \dot{C}(t) C^{-1}(t)(X_t-m(t)).\tag{165} $

This equation can only be solved analytically if $\dot{C}(t)$ and $C(t)$ commute (which is the case e.g. if $C_0 = \text{\it Id}$), but it is easy to see that it always guarantees that

$ {\mathbb{E}}_0 X_t(x_0) = m(t), \qquad {\mathbb{E}}_0 \big[(X_t(x_0)-m(t)) (X_t(x_0)-m(t))^\mathsf{T}\big] = C(t),\tag{166} $

where $X_t(x_0)$ denotes the solution to 165 for the initial condition $X_{t=0}(x_0) = x_0$ and ${\mathbb{E}}_0$ denotes expectation over $x_0\sim \rho_0$. A similar statement is true if we solve Equation 165 with final conditions at $t=1$ drawn from $\rho_1$. Similarly, the forward SDE Equation 33 associated with the velocity Equation 162 and the score Equation 163 is the linear SDE

$ d X^\mathsf{F}_t = \dot{m}(t)dt + \big(\tfrac12 \dot{C}(t) - \epsilon\big) C^{-1}(t)(X^\mathsf{F}_t-m(t)) dt + \sqrt{2 \epsilon} dW_t.\tag{167} $

and its solutions are such that

$ {\mathbb{E}}0 {\mathbb{E}}^{x_0}\mathsf{F} X_t^\mathsf{F} = m(t), \qquad {\mathbb{E}}0 {\mathbb{E}}^{x_0}\mathsf{F}\big[(X^\mathsf{F}_t-m(t))(X_t^\mathsf{F}-m(t))^\mathsf{T}\big] = C(t),\tag{168} $

where ${\mathbb{E}}\mathsf{F}^{x_0}$ denotes expectation over the solution of Equation 167 conditional on the event $X^\mathsf{F}{t=0}=x_0$ and ${\mathbb{E}}_0$ denotes expectation over $x_0\sim \rho_0$. A similar statement also holds for the backward SDE Equation 34.

Proof: The characteristic function of $\rho(t, x)$ is given by

$ g(t, k) = {\mathbb{E}} e^{ik\cdot x_t} = \sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j e^{ik\cdot m_{ij}(t) - \tfrac12 k^\mathsf{T} C_{ij}(t) k } .\tag{169} $

whose inverse Fourier transform is Equation 159. This automatically implies Equation 161 since we know from Equation 15 that $s= \nabla \log \rho$. To derive Equation 162 use the function $m$ defined below in Equation 183:

$ m(t, k) = \sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j (\dot{m}{ij}(t) +\tfrac12 i \dot{C}{ij}(t) k)e^{ik\cdot m_{ij}(t) - \tfrac12 k^\mathsf{T} C_{ij}^\gamma(t) k } .\tag{170} $

From Equation 184, we know that the inverse Fourier transform of this function is $b\rho$, so that we obtain

$ b(t, x) \rho(t, x) = \sum_{i=1}^{N_0}\sum_{j=1}^{N_1} p^0_ip^1_j \left(\dot{m}{ij}(t) + \tfrac12\dot{C}{ij}(t) C_{ij}^{-1}(t)(x-m_{ij}(t))\right) {\sf N}(x| m_{ij}(t), C^\epsilon_{ij}(t)).\tag{171} $

This gives Equation 160.

B. Proofs

In this appendix, we provide the details for proofs omitted from the main text. For ease of reading, a copy of the original theorem statement is provided with the proof.

B.1 Proof of Theorem 5, Theorem 6, and 202, and Corollary 9.

Proof: Let $g(t, k) = {\mathbb{E}} e^{ik \cdot x_t}$, $k \in {\mathbb{R}}^d $, be the characteristic function of $\rho(t, x)$. From the definition of $x_t$ in Equation 2,

$ g(t, k) = {\mathbb{E}} e^{ik\cdot (I(t, x_0, x_1) + \gamma(t) z)}.\tag{172} $

Using the independence between $(x_0, x_1)$ and $z$, we have

$ g(t, k) = {\mathbb{E}} \left(e^{ik\cdot I(t, x_0, x_1) }\right) {\mathbb{E}} \left(e^{i\gamma(t) k\cdot z}\right) \equiv g_0(t, k) e^{-\tfrac12 \gamma^2(t) |k|^2 }\tag{173} $

where we defined

$ g_0(t, k) = {\mathbb{E}} \left(e^{ik\cdot I(t, x_0, x_1) }\right)\tag{174} $

The function $g_0(t, k) $ is the characteristic function of $I(t, x_0, x_1)$ with $(x_0, x_1)\sim\nu$. From Equation 173, we have

$ |g(t, k)| = |g_0(t, k)| e^{-\tfrac12 \gamma^2(t) |k|^2 } \le e^{-\tfrac12 \gamma^2(t) |k|^2 }\tag{175} $

Since $\gamma(t)>0$ for all $t\in(0, 1)$ by assumption, this shows that

$ \forall p \in {\mathbb{N}}\ \ \text{and} \ \ t\in(0, 1) \quad : \quad \int_{{\mathbb{R}}^d} |k|^p |g(t, k)|dk<\infty,\tag{176} $

implying that $\rho(t, \cdot)$ is in $C^p({\mathbb{R}}^d)$ for any $p\in {\mathbb{N}}$ and all $t\in (0, 1)$. From Equation 173, we also have

$ \begin{aligned} |\partial_t g(t, k)|^2 & = \left| {\mathbb{E}}[(ik\cdot \partial_t I_t(x_0, x_1) - \gamma(t) \dot{\gamma}(t) |k|^2) e^{ik\cdot I_t(x_0, x_1)}]\right|^2 e^{-\gamma^2(t) |k|^2 } \ &\le 2\left(|k|^2 {\mathbb{E}}\big[|\partial_t I_t(x_0, x_1)|^2] + |\gamma(t) \dot{\gamma}(t)|^2 |k|^4\right) e^{-\gamma^2(t) |k|^2 }\ &\le 2\left(|k|^2 M_1 + 4|\gamma(t) \dot{\gamma}(t)|^2 |k|^4\right) e^{-\gamma^2(t) |k|^2 } \end{aligned}\tag{177} $

and

$ \begin{aligned} |\partial^2_t g(t, k)|^2 & \le 4\left(|k|^2 {\mathbb{E}}\big[|\partial^2_t I_t(x_0, x_1)|^2] +(|\dot{\gamma}(t)|^2 + \gamma(t) \ddot{\gamma}(t))^2 |k|^4\right) e^{-\gamma^2(t) |k|^2 }\ & \quad + 8\left(|k|^2 {\mathbb{E}}\big[|\partial_t I_t(x_0, x_1)|^4] + (\gamma(t) \dot{\gamma}(t))^4 |k|^8\right) e^{-\gamma^2(t) |k|^2 } \ &\le 4\left(|k|^2 M_2 +(|\dot{\gamma}(t)|^2 + \gamma(t) \ddot{\gamma}(t))^2 |k|^4\right) e^{-\gamma^2(t) |k|^2 }\ & \quad + 8\left(|k|^2 M_1 + (\gamma(t) \dot{\gamma}(t))^4 |k|^8\right) e^{-\gamma^2(t) |k|^2 } \end{aligned}\tag{178} $

where in both cases we used Equation 9 in Assumption 4 to get the last inequalities. These imply that

$ \forall p \in {\mathbb{N}}\ \ \text{and} \ \ t\in(0, 1) \quad : \quad \int_{{\mathbb{R}}^d} |k|^p |\partial_t g(t, k)|dk<\infty; \quad \int_{{\mathbb{R}}^d} |k|^p |\partial^2_t g(t, k)|dk<\infty\tag{179} $

indicating that $\partial_t \rho(t, \cdot)$ and $\partial^2_t \rho(t, \cdot)$ are in $C^p({\mathbb{R}}^d)$ for any $p\in {\mathbb{N}}$, i.e. $\rho\in C^1((0, 1); C^p({\mathbb{R}}^d))$ as claimed. To show that $\rho$ is also positive, denote by $\mu_0(t, dx)$ the unique (by the Fourier inversion theorem) probability measure associated with $g_0(t, k)$, i.e. the measure such that

$ g_0(t, k) = \int_{{\mathbb{R}}^d} e^{ik\cdot x} \mu_0(t, dx).\tag{180} $

From Equation 173 and the convolution theorem it follows that we can express $\rho$ as

$ \rho(t, x) = \int_{{\mathbb{R}}^d } \frac{e^{-|x-y|^2/(2\gamma^2(t))}}{(2\pi \gamma^2(t))^{d/2}} \mu_0(t, dy),\tag{181} $

This shows that $\rho>0$ for all $(t, x)\in (0, 1)\times {\mathbb{R}}^d$. Since $x_{t=0} = x_0$ and $x_{t=1} = x_1$ by definition of the interpolant, we also have $\rho(0) = \rho_0$ and $\rho(1)=\rho_1$, which shows that $\rho$ is also positive and in $C^p({\mathbb{R}}^d)$ at $t=0, 1$ by Assumption 4. Note that since $\rho \in C^1((0, 1); C^p({\mathbb{R}}^d))$ and is positive, we also immediately deduce that $s = \nabla \log \rho = \nabla \rho /\rho \in C^1((0, 1); (C^p({\mathbb{R}}^d))^d)$.

To show that $\rho$ satisfies the TE Equation 10, we take the time derivative of Equation 172 to deduce that

$ \partial_t g(t, k) = ik\cdot m(t, k)\tag{182} $

where $m: [0, 1]\times {\mathbb{R}}^d\to {\mathbb{C}}^d$ is the vector-valued function defined as

$ m(t, k) = {\mathbb{E}}\left((\partial_t I(t, x_0, x_1) +\dot{\gamma}(t) z)e^{ik\cdot x_t} \right).\tag{183} $

By definition of the conditional expectation, $m(t, k)$ can be expressed as

$ \begin{aligned} m(t, k) & = \int_{{\mathbb{R}}^d} {\mathbb{E}}\left((\partial_t I(t, x_0, x_1) +\dot{\gamma}(t) z) e^{ik\cdot x_t} |x_t=x \right) \rho(t, x) dx\ & = \int_{{\mathbb{R}}^d} e^{ik\cdot x} {\mathbb{E}}\left((\partial_t I(t, x_0, x_1) +\dot{\gamma}(t) z) |x_t=x \right) \rho(t, x) dx\ & = \int_{{\mathbb{R}}^d} e^{ik\cdot x} b(t, x) \rho(t, x) dx \end{aligned}\tag{184} $

where the last equality follows from the definition of $b$ in Equation 11. Inserting Equation 184 in Equation 182, we deduce that this equation can be written in real space as the TE Equation 10.

Let us now investigate the regularity of $b$. To that end, we go back to $m$ and use the independence between $x_0$, $x_1$, and $z$, as well as Gaussian integration by parts to deduce that

$ m(t, k) = {\mathbb{E}}\left((\partial_t I(t, x_0, x_1) -i \gamma(t) \dot\gamma(t) k) e^{ik\cdot I(t, x_1, x_0)}\right) e^{-\tfrac12 \gamma^2(t) |k|^2 },\tag{185} $

As a result

$ \begin{aligned} |m(t, k)|^2 &= \left| {\mathbb{E}}\left((\partial_t I(t, x_0, x_1) -i \gamma(t) \dot\gamma(t) k) e^{ik\cdot I(t, x_1, x_0)}\right)\right|^2 e^{-\gamma^2(t) |k|^2 } \ & \le 2\left({\mathbb{E}}\big[|\partial_t I(t, x_0, x_1)|^2\big] + |\gamma(t) \dot{\gamma}(t)|^2 |k|^2\right) e^{-\gamma^2(t) |k|^2 } \ & \le 2M_1 e^{-\gamma^2(t) |k|^2 }, \end{aligned}\tag{186} $

and

$ \begin{aligned} |\partial_t m(t, k)|^2 &\le 4\left({\mathbb{E}}\big[|\partial^2_t I(t, x_0, x_1)|^2 + (\gamma(t) \ddot{\gamma}(t) + \dot{\gamma}^2(t))^2 \right) e^{-\gamma^2(t) |k|^2 } \ & \quad + 8|k|^2 \left({\mathbb{E}} \big |\partial_t I(t, x_0, x_1)|^4\big] + (\gamma(t) \dot{\gamma}(t))^4 |k|^4 \right) e^{-\gamma^2(t) |k|^2 } \ & \le 4\left(M_1 + (\gamma(t) \ddot{\gamma}(t) + \dot{\gamma}^2(t))^2 \right) e^{-\gamma^2(t) |k|^2 } \ & \quad + 8|k|^2 \left(M_2 + (\gamma(t) \dot{\gamma}(t))^4 |k|^4 \right) e^{-\gamma^2(t) |k|^2 }, \end{aligned}\tag{187} $

where in both cases the last inequalities follow from Equation 9. Therefore

$ \forall p \in {\mathbb{N}} \ \ \text{and} \ \ t\in(0, 1) \quad : \quad \int_{{\mathbb{R}}^d} |k|^p |m(t, k)|dk<\infty, \quad \int_{{\mathbb{R}}^d} |k|^p |\partial_t m(t, k)|dk<\infty,\tag{188} $

which implies that the inverse Fourier transform of $m$ is a function $j : [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}^d$ that satisfies $j(t, \cdot) \in (C^p({\mathbb{R}}^d))^d$ for any $p\in {\mathbb{N}}$ and any $t\in(0, 1)$ and can be expressed as

$ j(t, x) = (2\pi)^{-d} \int_{{\mathbb{R}}^d} e^{-ik\cdot x} m(t, k) dk = {\mathbb{E}}\left(\partial_t I(t, x_0, x_1)+ \dot{\gamma} (t) z |x_t = x\right) \rho(t, x) \equiv b(t, x) \rho_t(x)\tag{189} $

where the last equality follows from the definition of $b$ in Equation 11. We deduce that $b\in C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ for any $p\in {\mathbb{N}}$ since $j\in C^0([0, 1];(C^p({\mathbb{R}}^d))^d)$ and $\rho\in C^1([0, 1];C^p({\mathbb{R}}^d))$, and $\rho>0$.

Finally, let us establish Equation 12. By Equation 9 we have

$ \begin{aligned} \int_{{\mathbb{R}}^d} |b(t, x)|^2 \rho(t, x) dx & = \int_{{\mathbb{R}}^d}| {\mathbb{E}}\left(\partial_t I(t, x_0, x_1)+\dot{\gamma}(t) z|x_t = x\right)|^2 \rho(t, x)dx \ &\le 2 \int_{{\mathbb{R}}^d} {\mathbb{E}}\left(|\partial_t I(t, x_0, x_1)|^2+|\dot{\gamma}(t)|^2 |z|^2|x_t = x\right) \rho(t, x)dx\ & \le 2 {\mathbb{E}}\big[|\partial_t I(t, x_0, x_1)|^2 +|\dot{\gamma}(t)|^2|z|^2\big]\ & < 2M_1^{1/2} + 2 d |\dot{\gamma}(t)|^2, \end{aligned}\tag{190} $

so that this integral is bounded for all $t\in (0, 1)$. To analyze its behavior at the end points, notice that the decomposition Equation 27 implies that

$ \begin{aligned} b_0(x) &\equiv \lim_{t\to0} b(t, x) = {\mathbb{E}}1 [\partial_t I(0, x, x_1)] - \lim{t\to0} \dot{\gamma}(t) \gamma(t) s_0(x), \ b_1(x) & \equiv \lim_{t\to1} b(t, x) = {\mathbb{E}}0 [\partial_t I(0, x_0, x)] - \lim{t\to1} \dot{\gamma}(t) \gamma(t) s_1(x), \end{aligned}\tag{191} $

where $s_0= \nabla \log\rho_0$, $s_1= \nabla \log\rho_1$, ${\mathbb{E}}0$ and ${\mathbb{E}}1$ denote expectations over $x_0\sim\rho_0$ and $x_1\sim \rho_1$, respectively, and we used the property that $x{t=0}=x_0$ and $x{t=1} = x_1$. Since $\lim_{t\to0, 1} \dot{\gamma}(t) \gamma(t)$ exists by our assumption that $\gamma^2\in C^1([0, 1])$, $b_0$ and $b_1$ are well defined, and

$ \int_{{\mathbb{R}}^d} |b_0(x)|^2 \rho_0(x) dx < \infty, \qquad \int_{{\mathbb{R}}^d} |b_1(x)|^2 \rho_1(x) dx < \infty.\tag{192} $

by Assumption 4. As a result, the integral in Equation 190 is continuous at $t=0$ and $t=1$, so it must be integrable on $[0, 1]$, and 12 holds.

Proof: By definition of $\rho$, the objective $\mathcal{L}_b$ defined in Equation 14 can also be written as

$ \begin{aligned} \mathcal{L}b[\hat{b}] &= \int_0^1\int{{\mathbb{R}}^d} \left(\tfrac12|\hat{b}(t, x)|^2 - {\mathbb{E}}\left((\partial_t I(t, x_0, x_1)+\dot\gamma(t) z|x_t=x\right)\cdot \hat{b}(t, x)\right) \rho(t, x) dxdt\ &= \int_0^1\int_{{\mathbb{R}}^d} \left(\tfrac12 |\hat{b}(t, x)|^2 - b(t, x)\cdot \hat{b}(t, x) \right) \rho(t, x) dxdt \end{aligned}\tag{193} $

where we used the definition of $b$ in Equation 11. This quadratic objective is bounded from below since

$ \begin{aligned} \mathcal{L}b[\hat{b}] &= \tfrac12 \int_0^1\int{{\mathbb{R}}^d} \left|\hat{b}(t, x) - b(t, x) \right|^2 \rho(t, x) dx dt- \tfrac12 \int_0^1\int_{{\mathbb{R}}^d} \left|b(t, x) \right|^2 \rho(t, x) dxdt\ & \ge - \tfrac12 \int_0^1 \int_{{\mathbb{R}}^d} \left|b(t, x) \right|^2 \rho(t, x) dxdt >-\infty \end{aligned}\tag{194} $

where the last inequality follows from Equation 12. Since $\rho_t$ is positive the minimizer of Equation 193 is unique and given by $\hat{b}= b$.

Proof: Since $\rho \in C^1((0, 1); C^p({\mathbb{R}}^d))$ and is positive by Theorem 5, we already know that $s = \nabla \log \rho = \nabla \rho /\rho \in C^1((0, 1); (C^p({\mathbb{R}}^d))^d)$. To establish Equation 15, note that, for $t\in(0, 1)$ where $\gamma(t)>0$, we have

$ {\mathbb{E}}\left(z e^{i\gamma(t) k \cdot z }\right) = -\gamma^{-1}(t)(i\partial_k) {\mathbb{E}} e^{i\gamma(t) k \cdot z } = -\gamma^{-1}(t)(i\partial_k) e^{-\tfrac12 \gamma^2(t) |k|^2} = i \gamma(t)k e^{-\tfrac12 \gamma^2(t) |k|^2}.\tag{195} $

As a result, using the independence between $x_0$, $x_1$, and $z$, we have

$ {\mathbb{E}}\left(z e^{i k \cdot x_t }\right) = i \gamma(t)k g(t, k)\tag{196} $

where $g$ is the characteristic function of $x_t$ defined in Equation 172. Using the properties of the conditional expectation, the left-hand side of this equation can be written as

$ {\mathbb{E}}\left(z e^{i k \cdot x_t }\right) = \int_{{\mathbb{R}}^d} {\mathbb{E}}\left(z e^{i k \cdot x_t }|x_t=x\right) \rho(t, x) dx = \int_{{\mathbb{R}}^d} {\mathbb{E}}\left(z |x_t=x\right) e^{i k x_t } \rho(t, x) dx\tag{197} $

Since the left hand side of Equation 196 is the Fourier transform of $-\gamma(t) \nabla \rho(t, x)$, we deduce that

$ {\mathbb{E}}\left(z |x_t=x\right) \rho(t, x) = -\gamma(t) \nabla \rho(t, x) = -\gamma(t) s(t, x) \rho(t, x).\tag{198} $

Since $\rho(t, x)>0$, this implies Equation 15 for $t\in(0, 1)$ where $\gamma(t)>0$.

To establish Equation 16, notice that

$ \begin{aligned} \int_{{\mathbb{R}}^d} |s(t, x)|^2 \rho(t, x) dx & = \int_{{\mathbb{R}}^d}| {\mathbb{E}}\left((\gamma^{-1}(t) z|x_t = x\right)|^2 \rho(t, x)dx \ &\le \int_{{\mathbb{R}}^d} \gamma^{-2}(t) {\mathbb{E}}\left(|z|^2|x_t = x\right) \rho(t, x)dx\ & \le d\gamma^{-2}(t). \end{aligned}\tag{199} $

This means that this integral is bounded for all $t\in(0, 1)$. Since the integral is also continuous at $t=0$ and $t=1$, with values given by Equation 8, it must be integrable on $[0, 1]$ and 16 holds.

The objective $\mathcal{L}_s$ defined in Equation 17 can also be written as

$ \begin{aligned} \mathcal{L}s[\hat{s}] &= \int_0^1\int{{\mathbb{R}}^d} \left(\tfrac12|\hat{s}(t, x)|^2 + \gamma^{-1}(t){\mathbb{E}}\left(z|x_t=x\right)\cdot \hat{s}(t, x)\right) \rho(t, x) dxdt\ &= \int_0^1\int_{{\mathbb{R}}^d} \left(\tfrac12 |\hat{s}(t, x)|^2 - s(t, x)\cdot \hat{s}(t, x) \right) \rho(t, x) dxdt, \end{aligned}\tag{200} $

where we used the definition of $s$ in Equation 15. This quadratic objective is bounded from below since

$ \begin{aligned} \mathcal{L}s[\hat{s}] &= \tfrac12 \int_0^1\int{{\mathbb{R}}^d} \left|\hat{s}(t, x) - s(t, x) \right|^2 \rho(t, x) dxdt - \tfrac12 \int_0^1\int_{{\mathbb{R}}^d} \left|s(t, x) \right|^2 \rho(t, x) dxdt\ & \ge - \tfrac12 \int_0^1\int_{{\mathbb{R}}^d} \left|s(t, x) \right|^2 \rho(t, x) dxdt >-\infty \end{aligned}\tag{201} $

where the last inequality follows from Equation 16. Since $\rho$ is positive the minimizer of Equation 200 is unique and given by $\hat{s} = s$.

Proof: The forward FPE Equation 20 and the backward FPE Equation 22 are direct consequences of the TE Equation 10 and 15, since the equality

$ \epsilon(t)\Delta \rho = \epsilon(t)\nabla \cdot (\rho \nabla \log \rho) = \epsilon(t)\nabla \cdot (s \rho)\tag{202} $

can be used to convert between these equations.

B.2 Proof of Lemma 15

Proof: The SDE Equation 33 and the ODE Equation 32 are the evolution equations for the processes whose densities solve Equation 20 and 10, respectively. The equation that requires some explanation is the backwards SDE Equation 34, which can be solved backwards in time from $t=1$ to $t=0$. As discussed in the main text, by definition, its solution is $X^\mathsf{B}{t}=Z^\mathsf{F}{1-t}$ where $Z^\mathsf{F}_t$ solves the forward SDE

$ dZ^\mathsf{F}t = -b\mathsf{B}(1-t, Z^\mathsf{F}_t)dt + \sqrt{2\epsilon} d W_t.\tag{203} $

To see how to write the backward Itô formula Equation 36 note that given any $f\in C^1([0, 1];C_0^2({\mathbb{R}}^d))$, we have

$ \begin{aligned} df(1-t, Z^\mathsf{F}_t) &= -\partial_t f(1-t, Z^\mathsf{F}_t)dt+\nabla f(1-t, Z^\mathsf{F}_t) \cdot dZ^\mathsf{F}_t + \epsilon \Delta f(1-t, Z^\mathsf{F}_t) dt\ &= -\partial_t f(1-t, Z^\mathsf{F}t)dt+\left(- b\mathsf{B}(1-t, Z^\mathsf{F}_t) \cdot \nabla f(1-t, Z^\mathsf{F}_t) + \epsilon \Delta f(1-t, Z^\mathsf{F}_t)\right) dt\ & \quad+ \sqrt{2\epsilon} \nabla f(1-t, Z^\mathsf{F}_t) \cdot dW_t \end{aligned}\tag{204} $

In integral form, this equation can be written as

$ \begin{aligned} f(1, Z^\mathsf{F}{1}) &= f(1-t, Z^\mathsf{F}{1-t}) \ &\quad- \int_{1-t}^1 \left(\partial_t f(1-s, Z^\mathsf{F}s)+ b\mathsf{B}(1-s, Z^\mathsf{F}_s) \cdot \nabla f(1-s, Z^\mathsf{F}_s) - \epsilon \Delta f(1-s, Z^\mathsf{F}s)\right) ds \ & \quad - \sqrt{2\epsilon} \int{1-t}^1 \nabla f(1-s, Z^\mathsf{F}_s) \cdot dW_s. \end{aligned}\tag{205} $

Using $X^\mathsf{B}t = Z^\mathsf{F}{1-t}$ and $W^\mathsf{B}t = - W{1-t}$ and changing integration variable from $s$ to $1-s$, this is

$ \begin{aligned} f(1, X^\mathsf{B}_0) & = f(1-t, X^\mathsf{B}_t) - \int_t^1 \left(\partial_t f(s, X^\mathsf{B}s) + b\mathsf{B}(s, X^\mathsf{B}_s) \cdot \nabla f(s, X^\mathsf{B}_s) - \epsilon \Delta f(s, X^\mathsf{B}_s)\right) ds \ & \quad - \sqrt{2\epsilon} \int_t^1 \nabla f(s, X^\mathsf{B}_s) \cdot dW^\mathsf{B}_s, \end{aligned}\tag{206} $

In differential form, this is equivalent to saying that

$ \begin{aligned} df(t, X^\mathsf{B}_t) &= \partial_t f(t, X^\mathsf{B}_t)dt + \nabla f(t, X^\mathsf{B}_t) \cdot dX^\mathsf{B}_t - \epsilon \Delta f(X^\mathsf{B}_t) dt\ &= \left(\partial_t f(t, X^\mathsf{B}t) +b\mathsf{B}(t, X^\mathsf{B}_t) \cdot \nabla f(t, X^\mathsf{B}_t) - \epsilon \Delta f(t, X^\mathsf{B}_t)\right) dt + \sqrt{2\epsilon} \nabla f(t, X^\mathsf{B}_t) \cdot dW_t \end{aligned}\tag{207} $

which is the backward Itô formula Equation 36. Similarly, by the Itô isometries we have: for any $g\in C^0([0, 1];(C_0^0({\mathbb{R}}_d))^d)$ and $t\in[0, 1]$

$ {\mathbb{E}}^x \int_{1-t}^1 g(s, Z^\mathsf{F}s) \cdot dW_s = 0, \qquad {\mathbb{E}}^x \left|\int{1-t}^1 g(s, Z^\mathsf{F}s) \cdot dW_s\right|^2 = \int{1-t}^1 {\mathbb{E}}^x|g (s, Z^\mathsf{F}_s) |^2ds.\tag{208} $

Written in terms of $X^\mathsf{B}_t$, these are Equation 37.

Derivation of Equation 30.

For any $t\in[0, 1]$ the score is the minimizer of

$ \begin{aligned} &\int_{{\mathbb{R}}^d} |\hat{s}(t, x) - \nabla \log \rho(t, x)|^2 \rho(t, x) dx\ &= \int_{{\mathbb{R}}^d} \left(|\hat{s}(t, x)|^2 - 2\hat{s}(t, x) \cdot\nabla\log \rho(t, x)+ | \nabla \log \rho(t, x)|^2\right) \rho(t, x) dx \ &= \int_{{\mathbb{R}}^d} \left(|\hat{s}(t, x)|^2 + 2\nabla \cdot \hat{s}(t, x) + | \nabla \log \rho(t, x)|^2\right) \rho(t, x) dx, \end{aligned} $

where we used the identity $\hat{s} \cdot \nabla \log \rho , \rho = \hat{s} \cdot \nabla \rho$ and integration by parts to obtain the second equality. The last term involving $|\nabla \log \rho|^2$ is a constant in $\hat{s}$ that can be neglected for optimization. Expressing the remaining terms as an expectation over $x_t$ and integrating the result in time gives Equation 30.

B.3 Proofs of Lemma 16 and Lemma 17, and Theorem 18.

Proof: Using Equation 38, we compute analytically

$ \begin{align*} \frac{d}{dt}\mathsf{KL}(\rho(t): \Vert : \hat\rho(t)) &= \frac{d}{dt}\int_{{\mathbb{R}}^d} \log\left(\frac{\rho}{\hat{\rho}}\right)\rho dx\ &= \int_{{\mathbb{R}}^d} \hat\rho\left(\frac{\partial_t \rho}{\hat\rho} - \frac{\rho}{(\hat\rho)^2}\partial_t\hat\rho\right)dx + \int\log\left(\frac{\rho}{\hat\rho}\right)\partial_t\rho dx\ &= -\int_{{\mathbb{R}}^d} \left(\frac{\rho}{\hat\rho}\right)\partial_t\hat\rho dx + \int_{{\mathbb{R}}^d}\log\left(\frac{\rho}{\hat\rho}\right)\partial_t\rho dx\ &= \int_{{\mathbb{R}}^d} \left(\frac{\rho}{\hat\rho}\right)\nabla\cdot\left(\hat{b}\hat\rho\right)dx - \int\log\left(\frac{\rho}{\hat\rho}\right)\nabla\cdot\left(b\rho\right)dx\ &= -\int_{{\mathbb{R}}^d} \nabla\left(\frac{\rho}{\hat\rho}\right)\cdot \hat{b}\hat\rho dx + \int\left(\nabla\log\rho - \nabla\log\hat\rho\right)\cdot b \rho dx\ &= -\int_{{\mathbb{R}}^d} \left(\frac{\nabla\rho}{\hat\rho} - \frac{\rho\nabla\hat\rho}{\hat\rho^2}\right)\cdot \hat{b}\hat\rho dx

  • \int_{{\mathbb{R}}^d}\left(\nabla\log\rho - \nabla\log\hat\rho\right)\cdot b \rho dx\ &= \int_{{\mathbb{R}}^d} \left(\nabla\log\hat\rho - \nabla\log\rho\right)\cdot \hat{b} \rho dx + \int_{{\mathbb{R}}^d}\left(\nabla\log\rho - \nabla\log\hat\rho\right)\cdot b \rho dx\ &= \int_{{\mathbb{R}}^d} \left(\nabla\log\hat\rho - \nabla\log\rho\right)\cdot (\hat{b} - b) \rho dx. \end{align*} $

where we omitted the argument $(t, x)$ of all functions for simplicity of notation. Integrating both sides from $0$ to $1$ completes the proof.

Proof: Similar to the proof of Lemma 16, we can use the FPE in Equation 40 to compute $\frac{d}{dt}\mathsf{KL}(\rho(t): \Vert : \hat\rho(t))$, which leads to the main result. Instead, we take a simpler approach, leveraging the result in Lemma 16. We re-write the Fokker-Planck equations in Equation 40 as the (score-dependent) transport equations

$ \begin{aligned} \partial_t \rho &= -\nabla\cdot((b_\mathsf{F} - \epsilon\nabla\log\rho)\rho), \ \partial_t \hat\rho &= -\nabla\cdot((\hat{b}_\mathsf{F} - \epsilon\nabla\log\hat\rho)\hat\rho). \end{aligned}\tag{209} $

Applying Lemma 16 directly, we find that

$ \begin{aligned} \mathsf{KL}(\rho(1): \Vert : \hat{\rho}(1)) = \int_0^1 \int_{{\mathbb{R}}^d} \left[\left(\nabla\log\hat\rho - \nabla\log\rho \right)\cdot \left(\left[\hat{b}\mathsf{F} - \epsilon\nabla\log\hat\rho \right] - \left[b\mathsf{F} - \epsilon\nabla\log\rho \right]\right)\right] \rho dx dt. \end{aligned} $

Expanding the above,

$ \begin{aligned} \mathsf{KL}(\rho(1) : \Vert : \hat\rho(1)) &= \int_0^1 \int_{{\mathbb{R}}^d}\left[\left(\nabla\log\hat\rho - \nabla\log\rho \right)\cdot (\hat{b}\mathsf{F} - b\mathsf{F})\right]\rho dx dt\ &\qquad - \epsilon\int_0^1 \int_{{\mathbb{R}}^d} \left[\left|\nabla\log\rho - \nabla\log\hat\rho \right|^2\right]\rho dx dt, \end{aligned} $

which proves the first part of the lemma. Now, by Young's inequality, it holds for any fixed $\eta > 0$ that

$ \begin{align} \mathsf{KL}(\rho_1: \Vert : \hat\rho(1)) &\leq \frac{1}{2\eta}\int_0^1 \int_{{\mathbb{R}}^d}\left|\hat{b}\mathsf{F} - b\mathsf{F}\right|^2\rho_tdx dt\nonumber\ &\qquad + \left(\tfrac{1}{2}\eta - \epsilon\right)\int_0^1 \int_{{\mathbb{R}}^d} \left|\nabla\log\rho_t - \nabla\log\hat{\rho}_t\right|^2\rho_tdxdt. \end{align} $

Hence, for $\eta =2 \epsilon$,

$ \begin{align} \mathsf{KL}(\rho_1: \Vert : \hat\rho(1)) &\leq \frac{1}{4 \epsilon} \int_0^1 \int_{{\mathbb{R}}^d}\left|\hat{b}\mathsf{F} - b\mathsf{F}\right|^2\rho dxdt, \end{align}\tag{210} $

which proves the second part of the lemma.

Proof: Observe that by Corollary 14, the target density $\rho_1 =\rho(1, \cdot)$ is the density of the process $X_t$ that evolves according to SDE Equation 33. By Lemma 17 and an additional application of Young's inequality, we then have that

$ \begin{align} \mathsf{KL}(\rho_1: \Vert : \hat\rho(1)) &\leq \frac{1}{2 \epsilon}\int_0^1\int_{{\mathbb{R}}^d}\left(|\hat{b} - b|^2 +\epsilon^2 \left|s - \hat{s}\right|^2\right)\rho dxdt. \end{align} $

Using the definition of $\mathcal{L}_b[\hat{b}]$ and $\mathcal{L}_s[\hat{s}] $ in Equation 14, and 17, we conclude that

$ \mathsf{KL}(\rho_1: \Vert : \hat{\rho}(1)) \leq \frac{1}{2 \epsilon}\left(\mathcal{L}b[\hat{b}] - \min{\hat{b}}\mathcal{L}_b[\hat{b}]\right) + \frac{\epsilon}{2}\left(\mathcal{L}s[\hat{s}] - \min{\hat{s}}\mathcal{L}_s[\hat{s}]\right), $

which is Equation 42. Using that $b = v - \gamma\dot{\gamma} s$ and $\hat{b} = \hat{v} - \gamma\dot{\gamma}\hat{s}$, we can write Equation 210 in this case as

$ \begin{align} \mathsf{KL}(\rho_1: \Vert : \hat\rho(1)) &\leq \frac{1}{4 \epsilon} \int_0^1 \int_{{\mathbb{R}}^d}\left|\hat{v} - v - (\gamma(t)\dot\gamma(t) - \epsilon)(\hat{s} - s)\right|^2\rho dxdt, \ &\leq \frac{1}{2 \epsilon} \int_0^1 \int_{{\mathbb{R}}^d}\left(\left|\hat{v} - v\right|^2 + (\gamma(t)\dot\gamma(t) - \epsilon)^2\left|\hat{s} - s\right|^2\right)\rho dxdt, \ &\leq \frac{1}{2 \epsilon}\left(\mathcal{L}{v}[\hat{v}] - \min{\hat{v}}\mathcal{L}{v}[\hat{v}]\right) + \frac{\max{t\in [0, 1]}(\gamma(t)\dot\gamma(t) - \epsilon)^2}{2 \epsilon}\left(\mathcal{L}{s}[\hat{s}] - \min{\hat{v}}\mathcal{L}_{s}[\hat{s}]\right), \end{align} $

which is the final part of the theorem.

B.4 Proofs of Lemma 20 and Theorem 21

Proof: If $\hat{\rho}$ solves the TE Equation 44 and $ X_{s, t}$ solves the ODE Equation 45, we have

$ \begin{aligned} \frac{d}{dt} \hat\rho(t, X_{s, t}(x)) & = \partial_t \hat\rho(t, X_{s, t}(x)) + b(t, X_{s, t}(x)) \cdot \nabla \hat\rho(t, X_{s, t}(x))\ &= - \nabla \cdot b(t, X_{s, t}(x)) \hat\rho(t, X_{s, t}(x)) \end{aligned}\tag{211} $

This equation implies that

$ \begin{aligned} \frac{d}{dt} \left(\exp\left(\int_{s}^t\nabla \cdot b(\tau, X_{s, \tau}(x)) d\tau \right) \hat\rho(t, X_{s, t}(x)) \right) = 0. \end{aligned}\tag{212} $

Integrating Equation 212 over $[0, t]$ and setting $s=t$ in the result gives

$ \begin{aligned} \hat\rho(t, x) = \exp\left(-\int_0^{t}\nabla \cdot b(\tau, X_{t, \tau}(x)) d\tau \right) \hat\rho(0, X_{t, 0}(x)) \end{aligned}\tag{213} $

If we use the initial condition $\hat{\rho}(0) = \rho_0$, this gives Equation 46. Similarly, Integrating Equation 212 on $[t, 1]$ and setting $s=t $ in the result gives

$ \begin{aligned} \hat\rho(t, x) = \exp\left(\int_{t}^1\nabla \cdot b(\tau, X_{t, \tau}(x)) d\tau \right) \hat\rho(1, X_{t, 1}(x)) \end{aligned}\tag{214} $

If we use the final condition $\hat{\rho}(1) = \rho_1$, this gives Equation 47.

Proof: Evaluating $d\hat{\rho}_\mathsf{F}(t, Y^\mathsf{B}_t)$ via the backward Itô formula Equation 36 we obtain

$ \begin{aligned} d\hat\rho_\mathsf{F}(t, Y_t^\mathsf{B}) &= \partial_t \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) dt + \nabla \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) \cdot dY^\mathsf{B}t - \epsilon \Delta \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) dt\ &= \partial_t \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) dt + \nabla \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) \cdot \hat{b}\mathsf{F}(t, Y^\mathsf{B}t)dt + \sqrt{2\epsilon} \nabla \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) \cdot dW^\mathsf{B}t - \epsilon \Delta \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) dt\ & = -\nabla \cdot \hat{b}\mathsf{F}(t, Y^\mathsf{B}t) \hat{\rho}\mathsf{F}(t, Y^\mathsf{B}t) dt + \sqrt{2\epsilon} \nabla \hat\rho\mathsf{F}(t, Y_t^\mathsf{B}) \cdot dW^\mathsf{B}_t. \end{aligned}\tag{215} $

where we used Equation 50 in the second step and 51 in the last one. This equation can be written as a total differential in the form

$ \begin{aligned} & d \left(\exp\left(- \int_t^1\nabla \cdot \hat{b}\mathsf{F}(\tau, Y^\mathsf{B}\tau) d\tau\right) \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) \right) \ & = \sqrt{2\epsilon} \exp\left(-\int_t^1\nabla \cdot \hat{b}\mathsf{F}(\tau, Y^\mathsf{B}\tau) d\tau\right) \nabla \hat{\rho}\mathsf{F}(t, Y_t^\mathsf{B}) \cdot dW^\mathsf{B}_t, \end{aligned}\tag{216} $

which after integration over $t\in[0, 1]$ becomes

$ \begin{aligned} & \hat{\rho}\mathsf{F}(1, Y{t=1}^\mathsf{B}) - \exp\left(- \int_0^1\nabla \cdot \hat{b}\mathsf{F}(t, Y^\mathsf{B}t) dt\right) \rho_0(Y^\mathsf{B}{t=0}) \ & = \sqrt{2\epsilon} \int_0^1 \exp\left(-\int_t^1\nabla \cdot \hat{b}\mathsf{F}(\tau, Y^\mathsf{B}_\tau) d\tau\right) \nabla \hat{\rho}(t, Y_t^\mathsf{B}) \cdot dW^\mathsf{B}_t. \end{aligned}\tag{217} $

where we used $\hat{\rho}(0)=\rho_0$. Taking an expectation conditioned on the event $Y_{t=1}^\mathsf{B}=x$ and using that the term on the right-hand side has mean zero, we find that

$ \hat{\rho}\mathsf{F}(1, Y{t=1}^\mathsf{B}) - {\mathbb{E}}^x_\mathsf{B}\exp\left(- \int_0^1\nabla \cdot \hat{b}_\mathsf{F}(t, Y^\mathsf{B}t) dt\right) \rho_0(Y^\mathsf{B}{t=0}) = 0.\tag{218} $

This gives Equation 52.

Similarly, evaluating $d\hat{\rho}_\mathsf{B}(t, Y^\mathsf{F}_t)$ via Itô's formula, we obtain

$ \begin{aligned} d \hat{\rho}_\mathsf{B}(t, Y^\mathsf{F}t) &= \partial_t \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) dt + \nabla \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}_t) \cdot dY^\mathsf{F}t + \epsilon \Delta \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) dt\ &= \partial_t \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) dt + \nabla \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) \cdot \hat{b}\mathsf{B}(t, Y^\mathsf{F}t)dt + \sqrt{2\epsilon} \nabla \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) \cdot dW_t + \epsilon \Delta \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) dt\ & = -\nabla \cdot \hat{b}\mathsf{B}(t, Y^\mathsf{F}t) \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) dt + \sqrt{2\epsilon} \nabla \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}_t) \cdot dW_t. \end{aligned}\tag{219} $

where we used Equation 49 in the second step and 53 in the last one. This equation can be written as a total differential in the form

$ \begin{aligned} & d\left(\exp\left(\int_0^t\nabla \cdot \hat{b}\mathsf{B}(\tau, Y^\mathsf{F}\tau) d\tau\right) \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}t) \right) \ & = \sqrt{2\epsilon} \exp\left(\int_0^t\nabla \cdot \hat{b}\mathsf{B}(\tau, Y^\mathsf{F}\tau) d\tau\right) \nabla \hat{\rho}_\mathsf{B}(t, Y^\mathsf{F}_t) \cdot dW_t. \end{aligned}\tag{220} $

Integrating the above on $t\in[0, 1]$, we find that

$ \begin{aligned} & \exp\left(\int_0^1\nabla \cdot \hat{b}\mathsf{B}(t, Y^\mathsf{F}t) dt\right)\rho_1(Y^\mathsf{F}1) - \hat{\rho}\mathsf{B}(0, Y^\mathsf{F}{t=0})\ &= \sqrt{2\epsilon} \int_0^1 \exp\left(\int_0^t\nabla \cdot \hat{b}\mathsf{B}(\tau, Y^\mathsf{F}\tau) d\tau\right) \nabla \hat{\rho}\mathsf{B}(t, Y^\mathsf{F}_t) \cdot dW_t. \end{aligned}\tag{221} $

where we used $\hat{\rho}(1) = \rho_1$. Taking an expectation conditioned on the event $Y^\mathsf{F}_{t=0} = x$ and applying the Itô isometry, we deduce that

$ {\mathbb{E}}\mathsf{F}^x \left(\exp\left(\int_0^1\nabla \cdot \hat{b}\mathsf{B}(t, Y^\mathsf{F}t) dt\right)\rho_1(Y^\mathsf{F}{t=1})\right) - \hat{\rho}_\mathsf{B}(0, y) =0.\tag{222} $

This gives Equation 54.

B.5 Proof of Theorem 26

Proof: Let us first consider what happens on the interval $t\in [0, \delta]$ where $I(t, x_0, x_1) = x_0$ by assumption and the stochastic interpolant Equation 66 with $x_0$ fixed and $a(t)=a>0$ reduces to

$ x_t = x_0 + \sqrt{2at(1-t)} z \quad \text{with} \quad x_0 \text{fixed}, \ z \sim {\sf N}(0, \text{\it Id}).\tag{223} $

The law of this process is simply

$ x_t \sim {\sf N}(x_0, 2at(1-t)\text{\it Id}),\tag{224} $

which means that its density satisfies for all $t>0$ the FPE

$ \partial_t \rho + 2a t \nabla \cdot \left(s(t, x) \rho\right) = a \Delta \rho,\tag{225} $

where $s(t, x) = \nabla \log \rho(t, x)$ is the score, which is explicitly given by

$ s(t, x) = -\frac{x-x_0}{2at(1-t)}\qquad \text{for} \ \ t\in (0, \delta].\tag{226} $

This means that the drift term in the FPE Equation 225 can be written as

$ 2a t s(t, x) = -\frac{x-x_0}{1-t}\qquad \text{for} \ \ t\in (0, \delta].\tag{227} $

and this equality also holds in the limit as $t\to0$. It is also easy to check that

$ -\frac{x-x_0}{2at(1-t)} = -\frac{\sqrt{2at}}{\sqrt{(1-t)}} {\mathbb{E}}(z|x_t=x) \qquad \text{for} \ \ t\in [0, \delta]\tag{228} $

which is consistent with the drift in Equation 74 on $t\in [0, \delta]$ since $\partial_t I(t, x_0, x_1) = 0$ on this interval. Since the right-hand side of Equation 227 is nonsingular at $t=0$ it also means that the SDE Equation 75 is well-defined for $t\in[0, \delta]$ and the law of its solutions coincide with that of the process $x_t$ defined Equation 223, $X_t^\mathsf{d} \sim {\sf N}(x_0, 2at(1-t)\text{\it Id})$ for $t\in[0, \delta]$.

Considering next what happens on $t\in(\delta, 1]$, since $\gamma(t) = \sqrt{2at(1-t)}$ is only zero at the endpoint $t=1$ where we assume that the density $\rho_1$ satisfies Assumption 4 we can mimick all the arguments in the proof of of Theorem 5 and Corollary 9 and Corollary 14 to terminate the proof.

Note that this proof shows that is enough to have $\partial_t I(t=0, x_0, x_1) = 0$, since this implies that $I(t, x_0, x_1) = x_0 + O(t^2)$. It also shows that, while it is key to use an SDE on $t\in[0, \delta]$ so that the generative process can spread the mass away from $x_0$, the diffusive noise is no longer necessary and we could switch back to a probability flow ODE on $t\in (\delta, 1]$ (using a time-dependent $a(t)$ to that effect with $a(t)=0$ for $t\in (\delta, 1]$).

B.6 Proof of Lemma 34 and Theorem 35.

Proof: By definition of the map $T$ in Equation 101, if $x_0\sim \rho_0$ and $x_1\sim\rho_1$, then $T^{-1}(0, x_0)\sim {\sf N}(0, \text{\it Id})$ and $T^{-1}(1, x_1)\sim {\sf N}(0, \text{\it Id})$. As a result, since $x_0$, $x_1$, and $z$ are independent, and $z\sim {\sf N}(0, \text{\it Id})$, we have

$ \alpha(t) T^{-1}(0, x_0) + \beta(t) T^{-1}(1, x_1) +\gamma(t) z \sim {\sf N}(0, \alpha^2(t)\text{\it Id}+\beta^2(t)\text{\it Id}+\gamma^2(t)\text{\it Id}) = {\sf N}(0, \text{\it Id})\tag{229} $

where the second equality follows from the condition $\alpha^2(t)+\beta^2(t)+\gamma^2(t)=1$. Therfore, using again the definition of the map $T$

$ T(t, \alpha(t) T^{-1}(0, x_0) + \beta(t) T^{-1}(1, x_1) +\gamma(t) z) \sim \rho(t),\tag{230} $

and we are done.

Proof: Denote by $\hat\rho(t, x)$ the density of $\hat{x}_t$ and define the current $\hat\jmath : [0, 1]\times {\mathbb{R}}^d \to {\mathbb{R}}^d$ as

$ \hat{\jmath}(t, x) = {\mathbb{E}} \big(\partial_t \hat{I}_t +\gamma(t) z| x=\hat{x}_t \big) \hat{\rho}(t, x).\tag{231} $

The max-min problem Equation 103 can then be formulated as the constrained optimization problem:

$ \begin{aligned} \max_{\hat{\rho}, \hat{\jmath}} \min_{\hat{u}} &\int_0^1 \int_{{\mathbb{R}}^d}\left(\tfrac12|\hat{u}(t, x)|^2 \hat{\rho}(t, x)- \hat{u}(t, x)\cdot \hat{\jmath}(t, x) \right) dx dt\ \text{subject to:} \quad & \partial_t \hat{\rho} + \nabla \cdot \hat{\jmath}= \epsilon \Delta \hat\rho, \quad \hat\rho(t=0) = \rho_0, \quad \hat{\rho}(t=1)= \rho_1 \end{aligned}\tag{232} $

To solve this problem we can use the extended objective

$ \begin{aligned} \max_{\hat{\rho}, \hat{\jmath}} \min_{\hat{u}} &\Bigg(\int_0^1 \int_{{\mathbb{R}}^d}\left(\tfrac12|\hat{u}(t, x)|^2 \hat{\rho}(t, x)- \hat{u}(t, x)\cdot \hat{\jmath}(t, x) \right) dx dt\ & -\int_0^1 \int_{{\mathbb{R}}^d}\lambda(t, x) \left(\partial_t \hat{\rho}(t, x) + \nabla \cdot \hat{\jmath}(t, x) - \epsilon \Delta \hat\rho(t, x)\right) dx dt \ & +\int_{{\mathbb{R}}^d} \eta_0(x)\left(\hat\rho(0, x)-\rho_0(x)\right) dx-\int_{{\mathbb{R}}^d} \eta_1(x)\left(\hat\rho(1, x)-\rho_1(x)\right) dx \Bigg) \end{aligned}\tag{233} $

where $\lambda(t, x)$, $\eta_0(x)$, and $\eta_1(x)$ are Lagrange multipliers used to enforce the constraints. The unique minimizer $(\rho, j, \lambda)$ of this optimization problem solves the Euler-Lagrange equations:

$ \begin{aligned} & \partial_t \rho + \nabla \cdot j = \epsilon \Delta \rho, \quad \rho(t=0) = \rho_0\quad \rho(t=1) = \rho_1\ & \partial_t \lambda + \tfrac12 |u|^2 = -\epsilon \Delta \lambda, \ & j = u \rho \ & u = \nabla \lambda \end{aligned}\tag{234} $

We can use the last two equations to write the first two as Equation 100, with $u=\nabla \lambda$. Since under Assumption 33 there is an interpolant that realizes the density $\rho(t)$ that solves Equation 100, we conclude that an optimizer $(I, u)$ of the the max-min problem Equation 103 exists. For any optimizer, $I$ will be such that $\rho(t)$ is the density of $x_t= I(t, x_0, x_1)+\gamma(t) z$, and $u$ will satisfy $u=\nabla \lambda$.

B.7 Proof of Theorem 39

Proof: Use

$ \beta(t_{j+1}) \beta^{-1}(t_j) = 1 + \dot\beta(t_j) \beta^{-1}(t_i) (t_{j+1}-t_j) + O\big((t_{j+1}-t_j)^2\big)\tag{235} $

and

$ \alpha(t_{j+1}) - \alpha(t_j)\beta(t_{j+1}) \beta^{-1}(t_j) = \big(\dot\alpha(t_j) - \alpha(t_j) \dot\beta(t_j) \beta^{-1}(t_i)) (t_{j+1}-t_j) + O\big((t_{j+1}-t_j)^2\big)\tag{236} $

to deduce that Equation 138 implies

$ \begin{aligned} X^\mathsf{den}{{j+1}} & = X^\mathsf{den}{j} + \dot{\beta}(t_{j}) \beta^{-1}(t_j) X^\mathsf{den}{j} (t{j+1}-t_j)\ & + \big(\dot\alpha(t_j) - \alpha(t_j) \dot\beta(t_j) \beta^{-1}(t_i)) \eta^\mathsf{os}z(t{j}, X^\mathsf{den}{j})(t{j+1}-t_j) + O\big((t_{j+1}-t_j)^2\big). \end{aligned}\tag{237} $

or equivalently

$ \begin{aligned} \frac{X^\mathsf{den}{{j+1}} -X^\mathsf{den}{j}}{t_{j+1}-t_j} = \dot{\beta}(t_{j}) \beta^{-1}(t_j) X^\mathsf{den}{j} + \big(\dot\alpha(t_j) - \alpha(t_j) \dot\beta(t_j) \beta^{-1}(t_i)) \eta^\mathsf{os}z(t{j}, X^\mathsf{den}{j}) + O(t_{j+1}-t_j). \end{aligned}\tag{238} $

Taking the limit as $N, j\to \infty$ with $j/N\to t\in[0, 1]$, we recover Equation 139 and deduce that $X^\mathsf{den}_{j} \to X_t$.

Note that the proof shows that the result of Theorem 39 also holds if we use a nonuniform grid of times $t_j$, $j\in{1, \ldots, N}$.

B.8 Proof of Theorem 41

Proof: The first part of the statement can be established by following the same steps as in the proof of Theorem 5 and Corollary 9. For the proof of the second part, use first Equation 141 written as $x^\mathsf{rec}_t = M(t, z)$ in Equation 146 to deduce that

$ b^\mathsf{rec}(t, x^\mathsf{rec}t) = \dot{\alpha}(t) N(t, M(t, z)) + \dot\beta(t) X{t=1}(N(t, M(t, z))) = \dot{\alpha}(t) z + \dot\beta(t) X_{t=1}(z) = \dot{x}^\mathsf{rec}_t\tag{239} $

This shows that Equation 146 is the unique minimizer of Equation 143. Next, use Equation 147 written as $X^\mathsf{rec}_t(x) = M(t, x)$ in Equation 146 to deduce that

$ b^\mathsf{rec}(t, X^\mathsf{rec}t(x)) = \dot{\alpha}(t) N(t, M(t, x)) + \dot\beta(t) X{t=1}(N(t, M(t, x))) = \dot{\alpha}(t) x + \dot\beta(t) X_{t=1}(x)\tag{240} $

This implies that Equation 147 solves Equation 145, and since the solution of this ODE is unique, we are done.

::: {caption="Table 3: Hyperparameters and architecture for image datasets."}

:::

C. Experimental Specifications

Details for the experiments in Section 7.1 are provided here. Feed forward neural networks of depth $4$ and width $512$ are used for each model of the velocities $b$, $v$, and $s$. Training was done for $7000$ iterations on batches comprised of $25$ draws from the base, $400$ draws from the target, and $100$ time slices. At each iteration, we used a variance reduction technique based on antithetic sampling, in which two samples $\pm z$ are used for each evaluation of the loss. The objectives given in Equation 14 and 17 were optimized using the Adam optimizer. The learning rate was set to $.002$ and was dropped by a factor of $2$ every $1500$ iterations of training. To integrate the ODE/SDE when drawing samples, we used the Heun-based integrator as suggested in [10].

C.1 Image Experiments

Network Architectures.

For all image generation experiments, the U-Net architecture originally proposed in [4] is used. The specification of architecture hyperparameters as well as training hyperparameters are given in Table 3. The same architecture is used regardless of whether learning $b, v, s, $ or $\eta$.

When using the SDE and learning $\eta_z$, we found that integrating to a time slightly before $t_f=1.0$ and using the denoising formula Equation 138 beyond this point provided the best results, as described in the main text.

References

[1] Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2023.

[2] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019.

[3] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[4] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.

[5] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.

[6] Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Maximilian Nickel, Aditya Grover, Ricky T. Q. Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 1749–1763. PMLR, 2022.

[7] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023b.

[8] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.

[9] Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011.

[10] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.

[11] Jerome H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249–266, 1987.

[12] Scott Chen and Ramesh Gopinath. Gaussianization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.

[13] Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 – 233, 2010.

[14] Esteban G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.

[15] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.

[16] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.

[17] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 2335–2344, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

[18] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087. PMLR, 10–15 Jul 2018.

[19] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[20] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[21] Chris Finlay, Joern-Henrik Jacobsen, Levon Nurbekyan, and Adam Oberman. How to train your neural ODE: the world of Jacobian and kinetic regularization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3154–3164. PMLR, 13–18 Jul 2020.

[22] Derek Onken, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. Ot-flow: Fast and accurate continuous normalizing flows via optimal transport. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):9223–9232, May 2021.

[23] Alexander Tong, Jessie Huang, Guy Wolf, David Van Dijk, and Smita Krishnaswamy. TrajectoryNet: A dynamic optimal transport network for modeling cellular dynamics. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9526–9536. PMLR, 13–18 Jul 2020.

[24] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.

[25] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.

[26] Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Jacques Azéma and Marc Yor, editors, Séminaire de Probabilités XIX 1983/84, pages 177–206, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg. ISBN 978-3-540-39397-9.

[27] Felix Otto. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.

[28] Young-Heon Kim and Emanuel Milman. A generalization of caffarelli’s contraction theorem via (reverse) heat flow. Mathematische Annalen, 354:827–862, 2010.

[29] Dimitra Maoutsa, Sebastian Reich, and Manfred Opper. Interacting particle solutions of fokker–planck equations through gradient–log–density estimation. Entropy, 22(8), 2020.

[30] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1415–1428. Curran Associates, Inc., 2021a.

[31] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. On density estimation with diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

[32] Nicholas M Boffi and Eric Vanden-Eijnden. Probability flow solution of the fokker–planck equation. Machine Learning: Science and Technology, 4(3):035012, jul 2023.

[33] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.

[34] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023.

[35] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped langevin diffusion. In International Conference on Learning Representations (ICLR), 2022.

[36] Stefano Peluchetti. Non-denoising forward-time diffusions. https://openreview.net/forum?id=oVfIKuhqfC, 2022.

[37] Xingchao Liu, Lemeng Wu, Mao Ye, and Qiang Liu. Let us build bridges: Understanding and extending diffusion generative models. arXiv:2208.14699, 2022.

[38] Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A Theodorou, Weili Nie, and Anima Anandkumar. $\text{I}^2$sb: Image-to-image schrödinger bridge. arXiv preprint arXiv:2302.05872, 2023a.

[39] Vignesh Ram Somnath, Matteo Pariset, Ya-Ping Hsieh, Maria Rodriguez Martinez, Andreas Krause, and Charlotte Bunne. Aligned diffusion schrödinger bridges. In The 39th Conference on Uncertainty in Artificial Intelligence, 2023.

[40] Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv:2209.14577, 2022.

[41] Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport. arXiv:2302.00482, 2023.

[42] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

[43] Yongxin Chen, Tryphon T. Georgiou, and Michele Pavon. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrödinger bridge. SIAM Review, 63(2):249–313, 2021.

[44] Linfeng Zhang, Weinan E, and Lei Wang. Monge-Ampère flow for generative modeling. arXiv.org:1809.10188, 2018.

[45] Chin-Wei Huang, Ricky T. Q. Chen, Christos Tsirigotis, and Aaron Courville. Convex potential flows: Universal probability distributions with optimal transport and convex optimization. In International Conference on Learning Representations, 2021.

[46] Liu Yang and George Em Karniadakis. Potential flow generator with l2 optimal transport regularity for generative models. IEEE Transactions on Neural Networks and Learning Systems, 33(2):528–538, 2022.

[47] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

[48] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In The Eleventh International Conference on Learning Representations, 2023.

[49] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022b.

[50] Charlotte Bunne, Ya-Ping Hsieh, Marco Cuturi, and Andreas Krause. The Schrödinger Bridge between Gaussian Measures has a Closed Form. arXiv:2202.05722, February 2022.

[51] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.

[52] Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion ODEs by high order denoising score matching. In Proceedings of the 39th International Conference on Machine Learning, 2022.

[53] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv:2206.06227, 2022.

[54] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In Shipra Agrawal and Francesco Orabona, editors, Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201 of Proceedings of Machine Learning Research, pages 946–985, 2023.

[55] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023.

[56] Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. ICML, 2022a.

[57] Yann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. In Gökhan BakIr, Thomas Hofmann, Alexander J Smola, Bernhard Schölkopf, and Ben Taskar, editors, Predicting structured data, chapter 10. MIT press, 2007.

[58] Yang Song and Diederik P. Kingma. How to train your energy-based models. arXiv:2101.03288, 2021.

[59] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.

[60] Santosh S. Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Log-sobolev suffices. arXiv:1903.08568, 2019.

[61] Joseph L. Doob. Classical potential theory and its probabilistic counterpart, volume 262 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, New York, 1984. ISBN 0-387-90881-1.

[62] Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. Discrete and Continuous Dynamical Systems, 34(4):1533–1574, 2014.

[63] Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, and Eric Vanden-Eijnden. Stochastic interpolants with data-dependent couplings. arXiv, October 2023.

[64] Ronen Eldan. Thin shell implies spectral gap up to polylog via a stochastic localization scheme. Geometric and Functional Analysis, 23(2):532–569, 2013.

[65] Ahmed El Alaoui and Andrea Montanari. An information-theoretic view of stochastic localization. IEEE Trans. Inf. Theory, 68(11):7423–7426, 2022.

[66] Andrea Montanari. Sampling, diffusions, and stochastic localization. arXiv:2305.10690, 2023.

[67] Eero P. Simoncelli and Edward H. Adelson. Noise removal via bayesian wavelet coring. In Proceedings of 3rd IEEE International Conference on Image Processing, volume 1, pages 379–382 vol.1, 1996.

[68] Aapo Hyvärinen. Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation. Neural Computation, 11(7):1739–1768, 1999.

[69] Zahra Kadkhodaie and Eero P. Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv:2007.13640, 2021.

[70] Charles M. Stein. Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151, 1981.

[71] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375–417, 1991.

[72] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models. ICML, 2023.

[73] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, page 807–814, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077.

[74] Maria-Elena Nilsback and Andrew Zisserman. A visual vocabulary for flower classification. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1447–1454, 2006.