How to build a consistency model:
Learning flow maps via self-distillation

1Carnegie Mellon University, 2Harvard University, 3Courant Institute of Mathematical Sciences and Capital Fund Management
Flow map overview

Overview. The tangent condition connects flow maps to velocity fields, enabling direct training of consistency models without pre-trained teachers.

Method Overview

Introduction

Flow-based generative models achieve state-of-the-art sample quality, but require expensive differential equation solves at inference time. Flow maps, which generalize consistency models, learn to jump directly between points on flow trajectories, enabling one or few-step generation. Yet despite their promise, these models lack a unified framework for efficient training.

We introduce such a mathematical framework, leading to three algorithmic families for learning flow maps: Eulerian, Lagrangian, and Progressive methods. Our approach converts any distillation scheme into a direct training algorithm by exploiting the tangent condition -- a simple relation between the flow map and its implicit velocity field. This eliminates the need for pre-trained teacher models while maintaining the training stability of distillation.

Theoretically, we show that our approach reveals a new class of high-performing methods, recovers many known methods for training flow maps (including consistency training, consistency distillation, shortcut models, align your flow, and mean flow), and provides significant insight into the design of training algorithms for flow maps. We test all approaches through numerical experiments on low-dimensional synthetic datasets, CIFAR-10, CelebA-64, and AFHQ-64, where we find that the class of Lagrangian methods uniformly outperforms both Eulerian and Progressive schemes.

Key Contributions

  1. Unified framework: We provide, to our knowledge, the most general mathematical characterization of consistency models, showing both how to design new learning algorithms and how existing methods emerge as special cases.
  2. Self-distillation: We describe how to convert any distillation scheme into a direct training algorithm via the tangent condition, eliminating the need for pre-trained teacher models and two-phase training.
  3. Lagrangian Self-Distillation: We introduce Lagrangian Self-Distillation (LSD), which simultaneously avoids the computation of spatial derivatives of the model and reliance on bootstrapping from multiple model steps. We show that this approach achieves superior theoretical guarantees and uniformly better empirical performance than Eulerian and Progressive schemes. These approaches, respectively, require spatial gradient computations and bootstrapping, leading to training instability and reduced performance.

Results

Unconditional Image Generation: CIFAR-10, CelebA-64, AFHQ-64

Stability. In all cases tried, the Eulerian losses were highly unstable without significant engineering effort to stabilize training. More details can be found in the paper, but this can be traced to the appearance of the spatial Jacobian in the loss. Lagrangian and progressive methods were far more stable, so we compare only those here.

Quantitative Performance. LSD achieves the best FID scores across all datasets and step counts. On CIFAR-10, LSD reaches FID 3.33 at 8 steps, while PSD variants require 16 steps to approach similar quality. The advantage is even more pronounced on CelebA-64 and AFHQ-64. PSD-U and PSD-M denote different sampling schemes for the intermediate point $u$, which can be taken to be uniformly distributed between $s$ and $t$ or at the midpoint $u = (s+t)/2$.

Dataset Method 1 Step 2 Steps 4 Steps 8 Steps 16 Steps
CIFAR-10
(FID ↓)
LSD 8.10 4.37 3.34 3.33 3.57
PSD-M 12.81 8.43 5.96 5.07 4.64
PSD-U 13.61 7.95 6.03 5.32 5.16
CelebA-64
(FID ↓)
LSD 12.22 5.74 3.18 2.18 1.96
PSD-M 19.64 11.75 7.89 6.06 5.09
PSD-U 18.81 11.02 7.47 6.00 5.63
AFHQ-64
(FID ↓)
LSD 11.19 7.78 7.00 5.89 5.61
PSD-M 18.86 14.75 14.40 13.26 11.07
PSD-U 14.50 10.73 10.99 12.02 11.47

Qualitative Results. Progressive refinement comparison across methods, whereby we systematically increase the number of steps for a fixed random seed. All methods improve with more sampling steps. LSD produces higher quality samples at every step count compared to PSD variants. Each row shows samples at $N \in \{1, 2, 4, 8, 16\}$ steps from the same random seed.

CIFAR-10

N=1
N=2
N=4
N=8
N=16
CIFAR-10 PSD-M

PSD-M

CIFAR-10 PSD-U

PSD-U

CIFAR-10 LSD

LSD

CelebA-64

N=1
N=2
N=4
N=8
N=16
CelebA-64 PSD-M

PSD-M

CelebA-64 PSD-U

PSD-U

CelebA-64 LSD

LSD

AFHQ-64

N=1
N=2
N=4
N=8
N=16
AFHQ-64 PSD-M

PSD-M

AFHQ-64 PSD-U

PSD-U

AFHQ-64 LSD

LSD

Checkerboard: Sharp Boundaries and Multimodality

On the synthetic 2D checkerboard dataset -- a paradigmatic model of multimodality and sharp boundaries -- LSD captures the mode structure most accurately. Here, we found that ESD remained stable, likely due to the simpler network parameterization in this low-dimensional setting. ESD and PSD introduce artifacts or blur boundaries at low step counts. LSD achieves the best KL divergence across 1, 2, 4, and 8 steps.

KL Divergence (↓)
Method 1 Step 2 Steps 4 Steps 8 Steps 16 Steps
LSD 0.0864 0.0765 0.0708 0.0699 0.0710
ESD 0.0983 0.0921 0.0834 0.0816 0.0751
PSD-M 0.1456 0.0891 0.0812 0.0717 0.0689
PSD-U 0.1113 0.1067 0.0747 0.0727 0.0679
Checkerboard results

Theory & Framework

The Tangent Condition

The flow map $X_{s,t}$ satisfies the defining property (or jump condition) that $X_{s,t}(x_s) = x_t$ for any trajectory $x_T$ of the probability flow $\dot{x}_t = b_t(x_t)$. One of our key insights is that the velocity field $b_t$ is implicitly encoded in the flow map itself:

Lemma (Tangent Condition): Let $X_{s,t}$ denote the flow map. Then, $$\lim_{s\to t}\partial_t X_{s,t}(x) = b_t(x) \quad \forall t \in [0,1], \; \forall x \in \mathbb{R}^d.$$

To exploit this algorithmically, we parameterize the flow map as $$X_{s,t}(x) = x + (t-s)v_{s,t}(x)$$ which automatically enforces the boundary condition $X_{s,s}(x) = x$. Taking the limit as $s \to t$: $$\lim_{s\to t}\partial_t X_{s,t}(x) = \lim_{s\to t}\partial_t[x + (t-s)v_{s,t}(x)] = v_{t,t}(x)$$ Combined with the tangent condition, we obtain the fundamental relation:

$$v_{t,t}(x) = b_t(x).$$

This shows that $v_{t,t}$ on the diagonal recovers the velocity field, which we can learn via standard flow matching. The challenge is then learning $v_{s,t}$ off the diagonal ($s \neq t$), which we address through self-distillation.

Characterizing the Map

Given the parameterization $X_{s,t}(x) = x + (t-s)v_{s,t}(x)$ and the tangent condition $v_{t,t}(x) = b_t(x)$, we can characterize the flow map through three equivalent conditions:

Proposition (Flow Map Characterizations): Assume $X_{s,t}(x) = x + (t-s)v_{s,t}(x)$ with $v_{t,t}(x) = b_t(x)$. Then $X_{s,t}$ is the flow map if and only if any of the following holds:

(i) Lagrangian condition: $$\partial_t X_{s,t}(x) = v_{t,t}(X_{s,t}(x)) \quad \forall (s,t,x) \in [0,1]^2 \times \mathbb{R}^d$$
(ii) Eulerian condition: $$\partial_s X_{s,t}(x) + \nabla X_{s,t}(x)v_{s,s}(x) = 0 \quad \forall (s,t,x) \in [0,1]^2 \times \mathbb{R}^d$$
(iii) Semigroup condition: $$X_{u,t}(X_{s,u}(x)) = X_{s,t}(x) \quad \forall (s,u,t,x) \in [0,1]^3 \times \mathbb{R}^d$$

Each condition provides a different perspective on the flow map. The Lagrangian follows trajectories forward in time, the Eulerian describes transport via a partial differential equation, and the semigroup expresses composition of jumps. These yield three distinct self-distillation algorithms, as we now discuss.

Algorithmic Framework

Proposition (Self-Distillation): The flow map $X_{s,t}$ is given by $X_{s,t}(x) = x + (t-s)v_{s,t}(x)$ where $v_{s,t}$ is the unique minimizer of $$\mathcal{L}(\hat{v}) = \mathcal{L}_b(\hat{v}) + \mathcal{L}_{\text{dist}}(\hat{v})$$ Here $\mathcal{L}_b$ is the flow matching loss on the diagonal: $$\mathcal{L}_b(\hat{v}) = \int_0^1 \mathbb{E}_{x_0,x_1}\left[|\hat{v}_{t,t}(I_t) - \dot{I}_t|^2\right]dt$$ and $\mathcal{L}_{\text{dist}}$ is any of the following three distillation losses:

Lagrangian Self-Distillation (LSD): $$\mathcal{L}_{\text{LSD}}(\hat{v}) = \int_0^1\int_0^t \mathbb{E}_{x_0,x_1}\left[\left|\partial_t \hat{X}_{s,t}(I_s) - \hat{v}_{t,t}(\hat{X}_{s,t}(I_s))\right|^2\right]ds\,dt$$
Eulerian Self-Distillation (ESD): $$\mathcal{L}_{\text{ESD}}(\hat{v}) = \int_0^1\int_0^t \mathbb{E}_{x_0,x_1}\left[\left|\partial_s \hat{X}_{s,t}(I_s) + \nabla \hat{X}_{s,t}(I_s)\hat{v}_{s,s}(I_s)\right|^2\right]ds\,dt$$
Progressive Self-Distillation (PSD): $$\mathcal{L}_{\text{PSD}}(\hat{v}) = \int_0^1\int_0^t\int_s^t \mathbb{E}_{x_0,x_1}\left[\left|\hat{X}_{s,t}(I_s) - \hat{X}_{u,t}(\hat{X}_{s,u}(I_s))\right|^2\right]du\,ds\,dt$$

By converting each characterization above into a training objective, we obtain three self-distillation algorithms that eliminate the need for pre-trained teachers while maintaining the stability of distillation.

Self-distillation framework

Our plug-and-play approach pairs any distillation objective on the off-diagonal $s \neq t$ with flow matching on the diagonal $s=t$.

Controlling Information Flow via Stopgradient

In practice, it is useful to control the flow of information from the diagonal ($s=t$) to the off-diagonal ($s \neq t$). We can implement this with the stopgradient operator $\text{sg}(\cdot)$, which treats its argument as constant during backpropagation. This prevents gradient flow through specific terms, enabling us to simulate the setting where we have a pre-trained teacher. It is particularly important to avoid backpropagating through the spatial gradient $\nabla \hat{X}_{s,t}$ in the Eulerian loss, which is often numerically unstable and requires increased memory.

The practical losses we recommend with stopgradient are:

Lagrangian Self-Distillation (LSD):

$$\mathcal{L}_{\text{LSD}}(\hat{v}) = \int_0^1\int_0^t \mathbb{E}_{x_0,x_1}\left[\left|\partial_t \hat{X}_{s,t}(I_s) - \text{sg}(\hat{v}_{t,t}(\hat{X}_{s,t}(I_s)))\right|^2\right]ds\,dt$$

Eulerian Self-Distillation (ESD):

$$\mathcal{L}_{\text{ESD}}(\hat{v}) = \int_0^1\int_0^t \mathbb{E}_{x_0,x_1}\left[\left|\partial_s \hat{X}_{s,t}(I_s) + \text{sg}(\nabla \hat{X}_{s,t}(I_s)\hat{v}_{s,s}(I_s))\right|^2\right]ds\,dt$$

Progressive Self-Distillation (PSD):

$$\mathcal{L}_{\text{PSD}}(\hat{v}) = \int_0^1\int_0^t \mathbb{E}_{p_\gamma}\mathbb{E}_{x_0,x_1}\left[\left|\hat{v}_{s,t}(I_s) - \text{sg}\left((1-\gamma)\hat{v}_{s,u}(I_s) + \gamma\hat{v}_{u,t}(\hat{X}_{s,u}(I_s))\right)\right|^2\right]ds\,dt$$
Above, we wrote the PSD loss entirely in terms of $\hat{v}$, and introduced $\gamma \in [0, 1]$ defining the intermediate point $u = \gamma s + (1-\gamma) t$. See the appendix of the paper for further details.

Recovering Known Methods: We show in the appendix of our paper that all known algorithms for training consistency models (including consistency training, consistency distillation, shortcut models, align your flow, and mean flow) can be recovered via an appropriate choice of stopgradient placement in our framework.

BibTeX

@article{boffi2025flowmaps,
  author    = {Nicholas M. Boffi and Michael S. Albergo and Eric Vanden-Eijnden},
  title     = {How to build a consistency model: Learning flow maps via self-distillation},
  journal   = {arXiv preprint arXiv:2505.18825},
  year      = {2025},
}