how to

Flow and diffusion

Nov 10, 2025
notesjulyfun技术学习models
4 Minutes
723 Words
  • ODE(常微分方程) SDE(随机微分方程, 带随机波动的变化)
  • SDE: 系统的未来状态不仅依赖当前状态,还包含非决定性的随机波动(噪声),因此未来状态是一个概率分布. 方程中多了随机项(通常用布朗运动表示)
  • CFM: ?
  • 教材笔记,详实简单: https://arxiv.org/pdf/2506.02070
  • Lab: https://github.com/eje24/iap-diffusion-labs

Lec 1

  • https://diffusion.csail.mit.edu/docs/slides_lecture_1.pdf
  • Diffusion network 预测的就是向量场(输入: t, X 当前位置. 输出: X 速度) default
  • 改为神经网络形式(引入 $theta$). 注意这里 $sigma_t$ 可以由时间变化:
  • default
  • [qm] 为何噪声到图像的过程可以建模为带布朗运动的 SDE?

Lec 2

  • Conditional Probability Path: 给定目标点 $z$(例如 diffusion 中的最终图像),求 $x$ 下一步路径
  • Marginal Probability Path: 不给定条件(所有分布的加权平均),求下一步路径
    • 两者在 $t_0$ 是完全一样的,从第二步开始才有区别.
  • 向量场函数(也可视为梯度或者速度):$$“def” u(x: RR^d, t: [0, 1]) |-> RR^d$$
  • default
  • [qm] 为何 score function 这样定义

[附]

Lec3 ok

Lec4: Guided (conditional) DDM

1
class ResidualLayer(nn.Module):
2
def __init__(self, channels: int, time_embed_dim: int, y_embed_dim: int):
3
super().__init__()
4
self.block1 = nn.Sequential(
5
nn.SiLU(),
6
nn.BatchNorm2d(channels),
7
nn.Conv2d(channels, channels, kernel_size=3, padding=1)
8
)
9
self.block2 = nn.Sequential(
10
nn.SiLU(),
11
nn.BatchNorm2d(channels),
12
nn.Conv2d(channels, channels, kernel_size=3, padding=1)
13
)
14
# Converts (bs, time_embed_dim) -> (bs, channels)
15
self.time_adapter = nn.Sequential(
36 collapsed lines
16
nn.Linear(time_embed_dim, time_embed_dim),
17
nn.SiLU(),
18
nn.Linear(time_embed_dim, channels)
19
)
20
# Converts (bs, y_embed_dim) -> (bs, channels)
21
self.y_adapter = nn.Sequential(
22
nn.Linear(y_embed_dim, y_embed_dim),
23
nn.SiLU(),
24
nn.Linear(y_embed_dim, channels)
25
)
26
27
def forward(self, x: torch.Tensor, t_embed: torch.Tensor, y_embed: torch.Tensor) -> torch.Tensor:
28
"""
29
Args:
30
- x: (bs, c, h, w)
31
- t_embed: (bs, t_embed_dim)
32
- y_embed: (bs, y_embed_dim)
33
"""
34
res = x.clone() # (bs, c, h, w)
35
36
# Initial conv block
37
x = self.block1(x) # (bs, c, h, w)
38
39
# Add time embedding
40
t_embed = self.time_adapter(t_embed).unsqueeze(-1).unsqueeze(-1) # (bs, c, 1, 1)
41
x = x + t_embed
42
43
# Add y embedding (conditional embedding)
44
y_embed = self.y_adapter(y_embed).unsqueeze(-1).unsqueeze(-1) # (bs, c, 1, 1)
45
x = x + y_embed
46
47
# Second conv block
48
x = self.block2(x) # (bs, c, h, w)
49
50
# Add back residual
51
x = x + res # (bs, c, h, w)

不禁令人发现:

  • 自然语言适合描述低频信号,而不擅长描述高频信号
  • UNet 通过保留高频信号来抑制信息过于平滑的问题
  • 之所以“视频能展示的内容远少于占用空间相同的游戏”是因为 mp4 需要编码高频信号
  • AE 压缩为低维潜在向量的结构之所以能用于生成,是因为它强制学习数据的底层结构

Some question

  • 这里如何编码时间 (bs, 1, 1, 1) => (bs, emb_dim)?用了 nn.Parameter.
  • 可能也可以用无需学习的正弦余弦位置编码
1
class FourierEncoder(nn.Module):
2
"""
3
Based on https://github.com/lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/karras_unet.py#L183
4
"""
5
def __init__(self, dim: int):
6
super().__init__()
7
assert dim % 2 == 0
8
self.half_dim = dim // 2
9
self.weights = nn.Parameter(torch.randn(1, self.half_dim))
10
11
def forward(self, t: torch.Tensor) -> torch.Tensor:
12
"""
13
Args:
14
- t: (bs, 1, 1, 1)
15
Returns:
7 collapsed lines
16
- embeddings: (bs, dim)
17
"""
18
t = t.view(-1, 1) # (bs, 1)
19
freqs = t * self.weights * 2 * math.pi # (bs, half_dim)
20
sin_embed = torch.sin(freqs) # (bs, half_dim)
21
cos_embed = torch.cos(freqs) # (bs, half_dim)
22
return torch.cat([sin_embed, cos_embed], dim=-1) * math.sqrt(2) # (bs, dim)
  • default
Article title:Flow and diffusion
Article author:Julyfun
Release time:Nov 10, 2025
Copyright 2025
Sitemap