Flow and diffusion

ODE（常微分方程） SDE（随机微分方程, 带随机波动的变化）
SDE: 系统的未来状态不仅依赖当前状态，还包含非决定性的随机波动（噪声），因此未来状态是一个概率分布. 方程中多了随机项（通常用布朗运动表示）
CFM: ?
教材笔记，详实简单: https://arxiv.org/pdf/2506.02070
Lab: https://github.com/eje24/iap-diffusion-labs

Lec 1

https://diffusion.csail.mit.edu/docs/slides_lecture_1.pdf
Diffusion network 预测的就是向量场（输入: t, X 当前位置. 输出: X 速度）
改为神经网络形式（引入 $theta$）. 注意这里 $sigma_t$ 可以由时间变化:
[qm] 为何噪声到图像的过程可以建模为带布朗运动的 SDE?

Lec 2

Conditional Probability Path: 给定目标点 $z$（例如 diffusion 中的最终图像），求 $x$ 下一步路径
Marginal Probability Path: 不给定条件（所有分布的加权平均），求下一步路径
- 两者在 $t_0$ 是完全一样的，从第二步开始才有区别.
向量场函数（也可视为梯度或者速度）：$$“def” u(x: RR^d, t: [0, 1]) |-> RR^d$$
[qm] 为何 score function 这样定义

[附]

很好的散度讲解: https://zhuanlan.zhihu.com/p/165479232kj

Lec3 ok

Lec4: Guided (conditional) DDM

https://github.com/eje24/iap-diffusion-labs/blob/main/solutions/lab_three_complete.ipynb
condition 是怎么加入 UNet: 自己看代码无需多言.

1
class ResidualLayer(nn.Module):
2
    def __init__(self, channels: int, time_embed_dim: int, y_embed_dim: int):
3
        super().__init__()
4
        self.block1 = nn.Sequential(
5
            nn.SiLU(),
6
            nn.BatchNorm2d(channels),
7
            nn.Conv2d(channels, channels, kernel_size=3, padding=1)
8
        )
9
        self.block2 = nn.Sequential(
10
            nn.SiLU(),
11
            nn.BatchNorm2d(channels),
12
            nn.Conv2d(channels, channels, kernel_size=3, padding=1)
13
        )
14
        # Converts (bs, time_embed_dim) -> (bs, channels)
15
        self.time_adapter = nn.Sequential(
36 collapsed lines
16
            nn.Linear(time_embed_dim, time_embed_dim),
17
            nn.SiLU(),
18
            nn.Linear(time_embed_dim, channels)
19
        )
20
        # Converts (bs, y_embed_dim) -> (bs, channels)
21
        self.y_adapter = nn.Sequential(
22
            nn.Linear(y_embed_dim, y_embed_dim),
23
            nn.SiLU(),
24
            nn.Linear(y_embed_dim, channels)
25
        )
26

27
    def forward(self, x: torch.Tensor, t_embed: torch.Tensor, y_embed: torch.Tensor) -> torch.Tensor:
28
        """
29
        Args:
30
        - x: (bs, c, h, w)
31
        - t_embed: (bs, t_embed_dim)
32
        - y_embed: (bs, y_embed_dim)
33
        """
34
        res = x.clone() # (bs, c, h, w)
35

36
        # Initial conv block
37
        x = self.block1(x) # (bs, c, h, w)
38

39
        # Add time embedding
40
        t_embed = self.time_adapter(t_embed).unsqueeze(-1).unsqueeze(-1) # (bs, c, 1, 1)
41
        x = x + t_embed
42

43
        # Add y embedding (conditional embedding)
44
        y_embed = self.y_adapter(y_embed).unsqueeze(-1).unsqueeze(-1) # (bs, c, 1, 1)
45
        x = x + y_embed
46

47
        # Second conv block
48
        x = self.block2(x) # (bs, c, h, w)
49

50
        # Add back residual
51
        x = x + res # (bs, c, h, w)

不禁令人发现：

自然语言适合描述低频信号，而不擅长描述高频信号
UNet 通过保留高频信号来抑制信息过于平滑的问题
之所以“视频能展示的内容远少于占用空间相同的游戏”是因为 mp4 需要编码高频信号
AE 压缩为低维潜在向量的结构之所以能用于生成，是因为它强制学习数据的底层结构

Some question

这里如何编码时间 (bs, 1, 1, 1) => (bs, emb_dim)？用了 nn.Parameter.
可能也可以用无需学习的正弦余弦位置编码

1
class FourierEncoder(nn.Module):
2
    """
3
    Based on https://github.com/lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/karras_unet.py#L183
4
    """
5
    def __init__(self, dim: int):
6
        super().__init__()
7
        assert dim % 2 == 0
8
        self.half_dim = dim // 2
9
        self.weights = nn.Parameter(torch.randn(1, self.half_dim))
10

11
    def forward(self, t: torch.Tensor) -> torch.Tensor:
12
        """
13
        Args:
14
        - t: (bs, 1, 1, 1)
15
        Returns:
7 collapsed lines
16
        - embeddings: (bs, dim)
17
        """
18
        t = t.view(-1, 1) # (bs, 1)
19
        freqs = t * self.weights * 2 * math.pi # (bs, half_dim)
20
        sin_embed = torch.sin(freqs) # (bs, half_dim)
21
        cos_embed = torch.cos(freqs) # (bs, half_dim)
22
        return torch.cat([sin_embed, cos_embed], dim=-1) * math.sqrt(2) # (bs, dim)