unit3-01_stable_diffusion_introduction

Stable Diffusion Pipeline 的职责是?

说白了 unet 预测的就是 vae 的 latent.

pipe:

unet
vae
text_encoder
image_encoder
feature_extractor
tokenizer (is a nn.Module without params)
scheduler
safety_checker ?

graph
    text --> t(tokenizer & text_encoder) --> e[text_embedding]
    e --> unet(unet)
    n["noisy_latents (4, 64, 64)"] --> unet
    timestep --> unet
    unet --> p["noise prediction (4, 64, 64)"]
    p --> v_d(vae decoder)

Unet 架构

1
AutoencoderKL                                      [1, 3, 64, 64]            --
2
├─Encoder: 1-1                                     [1, 8, 8, 8]              --
3
│    └─Conv2d: 2-1                                 [1, 128, 64, 64]          3,584
4
│    └─ModuleList: 2-2                             --                        --
5
│    │    └─DownEncoderBlock2D: 3-1                [1, 128, 32, 32]          738,944
6
│    │    └─DownEncoderBlock2D: 3-2                [1, 256, 16, 16]          2,690,304
7
│    │    └─DownEncoderBlock2D: 3-3                [1, 512, 8, 8]            10,754,560
8
│    │    └─DownEncoderBlock2D: 3-4                [1, 512, 8, 8]            9,443,328
9
│    └─UNetMidBlock2D: 2-3                         [1, 512, 8, 8]            --
10
│    │    └─ModuleList: 3-7                        --                        (recursive)
11
│    │    └─ModuleList: 3-6                        --                        1,051,648
12
│    │    └─ModuleList: 3-7                        --                        (recursive)
13
│    └─GroupNorm: 2-4                              [1, 512, 8, 8]            1,024
14
│    └─SiLU: 2-5                                   [1, 512, 8, 8]            --
15
│    └─Conv2d: 2-6                                 [1, 8, 8, 8]              36,872
16 collapsed lines
16
├─Conv2d: 1-2                                      [1, 8, 8, 8]              72
17
├─Conv2d: 1-3                                      [1, 4, 8, 8]              20
18
├─Decoder: 1-4                                     [1, 3, 64, 64]            --
19
│    └─Conv2d: 2-7                                 [1, 512, 8, 8]            18,944
20
│    └─UNetMidBlock2D: 2-8                         [1, 512, 8, 8]            --
21
│    │    └─ModuleList: 3-10                       --                        (recursive)
22
│    │    └─ModuleList: 3-9                        --                        1,051,648
23
│    │    └─ModuleList: 3-10                       --                        (recursive)
24
│    └─ModuleList: 2-9                             --                        --
25
│    │    └─UpDecoderBlock2D: 3-11                 [1, 512, 16, 16]          16,524,800
26
│    │    └─UpDecoderBlock2D: 3-12                 [1, 512, 32, 32]          16,524,800
27
│    │    └─UpDecoderBlock2D: 3-13                 [1, 256, 64, 64]          4,855,296
28
│    │    └─UpDecoderBlock2D: 3-14                 [1, 128, 64, 64]          1,067,648
29
│    └─GroupNorm: 2-10                             [1, 128, 64, 64]          256
30
│    └─SiLU: 2-11                                  [1, 128, 64, 64]          --
31
│    └─Conv2d: 2-12                                [1, 3, 64, 64]            3,459