Stable Diffusion Pipeline 的职责是?
说白了 unet 预测的就是 vae 的 latent.
pipe:
- unet
- vae
- text_encoder
- image_encoder
- feature_extractor
- tokenizer (is a nn.Module without params)
- scheduler
- safety_checker ?
graph text --> t(tokenizer & text_encoder) --> e[text_embedding] e --> unet(unet) n["noisy_latents (4, 64, 64)"] --> unet timestep --> unet unet --> p["noise prediction (4, 64, 64)"] p --> v_d(vae decoder)
Unet 架构
1AutoencoderKL [1, 3, 64, 64] --2├─Encoder: 1-1 [1, 8, 8, 8] --3│ └─Conv2d: 2-1 [1, 128, 64, 64] 3,5844│ └─ModuleList: 2-2 -- --5│ │ └─DownEncoderBlock2D: 3-1 [1, 128, 32, 32] 738,9446│ │ └─DownEncoderBlock2D: 3-2 [1, 256, 16, 16] 2,690,3047│ │ └─DownEncoderBlock2D: 3-3 [1, 512, 8, 8] 10,754,5608│ │ └─DownEncoderBlock2D: 3-4 [1, 512, 8, 8] 9,443,3289│ └─UNetMidBlock2D: 2-3 [1, 512, 8, 8] --10│ │ └─ModuleList: 3-7 -- (recursive)11│ │ └─ModuleList: 3-6 -- 1,051,64812│ │ └─ModuleList: 3-7 -- (recursive)13│ └─GroupNorm: 2-4 [1, 512, 8, 8] 1,02414│ └─SiLU: 2-5 [1, 512, 8, 8] --15│ └─Conv2d: 2-6 [1, 8, 8, 8] 36,87216 collapsed lines
16├─Conv2d: 1-2 [1, 8, 8, 8] 7217├─Conv2d: 1-3 [1, 4, 8, 8] 2018├─Decoder: 1-4 [1, 3, 64, 64] --19│ └─Conv2d: 2-7 [1, 512, 8, 8] 18,94420│ └─UNetMidBlock2D: 2-8 [1, 512, 8, 8] --21│ │ └─ModuleList: 3-10 -- (recursive)22│ │ └─ModuleList: 3-9 -- 1,051,64823│ │ └─ModuleList: 3-10 -- (recursive)24│ └─ModuleList: 2-9 -- --25│ │ └─UpDecoderBlock2D: 3-11 [1, 512, 16, 16] 16,524,80026│ │ └─UpDecoderBlock2D: 3-12 [1, 512, 32, 32] 16,524,80027│ │ └─UpDecoderBlock2D: 3-13 [1, 256, 64, 64] 4,855,29628│ │ └─UpDecoderBlock2D: 3-14 [1, 128, 64, 64] 1,067,64829│ └─GroupNorm: 2-10 [1, 128, 64, 64] 25630│ └─SiLU: 2-11 [1, 128, 64, 64] --31│ └─Conv2d: 2-12 [1, 3, 64, 64] 3,459