Posts with tag diffusion-models-class

unit3-01_stable_diffusion_introduction

2025-06-20
diffusion-models-classjulyfunnotes技术学习

Stable Diffusion Pipeline 的职责是?说白了 unet 预测的就是 vae 的 latent.pipe:unetvaetext_encoderimage_encoderfeature_extractortokenizer (is a nn.Module without params)schedulersafety_checker ?graph text --> t(tokenizer & text_encoder) --> e[text_embedding] e --> unet(unet) n["noisy_latents (4, 64, 64)"] --> unet timestep --> unet unet --> p["noise prediction (4, 64, 64)"] p --> v_d(vae decoder)Unet 架构AutoencoderKL [1, 3, 64, 64] -- ├─Encoder: 1-1 [1, 8, 8, 8] -- │ └─Conv2d: 2-1 [1, 128, 64, 64] 3,584 │ └─ModuleList: 2-2 -- -- │ │ └─DownEncoderBlock2D: 3-1 [1, 128, 32, 32] 738,944 │ │ └─DownEncoderBlock2D: 3-2 [1, 256, 16, 16] 2,690,304 │ │ └─DownEncoderBlock2D: 3-3 [1, 512, 8, 8] 10,754,560 │ │ └─DownEncoderBlock2D: 3-4 [1, 512, 8, 8] 9,443,328 │ └─UNetMidBlock2D: 2-3 [1, 512, 8, 8] -- │ │ └─ModuleList: 3-7 -- (recursive) │ │ └─ModuleList: 3-6 -- 1,051,648 │ │ └─ModuleList: 3-7 -- (recursive) │ └─GroupNorm: 2-4 [1, 512, 8, 8] 1,024 │ └─SiLU: 2-5 [1, 512, 8, 8] -- │ └─Conv2d: 2-6 [1, 8, 8, 8] 36,872 ├─Conv2d: 1-2 [1, 8, 8, 8] 72 ├─Conv2d: 1-3 [1, 4, 8, 8] 20 ├─Decoder: 1-4 [1, 3, 64, 64] -- │ └─Conv2d: 2-7 [1, 512, 8, 8] 18,944 │ └─UNetMidBlock2D: 2-8 [1, 512, 8, 8] -- │ │ └─ModuleList: 3-10 -- (recursive) │ │ └─ModuleList: 3-9 -- 1,051,648 │ │ └─ModuleList: 3-10 -- (recursive) │ └─ModuleList: 2-9 -- -- │ │ └─UpDecoderBlock2D: 3-11 [1, 512, 16, 16] 16,524,800 │ │ └─UpDecoderBlock2D: 3-12 [1, 512, 32, 32] 16,524,800 │ │ └─UpDecoderBlock2D: 3-13 [1, 256, 64, 64] 4,855,296 │ │ └─UpDecoderBlock2D: 3-14 [1, 128, 64, 64] 1,067,648 │ └─GroupNorm: 2-10 [1, 128, 64, 64] 256 │ └─SiLU: 2-11 [1, 128, 64, 64] -- │ └─Conv2d: 2-12 [1, 3, 64, 64] 3,45

unit2-02_class_conditioned_diffusion_model_example

2025-06-10
diffusion-models-classjulyfunnotes技术学习

Class-conditioned指的是类别-Conditioned. 或者说 class-label-conditioned.网络输入改成啥样了? 其实就是 concat.Unet 输入通道直接改成了 in_channels=1 + class_emb_sizeUNet2DModel( in_channels=1 + class_emb_size,forward 时广播 + torch.cat 一下.def forward(self, x, t, class_labels): bs, ch, w, h = x.shape # & self.class_emb = nn.Embedding(num_classes, class_emb_size) class_cond = self.class_emb(class_labels) # * # 广播 class_cond = class_cond.view(bs, class_cond.shape[1], 1, 1).expand(bs, class_cond.shape[1], w, h) net_input = torch.cat((x, class_cond), dim=1) # model 返回 ModelOutput. # sample: 就是预测的噪声张量. # additional_residuals: 存储额外残差信息. 一般没用. return self.model(net_input, t).sampl

unit2-01_finetuning_and_guidance

2025-06-10
diffusion-models-classjulyfunnotes技术学习

Generating process:x = torch.randn(4, 3, 256, 256).to(device) for i, t in tqdm(enumerate(scheduler.timesteps)): model_input = scheduler.scale_model_input(x, t) with torch.no_grad(): noise_pred = image_pipe.unet(model_input, t)["sample"] x = scheduler.step(noise_pred, t, sample=x).prev_sampleGuidancex = torch.randn(4, 3, 256, 256).to(device) for i, t in tqdm(enumerate(scheduler.timesteps)): x = x.detach().requires_grad_() model_input = scheduler.scale_model_input(x, t) noise_pred = image_pipe.unet(model_input, t)["sample"] x0 = scheduler.step(noise_pred, t, x).pred_original_sample loss = <custom_loss>(x0) * <guidance_loss_scale> cond_grad = -torch.autograd.grad(loss, x)[0] x = x.detach() + cond_grad x = scheduler.step(noise_pred, t, x).prev_sampleCLIP Guidance with torch.no_grad(): text_features = clip_model.encode_text(text) for i, t in tqdm(enumerate(scheduler.timesteps)): # print(i, t) # (1, tensor(1000)), (2, tensor(980))... model_input = scheduler.scale_model_input(x, t) # DDIM loaded with torch.no_grad(): # image_pipe is loaded by the same name noise_pred = image_pipe.unet(model_input, t)["sample"] cond_grad = 0 for cut in range(n_cuts): x = x.detach().requires_grad_() x0 = scheduler.step(noise_pred,t, sample=x).pred_original_sample loss = <clip_loss>(x0, text_features) * guidance_scale cond_grad -= torch.autograd.grad(loss, x)[0] / n_cuts if i % 25 == 0: print(f"Steps {i} loss: {loss.item()}") alpha_bar = scheduler.alphas_cumprod[i] # `alpha_bar` here is decreasing and works for textures. # Can be changed to some increasing coefficients! x = x.detach() + cond_grad * alpha_bar.sqrt() x = scheduler.step(noise_pred, t, x).prev_sampl

No more posts to load.