$$R_t^(in n times h) = sigma (X_t W_(x r) + H_(t - 1) W_(h r) + b_r)$$
$$Z_t^(in n times h) = sigma (X_t W_(x z) + H_(t - 1) W_(h z) + b_z)$$
以上都独占三组参数。
候选隐状态,由重置门 $R_t$ 决定是否保留旧状态,若 $R_t$ 接近 $0$ 不保留,则倾向于捕获短期依赖关系。这里又占有三组参数:
$$limits(H_t)^(tilde) = tanh(X_t W_(x h) + (R_t dot.circle H_(t - 1)) W_(h h) + b_h)$$
更新门将决定是否忽略当前隐状态,最终生成真正的隐状态。若 $Z_t$ 接近 $1$ 则倾向于保留旧状态,捕获长期依赖关系:
$$H_t = Z_t dot.circle H_(t - 1) + (1 - Z_t) dot.circle limits(H_t)^(tilde)$$
从头实现
1def get_params(vocab_size, num_hiddens, device):2 num_inputs = num_outputs = vocab_size3
4 def normal(shape):5 return torch.randn(size=shape, device=device)*0.016
7 def three():8 return (normal((num_inputs, num_hiddens)),9 normal((num_hiddens, num_hiddens)),10 torch.zeros(num_hiddens, device=device))11
12 W_xz, W_hz, b_z = three() # 更新门参数13 W_xr, W_hr, b_r = three() # 重置门参数14 W_xh, W_hh, b_h = three() # 候选隐状态参数15 # 输出层参数24 collapsed lines
16 W_hq = normal((num_hiddens, num_outputs))17 b_q = torch.zeros(num_outputs, device=device)18 # 附加梯度19 params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]20 for param in params:21 param.requires_grad_(True)22 return params23
24def init_gru_state(batch_size, num_hiddens, device):25 return (torch.zeros((batch_size, num_hiddens), device=device), )26
27def gru(inputs, state, params):28 W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params29 H, = state30 outputs = []31 for X in inputs:32 Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)33 R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)34 H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)35 H = Z * H + (1 - Z) * H_tilda36 Y = H @ W_hq + b_q37 outputs.append(Y)38 # [qm] 这里批量何在?39 return torch.cat(outputs, dim=0), (H,)
高级 api 接口
1num_inputs = vocab_size2gru_layer = nn.GRU(num_inputs, num_hiddens)3model = d2l.RNNModel(gru_layer, len(vocab))4model = model.to(device)5d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
问题
训练好后的网络如何决定是否保留隐藏状态?是因为 R 和 Z 中的权重分布,导致输入 ahh
的无效单词就会导致 $R_t$ 重置?根据上面的 R 公式,似乎就是依赖当前输入和隐状态来决定生成重置权重。