9.1.gru
$$R_t^(in n times h) = sigma (X_t W_(x r) + H_(t - 1) W_(h r) + b_r)$$
$$Z_t^(in n times h) = sigma (X_t W_(x z) + H_(t - 1) W_(h z) + b_z)$$
以上都独占三组参数。
候选隐状态,由重置门 $R_t$ 决定是否保留旧状态,若 $R_t$ 接近 $0$ 不保留,则倾向于捕获短期依赖关系。这里又占有三组参数:
$$limits(H_t)^(tilde) = tanh(X_t W_(x h) + (R_t dot.circle H_(t - 1)) W_(h h) + b_h)$$
更新门将决定是否忽略当前隐状态,最终生成真正的隐状态。若 $Z_t$ 接近 $1$ 则倾向于保留旧状态,捕获长期依赖关系:
$$H_t = Z_t dot.circle H_(t - 1) + (1 - Z_t) dot.circle limits(H_t)^(tilde)$$
从头实现
def get_params(vocab_size, num_hiddens, device):
num_inputs = num_outputs = vocab_size
def normal(shape):
return torch.randn(size=shape, device=device)*0.01
def three():
return (normal((num_inputs, num_hiddens)),
normal((num_hiddens, num_hiddens)),
torch.zeros(num_hiddens, device=device))
W_xz, W_hz, b_z = three() # 更新门参数
W_xr, W_hr, b_r = three() # 重置门参数
W_xh, W_hh, b_h = three() # 候选隐状态参数
# 输出层参数
W_hq = normal((num_hiddens, num_outputs))
b_q = torch.zeros(num_outputs, device=device)
# 附加梯度
params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
param.requires_grad_(True)
return params
def init_gru_state(batch_size, num_hiddens, device):
return (torch.zeros((batch_size, num_hiddens), device=device), )
def gru(inputs, state, params):
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
H = Z * H + (1 - Z) * H_tilda
Y = H @ W_hq + b_q
outputs.append(Y)
return torch.cat(outputs, dim=0), (H,)
高级 api 接口
num_inputs = vocab_size
gru_layer = nn.GRU(num_inputs, num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))
model = model.to(device)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
问题
训练好后的网络如何决定是否保留隐藏状态?是因为 R 和 Z 中的权重分布,导致输入 ahh
的无效单词就会导致 $R_t$ 重置?根据上面的 R 公式,似乎就是依赖当前输入和隐状态来决定生成重置权重。