9.1.gru

$$R_t^(in n times h) = sigma (X_t W_(x r) + H_(t - 1) W_(h r) + b_r)$$

$$Z_t^(in n times h) = sigma (X_t W_(x z) + H_(t - 1) W_(h z) + b_z)$$

以上都独占三组参数。

候选隐状态，由重置门 $R_t$ 决定是否保留旧状态，若 $R_t$ 接近 $0$ 不保留，则倾向于捕获短期依赖关系。这里又占有三组参数：

$$limits(H_t)^(tilde) = tanh(X_t W_(x h) + (R_t dot.circle H_(t - 1)) W_(h h) + b_h)$$

更新门将决定是否忽略当前隐状态，最终生成真正的隐状态。若 $Z_t$ 接近 $1$ 则倾向于保留旧状态，捕获长期依赖关系：

$$H_t = Z_t dot.circle H_(t - 1) + (1 - Z_t) dot.circle limits(H_t)^(tilde)$$

从头实现

def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device)*0.01

    def three():
        return (normal((num_inputs, num_hiddens)),
                normal((num_hiddens, num_hiddens)),
                torch.zeros(num_hiddens, device=device))

    W_xz, W_hz, b_z = three()  # 更新门参数
    W_xr, W_hr, b_r = three()  # 重置门参数
    W_xh, W_hh, b_h = three()  # 候选隐状态参数
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params

def init_gru_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )

def gru(inputs, state, params):
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
        R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
        H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
        H = Z * H + (1 - Z) * H_tilda
        Y = H @ W_hq + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)

高级 api 接口

num_inputs = vocab_size
gru_layer = nn.GRU(num_inputs, num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))
model = model.to(device)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

问题

训练好后的网络如何决定是否保留隐藏状态？是因为 R 和 Z 中的权重分布，导致输入 ahh 的无效单词就会导致 $R_t$ 重置？根据上面的 R 公式，似乎就是依赖当前输入和隐状态来决定生成重置权重。