how to

9.1.gru

Aug 19, 2024
notesjulyfun技术学习d2l
3 Minutes
473 Words

$$R_t^(in n times h) = sigma (X_t W_(x r) + H_(t - 1) W_(h r) + b_r)$$

$$Z_t^(in n times h) = sigma (X_t W_(x z) + H_(t - 1) W_(h z) + b_z)$$

以上都独占三组参数。

候选隐状态,由重置门 $R_t$ 决定是否保留旧状态,若 $R_t$ 接近 $0$ 不保留,则倾向于捕获短期依赖关系。这里又占有三组参数:

$$limits(H_t)^(tilde) = tanh(X_t W_(x h) + (R_t dot.circle H_(t - 1)) W_(h h) + b_h)$$

更新门将决定是否忽略当前隐状态,最终生成真正的隐状态。若 $Z_t$ 接近 $1$ 则倾向于保留旧状态,捕获长期依赖关系:

$$H_t = Z_t dot.circle H_(t - 1) + (1 - Z_t) dot.circle limits(H_t)^(tilde)$$

从头实现

1
def get_params(vocab_size, num_hiddens, device):
2
num_inputs = num_outputs = vocab_size
3
4
def normal(shape):
5
return torch.randn(size=shape, device=device)*0.01
6
7
def three():
8
return (normal((num_inputs, num_hiddens)),
9
normal((num_hiddens, num_hiddens)),
10
torch.zeros(num_hiddens, device=device))
11
12
W_xz, W_hz, b_z = three() # 更新门参数
13
W_xr, W_hr, b_r = three() # 重置门参数
14
W_xh, W_hh, b_h = three() # 候选隐状态参数
15
# 输出层参数
24 collapsed lines
16
W_hq = normal((num_hiddens, num_outputs))
17
b_q = torch.zeros(num_outputs, device=device)
18
# 附加梯度
19
params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
20
for param in params:
21
param.requires_grad_(True)
22
return params
23
24
def init_gru_state(batch_size, num_hiddens, device):
25
return (torch.zeros((batch_size, num_hiddens), device=device), )
26
27
def gru(inputs, state, params):
28
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
29
H, = state
30
outputs = []
31
for X in inputs:
32
Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
33
R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
34
H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
35
H = Z * H + (1 - Z) * H_tilda
36
Y = H @ W_hq + b_q
37
outputs.append(Y)
38
# [qm] 这里批量何在?
39
return torch.cat(outputs, dim=0), (H,)

高级 api 接口

1
num_inputs = vocab_size
2
gru_layer = nn.GRU(num_inputs, num_hiddens)
3
model = d2l.RNNModel(gru_layer, len(vocab))
4
model = model.to(device)
5
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

问题

训练好后的网络如何决定是否保留隐藏状态?是因为 R 和 Z 中的权重分布,导致输入 ahh 的无效单词就会导致 $R_t$ 重置?根据上面的 R 公式,似乎就是依赖当前输入和隐状态来决定生成重置权重。

Article title:9.1.gru
Article author:Julyfun
Release time:Aug 19, 2024
Copyright 2025
Sitemap