How to?

其他一样.

Nov 11, 2024
notesjulyfun技术学习hrl
4 Minutes
684 Words

https://hrl.boyuai.com/chapter/2/策略梯度算法/

  • 之前的算法基于价值,没有显式策略(策略就是选最大动作价值的动作)。下述 REINFORCE 方法基于策略.

  • 𝜋𝜃 是策略,处处可微,要学习的参数为 𝜃

  • 目标是最大化 : 𝐽(𝜃)=𝔼𝑠0[𝑉𝜋𝜃(𝑠0)]

  • 设状态访问分布为 𝜈𝜋(无穷步状态概率加权向量,见第三章),有:

    • 提示: (log𝑓)=𝑓𝑓
    • default
  • 另一证明:

    • https://paddlepedia.readthedocs.io/en/latest/tutorials/reinforcement_learning/policy_gradient.html
    • 此图第一行少写了一个
    • 这里 𝑇(𝑠,𝑎) 是执行 𝑎 后转移到 𝑠 的概率.
    • default

蒙特卡洛 REINFORCE

  • 考虑用蒙特卡洛方法估计,对有限步的环境来说,依据上式有:

    • 其中 𝑇 为最大步数. 小括号内就是 𝐺𝑡,下面第二张图中以 𝜓𝑡 表示.
    • default
  • 训练步骤

    • default
  • 网络结构:

    • Input: 可微状态
    • Output: 离散动作的概率多项分布
    • 损失函数: 上述梯度更新公式(去掉微分符号)
1
class PolicyNet(torch.nn.Module):
2
def __init__(self, state_dim, hidden_dim, action_dim):
3
super(PolicyNet, self).__init__()
4
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
5
self.fc2 = torch.nn.Linear(hidden_dim, action_dim)
6
7
# 输入当前状态,输出各个可选动作的概率分布, 例如 [0.3, 0.7]
8
def forward(self, x):
9
x = F.relu(self.fc1(x))
10
return F.softmax(self.fc2(x), dim=1)
11
12
class REINFORCE:
13
def __init__(self, state_dim, hidden_dim, action_dim, learning_rate, gamma,
14
device):
15
self.policy_net = PolicyNet(state_dim, hidden_dim,
81 collapsed lines
16
action_dim).to(device)
17
...
18
19
def take_action(self, state): # 根据动作概率分布随机采样
20
state = torch.tensor([state], dtype=torch.float).to(self.device)
21
probs = self.policy_net(state)
22
action_dist = torch.distributions.Categorical(probs)
23
action = action_dist.sample() # 随机选一个
24
return action.item()
25
26
def update(self, transition_dict):
27
reward_list = transition_dict['rewards']
28
state_list = transition_dict['states']
29
action_list = transition_dict['actions']
30
31
G = 0
32
self.optimizer.zero_grad()
33
for i in reversed(range(len(reward_list))): # 从最后一步算起
34
reward = reward_list[i]
35
state = torch.tensor([state_list[i]],
36
dtype=torch.float).to(self.device)
37
action = torch.tensor([action_list[i]]).view(-1, 1).to(self.device)
38
# 这里仅仅是提取rollout过程中采纳那个 action 的概率.
39
# 这个概率也是最终 loss 中唯一的梯度来源.
40
log_prob = torch.log(self.policy_net(state).gather(1, action))
41
G = self.gamma * G + reward
42
loss = -log_prob * G # 每一步的损失函数
43
loss.backward() # 反向传播计算梯度
44
self.optimizer.step() # 梯度下降
45
46
# 其他一样.
47
learning_rate = 1e-3
48
num_episodes = 1000
49
hidden_dim = 128
50
gamma = 0.98
51
device = torch.device("cuda") if torch.cuda.is_available() else torch.device(
52
"cpu")
53
54
env_name = "CartPole-v0"
55
env = gym.make(env_name)
56
env.seed(0)
57
torch.manual_seed(0)
58
state_dim = env.observation_space.shape[0]
59
action_dim = env.action_space.n
60
agent = REINFORCE(state_dim, hidden_dim, action_dim, learning_rate, gamma,
61
device)
62
63
return_list = []
64
for i in range(10):
65
with tqdm(total=int(num_episodes / 10), desc='Iteration %d' % i) as pbar:
66
for i_episode in range(int(num_episodes / 10)):
67
episode_return = 0
68
transition_dict = {
69
'states': [],
70
'actions': [],
71
'next_states': [],
72
'rewards': [],
73
'dones': []
74
}
75
state = env.reset()
76
done = False
77
while not done:
78
action = agent.take_action(state)
79
nexth_state, reward, done, _ = env.step(action)
80
transition_dict['states'].append(state)
81
transition_dict['actions'].append(action)
82
transition_dict['next_states'].append(next_state)
83
transition_dict['rewards'].append(reward)
84
transition_dict['dones'].append(done)
85
state = next_state
86
episode_return += reward
87
return_list.append(episode_return)
88
agent.update(transition_dict)
89
if (i_episode + 1) % 10 == 0:
90
pbar.set_postfix({
91
'episode':
92
'%d' % (num_episodes / 10 * i + i_episode + 1),
93
'return':
94
'%.3f' % np.mean(return_list[-10:])
95
})
96
pbar.update(1)
  • 考虑最优情况的正确性:若 𝐺1=200,𝐺2=200,则会学到使得 𝑝1 尽可能大(取 -log 后尽可能小,损失尽可能小),𝑝2 尽可能小(取 -log 后极大,损失极小)
    • softmax 应该能防止数值爆炸.

优化问题,神经网络和强化学习

Article title:其他一样.
Article author:Julyfun
Release time:Nov 11, 2024
Copyright 2026
Sitemap