How to?

285 hw1

Jun 18, 2026
技术学习285
2 Minutes
324 Words

1.1 Given

default show: default

1.2

default

  • holy!
  • The Off-policy policy gradient: 这张图简单易懂:
  • default

2 Editing Code

如何可视化: tensorboard --logdir=/home/julyfun/Documents/GitHub/homework_fall2023/hw1/data/q1_bc_ant_HalfCheetah-v4_2 3-11-2025_00-00-06/

  • Ant-v4:
1
Eval_AverageReturn : 4795.3828125
2
Train_AverageReturn : 4681.891673935816
3
Training Loss : 0.0011749982368201017
  • Walker-2d 只有 2d 物理
1
Eval_AverageReturn : 998.958740234375
2
Train_AverageReturn : 5383.310325177668
3
Training Loss : 0.01655399613082409

走的不太行,能往前冲一点

  • HalfCheetah
1
Eval_AverageReturn : 4070.5146484375
2
Train_AverageReturn : 4034.7999834965067
3
Training Loss : 0.004244370386004448

default

  • Hopper
1
Eval_AverageReturn : 1542.371826171875
2
Train_AverageReturn : 3717.5129936182307
3
Training Loss : 0.010719751007854939

看视频跳的还可以 default 关于这个 Hopper 任务, batchsize: (我本来还想搞 N 条轨迹来实验,但是这个专家数据只有若干个 s-a pair,没法搞) default

default

3. Dagger

  • Walker-2d
  • 这里 batch_size default=1000 参数不用调,因为 train 的时候只拿出前 train_batch_size=100 个. 是公平的.
1
Eval_AverageReturn : 998.958740234375
2
Eval_StdReturn : 488.8404541015625
3
Train_AverageReturn : 5383.310325177668
4
Train_StdReturn : 54.15251563871789
5
Training Loss : 0.01655399613082409
6
7
********** Iteration 1 ************
8
Eval_AverageReturn : 5408.2431640625
9
Eval_AverageEpLen : 1000.0
10
Train_AverageReturn : 1130.504150390625
11
Train_StdReturn : 583.0662231445312
12
Training Loss : 0.022556820884346962
13
14
********** Iteration 2 ************
15
Eval_AverageReturn : 5389.76708984375
4 collapsed lines
16
Eval_StdReturn : 0.0
17
Train_AverageReturn : 5411.14599609375
18
Train_StdReturn : 0.0
19
Training Loss : 0.012993226759135723
  • 实验时在默认命令上尽量少改动.

  • Walker2d DAGGER: default

  • 强制不收集新数据: if itr == 0: => if itr == 0 or True: (后来我改了其他代码,可以 no_dagger 多轮执行) default

  • no-dagger Step 0: 会 terminate default

  • DAgger Step 9: default

  • Hopper DAgger: default

  • Hopper no dagger default

  • DAgger Step 9: default

  • later ask: 各个名称对应什么任务?

  • dagger 失败时是如何找到专家数据的?

Article title:285 hw1
Article author:Julyfun
Release time:Jun 18, 2026
Copyright 2026
Sitemap