You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!
In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.
However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.
Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].
which means essentially what happened before time step t will have impact on the value Q[t].
Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?
Please correct me if anything wrong,
Thanks!
The text was updated successfully, but these errors were encountered:
Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.
@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!
In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.
However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.
Which means, for example, at time step
t
, in trajectory(s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...)
, ifr[t-2]
andr[t-1]
is very low, we need to decrease the value ofQ[t]
according to upper bounds introduced byr[t-2]
,r[t-1]
.which means essentially what happened before time step
t
will have impact on the valueQ[t]
.Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?
Please correct me if anything wrong,
Thanks!
The text was updated successfully, but these errors were encountered: