Question on upper bound #1

immars · 2017-06-16T07:56:53Z

@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!

In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.

However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.

Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].

which means essentially what happened before time step t will have impact on the value Q[t].

Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?

Please correct me if anything wrong,

Thanks!

The text was updated successfully, but these errors were encountered:

ShibiHe · 2017-06-23T13:30:37Z

Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.

immars · 2017-07-24T09:05:29Z

Thanks for the reply! It works indeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on upper bound #1

Question on upper bound #1

immars commented Jun 16, 2017 •

edited

Loading

ShibiHe commented Jun 23, 2017

immars commented Jul 24, 2017

Question on upper bound #1

Question on upper bound #1

Comments

immars commented Jun 16, 2017 • edited Loading

ShibiHe commented Jun 23, 2017

immars commented Jul 24, 2017

immars commented Jun 16, 2017 •

edited

Loading