Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on upper bound #1

Open
immars opened this issue Jun 16, 2017 · 2 comments
Open

Question on upper bound #1

immars opened this issue Jun 16, 2017 · 2 comments

Comments

@immars
Copy link

immars commented Jun 16, 2017

@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!

In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.

However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.

Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].

which means essentially what happened before time step t will have impact on the value Q[t].

Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?

Please correct me if anything wrong,

Thanks!

@ShibiHe
Copy link
Owner

ShibiHe commented Jun 23, 2017

Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.

@immars
Copy link
Author

immars commented Jul 24, 2017

Thanks for the reply! It works indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants