-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grad in SI #14
Comments
Hi, thanks for your interest in the code! You're correct that in the line you refer to, |
Actually, thinking a bit more about, I'm not sure anymore how relevant the above paper is to this specific issue. Sorry. I quickly checked the original paper on SI and my impression is that based on the description in that paper it is ambiguous whether they used the gradient of the reguarized or the unregularized loss. (But please correct me if you can point out somewhere in that paper where this is specified!) |
Thank you very much for your detailed response. It seems that the nature paper you referred to indeed used the regularized loss. And I haven't found in the SI paper specifying whether they used the regularized loss or the original loss. I tried to read the released code of the SI paper, but frankly speaking I didn't manage to understand their code well enough either. That's why I came to read your implementation. Despite that, I noticed in their released code, the original loss is used, in the following line, I agree with you that in practice it might not matter too much. For using the regularized loss, I think it is similar to use larger coefficients for earlier tasks. Since the Hessian of the quadratic regularization captures the contributions of the previous tasks, and their contribution also exists in the |
Interesting, thanks! When I have some time I might look more into this. |
Hi,
I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used
p.grad
, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.continual-learning/train.py
Line 248 in d281967
The text was updated successfully, but these errors were encountered: