-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Draft) Add DLA function to utils #466
base: dev
Are you sure you want to change the base?
(Draft) Add DLA function to utils #466
Conversation
Thanks for starting on this - it seems useful and I agree that it should be it's own file (probably just for DLA as it'll become quite large once it's fully documented etc). In general I agree as well that it's probably worth expanding this a bit to work more generally. Specifically you can break DLA down recursively e.g. by attention layer -> attention head -> source layer -> source component... It would be nice to hae this as well. Hope that makes sense and if you are unsure about how to abstract more I'm happy to have a chat about it! |
Hi @alan-cooney, thanks for the comment. I have a couple questions: I can get the attention head contributions (or even the mlp neurons) with get_full_resid_decomposition(), however I can get the correct and incorrect directions only for the residual stream with tokens_to_residual_directions(). How can I get the directions for the individual heads (or even neurons) ? Also, what do you mean by break down by 'source layer' and 'source component' ? |
@VasilGeorgiev39 Are you still available to wrap this up? |
@bryce13950 Yes, I will be available after the 9th of May. What do you think would be the best approach for this? |
I am not quite sure. Alan has been pulled away for his full time job in the last few months. I have reached out to him separately to see if he can clarify the comments on this, but I haven't heard back via slack. I don't really get what he means by source layer and source component either. Maybe we can start by turning it into its own module, and then seeing where it can be generalized. I do like your idea of setting it up as a tool, and I am likely going to be doing just that in another context. Do you want to move this into it's own module in a directly named tools? |
Description
DLA is usually the first step we do in a new exploration. I think it would be nice to have a common function that does it in a single step.
Let me know if you think this does not generalize well enough or if you have other concerns.
Not sure if Utils is the right place for it tho, maybe we can create a new module that will hold the mech interp toolkit?
If it looks good I'll write tests and stuff.
Type of change
Please delete options that are not relevant.
Checklist: