Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

Open
lauradmartens opened this issue Nov 7, 2024 · 6 comments

Comments

@lauradmartens
Copy link

Hi Jacob,

Thanks a lot for tangermeme, we've been using it a lot!

When working with fimo on our sequences, I had problems when the sequence contained N, which in our case is encoded with [0.25, 0.25, 0.25, 0.25]. This throws an error in the _validate_input function which accepts [0, 0, 0, 0] for unknown characters.

And also I think the possibility to allow_N should be added to the ersatz functions. I did a prototype for substitute, maybe you could check if that is the way you'd also do it and I can add it to the other functions.

@jmschrei
Copy link
Owner

jmschrei commented Nov 9, 2024

Glad to hear it's useful to you!

It sounds like the immediate need is more like a "validate_input" parameter which can be set to "False". That way FIMO can run on any input without validating that it's correct if you know the reason why it's failing and are okay with it.

I think that the allow_N parameter is ambiguous because, in my view, N doesn't mean [0.25, 0.25, 0.25, 0.25], it means [0, 0, 0, 0]. One might also argue that, if it's non-zero, it should match genomic background rather than uniform probabilities. So.. I'm a bit wary in general about the allow_N terminology. I might actually change it to be something more like where you specify what vector you want N to be, and if you don't specify it, you're not okay with Ns. But I need to think more about this.

@jmschrei
Copy link
Owner

jmschrei commented Nov 9, 2024

Actually, where are you encountering the issue with _validate_input with FIMO?

@adamyhe
Copy link
Contributor

adamyhe commented Dec 28, 2024

I've run into this issue with deep_lift_shap. Specifically, when it calls dinucleotide_shuffle, which uses _validate_input. I'm not sure if it's safe to allow [0, 0, 0, 0] into deep_lift_shap, so I've just been preprocessing bed files and excluding records which contain non-ACGT characters.

@jmschrei
Copy link
Owner

Unfortunately, I don't have a good answer for doing dinucleotide shuffle or deep_lift_shap with missing characters. There are technical issues (as you've found) and also some conceptual issues such as how valid are model predictions when portions of the sequence are missing if it's largely trained on fully-observed sequences. How meaningful are the gradients in this setting?

That being said, if you need attributions for such sequences and trust the results (within reason) you can always construct your own references for use by deep_lift_shap if you have a method you thinks work well.

@lauradmartens
Copy link
Author

Hi, sorry for the late response! I actually encountered it in the ersatz.substitute function here:

_validate_input(X, "X", ohe=True)

Maybe one could check if ignore=['N'] and then set allow_N=True for the sequence validation? Because I see that you allow Ns for the motifs.

@jmschrei
Copy link
Owner

jmschrei commented Jan 7, 2025

Oh, hmm. Maybe I should just have a flag that allows you to disable validation if you know what you're doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants