Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

lauradmartens · 2024-11-07T13:42:02Z

Hi Jacob,

Thanks a lot for tangermeme, we've been using it a lot!

When working with fimo on our sequences, I had problems when the sequence contained N, which in our case is encoded with [0.25, 0.25, 0.25, 0.25]. This throws an error in the _validate_input function which accepts [0, 0, 0, 0] for unknown characters.

And also I think the possibility to allow_N should be added to the ersatz functions. I did a prototype for substitute, maybe you could check if that is the way you'd also do it and I can add it to the other functions.

The text was updated successfully, but these errors were encountered:

jmschrei · 2024-11-09T10:38:18Z

Glad to hear it's useful to you!

It sounds like the immediate need is more like a "validate_input" parameter which can be set to "False". That way FIMO can run on any input without validating that it's correct if you know the reason why it's failing and are okay with it.

I think that the allow_N parameter is ambiguous because, in my view, N doesn't mean [0.25, 0.25, 0.25, 0.25], it means [0, 0, 0, 0]. One might also argue that, if it's non-zero, it should match genomic background rather than uniform probabilities. So.. I'm a bit wary in general about the allow_N terminology. I might actually change it to be something more like where you specify what vector you want N to be, and if you don't specify it, you're not okay with Ns. But I need to think more about this.

jmschrei · 2024-11-09T17:58:52Z

Actually, where are you encountering the issue with _validate_input with FIMO?

adamyhe · 2024-12-28T10:13:23Z

I've run into this issue with deep_lift_shap. Specifically, when it calls dinucleotide_shuffle, which uses _validate_input. I'm not sure if it's safe to allow [0, 0, 0, 0] into deep_lift_shap, so I've just been preprocessing bed files and excluding records which contain non-ACGT characters.

jmschrei · 2024-12-28T17:04:19Z

Unfortunately, I don't have a good answer for doing dinucleotide shuffle or deep_lift_shap with missing characters. There are technical issues (as you've found) and also some conceptual issues such as how valid are model predictions when portions of the sequence are missing if it's largely trained on fully-observed sequences. How meaningful are the gradients in this setting?

That being said, if you need attributions for such sequences and trust the results (within reason) you can always construct your own references for use by deep_lift_shap if you have a method you thinks work well.

lauradmartens · 2025-01-07T15:38:53Z

Hi, sorry for the late response! I actually encountered it in the ersatz.substitute function here:

tangermeme/tangermeme/ersatz.py

Line 162 in 9fd7b2d

_validate_input(X, "X", ohe=True)

Maybe one could check if ignore=['N'] and then set allow_N=True for the sequence validation? Because I see that you allow Ns for the motifs.

jmschrei · 2025-01-07T19:13:21Z

Oh, hmm. Maybe I should just have a flag that allows you to disable validation if you know what you're doing?

lauradmartens mentioned this issue Nov 7, 2024

Add more functionality for unknown characters #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

lauradmartens commented Nov 7, 2024

jmschrei commented Nov 9, 2024

jmschrei commented Nov 9, 2024

adamyhe commented Dec 28, 2024

jmschrei commented Dec 28, 2024

lauradmartens commented Jan 7, 2025

jmschrei commented Jan 7, 2025

Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

Allow N to be encoded with 0.25 and add allow_N to ersatz functions #25

Comments

lauradmartens commented Nov 7, 2024

jmschrei commented Nov 9, 2024

jmschrei commented Nov 9, 2024

adamyhe commented Dec 28, 2024

jmschrei commented Dec 28, 2024

lauradmartens commented Jan 7, 2025

jmschrei commented Jan 7, 2025