Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average shared span for multiple matches #12

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

hfr1tz3
Copy link

@hfr1tz3 hfr1tz3 commented Dec 22, 2024

Option 2 for issue #11

For each node $n_2\in T_2$ with multiple matches in $T_1$, Let $\beta(n_2)=[n_1\in T_1 \colon \alpha(n_1)=n_2]$. Then we compute the similarity between two trees $T_1$, and $T_2$ as
$$sim(T_1,T_2)=\sum_{n_2\in T_2}\frac{1}{\beta(n_2)}\sum_{n_1\in \alpha^{-1}(n_2)} m(n_1,n_2),$$
which is the average over shared spans between nodes in $T_2$ and their multiple matches $\beta(n_2)$.

@hfr1tz3 hfr1tz3 marked this pull request as ready for review December 23, 2024 19:00
@hfr1tz3
Copy link
Author

hfr1tz3 commented Dec 23, 2024

I added extra tests that are for the ties='average' method. I am open to more suggestions if we need more.

@petrelharp
Copy link
Contributor

Gee, that's a quite clean implementation. Nice.

My first inclination is not to introduce the "ties" argument? I think we don't have a use case for ties=None, so why complicate things?

@hfr1tz3
Copy link
Author

hfr1tz3 commented Jan 6, 2025

That's a fair point. I think it was sort of an idea to put in place, where in case we want to add a new matching scheme then we have the scaffolding for that already. However, if want to remove that I can.
@petrelharp just ping me if you want it removed. I'll try to get to it while at JMM.

@petrelharp
Copy link
Contributor

Good thought, but there's no need to put the argument in place, since if we add it in the future but with the default "average", then that'll be backwards-compatible. So, it's just unecessary complexity. I'd like it removed.

@hfr1tz3
Copy link
Author

hfr1tz3 commented Jan 13, 2025

Alright, I removed the ties argument from the compare function. I additionally added another test case to check that the new average node span is working properly.

@petrelharp
Copy link
Contributor

Hm, okay - sorry I didn't pick up on this earlier - but, this is also changing dissimilarity (since both dissimilarity and tpr depend on total_match_span). Now I need to think about whether this is desired or not.

@petrelharp
Copy link
Contributor

Okay: I think that if T1 is inferred and T2 is truth, then our motivation has been:

  1. we want ARF to be "how much of T1 is wrong" (so, denominator is |T1|)
  2. and TPR to be "how much of T2 is correctly represented in T1" (denominator is |T2|)
  3. previously, interpretation (2) does not hold, because adding more nodes to T1 that mapped to the same node in T2 would change TPR when it wasn't actually changing how much of T2 that was represented
  4. with this change, interpretation (1) arguably doesn't hold, because suppose those multiple nodes in T1 that map to a single node in T2 all agree with the node in T2; they are arguably not wrong, so should not count against it
  5. the question is, I guess, what is "wrong"; one interpretation is that a node in T1 is 'wrong' if it implies relationships not in T2 (e.g., "on this chunk of genome A and B are more closely related than to C")
  6. having multiple nodes in T1 that match a given node in T2 feels a bit wrong, but counterpoint: adding completely-unary nodes to T1 should arguably not be penalised because we know that those nodes in fact exist - edges are but chains of inheritance through a bunch of individuals

So: currently we have:
Screenshot from 2025-01-16 05-53-34
... so I guess I'm arguing that the right-hand equality in equation (2) above should not hold, nor the right hand equality here:
Screenshot from 2025-01-16 05-56-17
so that, conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

What do you think?

@hfr1tz3
Copy link
Author

hfr1tz3 commented Jan 16, 2025

... conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

I actually like this change. When I was attempting to redefine terms in the paper, the notation got bogged down.

ARF in my mind should be defined as it was with

$$\textnormal{ARF}(T_1, T_2) = 1-\frac{1}{|| T_1||} \sum_{n\in N_1} m(n_1, \alpha(n_1)).$$

To comment about points 3, 4, and 6:
I took some time to think about possibly doing a 'weighted avg' for TPR if we have multiple matches for a node in $T_2$, however those still can have a TPR of over 1. I think with the averaged node span for TPR, we can drastically decrease TPR when that doesn't reflect the actual property of "how much of $T_2$ is in $T_1$". For instance, I feel like a tree sequence $T_1$ with extra unary nodes should have TPR of 1 when compared to its simplified version $T_2$.

If we want to remove the 'averaged node span' from TPR we could alternatively define it as

$$ \textnormal{TPR}(T_1, T_2) = \frac{1}{||T_2||}\sum_{n_2\in N_2} \max_{n_1\in \alpha^{-1}(n)}m(n_1, n_2)$$

Below is an example between avg TPR and max TPR:

image

@petrelharp
Copy link
Contributor

Hey, this is a great point. max is totally better than mean here. Totally on board with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants