Average shared span for multiple matches #12

hfr1tz3 · 2024-12-22T19:33:14Z

Option 2 for issue #11

For each node $n_2\in T_2$ with multiple matches in $T_1$, Let $\beta(n_2)=[n_1\in T_1 \colon \alpha(n_1)=n_2]$. Then we compute the similarity between two trees $T_1$, and $T_2$ as
$$sim(T_1,T_2)=\sum_{n_2\in T_2}\frac{1}{\beta(n_2)}\sum_{n_1\in \alpha^{-1}(n_2)} m(n_1,n_2),$$
which is the average over shared spans between nodes in $T_2$ and their multiple matches $\beta(n_2)$.

hfr1tz3 · 2024-12-23T19:03:57Z

I added extra tests that are for the ties='average' method. I am open to more suggestions if we need more.

petrelharp · 2025-01-04T23:03:05Z

Gee, that's a quite clean implementation. Nice.

My first inclination is not to introduce the "ties" argument? I think we don't have a use case for ties=None, so why complicate things?

hfr1tz3 · 2025-01-06T18:54:09Z

That's a fair point. I think it was sort of an idea to put in place, where in case we want to add a new matching scheme then we have the scaffolding for that already. However, if want to remove that I can.
@petrelharp just ping me if you want it removed. I'll try to get to it while at JMM.

petrelharp · 2025-01-07T19:09:23Z

Good thought, but there's no need to put the argument in place, since if we add it in the future but with the default "average", then that'll be backwards-compatible. So, it's just unecessary complexity. I'd like it removed.

hfr1tz3 · 2025-01-13T22:03:22Z

Alright, I removed the ties argument from the compare function. I additionally added another test case to check that the new average node span is working properly.

petrelharp · 2025-01-16T04:25:26Z

Hm, okay - sorry I didn't pick up on this earlier - but, this is also changing dissimilarity (since both dissimilarity and tpr depend on total_match_span). Now I need to think about whether this is desired or not.

petrelharp · 2025-01-16T13:58:52Z

Okay: I think that if T1 is inferred and T2 is truth, then our motivation has been:

we want ARF to be "how much of T1 is wrong" (so, denominator is |T1|)
and TPR to be "how much of T2 is correctly represented in T1" (denominator is |T2|)
previously, interpretation (2) does not hold, because adding more nodes to T1 that mapped to the same node in T2 would change TPR when it wasn't actually changing how much of T2 that was represented
with this change, interpretation (1) arguably doesn't hold, because suppose those multiple nodes in T1 that map to a single node in T2 all agree with the node in T2; they are arguably not wrong, so should not count against it
the question is, I guess, what is "wrong"; one interpretation is that a node in T1 is 'wrong' if it implies relationships not in T2 (e.g., "on this chunk of genome A and B are more closely related than to C")
having multiple nodes in T1 that match a given node in T2 feels a bit wrong, but counterpoint: adding completely-unary nodes to T1 should arguably not be penalised because we know that those nodes in fact exist - edges are but chains of inheritance through a bunch of individuals

So: currently we have:

... so I guess I'm arguing that the right-hand equality in equation (2) above should not hold, nor the right hand equality here:

so that, conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

What do you think?

hfr1tz3 · 2025-01-16T19:15:50Z

... conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

I actually like this change. When I was attempting to redefine terms in the paper, the notation got bogged down.

ARF in my mind should be defined as it was with

$$\textnormal{ARF}(T_1, T_2) = 1-\frac{1}{|| T_1||} \sum_{n\in N_1} m(n_1, \alpha(n_1)).$$

To comment about points 3, 4, and 6:
I took some time to think about possibly doing a 'weighted avg' for TPR if we have multiple matches for a node in $T_2$, however those still can have a TPR of over 1. I think with the averaged node span for TPR, we can drastically decrease TPR when that doesn't reflect the actual property of "how much of $T_2$ is in $T_1$". For instance, I feel like a tree sequence $T_1$ with extra unary nodes should have TPR of 1 when compared to its simplified version $T_2$.

If we want to remove the 'averaged node span' from TPR we could alternatively define it as

$$ \textnormal{TPR}(T_1, T_2) = \frac{1}{||T_2||}\sum_{n_2\in N_2} \max_{n_1\in \alpha^{-1}(n)}m(n_1, n_2)$$

Below is an example between avg TPR and max TPR:

petrelharp · 2025-01-17T05:15:58Z

Hey, this is a great point. max is totally better than mean here. Totally on board with this.

hfr1tz3 added 4 commits December 22, 2024 11:27

add average ties for compare method

22fb7b5

quick fix

4921a5d

added test cases for 'average' ties

a8be5b4

tests passed

f43a9ec

hfr1tz3 marked this pull request as ready for review December 23, 2024 19:00

removed 'ties' arg

ed589f1

**actually** removed ties arg

001957d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average shared span for multiple matches #12

Average shared span for multiple matches #12

hfr1tz3 commented Dec 22, 2024 •

edited by petrelharp

Loading

hfr1tz3 commented Dec 23, 2024

petrelharp commented Jan 4, 2025

hfr1tz3 commented Jan 6, 2025

petrelharp commented Jan 7, 2025

hfr1tz3 commented Jan 13, 2025

petrelharp commented Jan 16, 2025

petrelharp commented Jan 16, 2025

hfr1tz3 commented Jan 16, 2025

petrelharp commented Jan 17, 2025

Average shared span for multiple matches #12

Are you sure you want to change the base?

Average shared span for multiple matches #12

Conversation

hfr1tz3 commented Dec 22, 2024 • edited by petrelharp Loading

hfr1tz3 commented Dec 23, 2024

petrelharp commented Jan 4, 2025

hfr1tz3 commented Jan 6, 2025

petrelharp commented Jan 7, 2025

hfr1tz3 commented Jan 13, 2025

petrelharp commented Jan 16, 2025

petrelharp commented Jan 16, 2025

hfr1tz3 commented Jan 16, 2025

petrelharp commented Jan 17, 2025

hfr1tz3 commented Dec 22, 2024 •

edited by petrelharp

Loading