Method for matching node sets across tree sequences #310

nspope · 2023-08-18T18:12:48Z

#301 describes a method for matching nodes between two tree sequences based on the span where both nodes subtend the same sets of samples, based off a suggestion by @petrelharp and @hfr1tz3 for evaluating extend_edges. This only uses graph structure, not mutations, so should be useful for comparing tsinfer to other inference pipelines.

Here I've implemented an efficient algorithm to do this:

If we have a dict mapping from unique sample sets to nodes for each overlapping tree in the query and target tree sequences, then it's possible to efficiently determine which node subtends exactly the same set of samples (e.g. no need to check all possible pairs)
The evaluation.CladeMap class is a tskit.Tree-like iterator that maintains/updates this dict via edge diffs
evaluation.CladeMap.next returns changes (in terms of unique sample sets) from one tree to the next
So, when coiterating over the two tree sequences, we only need to update the shared spans for nodes with an altered sample set

This is reasonably quick: for example it takes ~10 seconds to calculate shared spans between a 27000-tree inferred TS (~30000 nodes) and a 100000-tree simulated TS (~80000 nodes), each with 500 samples.

I put this into the new tsdate.evaluation module, later I'll add some other utilities to compare ages across tree sequences (like @petrelharp's mutation mapping strategy). I think we'll also want to add some tools for downsampling polytomies to get unbiased marginal statistics.

nspope · 2023-08-18T18:25:06Z

Here's an example showing node ages in an inferred+dated tree sequence vs their best-matching counterparts in the true ARG. This is 500 samples across 50Mb in stdpopsim's BosTau model. Using the variational dating algorithm,

and the same example using the old inside-outside algorithm,

tests/test_evaluation.py

tsdate/evaluation.py

tests/test_evaluation.py

tsdate/evaluation.py

petrelharp

Gee, this is excellent. A few comments.

nspope · 2023-08-19T04:12:22Z

Random thought -- I've been using RMSE of log10 age to measure estimation error, but this really isn't ideal, as it's most sensitive to the youngest nodes. A better way to calculate goodness-of-fit might be take the mean and variance of the variational distribution, call these $\hat{\mu_i}$ and $\hat{\sigma}_i^2$ for the $i$ th node, and transform to z-scores $(t_i - \hat{\mu}_i) \hat{\sigma}_i^{-1}$ before averaging. This would weight the "residual" for each node by how accurately we think we've estimated its age (though it might blow up when approximations are nearly degenerate, like happened with unary nodes).

nspope · 2023-08-21T01:20:20Z

Darn-- the shared span calculation works just fine for unary nodes, but there can be multiple "best matches" for a given node (e.g. that all have the same shared span). In particular this can happen with census events. For now, I'll just document what happens in this case (the node with the lowest id is returned). Everything works as expected with simplified tree sequences, and dating with unary nodes is a work-in-progress anyway.

Remove swap file Fixed bug in span normalisation Add tests Cleanup tests Tests Document behaviour in match_node_age when there are multiple best matches

nspope · 2023-08-21T01:25:32Z

OK, I think this is good to go as a first pass.

petrelharp · 2023-08-21T16:15:59Z

there can be multiple "best matches" for a given node

Right - I think that's unavoidable - but won't happen in the limit of large mutation rate? (Or is there something else going on I'm not thinking of?)

petrelharp · 2023-08-21T16:18:25Z

I've been using RMSE of log10 age

Using the posterior itself seems iffy, as you say. What if we just try to ad-hoc visually fit some functional form to the plot of mean error versus age? (and then maybe post hoc figure out a theoretical reason?)

hyanwong · 2023-08-24T08:07:24Z

This all seems good to me, thanks for finding an efficient method @nspope. I wonder, however, if it is something that is more widely useful than just for tsdate? There is a tscompare repo (aims listed at tskit-dev/tscompare#1) with nothing in it, and I wonder if we put this in there, and make that repo a dependency of tsdate?

However, for the time being, I'm happy to merge this into tsdate, and we can move it to tscompare once we've stress-tested it maybe?

Re: log transformations, is there an argument that we should square root the times? This would also allow zeros without the +1 hack, of course. I can't think of a specific theoretical justification for this (although the expected variance is the square of the mean tMRCA, right?)

hyanwong · 2023-08-24T11:22:21Z

I've been using RMSE of log10 age

Using the posterior itself seems iffy, as you say. What if we just try to ad-hoc visually fit some functional form to the plot of mean error versus age? (and then maybe post hoc figure out a theoretical reason?)

Since theory predicts that the variance in tMRCA goes as mean^2, could we simply weight the RMSE by 1/(mean^2 + const), where const accounts for the fact that even at time zero we have some error in node time estimation? I don't know what the constant should be, however.

nspope · 2023-08-25T01:41:42Z

I'm all for moving this into tscompare (or anywhere else it might be useful), once we've played around with it a bit.

hyanwong · 2023-08-25T08:00:05Z

I'm all for moving this into tscompare (or anywhere else it might be useful), once we've played around with it a bit.

OK, I'll merge and we can move it later.