Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method for matching node sets across tree sequences #310

Merged
merged 1 commit into from
Aug 25, 2023

Conversation

nspope
Copy link
Contributor

@nspope nspope commented Aug 18, 2023

#301 describes a method for matching nodes between two tree sequences based on the span where both nodes subtend the same sets of samples, based off a suggestion by @petrelharp and @hfr1tz3 for evaluating extend_edges. This only uses graph structure, not mutations, so should be useful for comparing tsinfer to other inference pipelines.

Here I've implemented an efficient algorithm to do this:

  • If we have a dict mapping from unique sample sets to nodes for each overlapping tree in the query and target tree sequences, then it's possible to efficiently determine which node subtends exactly the same set of samples (e.g. no need to check all possible pairs)
  • The evaluation.CladeMap class is a tskit.Tree-like iterator that maintains/updates this dict via edge diffs
  • evaluation.CladeMap.next returns changes (in terms of unique sample sets) from one tree to the next
  • So, when coiterating over the two tree sequences, we only need to update the shared spans for nodes with an altered sample set

This is reasonably quick: for example it takes ~10 seconds to calculate shared spans between a 27000-tree inferred TS (~30000 nodes) and a 100000-tree simulated TS (~80000 nodes), each with 500 samples.

I put this into the new tsdate.evaluation module, later I'll add some other utilities to compare ages across tree sequences (like @petrelharp's mutation mapping strategy). I think we'll also want to add some tools for downsampling polytomies to get unbiased marginal statistics.

@nspope
Copy link
Contributor Author

nspope commented Aug 18, 2023

Here's an example showing node ages in an inferred+dated tree sequence vs their best-matching counterparts in the true ARG. This is 500 samples across 50Mb in stdpopsim's BosTau model. Using the variational dating algorithm,

and the same example using the old inside-outside algorithm,

tsdate/evaluation.py Outdated Show resolved Hide resolved
Copy link

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gee, this is excellent. A few comments.

@nspope
Copy link
Contributor Author

nspope commented Aug 19, 2023

Random thought -- I've been using RMSE of log10 age to measure estimation error, but this really isn't ideal, as it's most sensitive to the youngest nodes. A better way to calculate goodness-of-fit might be take the mean and variance of the variational distribution, call these $\hat{\mu_i}$ and $\hat{\sigma}_i^2$ for the $i$ th node, and transform to z-scores $(t_i - \hat{\mu}_i) \hat{\sigma}_i^{-1}$ before averaging. This would weight the "residual" for each node by how accurately we think we've estimated its age (though it might blow up when approximations are nearly degenerate, like happened with unary nodes).

@nspope
Copy link
Contributor Author

nspope commented Aug 21, 2023

Darn-- the shared span calculation works just fine for unary nodes, but there can be multiple "best matches" for a given node (e.g. that all have the same shared span). In particular this can happen with census events. For now, I'll just document what happens in this case (the node with the lowest id is returned). Everything works as expected with simplified tree sequences, and dating with unary nodes is a work-in-progress anyway.

Remove swap file

Fixed bug in span normalisation

Add tests

Cleanup tests

Tests

Document behaviour in match_node_age when there are multiple best matches
@nspope
Copy link
Contributor Author

nspope commented Aug 21, 2023

OK, I think this is good to go as a first pass.

@petrelharp
Copy link

there can be multiple "best matches" for a given node

Right - I think that's unavoidable - but won't happen in the limit of large mutation rate? (Or is there something else going on I'm not thinking of?)

@petrelharp
Copy link

I've been using RMSE of log10 age

Using the posterior itself seems iffy, as you say. What if we just try to ad-hoc visually fit some functional form to the plot of mean error versus age? (and then maybe post hoc figure out a theoretical reason?)

@hyanwong
Copy link
Member

hyanwong commented Aug 24, 2023

This all seems good to me, thanks for finding an efficient method @nspope. I wonder, however, if it is something that is more widely useful than just for tsdate? There is a tscompare repo (aims listed at tskit-dev/tscompare#1) with nothing in it, and I wonder if we put this in there, and make that repo a dependency of tsdate?

However, for the time being, I'm happy to merge this into tsdate, and we can move it to tscompare once we've stress-tested it maybe?

Re: log transformations, is there an argument that we should square root the times? This would also allow zeros without the +1 hack, of course. I can't think of a specific theoretical justification for this (although the expected variance is the square of the mean tMRCA, right?)

@hyanwong
Copy link
Member

I've been using RMSE of log10 age

Using the posterior itself seems iffy, as you say. What if we just try to ad-hoc visually fit some functional form to the plot of mean error versus age? (and then maybe post hoc figure out a theoretical reason?)

Since theory predicts that the variance in tMRCA goes as mean^2, could we simply weight the RMSE by 1/(mean^2 + const), where const accounts for the fact that even at time zero we have some error in node time estimation? I don't know what the constant should be, however.

@nspope
Copy link
Contributor Author

nspope commented Aug 25, 2023

I'm all for moving this into tscompare (or anywhere else it might be useful), once we've played around with it a bit.

@hyanwong
Copy link
Member

I'm all for moving this into tscompare (or anywhere else it might be useful), once we've played around with it a bit.

OK, I'll merge and we can move it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants