-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method for matching node sets across tree sequences #310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gee, this is excellent. A few comments.
Random thought -- I've been using RMSE of log10 age to measure estimation error, but this really isn't ideal, as it's most sensitive to the youngest nodes. A better way to calculate goodness-of-fit might be take the mean and variance of the variational distribution, call these |
Darn-- the shared span calculation works just fine for unary nodes, but there can be multiple "best matches" for a given node (e.g. that all have the same shared span). In particular this can happen with census events. For now, I'll just document what happens in this case (the node with the lowest id is returned). Everything works as expected with simplified tree sequences, and dating with unary nodes is a work-in-progress anyway. |
Remove swap file Fixed bug in span normalisation Add tests Cleanup tests Tests Document behaviour in match_node_age when there are multiple best matches
OK, I think this is good to go as a first pass. |
Right - I think that's unavoidable - but won't happen in the limit of large mutation rate? (Or is there something else going on I'm not thinking of?) |
Using the posterior itself seems iffy, as you say. What if we just try to ad-hoc visually fit some functional form to the plot of mean error versus age? (and then maybe post hoc figure out a theoretical reason?) |
This all seems good to me, thanks for finding an efficient method @nspope. I wonder, however, if it is something that is more widely useful than just for tsdate? There is a However, for the time being, I'm happy to merge this into tsdate, and we can move it to Re: log transformations, is there an argument that we should square root the times? This would also allow zeros without the +1 hack, of course. I can't think of a specific theoretical justification for this (although the expected variance is the square of the mean tMRCA, right?) |
Since theory predicts that the variance in tMRCA goes as |
I'm all for moving this into tscompare (or anywhere else it might be useful), once we've played around with it a bit. |
OK, I'll merge and we can move it later. |
#301 describes a method for matching nodes between two tree sequences based on the span where both nodes subtend the same sets of samples, based off a suggestion by @petrelharp and @hfr1tz3 for evaluating extend_edges. This only uses graph structure, not mutations, so should be useful for comparing tsinfer to other inference pipelines.
Here I've implemented an efficient algorithm to do this:
evaluation.CladeMap
class is atskit.Tree
-like iterator that maintains/updates this dict via edge diffsevaluation.CladeMap.next
returns changes (in terms of unique sample sets) from one tree to the nextThis is reasonably quick: for example it takes ~10 seconds to calculate shared spans between a 27000-tree inferred TS (~30000 nodes) and a 100000-tree simulated TS (~80000 nodes), each with 500 samples.
I put this into the new
tsdate.evaluation
module, later I'll add some other utilities to compare ages across tree sequences (like @petrelharp's mutation mapping strategy). I think we'll also want to add some tools for downsampling polytomies to get unbiased marginal statistics.