-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow internal/non-contemporary samples for dating sc2ts, pedigrees #433
base: main
Are you sure you want to change the base?
Conversation
This is great! I'm just thinking about how I might get a sensible mutation rate to use. I presume one way, if we think that most of the node times are +- correct, is to ensure that the TS is simplified, then simply count up the total mutational area and divide by the number of mutations, i.e.
|
An initial run on a huge covid arg gives:
I assume there is some deeper number stuff going on: is there a way to switch number off to get a deeper trace? |
In several places in the code, I'm assuming the time scale starts at zero, but it seems this is never checked. I suppose we should enforce this. Can you try it on the big covid sequence with (a) shifting nodes time so that the most recent sample node is at 0, (b) turning rescaling off with rescaling_intervals=0? This allows that tiny example to run through. |
Good point. I have been using rescaling_intervals=0 for this. When I transform the times to start at 0, and flag the root(s) as not-a-sample, I can get dating to work with pretty large subsets of the data (e.g. 350k samples), which is fantastic. It still fails on larger sample sizes though, but now with a slightly different (assertion) error:
|
This is probably due to floating point error, in that comparison against 1. Seems like like this a good case to catch these sorts of numerical issues. Can you share the trees over slack or email, and the code you're using to prepare them? |
Amazing! |
Just to post some of @nspope's thoughts. The most recent branch fixes the sample-root issue and also allows rescaling, but it's not clear to Nate or myself that rescaling is sensible here anyway. Internal samples seems to be dealt with just fine. Re the
We can use the current code to date relatively large portions of the sc2ts ARGs (e.g. >400k samples), so we could try chopping the TS into chunks to get something working. Alternatively, I wonder if there is a way to skip certain edges if they create particularly problematic constraints, and output the list of skipped edges separately? This could be useful to identify badly inferred relationships. For this I suggest that it would be much better to omit bad edges than bad nodes, if possible. |
I'd skip rescaling for the covid trees -- there's so many constraints (including a fixed root age) that I really doubt that it's necessary (in unconstrained ARGs it's generally most helpful towards the root). Regarding the assertion -- I can do something hacky to bypass the issue, if you want to give it a go on the full ARG this or next week (and do it properly later). |
With last commit it'll run through on the 450K covid example you sent @hyanwong -- can you try it on a larger example? |
At the end of the dating process, we need to pick dates for the nodes (and mutations) that are consistent with the parent-older-than-child requirement of a tree sequence. There are two parts to this, a least squares and a simple upwards iteration. However both of these can cause samples at non-zero times to have their times changed, which is not ideal. A relatively simple fix would be to rescale all the node times at the end of the routine. If we have a fixed node N2 at time t2, closest in time to a younger fixed node N1 at times t1, but we estimate its time as t2*, we simply need to multiply the times of non-fixed nodes at times between t1 and t2* by (t2-t1)/(t2*-t1), and do so iteratively up the ARG from the tips. |
The least squares approach will respect sample ages, but can't get the constraint exactly. So we always have to recourse to the simple approach of bumping parents up, which is where the problem lies. Post-hoc rescaling like you suggest is a nice solution. |
Here's an (currently failing) algorithm for reselling using a single pass through the nodes in time order. It requires use to ensure that all sample nodes that occur at the same time are given the same inferred time (otherwise we could "squash" children into a zero-length timeslice, and have child_time == parent_time). We should be able to enforce this same-time requirement using the iterative approach, but it is not coded up below, so this is not currently working.
|
It looks like you haven't switched to the ruff linter for your tsdate install @nspope, so CI isn't passing the linting tests. |
a142dec
to
1791b09
Compare
I'd like to merge this @nspope . Is there any reason not to? For example, I see |
@Mergifyio rebase |
✅ Branch has been successfully rebased |
362aaa5
to
4e0a7b6
Compare
Don't merge it, because it's doing the wrong thing (resetting messages for one part of the algorithm, but not for other parts). That'll be fine for sc2ts where there's no prior (roots are fixed) and no unphased singletons. Otherwise the updates will get progressively more incorrect if the "hacky bypass" bit is triggered. Fixing this requires a small refactor, unfortunately. |
Ah, I hadn't realised that (or had forgotten). Thanks for the explanation. |
A minimal case:
Code to simulate and run on test case
Some caveats:
constr_iterations
to 100 or something similar.