-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dEploid mcmc not converging #346
Comments
Here are some files: |
Hi @bredelings , had a quick look at the data. You have got a lot of missing data. Would it be ok to encode your "." to zeros? does it make sense? I am not an expert in the sequence you are dealing with, but DEploid does not handle missing data at the moment. Made a quick plot with the alt vs ref count, on the left is what's in the data, on the right, is after encode "." as zeros. Eyeballing this data, it should be a mixed sample, unless there is a lot of error with the reading that you may want to filter them out. DEploid MCMC struggle with certain mixing is a known problem. I don't have a solution yet, It would be awesome if you could help to solve this problem. Very often, we find it is the data is very difficult. our strategy is to take multiple runs, and take the consensus result. There is a newer version, "Best-practice" is soon coming. I want to finalize some of the documentation. But for this version, you must provide a reference panel. We found that it is very useful when learning k, especially if you have a very balanced mixing. vcfFileName = "ERR2678989.vcf.txt"
numberOfHeaderLines = 42
ADFieldIndex=7
vcf <- read.table( gzfile(vcfFileName), skip = numberOfHeaderLines,
header = T, comment.char = "", stringsAsFactors = FALSE,
check.names = FALSE)
sampleName <- names(vcf)[10]
tmp <- vcf[[sampleName]]
field <- strsplit(as.character(tmp), ":")
tmpCovStr <- unlist(lapply(field, `[[`, ADFieldIndex))
tmpCov <- strsplit(as.character(tmpCovStr), ",")
refCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 1)))
altCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 2)))
refCount0 = refCount
refCount0[is.na(refCount)] = 0
altCount0 = altCount
altCount0[is.na(altCount)]=0
plot(refCount0, altCount0)
png("tmp.png", width = 1200, height =600)
par(mfrow=c(1,2))
plot(refCount, altCount)
plot(refCount0, altCount0)
dev.off() |
@bredelings I am with you. Should be clonal I think. Do you have a reference panel for this? What your initial k is set to? |
https://deploid.readthedocs.io/en/latest/FAQ.html#over-fitting Hey @bredelings We have seen this problem before as well. You can force and set k to 1. And just use the plaf as the reference to learn the haplotype, and use this as a reference strain and build a panel to deconvolve your other samples. I have used this approach for pf3k |
Hi @shajoezhu, thanks for the advice! I will try adding a reference panel. I wasn't sure it would help because it seems that rho/theta is pretty high in vivax, so linkage disequilibrium is pretty low. But maybe the panel could still help. Do you know if rho/theta is lower in falciparum -- i.e. there are more SNPs per recombination event? I haven't had time to fully look at the mixing in sufficient detail yet - I will get back to you soon, hopefully next week. |
Hi Ben, it works for vivax as well. You can consider to use some of these
data
https://www.malariagen.net/data/p-vivax-genome-variation-may-2016-data-release
Pretty sure I had done this some time ago, can send across if I can find it
them
…On Sat, 6 Feb 2021 at 07:43, Benjamin Redelings ***@***.***> wrote:
Hi @shajoezhu <https://github.com/shajoezhu>, thanks for the advice! I
will try adding a reference panel. I wasn't sure it would help because it
seems that rho/theta is pretty high in vivax, so linkage disequilibrium is
pretty low. But maybe the panel could still help. Do you know if rho/theta
is higher in falciparum?
I haven't had time to fully look at the mixing in sufficient detail yet -
I will get back to you soon, hopefully next week.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4FP7NDR2IHWF2T73EMUSTS5TXLLANCNFSM4WZOJQMQ>
.
|
OK, I had a few ideas about fixing mixing problems. It looks like you have transition kernels to update (w|h) and transition kernels to update (h|w), where h is the haplotypes and w is the weights. The transition kernels for updating (h|w) can resample one or two haplotypes at a time. Does that sound right? I think the issue is that if h and w are correlated, then updating
I can write down the math for detailed balance if needed. It looks like computing the probability by summing out the haplotypes could be implemented fairly easily in I'm not sure if I'm missing something in |
Another idea was to allow adding and removing haplotypes instead of fixing the number at five and setting the prior so that some have very low frequencies. This probably makes summarizing the posterior more complex. Maybe you would want to generate multiple summaries in that case, i.e. a summary for K=1 with one haplotype, a summary for K=2 with two haplotypes, a summary for K=3 with 3 haplotypes. In any case, I think that you could add a 0/1 variable Z[k] for each of the 5 haplotypes, and say that It would be possible to actually remove x'[k] and h[k] from the MCMC state when Z[k] ==0 and the haplotype is missing. You would draw x'[k] from the prior and draw h[k] using UpdateSingleHap when adding the haplotype back in. But it would also be possible to keep all 5 haplopypes even when Z[k] == 0. However, for the purposes of mixing, it would still be important to resample h[k] from the posterior using UpdateSingleHap when flipping Z[k] to 1. |
Any thoughts? It also looks like the |
OK, I had a few ideas about fixing mixing problems. It looks like you have
transition kernels to update (w|h) and transition kernels to update (h|w),
where h is the haplotypes and w is the weights. The transition kernels for
updating (h|w) can resample one or two haplotypes at a time. Does that
sound right?
Yes, that is correct
The correlation will depend on what type of infection this is. Super
infection vs co infection. We talk about this in the elife DEploid ibd
paper. The former you can consider that they are independent, but the
latter is heavily correlated, which is more complex, biologically more
interesting
I think the issue is that if h and w are correlated, then updating h but
not w might not work. So, what you need is a move that
- proposes a move from w1 to w2 by changing the relative frequencies
of strain i and strain j. This could perhaps perturb x[i] and x[j]
- computes the probability of w1 by summing out the haplotypes for i
and j using dynamic programming.
Can you be more specific with the dynamic programming part? And what do
you mean by summing out the haplotye if I may ask.
- computes the probability of w2 by summing out the haplotypes for i
and j using dynamic programming.
- accept or reject the new weights w2
- if w2 is accepted, resample haplotypes i and j from the dynamic
programming matrix associated with w2
I was wondering can you be more specific with this matrix? What’s the
size, it will probably help me to better understand the dynamic programming
part.
- if w2 is rejected, then resample haplotypes i and j from the dynamic
programming matrix associated with w1
I can write down the math for detailed balance if needed.
It looks like computing the probability by summing out the haplotypes
could be implemented fairly easily in UpdatePairHap::calcFwdBwdProbs() by
just summing the last column.
I'm not sure if I'm missing something in UpdatePairHap::
calcExpectedWsaf( ) though. What is that doing?
at each site, there is the observed wsaf, which is alt/(ref+alt), and
expected wsaf from the model, of inner product of the proportion with the
genotypes or strains at this site.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4FP7O2QDBAJV4FHTABLWDS6RIBXANCNFSM4WZOJQMQ>
.
|
I really like this idea. I had done something like this before, didn’t
succeed! Please let me know what you think, and feel free to give a go. I
really would like to know why I failed for this approach, cos I thought it
would have worked, or at least with some improvement
On Thu, 11 Feb 2021 at 23:36, Benjamin Redelings ***@***.***> wrote:
Another idea was to allow adding and removing haplotypes instead of fixing
the number at five and setting the prior so that some have very low
frequencies. This probably makes summarizing the posterior more complex.
Maybe you would want to generate multiple summaries in that case, i.e. a
summary for K=1 with one haplotype, a summary for K=2 with two haplotypes,
a summary for K=3 with 3 haplotypes.
In any case, I think that you could add a 0/1 variable Z[k] for each of
the 5 haplotypes, and say that
x'[k] ~ normal(eta, sigma^2)
x[k] = if z[k] == 1 then x'[j] else -Inf
It would be possible to actually remove x'[k] and h[k] from the MCMC state
when Z[k] ==0 and the haplotype is missing. You would draw x'[k] from the
prior and h[k] using UpdateSingleHap when adding the haplotype back in.
Please feel free to try it out for any new ideas. I would love to hear your
thoughts, and perhaps we can schedule calls as well. My time is a little
unpredictable at the moment due to child care, and I apologize for the late
reply.
Some idea and experience from my part is that, you also need to consider
the sequencing technology and mapping quality. The sequence coverage varies
along sequence, simple cumulative probability can be heavily affected by
some high read count. It would be better to do by chuncks, DEploid ibd has
some functionality to adjust that. But my experience is that nothing does
better job when you have a good quality reference panel, that’s why I spent
quite a bit of effort in the latest best practices version to improve the
panel. I am happy with the results, just taking time to get all
documentation ready for release
—
… You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4FP7LVHAQOEN5DZFLFZ2TS6RSYHANCNFSM4WZOJQMQ>
.
|
I removed the remote one cos I was using the local version heavily, and it
started to branch off again. I didn’t realize you were rebased to that.
Perhaps, we can find the hash, and create a branch that is dedicated for
your tests? Let me know. Thanks!
…On Fri, 19 Feb 2021 at 01:20, Benjamin Redelings ***@***.***> wrote:
Any thoughts?
It also looks like the doc branch has disappeared... do you know where it
went?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4FP7LM52PRHS6CO72MZPTS7W4FJANCNFSM4WZOJQMQ>
.
|
Sorry, I just saw this because the e-mails went into my spam folder. Yeah, finding some time to talk would be great. I am also really interested in your ideas about the sequencing technology and mapping quality. More later. |
I have some patches I wanted to submit to illustrate some of what I was talking about and clarify about the dynamic programming. Is there a good branch for me to make the patches against? Mostly its against |
BTW, it looks like there might be some double-counting in this line of fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * pNoRec + massFromRec); It looks like the transition from i -> i is included in both fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * (pNoRec - pRecEachHap) + massFromRec); Not entirely sure... |
Hi Ben, just branch off from the latest master should be fine. I will avoid
to work on this file as well. Thank you!
…On Fri, 26 Feb 2021 at 20:16, Benjamin Redelings ***@***.***> wrote:
I have some patches I wanted to submit to illustrate some of what I was
talking about and clarify about the dynamic programming. Is there a good
branch for me to make the patches against? Mostly its against
updateHap.cpp
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4FP7ID6R67CPZ64X6AXVLTA76R5ANCNFSM4WZOJQMQ>
.
|
Describe the bug
When run multiple times on the same data set, dEploid oscillates between two different answers. For example,
first answer:
second answer:
In this specific, case its clear from looking at the data that K=1 is right.
dEploid version: 659c770
To Reproduce
Steps to reproduce the behavior:
I also tried increasing the run time from 800 to 8000 generations, but it didn't help.
Expected behavior
It seems that that the MCMC would be give the same answer each time. However, it seems that there are two modes, and the MCMC cannot move between them.
Additional context
I might be able to help you improve the mixing.
The text was updated successfully, but these errors were encountered: