Possible artifacts due to mutations introduced by recapitation #353

mshpak76 · 2024-10-04T15:01:54Z

mshpak76
Oct 4, 2024

With the default settings of slim and pyslim, I'm assuming that the generated sequence data doesn't distinguish between a mutation with a nonzero selection coefficient introduced in a slim simulation and a neutral mutation introduced at the same site from the neutral coalescent recapitation in pyslim?

Specifically, suppose I have a three population model with no migration with a genealogy (C,(B,A)). If I introduce a beneficial mutation at a random site in an A genome (after its split with B) and it undergoes a complete selective sweep, I should have a configuration 1,0,0 for the frequency of the variant allele in population A,B,C. However, for a small number of replicates, I get a nonzero frequency of the variant allele in populations B or C. The only explanation I can think of is that a neutral mutation is introduced in population B or C (or their common ancestor) at the same site as the beneficial mutation in A, and since by default all sites are either "0" wildtype or "1" variant, the genotypes generated don't distinguish between the variant introduced in slim and a variant introduced at the same site (in another propulation) by pyslim. Is this correct?

bhaller · 2024-10-04T15:10:30Z

bhaller
Oct 4, 2024
Maintainer

If by "genotypes" you're referred to nucleotide sequences, such as in VCF, then yes, you could get a mutation at the same site to the same nucleotide coming from both processes, and appearing the same, I think. But mutational lineages are treated separately in the tree sequence, and so at that level you'd be able to tell the difference between the two mutations. Exporting to nucleotide sequences, such as VCF, is lossy, and loses ancestry information including this distinction between different mutational lineages, "flattening" all of that information into a simple nucleotide sequence. If you don't want that to happen, don't work at the nucleotide sequence level, work at the tree sequence level. :->

9 replies

petrelharp Oct 4, 2024
Maintainer

Let's see - one way to check this would be to look at an example site that you're seeing this pattern at, and see what the genotypes are there both before and after adding neutral mutations.

But, I think I know what's going on. Referring to the docs for genotype_matrix, the entry of the matrix is an "index into the array of alleles". You're ensuring that all rows of all your three genotype matrices are the same by passing filter_sites=False to simplify, but this won't guarantee that all the alleles arrays at each site are the same. So, essentially I think you're right - 1 is meaning "derived mutation" (and hence not distinguishing between the different derived mutations at the site), but this is not because of SLiM or pyslim, it's because you've done subsetting of the tree sequence using simplify. In other words, if there's mutation X that's present in A (but not elsewhere), and mutation Y that's present in B (but not elsewhere) at the same site, then in your workflow here they'll both be represented as 1 in the genotype matrix (if there's no other mutations at that site). A better way to do it would be to not simplify, and use the samples argument to genotype_matrix. Since under the hood, each distinct mutation produced by SLiM (or by msprime using the SLiMMutationModel) is unique (has a unique derived state!), then you won't confound distinct mutations in this way.

Another way to debug what's going on: find the index of the site in question, and do like

site = mts.site(site_index)
print(site)
for mut in site.mutations:
  print(mut)

mshpak76 Oct 4, 2024
Author

The interpretation that you describe is what I had thought at first, but how does this account for the (not insubstantial) number of sites with a label > 1, some as high as 5? It seems unlikely that given my comparatively low simulated mutation rates and large numbers of sites that a single site would get multiple hits.

petrelharp Oct 7, 2024
Maintainer

I don't know - my suggestion is to print those sites out and see what's going on there, as I suggest directly above.

mshpak76 Oct 8, 2024
Author

I think that the most straightforward way for me to proceed is to re-run the code excerpted in the earlier post without using the simplify() function. However, if I understand correctly, in the absence of running simplify(), how do I identify the derived mutation of interest at a site and tally its frequency in the sample? Specifically, in the code excerpted below, I assume that what I'm doing to calculate frequencies would not work in the absence of simplify(), so in its absence, how would I need to modify my code to (correctly) calculate site_freqs_c etc for the derived allele?

In your previous response, you suggested "A better way to do it would be to not simplify, and use the samples argument to genotype_matrix", however, I'm not quite sure what this means operationally or syntactically.

mts_subset_a = mts.simplify(mts.samples(population=1), filter_sites=False)
mts_subset_b = mts.simplify(mts.samples(population=2), filter_sites=False)
mts_subset_c = mts.simplify(mts.samples(population=3), filter_sites=False)
Genotypes_A = mts_subset_a.genotype_matrix()
Genotypes_B = mts_subset_b.genotype_matrix()
Genotypes_C = mts_subset_c.genotype_matrix()
freq_sum_c = np.sum(Genotypes_C, 1)
near_fixed_site = [i for i in range(len(freq_sum_c)) if freq_sum_c[i] == beneficial_mutations][0]
carry_beneficial_allele = [i for i in range(len(Genotypes_C[near_fixed_site])) if Genotypes_C[near_fixed_site][i]==1]
samp_a = np.random.choice(range(len(Genotypes_A[0])), size=hap_size, replace=False)
samp_b = np.random.choice(range(len(Genotypes_B[0])), size=hap_size, replace=False)
samp_c = np.random.choice(carry_beneficial_allele, size=hap_size, replace=False)
Genotype_Array_A = Genotypes_A[:,samp_a]
Genotype_Array_B = Genotypes_B[:,samp_b]
Genotype_Array_C = Genotypes_C[:,samp_c]
site_freqs_a = np.sum(Genotype_Array_A, 1)/hap_size
site_freqs_b = np.sum(Genotype_Array_B, 1)/hap_size
site_freqs_c = np.sum(Genotype_Array_C, 1)/hap_size

petrelharp Oct 15, 2024
Maintainer

What I meant was to use

ts.genotype_matrix(samples=...)

... but I see you've moved this question to #354.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible artifacts due to mutations introduced by recapitation #353

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Possible artifacts due to mutations introduced by recapitation #353

mshpak76 Oct 4, 2024

Replies: 1 comment · 9 replies

bhaller Oct 4, 2024 Maintainer

petrelharp Oct 4, 2024 Maintainer

mshpak76 Oct 4, 2024 Author

petrelharp Oct 7, 2024 Maintainer

mshpak76 Oct 8, 2024 Author

petrelharp Oct 15, 2024 Maintainer

mshpak76
Oct 4, 2024

Replies: 1 comment 9 replies

bhaller
Oct 4, 2024
Maintainer

petrelharp Oct 4, 2024
Maintainer

mshpak76 Oct 4, 2024
Author

petrelharp Oct 7, 2024
Maintainer

mshpak76 Oct 8, 2024
Author

petrelharp Oct 15, 2024
Maintainer