Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T parameter help #16

Open
gianfilippo opened this issue Jan 20, 2022 · 10 comments
Open

T parameter help #16

gianfilippo opened this issue Jan 20, 2022 · 10 comments

Comments

@gianfilippo
Copy link

Hi,

I am trying to understand the how to properly choose fix the T parameter, which I understand is the min threshold for the O22 value (i.e. observed count of the 2-haplotype).

I am running one SARS-cov-2 sample (amplicon sequencing on Illumina) I am interested in.
I ran it with Tf=0.05 and T=1,10,100. Summary results are below.

Could you please explain me how T is affecting number of SNVs and haplotypes ?

SNVs numbers:
T=100 2412
T=10 3871
T=1 2491

Haplotypes N and freq:
T100:
0_fr_0.2847895994964371, 1_fr_0.19215312905857715, 2_fr_0.19203620837052798, 3_fr_0.1624256254050896, 4_fr_0.06115178494543045, 5_fr_0.05772617687199445
T10:
0_fr_0.2676417679182119, 1_fr_0.26451591468463653, 2_fr_0.15307681215147081, 3_fr_0.08339144250951734, 4_fr_0.08243902008443806, 5_fr_0.08039214779779189, 6_fr_0.06854289485393335
T1:
0_fr_1.0

Thanks

@vtsyvina
Copy link
Owner

Hello.

The difference in t and tf parameters is that tf is relative to coverage and t is absolute. If a minor haplotype for two positions is supported by less than t reads then it won't be considered in haplotype reconstruction.

With higher parameter T you usually expect the number of haplotypes to decrease(since we consider less "links" between minor positions). However, it depends on the sample and sometimes that one link that appear for lower value of t can actually link two haplotypes that were otherwise(with higher value of t) considered distinct.

I may try to tell you more if you send me logs(to email) while running the tool with "-log" flag. Without them it is difficult to say what exactly is happening in your case.

@gianfilippo
Copy link
Author

gianfilippo commented Jan 21, 2022 via email

@vtsyvina
Copy link
Owner

@gianfilippo This is only a part of the file. You need to attach it separately.

Also, variant calling is kinda experimental feature that we added. We never really evaluated it's performance and accuracy. I'd take the results from it with a grain of salt.

@gianfilippo
Copy link
Author

Hi, ok but I am a bit lost now. From your paper I understood that the first step is to "extract pairs of statistically linked mutations". Could you please clarify ?

@vtsyvina
Copy link
Owner

Yes. First step is building the linked graph. We find pairs of SNPs that we suspect should go together in at least one haplotype.

Once this graph is build we go to the next step of finding maximum cliques - sets of SNPs with pair-wise connection. You can see these cliques as small scaffolds since reads cover only limited area of genome(especially Illumina). Now, having these cliques we try to merge them to obtain haplotypes. It's not a trivial step, so I won't go into details.

After we have haplotype candidates the tool splits all reads into clusters(based on the distance to all haplotype candidates) and calculate consensus. In this step some SNPs that we initially identified may go away(whether it was a read error or part of a different haplotype) and some other appear(sometimes we don't have enough statistical evidence for an SNP link, but after reads clustering the SNP appears in consensus). And this is the final answer of the algorithm. This is regular algorithm.

The variant calling part works on top of the algorithm. It takes obtained haplotypes, get all SNPs from them(relative to sample consensus) and tries to estimate frequency. So it is a pretty trivial addition to the main algorithm(we don't discuss it in paper, just added here in repo as experimental feature).

@gianfilippo
Copy link
Author

logfile.txt

@gianfilippo
Copy link
Author

Hi,

thanks. I reattached the logfile.

Best

@vtsyvina
Copy link
Owner

It will help if you run without VC, just "-m snv-illumina". This log is only for t=100. SO I cannot compare with what is going on for different choice of t

@antoine4ucsd
Copy link

Hello
I am trying to understand this. I tried on the same samples:
option 1: -t default and -tf 0.05
option 2: -t 100 and -tf 0.05
for some samples, I get more haplotypes with option 1 (which I was expecting) but for others, I get more haplotypes with option 2. it is more suprising since I assumed the latter was more stringent?

can you help with that?

@antoine4ucsd
Copy link

adding more context, I also tried a third option with t 1000
I got respectively
7 HAP for -t default
8 HAP for -t 100 (more than with t default??)
4 HAP for -t 1000

log files are attached
thank you for your help!

logfile.default.txt
logfile.t100.txt
logfile.t1000.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants