Change `translate_matches` rule to not multiplex all COBS results #89

leoisl · 2022-07-25T13:44:32Z

This experiment shows that the decompress_and_run_cobs rule is I/O bound (as expected, due to xz-decompress and running COBS) while the batch_align_minimap2 is CPU bound (as expected, as we try to use pipe communication as much as possible when running minimap2). So, running all decompress_and_run_cobs rules before batch_align_minimap2 rule is a bit inefficient - we should mix I/O- and CPU-bound as much as possible to not overload any of the two resources. I did this here and got an improvement, but I think we can improve more by prioritising the minimap2 rule (will test this soon). Anyway, this can't be done with the current translate_matches rule because it requires all COBS rules to finish:

So this rule basically forces us to run all COBS rule before running any minimap2 rule. I changed this by simply translating each {batch}____{qfile} matches instead of all matches, but I am not sure if it is the correct way to do it. I noticed we got a looooooooot more unmapped reads in the output. And in the translated file (in intermediate/02_filter) I can see reads that map to some samples, e.g.:

>23a7f1d1-f52d-4370-9349-d3a0d834e016 SAMEA1561949,SAMEA1561908,SAMEA1561925,SAMEA1561936
CTCAAATTGTACTTCGTTTCAGT...

but also reads that map to no samples (is this correct?)? e.g.:

>5ae8ed33-1ee0-46a1-a2c7-7932b99306d3 
CTCATTGTACTTCGTTTCAGTTACGTATTGCTGTTTCTTTGTCG...

the big majority of reads map to no samples (if my interpretation is correct), and this makes minimap2 mapping logically much slower.

Is my reasoning correct? Can we have a translate_matches rule that runs for each batch? I will leave this to you, as I didn't write scripts/filter_queries.py.

The text was updated successfully, but these errors were encountered:

karel-brinda · 2022-07-25T13:48:08Z

I'm afraid it's not the correct way. We need to collect all results from matching to know what will be mapped. Most batches will then be basically skipped.

With the approach proposed here, COBS would be overall faster, but Minimap slower, and there would be many garbage results in the output (eg S. pneumoniae reads forcibly mapped to Mtb despite low cobs scores).

Besides all the other disadvantages, this would decrease the interpretability of results and also make the results dependent on how genomes are packaged into individual batches.

leoisl · 2022-07-25T13:54:08Z

Hey, thanks for the explanation, but I still don't understand the general idea. Right now we just keep the top 100 samples that a read COBS-match to? So we can't parallelise this rule because we need to see all matches and keeping the top 100 remove spurious matches? What if there is a gene that is present in more than 100 samples (this should be common due to this dataset being large)?

karel-brinda · 2022-07-25T14:02:00Z

We keep only top 100 overall through all batches (+ ties so it can be a lot). This is the way how people currently use BLAST and also a form of optimization.

karel-brinda · 2022-07-25T14:02:37Z

It should be documented in the readme as this is a critical thing for understanding the results.

leoisl · 2022-07-25T14:06:49Z

Oh ok, I understand now.

So we have this keep top n matches as an optimisation, which is mutually exclusive with the optimisation of mixing COBS and minimap2 processes, and it seems we should keep the top n matches optimisation? Could you confirm this?

karel-brinda · 2022-07-25T14:08:12Z

Yes, exactly. Also, in the future, more types of specification should be possible, see #20 (but we don't need this for the preprint)

leoisl · 2022-07-25T14:15:16Z

Oh OK. What I was implementing was:

All COBS matches

which is amenable to the optimisation of running COBS and minimap2 together. But indeed we just need 1. Top n matches (currently) for now. Anyway, if this is used in Blast, it is fairly standard, so happy to keep this as the default

karel-brinda · 2022-07-25T14:40:09Z

if this is used in Blast, it is fairly standard, so happy to keep this as the default

People usually look at top matches only.

leoisl · 2022-07-31T20:18:24Z

This is just required if we implement: 2. All COBS matches in match filtration (see #20)

karel-brinda added the documentation Improvements or additions to documentation label Jul 25, 2022

leoisl mentioned this issue Jul 25, 2022

Performance logging of ARGannot_r3.nonamb and all_ebi_plasmids.fa against 661K HQ indexes #52

Closed

leoisl closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2022

leoisl mentioned this issue Jul 31, 2022

Remove unmapped reads from SAM output #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `translate_matches` rule to not multiplex all COBS results #89

Change `translate_matches` rule to not multiplex all COBS results #89

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022 •

edited

Loading

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022

karel-brinda commented Jul 25, 2022

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022 •

edited

Loading

leoisl commented Jul 25, 2022 •

edited

Loading

karel-brinda commented Jul 25, 2022 •

edited

Loading

leoisl commented Jul 31, 2022

Change translate_matches rule to not multiplex all COBS results #89

Change translate_matches rule to not multiplex all COBS results #89

Comments

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022 • edited Loading

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022

karel-brinda commented Jul 25, 2022

leoisl commented Jul 25, 2022

karel-brinda commented Jul 25, 2022 • edited Loading

leoisl commented Jul 25, 2022 • edited Loading

karel-brinda commented Jul 25, 2022 • edited Loading

leoisl commented Jul 31, 2022

Change `translate_matches` rule to not multiplex all COBS results #89

Change `translate_matches` rule to not multiplex all COBS results #89

karel-brinda commented Jul 25, 2022 •

edited

Loading

karel-brinda commented Jul 25, 2022 •

edited

Loading

leoisl commented Jul 25, 2022 •

edited

Loading

karel-brinda commented Jul 25, 2022 •

edited

Loading