Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change translate_matches rule to not multiplex all COBS results #89

Closed
leoisl opened this issue Jul 25, 2022 · 9 comments
Closed

Change translate_matches rule to not multiplex all COBS results #89

leoisl opened this issue Jul 25, 2022 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@leoisl
Copy link
Collaborator

leoisl commented Jul 25, 2022

This experiment shows that the decompress_and_run_cobs rule is I/O bound (as expected, due to xz-decompress and running COBS) while the batch_align_minimap2 is CPU bound (as expected, as we try to use pipe communication as much as possible when running minimap2). So, running all decompress_and_run_cobs rules before batch_align_minimap2 rule is a bit inefficient - we should mix I/O- and CPU-bound as much as possible to not overload any of the two resources. I did this here and got an improvement, but I think we can improve more by prioritising the minimap2 rule (will test this soon). Anyway, this can't be done with the current translate_matches rule because it requires all COBS rules to finish:

image

So this rule basically forces us to run all COBS rule before running any minimap2 rule. I changed this by simply translating each {batch}____{qfile} matches instead of all matches, but I am not sure if it is the correct way to do it. I noticed we got a looooooooot more unmapped reads in the output. And in the translated file (in intermediate/02_filter) I can see reads that map to some samples, e.g.:

>23a7f1d1-f52d-4370-9349-d3a0d834e016 SAMEA1561949,SAMEA1561908,SAMEA1561925,SAMEA1561936
CTCAAATTGTACTTCGTTTCAGT...

but also reads that map to no samples (is this correct?)? e.g.:

>5ae8ed33-1ee0-46a1-a2c7-7932b99306d3 
CTCATTGTACTTCGTTTCAGTTACGTATTGCTGTTTCTTTGTCG...

the big majority of reads map to no samples (if my interpretation is correct), and this makes minimap2 mapping logically much slower.

Is my reasoning correct? Can we have a translate_matches rule that runs for each batch? I will leave this to you, as I didn't write scripts/filter_queries.py.

@karel-brinda
Copy link
Owner

karel-brinda commented Jul 25, 2022

I'm afraid it's not the correct way. We need to collect all results from matching to know what will be mapped. Most batches will then be basically skipped.

With the approach proposed here, COBS would be overall faster, but Minimap slower, and there would be many garbage results in the output (eg S. pneumoniae reads forcibly mapped to Mtb despite low cobs scores).

Besides all the other disadvantages, this would decrease the interpretability of results and also make the results dependent on how genomes are packaged into individual batches.

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 25, 2022

Hey, thanks for the explanation, but I still don't understand the general idea. Right now we just keep the top 100 samples that a read COBS-match to? So we can't parallelise this rule because we need to see all matches and keeping the top 100 remove spurious matches? What if there is a gene that is present in more than 100 samples (this should be common due to this dataset being large)?

@karel-brinda
Copy link
Owner

We keep only top 100 overall through all batches (+ ties so it can be a lot). This is the way how people currently use BLAST and also a form of optimization.

@karel-brinda karel-brinda added the documentation Improvements or additions to documentation label Jul 25, 2022
@karel-brinda
Copy link
Owner

It should be documented in the readme as this is a critical thing for understanding the results.

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 25, 2022

Oh ok, I understand now.

So we have this keep top n matches as an optimisation, which is mutually exclusive with the optimisation of mixing COBS and minimap2 processes, and it seems we should keep the top n matches optimisation? Could you confirm this?

@karel-brinda
Copy link
Owner

karel-brinda commented Jul 25, 2022

Yes, exactly. Also, in the future, more types of specification should be possible, see #20 (but we don't need this for the preprint)

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 25, 2022

Oh OK. What I was implementing was:

  1. All COBS matches

which is amenable to the optimisation of running COBS and minimap2 together. But indeed we just need 1. Top n matches (currently) for now. Anyway, if this is used in Blast, it is fairly standard, so happy to keep this as the default

@karel-brinda
Copy link
Owner

karel-brinda commented Jul 25, 2022

if this is used in Blast, it is fairly standard, so happy to keep this as the default

People usually look at top matches only.

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 31, 2022

This is just required if we implement: 2. All COBS matches in match filtration (see #20)

@leoisl leoisl closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants