-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change translate_matches
rule to not multiplex all COBS results
#89
Comments
I'm afraid it's not the correct way. We need to collect all results from matching to know what will be mapped. Most batches will then be basically skipped. With the approach proposed here, COBS would be overall faster, but Minimap slower, and there would be many garbage results in the output (eg S. pneumoniae reads forcibly mapped to Mtb despite low cobs scores). Besides all the other disadvantages, this would decrease the interpretability of results and also make the results dependent on how genomes are packaged into individual batches. |
Hey, thanks for the explanation, but I still don't understand the general idea. Right now we just keep the top 100 samples that a read COBS-match to? So we can't parallelise this rule because we need to see all matches and keeping the top 100 remove spurious matches? What if there is a gene that is present in more than 100 samples (this should be common due to this dataset being large)? |
We keep only top 100 overall through all batches (+ ties so it can be a lot). This is the way how people currently use BLAST and also a form of optimization. |
It should be documented in the readme as this is a critical thing for understanding the results. |
Oh ok, I understand now. So we have this keep top |
Yes, exactly. Also, in the future, more types of specification should be possible, see #20 (but we don't need this for the preprint) |
Oh OK. What I was implementing was:
which is amenable to the optimisation of running COBS and minimap2 together. But indeed we just need 1. Top n matches (currently) for now. Anyway, if this is used in Blast, it is fairly standard, so happy to keep this as the default |
People usually look at top matches only. |
This is just required if we implement: 2. All COBS matches in match filtration (see #20) |
This experiment shows that the
decompress_and_run_cobs
rule is I/O bound (as expected, due to xz-decompress and running COBS) while thebatch_align_minimap2
is CPU bound (as expected, as we try to use pipe communication as much as possible when running minimap2). So, running alldecompress_and_run_cobs
rules beforebatch_align_minimap2
rule is a bit inefficient - we should mix I/O- and CPU-bound as much as possible to not overload any of the two resources. I did this here and got an improvement, but I think we can improve more by prioritising theminimap2
rule (will test this soon). Anyway, this can't be done with the currenttranslate_matches
rule because it requires allCOBS
rules to finish:So this rule basically forces us to run all
COBS
rule before running anyminimap2
rule. I changed this by simply translating each{batch}____{qfile}
matches instead of all matches, but I am not sure if it is the correct way to do it. I noticed we got a looooooooot more unmapped reads in the output. And in the translated file (inintermediate/02_filter
) I can see reads that map to some samples, e.g.:but also reads that map to no samples (is this correct?)? e.g.:
the big majority of reads map to no samples (if my interpretation is correct), and this makes
minimap2
mapping logically much slower.Is my reasoning correct? Can we have a
translate_matches
rule that runs for each batch? I will leave this to you, as I didn't writescripts/filter_queries.py
.The text was updated successfully, but these errors were encountered: