Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize when possible? #10

Open
federicomarini opened this issue Jan 26, 2021 · 3 comments
Open

Parallelize when possible? #10

federicomarini opened this issue Jan 26, 2021 · 3 comments

Comments

@federicomarini
Copy link
Owner

As the samples get processed one by one, it might be of interest to try and parallelize that so that runtimes might be significantly shortened, especially when running many samples at once

BiocParallel might be providing a very nice & convenient way to do so

@federicomarini
Copy link
Owner Author

Related to this:
I did some profiling on the main function to run quantiseq, and basically noticed that the bottleneck is actually prior to that, namely in the mapGenes function.
So, after some in-depth debugging I came to think that the solution in here e0a8731 should be robust enough.
Maybe worth porting to the current state of immunedeconv, so I am pinging @grst on this 😉

Then: an additional thing to be done would be to do the aggregation only on the lines that have the duplicate row names, so that would speed it up "massively enough" to the extent we won't really need to parallelize.
Happy to wrap up a tiny PR if you're all good on this!

@grst
Copy link
Collaborator

grst commented Jan 29, 2021

A dplyr groupby(gene_symbol) %>% summarise_all(sum)) should be considerably faster than base R.

Happy to include the parallelized version into immunedeconv, but probably it's easiest to wait until this package is more or less ready and then port immunedeconv to use it as a dependency.

@federicomarini
Copy link
Owner Author

As of now no parallelization is done, just a conditional check - from my understanding, this aggregation needs to be done only if any rownames are duplicated.

But as you said: probably best to give it the time to sediment in here and then just use it as Imports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants