-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for kallisto counts #185
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small changes. Probably should test this with different options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more work needed on the counts logic I think.
R/kallisto.R
Outdated
dplyr::mutate(across( | ||
.cols = matches('count'), | ||
.fns = ~ as.integer(.x))) | ||
dplyr::mutate(count = as.integer(.data$count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can combine multiple mutates in one instead of duplicating e.g. https://dplyr.tidyverse.org/reference/mutate.html#ref-examples
I tend to use separate calls only when the expression is complex.
R/sample_data.R
Outdated
@@ -69,9 +69,9 @@ read_sample_data <- function(p, results_dir, tx2gene = NULL) { | |||
kallisto <- p[["kallisto"]] | |||
|
|||
# check which quant input is provided | |||
if (!isnull(salmon)) { | |||
if (is.null(salmon)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the !
disappear?
R/sample_data.R
Outdated
counts <- salmon_counts(salmon, tx2gene = tx2gene) | ||
} else if ((!isnull(kallisto))) { | ||
} else if ((is.null(kallisto))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the !
disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic needs a bit more work here. What are the different options you're expecting? One can provide:
- a kallisto file
- a salmon file
- a salmon and a kallisto file
- neither
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the last two options - would it suffice if "Null" is returned for these cases - that is:
else if (!is.null(kallisto) && !is.null(salmon)) {
return(NULL)
} else {
return(NULL)
}
Or we should generate specific error messages in the main rmd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put a return
statement there it will prematurely terminate the read_sample_data
function, which we don't want I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the dragen_wts_dir
should be removed and instead rnasum should just work directly with file paths? That would simplify this a bit in terms of logic. You'd just be dealing with a list p
which may or may not have elements salmon
and kallisto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the dragen_wts_dir should be removed and instead rnasum should just work directly with file paths?
That'd then also require refactoring at infrastructure level as we use that as an input to CWL workflow?
Not negating - but perhaps something we should add as a to-do for the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the last two options - how about assigning null to counts
and then an assert that statement with an error message in rmd before RNAsum::combineDatasets
call? So it doesn't proceed with further processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that makes sense please go ahead, sorry if I'm being a bit picky.
If I were to do this properly, I'd create a new function (e.g. check_params
) that takes as input the params
list, checks each element's validity, and outputs a new list with e.g. a counts
element that gets used in the Rmd. But happy for you to implement the kallisto support however you wish, just try to make sure the basic cases are supported and that the logic works. Also I'd try stop the Rmd execution as early as possible upon invalid param specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent sometime testing with the following variable conditions to make sure rmd execution is stopped earlier if the quant inputs are incorrect (ref: last two commits).
- kallisto file
- salmon file
- salmon and kallisto file
- neither
# check which quant input is provided | ||
if (!is.null(salmon)) { | ||
counts <- salmon_counts(salmon, tx2gene = tx2gene) | ||
} else if (!is.null(kallisto)) { | ||
counts <- kallisto_counts(kallisto, tx2gene = tx2gene) | ||
} else if (!is.null(kallisto) && !is.null(salmon)) { | ||
counts <- NULL | ||
} else { | ||
counts <- NULL | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the logic here, let's say we have the booleans A and B. This is what's happening here:
if (A) { # if salmon is specified
...
} else if (B) { # so salmon is not specified but kallisto is
...
} else if (A && B) { # so salmon is not specified, and kallisto is not specified, but you're checking if both of them are specified
...
} else {
...
}
@@ -122,6 +123,11 @@ ref_dataset.list <- vector("list", length(dataset)) |> set_names(dataset) | |||
ref_genes.list <- RNAsum::get_refgenes(params) | |||
# sample WTS/WGS data | |||
sample_data.list <- RNAsum::read_sample_data(params, results_dir, tx2gene = tx2ensembl) | |||
# check counts input was correctly read and not null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the R session will first call RNAsum::read_sample_data
which is an intensive and time-consuming process. If no salmon or kallisto inputs have been provided, it goes to waste. What if you catch that error at the beginning of RNAsum::read_sample_data
?
Another one for method/input generalisation.
Closes #184.
FYI @JMarzec