Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for kallisto counts #185

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Adding support for kallisto counts #185

wants to merge 7 commits into from

Conversation

skanwal
Copy link
Member

@skanwal skanwal commented Jan 16, 2025

Another one for method/input generalisation.
Closes #184.

FYI @JMarzec

@skanwal skanwal requested a review from pdiakumis January 16, 2025 22:58
Copy link
Member

@pdiakumis pdiakumis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes. Probably should test this with different options.

R/kallisto.R Show resolved Hide resolved
R/kallisto.R Outdated Show resolved Hide resolved
R/sample_data.R Outdated Show resolved Hide resolved
R/sample_data.R Outdated Show resolved Hide resolved
Copy link
Member

@pdiakumis pdiakumis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more work needed on the counts logic I think.

R/kallisto.R Outdated
dplyr::mutate(across(
.cols = matches('count'),
.fns = ~ as.integer(.x)))
dplyr::mutate(count = as.integer(.data$count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine multiple mutates in one instead of duplicating e.g. https://dplyr.tidyverse.org/reference/mutate.html#ref-examples

I tend to use separate calls only when the expression is complex.

R/sample_data.R Outdated
@@ -69,9 +69,9 @@ read_sample_data <- function(p, results_dir, tx2gene = NULL) {
kallisto <- p[["kallisto"]]

# check which quant input is provided
if (!isnull(salmon)) {
if (is.null(salmon)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the ! disappear?

R/sample_data.R Outdated
counts <- salmon_counts(salmon, tx2gene = tx2gene)
} else if ((!isnull(kallisto))) {
} else if ((is.null(kallisto))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the ! disappear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic needs a bit more work here. What are the different options you're expecting? One can provide:

  • a kallisto file
  • a salmon file
  • a salmon and a kallisto file
  • neither

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last two options - would it suffice if "Null" is returned for these cases - that is:

else if (!is.null(kallisto) && !is.null(salmon)) {
    return(NULL)
  } else {
    return(NULL)
  }

Or we should generate specific error messages in the main rmd?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put a return statement there it will prematurely terminate the read_sample_data function, which we don't want I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the dragen_wts_dir should be removed and instead rnasum should just work directly with file paths? That would simplify this a bit in terms of logic. You'd just be dealing with a list p which may or may not have elements salmon and kallisto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the dragen_wts_dir should be removed and instead rnasum should just work directly with file paths?

That'd then also require refactoring at infrastructure level as we use that as an input to CWL workflow?
Not negating - but perhaps something we should add as a to-do for the next release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last two options - how about assigning null to counts and then an assert that statement with an error message in rmd before RNAsum::combineDatasets call? So it doesn't proceed with further processing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that makes sense please go ahead, sorry if I'm being a bit picky.
If I were to do this properly, I'd create a new function (e.g. check_params) that takes as input the params list, checks each element's validity, and outputs a new list with e.g. a counts element that gets used in the Rmd. But happy for you to implement the kallisto support however you wish, just try to make sure the basic cases are supported and that the logic works. Also I'd try stop the Rmd execution as early as possible upon invalid param specification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent sometime testing with the following variable conditions to make sure rmd execution is stopped earlier if the quant inputs are incorrect (ref: last two commits).

  • kallisto file
  • salmon file
  • salmon and kallisto file
  • neither

Comment on lines +71 to +80
# check which quant input is provided
if (!is.null(salmon)) {
counts <- salmon_counts(salmon, tx2gene = tx2gene)
} else if (!is.null(kallisto)) {
counts <- kallisto_counts(kallisto, tx2gene = tx2gene)
} else if (!is.null(kallisto) && !is.null(salmon)) {
counts <- NULL
} else {
counts <- NULL
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the logic here, let's say we have the booleans A and B. This is what's happening here:

  if (A) { # if salmon is specified
    ...
  } else if (B) { # so salmon is not specified but kallisto is
    ...
  } else if (A && B) { # so salmon is not specified, and kallisto is not specified, but you're checking if both of them are specified
    ...
  } else {
    ...
  }

@@ -122,6 +123,11 @@ ref_dataset.list <- vector("list", length(dataset)) |> set_names(dataset)
ref_genes.list <- RNAsum::get_refgenes(params)
# sample WTS/WGS data
sample_data.list <- RNAsum::read_sample_data(params, results_dir, tx2gene = tx2ensembl)
# check counts input was correctly read and not null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the R session will first call RNAsum::read_sample_data which is an intensive and time-consuming process. If no salmon or kallisto inputs have been provided, it goes to waste. What if you catch that error at the beginning of RNAsum::read_sample_data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kallisto counts
2 participants