-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add calculate_cell_cluster_metrics() function #23
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
R/evaluate-clusters.R
Outdated
#' The cell id column's values should match either the PC matrix row names, or the | ||
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be | ||
#' output from the `rOpenScPCA::calculate_clusters()` function. | ||
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that I didn't look to closely into these arguments and if someone wanted to specify a different argument for purity versus silhouette they would not be able to do so this way because all arguments get pass to both functions. If we think this will be a common use case I can go back and adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use ...
. But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map
on their own.
R/evaluate-clusters.R
Outdated
#' | ||
#' set.seed(2024) | ||
#' | ||
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these steps just because I wanted to illustrating how I was testing this but if this is too much detail for this example we can trim this down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"
I would simplify to by assuming an sce_object
variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{}
construct!) sweep_clusters()
and calculate_cell_cluster_metrics()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an example sce_object that already exists I can pull from? If so how do I call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these examples are not run, it is fine to just "assume" an sce_object
exists for this section, and you can start from the sweep_clusters()
step (skipping the prep).
Still left to do here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! My main comment here (aside from my earlier misread) is that I still wonder if we want this function to work on a list, rather than just on a single clustering data frame. The convenience of calculating all the metrics at once makes sense to me, but if we have a function that evaluates a single data frame, then turning that into evaluating the list is a very simple addition of a purrr
wrapper, and I think that seems more transparent. But others may disagree!
R/evaluate-clusters.R
Outdated
#' The cell id column's values should match either the PC matrix row names, or the | ||
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be | ||
#' output from the `rOpenScPCA::calculate_clusters()` function. | ||
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use ...
. But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map
on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this!! I left some initial comments here, but before I look more, I have a thought about the use cases for this function - currently it's written to only run on a sweep list, but it makes sense to me to also make this flexible enough to run on single data frame (e.g., not a list of data frames). This means updates related to the input argument sweep_list
:
- First, give it a more flexible name... Maybe something like
cluster_results
? I don't love it, but I'm not sure of how else to communicate that it might be either a list of dfs or df, so really I don't hate it either! - Second, add a check if it's a data frame and if so, make it a list of length 1 with the given data frame in it. So the code might look like:
if (data frame) { listify it}
else { run the existing stopifnot checks}
- I suppose we'd need another check to determine whether to return the list or the first index from the eval'd df, too, so you might actually structure this code by first defining an
is_df
variable or so, and using that for both checks (the opening sanity check to transform it into a list to play nicely with purrr, and the final check for what to return.
R/evaluate-clusters.R
Outdated
#' | ||
#' set.seed(2024) | ||
#' | ||
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"
I would simplify to by assuming an sce_object
variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{}
construct!) sweep_clusters()
and calculate_cell_cluster_metrics()
Co-authored-by: Joshua Shapiro <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
for more information, see https://pre-commit.ci
…savvy/multi_sweep
for more information, see https://pre-commit.ci
…savvy/multi_sweep
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Stephanie Spielman <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting I haven't run this yet, but will run after this round of review! Let me know if I can clarify anything here :)
R/sweep-clusters.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much take your point here about the cluster_df
name confusion, definitely fine with this change
#' resolution = 0.1, | ||
#' seed = 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing in @jashapiro if you have an opinion here -
In hello-clusters
usage examples, we only set the seed once at the beginning of the code (as you did above with set.seed(2024)
), but not passing any seed into individual functions. When I was reviewing and saw this, I wondered maybe we should remove the seed here as well. But then I checked the other doc examples in rOpenScPCA
itself and it seems we do not use set.seed()
anywhere in these docs but pass in a seed argument directly all around!
I sort of think we want to firm this up and generally present set.seed()
in all examples and not the seed argument, but maybe that level of consistency is Stephanie-style overkill. Who has thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove passing in a seed in the examples (all around I guess), but I would also not show set.seed()
in the package code examples. That level of detail makes the most sense to me in the example notebooks, not in individual function docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove passing in a seed in the examples (all around I guess),
@cansavvy I can take care of this separately unless you are feeling very eager, up to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for it! I might not get to it for a few days!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not necessarily either, so I'll make it an issue!
df <- calculate_purity( | ||
x = x, | ||
cluster_df = df, | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ...
can definitely be kept (6539ac9), you just need a @param
line for it in the roxygen.
In this case we should keep it since there are other arguments which might be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but we decided in a previous conversation @jashapiro didn't think a ... was ever going to be used so I removed it #23 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right! delete away.
#' algorithm = "walktrap", | ||
#' weighting = "jaccard", | ||
#' nn = c(10, 15, 25), | ||
#' resolution = c(0.75, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parameter isn't used with walktrap, so we'd want to remove it here or change the algorithm
R/evaluate-clusters.R
Outdated
#' @param cluster_results A single data frame or list of data frames obtained from | ||
#' `rOpenScPCA::sweep_clusters()`. Each data frame in the list should contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where I would say that the single data frame is typically from rOpenScPCA::calculate_clusters()
.
R/evaluate-clusters.R
Outdated
#' `rOpenScPCA::calculate_silhouette()` and/or `rOpenScPCA::calculate_purity()`, based on | ||
#' `calculate_silhouette()` functions output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this?
based on
calculate_silhouette()
functions output.
Co-authored-by: Stephanie Spielman <[email protected]>
for more information, see https://pre-commit.ci
…savvy/multi_sweep
Co-authored-by: Stephanie Spielman <[email protected]>
…savvy/multi_sweep
Background
This PR is for #10. Tried to follow the context there. But since this is my first PR on this project I might be missing some insights so Im just posting this as a draft first so that someone can check that I'm in the generally right direction.
Summary
This is a function that takes output from
sweep_clusters()
and runs evals on it. It can runcalculate_silhouette()
and/orcalculate_purity()
on all elements of the list of data frames that are outputed fromsweep_clusters()
.An additional very nit picky thing I have in here but I was very minorly thrown off by the examples in sweep_clusters() being named
cluster_df
when they are actually lists of data frames and not data frames out right. If you don't like this change no worries. The documentation itself is very clear but I'm just a person who kinda goes straight for the examples first.Requested feedback
Am I understanding this right? It doesn't seem like this function will be that useful but I trust ya'll have more context and knowledge about the needs of the project than I who just started looking at this stuff last week lol.
Side side question that is also very minor, can we name the data frames in the list according to the combo of their parameters or do we find that would be too clunky? I like names in my lists but this is maybe a cansavvy quirk we don't need to subject everyone else to.