Function to generate clustering stats for a set of parameters #10

allyhawkins · 2024-11-19T20:56:34Z

We currently support generating clustering results using a range of parameters with the sweep_clusters() function, but the functions in calculate-clusters.R only support calculating metrics for one set of clustering results. In order to make the plots described in #9 it might be helpful to have a function that calculates one or all the metrics on the clustering output from sweep-clusters().

I think this would take the following arguments:

List of data frames with clustering results using different clustering parameters output from sweep_clusters().
Metric(s) to calculate. This could be a list that specifies which metrics to calculate. For example, providing c("purity", "width") would run both calculate_silhouette() and calculate_purity() on all data frames/ clustering results. Alternatively we could use flags for each metric, width, purity, and stability.

The output would be a list of data frames with one data frame for each metric. That means there would be one data frame that contains all the results from the purity calculations for all clustering results that were output from sweep_clusters(), one for width, and one for stability. Then these data frames could be provided as input to the function for plotting described in #9.

The text was updated successfully, but these errors were encountered:

allyhawkins · 2024-11-19T20:57:38Z

Tagging @cansavvy in case you are interested!

sjspielman · 2024-12-16T21:30:33Z

@cansavvy I'm going to go ahead and assign you here too, since this should actually come before #9.

jashapiro · 2024-12-16T21:55:30Z

The output would be a list of data frames with one data frame for each metric. That means there would be one data frame that contains all the results from the purity calculations for all clustering results that were output from sweep_clusters(), one for width, and one for stability. Then these data frames could be provided as input to the function for plotting described in #9.

Looking at the proposed output here, I think we might want to have one function that calculates both purity and silhouette width and puts them into a single data frame. If we aren't doing more than just running through the list and producing a new list, I'm not sure these functions would add much clarity beyond the "builtin" way I would process a list, namely with purrr something like this:

sweep_purity_list <- sweep_results_list |>
  purrr::map(calculate_purity)

In practice, I would expect to do something more like the following to facilitate summary stats and faceted plotting.

sweep_purity_df <- sweep_results_list |>
  purrr::map(calculate_purity) |>
  dplyr::bind_rows()

If we had a calculate_cell_cluster_metrics() function that did purity and silhouette width, we would be able to combine those easily for later plots (I wonder if that is the only function we really need? How long does it really take to calculate both of these?)

Stability would still have to be a separate function, as the ARI there is calculated per bootstrap, not per cell.

sjspielman assigned cansavvy Dec 16, 2024

This was referenced Dec 18, 2024

Adding a warning messaging for when you only got 1 group for your clustering results #22

Open

Add calculate_cell_cluster_metrics() function #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to generate clustering stats for a set of parameters #10

Function to generate clustering stats for a set of parameters #10

allyhawkins commented Nov 19, 2024

allyhawkins commented Nov 19, 2024

sjspielman commented Dec 16, 2024

jashapiro commented Dec 16, 2024

Function to generate clustering stats for a set of parameters #10

Function to generate clustering stats for a set of parameters #10

Comments

allyhawkins commented Nov 19, 2024

allyhawkins commented Nov 19, 2024

sjspielman commented Dec 16, 2024

jashapiro commented Dec 16, 2024