Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to generate clustering stats for a set of parameters #10

Open
allyhawkins opened this issue Nov 19, 2024 · 3 comments
Open

Function to generate clustering stats for a set of parameters #10

allyhawkins opened this issue Nov 19, 2024 · 3 comments
Assignees

Comments

@allyhawkins
Copy link
Member

We currently support generating clustering results using a range of parameters with the sweep_clusters() function, but the functions in calculate-clusters.R only support calculating metrics for one set of clustering results. In order to make the plots described in #9 it might be helpful to have a function that calculates one or all the metrics on the clustering output from sweep-clusters().

I think this would take the following arguments:

  • List of data frames with clustering results using different clustering parameters output from sweep_clusters().
  • Metric(s) to calculate. This could be a list that specifies which metrics to calculate. For example, providing c("purity", "width") would run both calculate_silhouette() and calculate_purity() on all data frames/ clustering results. Alternatively we could use flags for each metric, width, purity, and stability.

The output would be a list of data frames with one data frame for each metric. That means there would be one data frame that contains all the results from the purity calculations for all clustering results that were output from sweep_clusters(), one for width, and one for stability. Then these data frames could be provided as input to the function for plotting described in #9.

@allyhawkins
Copy link
Member Author

Tagging @cansavvy in case you are interested!

@sjspielman
Copy link
Member

@cansavvy I'm going to go ahead and assign you here too, since this should actually come before #9.

@jashapiro
Copy link
Member

The output would be a list of data frames with one data frame for each metric. That means there would be one data frame that contains all the results from the purity calculations for all clustering results that were output from sweep_clusters(), one for width, and one for stability. Then these data frames could be provided as input to the function for plotting described in #9.

Looking at the proposed output here, I think we might want to have one function that calculates both purity and silhouette width and puts them into a single data frame. If we aren't doing more than just running through the list and producing a new list, I'm not sure these functions would add much clarity beyond the "builtin" way I would process a list, namely with purrr something like this:

sweep_purity_list <- sweep_results_list |>
  purrr::map(calculate_purity)

In practice, I would expect to do something more like the following to facilitate summary stats and faceted plotting.

sweep_purity_df <- sweep_results_list |>
  purrr::map(calculate_purity) |>
  dplyr::bind_rows()

If we had a calculate_cell_cluster_metrics() function that did purity and silhouette width, we would be able to combine those easily for later plots (I wonder if that is the only function we really need? How long does it really take to calculate both of these?)

Stability would still have to be a separate function, as the ARI there is calculated per bootstrap, not per cell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants