From 1f3993a28d2db92494d9ed497ae45a8bef3a2c35 Mon Sep 17 00:00:00 2001
From: sreichl <reichl.stephan@gmail.com>
Date: Sun, 30 Jun 2024 15:48:15 +0200
Subject: [PATCH] update docs with Heatmap performance improvements #4

---
 README.md                          | 18 ++++++++++++------
 config/README.md                   |  6 +++---
 config/config.yaml                 |  7 ++++---
 workflow/Snakefile                 |  5 -----
 workflow/envs/fastdist_UNUSED.yaml | 12 ------------
 workflow/envs/sklearn_UNUSED.yaml  |  7 -------
 workflow/envs/umap_UNUSED.yaml     | 13 -------------
 workflow/rules/clustering.smk      |  1 -
 workflow/rules/common.smk          |  3 +--
 9 files changed, 20 insertions(+), 52 deletions(-)
 delete mode 100644 workflow/envs/fastdist_UNUSED.yaml
 delete mode 100644 workflow/envs/sklearn_UNUSED.yaml
 delete mode 100644 workflow/envs/umap_UNUSED.yaml

diff --git a/README.md b/README.md
index fdf7508..9c41a74 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 # Unsupervised Analysis Workflow
 A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.
 
-This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts!
+This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others find and benefit from the effort and me to prioritize my efforts!
 
 **If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.8405360](https://doi.org/10.5281/zenodo.8405360).**
 
@@ -38,6 +38,7 @@ This project wouldn't be possible without the following software and their depen
 | clustree       | https://doi.org/10.1093/gigascience/giy083        |
 | ComplexHeatmap | https://doi.org/10.1093/bioinformatics/btw313     |
 | densMAP        | https://doi.org/10.1038/s41587-020-00801-7        |
+| fastcluster    | https://doi.org/10.18637/jss.v053.i09             |
 | ggally         | https://CRAN.R-project.org/package=GGally         |
 | ggplot2        | https://ggplot2.tidyverse.org/                    |
 | ggrepel        | https://CRAN.R-project.org/package=ggrepel        |
@@ -69,7 +70,7 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
 (Optional) We used the density preserving regularization option, densMAP [ref], during the embedding step, with default parameters to account for varying local density of the data within its original high dimensional space.
 
 **Hierarchically Clustered Heatmap**
-Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
+Hierarchically clustered heatmaps of scaled data (z-score) were generated using the Python package scipy's (ver) [ref] function pdist for distance matrix calculation (for observation and features), fastcluster's R implementation (ver) [ref] for hierarchical clustering and the R package ComplexHeatmap (ver) [ref] for visualization. (optional) To reduce computational cost the observations were downsampled to [heatmap:n_observations] and top [n_features] features selected by high variance. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
 
 **Visualization**
 The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
@@ -115,7 +116,12 @@ The workflow perfroms the following analyses on each dataset provided in the ann
   - diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
   - connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
 - Hierarchically Clustered Heatmap (.PNG)
-    - hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
+    - Hierarchically clustered heatmaps of scaled data (z-score) with configured distances ([metrics]) and clustering methods ([hclust_methods]).
+    - Distance matrices of observations and features are precomputed using scipy's dist function.
+    - Hierarchical clustering is performed by the R implementation of fastcluster.
+    - The observations can be randomly downsampled by proportion or absolute number ([n_observations]) to reduce computational cost.
+    - The number of features can be reduced to a proportion or an absolute number of the top variable features ([n_features]) to reduce computational cost.
+    - All combinations are computed, and annotated with [metadata_of_interest].
 - Visualization
   -  2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
   -  interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
@@ -174,7 +180,7 @@ The workflow perfroms the following analyses on each dataset provided in the ann
 # Usage
 Here are some tips for the usage of this workflow:
 - Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
-- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
+- Heatmaps require **a lot** of memory, hence options to reduce computational cost are provided and the memory allocation is solved dynamically based on retries. If an out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X times upon failure with X times the memory.
 - Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
 - Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).
 
@@ -185,12 +191,12 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
 We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
 - config
     - configuration: config/config.yaml
-    - sample annotation: digits_unsupervised_analysis_annotation.csv
+    - sample annotation: test/config/digits_unsupervised_analysis_annotation.csv
 - data
     - dataset (1797 observations, 64 features): digits_data.csv
     - metadata (consisting of the ground truth label "target"): digits_labels.csv
 - results will be generated in the configured subfolder `./test/results/`
-- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
+- performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task)
 
 # single-cell RNA sequencing (scRNA-seq) data analysis
 Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
diff --git a/config/README.md b/config/README.md
index dc4c0be..30e96e7 100644
--- a/config/README.md
+++ b/config/README.md
@@ -1,10 +1,10 @@
 # Configuration
 
-You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
+You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data and configuration in `test/` as a starting point.
 
 - project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
 - sample annotation (annotation): CSV file consisting of four mandatory columns.
     -  name: A unique name of the dataset (tip: keep it short but descriptive).
     -  data: Path to the tabular data as comma separated table (CSV).
-    -  metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
-    -  samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
+    -  metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each observation/sample and every other column metadata for the respective observation (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
+    -  samples_by_features: Boolean indicator if the data matrix is observations/samples (rows) x features (columns): 0==no, 1==yes.
diff --git a/config/config.yaml b/config/config.yaml
index dc12010..d46a014 100644
--- a/config/config.yaml
+++ b/config/config.yaml
@@ -13,6 +13,7 @@ project_name: digits
 
 ##### PCA #####
 # https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
+# especially relevant for large data
 pca:
     n_components: 0.9 # variance as float (0-1], number of components as int e.g., 50, or 'mle'
     svd_solver: 'auto' # options: ‘auto’, ‘full’, ‘covariance_eigh’, ‘arpack’, ‘randomized’
@@ -36,13 +37,13 @@ umap:
 ##### HEATMAP #####
 # information on the ComplexHeatmap parameters: https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
 # distance metrics: for rows and columns. all metrics that are supported by scipy.spatial.distance.pdist (https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.spatial.distance.pdist.html)
-# clustering methods: methods for hierarchical clustering that are supported by stats::hclust() (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust)
+# clustering methods: methods for hierarchical clustering that are supported by fastcluster's R implementation (https://danifold.net/fastcluster.html)
 # it is the most resource (memory) intensive method, leave empty [] if not required
 heatmap:
     metrics: ['correlation','cosine']
     hclust_methods: ['complete']
-    n_observations: 1000 # random sampled proportion float [0-1] or absolute number as integer
-    n_features: 0.5 # highly variable features percentate float [0-1] or absolute number as integer
+    n_observations: 1000 # random sampled proportion float (0-1] or absolute number as integer
+    n_features: 0.5 # highly variable features proportion float (0-1] or absolute number as integer
     
 ##### LEIDEN #####
 # Leiden clustering applied on UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
diff --git a/workflow/Snakefile b/workflow/Snakefile
index c0724c8..47792cf 100644
--- a/workflow/Snakefile
+++ b/workflow/Snakefile
@@ -142,11 +142,6 @@ rule all:
                             n_components=[dims for dims in config["umap"]["n_components"] if dims in [2,3]]
                                ) if 2 in config["umap"]["n_components"] or 3 in config["umap"]["n_components"] else [],
         # Heatmap
-#         distance_matrices = expand(os.path.join(result_path,'{sample}','Heatmap','DistanceMatrix_{metric}_{type}.csv'),
-#                                sample=list(annot.index),
-#                                metric=config["heatmap"]["metrics"],
-#                                type=["observations","features"],
-#                           ),
         heatmap_plots = expand(os.path.join(result_path,'{sample}','Heatmap','plots','Heatmap_{metric}_{method}.png'),
                                sample=list(annot.index),
                                method=config["heatmap"]["hclust_methods"],
diff --git a/workflow/envs/fastdist_UNUSED.yaml b/workflow/envs/fastdist_UNUSED.yaml
deleted file mode 100644
index d3916d7..0000000
--- a/workflow/envs/fastdist_UNUSED.yaml
+++ /dev/null
@@ -1,12 +0,0 @@
-channels:
-  - conda-forge
-  - bioconda
-  - defaults
-dependencies:
-  - scikit-learn
-  - pandas=1.5.0
-  - numpy
-  - numba
-  - pip
-  - pip:
-    - fastdist==1.1.6
\ No newline at end of file
diff --git a/workflow/envs/sklearn_UNUSED.yaml b/workflow/envs/sklearn_UNUSED.yaml
deleted file mode 100644
index c90d89c..0000000
--- a/workflow/envs/sklearn_UNUSED.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-channels:
-  - conda-forge
-  - defaults
-dependencies:
-  - scikit-learn=1.1.2
-  - pandas=1.5.0
-  - numpy=1.23.3
\ No newline at end of file
diff --git a/workflow/envs/umap_UNUSED.yaml b/workflow/envs/umap_UNUSED.yaml
deleted file mode 100644
index 6cca433..0000000
--- a/workflow/envs/umap_UNUSED.yaml
+++ /dev/null
@@ -1,13 +0,0 @@
-channels:
-  - conda-forge
-  - bioconda
-  - defaults
-dependencies:
-  - umap-learn=0.5.5
-  - pandas=1.5.0
-  - python=3.9
-  - numpy=1.23.3
-  - pynndescent=0.5.11
-  - pip
-  - pip:
-    - umap-learn[plot]
\ No newline at end of file
diff --git a/workflow/rules/clustering.smk b/workflow/rules/clustering.smk
index d327da0..10928b1 100644
--- a/workflow/rules/clustering.smk
+++ b/workflow/rules/clustering.smk
@@ -83,7 +83,6 @@ rule aggregate_all_clustering_results:
         # read each clustering result and add to data dict
         for filename in input:
             agg_clust.append(pd.read_csv(filename, header=0, index_col=0))
-
         
         # convert the dictionary to a DataFrame
         agg_clust_df = pd.concat(agg_clust, axis=1)
diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
index b7b377d..477f86c 100644
--- a/workflow/rules/common.smk
+++ b/workflow/rules/common.smk
@@ -134,8 +134,7 @@ def get_external_validation_paths(wildcards):
 
 # get paths to determine internal cluster indices
 def get_internal_validation_paths(wildcards):
-    return {#'data': annot.loc[wildcards.sample,'data'],
-            'metadata': annot.loc[wildcards.sample,"metadata"],
+    return {'metadata': annot.loc[wildcards.sample,"metadata"],
             'clusterings': os.path.join(result_path,wildcards.sample, "metadata_clusterings.csv"),
             'pca': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_data.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"])),
             'pca_var': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_var.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"]))