Skip to content

Commit

Permalink
update docs with Heatmap performance improvements #4
Browse files Browse the repository at this point in the history
  • Loading branch information
sreichl committed Jun 30, 2024
1 parent 12f669f commit 1f3993a
Show file tree
Hide file tree
Showing 9 changed files with 20 additions and 52 deletions.
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Unsupervised Analysis Workflow
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.

This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts!
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others find and benefit from the effort and me to prioritize my efforts!

**If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.8405360](https://doi.org/10.5281/zenodo.8405360).**

Expand Down Expand Up @@ -38,6 +38,7 @@ This project wouldn't be possible without the following software and their depen
| clustree | https://doi.org/10.1093/gigascience/giy083 |
| ComplexHeatmap | https://doi.org/10.1093/bioinformatics/btw313 |
| densMAP | https://doi.org/10.1038/s41587-020-00801-7 |
| fastcluster | https://doi.org/10.18637/jss.v053.i09 |
| ggally | https://CRAN.R-project.org/package=GGally |
| ggplot2 | https://ggplot2.tidyverse.org/ |
| ggrepel | https://CRAN.R-project.org/package=ggrepel |
Expand Down Expand Up @@ -69,7 +70,7 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
(Optional) We used the density preserving regularization option, densMAP [ref], during the embedding step, with default parameters to account for varying local density of the data within its original high dimensional space.

**Hierarchically Clustered Heatmap**
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the Python package scipy's (ver) [ref] function pdist for distance matrix calculation (for observation and features), fastcluster's R implementation (ver) [ref] for hierarchical clustering and the R package ComplexHeatmap (ver) [ref] for visualization. (optional) To reduce computational cost the observations were downsampled to [heatmap:n_observations] and top [n_features] features selected by high variance. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.

**Visualization**
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
Expand Down Expand Up @@ -115,7 +116,12 @@ The workflow perfroms the following analyses on each dataset provided in the ann
- diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
- connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
- Hierarchically Clustered Heatmap (.PNG)
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
- Hierarchically clustered heatmaps of scaled data (z-score) with configured distances ([metrics]) and clustering methods ([hclust_methods]).
- Distance matrices of observations and features are precomputed using scipy's dist function.
- Hierarchical clustering is performed by the R implementation of fastcluster.
- The observations can be randomly downsampled by proportion or absolute number ([n_observations]) to reduce computational cost.
- The number of features can be reduced to a proportion or an absolute number of the top variable features ([n_features]) to reduce computational cost.
- All combinations are computed, and annotated with [metadata_of_interest].
- Visualization
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
- interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
Expand Down Expand Up @@ -174,7 +180,7 @@ The workflow perfroms the following analyses on each dataset provided in the ann
# Usage
Here are some tips for the usage of this workflow:
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
- Heatmaps require **a lot** of memory, hence options to reduce computational cost are provided and the memory allocation is solved dynamically based on retries. If an out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X times upon failure with X times the memory.
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).

Expand All @@ -185,12 +191,12 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
- config
- configuration: config/config.yaml
- sample annotation: digits_unsupervised_analysis_annotation.csv
- sample annotation: test/config/digits_unsupervised_analysis_annotation.csv
- data
- dataset (1797 observations, 64 features): digits_data.csv
- metadata (consisting of the ground truth label "target"): digits_labels.csv
- results will be generated in the configured subfolder `./test/results/`
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
- performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task)

# single-cell RNA sequencing (scRNA-seq) data analysis
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
Expand Down
6 changes: 3 additions & 3 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Configuration

You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data and configuration in `test/` as a starting point.

- project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
- sample annotation (annotation): CSV file consisting of four mandatory columns.
- name: A unique name of the dataset (tip: keep it short but descriptive).
- data: Path to the tabular data as comma separated table (CSV).
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
- samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each observation/sample and every other column metadata for the respective observation (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
- samples_by_features: Boolean indicator if the data matrix is observations/samples (rows) x features (columns): 0==no, 1==yes.
7 changes: 4 additions & 3 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ project_name: digits

##### PCA #####
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# especially relevant for large data
pca:
n_components: 0.9 # variance as float (0-1], number of components as int e.g., 50, or 'mle'
svd_solver: 'auto' # options: ‘auto’, ‘full’, ‘covariance_eigh’, ‘arpack’, ‘randomized’
Expand All @@ -36,13 +37,13 @@ umap:
##### HEATMAP #####
# information on the ComplexHeatmap parameters: https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
# distance metrics: for rows and columns. all metrics that are supported by scipy.spatial.distance.pdist (https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.spatial.distance.pdist.html)
# clustering methods: methods for hierarchical clustering that are supported by stats::hclust() (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust)
# clustering methods: methods for hierarchical clustering that are supported by fastcluster's R implementation (https://danifold.net/fastcluster.html)
# it is the most resource (memory) intensive method, leave empty [] if not required
heatmap:
metrics: ['correlation','cosine']
hclust_methods: ['complete']
n_observations: 1000 # random sampled proportion float [0-1] or absolute number as integer
n_features: 0.5 # highly variable features percentate float [0-1] or absolute number as integer
n_observations: 1000 # random sampled proportion float (0-1] or absolute number as integer
n_features: 0.5 # highly variable features proportion float (0-1] or absolute number as integer

##### LEIDEN #####
# Leiden clustering applied on UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
Expand Down
5 changes: 0 additions & 5 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -142,11 +142,6 @@ rule all:
n_components=[dims for dims in config["umap"]["n_components"] if dims in [2,3]]
) if 2 in config["umap"]["n_components"] or 3 in config["umap"]["n_components"] else [],
# Heatmap
# distance_matrices = expand(os.path.join(result_path,'{sample}','Heatmap','DistanceMatrix_{metric}_{type}.csv'),
# sample=list(annot.index),
# metric=config["heatmap"]["metrics"],
# type=["observations","features"],
# ),
heatmap_plots = expand(os.path.join(result_path,'{sample}','Heatmap','plots','Heatmap_{metric}_{method}.png'),
sample=list(annot.index),
method=config["heatmap"]["hclust_methods"],
Expand Down
12 changes: 0 additions & 12 deletions workflow/envs/fastdist_UNUSED.yaml

This file was deleted.

7 changes: 0 additions & 7 deletions workflow/envs/sklearn_UNUSED.yaml

This file was deleted.

13 changes: 0 additions & 13 deletions workflow/envs/umap_UNUSED.yaml

This file was deleted.

1 change: 0 additions & 1 deletion workflow/rules/clustering.smk
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ rule aggregate_all_clustering_results:
# read each clustering result and add to data dict
for filename in input:
agg_clust.append(pd.read_csv(filename, header=0, index_col=0))


# convert the dictionary to a DataFrame
agg_clust_df = pd.concat(agg_clust, axis=1)
Expand Down
3 changes: 1 addition & 2 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,7 @@ def get_external_validation_paths(wildcards):

# get paths to determine internal cluster indices
def get_internal_validation_paths(wildcards):
return {#'data': annot.loc[wildcards.sample,'data'],
'metadata': annot.loc[wildcards.sample,"metadata"],
return {'metadata': annot.loc[wildcards.sample,"metadata"],
'clusterings': os.path.join(result_path,wildcards.sample, "metadata_clusterings.csv"),
'pca': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_data.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"])),
'pca_var': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_var.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"]))
Expand Down

0 comments on commit 1f3993a

Please sign in to comment.