Skip to content

Commit

Permalink
improve and document correlation heatmaps #11
Browse files Browse the repository at this point in the history
  • Loading branch information
sreichl committed Jul 10, 2024
1 parent 97ee3bc commit 8b389ec
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 10 deletions.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,10 @@ This project wouldn't be possible without the following software and their depen

| Software | Reference (DOI) |
| :---: | :---: |
| ComplexHeatmap | https://doi.org/10.1093/bioinformatics/btw313 |
| CQN | https://doi.org/10.1093/biostatistics/kxr054 |
| edgeR | https://doi.org/10.1093/bioinformatics/btp616 |
| fastcluster | https://doi.org/10.18637/jss.v053.i09 |
| ggplot2 | https://ggplot2.tidyverse.org/ |
| _limma_ | https://doi.org/10.1093/nar/gkv007 |
| matplotlib | https://doi.org/10.1109/MCSE.2007.55 |
Expand Down Expand Up @@ -73,6 +75,8 @@ __Highly Variable Feature (HVF) selection.__ Highly variable features (HVF) were

__Confounding Factor Analysis (CFA).__ We assessed the potential confounding effects of metadata on principal components (PCs) by quantifying their statistical associations with the first ten PCs from principal component analysis (PCA). Categorical metadata were tested using the Kruskal-Wallis test, while numeric metadata were analyzed using Kendall's Tau correlation. Metadata without variation were excluded, and numeric metadata with fewer than 25 unique values were converted to factors. P-values for associations between PCs and metadata were calculated and adjusted for multiple testing using the Benjamini-Hochberg method. The results were visualized as a heatmap with hierarchically clustered rows (metadata) displaying -log10 adjusted p-values, distinguishing between numeric and categorical metadata.

__Correlation Heatmaps.__ To assess sample similarities we generated heatmaps of the sample-wise Pearson correlation matrix. using the `cor` function in R with the "pearson" method. Two versions of heatmaps were created: one hierarchically clustered and one sorted alphabetically by sample name. Hierarchical clustering was performed with the `hclust` function from the `fastcluster` package [ver] using "euclidean" distance and "complete" linkage. Heatmaps were visualized using the `ComplexHeatmap` package [ver] and annotated with relevant metadata.

__Visualization.__ The quality of the data and the effectiveness of the processing steps were assessed through the following visualizations (raw/filtered counts were log2(x+1)-normalized): the mean-variance relationship of all features, densities of log2-values per sample, boxplots of log2-values per sample, and Principal Component Analysis (PCA) plots. For the PCA plots, features with zero variance were removed beforehand and colored by `[visualization_parameters.annotate]`. The plots were generated using the R libraries ggplot2, reshape2, and patchwork`(ver)[ref]`.

**The analyses and visualizations described here were performed using a publicly available Snakemake `[ver](ref)` workflow `(ver)` [10.5281/zenodo.8144219](https://doi.org/10.5281/zenodo.8144219).**
Expand Down Expand Up @@ -109,7 +113,7 @@ The workflow performs the following steps to produce the outlined results:
- All transformed datasets are saved as CSV files and named by the applied methods, respectively.
- Example: `{split}/normCQN_reComBat_HVF.csv` implies that the respective data `{split}` was filtered, normalized using CQN, integrated with reComBat and subset to its HVFs.
- Visualizations (`{split}/plots/`)
- Next to the method specific visualizations (e.g., for CQN, HVF selection), a diagnostic figure is provided for every generated dataset (`*.png`), consisting of the following plots:
- Next to the method specific visualizations (e.g., for CQN, HVF selection), a **diagnostic figure** is provided for every generated dataset (`*.png`), consisting of the following plots:
- Mean-Variance relationship of all features as hexagonal heatmap of 2d bin counts.
- Densities of log-normalized counts per sample colored by sample or configured annotation column.
- Boxplots of log-normalized counts per sample colored by sample or configured annotation column.
Expand All @@ -119,6 +123,10 @@ The workflow performs the following steps to produce the outlined results:
- Categorical metadata association is tested using the non-parametric Kruskal-Wallis test, which is broadly applicable due to relaxed requirements and assumptions.
- Numeric metadata association is tested using rank-based Kendall's Tau, which is suitibale for "small" data sets with many ties and is robust to outliers.
- Statistical associations as `-log10(adjusted p-values)` are visualized using a heatmap with hierarchically clustered rows (metadata).
- Correlation Heatmaps (`*_heatmap_{clustered|sorted}.png`)
- Heatmap of sample-wise Pearson correlation matrix of the respective data split and processing step to quickly assess sample similarities e.g., replicates/conditions should correlate highly but batch shoud not.
- Hierarchicaly clustered using method 'complete' with distance metric 'euclidean' (`*_heatmap_clustered.png`).
- Alphabetically sorted by sample name (`*_heatmap_sorted.png`).
- Note: raw and filtered counts are log2(x+1)-normalized for the visualizations.
- These visualizations should help to assess the quality of the data and the effectiveness of the processing steps (e.g., normalization).
- Visualizations are within each split's plots subfolder, with the identical naming scheme as the respective data.
Expand Down
2 changes: 1 addition & 1 deletion workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ rule all:
# expand(os.path.join(result_path, '{split}','{transformed_data}.csv'), split=splits, transformed_data=transformed_data),
# VISUALIZE
expand(os.path.join(result_path, '{split}','plots','{all_data}.png'), split=splits, all_data=all_data),
expand(os.path.join(result_path, '{split}','plots','{all_data}_heatmap.png'), split=splits, all_data=all_data),
expand(os.path.join(result_path, '{split}','plots','{all_data}_heatmap_clustered.png'), split=splits, all_data=all_data),
# config
envs = expand(os.path.join(config["result_path"],'envs',module_name,'{env}.yaml'),env=["python_basics","edgeR","cqn","reComBat","ggplot"]),
configs = os.path.join(config["result_path"],'configs',module_name,'{}_config.yaml'.format(config["project_name"])),
Expand Down
2 changes: 1 addition & 1 deletion workflow/report/correlation_heatmap.rst
Original file line number Diff line number Diff line change
@@ -1 +1 @@
Hierarchically clustered heatmap of the sample-wise Pearson correlation matrix of data split {{snakemake.wildcards["split"]}} transformed by {{snakemake.wildcards["label"]}} using method 'complete' with distance metric 'euclidean'.
Hierarchically clustered or alphabetically sorted by sample name heatmap of the sample-wise Pearson correlation matrix of data split {{snakemake.wildcards["split"]}} transformed by {{snakemake.wildcards["label"]}} using method 'complete' with distance metric 'euclidean'.
12 changes: 10 additions & 2 deletions workflow/rules/visualize.smk
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,21 @@ rule plot_heatmap:
data = os.path.join(result_path,'{split}','{label}.csv'),
metadata = os.path.join(result_path,'{split}','annotation.csv'),
output:
heatmap = report(os.path.join(result_path,'{split}','plots','{label}_heatmap.png'),
heatmap_clustered = report(os.path.join(result_path,'{split}','plots','{label}_heatmap_clustered.png'),
caption="../report/correlation_heatmap.rst",
category="{}_{}".format(config["project_name"], module_name),
subcategory="{split}",
labels={
"name": "{label}",
"type": "correlation heatmap"
"type": "correlation heatmap clustered"
}),
heatmap_sorted = report(os.path.join(result_path,'{split}','plots','{label}_heatmap_sorted.png'),
caption="../report/correlation_heatmap.rst",
category="{}_{}".format(config["project_name"], module_name),
subcategory="{split}",
labels={
"name": "{label}",
"type": "correlation heatmap sorted"
}),
params:
partition = config.get("partition"),
Expand Down
36 changes: 31 additions & 5 deletions workflow/scripts/plot_heatmap.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ data_path <- snakemake@input[["data"]]
metadata_path <- snakemake@input[["metadata"]]

# output
plot_path <- snakemake@output[["heatmap"]]
hm_clustered_path <- snakemake@output[["heatmap_clustered"]]
hm_sorted_path <- snakemake@output[["heatmap_sorted"]]

# parameters
metadata_cols <- c(snakemake@config[["visualization_parameters"]][["annotate"]])
Expand Down Expand Up @@ -87,7 +88,8 @@ for (col in metadata_cols) {
if (!is.numeric(metadata[[col]])) {
n_cat <- length(unique(metadata[[col]]))
qual_col_pals <- brewer.pal.info[brewer.pal.info$category == 'qual',]
colors <- sample(unique(unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))), n_cat, replace=TRUE)
all_colors <- unique(unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals))))
colors <- sample(all_colors, n_cat, replace=ifelse(n_cat>length(all_colors), TRUE, FALSE))
names(colors) <- unique(metadata[[col]])
colors_list[[col]] <- colors
}
Expand All @@ -96,9 +98,11 @@ for (col in metadata_cols) {
# Create the row annotation
row_annot <- HeatmapAnnotation(df = metadata[, metadata_cols, drop = FALSE], which = "row", col = colors_list)

### make & save heatmap
### make & save heatmaps
# options(repr.plot.width=plot_dim, repr.plot.height=plot_dim)
png(filename=plot_path, width=plot_dim, height=plot_dim, units = "in", res=300)

# Clustered hierarchically Heatmap
png(filename=hm_clustered_path, width=plot_dim, height=plot_dim, units = "in", res=300)

Heatmap(data,
name = "Pearson\nCorrelation",
Expand All @@ -107,7 +111,7 @@ Heatmap(data,
left_annotation = row_annot,
show_column_dend = FALSE,
show_row_names = ifelse(nrow(data)>100, FALSE, TRUE),
show_column_names = FALSE,
show_column_names = ifelse(nrow(data)>100, FALSE, TRUE),
cluster_rows = as.dendrogram(hc),
cluster_columns = as.dendrogram(hc),
row_dend_reorder = TRUE,
Expand All @@ -117,3 +121,25 @@ Heatmap(data,
)

dev.off()


# Sorted alphabetically Heatmap
png(filename=hm_sorted_path, width=plot_dim, height=plot_dim, units = "in", res=300)

Heatmap(data,
name = "Pearson\nCorrelation",
column_title = paste0("Heatmap of Pearson correlation matrix sorted alphabetically by sample name."),
col = col_fun,
left_annotation = row_annot,
show_column_dend = FALSE,
show_row_names = ifelse(nrow(data)>100, FALSE, TRUE),
show_column_names = ifelse(nrow(data)>100, FALSE, TRUE),
row_order = order(rownames(data)),
column_order = order(colnames(data)),
row_dend_reorder = TRUE,
column_dend_reorder = TRUE,
use_raster = TRUE,
raster_quality = 9
)

dev.off()

0 comments on commit 8b389ec

Please sign in to comment.