3_RNAseqVis_Heatmap.Rmd

---
title: "Visualisation of DESeq2 results"
description: "DEG analysis based on DESeq2 and GSEA"
principal investigator: "Joaquín de Navascués"
researcher: "Aleix Puig, modified by Joaquín de Navascués"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: hide
    theme: readable
    df_print: paged
---
```{r set-publication-theme, echo=FALSE, cache=FALSE}
ggplot2::theme_set(ggpubr::theme_pubr(base_size=10))
```

```{r setup, echo = FALSE, cache = FALSE}
knitr::opts_chunk$set(dev = 'png', 
                      fig.align = 'center', fig.height = 5, fig.width = 8.5, 
                      pdf.options(encoding = "ISOLatin9.enc"),
                      fig.path='integration/figures/', warning=FALSE, message=FALSE)
```

**Libraries:**
```{r, warning=FALSE}
if (!require("librarian")) install.packages("librarian")
# data
librarian::shelf(tibble, stringr, purrr, dplyr, plyr, DescTools, limma)
# graphics
librarian::shelf(RColorBrewer, pheatmap, cetcolor)
# convenience
library(here)
setwd(here::here()) # to distinguish from plyr::here()
```

**Path to definitive images (outside repo):**
```{r}
figdir <- paste0(c(head(str_split(getwd(),'/')[[1]],-1),
                   paste0(tail(str_split(getwd(),'/')[[1]],1), '_figures')),
                 collapse='/')
dir.create(figdir, showWarnings = FALSE)
```

# RNAseq results visualisation: Samples heatmap

## 1 Gather the DEG data

```{r}
# experimental design and labels
targets <- readRDS('output/targets.RDS')
# add column to distinguish control batches
targets$sample_type <- targets$Condition
targets$sample_type[targets$sample_type=='Control'] <- str_c(
  targets$Condition, "_", targets$Batch)[targets$Condition=='Control']
targets[,c('Condition', 'sampleIDs', 'sample_type')]
saveRDS(targets, file='output/targets.RDS')
```

```{r}
# TPM counts
tpmNormalisedCounts <- readRDS('output/tpmNormalisedCounts.RDS')
# pseudocounts
vsd_bcorr <- readRDS('output/vst_pseudocounts_batchCorrected.RDS')
# DEG lists
data_list <- list.files("output", pattern = "_vs_")
datasets <- pmap(list(file = paste("output", data_list, sep = "/")),
                 readRDS)
names(datasets) <- str_split_fixed(
  str_split_fixed(data_list, ".RDS", n=2)[,1],
  "_", n=3)[,3]
```

Total unique differentially expressed genes for | log2FC | > 2:
```{r}
# Filter for DEG with abs(log2FC) >= `fc_thresh`
fc_thresh <- 2.5
DE_genes <- datasets %>% 
  map( filter, padj <= 0.05 ) %>%
  map( filter, log2FoldChange %][% c(-fc_thresh, fc_thresh) )
# get the union of DEGs sets
DE_unique_genes <- unique(unlist(lapply(DE_genes, '[[', "ensemblGeneID")))
length(DE_unique_genes)
```

## 2 Using TPM counts with and without batch-correction

Let us first try to use the data as 'raw' as possible: TPM-normalised counts, Z-scored (as the range of values is 0.2 - 10,000) - obviously with no batch correction.

```{r}
# filter by DEGs
DE_tpm <- tpmNormalisedCounts[match(DE_unique_genes,
                                    rownames(tpmNormalisedCounts)),]
```

This gets us the TPMs of all DEGs, without Z-scoring -- this is done internally in the `pheatmap` function).

Prepare the heatmap customisation:
```{r}
main.title <-  'Clustered heatmap of DEGs'
# annotation labels
## for batch
ann_labels <- data.frame(batch = ifelse(test = targets$Batch == 'a',
                                         yes = '1',
                                         no =  '2'))
## for genotype
ann_labels$condition <- mapvalues(targets$Condition,
                                  from=unique(targets$Condition),
                                  to=c('da RNAi', 'da ov/ex',
                                       'control', 'da:da ov/ex',
                                       'scute ov/ex'))
ann_labels$condition <- factor(ann_labels$condition,
                               levels=c('da RNAi', 'control',
                                        'da ov/ex', 'da:da ov/ex',
                                        'scute ov/ex'))
rownames(ann_labels) <- targets$sampleIDs # same as `names(DE_tpm)`
# annotation colours
ann_colors = list(
  batch = c('1' = brewer.pal(12, 'Paired')[2],
            '2' = brewer.pal(12, 'Paired')[8]),
  condition = c("da RNAi" = brewer.pal(12, 'Paired')[9],
                "da ov/ex" = brewer.pal(12, 'Paired')[5],
                "control" = brewer.pal(12, 'Paired')[1],
                "da:da ov/ex" = brewer.pal(12, 'Paired')[6],
                "scute ov/ex" = brewer.pal(12, 'Paired')[4])
  )
```

### Heatmap of scaled TPM counts

```{r, fig.height=12, fig.width=10}
hm <- pheatmap(
  # data
    mat               = DE_tpm,
    scale             = "row",   # z-scores the rows
  # main
    main              = main.title,
    fontsize          = 14,
    clustering_method = "ward.D2",
  # rows
    cluster_rows      = TRUE,
    clustering_distance_rows = 'manhattan',
    treeheight_row    = 25,      # default is 50
    show_rownames     = FALSE,
  # cols
    cluster_cols      = TRUE,
    clustering_distance_cols = 'manhattan',
    treeheight_col    = 25,
    labels_col        = targets$sample_type,
    fontsize_col      = 9,
    angle_col         = 45,
  # annotation
    annotation        = ann_labels,
    annotation_colors = ann_colors,
  # tiles
    color             = cet_pal(n = 256, name = "cbd1", alpha = 1),
    border_color      = NA,
    cellwidth         = 20,
    cellheight        = 0.5
)
hm
```

The 'looks' are ok, but the organisation of the columns itself is not. Let us try with a batch-corrected version.

We use the `DESeqTranform` object to mark the batch-effects, and `limma::removeBatchEffect` to remove them:
```{r}
vsd <- readRDS('output/vst_pseudocounts.RDS')
# min(tpmNormalisedCounts) is 0 so we need to add 1 for log transform
tpm_log_bcorr <- removeBatchEffect(
  log2(tpmNormalisedCounts+1),
  batch=vsd$batch,
  design=model.matrix(~condition, colData(vsd) )
)
# reverse log transform
tpm_bcorr <- 2^tpm_log_bcorr
# save for MA plots
saveRDS(tpm_bcorr, file='output/tmp_batch_corrected.RDS')
# filter by DEGs
DE_tpm_bcorr <- tpm_bcorr[match(DE_unique_genes,
                             rownames(tpm_bcorr)),]
```

ALl annotations are the same -- we are now ready to plot.

### Heatmap of batch-corrected, z-scored TPM

```{r, fig.height=12, fig.width=10}
hm <- pheatmap(
  # data
    mat               = DE_tpm_bcorr,
    scale             = "row",   # z-scores the rows
  # main
    main              = main.title,
    fontsize          = 14,
    clustering_method = "ward.D2",
  # rows
    cluster_rows      = TRUE,
    clustering_distance_rows = 'minkowski',
    treeheight_row    = 25,      # default is 50
    cutree_rows       = 6,
    show_rownames     = FALSE,
  # cols
    cluster_cols      = TRUE,
    clustering_distance_cols = 'canberra',
    treeheight_col    = 25,
    labels_col        = targets$sample_type,
    fontsize_col      = 9,
    angle_col         = 45,
  # annotation
    annotation        = ann_labels,
    annotation_colors = ann_colors,
  # tiles
    color             = cet_pal(n = 256, name = "cbd1", alpha = 1),
    border_color      = NA,
    cellwidth         = 20,
    cellheight        = 0.5
)
hm
```

This is clearly more reasonable -- however, it requires transforming TPMs forth and back. We might just as well use the pseudocounts generated by the `DESeq2::vst` transformation.

## 3 Heatmap of batch-corrected pseudocounts

To do this, we need the batch-corrected, `DESeqTranform` object from the DGE analysis with `DESeq2`, filtered by `log2FoldChange` and `padj`:
```{r}
vsd_bcorr <- readRDS('output/vst_pseudocounts_batchCorrected.RDS')
DE_vsd_bcorr <- as.data.frame(assay(vsd_bcorr)[match(DE_unique_genes,
                             rownames(vsd_bcorr)),])
```

It is not needed to get the z-scores for the rows (`scale = 'row'`), as the range of values is ~3-18:
```{r, fig.height=12, fig.width=10}
hm <- pheatmap(
  # data
    mat               = DE_vsd_bcorr,
  # main
    main              = main.title,
    fontsize          = 14,
    clustering_method = "ward.D2",
  # rows
    cluster_rows      = TRUE,
    clustering_distance_rows = 'euclidean', # "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"
    treeheight_row    = 25,      # default is 50
    show_rownames     = FALSE,
  # cols
    cluster_cols      = TRUE,
    clustering_distance_cols = 'euclidean',
    treeheight_col    = 25,
    labels_col        = targets$sample_type,
    fontsize_col      = 9,
    angle_col         = 45,
  # annotation
    annotation        = ann_labels,
    annotation_colors = ann_colors,
  # tiles
    color             = cet_pal(n = 256, name = "cbd1", alpha = 1),
    border_color      = NA,
    cellwidth         = 20,
    cellheight        = 0.5
)
hm
```

OK, I see the problem here. Clearly the way to go is do the reverse transformation $2^{`vsd`}$. But what is the point of doing this with TPMs? Normalising by transcript units loses its the "physicality" if applying a ${log~2~(tpm+1)}$ transform. So let us use the batch-corrected `DESeqTranform` object directly. The dataframe `DE_vsd_bcorr` already contains the matrix of `vst`-transformed values _for the DEGs_, so we just need to do:

```{r}
DE_bcorr <- 2^DE_vsd_bcorr
```

This has a range of ~ 12-30,000, so we have to plot with z-score scaling:

```{r, fig.height=12, fig.width=10}
hm <- pheatmap(
  # data
    mat               = DE_bcorr,
    scale             = 'row',
  # main
    main              = main.title,
    fontsize          = 14,
    clustering_method = "ward.D2",
  # rows
    cluster_rows      = TRUE,
    clustering_distance_rows = 'minkowski', # "euclidean", "maximum", "manhattan", "canberra", "minkowski"
    treeheight_row    = 25,      # default is 50
    cutree_rows       = 6,
    show_rownames     = FALSE,
  # cols
    cluster_cols      = TRUE,
    clustering_distance_cols = 'canberra',
    treeheight_col    = 25,
    labels_col        = targets$sample_type,
    fontsize_col      = 9,
    angle_col         = 45,
  # annotation
    annotation        = ann_labels,
    annotation_colors = ann_colors,
  # tiles
    color             = cet_pal(n = 256, name = "cbd1", alpha = 1),
    border_color      = NA,
    cellwidth         = 20,
    cellheight        = 0.5,
)
# save it
tiff(file=paste0(figdir,'/Heatmap_vsd_bcorr.tiff'),
     width=10, height=12, units="in", res=300)
hm
dev.off()

pdf(file=paste0(figdir,'/Heatmap_vsd_bcorr.pdf'),
     width=10, height=12)
hm
dev.off()

png(file=paste0(figdir,'/Heatmap_vsd_bcorr.png'),
     width=10, height=12, units="in", res=300)
hm
dev.off()

svg(file=paste0(figdir,'/Heatmap_vsd_bcorr.svg'),
     width=10, height=12)
hm
dev.off()

hm
```

This is the clear winner, so I add the parameters to save it to file.