2_dge_descriptive_viz_old.Rmd

---
title: "2. Visualisation of DESeq2 results"
description: "PCA and clustered heatmap"
principal investigator: "Joaquín de Navascués"
researchers: "Aleix Puig-Barbé, Joaquín de Navascués"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: show
    theme: readable
    df_print: paged
    css: doc.css
---
```{r setup, echo=FALSE, cache=FALSE}
ggplot2::theme_set(ggpubr::theme_pubr(base_size=10))
knitr::opts_chunk$set(dev = 'png', 
                      fig.align = 'center', fig.height = 7, fig.width = 8.5, 
                      pdf.options(encoding = "ISOLatin9.enc"),
                      fig.path='notebook_figs/', warning=FALSE, message=FALSE)
```


# 1 Preparation


**Libraries/utils:**
```{r libraries, warning=FALSE, message=FALSE}
if (!require("librarian")) install.packages("librarian")
librarian::shelf(
  # data
  tibble, DESeq2, stringr, purrr, plyr, dplyr, reshape2, santoku, DescTools, matrixStats,
  # graphics
  ggplot2, ggthemes, ggtext, ggrepel, eulerr, RColorBrewer, pheatmap, cetcolor,
  # convenience
  here)

if(!exists("extract_regulated_sets", mode="function")) source("utils.R")
```

**Set working directory:**
```{r setwd}
if (Sys.getenv("RSTUDIO")==1) {
   # setwd to where the editor is, if the IDE is RStudio
  setwd( dirname(rstudioapi::getSourceEditorContext(id = NULL)$path) )
} else {
  # setwd to where the editor is in a general way - maybe less failsafe than the previous
  setwd(here::here())
  # the following checks that the latter went well, but assuming
  # that the user has not changed the name of the repo
  d <- str_split(getwd(),'/')[[1]][length(str_split(getwd(),'/')[[1]])]
  if (d != 'RNAseq-EmcDaSc-adult_midgut') { stop(
    paste0("Could not set working directory automatically to where this",
           " script resides.\nPlease do `setwd()` manually"))
    }
}
```

**To save images outside the repo (to reduce size):**
```{r define_dir2figs}
figdir <- paste0(c(head(str_split(getwd(),'/')[[1]],-1),
                   paste0(tail(str_split(getwd(),'/')[[1]],1), '_figures')),
                 collapse='/')
dir.create(figdir, showWarnings = FALSE)
```


# 2 RNAseq results visualisation: Principal Components Analysis


## 2.1 Load and check expression data


We have seen how batch correction is necessary to show a meaningful PCA. Of the different expression measures, vst-normalised data is the most reasonable to make comparisons across samples.
```{r load-vst}
targets <- readRDS('output/targets.RDS')
vsd <- readRDS('output/vst_pseudocounts_batchCorrected.RDS')
```

When performing PCA on RNAseq data, it is often useful to filter genes with very low variance across experimental conditions. To test whether this makes sense in our case:
```{r visualise-variance, warning=FALSE}
# get per-gene variance and mean expression
var_vsd <- rowVars(assay(vsd), rm.na=TRUE)
avg_vsd <- rowMeans(assay(vsd))
df <- data.frame(Variance=var_vsd, Mean=avg_vsd)
# plot close to variance==0 with inset:
p1 <- ggplot(df, aes(x=Variance, y=Mean)) +
  geom_point(alpha=0.05)
p2 <- ggplot(df, aes(x=Variance, y=Mean)) +
  geom_point(alpha=0.1) + xlim(0,0.1) + ylim(5,15) +
  theme(panel.background = element_rect(fill='grey90'))
p1 + annotation_custom(ggplotGrob(p2),
                       xmin=7, xmax=15,
                       ymin=10, ymax=25)
```

Some genes have very low variance (which is the whole point of the `vst` transformation), but none of them are zero.
There seems to be little point in drawing an arbitrary threshold, so I move on with the whole dataset.

## 2.2 PCA plot

```{r}
pca <- prcomp(t(assay(vsd)), center=TRUE)
scores <- data.frame(targets$sampleIDs, pca$x[,1:2])
xtitle <- paste0('**PC1** (', round(summary(pca)$importance['Proportion of Variance','PC1']*100,1), ' %)')
ytitle <- paste0('**PC2** (', round(summary(pca)$importance['Proportion of Variance','PC2']*100,1), ' %)')
ptitle <- 'Principal component scores' 
stitle <- '(*vst* pseudocounts, batch-corrected)'

ggplot(scores,
       aes(x = PC1, y = PC2,
           label=factor(targets$sampleIDs),
           colour=factor(targets$condition_md) )
       ) +
  # data
  geom_vline(xintercept=0, linetype='dashed', colour='grey60', linewidth=0.5) +
  geom_hline(yintercept=0, linetype='dashed', colour='grey60', linewidth=0.5) +
  geom_point(size=4) +
  geom_point(size=2, colour='white', alpha=0.5) + 
  geom_text_repel(size=4, alpha=0.75,
                  max.overlaps=Inf,
                  seed=42,
                  force=0.3, force_pull=1, 
                  box.padding=1, point.padding=0.5) +
  lims(x= c(-100, 150), y = c(-80, 80)) +
  # decorations
  theme_minimal(base_size=16) +
  labs(x=xtitle,
       y=ytitle,
       title=ptitle,
       subtitle=stitle) +
  scale_colour_discrete(name="condition") +
  theme(legend.text = element_markdown(size=10),
        axis.title.x = element_markdown(),
        axis.title.y = element_markdown(),
        plot.title = element_text(hjust=0.5),
        plot.subtitle = element_markdown(hjust=0.5, color="grey40"),
        axis.text.x = element_text(color="grey60"),
        axis.text.y = element_text(color="grey60"),
        axis.ticks.x = element_line(linewidth=0.5, linetype = "solid", color="grey60"),
        axis.ticks.y = element_line(linewidth=0.5, linetype = "solid", color="grey60"),
        axis.ticks.length=unit(-0.25, "cm"),
        panel.border = element_rect(linewidth=1, linetype = "solid", fill = NA),
        panel.grid = element_blank())

ggsave('PCA_vsd_batchcorr.pdf', plot = last_plot(), device = 'pdf',
       path = figdir, dpi = 300)
```


# 3 RNAseq results visualisation: clustered heatmap


## 3.1 Prepare the data

Identify the batch corresponding to each sample:
```{r get-target-batch}
targets$sample_type <- targets$Condition
targets$sample_type[targets$sample_type=='Control'] <- str_c(
  targets$Condition, "_", targets$Batch)[targets$Condition=='Control']
saveRDS(targets, file='output/targets.RDS')
```

Get differential gene expression data and filter genes by $|\log_{2}(fold.change)| \ge 2.5$ and $p.adjusted < 0.05$:
```{r get-dge}
# DEG lists
data_list <- list.files("output", pattern = "_vs_")
datasets <- pmap(list(file = file.path("output", data_list)),
                 readRDS)
names(datasets) <- str_split_fixed(
  str_split_fixed(data_list, ".RDS", n=2)[,1],
  "_", n=3)[,3]

# Filter for DEG with abs(log2FC) >= `fc_thresh`
fc_thresh <- 2.5
DE_genes <- datasets %>% 
  map( filter, padj <= 0.05 ) %>%
  map( filter, log2FoldChange %][% c(-fc_thresh, fc_thresh) )
# get the union of DEGs sets
DE_unique_genes <- unique(unlist(lapply(DE_genes, '[[', "ensemblGeneID")))
length(DE_unique_genes)
```

Prepare heatmap customisation:
```{r}
main.title <-  'Clustered heatmap of DEGs'
# annotation labels
## for batch
ann_labels <- data.frame(batch = ifelse(test = targets$Batch == 'a',
                                         yes = '1',
                                         no =  '2'))
## for genotype
ann_labels$condition <- mapvalues(targets$Condition,
                                  from=unique(targets$Condition),
                                  to=c('da RNAi', 'da ov/ex',
                                       'control', 'da:da ov/ex',
                                       'scute ov/ex'))
ann_labels$condition <- factor(ann_labels$condition,
                               levels=c('da RNAi', 'control',
                                        'da ov/ex', 'da:da ov/ex',
                                        'scute ov/ex'))
rownames(ann_labels) <- targets$sampleIDs # same as `names(DE_tpm)`
# annotation colours
ann_colors = list(
  batch = c('1' = brewer.pal(12, 'Paired')[2],
            '2' = brewer.pal(12, 'Paired')[8]),
  condition = c("da RNAi" = brewer.pal(12, 'Paired')[9],
                "da ov/ex" = brewer.pal(12, 'Paired')[5],
                "control" = brewer.pal(12, 'Paired')[1],
                "da:da ov/ex" = brewer.pal(12, 'Paired')[6],
                "scute ov/ex" = brewer.pal(12, 'Paired')[4])
  )
```

Load the vst-normalised, batch-corrected counts from the `DESeq2::DESeqTranform` object and filter by $|\log_{2}(fold.change)| \ge 2.5$ and $p.adjusted < 0.05$. We also need to reverse the "variance-stabilising" transform ${\log_2(counts+1)}$ to show the appropriate range or variation, simply doing $2^{vsd}$ (this will get us a range of values of 12-30,000, instead of 3-18):
```{r get-counts}
# pseudocounts
vsd_bcorr <- readRDS('output/vst_pseudocounts_batchCorrected.RDS')
DE_vsd_bcorr <- as.data.frame(assay(vsd_bcorr)[match(DE_unique_genes,
                             rownames(vsd_bcorr)),])
DE_bcorr <- 2^DE_vsd_bcorr
```

Plot the z-scored expression values
```{r pheatmap, fig.height=12, fig.width=10}
hm <- pheatmap(
  # data
    mat               = DE_bcorr,
    scale             = 'row',
  # main
    main              = main.title,
    fontsize          = 14,
    clustering_method = "ward.D2",
  # rows
    cluster_rows      = TRUE,
    clustering_distance_rows = 'minkowski', # "euclidean", "maximum", "manhattan", "canberra", "minkowski"
    treeheight_row    = 25,      # default is 50
    cutree_rows       = 6,
    show_rownames     = FALSE,
  # cols
    cluster_cols      = TRUE,
    clustering_distance_cols = 'canberra',
    treeheight_col    = 25,
    labels_col        = targets$sample_type,
    fontsize_col      = 9,
    angle_col         = 45,
  # annotation
    annotation        = ann_labels,
    annotation_colors = ann_colors,
  # tiles
    color             = cet_pal(n = 256, name = "cbd1", alpha = 1),
    border_color      = NA,
    cellwidth         = 20,
    cellheight        = 0.5,
)
# save it
pdf(file=paste0(figdir,'/Heatmap_vsd_bcorr.pdf'),
     width=10, height=12)
hm
dev.off()
```


# 4 RNAseq results visualisation: Set diagrams


# 4.1 Prepare data

```{r load-dge}
# DEG data
DaDaOE_deg <- datasets['DaDaOE']
DaKD_deg   <- datasets['DaKD']
DaOE_deg   <- datasets['DaOE']
ScOE_deg   <- datasets['ScOE']
# gene symbols
dlist <- read.table(file="resources/gene_symbols.txt", header=TRUE)
names(dlist)[[1]] <- 'ensembl_gene_id'
rownames(dlist) <- dlist$ensembl_gene_id
```