slide_dge.Rmd

---
title: "Differential Gene Expression"
subtitle: "Workshop on RNA-Seq"
author: "`r paste0('<b>Julie Lorent</b> | ',format(Sys.time(), '%d-%b-%Y'))`"
institute: NBIS, SciLifeLab
keywords: bioinformatics, course, scilifelab, nbis
output:
  xaringan::moon_reader:
    encoding: 'UTF-8'
    self_contained: false
    chakra: 'assets/remark-latest.min.js'
    css: 'assets/slide.css'
    lib_dir: libs
    nature:
      ratio: '4:3'
      highlightLanguage: r
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
      slideNumberFormat: "%current%/%total%"
    include: NULL
---
exclude: true
count: false

```{r,echo=FALSE,child="assets/header-slide.Rmd"}
```

<!-- ------------ Only edit title, subtitle & author above this ------------ -->

```{r,include=FALSE}
# load the packages you need
library(dplyr)
library(tidyr)
#library(stringr)
library(ggplot2)
#library(plotly)
library(pheatmap)
library(DESeq2)
library(edgeR)
library(gridExtra)

source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```

```{r,eval=TRUE,include=FALSE}
download_data("data/counts_filtered.csv")
cf <- read.csv("data/counts_filtered.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)

download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)
```

---
name: intro

## What I'll talk about in this lecture

- Compare gene expression between 2 groups of samples 
  - while accounting for differences in sequencing depth
  - by testing for differences in means with appropriate estimation of count data variability
- Calculate log fold changes
- Step by step description of DESeq2 analysis


## What we will discuss in the next lecture

- More information on count data distributions 
- Batch effects
- Advanced designs


---
name: intro2

## Couldn’t we just use a Student’s t test for each gene?


```{r, echo=FALSE, out.width="30%", fig.align='center'}
knitr::include_graphics("data/boxplot.png")
```

- May have few replicates -> "borrow" information from other genes
- Multiple testing issues -> Correction
- Distribution is not normal -> negative binomial methods


---
name: prep

## Preparation

- DEseq2 (and edgeR) take as input raw counts and metadata.
- Create the DESeq2 object

```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
d
```

- Categorical variables must be factors
- Building GLM models: `~var`, `~covar+var`

???

- The model `~var` asks DESeq2 to find DEGs between the levels under the variable *var*.
- The model `~covar+var` asks DESeq2 to find DEGs between the levels under the variable *var* while controlling for the covariate *covar*.

---
name: dge-sf

## Size factors

- Objective of the differential gene expression: compare concentration of cDNA fragments from each gene between conditions/samples.
- Data we have: read counts which depend on these concentration, but also on sequencing depth
- Total count can be influenced by a few highly variable genes
- For this reason, DESeq2 uses size factors (median-of-ratios) instead of total count as normalization factors to account for differences in sequencing depth
- Normalisation factors are computed as follows:

```{r}
d <- DESeq2::estimateSizeFactors(d,type="ratio")
sizeFactors(d)
```

---
name: neg-binom

## Negative binomial distribution

- RNAseq data is not normally distributed neither as raw counts nor using simple transformations
- DESeq2 and edgeR instead assume negative binomial distributions. 
- Given this assumption, to test for differential expression, one need to get a good estimate of the dispersion (variability given the mean).

---
name: dge-dispersion-1

## Dispersion

- Dispersion is a measure of spread or variability in the data. 
- Variance is a classical measure of dispersion which is usually not used for negative binomial distributions because of its relationship to the mean 
- The DESeq2 dispersion approximates the coefficient of variation for genes with moderate to high count values and is corrected for genes with low count values 

```{r,fig.height=4,fig.width=8,echo=FALSE}
cv <- function(x) sd(x)/mean(x)
rowCv <- function(x) apply(x,1,cv)

{
  par(mfrow=c(1,2))
  plot(x=log10(rowMeans(cf)+1),y=log10(rowVars(as.matrix(cf))),xlab="Log10 Mean counts",ylab="Log10 Mean Variance")
  plot(x=log10(rowMeans(cf)+1),y=log10(rowCv(as.matrix(cf))),xlab="Log10 Mean counts",ylab="Log10 Mean CV")
  par(mfrow=c(1,1))
}
```


---
name: dge-dispersion-2

## Dispersion

- RNAseq experiments typically have few replicates
- To improve the dispersion estimation in this case, we "borrow" information from other genes with similar mean values

```{r,fig.height=3,fig.width=3}
d <- DESeq2::estimateDispersions(d)
```

![](data/deseq_dispersion.png)
---
name: dge-test

## Testing

- Log2 fold changes changes are computed after GLM fitting `FC = counts group B / counts group A`

```{r}
dg <- nbinomWaldTest(d)
resultsNames(dg)
```

--

- Use `results()` to customise/return results
  - Set coefficients using `contrast` or `name`
  - Filtering results by fold change using `lfcThreshold`
  - `cooksCutoff` removes outliers
  - `independentFiltering` removes low count genes
  - `pAdjustMethod` sets method for multiple testing correction
  - `alpha` set the significance threshold

---
name: dge-test-2

## Testing

```{r}
res <- results(dg,name="Group_day07_vs_day00",alpha=0.05)
summary(res)
```

- Alternative way to specify contrast

```{r,eval=FALSE}
results(dg,contrast=c("Group","day07","day00"),alpha=0.05)
```


---
name: dge-test-3

## Testing

```{r}
head(res)
```

---
name: dge-test-4

## Testing

- Use `lfcShrink()` to correct fold changes for genes with high dispersion or low counts
- Does not change number of DE genes

![](data/lfc_shrink.png)


---
name: acknowl

## Acknowledgements

- RNA-seq analysis [Bioconductor vignette](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html)
- [DGE Workshop](https://github.com/hbctraining/DGE_workshop/tree/master/lessons) by HBC training


<!-- --------------------- Do not edit this and below --------------------- -->

---
name: acknowl

## Acknowledgements

- RNA-seq analysis [Bioconductor vignette](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html)
- [DGE Workshop](https://github.com/hbctraining/DGE_workshop/tree/master/lessons) by HBC training


---
name: end_slide
class: end-slide, middle
count: false

# Thank you. Questions?

```{r,echo=FALSE,child="assets/footer-slide.Rmd"}
```

```{r,include=FALSE,eval=FALSE}
# manually run this to render this document to HTML
rmarkdown::render("slide_dge.Rmd")
# manually run this to convert HTML to PDF
#pagedown::chrome_print("presentation_dge.html",output="presentation_dge.pdf")
```