adds notebooks for plotting dynamics and SILAC ratios

CambridgeCentreForProteomics · Dec 10, 2019 · 952cfe1 · 952cfe1
1 parent b645828
commit 952cfe1
Show file tree

Hide file tree

Showing 18 changed files with 12,661 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,15 @@
-# OOPS_change_in_RNA_binding
-Example of how to detect change in RNA binding from parallel Total and RNA-bound protein abundance TMT quantification using the OOPS protocol.
+# Visualising OOPS
+
+This repository was used to generate figures for the OOPS Nature Protocols manuscript
+
+The repository structure is:
+
+- raw:
+Copies of supplementary data from the original OOPS publication (https://www.nature.com/articles/s41587-018-0001-2)
+
+- notebooks:
+R markdown notebooks to analyse data and generate figures
+
+- results/plots:
+Plots from notebooks
 
-This notebook is for the Nature Protocols supplementary material
diff --git a/Identify_changes_in_RNA_binding.Rmd → ...books/Identify_changes_in_RNA_binding.Rmd b/Identify_changes_in_RNA_binding.Rmd → ...books/Identify_changes_in_RNA_binding.Rmd
@@ -9,8 +9,6 @@ header-includes:
 ---
 
 
-\colorlet{shadecolor}{lightgray!10} 
-
 In this notebook, we will identify changes in RNA binding following parallel quantification of the "total" and "RNA-bound" protein abundance using the OOPS protocol. We start with a single protein and use the base function `lm`, then apply this to the complete dataset to identify all proteins with a change in RNA binding. Following this, we use the `limma` package to test for RNA binding changes using a moderated t-statistic.
 
 Below we load the required packages and set a plotting theme.
@@ -36,7 +34,7 @@ We start by reading in the data. Our input here is the protein-level quantificat
 
 The input data here is identical to supplementary table 5 from the above paper. 
 ```{r}
-protein_quant_raw <- readxl::read_excel('./ncbi_30607034_OOPS_NBT_table_S5.xlsx',
+protein_quant_raw <- readxl::read_excel('../raw/ncbi_30607034_OOPS_NBT_table_S5.xlsx',
                                         sheet=3, skip=1, n_max=1917) %>%
   tibble::column_to_rownames('master_protein')
 
@@ -252,31 +250,31 @@ limma_results_treat_sig <- topTreat(limma_results_treat, coef = "timepoint6h:typ
 
 ```
 
-And below we reproduce our volcano plot including the 95% confidence interval and highlight those proteins which have < 1% FDR and an absolute fold change significantly greater than 2. 
+And below we reproduce our volcano plot including the 95% confidence interval and highlight those proteins which have < 1% FDR and an absolute fold change significantly greater than 2. We can see that many of the proteins with a fold change (FC) point estimate > 2 have a 95% confidence interval that overlaps the dashed lines for >2-fold change. TREAT also takes the multiple testing into account so it's even more conservative than just using the 95% CI shown below.
 
 ```{r, fig.height=6, fid.width=6, out.width = '60%'}
 
-limma_results$sig <- ifelse(limma_results$P.Value<=0.01, "<1% FDR", ">1% FDR") # add "sig" column
+limma_results$sig <- ifelse(limma_results$P.Value<=0.01, "Sig. change (FC > 1; <1% FDR)", ">1% FDR") # add "sig" column
 limma_results$sig[
-  rownames(limma_results) %in% rownames(limma_results_treat_sig)] <- "<1% FDR & FC > 2 (TREAT)"
+  rownames(limma_results) %in% rownames(limma_results_treat_sig)] <- "Sig. change (FC > 2; <1% FDR)"
 
 limma_results$SE <- sqrt(limma_fit_results$s2.post) * limma_fit_results$stdev.unscaled[,1]
 
 p <- limma_results %>%
   ggplot(aes(logFC, -log10(P.Value), colour=sig)) +
-  geom_point(size=1) +
-  scale_colour_manual(values=c(cbPalette[c(6,2)], "grey20"), name="") + # manually adjust colours
+  geom_point(size=0.5) +
+  geom_errorbarh(aes(xmin=CI.L, xmax=CI.R), alpha=0.3, size=0.5) +
+  scale_colour_manual(values=c("grey20", cbPalette[c(2,6)]), name="") + # manually adjust colours
   geom_vline(xintercept=1, linetype=2, colour="grey70") +
   geom_vline(xintercept=-1, linetype=2, colour="grey70") +
-  theme(legend.position="top", legend.direction=2)
+  theme(legend.position="top", legend.direction=2) +
+  xlab('Fold change (log2)') +
+  ylab('-log10(P-value)')
   
 print(p)
+ggsave('../results/plots/rna_binding_changes_volcano.png')
 ```
 
-We can also add error bars to the above Below, we can see that many of the proteins with a fold change (FC) point estimate > 2 have a 95% confidence interval that overlaps the dashed lines for >2-fold change. TREAT also takes the multiple testing into account so it's even more conservative than just using the 95% CI shown below.
-```{r, fig.height=6, fid.width=6, out.width = '60%'}
-print(p + geom_errorbarh(aes(xmin=CI.L, xmax=CI.R)))
-```
 
 Of course, the threshold for the fold changes you are interested in depends entirely on your prior expectations. A 2-fold threshold for "biologically" relevant changes will likely be overly conservative in the great majority of cases.
 

diff --git a/notebooks/Identify_changes_in_RNA_binding.nb.html b/notebooks/Identify_changes_in_RNA_binding.nb.html
diff --git a/Identify_changes_in_RNA_binding.pdf → ...books/Identify_changes_in_RNA_binding.pdf b/Identify_changes_in_RNA_binding.pdf → ...books/Identify_changes_in_RNA_binding.pdf
diff --git a/notebooks/OOPS_CL_RNase_control_ratios.Rmd b/notebooks/OOPS_CL_RNase_control_ratios.Rmd
@@ -0,0 +1,96 @@
+---
+title: "OOPS CL:NC and RNase +:- ratios"
+output:
+  pdf_document: default
+  html_notebook: default
+  html_document:
+    df_print: paged
+header-includes:
+- \usepackage{xcolor}
+- \usepackage{framed}
+---
+
+Here we want to visualise the CL vs NC (non-crosslinked) and RNase vs Ctrl SILAC experiments.
+
+Below we load the required packages and set a plotting theme.
+```{r, message=FALSE, warning=FALSE}
+# load packages
+library(tidyverse)
+library(ggbeeswarm)
+
+# set up standardised plotting scheme
+theme_set(theme_bw(base_size = 20) +
+            theme(panel.grid.major=element_blank(),
+                  panel.grid.minor=element_blank(),
+                  aspect.ratio=1))
+
+cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
+               "#0072B2", "#D55E00", "#CC79A7", "#999999")
+```
+
+
+We start by reading in the data. Our input here is the protein-level quantification for the CL vs NC and RNase + vs - experiments conducted for the OOPS NBT paper (https://www.nature.com/articles/s41587-018-0001-2). The peptide-level abundances have been aggregated to protein level abundance and center-median normalised. Proteins with missing values have been removed. Only proteins quantified in both "total" and "OOPS" samples are included.
+
+The input data here is identical to supplementary table 1 from the above paper. Below we read in the data and extract the ratio between CL vs NC or RNase vs Ctrl from the respective datasheets. Where the protein was not quantified in the NC/Ctrl samples, the ratio is NA and the value in the CL/RNase sample is used instead. This represents a "pseudo value" for the ratio which could not be quantified.
+```{r}
+glycoproteins <- read.delim('../raw/glycoproteins.tsv') %>% pull(protein)
+
+cl_nc_protein_quant_raw <- readxl::read_excel('../raw/ncbi_30607034_OOPS_NBT_table_S1.xlsx',
+                                        sheet=1, skip=1, n_max=2655, na='NA',
+                                        col_types = c("text", "numeric", "text", "numeric", "numeric", "numeric")) %>%
+  filter(step==3, !master_protein %in% glycoproteins) %>%
+  filter(is.finite(CL)) %>%
+  mutate(pseudo_CL_NC_Ratio=ifelse(is.na(CL_NC_Ratio), CL, CL_NC_Ratio)) %>%
+  select(master_protein, ratio=pseudo_CL_NC_Ratio) %>%
+  mutate(exp='CL:NC')
+
+
+RNase_ctrl_protein_quant_raw <- readxl::read_excel('../raw/ncbi_30607034_OOPS_NBT_table_S1.xlsx',
+                                        sheet=3, skip=1, n_max=4307, na='NA',
+                                        col_types = c("text", "text", "text", "numeric", "numeric", "numeric", "numeric",
+                                                      "numeric", "numeric", "numeric", "text")) %>%
+  filter(!master_protein %in% glycoproteins, cell_line=='U2OS', Phase=='Org') %>%
+  filter(is.finite(RNAse)) %>%
+  mutate(pseudo_RNAse_NC_Ratio=ifelse(is.na(RNAse_NC_Ratio), RNAse, RNAse_NC_Ratio)) %>%
+  select(master_protein, ratio=pseudo_RNAse_NC_Ratio) %>%
+  mutate(exp='RNase:Ctrl')
+
+```
+
+Combine the two experiments
+```{r}
+combined_cl_rnase <- rbind(RNase_ctrl_protein_quant_raw, cl_nc_protein_quant_raw)
+```
+
+Plot
+```{r}
+
+ratios_p1 <- combined_cl_rnase %>% 
+  mutate(exp=factor(exp, levels=c('CL:NC', 'RNase:Ctrl'))) %>%
+  ggplot(aes(ratio, fill=exp)) +
+  geom_histogram(bins=60) +
+  facet_grid(exp~., scales='free_y') +
+  xlab('Ratio (log2)') +
+  ylab('Proteins') +
+  geom_vline(xintercept=0, linetype=2, colour='grey50') +
+  scale_fill_manual(values=cbPalette[2:3], guide=FALSE) +
+  theme(panel.spacing = unit(0.25, "lines"))
+
+print(ratios_p1)
+ggsave('../results/plots/ratios_p1.png')
+
+ratios_p2 <- combined_cl_rnase %>% 
+  mutate(exp=factor(exp, levels=c('RNase:Ctrl', 'CL:NC'))) %>%
+  ggplot(aes(exp, ratio, colour=exp)) +
+  geom_quasirandom(size=0.5, bandwidth=0.25) +
+  coord_flip() +
+  xlab('') +
+  ylab('Ratio (log2)') +
+  geom_hline(yintercept=0, linetype=2, colour='grey50') +
+  scale_colour_manual(values=cbPalette[c(3,2)], guide=FALSE)
+
+print(ratios_p2)
+ggsave('../results/plots/ratios_p2.png')
+
+```
+
diff --git a/notebooks/OOPS_CL_RNase_control_ratios.html b/notebooks/OOPS_CL_RNase_control_ratios.html
diff --git a/notebooks/OOPS_CL_RNase_control_ratios.nb.html b/notebooks/OOPS_CL_RNase_control_ratios.nb.html
diff --git a/notebooks/OOPS_CL_RNase_control_ratios.pdf b/notebooks/OOPS_CL_RNase_control_ratios.pdf
diff --git a/notebooks/Visualise_dynamics.Rmd b/notebooks/Visualise_dynamics.Rmd
@@ -0,0 +1,98 @@
+---
+title: "Visualising RNA binding dynamics"
+output:
+  pdf_document: default
+  html_notebook: default
+header-includes:
+- \usepackage{xcolor}
+- \usepackage{framed}
+---
+
+
+Here we will visualise total and RNA-bound protein abundances across conditions.
+
+Below we load the required packages and set a plotting theme.
+```{r, message=FALSE, warning=FALSE}
+# load packages
+library(tidyverse)
+library(UniProt.ws)
+
+# set up standardised plotting scheme
+theme_set(theme_bw(base_size = 15) +
+            theme(panel.grid.major=element_blank(),
+                  panel.grid.minor=element_blank(),
+                  aspect.ratio=1))
+
+cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
+               "#0072B2", "#D55E00", "#CC79A7", "#999999")
+```
+
+
+We start by reading in the data. Our input here is the protein-level quantification for the Nocodazole arrest/release experiment conducted for the OOPS NBT paper (https://www.nature.com/articles/s41587-018-0001-2). In this experiment, we wanted to assess changes in RNA binding in arrested/released cells. To do this, we quantified "total" protein abundance and RNA-bound (extracted by OOPS) protein abundance. The peptide-level abundances have been aggregated to protein level abundance and center-median normalised. Proteins with missing values have been removed. Only proteins quantified in both "total" and "OOPS" samples are included.
+
+The input data here is identical to supplementary table 5 from the above paper. 
+```{r}
+protein_quant_raw <- readxl::read_excel('../raw/ncbi_30607034_OOPS_NBT_table_S5.xlsx',
+                                        sheet=3, skip=1, n_max=1917)
+
+```
+
+In order to plot a functional subset of proteins, we will use the UniProt pathway annotations.
+
+Warning: This cell will take a few minutes to run the query on the Uniprot database...
+```{r}
+humanUP <- UniProt.ws(taxId=9606) # H.sapiens
+protein_ids <- protein_quant_raw$master_protein
+
+hsapiens.annot <- AnnotationDbi::select(
+  humanUP,
+  keys = protein_ids, 
+  columns = c("PATHWAY",  "PROTEIN-NAMES"),
+  keystyle = "UNIPROTKB")
+
+hsapiens.pathway <-  hsapiens.annot %>% data.frame() %>% 
+  separate_rows(PATHWAY, sep="; ") %>% dplyr::select(UNIPROTKB, PROTEIN.NAMES, PATHWAY)
+
+```
+
+Identify the glycolysis proteins
+```{r}
+glycolysis_proteins <- hsapiens.pathway %>% filter(PATHWAY=='glycolysis')
+glycolysis_proteins$cleaned_protein_name <- sapply(strsplit(glycolysis_proteins$PROTEIN.NAMES, split='\\('), '[[', 1)
+```
+
+Restructure the data and subset to the glycolysis proteins
+```{r}
+glycolysis_intensities <- protein_quant_raw %>%
+  gather(key='sample', value='intensity', -master_protein) %>%
+  merge(glycolysis_proteins, by.x='master_protein', by.y='UNIPROTKB') %>%
+  separate(sample, into=c('timepoint', 'replicate', 'type'), remove=FALSE) %>%
+  mutate(type=factor(type, levels=c('total', 'OOPS'))) %>%
+  mutate(timepoint=factor(timepoint, levels=c('0h', '6h', '23h')))
+
+glycolysis_intensities$type <- recode(glycolysis_intensities$type,  'OOPS'='RNA-bound', 'total'='Total')
+```
+
+Plot the glycolysis proteins
+```{r, fig.width=10}
+protein_order <- glycolysis_intensities %>%
+  group_by(cleaned_protein_name) %>% summarise(max_intensity=max(intensity)) %>%
+  arrange(max_intensity) %>% pull(cleaned_protein_name)
+
+p <- glycolysis_intensities %>%
+  mutate(cleaned_protein_name=factor(cleaned_protein_name, levels=protein_order)) %>%
+  ggplot(aes(interaction(replicate, timepoint), cleaned_protein_name, fill=intensity)) +
+  geom_tile(colour='grey80', lwd=0.1) +
+  facet_grid(.~type) +
+  ylab('') + xlab('') +
+  scale_x_discrete(labels=c('', '0h', '', '', '6h', '', '', '23h', '')) +
+  geom_vline(xintercept=3.5) +
+  geom_vline(xintercept=6.5) +
+  scale_fill_gradient(low=cbPalette[1], high=cbPalette[5], name='Protein abundance\n(centre-median normalised)') +
+  theme(axis.text.y=element_text(size=10))
+
+print(p)
+
+ggsave('../results/plots/rna_binding_changes_heatmap.png')
+```
+
diff --git a/notebooks/Visualise_dynamics.nb.html b/notebooks/Visualise_dynamics.nb.html
diff --git a/notebooks/Visualise_dynamics.pdf b/notebooks/Visualise_dynamics.pdf