-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathBRCA.Rmd
183 lines (149 loc) · 7.16 KB
/
BRCA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "BRCA data exploration"
author: "Urminder Singh"
date: "May 2, 2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#load required libraries
library(dplyr)
library(plyr)
library("readr")
library(TCGAbiolinks)
```
## BRCA data exploration
This is a short lesson on how to explore the mutation data at TCGA using TCGABiolinks and maftools libraries.
Using study metadata is crucial to the analysis of TCGA datasets. I have provided some functions which user can use to format TCGA metadata and easily use that in analysis. I will particulary look at BRCA data at TCGA but the methods and functions provided here could easily be extended to any cancer type.
## Downloading TCGA Metadata
For all the data deposited in TCGA, there is associated clinical metadata. Clinical metadata could be critical in understanding the data from different perspectives. For example, the clinical data has information about tumor stage, gender, age etc. all of which could be helpful in data analysis from various perspectives.
TCGAbiolinks provides the function "*GDCquery_clinic*" to download the clinical metadata. The clinical metadata is separated into two categories i.e. Clinical and Biospecimen. More information on this is available at https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/clinical.html
A simple example to use *GDCquery_clinic* function is shown below:
```{r }
clinicalBRCA <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
biospecimenBRCA <- GDCquery_clinic(project = "TCGA-BRCA", type = "Biospecimen")
head(clinicalBRCA)
head(biospecimenBRCA)
```
The clinical metadata is saved in clinical and Biospecimen categories to represent two different type of data associated with each patient. This also removes a lot redundancy from the data had this been a single table. This structure is also reflected in the downloaded metadata and this makes it really hard to use the metadata.
For example, in the *biospecimenBRCA* table, the column *portions* contain a dataframe for each row. This is beacuse a multiple portions could be mapped to a single *submitter_id* (for more information on metadata see: https://docs.cancergenomicscloud.org/docs/tcga-metadata). This is true for other columns *portions.analytes*, *portions.analytes.aliquots"* as well.
I have written a function *getTCGAMetadata* to format the clinical metadata into one table. This makes the metadata simple and easy to use. Example usage is shown below:
```{r }
#function to expand columns of TCGA metadata from list to df. Used in getTCGAMetadata function
expand<-function(df,colName){
res<-data.frame()
#for each row
for(i in 1: dim(df)[1]){
thisRow<-df[i, ! (colnames(df) %in% c(colName))]
tempdf<-as.data.frame(df[i, c(colName)])
#if list is empty skip that row
if(dim(tempdf)[1]<1){
next
}
#change colnames so they are unique
colnames(tempdf)<-paste(paste(colName,".",sep = ""),colnames(tempdf),sep = "")
#print(paste(i,colnames(tempdf)))
newRow<-cbind(thisRow,tempdf,row.names = NULL)
res<-bind_rows(res,newRow)
}
#print(res)
return(res)
}
#function to download and combine TCGA clinical and Biospecimen metadata given a project name
#example usage: getTCGAMetadata("TCGA-BRCA")
getTCGAMetadata<-function(projName){
print(paste("Downloading",projName))
clinicalBRCA <- GDCquery_clinic(project = projName, type = "clinical")
biospecimenBRCA <- GDCquery_clinic(project = projName, type = "Biospecimen")
#rename all cols from clinical table with suffix clinical
colnames(clinicalBRCA)<- paste0("clinical.",colnames(clinicalBRCA))
#expand biospecimen data in the order portions, portions.analytes, portions.analytes.aliquots
toUnpack<-c("portions", "portions.analytes", "portions.analytes.aliquots")
for(s in toUnpack){
biospecimenBRCA<-expand(biospecimenBRCA,s)
}
#add patient barcode to biospecimen data
biospecimenBRCA<- biospecimenBRCA %>% mutate(clinical.bcr_patient_barcode=substr(submitter_id,1,nchar(as.character(submitter_id))-4))
#join clinical and biospecimen
finalJoined<-join(clinicalBRCA,biospecimenBRCA,by="clinical.bcr_patient_barcode")
return(finalJoined)
}
#a list of useful (suggested) columns to retain from the TCGA metadata (to reduce the dimensions)
colsToKeep<-c("clinical.submitter_id",
"clinical.classification_of_tumor",
"clinical.primary_diagnosis",
"clinical.tumor_stage",
"clinical.age_at_diagnosis",
"clinical.vital_status",
"clinical.days_to_death",
"clinical.tissue_or_organ_of_origin",
"clinical.days_to_birth",
"clinical.site_of_resection_or_biopsy",
"clinical.days_to_last_follow_up",
"clinical.cigarettes_per_day",
"clinical.weight",
"clinical.alcohol_history",
"clinical.bmi",
"clinical.years_smoked",
"clinical.height",
"clinical.gender",
"clinical.year_of_birth",
"clinical.race",
"clinical.ethnicity",
"clinical.year_of_death",
"clinical.bcr_patient_barcode",
"clinical.disease",
"submitter_id",
"sample_type",
"tissue_type",
"portions.submitter_id",
"portions.analytes.analyte_type",
"portions.analytes.submitter_id",
"portions.analytes.analyte_type_id",
"portions.analytes.aliquots.analyte_type",
"portions.analytes.aliquots.submitter_id")
#download BRCA metadata
brcaMetadata<-getTCGAMetadata("TCGA-BRCA")
#only keep useful columns
brcaMetadata<-brcaMetadata[,colsToKeep]
head(brcaMetadata)
```
Similarly to download and combine metadata from multiple TCGA project one can use:
```{r eval=FALSE}
#download metadata of following projects into a single dataframe
tcgaProjList<-c("TCGA-BLCA","TCGA-HNSC","TCGA-ESCA","TCGA-PRAD")
#mdList will have all metadata for tcgaProjList
mdListDF<-data.frame()
for(s in tcgaProjList){
#mdList<-c(mdList,getjoinedBiospcClinc(s))
if(dim(mdListDF)[1]<1){
mdListDF<-getjoinedBiospcClinc(s)
}else{
temp<-getjoinedBiospcClinc(s)
mdListDF<-bind_rows(mdListDF,temp)
}
}
```
Now we can use the downloaded metadata in subsequent analysis.
## Downloading Mutation Data
The mutation data could be downloaded using TCGAbiolinks' *GDCquery_Maf* function. *GDCquery_Maf* will will download the mutation data in Mutation Annotation Format (MAF) aligned against hg38. Example usage is as follows:
```{r eval=FALSE}
BRCAmaf <- GDCquery_Maf("CHOL", pipelines = "varscan2")
```
To download and combine multiple MAF files one can do as follows:
```{r eval=FALSE}
#function to download maf file given a project name
getmaf<-function(projName){
thisMaf<-GDCquery_Maf(projName, pipelines = "varscan2")
return(thisMaf)
}
#function to download maf files given a list project names. Returned files is combined into one bigger file.
getmafs<-function(projList){
allMaf<-data.frame()
for(p in projList){
temp<-GDCquery_Maf(p, pipelines = "varscan2")
allMaf<-rbind(allMaf,temp)
}
return(allMaf)
}
```