-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.rmd
127 lines (99 loc) · 4.78 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "Rapid expressed variant and allelic bias detection for rare variants and rare diseases"
author: ""
date: "5/10/2023"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Contact Infomation
## Installing
To install the private package:
# install.packages("devtools")
library(devtools)
# Bioconductor needs to be activated when automatically installing dependencies.
options(repos = BiocManager::repositories())
devtools::install_github("trichelab/bamSliceR")
library(bamSliceR)
## If you will be analyzing any data that is controlled access, you will need to download a GDC token and store it in a file in your home directory. For more instructions see [GenomicDataCommons Token](https://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicDataCommons/inst/doc/overview.html#authentication) and [Data portal authentication tokens](https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Cart/#gdc-authentication-tokens)
```{bash, eval=FALSE}
echo "my_downloaded_auth_code" > ~/.gdc_token
```
## STEP 1. Download Bam Slices
### Inspect the available aligned sequencing data (BAM) on GDC portal
To query the available BAM files on GDC, three pieces of information are needed:
1) The project ID.
2) Experiment Strategy (ex. "RAN-Seq", "WGS", etc. )
3) Alignment workflow.
These three pieces of information can be inspected by availableProjectId(), availableExpStrategy() and availableWorkFlow() correspondingly to locate the BAM files on GDC portal.
Here, we showed the example using TARGET-AML cohort with 2,281 subjects including 3,225 RNA-seq BAM files.
#### Check ids for all the projects on GDC portal
```{r message=FALSE, warning=FALSE}
library(bamSliceR)
availableProjectId()
```
#### Check the available Experimental Strategies give a project id.
```{r message=FALSE}
availableExpStrategy("TARGET-AML")
```
#### Check the available Workflows for each Experimental Strategies of a project.
```{r message=FALSE}
availableWorkFlow(projectId = "TARGET-AML", es = "RNA-Seq")
```
#### Get the information of BAM files
After knowing the keyword of project, experimental strategy and workflow, we can
collect information of BAMs we are interested in.
```{r message=FALSE}
file_meta = getGDCBAMs(projectId = "TARGET-AML", es = "RNA-Seq", workflow = "STAR 2-Pass Genome")
head(file_meta)
```
#### Example on how to Make the character() vector of genomic regions for BAM slicing.
BAM slicing API from GDC portal accept genomic ranges specifying as vector of character() e.g., c("chr", "chr1:10000").
Here we provide a function to get the required input format given the gene names.
```{r message=FALSE}
target_genes_data = system.file("data", "gene_names.rds", package = "bamSliceR")
target_genes = readRDS(target_genes_data)
target_genes
```
Get either GRanges or vector of character() for exons of the genes.
```{r message=FALSE}
#Get GRanges for exons of all genes above
target_ranges_gr = getGenesCoordinates(target_genes, ret = "GRanges")
head(target_ranges_gr)
#Get the vector of character() instead.
target_ranges_chars = getGenesCoordinates(target_genes, ret ="DF")
head(target_ranges_chars)
```
## Downloading the sliced BAMs
```{r message=FALSE, eval = FALSE}
#Download 3225 RNA-Seq sliced BAMs files with reads within regions defined by "target_ranges_chars" from TARGET-AML.
downloadSlicedBAMs(file_df = file_meta, regions = target_ranges_chars, dir = "BAM_FILES")
```
## STEP 2. Extract variants from sliced BAMs
We first also need to specify the regions as a GRanges object.
```{r message=FALSE}
head(target_ranges_gr)
```
Then we need make the character() vector including all the names of downloaded BAM files.
```{bash, eval=FALSE}
cd DIR_BAM_FILES
ls | grep bam$ > bamfiles
```
In the directory of the downloaded BAM files. And then scan the 'bamfiles' in R.
```{r message=FALSE, eval = FALSE}
bamfiles = scan("bamfiles", "character")
```
Last thing we need is to specify the directory of gmapGenome object created before.
(see [here](https://github.com/trichelab/bamSliceR/blob/main/vignettes/How_to_create_gmapGenome.r) about how to create gmapGenome object.)
```{r message=FALSE, eval = FALSE}
gmapGenome_dir = "/path/to/your/gmapGenome"
```
We can then start to tally the reads of BAMs files:
```{r message=FALSE, eval = FALSE}
tallied_reads = tallyReads(bamfiles = bamfiles, gmapGenome_dir = gmapGenome_dir, grs = target_ranges,
BPPARAM = MulticoreParam(workers = 4 , stop.on.error = TRUE), parallelOnRanges = TRUE,
parallelOnRangesBPPARAM = MulticoreParam(workers = 4) )
```
For more information about how to efficiently run tallyReads() on HPC on large cohorts,
please see the instruction in [here](https://github.com/trichelab/bamSliceR/blob/main/vignettes/How_to_TallyingRead_On_Parallel.md).