-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrun_workflow.Rmd
559 lines (439 loc) · 18.5 KB
/
run_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
---
title: "Run Cut&Run Peak Calling"
always_allow_html: true
params:
outdir:
value: ""
input: text
output:
html_document:
theme: yeti
highlight: breezedark
toc: true
toc_float: true
toc_depth: 3
number_sections: true
fig_caption: true
df_print: paged
github_document:
html_preview: false
toc: true
fig_width: 5
fig_height: 5
toc_depth: 3
editor_options:
markdown:
wrap: 72
---
```{r set-up, echo=FALSE}
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80),
tidy=TRUE,
fig.align='center',
fig.width = 10, fig.height = 10,
eval = FALSE)
options(stringsAsFactors = FALSE, max.print = 100)
table = function (..., useNA = 'ifany') base::table(..., useNA = useNA)
```
```{r message = FALSE, warning=FALSE, echo=FALSE, eval=TRUE}
library(stringr)
library(magrittr)
library(dplyr)
library(tidyr)
library(tibble)
```
# About the Pipeline
The pipeline runs the
[Bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
alignment, quality trimming of reads with trimgalore,
[SEACR](https://github.com/FredHutch/SEACR) peak calling, and optionally
[MACS2](https://github.com/macs3-project/MACS) peak calling. MACS2
requires an effective genome size to call peaks, which you can provide
directly or call
[`unique-kmers.py`](https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html)
to calculate the effective genome size on the fly. Coverage tracks are
produced for visualization in [IGV](https://igv.org/doc/desktop/).
It will also perform general QC statistics on the fastqs with
[fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
the alignment, peak calling, and sample similarity using
[deeptools](https://deeptools.readthedocs.io/en/develop/). Finally, the
QC reports are collected into a single file using
[multiQC](https://multiqc.info/).
A DAG (directed acyclic graph) of the default workflow is show below:
```{r echo=FALSE, eval=TRUE}
knitr::include_graphics("images/dag.png")
```
# Activate the Environment on HPC
The directions to set-up the Nextflow workflow requirements are found in
the [README.md](../README.md). Ensure that you have followed the steps
to fork and clone the repository and created the conda nextflow
environment before starting with this document.
### 1) Interactive Session
Optional but recommended: use `tmux` on the cybertron login nodes. Name
the session nextflow and then request an interactive session, then
activate the nextflow conda environment. The project codes can be found
with `project info` command. Change the `$QUEUE` and `$NAME` variables
in the code chunk below to be accurate for your Cybertron projects.
```{bash optional, eval=FALSE}
tmux new-session -s nextflow
project info
NAME="RSC_adhoc"
QUEUE="sceaq"
qsub -I -q $QUEUE -P $(project code $NAME) -l select=1:ncpus=1:mem=8g -l walltime=8:00:00
```
### 2) Open CutandRun workflow folder
Navigate to where you place the cloned (copied) cutandrun_nf directory,
and then checkout the latest release branch.
```{bash eval=FALSE}
cd /path/to/my/cutandrun_nf
git fetch
# this will list all branches of the repository. Find the release branch with the latest version in red text, which means you don't yet have a local copy of that branch.
git branch -a
# then select the latest version, for example 2.0.0. This downloads the stable version of pipeline locally.
git checkout release/2.0.0
# Now the * indicates that you're on the release branch and its no longer red text.
git branch
```
### 3) Activate conda environement
Activate the Nextflow conda environment.
```{bash required, eval=FALSE}
conda env create -f env/nextflow.yaml
conda activate nextflow
```
# Examine the Sample Sheet
A sample sheet in csv (comma separated values) format is used as input
to the pipeline. This sample sheet **must** have the following column
names in any order:
- "sample"
- "sample_id"
- "target_or_control"
- "single_end"
- "read1"
- "read2"
```{r echo=FALSE, eval=TRUE}
sample_desc <- c("Any alphanumeric string for each biological sample in the dataset. Will have the same sample IDs for each antibody used. For example SAMPLE_1 has both H3K27me3 and IgG control CUT&RUN, and thus SAMPLE_1 has 1 row with the files for H3K27me3, and SAMPLE_1 has 2nd row with the files for IgG data.")
sample_id_desc <- glue::glue("Any alphanumeric string for each unique sample+condition. No duplicates allowed. For example SAMPLE_1 has both H3K27me3 and IgG control CUT&RUN. Thus, SAMPLE_1 is the value in `sample`, and \"SAMPLE_1_H3K27me3\" is the value in `sample_id`. Again, SAMPLE_1 has 2nd row with the files for IgG data, where SAMPLE_1 is the value in `sample`, and \"SAMPLE_1_IgG\" is the value in `sample_id`")
target_desc <- c("Must contain the values [target or control] case-sensitive. Target is for the antibodies using the immunoprecipitation for the proteins of interest, such as transcription factors or histone modifications like H3K27me3, or the value control for the isotype control (eg IgG).")
read1 <- c("Contain absolute filepaths to read 1 in paired-end fastqs.")
read2 <- c("Contain absolute filepaths to read 2 in paired-end fastqs.")
single_end <- c("For CUT&RUN data it should always be [false] case-sensitive.")
```
```{r eval=TRUE, echo=FALSE}
data.frame(column_name=c("sample","sample_id", "target_or_control","read1", "read2","single_end"),
column_description=c(sample_desc,sample_id_desc,target_desc,read1, read2, single_end)) %>%
knitr::kable()
```
### Examples
**1)** Below is an example of a complete sample sheet for use in the
pipeline, which can be edited for your own samples in
`test_data/test_dataset_sample_sheet.csv`.
- It contains IgG control samples for peak calling.
- This sample sheets OK to use even if you elect to skip IgG
normalization in SEACR or use IgG background in MACS2. The pipeline
will simply not use the controls.
- Use `threshold`, and `no_control_macs2` parameters in
`nextflow.config` to change this. Details found in [Configure
Pipeline for Your Data](#configure-pipeline-for-your-data)
```{r eval=TRUE, echo=FALSE}
sample_sheet <- read.csv(here::here("test_data/test_dataset_sample_sheet.csv")) %>%
mutate(single_end="false") %>%
mutate(sample=str_split_fixed(sample_id, pattern = "_", n=3)[,1]) %>%
select(sample, everything())
head(sample_sheet) %>%
knitr::kable()
# dim(sample_sheet)
```
**2)** Below is another example of a complete sample sheet for use in
the pipeline.
- It lacks IgG control samples for peak calling.
- This sample sheets OK to use only if you modify the parameters to
skip using IgG controls.
- Use `threshold`, and `no_control_macs2` parameters in
`nextflow.config` to modify this. Details found in [Configure
Pipeline for Your Data](#configure-pipeline-for-your-data).
```{r echo=FALSE, eval=TRUE}
sample_sheet_noCtrl <- sample_sheet %>%
mutate(target_or_control="target")
head(sample_sheet_noCtrl) %>%
knitr::kable()
```
# Run the Example Data
To ensure that the pipeline works, first run the test data set. This
example will run using the data found in the `test_sample_sheet.csv`.
```{bash}
./main_run.sh "test_dataset"
```
# Configure Pipeline for Your Data {#configure-pipeline-for-your-data}
## Configuration file
Open the configuration file `nextflow.config` and edit the necessary
parameters for building the index, and/or running the alignment or peak
calling steps.
```{r warning=FALSE, echo=FALSE, eval=TRUE}
config_file <- readLines(here::here("nextflow.config")) %>%
noquote()
params_lines <- c(grep("global parameters", config_file),
grep("multiqc_config", config_file)+1)
```
```{r warning=FALSE, echo=FALSE, eval=TRUE}
config_file %>%
head(.,n = 20) %>%
c(., "<...>") %>%
cat(.,sep="\n")
```
## Global Params
Be sure to change the following lines for the global parameters:
- sample_sheet
- queue
- project code
- outdir
- peak_outdir
```{r echo=FALSE, eval=TRUE}
end <- grep("Bowtie params",config_file)-1
config_file[params_lines[1]:end] %>%
cat(.,sep = "\n")
```
## Genomic References
Additionally, determine if you require a new bowtie2 index to be build
for the target genome and/or the spike-in genome. The pipeline requires
either a fasta filepath OR Bowtie2 index filepath. This is also required
for the spike-in, with E. Coli provided as a default.
E. coli is the default since it that is a carry over DNA from the
Cut&Run library prep methodology and is expected to be present in all
Cut&Run experiments regardless if exogenous spike-in is used like Yeast.
Please see [here](https://doi.org/10.7554/eLife.46314) for more
information on spike-in normalization.
Change the following lines for alignment reference files when needed:
- build_index
- fasta
- index
- build_spike_index
- spike_fasta
- spike_index
```{r echo=FALSE, eval=TRUE}
bowtie2 <- c(grep("Bowtie params", config_file), grep("\\sspike_index.+\\=", config_file))
config_file[bowtie2[1]:bowtie2[2]] %>%
cat(.,sep = "\n")
```
## SEACR
SEACR defaults to using IgG control normalization and stringent peak
calling, eg `SEACR_1.3.sh target_bedgraph igg_bedgraph norm stringent`.
If skipping the use of IgG control all together, set `threshold` to any
value \> 0.
If you would like to use a spike-in normalization, either E. Coli or an
exongenous spike-in like Drosophila, set `spike_norm` to true. You must
see [Advanced Options](#advanced-options) for details on turning off igg
normalization.
- threshold
- spike_norm
- chrom_sizes
- scale_factor_constant
```{r echo=FALSE, eval=TRUE}
c( config_file[grep("SEACR params", config_file):(grep("scale_factor_constant", config_file)+1)]) %>%
cat(.,sep = "\n")
```
## Optional: MACS2
Finally, decide whether to run MACS2 calls along with the SEACR peak
calling algorithm (default = true). MACS2 will use the effective genome
size value provided in `gsize` parameter.
If you are using a non-model organism or simply don't want to use the
effective genome size provided in literature or MACS2 documentation, you
can set `calc_effective_gsize = true` to calculate an effective genome
size using the target genome fasta `fasta` filepath and read-length.
- run_macs2
- no_control_macs2
- gsize
- calc_effective_gsize
- read_length
```{r echo=FALSE, eval=TRUE}
c( config_file[grep("MACS2 params", config_file):(grep("no_control", config_file)+1)],
config_file[grep("\\sgsize.+\\=", config_file):grep("read_length", config_file)]) %>%
cat(.,sep = "\n")
```
## Advanced Options {#advanced-options}
In the `nextflow.config`, you can define additional command line
arguments to the scientific software under [process
scope](https://www.nextflow.io/docs/latest/process.html#). You may use
the advanced options to change computational resources requested for
different processes. The CPUs and memory parameters can updated to
request a larger amount of resources like CPUs or memory if files are
large. You may also edit the commandline parameters for processes in the
workflow using the `ext.arg`
[directive](https://www.nextflow.io/docs/latest/process.html#directives).
Please be aware the **default command line parameters for `Bowtie2`
processes are already provided for both target and spike-in alignment**,
but can be edited.
The most commonly modified and important process parameters are listed
toward the top of the process scope in the `nextflow.config` file.
You can edit the command line parameters for `SEACR` and `MACS2`
parameters that often need to be re-run multiple times when deciding on
the appropriate peak-set to use. For example, `MACS2` broad and narrow
peak calling parameters for different histone modifications which can be
modified using the `ext.args` parameter.
```{r eval=TRUE, echo=FALSE}
process_lines <- c(grep("Computational resource", config_file),
grep("Picard to mark duplicate reads", config_file)-1)
config_file[c(process_lines[1]:process_lines[2])] %>%
gsub("^.+(mode:.+|saveAs.+|fail.+).+$","", .) %>%
gsub("\\]$", "", .) %>%
.[-grep("^$", .)] %>%
gsub("\\[ path.+", "[...]",.)
```
SEACR has the option to be set to SEACR v1.4 or SEACR v1.3 - which have
particularly different commandline interfaces, changes in the methods
for normalization to IgG, and v1.4 can optionally remove peaks found in
IgG. Please see [here](https://github.com/FredHutch/SEACR/releases) for
the full changelog.
For SEACR v1.3, Often, you will need to change `SEACR` from "non" to
"norm" for different normalization strategies whether you're using IgG
normalization or spike-in normalization. The example below demonstrates
how to change the commandline params and version by editing
`ext.version` and `ext.args`.
```{r eval=TRUE, echo=FALSE}
seacr_lines <- c(grep("SEACR peak calling", config_file),
grep("MACS2 peak", config_file)-2)
config_file[c(seacr_lines[1]:seacr_lines[2])] %>%
gsub("^.+(mode:.+|saveAs.+|fail.+).+$","", .) %>%
gsub("\\]$", "", .) %>%
.[-grep("^$", .)] %>%
gsub("\\[ path.+", "[...]",.) %>%
gsub("\\'1.4\\'","'1.3'", .) %>%
gsub("--norm.+", "norm stringent'", .) %>%
cat(.,sep = "\n")
```
# Run Script
```{r eval=TRUE}
usethis::edit_file(here::here("main_run.sh"))
```
```{r eval=TRUE, echo=FALSE}
main_run <- readLines(here::here("main_run.sh"))
# main_run
```
Decide on the `NFX_PROFILE`, which allows you to run the processes
either locally, or using the PBS job scheduler on Cybertron, and
determine if you'd like to use singularity containers or docker
containers.
2) `PBS_singularity` [DEFAULT, recommended] \* you can submit a PBS job
that will use singularity containers on Cybertron \* This takes care
of requesting the appropriate resources using PBS
3) `local_singularity` \* locally on an interactive session Cybertron
with singularity \* requires appropriate computational resources be
requested using
`qsub -I -q <queue_name> -P <project_code> -l select=1:ncpus=4:mem=32GB`
Edit the script `main_run.sh` and change the values for the
`NFX_PROFILE` variable if desired.
```{r eval=TRUE, echo=FALSE}
idx <- c(grep("#Options", main_run), grep("#Options", main_run)+1) %>%
.[order(.)]
main_run[head(idx,2)] %>%
cat(.,sep = "\n")
```
## Alignment and Peak Calls
Edit the variables in the `main_run.sh` script for entry-point of the
workflow. The default option *"align_call_peaks"* for the `NFX_ENTRY`
will run the full pipeline (QC, alignment, peak calling, coverage
tracks).
```{r eval=TRUE, echo=FALSE}
main_run[tail(idx, n=2)] %>%
cat(.,sep = "\n")
```
If you already have aligned BAM files, see
`test_data/test_dataset_bams_sample_sheet.csv` for an example to call
peaks only using the entry `call_peaks`.
```{r eval=TRUE, echo=FALSE}
main_run[tail(idx, n=2)] %>%
gsub("\\=\\'align_call_peaks\\'", "='call_peaks'", .) %>%
cat(.,sep = "\n")
```
Then, execute the `main_run.sh` script in order to complete the peak
calling on the samples. Provide a small descriptive prefix for the
pipeline run.
```{bash}
./main_run.sh "my_analysis"
```
## Optional: Build the Index and Exit Pipeline
You can also change the entry-point of the workflow, which is
accomplished by setting the `NFX_ENTRY` variable in the `main_run.sh`
script to be `bowtie2_index_only`. This will allow the pipeline to run
only the Bowtie2 build process and exit upon completion of the index
building step.
```{r eval=TRUE, echo=FALSE}
main_run[tail(idx, n=2)] %>%
gsub("\\=\\'align_call_peaks\\'", "='bowtie2_index_only'", .) %>%
cat(.,sep = "\n")
```
```{bash}
./main_run.sh "bowtie2_index"
```
# Expected Outputs
Under the path provided in the nextflow config for params "outdir", you
will find directories named for each of the modules.
### Final Outputs
`results/{params.outdir}`
- `samtools_view/`
- aligned, coordinate sorted, marked duplicates, and optionally
quality filtered bam file
- {sample_id}.markedDup.filter.bam
- `samtools_index/`
- {sample_id}.markedDup.filter.sort.bam.bai
- `deeptools_bamcoverage/`
- Counts per million normalized coverage track (bigwig) file.
- {sample_id}\_CPM.bigWig
- `{params.peaks_outdir}/seacr_callpeak/`
- If using IgG normalization, the {sample_id} of the IgG control
used is appended to the target
- {sample_id}\_[norm,non].[stringent,relaxed].bed
- `{params.peaks_outdir}/macs2_callpeak/`
- Optional output if `run_macs2 = true`
- {sample_id}\_peaks.[narrowPeak,broadPeak]
## Pipeline Reports
In addition, there will be an HTML report with information on where the
temp data is stored in the `workDir` path, and general run statistics
such as resource utilized versus requested, which helps with
optimization. It will also provide information on how much walltime was
used per sample, total CPU hours, etc.
The HTML file is found in `reports` directory and will have the prefix
defined on the command line when the `./main_run.sh "my_analysis"` was
invoked, so in this example it would be named
"my_analysis\_{DATE}.html".
There will also be a detailed nextflow log file that is useful for
de-bugging which will also be named in this example,
"my_analysis\_{DATE}\_nextflow.log".
Finally, the pipeline will produce a DAG - Directed acyclic graph, which
describes the workflow channels (inputs) and the modules. The DAG image
will be saved under `dag/` directory with the name
"my_analysis\_{DATE}\_dag.pdf".
```{r echo=FALSE, eval=TRUE}
knitr::include_graphics("images/dag.png")
```
### Complete File Structure
There will be the following file structure:
```{r eval=TRUE, echo = FALSE}
outdir <- params$outdir
if ( outdir == "" ) {
outdir <- config_file[grep("\\soutdir.+\ = (.+)$", config_file)] %>%
str_split_fixed(., pattern = ' = ', n = 2) %>%
.[,2] %>%
gsub('\\"', "", . ) %>%
gsub("\\.", "..", .)
}
fs::dir_tree(outdir, recurse = 1)
```
### Detailed File Structure
Within each directory you will find the following files (top 5 files per
directory are shown):
```{r eval=TRUE, echo = FALSE}
results_dirs <- fs::dir_info(outdir, recurse = TRUE)
results_dirs %>%
select(path, type) %>%
mutate(process = gsub(outdir,"", path),
filename = ifelse(type == 'file', basename(path), "")) %>%
filter(!grepl("multiqc_report|params[2-9]", process)) %>%
mutate_at(vars(process), ~ case_when(
type == 'file' & grepl("fastqc", .) ~ dirname(dirname(.)),
grepl("FASTQC", .) ~ dirname(.),
type == 'file' ~ dirname(.),
TRUE ~ .)) %>%
group_by(process) %>%
dplyr::slice(1:5) %>%
ungroup() %>%
knitr::kable()
```