-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
93 lines (70 loc) · 3.34 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
output: github_document
editor_options:
markdown:
wrap: 72
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## align-condathis-targets
This repo showcases the use of [`targets`](https://cran.r-project.org/web/packages/targets/index.html) and [`condathis`](https://cran.r-project.org/web/packages/condathis/index.html) to build a pure
`R` pipeline that stays reproducible while also using non-`R`
command-line interface (CLI) tools. The `targets` package is ["a
Make-like pipeline tool for statistics and data science in
`R`"](https://books.ropensci.org/targets/). The [`condathis`](https://github.com/luciorq/condathis) package is a
CRAN package that lets you run CLI tools in `R`, ensuring
reproducibility across systems and environmental isolation. With
`condathis`, there's no need for users to manually ensure CLI tools are
installed on their systems.
*The pipeline in this example processes RNA-seq FASTQ files by running quality control (QC), trimming adapters, aligning the sequences, and ultimately producing sorted BAM files.*
## Motivation
As bioinformaticians working with omics data, `R` is incredibly rich in
packages. However, most tools for the initial phases of omics analysis
are not `R`-based. The `targets` package is fantastic for building `R`
pipelines. By combining `targets` with `condathis`, we can create
pipelines that integrate both `R` tools and other CLI tools in an `R`
environment, while still maintaining reproducibility.
## How condathis is used
The package `condathis` is used to:
1) Create environments with specific versions of the CLI tools needed. See example below.
```{r, eval=FALSE}
## Create an environment with gsutil v5.33
condathis::create_env("gsutil==5.33", env_name = "gsutil-env")
```
2) Interact with those environments and launch the CLI commands. See example below.
```{r, eval=FALSE}
url_cloude_storage <- "gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta"
# Same as processx::run but needs to indicate the envname
condathis::run(
"gsutil", "-m", "cp", url_cloude_storage, here::here("data", "outputs"), # where to save the data
env_name = "gsutil-env"
)
```
--> **R wrapper functions** that incorporate specific `condathis` CLI commands
are created to return paths that can be used by `tar_files`. <--
## Folders and files
- All the **wrapper functions** are stored in [`code/targets_functions.R`](./code/targets_functions.R).
- `config/`: Contains the [`mapping.csv`](./config/mapping.csv) that maps subjects with
corresponding files.
- `data/raw/`: Contains some example FASTQ files (note that the files
have been trimmed to contain only a few reads for easy processing).
- `data/outputs/`: This is where the outputs of the pipeline are
created.
## Pipeline Overview
- `fastp`: Quality of FASTQ files is checked and adapters are trimmed.
- `gsutil`: Reference genome is downloaded.
- `minimap2`: Aligns the files to the reference.
- `samtools`: Transforms SAM files to BAM files and sorts
To restore all the project dependencies, run: `renv::restore() `.
All outputs will be generated in `data/outputs`.
### Run Pipeline with `targets`
```{r}
library(targets)
# tar_dir()
targets::tar_make()
```