Merge pull request #33 from CCBR/dev

Dev
CCBR · Feb 23, 2023 · 0bb664c · 0bb664c
2 parents 953c5a4 + 13e698c
commit 0bb664c
Show file tree

Hide file tree

Showing 44 changed files with 2,577 additions and 367 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,6 @@
 !.gitignore
 !.gitattributes
 site/
+*._*
+.DS*
+.R*
diff --git a/README.md b/README.md
@@ -1,43 +1,4 @@
-# CCBR Snakemake Pipeline Cookiecutter
-This is a dummy folder framework for CCBR snakemake workflows.
-New workflows can be started using this repository as a template.
+# MAPLE Pipeline
+MAPLE ([M]NaseSeq [A]nalysis [P]ipe[l]i[n]e) was developed in support of NIH's [Dr. Zhurkin Laboratory](https://ccr.cancer.gov/staff-directory/victor-b-zhurkin). It has been developed and tested solely on NIH HPC Biowulf.
 
-## Creating PAT for GH 
-This is a prerequisite for the next step. You will need [gh cli](https://cli.github.com/) installed on your laptop or use `/data/CCBR_Pipeliner/db/PipeDB/bin/gh_1.7.0_linux_amd64/bin/gh` on biowulf. Skip if can access github in an automated way already.
-
-Personal Access Token (PAT) is required to access GitHub (GH) without having to authenticate by other means (like password) every single time. You can create a PAT by going [here](https://github.com/settings/tokens). Then you can copy the PAT and save it into a file on biowulf (say `~/gh_token`). Next, you can run the following command to set everything up correctly on biowulf (or your laptop)
-```
-gh auth login --with-token < ~/git_token
-```
-
-## Creating new repository
-You can use [gh cli](https://cli.github.com/) to
- * create a new repository under CCBR, and
- * copy over the template code from CCBR_SnakemakePipelineCookiecutter
-with the following command
-```
-gh repo create CCBR/<reponame> \
---description "<repo description>" \
---public \
---template CCBR/CCBR_SnakemakePipelineCookiecutter \
---confirm
-```
-On biowulf, you may have to specify the full path of the `gh` executable is located here: `/data/CCBR_Pipeliner/db/PipeDB/bin/gh_1.7.0_linux_amd64/bin/gh`
-
-Then you can clone a local copy of the new repository:
-```
-gh repo clone CCBR/<reponame>.git
-```
-
-If you drop the `CCBR/` from the `gh` command above, then the new repo is created under your username. The commands would then look like this:
-```
-gh repo create <reponame> \
---description "<repo description>" \
---public \
---template CCBR/CCBR_SnakemakePipelineCookiecutter \
---confirm
-
-gh repo clone <your_github_handle>/<reponame>.git
-```
-
-You can change `--public` to `--private` in the above `gh` command to make the newly created repository private.
+For more information on pipeline requirements, inputs, and expected outputs, review the pipeline [documentation](https://ccbr.github.io/ccbr1214/).
diff --git a/config/config.yaml b/config/config.yaml
@@ -1,9 +1,67 @@
-## you probably need to change or comment/uncomment some of these
-#
-# The working dir... output will be in the results subfolder of the workdir
-workdir: "WORKDIR"
-#
-# tab delimited samples file ... should have the following 3 columns
-# sampleName	path_to_R1_fastq	path_to_R2_fastq
-#
-samples: "WORKDIR/samples.tsv"
+#########################################################################################
+#Folders and Paths REQUIRED
+#########################################################################################
+workdir: "WORKDIR" #output will be in the results subfolder of the workdir
+
+# paths to manifest files
+samplemanifest: "WORKDIR/manifests/samples.tsv"
+
+########################################################################################
+#user parameters
+#########################################################################################
+########################
+# first pass required
+########################
+species: "hg19" #species hg19 or hg38
+reference_source: "usc" #NCBI or USC
+
+# first pass completes trimming, alignment, assembly and a complete histogram
+# second pass completes subsetting, DAC analysis and DYAD analysis
+# third pass completes comparisons between multiple samples
+pipeline_phase: "first_pass" #first_pass, second_pass, third_pass
+
+########################
+# second pass required
+########################
+fragment_length_min: "140" #minimum fragment length
+fragment_length_max: "160" #maximum fragment length
+
+limit: 1000000
+max_distance: 1500
+
+# Require master_table
+master_table: "Y"
+
+## If master_table: "N"
+### User may change the selected-bed file to include a different bed file to subset samples
+### file must be located /WORKDIR/resourcs/
+### file includes a shorthand_/name,abosulte_path
+#### selected_shorthand,selected_bed
+#### NAME_FILL_IN,/data/CCBR_Pipeliner/Pipelines/ccbr1214/bed_files/hg19_protein-coding_genes.bed
+## If master_table: "Y"
+## user must select a single bed file to run analysis
+### /data/CCBR_Pipeliner/Pipelines/ccbr1214/bed_files/hg19_protein-coding_genes.bed
+#bed_list_name: "bed_lists.csv"
+bed_list_name: "/data/CCBR_Pipeliner/Pipelines/ccbr1214/bed_files/hg19_protein-coding_genes.bed"
+
+## if a specific bed file has not been created then create it
+### grep -Fwf $gene_list $master_bed_file | awk -v OFS='\t' '{{print $1,$2,$3}}'> $selected_bed_file
+
+## if master_table: "Y"
+## the total number of gene lists to create
+gene_list_n: 40
+
+########################
+# third pass required
+########################
+# if selected the manifest/contrast_manifest.tsv must be completed
+output_contrast_location: "WORKDIR/merged_DACS" #"/data/Zhurkin-20/analysis/sent_to_pi/"
+contrastmanifest: "WORKDIR/manifests/CONTRASTS_FILL_IN.tsv"
+contrast_shorthand: "SHORTHAND_FILL_IN"
+
+#########################################################################################
+# reference files
+#########################################################################################
+index_dir: "/data/CCBR_Pipeliner/Pipelines/ccbr1214/indices"
+adaptors: "/data/CCBR_Pipeliner/Pipelines/ccbr1214/adapters/TruSeq_and_nextera_adapters.fa"
+master_bed_file: "/data/CCBR_Pipeliner/Pipelines/ccbr1214/bed_files/hg19_protein-coding_genes.bed" #path to the master bed that will  be used to create a selected_bed, if one is not provided
diff --git a/config/contrasts.tsv b/config/contrasts.tsv
@@ -0,0 +1,3 @@
+DAC_files
+/path/to/sample1_DAC.csv
+/path/to/sample2_DAC.csv
diff --git a/config/samples.tsv b/config/samples.tsv
@@ -1,3 +1,5 @@
-sampleName	path_to_R1_fastq	path_to_R2_fastq
-Sample1	<path to sample1 R1 fastq>	<path to sample1 R2 fastq>
-Sample2	<path to sample2 R1 fastq>	<path to sample2 R2 fastq>
+sampleName	type	path_to_R1_fastq	path_to_R2_fastq
+Sample1	tumor	/path/to/sample1.R1.fastq.gz	/path/to/sample1.R2.fastq.gz
+Sample2	normal	/path/to/sample2.R1.fastq.gz	/path/to/sample2.R2.fastq.gz
+Sample3	tumor	/path/to/sample4.R1.fastq.gz	/path/to/sample4.R2.fastq.gz
+Sample4	normal	/path/to/sample3.R1.fastq.gz	/path/to/sample3.R2.fastq.gz
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1 @@
+MAPLE ([M]NaseSeq [A]nalysis [P]ipe[l]i[n]e) was developed in support of NIH's [Dr. Zhurkin Laboratory](https://ccr.cancer.gov/staff-directory/victor-b-zhurkin). It has been developed and tested solely on NIH HPC Biowulf.
diff --git a/docs/user-guide/contributions.md b/docs/user-guide/contributions.md
@@ -0,0 +1,9 @@
+# Contributions
+The following members contributed to the development of the MNaseSeq pipeline:
+
+- [Wilfried Guiblet](https://github.com/wilfriedguiblet)
+- [Samantha Sevilla](https://github.com/slsevilla)
+- [Vishal Koparde](https://github.com/kopardev)
+- [Victor Zhurkin](https://ccr.cancer.gov/staff-directory/victor-b-zhurkin)
+
+WG, SS, VK contributed to the generating the source code and all members contributed to the main concepts and analysis.
diff --git a/docs/user-guide/getting-started.md b/docs/user-guide/getting-started.md
@@ -0,0 +1,40 @@
+# Overview
+The MAPLE (**M**NaseSeq **A**nalysis **P**ipe**l**i**n**e) github repository is stored locally, and will be used for project deployment. Multiple projects can be deployed from this one point simultaneously, without concern.
+
+## 1. Getting Started
+### 1.1 Introduction
+MAPLE beings with raw FASTQ files and performs adaptor trimming, assembly, and alignment. Bed files are created, and depending on user input, selected regions of interst may be used. Fragment centers (DYAD's) are then determined, and histograms of occurences are created. QC reports are also generated with each project.
+
+The following are sub-commands used within MNaseSeq:
+
+- init: initalize the pipeline
+- dryrun: predict the binding of peptides to any MHC molecule
+- run: execute the pipeline on the Biowulf HPC
+- runlocal: execute a local, interactive, session
+- unlock: unlock directory
+- reset: delete a workdir, and re-initialize
+
+## 1.2 Setup Dependencies
+MNaseSeq has several dependencies listed below. These dependencies can be installed by a sysadmin. All dependencies will be automatically loaded if running from Biowulf.
+
+- bedtools: "bedtools/2.30.0"
+- bowtie2: "bowtie/2-2.4.2"
+- cutadapt: "cutadapt/1.18"
+- pear: "pear/0.9.11"
+- python: "python/3.7"
+- R: "R/4.0.3"
+- samtools: "samtools/1.11"
+
+## 1.3 Login to the cluster
+MAPLE has been exclusively tested on Biowulf HPC. Login to the cluster's head node and move into the pipeline location.
+```
+# ssh into cluster's head node
+ssh -Y [email protected]
+```
+
+## 1.4 Load an interactive session
+An interactive session should be started before performing any of the pipeline sub-commands, even if the pipeline is to be executed on the cluster.
+```
+# Grab an interactive node
+srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb  --cpus-per-task=4 --pty bash
+```
diff --git a/docs/user-guide/output.md b/docs/user-guide/output.md
@@ -0,0 +1,24 @@
+#4. Expected Outputs
+The following directories are created under the output_directory, dependent on the Pass of the pipeline
+
+## First Pass (first_pass)
+
+- 01_trim: this directory includes trimmed FASTQ files
+- 02_assembled: this directory includes assembled FASTQ files
+- 03_aligned: this directory includes aligned BAM files and BED files
+    - 01_bam: BAM files after alignment
+    - 02_bed: converted bed files
+    - 03_histograms: histograms of bed files
+
+## Second Pass (second_pass)
+- 04_dyads: this directory contains DYAD calculated files
+    - 01_DYADs: this includes direct DYAD calculations
+    - 02_histograms: this includes histogram occurances
+    - 03_CSV: this includes the occurance data in CSV format
+
+## Third Pass (third_pass)
+- /path/to/output/contrast: this includes the contrast file for each sample provided in the contrasts.tsv manifest
+
+## All Passes
+- log: this includes log files
+    - [date of run]: the slurm output files of the pipeline sorted by pipeline start time; copies of config and manifest files used in this specific pipeline run; error reporting script
diff --git a/docs/user-guide/preparing-files.md b/docs/user-guide/preparing-files.md
@@ -0,0 +1,61 @@
+# 2. Preparing Files
+The pipeline is controlled through editing configuration and manifest files. Defaults are found in the /WORKDIR/ after initialization.
+
+## 2.1 Configs
+The configuration files control parameters and software of the pipeline. These files are listed below:
+
+- resources/cluster.yaml
+- resources/tools.yaml
+- config.yaml
+
+### 2.1.1 Cluster YAML (REQUIRED)
+The cluster configuration file dictates the resouces to be used during submission to Biowulf HPC. There are two differnt ways to control these parameters - first, to control the default settings, and second, to create or edit individual rules. These parameters should be edited with caution, after significant testing.
+
+### 2.1.2 Tools YAML (REQUIRED)
+The tools configuration file dictates the version of each tool that is being used. Updating the versions may break specific rules if versions are not backwards compatible with the defaults listed.
+
+### 2.1.3 Config YAML (REQUIRED)
+There are several groups of parameters that are editable for the user to control the various aspects of the pipeline. These are :
+
+- Folders and Paths
+      - These parameters will include the input and ouput files of the pipeline, as well as list all manifest names.
+- User parameters
+      - These parameters will control the pipeline features. These include thresholds and whether to perform processes.
+
+## 2.2 Preparing Manifests
+There are two manifests used for the pipeline. These files describe information on the samples and desired contrasts. The paths of these files are defined in the config.yaml file. These files are:
+
+- sampleManifest (REQUIRED for all Passes)
+- contrastManifest (REQUIRED for third_pass)
+
+### 2.2.1 Samples Manifest
+This manifest will include information to sample level information. It includes the following column headers: sampleName type path_to_R1_fastq path_to_R2_fastq
+
+- sampleName: the sampleID associated with the fasta file; which are unique. This may be a shorthand name, and will be used throughout the analysis.
+- type: demographic information regarding the sample; example 'tumor'
+- path_to_R1_fastq: the full path to the R1.fastq.gz file
+- path_to_R1_fastq: the full path to the R2.fastq.gz file
+
+An example sampleManifest file with multiplexing of one sample. Notice that the multiplexID test_1 is repeated, as Ro_Clip and Control_Clip are both found in the same fastq file, whereas test_2 is not multiplexed:
+
+```
+sampleName  type    path_to_R1_fastq                path_to_R2_fastq
+Sample1     tumor   /path/to/sample1.R1.fastq.gz    /path/to/sample1.R2.fastq.gz
+Sample2     tumor   /path/to/sample2.R1.fastq.gz    /path/to/sample2.R2.fastq.gz
+Sample3     tumor   /path/to/sample3.R1.fastq.gz    /path/to/sample3.R2.fastq.gz
+Sample4     tumor   /path/to/sample4.R1.fastq.gz    /path/to/sample4.R2.fastq.gz
+```
+
+### 2.2.2 Contrast Manifest
+This manifest will include contrast information of samples to compare. The first two Passes must be complete in order to run this final phase.
+
+Manifest example 1 (PASS)
+```
+/path/to/RESULTSDIR/04_dyad/03_csv/sample1.hg19.140-160.DYAD_corrected.csv
+/path/to/RESULTSDIR/04_dyad/03_csv/sample2.hg19.140-160.DYAD_corrected.csv
+```
+
+This wil create the output file, dependent on the config inputs for `output_contrast_location` and the `selected_shorthand`:
+```
+/path/to/output_contrast_location/final_sample1.sample2.140-160.selected_shorthand.DAC.csv
+```
diff --git a/docs/user-guide/run.md b/docs/user-guide/run.md
@@ -0,0 +1,55 @@
+# 3. Running the Pipeline
+## 3.1 Pipeline Overview
+The Snakemake workflow has a multiple options:
+```
+Usage:
+    ./run -m/--runmode=<RUNMODE> -w/--workdir=<WORKDIR>
+
+    Required Arguments:
+    1.  RUNMODE: [Type: String] Valid options:
+        *) init : initialize workdir
+        *) run : run with slurm
+        *) reset : DELETE workdir dir and re-init it
+        *) dryrun : dry run snakemake to generate DAG
+        *) unlock : unlock workdir if locked by snakemake
+        *) runlocal : run without submitting to sbatch
+    2.  WORKDIR: [Type: String]: Absolute or relative path to the 
+        output folder with write permissions.
+```
+
+## 3.2 Commands explained
+The following explains each of the command options:
+
+**Preparation Commands**
+
+- init (REQUIRED): This must be performed before any Snakemake run (dry, local, cluster) can be performed. This will copy the necessary config, manifest and Snakefiles needed to run the pipeline to the provided output directory.
+- dryrun (OPTIONAL): This is an optional step, to be performed before any Snakemake run (local, cluster). This will check for errors within the pipeline, and ensure that you have read/write access to the files needed to run the full pipeline.
+
+**Processing Commands**
+
+- runlocal - This will run the pipeline on a local node. NOTE: This should only be performed on an interactive node.
+- run - This will submit a master job to the cluster, and subsequent sub-jobs as needed to complete the workflow. An email will be sent when the pipeline begins, if there are any errors, and when it completes.
+
+**Other Commands (All optional)**
+
+- unlock: This will unlock the pipeline if an error caused it to stop in the middle of a run.
+- reset: This will DELETE workdir dir and re-init it
+
+To run any of these commands, follow the the syntax:
+```
+./run --runmode=COMMAND --workdir=/path/to/output/dir
+```
+
+## 3.3 Typical Workflow
+A typical command workflow, running on the cluser, is as follows:
+```
+./run --runmode=init --workdir=/path/to/output/dir
+./run --runmode=dryrun --workdir=/path/to/output/dir
+./run --runmode=run --workdir=/path/to/output/dir
+```
+
+## 3.4 Passes explained
+MAPLE is to be run in three Passes:
+1.) first_pass completes trimming, alignment, assembly and a complete histogram
+2.) second_pass completes subsetting, DAC analysis and DYAD analysis
+3.) third pass completes comparisons between multiple samples
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,6 @@ @@
     !.gitignore
     !.gitattributes
     site/
+    *._*
+    .DS*
+    .R*
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		MAPLE ([M]NaseSeq [A]nalysis [P]ipe[l]i[n]e) was developed in support of NIH's [Dr. Zhurkin Laboratory](https://ccr.cancer.gov/staff-directory/victor-b-zhurkin). It has been developed and tested solely on NIH HPC Biowulf.