details for RNA-Seq proposal

stjudecloud · Dec 7, 2023 · 44451a5 · 44451a5
1 parent 00531a1
commit 44451a5
Showing 1 changed file with 129 additions and 1 deletion.
diff --git a/text/data_versioning.md b/text/data_versioning.md
@@ -3,6 +3,7 @@
 - [Introduction](#introduction)
 - [Motivation](#motivation)
 - [Pre-Discussion](#pre-discussion)
+- [Discussion](#discussion)
 
 # Introduction
 
@@ -22,4 +23,131 @@ Currently we have the RNA-Seq workflow v2.0.0 in use for St. Jude Cloud. We have
 
 We also have issues with the `msgen` pipeline that is used for WGS and WES. Microsoft has made changes to the pipeline and process throughout the time that we have been using it for production data releases. We have not incremented any versions in that time. So as it stands, we are trusting, without any verification, that any changes introduced are acceptable. 
 
-A data versioning process should help us find a reasonable balance that doesn't stop us from making any changes, ever, but also we want to avoid incorporating any and all changes without any assurances, or insight in `msgen`'s case, that those changes are reasonable and not meaningfully impactful to the data results.
+A data versioning process should help us find a reasonable balance that doesn't stop us from making any changes, ever, but also we want to avoid incorporating any and all changes without any assurances, or insight in `msgen`'s case, that those changes are reasonable and not meaningfully impactful to the data results.
+
+# Discussion
+
+## Whole-Genome and Whole-Exome Sequencing
+
+We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will  perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query.
+
+## RNA-Seq
+
+We propose to evaluate RNA sequencing (RNA-Seq) by running well characterized samples through each iteration of the analysis pipeline. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will use RNA-Seq data from HG002, HG004, and HG005 samples. These three samples will be run through new iterations of the [St. Jude Cloud RNA-Seq harmonization workflow](https://stjudecloud.github.io/rfcs/0001-rnaseq-workflow-v2.0.0.html). The pipelines will output aligned BAMs and feature count files for each sample. We will then generate variant calls using the [GATK RNA-Seq short variant discovery best practices workflow](https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-). We will use Illumina's hap.py comparison tool, along with the high confidence variant calls from GIAB to compare output from prior versions of the RNA-Seq pipeline to the proposed version.
+
+### GATK RNA-Seq short variant discovery best practices workflow
+
+The GATK best practices workflows are de facto standards for data processing in bioinformatics. Therefore we will reuse their existing workflow for generating variant calls from RNA-Seq data. The workflow accepts a STAR-aligned BAM file, such as those produced by our RNA-Seq alignment harmonization workflow, and produces a set of variant calls by running a number of GATK steps.
+
+![GATK RNA-Seq short variant discovery workflow diagram](../resources/data_versioning/rnaseq-variant-workflow.png)
+
+#### Workflow
+
+1. Run Picard `MarkDuplicates`
+
+    ```bash
+        picard -Xmx~{java_heap_size}g MarkDuplicates \
+            -I ~{bam} \
+            -O ~{if create_bam then prefix + ".bam" else "/dev/null"} \
+            --VALIDATION_STRINGENCY SILENT \
+            --CREATE_INDEX ~{create_bam} \
+            --CREATE_MD5_FILE ~{create_bam} \
+            --COMPRESSION_LEVEL 5 \
+            --METRICS_FILE ~{prefix}.metrics.txt
+    ```
+
+2. Run GATK `SplitNCigarReads`
+
+    ```bash
+        gatk \
+            SplitNCigarReads \
+            -R ~{fasta} \
+            -I ~{bam} \
+            -O ~{prefix}.bam \
+            -OBM true
+       # GATK is unreasonable and uses the plain ".bai" suffix.
+       mv ~{prefix}.bai ~{prefix}.bam.bai
+    ```
+
+3. Run GATK Spark-enabled `BaseRecalibrator` (`BaseRecalibratorSpark`).
+
+   ```bash
+        gatk \
+            --java-options "-Xms4000m" \
+            BaseRecalibratorSpark \
+            -R ~{fasta} \
+            -I ~{bam} \
+            ~{if use_original_quality_scores then "--use-original-qualities" else "" } \
+            -O ~{prefix}.txt \
+            -known-sites ~{dbSNP_vcf} \
+            -known-sites ~{sep=" --known-sites " known_indels_sites_VCFs} \
+            --spark-master local[4]
+   ```
+
+4. Run GATK Spark-enabled `ApplyBQSR` (`ApplyBQSRSpark`).
+
+    ```bash
+            gatk \
+            --java-options "-Xms3000m" \
+            ApplyBQSRSpark \
+            --spark-master local[4] \
+            -I ~{bam} \
+            ~{if use_original_quality_scores then "--use-original-qualities" else "" } \
+            -O ~{prefix}.bam \
+            --bqsr-recal-file ~{recalibration_report}
+    ```
+
+5. Run Picard `IntervalListTools` to generate list of regions to scatter over.
+
+    ```bash
+        picard -Xms1g \
+            IntervalListTools \
+            SCATTER_COUNT=~{scatter_count} \
+            SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
+            UNIQUE=~{unique} \
+            SORT=~{sort} \
+            INPUT=~{interval_list} \
+            OUTPUT=out
+    ```
+
+6. Run GATK Spark-enabled `HaplotypeCaller` (`HaplotypeCallerSpark`) on each region.
+
+    ```bash
+        gatk \
+           --java-options "-Xms6000m" \
+            HaplotypeCallerSpark \
+            -R ~{fasta} \
+            -I ~{bam} \
+            -L ~{interval_list} \
+            -O ~{prefix}.vcf.gz \
+            ~{if use_soft_clipped_bases then "" else "--dont-use-soft-clipped-bases"} \
+            --standard-min-confidence-threshold-for-calling ~{stand_call_conf} \
+           --spark-master local[4]
+    ```
+
+7. Run Picard `MergeVcfs`.
+
+    ```bash
+        picard -Xms2000m \
+            MergeVcfs \
+            ~{sep(' ', prefix('--INPUT=', vcfs))} \
+            --OUTPUT ~{output_vcf_name}
+    ```
+
+8. Run GATK `VariantFiltration`.
+
+    ```bash
+        gatk \
+            VariantFiltration \
+                --R ~{fasta} \
+                --V ~{vcf} \
+                --window ~{window} \
+                --cluster ~{cluster} \
+                 ~{sep(' ', prefix('--filter-name ', filter_name))} \
+                 ~{sep(' ', prefix('--filter-expression ', squote(filter_expression)))} \
+                -O ~{prefix}.vcf.gz
+    ```
+
+### Validation
+
+After generation of RNA-Seq variants, we will run Illumina's `hap.py` comparison tool, as recommended by GA4GH, to evaluate the variant calls produced by the newest version of the pipeline compared with those produced by the prior version of the pipeline and compared against the high-quality variant call benchmark set published by GIAB.