Merged in xengsort (pull request #192)

Xengsort
TheJacksonLaboratory · May 7, 2024 · 382b138 · 382b138
2 parents 07d4c11 + d52cb0b
commit 382b138
Show file tree

Hide file tree

Showing 39 changed files with 299 additions and 98 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,4 +13,4 @@ test.csv
 test2.csv
 .nf-test
 .nf-test.log
-nf-test-report.tap
+nf-test-report*
diff --git a/.zenodo.json b/.zenodo.json
@@ -1,6 +1,6 @@
 {
     "upload_type": "software", 
-    "description": "v0.6.1 Release. See https://github.com/TheJacksonLaboratory/cs-nf-pipelines/wiki",
+    "description": "See https://github.com/TheJacksonLaboratory/cs-nf-pipelines/wiki",
     "title": "cs-nf-pipelines",
     "creators": [
         {
@@ -36,6 +36,10 @@
             "affiliation": "The Jackson Laboratory",
             "name": "Gabriel Rech"
         },
+        {
+            "affiliation": "The Jackson Laboratory",
+            "name": "Ardian Ferraj"
+        },
         {
             "affiliation": "The Jackson Laboratory",
             "name": "Anuj Srivastava"

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11068737.svg)](https://doi.org/10.5281/zenodo.11068737)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11068736.svg)](https://doi.org/10.5281/zenodo.11068736)
 
 # JAX NGS Operations Nextflow DSL2 Pipelines
 

diff --git a/ReleaseNotes.md b/ReleaseNotes.md
@@ -1,5 +1,30 @@
 # RELEASE NOTES
 
+## Release 0.6.3
+
+In this release we change the read disambiguation tool Xenome for Xengsort. Extensive benchmarking shows high concordance among results obtained from both tools.  
+
+Additionally, we correct an issue with the human PTA workflow when running the combination of the `--pdx` and `--split_fastq` options. Data run with this combination of options from version 0.6.0-0.6.2 should be re-run. 
+
+### Pipelines Added:
+
+None
+
+### Modules Added:
+
+1. xengsort/xengsort_classify.nf
+1. xengsort/xengsort_index.nf
+
+### Pipeline Changes:
+
+1. Xengsort replaces Xenome for all PDX based workflows (RNAseq, RNA fusion, Hs PTA, Somatic WES, Somatic WES PTA)
+1. Correction made for the Human PTA when running the combination of the `--pdx` and `--split_fastq` options.
+
+### Module Changes:
+
+None
+
+
 ## Release 0.6.2
 
 In this minor release we adjust memory and wall clock statements, and modified `bin/pta/merge-caller-vcfs.r` to correct for an edge case related bug.

diff --git a/bin/help/pta.nf b/bin/help/pta.nf
@@ -12,8 +12,10 @@ The following are human specific parameters. To see help for mouse, add `--gen_o
 
 --csv_input | /<FILE_PATH> | CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See the repository wiki (https://github.com/TheJacksonLaboratory/cs-nf-pipelines/wiki) for additional information. 
 
---xenome_prefix | /projects/compsci/omics_share/human/GRCh38/supporting_files/xenome/trans_human_GRCh38_84_NOD_based_on_mm10_k25| Xenome index for deconvolution of human and mouse reads. Used when `--pdx` is run. 
---pdx | false | Options: false, true. If specified, 'Xenome' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--pdx | false | Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--xengsort_host_fasta | '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa' | Xengsort host fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_idx_path | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort' | Xengsort index for deconvolution of human and mouse reads. Used when `--pdx` is run. If `null`, Xengsort Index is run using ref_fa and host_fa.  
+--xengsort_idx_name | 'hg38_GRCm39-NOD_ShiLtJ' | Xengsort index name associated with files located in `xengsort_idx_path` or name given to outputs produced by Xengsort Index
 
 --deduplicate_reads | false | Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data. 
 

diff --git a/bin/help/rna_fusion.nf b/bin/help/rna_fusion.nf
@@ -17,7 +17,6 @@ Parameter | Default | Description
 
 --gen_org | mouse | Options: mouse and human.
 
---xenome_prefix | /projects/compsci/omics_share/human/GRCh38/supporting_files/xenome/trans_human_GRCh38_84_NOD_based_on_mm10_k25| Xenome index for deconvolution of human and mouse reads. Used when `--pdx` is run. 
 --read_length | 150 | Options: 75, 100, 150. Changed relative to sample read length.
 --star_index | /projects/omics_share/human/GRCh38/transcriptome/indices/rna_fusion/star/star-2.7.4a-150bp | STAR index used by several tools. Change the index relative to sample read length. Read length options: 75, 100, 150. 
 --star_fusion_star_index | /projects/omics_share/human/GRCh38/transcriptome/indices/rna_fusion/starfusion/star-150 | STAR-fusion index. Change the index relative to sample read length. Read length options: 75, 100, 150. 
@@ -47,7 +46,11 @@ Parameter | Default | Description
 --fusion_report_opt | null | Additional fusion-report options can be provided. 
 --databases | /projects/compsci/omics_share/human/GRCh38/supporting_files/rna_fusion_dbs | Fusion-report databases of known fusion events. Used in report generation only. 
 
---pdx | false | Options: false, true. If specified, 'Xenome' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--pdx | false | Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--ref_fa | '/projects/compsci/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'| Xengsort graft fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_host_fasta | '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa' | Xengsort host fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_idx_path | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort' | Xengsort index for deconvolution of human and mouse reads. Used when `--pdx` is run. If `null`, Xengsort Index is run using ref_fa and host_fa.  
+--xengsort_idx_name | 'hg38_GRCm39-NOD_ShiLtJ' | Xengsort index name associated with files located in `xengsort_idx_path` or name given to outputs produced by Xengsort Index
 
 '''
 }

diff --git a/bin/help/rnaseq.nf b/bin/help/rnaseq.nf
@@ -18,9 +18,6 @@ Parameter | Default | Description
 --gen_org | mouse | Options: mouse and human.
 --genome_build | 'GRCm38' | Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCm38.
 
---pdx | false | Options: true or false. If 'true' Xenome is run to remove mouse reads from samples. 
---xenome_prefix | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xenome/trans_human_GRCh38_84_NOD_based_on_mm10_k25' | Pre-compiled Xenome classification index files. Used if PDX analysis is specified. 
-
 --quality_phred | 15 | The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
 --unqualified_perc | 40 | Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
 --detect_adapter_for_pe | false | If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
@@ -50,8 +47,13 @@ Parameter | Default | Description
                  | Human: '/projects/omics_share/human/GRCh38/transcriptome/annotation/ensembl/v104/Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.rRNA.interval_list'
                  | The coverage metric calculation step requires this file. Refers to human assembly when --gen_org human. JAX users should not change this parameter.
 
---pdx | false | Options: false, true. If specified, 'Xenome' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--pdx | false | Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
 --classifier_table | '/projects/compsci/omics_share/human/GRCh38/supporting_files/rna_ebv_classifier/EBVlym_classifier_table_48.txt' | EBV expected gene signatures used in EBV classifier. Only used when '--pdx' is run. 
+--ref_fa | '/projects/compsci/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'| Xengsort graft fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_host_fasta | '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa' | Xengsort host fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_idx_path | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort' | Xengsort index for deconvolution of human and mouse reads. Used when `--pdx` is run. If `null`, Xengsort Index is run using ref_fa and host_fa.  
+--xengsort_idx_name | 'hg38_GRCm39-NOD_ShiLtJ' | Xengsort index name associated with files located in `xengsort_idx_path` or name given to outputs produced by Xengsort Index
+
 
 There are two additional parameters that are human specific. They are: 
 

diff --git a/bin/help/somatic_wes.nf b/bin/help/somatic_wes.nf
@@ -21,8 +21,10 @@ Parameter | Default | Description
 --unqualified_perc | 40 | Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
 --detect_adapter_for_pe | false | If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
 
---pdx | false | Options: false, true. If specified, 'Xenome' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
---xenome_prefix | /projects/compsci/omics_share/human/GRCh38/supporting_files/xenome/trans_human_GRCh38_84_NOD_based_on_mm10_k25| Xenome index for deconvolution of human and mouse reads. Used when `--pdx` is run. 
+--pdx | false | Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--xengsort_host_fasta | '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa' | Xengsort host fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_idx_path = | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort' | Xengsort index for deconvolution of human and mouse reads. Used when `--pdx` is run. If `null`, Xengsort Index is run using ref_fa and host_fa.  
+--xengsort_idx_name = | 'hg38_GRCm39-NOD_ShiLtJ' | Xengsort index name associated with files located in `xengsort_idx_path` or name given to outputs produced by Xengsort Index
 
 --genotype_targets | '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz' | Target SNP bed file for the ancestry panel. Can contain annotation information. 
 --snpID_list | '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list' | Target SNPs in list used in BCFtools filtering step

diff --git a/bin/help/somatic_wes_pta.nf b/bin/help/somatic_wes_pta.nf
@@ -21,8 +21,10 @@ Parameter | Default | Description
 --unqualified_perc | 40 | Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
 --detect_adapter_for_pe | false | If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
 
---pdx | false | Options: false, true. If specified, 'Xenome' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
---xenome_prefix | /projects/compsci/omics_share/human/GRCh38/supporting_files/xenome/trans_human_GRCh38_84_NOD_based_on_mm10_k25| Xenome index for deconvolution of human and mouse reads. Used when `--pdx` is run. 
+--pdx | false | Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis. 
+--xengsort_host_fasta | '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa' | Xengsort host fasta file. Used by Xengsort Index when `--pdx` is run, and xengsort_idx_path is `null` or false.  
+--xengsort_idx_path = | '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort' | Xengsort index for deconvolution of human and mouse reads. Used when `--pdx` is run. If `null`, Xengsort Index is run using ref_fa and host_fa.  
+--xengsort_idx_name = | 'hg38_GRCm39-NOD_ShiLtJ' | Xengsort index name associated with files located in `xengsort_idx_path` or name given to outputs produced by Xengsort Index
 
 --genotype_targets | '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz' | Target SNP bed file for the ancestry panel. Can contain annotation information. 
 --snpID_list | '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list' | Target SNPs in list used in BCFtools filtering step

diff --git a/bin/log/pta.nf b/bin/log/pta.nf
@@ -34,7 +34,9 @@ ______________________________________________________
 --quality_phred                 ${params.quality_phred}
 --unqualified_perc              ${params.unqualified_perc}
 --detect_adapter_for_pe         ${params.detect_adapter_for_pe}
---xenome_prefix                 ${params.xenome_prefix}
+--xengsort_host_fasta           ${params.xengsort_host_fasta}
+--xengsort_idx_path             ${params.xengsort_idx_path}
+--xengsort_idx_name             ${params.xengsort_idx_name}
 --ref_fa                        ${params.ref_fa}
 --ref_fa_indices                ${params.ref_fa_indices}
 --ref_fa_dict                   ${params.ref_fa_dict}

diff --git a/bin/log/rna_fusion.nf b/bin/log/rna_fusion.nf
@@ -27,7 +27,10 @@ ______________________________________________________
 --keep_intermediate                 ${params.keep_intermediate}
 -c                                  ${params.config}
 --multiqc_config                    ${params.multiqc_config}
---xenome_prefix                     ${params.xenome_prefix}
+--ref_fa                            ${params.ref_fa}
+--xengsort_host_fasta               ${params.xengsort_host_fasta}
+--xengsort_idx_path                 ${params.xengsort_idx_path}
+--xengsort_idx_name                 ${params.xengsort_idx_name}
 --read_length                       ${params.read_length}
 --star_index                        ${params.star_index}
 --star_fusion_star_index            ${params.star_fusion_star_index}

diff --git a/bin/log/rnaseq.nf b/bin/log/rnaseq.nf
@@ -47,7 +47,10 @@ ______________________________________________________
 --detect_adapter_for_pe      ${params.detect_adapter_for_pe}
 
 --pdx                        ${params.pdx}
---xenome_prefix              ${params.xenome_prefix}
+--ref_fa                     ${params.ref_fa}
+--xengsort_host_fasta        ${params.xengsort_host_fasta}
+--xengsort_idx_path          ${params.xengsort_idx_path}
+--xengsort_idx_name          ${params.xengsort_idx_name}
 
 --strandedness_ref           ${params.strandedness_ref}
 --strandedness_gtf           ${params.strandedness_gtf}

diff --git a/bin/log/somatic_wes.nf b/bin/log/somatic_wes.nf
@@ -27,7 +27,9 @@ ______________________________________________________
 --pubdir                        ${params.pubdir}
 --organize_by                   ${params.organize_by}
 --pdx                           ${params.pdx}
---xenome_index                  ${params.xenome_prefix}
+--xengsort_host_fasta           ${params.xengsort_host_fasta}
+--xengsort_idx_path             ${params.xengsort_idx_path}
+--xengsort_idx_name             ${params.xengsort_idx_name}
 --ref_fa                        ${params.ref_fa}
 --ref_fa_indices                ${params.ref_fa_indices}
 --quality_phred                 ${params.quality_phred}

diff --git a/bin/log/somatic_wes_pta.nf b/bin/log/somatic_wes_pta.nf
@@ -22,7 +22,9 @@ ______________________________________________________
 --pubdir                        ${params.pubdir}
 --organize_by                   ${params.organize_by}
 --pdx                           ${params.pdx}
---xenome_index                  ${params.xenome_prefix}
+--xengsort_host_fasta           ${params.xengsort_host_fasta}
+--xengsort_idx_path             ${params.xengsort_idx_path}
+--xengsort_idx_name             ${params.xengsort_idx_name}
 --ref_fa                        ${params.ref_fa}
 --ref_fa_indices                ${params.ref_fa_indices}
 --quality_phred                 ${params.quality_phred}

diff --git a/bin/shared/multiqc/pta_multiqc.yaml b/bin/shared/multiqc/pta_multiqc.yaml
@@ -8,7 +8,7 @@ export_plots: true
 module_order:
   - fastp
   - fastqc
-  - xenome
+  - xengsort
   - conpair
   - gatk
   - picard

diff --git a/bin/shared/multiqc/rna_fusion_multiqc.yaml b/bin/shared/multiqc/rna_fusion_multiqc.yaml
@@ -7,7 +7,7 @@ export_plots: true
 
 module_order:
   - fastqc
-  - xenome
+  - xengsort
   - custom_content
 
 table_columns_visible:

diff --git a/bin/shared/multiqc/rnaseq_multiqc.yaml b/bin/shared/multiqc/rnaseq_multiqc.yaml
@@ -8,7 +8,7 @@ export_plots: true
 module_order:
   - fastp
   - fastqc
-  - xenome
+  - xengsort
   - star
   - rsem
   - picard

diff --git a/bin/shared/multiqc/somatic_wes_multiqc.yaml b/bin/shared/multiqc/somatic_wes_multiqc.yaml
@@ -8,7 +8,7 @@ export_plots: true
 module_order:
   - fastp
   - fastqc
-  - xenome
+  - xengsort
   - gatk
   - picard
 

diff --git a/bin/shared/multiqc/somatic_wes_pta_multiqc.yaml b/bin/shared/multiqc/somatic_wes_pta_multiqc.yaml
@@ -8,7 +8,7 @@ export_plots: true
 module_order:
   - fastp
   - fastqc
-  - xenome
+  - xengsort
   - gatk
   - picard
 

diff --git a/config/pta.config b/config/pta.config
@@ -29,8 +29,10 @@ params {
     // NOTE: For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
     //       For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
 
-    // Xenome index
-    xenome_prefix=params.reference_cache+'/human/GRCh38/supporting_files/xenome/hg38_broad_NOD_based_on_mm10_k25'
+    // xengsort index  
+    xengsort_host_fasta = params.reference_cache+'/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
+    xengsort_idx_path = params.reference_cache+'/human/GRCh38/supporting_files/xengsort'
+    xengsort_idx_name = 'hg38_GRCm39-NOD_ShiLtJ'
 
     // Reference fasta
     ref_fa = params.reference_cache+'/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'

diff --git a/config/rna_fusion.config b/config/rna_fusion.config
@@ -2,7 +2,7 @@
 
 manifest {
     name = "rna_fusion"
-    description = 'Pipeline for processing of PDX RNASeq samples to call RNA Fusions, contains xenome step for processing PDX samples'
+    description = 'Pipeline for processing of PDX RNASeq samples to call RNA Fusions, contains xengsort step for processing PDX samples'
 }
 
 params {
@@ -21,8 +21,11 @@ params {
   // PDX
   pdx = false
 
-  // Xenome index  
-  xenome_prefix=params.reference_cache+'/human/GRCh38/supporting_files/xenome/hg38_broad_NOD_based_on_mm10_k25'
+  // xengsort index  
+  ref_fa = params.reference_cache+'/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
+  xengsort_host_fasta = params.reference_cache+'/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
+  xengsort_idx_path = params.reference_cache+'/human/GRCh38/supporting_files/xengsort'
+  xengsort_idx_name = 'hg38_GRCm39-NOD_ShiLtJ'
 
   // READ LENGTH ADJUSTMENTS: 
   read_length = 150 // change relative to sample being processed. 75, 100, 125, and 150 are supported.
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,7 +8,7 @@ export_plots: true @@
     module_order:
       - fastp
       - fastqc
-      - xenome
+      - xengsort
       - conpair
       - gatk
       - picard
@@ Expand Down @@