update docs and refer to genome_tracks module #39; fix HOME result ag…

…gregation #37
epigen · Apr 25, 2024 · 5964a86 · 5964a86
1 parent cb48a2c
commit 5964a86
Show file tree

Hide file tree

Showing 9 changed files with 30 additions and 72 deletions.
diff --git a/.test/hg38test/config/hg38test_atacseq_pipeline_config.yaml b/.test/hg38test/config/hg38test_atacseq_pipeline_config.yaml
@@ -8,7 +8,6 @@ partition: 'shortq'
 project_name: hg38test #MyATACproject # name of the project/dataset
 result_path: results/hg38test/ #/path/to/results/ # path to the output folder
 annotation: .test/hg38test/config/hg38test_atacseq_pipeline_annotation.csv #/path/to/sample_annotation.csv # path to annotation file, specified in config/README.md
-email: [email protected] # used for UCSC hub generation
 
 ##### PROCESSING #####
 
@@ -22,7 +21,6 @@ sequencing_center: CeMM_BSF
 ##### REPORT #####
 
 # sample(!) specific annotation columns of interest from the annotation sheet (note: unit specific annotations are not retained)
-# first entry is used to color UCSC Genome Browser tracks
 annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
 
 ##### QUANTIFICATION #####

diff --git a/.test/mm10test/config/mm10test_atacseq_pipeline_config.yaml b/.test/mm10test/config/mm10test_atacseq_pipeline_config.yaml
@@ -8,7 +8,6 @@ partition: 'shortq'
 project_name: mm10test #MyATACproject # name of the project/dataset
 result_path: results/mm10test/ #/path/to/results/ # path to the output folder
 annotation: .test/mm10test/config/mm10test_atacseq_pipeline_annotation.csv #/path/to/sample_annotation.csv # path to annotation file, specified in config/README.md
-email: [email protected] # used for UCSC hub generation
 
 ##### PROCESSING #####
 
@@ -22,7 +21,6 @@ sequencing_center: CeMM_BSF
 ##### REPORT #####
 
 # sample(!) specific annotation columns of interest from the annotation sheet (note: unit specific annotations are not retained)
-# first entry is used to color UCSC Genome Browser tracks
 annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
 
 ##### QUANTIFICATION #####

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 From r**A**w/un**A**ligned BAM files to count**Z**.
 A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow implementation of the [BSF's](https://www.biomedical-sequencing.org/) [ATAC-seq Data Processing Pipeline](https://github.com/berguner/atacseq_pipeline "ATAC-seq Data Processing Pipeline") extended by downstream quantification and annotation steps using Bash and Python.
 
-This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.
+This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts!
 
 **If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI [10.5281/zenodo.6323634](https://doi.org/10.5281/zenodo.6323634).**
 
@@ -21,7 +21,6 @@ Table of contents
   * [Configuration](#configuration)
   * [Examples](#examples)
   * [Quality Control](#quality-control)
-  * [Genome Browser Tracks](#genome-browser-tracks)
   * [Links](#links)
   * [Resources](#resources)
   * [Publications](#publications)
@@ -75,7 +74,7 @@ The processing and quantification described here was performed using a publicly
       - Note: even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called.
       - Note: the peak support can be >1 for certain samples in case of a consensus region spanning more than one peak within the respective sample.
     - Peak annotation and motif analysis HOMER.
-    - Coverage analysis generating bigWig files (hub/) and quantifying TSS coverage (results/).
+    - Quantification of TSS coverage (results/).
 - Reporting (report/)
     - MultiQC report generation using MultiQC, extended with an in-house developed plugin [atacseq_report](./workflow/scripts/multiqc_atacseq).
 - Quantification (counts/)
@@ -91,7 +90,6 @@ The processing and quantification described here was performed using a publicly
       - UROPA with regulatory build and gencode as references
       - HOMER with annotatePeaks.pl
       - bedtools for nucleotide counts/content (e.g., % of GC)
-- UCSC Genome Browser Trackhub (hub/)
 
 # Usage
 These steps are the recommended usage for this workflow:
@@ -126,7 +124,7 @@ Below are some guidelines for the manual quality control of each sample, but kee
     - Regulatory regions >10% (as it is roughly 10% of the genome)
     - TSS (Transcription Start Site) normalized coverage ideally > 4 (at least >2)
     - % Duplications “not excessive”
-5. Inspect [Genome Browser Tracks](#genome-browser-tracks) using UCSC Genome Browser (online) or IGV (local)
+5. Inspect [Genome Browser Tracks](https://github.com/epigen/genome_tracks/) using UCSC Genome Browser (online) or IGV (local)
     - Compare all samples to the best, based on above's QC metrics.
     - Check cell type / experiment-specific markers for accessibility as positive controls.
     - Check e.g., developmental regions for accessibility as negative controls.
@@ -144,25 +142,6 @@ My personal QC value scheme to inform downstream analyses (e.g., unsupervised an
 
 Finally, a previous PhD student in our lab, [André Rendeiro](https://orcid.org/0000-0001-9362-5373), wrote about ["ATAC-seq sample quality, wet lab troubleshooting and advice"](https://github.com/epigen/open_pipelines/blob/master/pipelines/atacseq.md#sample-quality-wet-lab-troubleshooting-and-advice).
 
-# Genome Browser Tracks
-The `hub` directory contains the read coverage per sample in .bigWig format for visual inspection of each sample e.g., during QC. Below are instructions for two different approaches (online/local).
-
-## UCSC Genome Browser Track Hub (online)
-0. Requirement: web server.
-1. Copy (or symlink) the `hub` directory to an accessible location on your web server (=web_server_location).
-2. Create a UCSC Genome Browser hyperlink
-    - the general formula is: ucsc_url + genome + web_server_location + hub/hub.txt
-    - concretely: `http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=` + {genome} + `&hubUrl=` + {web_server_location} + `hub/hub.txt`
-    - [mm10test](http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=mm10&hubUrl=https://medical-epigenomics.org/data/atacseq_pipeline/mm10test/hub/hub.txt): `http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=mm10&hubUrl=https://medical-epigenomics.org/data/atacseq_pipeline/mm10test/hub/hub.txt`
-    - [hg38test](http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&hubUrl=https://medical-epigenomics.org/data/atacseq_pipeline/hg38test/hub/hub.txt): `http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&hubUrl=https://medical-epigenomics.org/data/atacseq_pipeline/hg38test/hub/hub.txt`
-3. Share the link with the world e.g., collaborators or upon publication of your data.
-
-## IGV: Integrative Genomics Viewer (local)
-0. Requirement: [IGV Desktop application](https://igv.org/doc/desktop/).
-1. Open IGV.
-2. Select genome.
-3. Drag and drop all/selected .bigWig files from the `hub` directory directly into the IGV application.
-
 # Links
 - [GitHub Repository](https://github.com/epigen/atacseq_pipeline/)
 - [GitHub Page](https://epigen.github.io/atacseq_pipeline/)
@@ -188,11 +167,11 @@ The `hub` directory contains the read coverage per sample in .bigWig format for
   unzip indices_for_Bowtie2.zip && rm indices_for_Bowtie2.zip
   ```
 - Recommended [MR.PARETO](https://github.com/epigen/mr.pareto) modules for downstream analyses (in that order):
+  - [Genome Browser Track Visualization](https://github.com/epigen/genome_tracks/) for quality control and visual inspection/analysis of genomic regions/genes of interest or top hits.
   - [<ins>Sp</ins>lit, F<ins>ilter</ins>, Norma<ins>lize</ins> and <ins>Integrate</ins> Sequencing Data](https://github.com/epigen/spilterlize_integrate/) after quantification.
   - [Unsupervised Analysis](https://github.com/epigen/unsupervised_analysis) to understand and visualize similarities and variations between samples.
   - [Differential Analysis with limma](https://github.com/epigen/dea_limma) to identify and visualize statistically significant genomic regions between sample groups.
   - [Enrichment Analysis](https://github.com/epigen/enrichment_analysis) for biomedical interpretation of differential analysis results.
-  - [Genome Tracks](https://github.com/epigen/genome_tracks) for visualization of genomic regions of interest or top hits.
 
 # Publications
 The following publications successfully used this module for their analyses.

diff --git a/config/README.md b/config/README.md
@@ -7,7 +7,7 @@ You need one configuration file and one annotation file to run the complete work
   - sample_name (first column!)
   - read_type: "single" or "paired".
   - bam_file: path to the raw/unaligned BAM file.
-  - pass_qc: number between 0 (not used for downstream steps e.g., quantification) and 1. Every sample with pass_qc>0 is included in the downstream analysis.
+  - pass_qc: number between 0 (not used for downstream steps e.g., quantification) and 1. Every sample with pass_qc>0 is included in the downstream quantification and annotation steps.
   - (optional) additional sample metadata columns can be added and indicated for inclusion in the report.
 
 2 examples (hg38 & mm10) are provided in the .test/ directory and are pre-filled with resource locations from the [respective Zenodo download](../README.md#resources).
diff --git a/config/config.yaml b/config/config.yaml
@@ -8,7 +8,6 @@ partition: 'shortq'
 project_name: MyATACproject # name of the project/dataset
 result_path: /path/to/results/ # path to the output folder
 annotation: /path/to/MyATACproject_pipeline_annotation.csv # path to annotation file, specified in config/README.md
-email: [email protected] # used for UCSC hub generation
 
 ##### PROCESSING #####
 
@@ -21,8 +20,7 @@ sequencing_center: CeMM_BSF
 
 ##### REPORT #####
 
-# sample(!) specific annotation columns of interest from the annotation sheet (note: unit specific annotations are not retained)
-# first entry is used to color UCSC Genome Browser tracks
+# sample(!) specific annotation columns of interest from the annotation sheet (note: sequencing unit specific annotations are not retained)
 annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
 
 ##### QUANTIFICATION #####

diff --git a/workflow/envs/multiqc.yaml b/workflow/envs/multiqc.yaml
@@ -10,5 +10,5 @@ dependencies:
   - pytables=3.9.1
   - pip=23.3.1
   - pip:
-      - 'git+https://github.com/epigen/atacseq_pipeline/#egg=atacseq_report&subdirectory=workflow/scripts/multiqc_atacseq' # works
-#     - -e /home/sreichl/projects/atacseq_pipeline/workflow/scripts/multiqc_atacseq # works but only local
+#       - 'git+https://github.com/epigen/atacseq_pipeline/#egg=atacseq_report&subdirectory=workflow/scripts/multiqc_atacseq' # works
+    - -e /home/sreichl/projects/atacseq_pipeline/workflow/scripts/multiqc_atacseq # works but only local
diff --git a/workflow/rules/processing.smk b/workflow/rules/processing.smk
@@ -88,11 +88,6 @@ rule tss_coverage:
         "logs/rules/coverage_{sample}.log"
     shell:
         """
-        #bamCoverage --bam {input.bam} \
-        #    -p max --binSize 10  --normalizeUsing RPGC \
-        #    --effectiveGenomeSize {params.genome_size} --extendReads 175 \
-        #    -o "{output.bigWig}" > "{output.bigWig_log}" 2>&1;
-
         echo "base,count" > {output.tss_hist};
         bedtools slop -b {params.tss_slop} -i {params.unique_tss} -g {params.chromosome_sizes} | \
             bedtools coverage -a - -b {input.bam} -d -sorted | \

diff --git a/workflow/rules/quantification.smk b/workflow/rules/quantification.smk
@@ -132,39 +132,29 @@ rule homer_aggregate:
     threads: config.get("threads", 2)
     log:
         "logs/rules/homer_aggregate.log"
-    shell:
-        """
-        first_file=true
-        
-        > {output}
-        
-        for file in {input}; do
-            if [ -s file ]; then
-                sample=$(awk -F'/' '{{print $(NF-2)}}' <<< "$file")
+    run:
+        combined_df = pd.DataFrame()
 
-                if $first_file; then
-                    awk -v prefix="$sample" 'BEGIN {{FS="\\t"; OFS=","}} {{
-                        if (NR==1) {{
-                            gsub(" ", "_", $0);
-                            printf "sample" OFS; 
-                            for (i=1; i<=NF; i++) printf "%s%s", $i, (i<NF ? OFS : ORS);
-                        }} else {{
-                            printf prefix OFS "\\"" $1 "\\"";
-                            for (i=2; i<=NF; i++) printf "%s%s", OFS, $i;
-                            print "";
-                        }}
-                    }}' "$file" >> {output}
+        for file_path in input:
+            if os.path.getsize(file_path) > 0:
+                sample = file_path.split('/')[-3]
+                df = pd.read_csv(file_path, sep='\t')
 
-                    first_file=false
-                else
-                    awk -v prefix="$sample" 'BEGIN {{FS="\\t"; OFS=","}} FNR > 1 {{printf prefix OFS "\\"" $1 "\\"";
-                        for (i=2; i<=NF; i++) printf "%s%s", OFS, $i;
-                        print "";
-                        }}' "$file" >> {output}
-                fi
-            fi
-        done
-        """
+                # replace white space in column names
+                df.columns = [col.replace(' ', '_') for col in df.columns]
+
+                # Remove columns that start with '#_' (unique per sample)
+                df = df.loc[:, ~df.columns.str.startswith('#_')]
+
+                # add sample name
+                df.insert(0, 'sample_name', sample)
+
+                if combined_df.shape[0]==0:
+                    combined_df = df
+                else:
+                    combined_df = pd.concat([combined_df, df], ignore_index=True)
+
+        combined_df.to_csv(output[0], index=False)
 
 # map consensus regions to closest TSS per gene
 rule map_consensus_tss:

diff --git a/workflow/rules/report.smk b/workflow/rules/report.smk
@@ -109,8 +109,8 @@ rule multiqc:
         expand(os.path.join(result_path,"results","{sample}","mapped", "{sample}.filtered.bam"), sample=samples.keys()),
         expand(os.path.join(result_path,"results","{sample}","peaks","{sample}_peaks.narrowPeak"), sample=samples.keys()),
 #         expand(os.path.join(result_path, "hub","{sample}.bigWig"),sample=samples.keys()),
+#         trackdb_file = os.path.join(result_path, "hub", config["genome"], "trackDb.txt"), # representing UCSC hub
         expand(os.path.join(result_path, 'report', '{sample}_peaks.xls'), sample=samples.keys()), # representing symlinked stats
-        trackdb_file = os.path.join(result_path, "hub", config["genome"], "trackDb.txt"), # representing UCSC hub
         sample_annotation = config["annotation"],
     output:
         multiqc_report = report(os.path.join(result_path,"report","multiqc_report.html"),