-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor pangenome mapping #357
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces several updates across the workflow. A new Conda environment configuration for kmc is added, and the samtools environment is updated with a new dependency. The resource YAML file now points to updated URLs for variant calls. Multiple Snakemake rules have been enhanced: new rules (for counting k-mers, creating reference paths, and reheadering mapped reads) have been added, and existing rules (map_reads_vg, get_pangenome_haplotypes) have been modified or removed. Additionally, related configurations have been streamlined in the main config file. Changes
Sequence Diagram(s)sequenceDiagram
participant S as Sample Data
participant KS as count_sample_kmers
participant CR as create_reference_paths
participant MR as map_reads_vg
participant RH as reheader_mapped_reads
participant FM as fix_mate
S->>KS: Provide read inputs
S->>CR: Provide reference info
KS->>MR: Output k-mer counts
CR->>MR: Output reference paths & haplotype data
S->>MR: Provide graph file input
MR->>RH: Pass preprocessed BAM file
RH->>FM: Pass reheadered BAM file
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
workflow/rules/mapping.smk (1)
18-33
: Watch for portability issues incount_sample_kmers
.Using process substitution (
@<(ls {input.reads})
) can break portability under shells that don't support it (like/bin/sh
). To avoid potential file name issues or shell-incompatibilities, consider more direct ways of passing input read filenames tokmc
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
workflow/envs/kmc.yaml
(1 hunks)workflow/resources/datavzrd/variant-calls-template.datavzrd.yaml
(2 hunks)workflow/rules/datavzrd.smk
(3 hunks)workflow/rules/mapping.smk
(2 hunks)workflow/rules/ref.smk
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- workflow/envs/kmc.yaml
- workflow/rules/datavzrd.smk
- workflow/resources/datavzrd/variant-calls-template.datavzrd.yaml
🔇 Additional comments (2)
workflow/rules/ref.smk (1)
151-160
: Consider verifying config entries and reintroducing caching if applicable.The new
rule get_pangenome
runscurl
to retrieve the pangenome fromconfig["ref"]["pangenome"]
, but there's no explicit check to ensurewc.ext
exists in the config. Additionally, you commented outcache: "omit-software"
. If large downloads are expected or frequently reused, caching might be beneficial.workflow/rules/mapping.smk (1)
52-74
: Confirm correct dependencies formap_reads_vg
.The input now depends on
"results/pangenome/{sample}/graph.gbz"
from the new rules, and the wrapper reference (file:/media/HDD/workspace/snakemake-wrappers/bio/vg/giraffe
) is also local. Ensure these dependencies exist in the expected location and that no concurrency or memory conflicts occur withthreads=64
.
I decided not to run
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
workflow/rules/mapping.smk (1)
76-76
: 🛠️ Refactor suggestionUse a published wrapper version instead of local path.
Using a local wrapper path (
file:/media/HDD/...
) reduces portability. Consider publishing the wrapper and using its version number.
🧹 Nitpick comments (3)
workflow/rules/mapping.smk (2)
18-34
: Consider making k-mer parameters configurable.The k-mer size (29) and memory limit (128GB) are hardcoded. Consider moving these to the config file for better flexibility across different datasets and compute environments.
- shell: - "kmc -k29 -m128 -okff -t{threads} -v @<(ls {input.reads}) {params.out_file} {params.out_dir} &> {log}" + params: + kmer_size=config["params"]["kmc"]["kmer_size"], + memory=config["params"]["kmc"]["memory"], + out_file=lambda w, output: os.path.splitext(output[0])[0], + out_dir=lambda w, output: os.path.dirname(output[0]) + shell: + "kmc -k{params.kmer_size} -m{params.memory} -okff -t{threads} -v @<(ls {input.reads}) {params.out_file} {params.out_dir} &> {log}"
74-74
: Consider making thread count configurable.The high thread count (64) should be configurable via the config file to accommodate different compute environments.
- threads: 64 + threads: config["params"]["vg"]["giraffe"]["threads"]workflow/envs/samtools.yaml (1)
5-6
: Add newline at end of file.Add a newline character at the end of the file to comply with YAML standards.
- samtools =1.21 - sed =4.8 +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 6-6: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
config/config.yaml
(1 hunks)workflow/envs/samtools.yaml
(1 hunks)workflow/rules/common.smk
(0 hunks)workflow/rules/mapping.smk
(1 hunks)
💤 Files with no reviewable changes (1)
- workflow/rules/common.smk
🧰 Additional context used
🪛 YAMLlint (1.35.1)
workflow/envs/samtools.yaml
[error] 6-6: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (3)
workflow/rules/mapping.smk (2)
37-45
: LGTM!The rule correctly generates reference paths for standard human chromosomes in the format required by vg giraffe.
79-91
: LGTM!The rule correctly reformats chromosome names in BAM headers for compatibility with downstream tools.
config/config.yaml (1)
36-37
: Verify the pangenome resource URLs.The URLs point to specific versions of the pangenome resources. Ensure these are the latest stable versions and are accessible.
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check if the pangenome resource URLs are accessible for url in \ "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.hapl" \ "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz" do curl -I "$url" 2>/dev/null | head -n 1 | grep "200 OK" || echo "Failed to access: $url" doneLength of output: 985
Pangenome resource URLs verified as accessible.
Both the hapl and gbz URLs return "200 OK," confirming that the specified resources are accessible and up-to-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
workflow/rules/mapping.smk (2)
37-46
: Improve chromosome naming and shell compatibility.
- Consider using MT instead of M for mitochondrial chromosome for better compatibility.
- Brace expansion might not work on all shells. Consider using a Python script or a more portable shell command.
- shell: - 'for chrom in {{1..22}} X Y M; do echo "{params.build}#0#chr$chrom"; done > {output} 2> {log}' + run: + with open(output[0], "w") as f, open(log[0], "w") as logf: + try: + for chrom in list(map(str, range(1, 23))) + ["X", "Y", "MT"]: + f.write(f"{params.build}#0#chr{chrom}\n") + except Exception as e: + print(str(e), file=logf)
79-92
: Improve header modification robustness.The sed command for header modification is complex and might be fragile. Consider using a more robust Python script or samtools reheader with a header file.
- shell: - "samtools view {input} -H | sed -E 's/(SN:{params.build}#0#chr)/SN:/; s/SN:M/SN:MT/' | samtools reheader - {input} > {output} 2> {log}" + run: + import subprocess + try: + header = subprocess.check_output(f"samtools view -H {input[0]}", shell=True).decode() + header = header.replace(f"SN:{params.build}#0#chr", "SN:") + header = header.replace("SN:M", "SN:MT") + with open("temp.header", "w") as f: + f.write(header) + shell("samtools reheader temp.header {input} > {output} 2> {log} && rm temp.header") + except subprocess.CalledProcessError as e: + shell(f"echo {str(e)} > {log}") + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
workflow/rules/mapping.smk
(1 hunks)
🔇 Additional comments (3)
workflow/rules/mapping.smk (3)
94-107
: LGTM!The modification to use reheadered BAM as input is correct and aligns with the new workflow.
18-35
: Verify shell command portability and memory settings.
- Process substitution (
<(ls {input.reads})
) might not work on all systems. Consider using a temporary file instead.- The memory parameter
-m128
seems arbitrary. Consider making it configurable via config file.✅ Verification successful
🏁 Script executed:
#!/bin/bash # Description: Check if process substitution is supported in the execution environment if [[ "$SHELL" =~ "bash" ]]; then echo "Process substitution is supported in bash" else echo "WARNING: Current shell might not support process substitution" fiLength of output: 118
Process substitution is supported in the current bash environment; however, the hard-coded memory parameter (-m128) remains unchanged and might be worth parameterizing for enhanced flexibility.
- The shell script execution confirms that process substitution (i.e., "<(ls {input.reads})") works since the environment uses bash.
- The memory setting (-m128) is still hard-coded; consider refactoring this into a configurable parameter via your config file.
48-77
: Verify thread count and wrapper compatibility.
- The thread count of 64 is very high. Ensure your execution environment can support this many threads.
- Verify that wrapper version 5.7.0 supports all the new parameters and functionality.
✅ Verification successful
🌐 Web query:
What is the latest version of snakemake-wrapper-utils and what features does it support for vg giraffe?
💡 Result:
The latest version of `snakemake-wrapper-utils` is **0.6.2**, as referenced in multiple Snakemake wrapper implementations like GATK HaplotypeCaller[5] and SRA-Tools fasterq-dump[3]. For **vg giraffe** integration: 1. The `snakemake-wrapper-utils` itself doesn't directly implement vg giraffe features but enables wrapper functionality through: - Java options handling (`get_java_opts`)[5] - Standardized parameter parsing - Temporary directory management[5] - Common error checking patterns 2. Specific vg giraffe support comes from dedicated wrappers in the main Snakemake Wrapper Repository, which utilize these utilities: - Added in wrapper repository version 5.4.0[8] - Requires `snakemake-wrapper-utils=0.6.2` for core functionality - Enables Giraffe read alignment through standardized Snakemake rule definitions[8] Key integration points include memory management, thread allocation, and consistent I/O handling across genomics tools[5][8]. The utilities ensure compatibility between different wrapper versions and simplify Conda environment management for vg pipeline components[1][2]. Citations: - 1: https://bioconda.github.io/recipes/snakemake-wrapper-utils/README.html - 2: https://snakemake-wrappers.readthedocs.io/en/stable/ - 3: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/sra-tools/fasterq-dump.html - 4: https://snakemake.readthedocs.io/en/v7.1.1/_modules/snakemake/wrapper.html - 5: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/gatk/haplotypecaller.html - 6: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/datavzrd.html - 7: https://www.biostars.org/p/9579245/ - 8: https://github.com/snakemake/snakemake-wrappers/releases - 9: https://packages.debian.org/sid/arm/science/ - 10: https://github.com/snakemake/snakemake-wrapper-utils
Thread count and wrapper compatibility verified.
- The rule’s thread count of 64 is high; ensure your execution environment can reliably support 64 concurrent threads.
- The vg giraffe wrapper is set to version 5.7.0, which builds on the snakemake-wrapper-utils features from version 0.6.2. This aligns with recent wrapper releases (≥5.4.0) that support proper parameter handling and I/O management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
workflow/rules/ref.smk (1)
156-157
: Consider adding documentation for the extension types.While the wildcard constraint clearly defines the allowed extensions (
hapl|gbz
), it would be helpful to document what each extension represents and when to use which format.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
workflow/rules/ref.smk
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Formatting
🔇 Additional comments (2)
workflow/rules/ref.smk (2)
151-162
: LGTM! The rule changes align well with the PR objectives.The refactoring from
get_pangenome_haplotypes
toget_pangenome
effectively supports the transition from VCF-based haplotypes to direct pangenome graph operations. The dynamic extension handling allows for both haplotype and graph formats while maintaining clean separation in logs.
151-162
: Verify downstream impact of removed rules.The removal of
rename_haplotype_contigs
,rename_haplotype_chroms
, andvg_autoindex
rules aligns with the shift away from external graph sampling. However, please verify that no other rules in the workflow depend on these removed rules' outputs.✅ Verification successful
🏁 Script executed:
#!/bin/bash # Description: Search for any remaining references to removed rules or their outputs # Search for references to removed rules rg -i "rename_haplotype_contigs|rename_haplotype_chroms|vg_autoindex" # Search for potential file patterns that were outputs of removed rules rg "\.renamed\.(vcf|fa)|\.autoindex"Length of output: 110
I'm going to run an additional shell script to directly search for any rule definitions (or references) to the removed rules in the repository. This will help us verify that no residual dependencies remain.
🏁 Script executed:
#!/bin/bash # For each removed rule, search for its definition in the repository. for rule in rename_haplotype_contigs rename_haplotype_chroms vg_autoindex; do echo "Searching for rule definition for '$rule':" rg -i "rule\s+$rule" . done # Additionally, search for any references in the entire repository that might indicate a dependency on outputs of the removed rules. rg -i "pangenome_prefix" .Length of output: 1124
Downstream Impact Verified for Removed Rules
Our investigation reveals that the removed rules (rename_haplotype_contigs, rename_haplotype_chroms, vg_autoindex) are not referenced anywhere in the repository, and no outputs specific to them (e.g., ".renamed.*" or ".autoindex") are produced. References to pangenome_prefix continue as expected in the remaining rules.
The current implementation of pangenome mapping uses a haplotype vcf and fasta reference for building the pangenome graph. As this is a lossy process we decided to perform haplotype sampling on full pangenome graph (see https://github.com/vgteam/vg/wiki/Haplotype-Sampling#haplotype-sampling). This should also reduce memory usage and increase performance.
Open issues:
sample_pangenome_graph
fails when using more than one thread. When running the command manually this issue does not show upvg haplotype
is required (tests need to be created)vg giraffe
needs to be updated (also tests missing)Summary by CodeRabbit
samtools
and addedsed
, to improve compatibility and performance.These changes deliver a more robust, up-to-date workflow that enhances user-facing processing and data management.