Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor pangenome mapping #357

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FelixMoelder
Copy link
Contributor

@FelixMoelder FelixMoelder commented Feb 5, 2025

The current implementation of pangenome mapping uses a haplotype vcf and fasta reference for building the pangenome graph. As this is a lossy process we decided to perform haplotype sampling on full pangenome graph (see https://github.com/vgteam/vg/wiki/Haplotype-Sampling#haplotype-sampling). This should also reduce memory usage and increase performance.

Open issues:

  • rule sample_pangenome_graph fails when using more than one thread. When running the command manually this issue does not show up
  • snakemake-wrapper for vg haplotype is required (tests need to be created)
  • snakemake-wrapper for vg giraffe needs to be updated (also tests missing)

Summary by CodeRabbit

  • New Features
    • Introduced a new package environment setup for KMC and enhanced data mapping with additional processing rules.
  • Refactor
    • Updated external data references to the latest versions, including URLs and file formats for pangenome haplotypes.
  • Chores
    • Upgraded tool dependencies, including samtools and added sed, to improve compatibility and performance.

These changes deliver a more robust, up-to-date workflow that enhances user-facing processing and data management.

Copy link

coderabbitai bot commented Feb 5, 2025

Walkthrough

The pull request introduces several updates across the workflow. A new Conda environment configuration for kmc is added, and the samtools environment is updated with a new dependency. The resource YAML file now points to updated URLs for variant calls. Multiple Snakemake rules have been enhanced: new rules (for counting k-mers, creating reference paths, and reheadering mapped reads) have been added, and existing rules (map_reads_vg, get_pangenome_haplotypes) have been modified or removed. Additionally, related configurations have been streamlined in the main config file.

Changes

File(s) Change Summary
workflow/envs/kmc.yaml
workflow/envs/samtools.yaml
New Conda environment: kmc.yaml adds channels (conda-forge, bioconda) and kmc=3.2; samtools.yaml updates samtools from 1.12 to 1.21 and adds sed=4.8.
workflow/resources/datavzrd/…variant-calls-template.datavzrd.yaml Updated URLs for ‘clinical significance’, ‘canonical’, and ‘mane_plus_clinical’ fields from v1.2.1 to v1.2.2.
workflow/rules/mapping.smk Added new rules: count_sample_kmers, create_reference_paths, and reheader_mapped_reads; modified map_reads_vg inputs/outputs and updated fix_mate input.
workflow/rules/ref.smk Renamed get_pangenome_haplotypes to get_pangenome with updated outputs (from .vcf.gz to .{ext}), parameters (lambda for URL, wildcard constraints), and log; removed rules rename_haplotype_contigs, rename_haplotype_chroms, and vg_autoindex.
config/config.yaml Changed pangenome section: replaced pangenome VCF URL with separate hapl and gbz URLs; removed renaming expressions.
workflow/rules/common.smk Removed the function get_vg_autoindex_vcf.

Sequence Diagram(s)

sequenceDiagram
    participant S as Sample Data
    participant KS as count_sample_kmers
    participant CR as create_reference_paths
    participant MR as map_reads_vg
    participant RH as reheader_mapped_reads
    participant FM as fix_mate

    S->>KS: Provide read inputs
    S->>CR: Provide reference info
    KS->>MR: Output k-mer counts
    CR->>MR: Output reference paths & haplotype data
    S->>MR: Provide graph file input
    MR->>RH: Pass preprocessed BAM file
    RH->>FM: Pass reheadered BAM file
Loading

Suggested reviewers

  • johanneskoester

Poem

I'm a little rabbit, hopping through the code,
Fresh changes abound on this winding road.
New rules sprout like carrots in a row,
Environments updated, letting dependencies grow.
With each commit, my heart skips a beat—
An ASCII celebration, joyful and sweet!
🥕💻 Hop on, friends, to the rhythm of our code!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workflow/rules/mapping.smk (1)

18-33: Watch for portability issues in count_sample_kmers.

Using process substitution (@<(ls {input.reads})) can break portability under shells that don't support it (like /bin/sh). To avoid potential file name issues or shell-incompatibilities, consider more direct ways of passing input read filenames to kmc.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84b6aed and ac3eac9.

📒 Files selected for processing (5)
  • workflow/envs/kmc.yaml (1 hunks)
  • workflow/resources/datavzrd/variant-calls-template.datavzrd.yaml (2 hunks)
  • workflow/rules/datavzrd.smk (3 hunks)
  • workflow/rules/mapping.smk (2 hunks)
  • workflow/rules/ref.smk (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • workflow/envs/kmc.yaml
  • workflow/rules/datavzrd.smk
  • workflow/resources/datavzrd/variant-calls-template.datavzrd.yaml
🔇 Additional comments (2)
workflow/rules/ref.smk (1)

151-160: Consider verifying config entries and reintroducing caching if applicable.

The new rule get_pangenome runs curl to retrieve the pangenome from config["ref"]["pangenome"], but there's no explicit check to ensure wc.ext exists in the config. Additionally, you commented out cache: "omit-software". If large downloads are expected or frequently reused, caching might be beneficial.

workflow/rules/mapping.smk (1)

52-74: Confirm correct dependencies for map_reads_vg.

The input now depends on "results/pangenome/{sample}/graph.gbz" from the new rules, and the wrapper reference (file:/media/HDD/workspace/snakemake-wrappers/bio/vg/giraffe) is also local. Ensure these dependencies exist in the expected location and that no concurrency or memory conflicts occur with threads=64.

workflow/rules/mapping.smk Outdated Show resolved Hide resolved
@FelixMoelder
Copy link
Contributor Author

FelixMoelder commented Feb 7, 2025

I decided not to run vg haplotypes for sampling the pangenome graph before running vg giraffe.
Instead sampling is performed by vg giraffe internally.
The reasons to do so are the following:

  • when snakemake executes vg haplotypes with multiple threads it fails with [vg haplotypes] cannot run <x> threads in parallel on this system
  • running only a single thread but multiple samples at once substantially increases the used memory
  • vg giraffe also creates dist- and minimizer-indexes before mapping; so there is little reason to "outsource" graph sampling
  • the sampled graph would be marked as temporary and there for be removed in any case after running vg giraffe
  • creating the sampled graph just takes little time and does not have much impact on the runtime of vg giraffe
  • vg giraffe works with multithreading

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
workflow/rules/mapping.smk (1)

76-76: 🛠️ Refactor suggestion

Use a published wrapper version instead of local path.

Using a local wrapper path (file:/media/HDD/...) reduces portability. Consider publishing the wrapper and using its version number.

🧹 Nitpick comments (3)
workflow/rules/mapping.smk (2)

18-34: Consider making k-mer parameters configurable.

The k-mer size (29) and memory limit (128GB) are hardcoded. Consider moving these to the config file for better flexibility across different datasets and compute environments.

-    shell:
-        "kmc -k29 -m128 -okff -t{threads} -v @<(ls {input.reads}) {params.out_file} {params.out_dir} &> {log}"
+    params:
+        kmer_size=config["params"]["kmc"]["kmer_size"],
+        memory=config["params"]["kmc"]["memory"],
+        out_file=lambda w, output: os.path.splitext(output[0])[0],
+        out_dir=lambda w, output: os.path.dirname(output[0])
+    shell:
+        "kmc -k{params.kmer_size} -m{params.memory} -okff -t{threads} -v @<(ls {input.reads}) {params.out_file} {params.out_dir} &> {log}"

74-74: Consider making thread count configurable.

The high thread count (64) should be configurable via the config file to accommodate different compute environments.

-    threads: 64
+    threads: config["params"]["vg"]["giraffe"]["threads"]
workflow/envs/samtools.yaml (1)

5-6: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML standards.

   - samtools =1.21
   - sed =4.8
+
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 6-6: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac3eac9 and 983d16f.

📒 Files selected for processing (4)
  • config/config.yaml (1 hunks)
  • workflow/envs/samtools.yaml (1 hunks)
  • workflow/rules/common.smk (0 hunks)
  • workflow/rules/mapping.smk (1 hunks)
💤 Files with no reviewable changes (1)
  • workflow/rules/common.smk
🧰 Additional context used
🪛 YAMLlint (1.35.1)
workflow/envs/samtools.yaml

[error] 6-6: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (3)
workflow/rules/mapping.smk (2)

37-45: LGTM!

The rule correctly generates reference paths for standard human chromosomes in the format required by vg giraffe.


79-91: LGTM!

The rule correctly reformats chromosome names in BAM headers for compatibility with downstream tools.

config/config.yaml (1)

36-37: Verify the pangenome resource URLs.

The URLs point to specific versions of the pangenome resources. Ensure these are the latest stable versions and are accessible.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Check if the pangenome resource URLs are accessible
for url in \
  "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.hapl" \
  "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz"
do
  curl -I "$url" 2>/dev/null | head -n 1 | grep "200 OK" || echo "Failed to access: $url"
done

Length of output: 985


Pangenome resource URLs verified as accessible.

Both the hapl and gbz URLs return "200 OK," confirming that the specified resources are accessible and up-to-date.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
workflow/rules/mapping.smk (2)

37-46: Improve chromosome naming and shell compatibility.

  1. Consider using MT instead of M for mitochondrial chromosome for better compatibility.
  2. Brace expansion might not work on all shells. Consider using a Python script or a more portable shell command.
-    shell:
-        'for chrom in {{1..22}} X Y M; do echo "{params.build}#0#chr$chrom"; done > {output} 2> {log}'
+    run:
+        with open(output[0], "w") as f, open(log[0], "w") as logf:
+            try:
+                for chrom in list(map(str, range(1, 23))) + ["X", "Y", "MT"]:
+                    f.write(f"{params.build}#0#chr{chrom}\n")
+            except Exception as e:
+                print(str(e), file=logf)

79-92: Improve header modification robustness.

The sed command for header modification is complex and might be fragile. Consider using a more robust Python script or samtools reheader with a header file.

-    shell:
-        "samtools view {input} -H | sed -E 's/(SN:{params.build}#0#chr)/SN:/; s/SN:M/SN:MT/' | samtools reheader - {input} > {output} 2> {log}"
+    run:
+        import subprocess
+        try:
+            header = subprocess.check_output(f"samtools view -H {input[0]}", shell=True).decode()
+            header = header.replace(f"SN:{params.build}#0#chr", "SN:")
+            header = header.replace("SN:M", "SN:MT")
+            with open("temp.header", "w") as f:
+                f.write(header)
+            shell("samtools reheader temp.header {input} > {output} 2> {log} && rm temp.header")
+        except subprocess.CalledProcessError as e:
+            shell(f"echo {str(e)} > {log}")
+            raise
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 983d16f and a74c7a6.

📒 Files selected for processing (1)
  • workflow/rules/mapping.smk (1 hunks)
🔇 Additional comments (3)
workflow/rules/mapping.smk (3)

94-107: LGTM!

The modification to use reheadered BAM as input is correct and aligns with the new workflow.


18-35: Verify shell command portability and memory settings.

  1. Process substitution (<(ls {input.reads})) might not work on all systems. Consider using a temporary file instead.
  2. The memory parameter -m128 seems arbitrary. Consider making it configurable via config file.
✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check if process substitution is supported in the execution environment

if [[ "$SHELL" =~ "bash" ]]; then
  echo "Process substitution is supported in bash"
else
  echo "WARNING: Current shell might not support process substitution"
fi

Length of output: 118


Process substitution is supported in the current bash environment; however, the hard-coded memory parameter (-m128) remains unchanged and might be worth parameterizing for enhanced flexibility.

  • The shell script execution confirms that process substitution (i.e., "<(ls {input.reads})") works since the environment uses bash.
  • The memory setting (-m128) is still hard-coded; consider refactoring this into a configurable parameter via your config file.

48-77: Verify thread count and wrapper compatibility.

  1. The thread count of 64 is very high. Ensure your execution environment can support this many threads.
  2. Verify that wrapper version 5.7.0 supports all the new parameters and functionality.
✅ Verification successful

🌐 Web query:

What is the latest version of snakemake-wrapper-utils and what features does it support for vg giraffe?

💡 Result:

The latest version of `snakemake-wrapper-utils` is **0.6.2**, as referenced in multiple Snakemake wrapper implementations like GATK HaplotypeCaller[5] and SRA-Tools fasterq-dump[3]. 

For **vg giraffe** integration:
1. The `snakemake-wrapper-utils` itself doesn't directly implement vg giraffe features but enables wrapper functionality through:
   - Java options handling (`get_java_opts`)[5]
   - Standardized parameter parsing
   - Temporary directory management[5]
   - Common error checking patterns

2. Specific vg giraffe support comes from dedicated wrappers in the main Snakemake Wrapper Repository, which utilize these utilities:
   - Added in wrapper repository version 5.4.0[8]
   - Requires `snakemake-wrapper-utils=0.6.2` for core functionality
   - Enables Giraffe read alignment through standardized Snakemake rule definitions[8]

Key integration points include memory management, thread allocation, and consistent I/O handling across genomics tools[5][8]. The utilities ensure compatibility between different wrapper versions and simplify Conda environment management for vg pipeline components[1][2].

Citations:

- 1: https://bioconda.github.io/recipes/snakemake-wrapper-utils/README.html
- 2: https://snakemake-wrappers.readthedocs.io/en/stable/
- 3: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/sra-tools/fasterq-dump.html
- 4: https://snakemake.readthedocs.io/en/v7.1.1/_modules/snakemake/wrapper.html
- 5: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/gatk/haplotypecaller.html
- 6: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/datavzrd.html
- 7: https://www.biostars.org/p/9579245/
- 8: https://github.com/snakemake/snakemake-wrappers/releases
- 9: https://packages.debian.org/sid/arm/science/
- 10: https://github.com/snakemake/snakemake-wrapper-utils

Thread count and wrapper compatibility verified.

  • The rule’s thread count of 64 is high; ensure your execution environment can reliably support 64 concurrent threads.
  • The vg giraffe wrapper is set to version 5.7.0, which builds on the snakemake-wrapper-utils features from version 0.6.2. This aligns with recent wrapper releases (≥5.4.0) that support proper parameter handling and I/O management.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
workflow/rules/ref.smk (1)

156-157: Consider adding documentation for the extension types.

While the wildcard constraint clearly defines the allowed extensions (hapl|gbz), it would be helpful to document what each extension represents and when to use which format.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a74c7a6 and 5c628d1.

📒 Files selected for processing (1)
  • workflow/rules/ref.smk (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Formatting
🔇 Additional comments (2)
workflow/rules/ref.smk (2)

151-162: LGTM! The rule changes align well with the PR objectives.

The refactoring from get_pangenome_haplotypes to get_pangenome effectively supports the transition from VCF-based haplotypes to direct pangenome graph operations. The dynamic extension handling allows for both haplotype and graph formats while maintaining clean separation in logs.


151-162: Verify downstream impact of removed rules.

The removal of rename_haplotype_contigs, rename_haplotype_chroms, and vg_autoindex rules aligns with the shift away from external graph sampling. However, please verify that no other rules in the workflow depend on these removed rules' outputs.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Search for any remaining references to removed rules or their outputs

# Search for references to removed rules
rg -i "rename_haplotype_contigs|rename_haplotype_chroms|vg_autoindex"

# Search for potential file patterns that were outputs of removed rules
rg "\.renamed\.(vcf|fa)|\.autoindex"

Length of output: 110


I'm going to run an additional shell script to directly search for any rule definitions (or references) to the removed rules in the repository. This will help us verify that no residual dependencies remain.


🏁 Script executed:

#!/bin/bash
# For each removed rule, search for its definition in the repository.
for rule in rename_haplotype_contigs rename_haplotype_chroms vg_autoindex; do
  echo "Searching for rule definition for '$rule':"
  rg -i "rule\s+$rule" .
done

# Additionally, search for any references in the entire repository that might indicate a dependency on outputs of the removed rules.
rg -i "pangenome_prefix" .

Length of output: 1124


Downstream Impact Verified for Removed Rules

Our investigation reveals that the removed rules (rename_haplotype_contigs, rename_haplotype_chroms, vg_autoindex) are not referenced anywhere in the repository, and no outputs specific to them (e.g., ".renamed.*" or ".autoindex") are produced. References to pangenome_prefix continue as expected in the remaining rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant