-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Orthanq wrappers for hla and virus applications #2640
base: master
Are you sure you want to change the base?
Conversation
Don't have any experience with It seems |
Yes that could make sense aswell. But I was thinking of putting everything related to the each application in one folder, to make the navigation within the usage easier. Therefore the commands are as follows:
and
also could you eloborate more regarding the temp usage? Thank you for all the reviewing :) |
From what I can see in the docs, it seems that the tool has three subcommands and If the program supports a snakemake-wrappers/bio/samtools/sort/wrapper.py Lines 25 to 30 in f0b638a
|
This PR was marked as stale because it has been open for 6 months with no activity. |
Warning Rate limit exceeded@huzuner has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 48 minutes and 54 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThe pull request introduces a comprehensive set of configuration and workflow files for the Orthanq bioinformatics tool, specifically focusing on HLA (Human Leukocyte Antigen) and virus processing workflows. The changes include environment configuration files, metadata descriptions, Snakefile rules, test data, and wrapper scripts for different subcommands such as candidates, preprocess, and call functionalities. These additions provide a structured approach to processing genomic data for HLA typing and viral lineage quantification using the Orthanq tool version 1.4.0. Changes
Sequence DiagramsequenceDiagram
participant User
participant Snakemake
participant Orthanq
User->>Snakemake: Trigger workflow
Snakemake->>Orthanq: Execute candidates subcommand
Orthanq-->>Snakemake: Generate candidate variants
Snakemake->>Orthanq: Execute preprocess subcommand
Orthanq-->>Snakemake: Generate BCF file
Snakemake->>Orthanq: Execute call subcommand
Orthanq-->>Snakemake: Generate quantification results
Snakemake-->>User: Workflow complete
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (13)
bio/orthanq/hla/preprocess/wrapper.py (1)
9-13
: Consider incorporating thread support.
Iforthanq preprocess hla
supports parallelization, passingsnakemake.threads
can fulfill the PR objective of flexible threading configuration.Apply the following diff to pass the number of threads:
shell( "orthanq preprocess hla --genome {snakemake.input.genome} " " --haplotype-variants {snakemake.input.haplotype_variants} --reads {snakemake.input.reads} " - " --vg-index {snakemake.input.vg_index} --output {snakemake.output[0]} {log}" + " --vg-index {snakemake.input.vg_index} --threads {snakemake.threads} --output {snakemake.output[0]} {log}" )bio/orthanq/hla/call/test/Snakefile (2)
8-8
: Consider parameter configurability.
Havingprior="diploid"
is acceptable for defaults, but consider whether this parameter might need to be externally configurable in certain contexts.
9-10
: Add thread or resource definition if needed.
Iforthanq_call_hla
can be parallelized or is computationally intensive, defining threads or additional resources (e.g., memory) would help with workload distribution.bio/orthanq/hla/call/meta.yaml (4)
2-3
: Enhance description with more details about input/output formats.The description should specify:
- The expected format of the input BCF file
- The structure/columns of the output CSV file
- Any constraints or requirements for the process
5-7
: Add version requirements and format specifications for input files.For better usability:
- Specify the required version/format of haplotype variants
- Indicate the format of haplotype calls
- Add a direct link to the XML download instructions
8-9
: Expand parameter documentation.For the
prior
parameter:
- Explain the implications of each option (uniform vs. diploid)
- Document any default values
- Add examples of when to use each option
10-11
: Specify output format details.The output description should include:
- CSV file structure (columns, data types)
- Any header information
- Example output snippet
bio/orthanq/hla/candidates/test/hla.xml (1)
61-62
: Consider reducing test data size.The current test file contains complete sequence and translation data. Consider:
- Using a smaller subset of features for testing
- Creating a minimal test file that still validates functionality
- Documenting why this specific allele was chosen for testing
Also applies to: 122-124
bio/orthanq/virus/candidates/test/Snakefile (1)
1-7
: Optionally define resource usage.
If these rules run on HPC or multi-core environments, consider specifying resources (e.g.,threads
,memory
) for more predictable scheduling.bio/orthanq/virus/preprocess/test/Snakefile (1)
1-10
: Resources or threads usage.
Consider adding explicit resource or thread specifications to ensure consistent performance across different systems, especially in HPC environments.bio/orthanq/virus/candidates/wrapper.py (1)
1-11
: Add documentation and usage examples.The wrapper would benefit from comprehensive documentation:
- Add a docstring explaining:
- Required input parameters
- Output format
- Available configuration options
- Include example usage
- Reference to test cases location
Add documentation at the top of the file:
+""" +Wrapper for Orthanq virus candidates generation. + +Required input: + reads: Input reads file(s) + +Required output: + Candidate variants file (VCF format) + +Optional parameters: + threads: Number of threads to use + temp_dir: Directory for temporary files + extra: Additional parameters to pass to Orthanq + +Example usage: + rule orthanq_candidates_virus: + input: + reads="mapped/virus.bam" + output: + "candidates/virus.vcf" + params: + temp_dir="temp", + extra="" + threads: 4 + wrapper: + "bio/orthanq/virus/candidates" +""" + __author__ = "Hamdiye Uzuner"🧰 Tools
🪛 Ruff (0.8.2)
7-7: Undefined name
snakemake
(F821)
bio/orthanq/hla/preprocess/test/Snakefile (2)
3-6
: Consider introducing a temporary or ephemeral directory for intermediate files.If the preprocessing steps generate large or sensitive intermediate files, you may want to isolate them in a temporary directory to keep the workspace uncluttered and facilitate cleanup. You could modify the rule parameters to include a temporary directory, for example:
rule orthanq_preprocess_hla: input: genome="genome.fasta", haplotype_variants="haplotype_variants.vcf", vg_index="hprc_prebuilt.xg", reads=["reads_1.fq", "reads_2.fq"] + params: + tempdir=temp("orthanq_preprocess_hla_temp") output: "calls.bcf",
1-12
: Consider adding a threads directive for performance.If the Orthanq preprocessing step supports multithreading, you can improve performance by specifying
threads:
in the rule:rule orthanq_preprocess_hla: + threads: 4 input: ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
bio/orthanq/hla/candidates/test/allele_freq.csv
is excluded by!**/*.csv
📒 Files selected for processing (28)
bio/orthanq/hla/call/environment.yaml
(1 hunks)bio/orthanq/hla/call/meta.yaml
(1 hunks)bio/orthanq/hla/call/test/Snakefile
(1 hunks)bio/orthanq/hla/call/test/hla.xml
(1 hunks)bio/orthanq/hla/call/wrapper.py
(1 hunks)bio/orthanq/hla/candidates/environment.yaml
(1 hunks)bio/orthanq/hla/candidates/meta.yaml
(1 hunks)bio/orthanq/hla/candidates/test/Snakefile
(1 hunks)bio/orthanq/hla/candidates/test/genome.fasta
(1 hunks)bio/orthanq/hla/candidates/test/hla.xml
(1 hunks)bio/orthanq/hla/candidates/test/hla_alleles.fasta
(1 hunks)bio/orthanq/hla/candidates/wrapper.py
(1 hunks)bio/orthanq/hla/preprocess/environment.yaml
(1 hunks)bio/orthanq/hla/preprocess/meta.yaml
(1 hunks)bio/orthanq/hla/preprocess/test/Snakefile
(1 hunks)bio/orthanq/hla/preprocess/wrapper.py
(1 hunks)bio/orthanq/virus/call/environment.yaml
(1 hunks)bio/orthanq/virus/call/meta.yaml
(1 hunks)bio/orthanq/virus/call/test/Snakefile
(1 hunks)bio/orthanq/virus/call/wrapper.py
(1 hunks)bio/orthanq/virus/candidates/environment.yaml
(1 hunks)bio/orthanq/virus/candidates/meta.yaml
(1 hunks)bio/orthanq/virus/candidates/test/Snakefile
(1 hunks)bio/orthanq/virus/candidates/wrapper.py
(1 hunks)bio/orthanq/virus/preprocess/environment.yaml
(1 hunks)bio/orthanq/virus/preprocess/meta.yaml
(1 hunks)bio/orthanq/virus/preprocess/test/Snakefile
(1 hunks)bio/orthanq/virus/preprocess/wrapper.py
(1 hunks)
✅ Files skipped from review due to trivial changes (13)
- bio/orthanq/hla/candidates/test/genome.fasta
- bio/orthanq/hla/candidates/environment.yaml
- bio/orthanq/virus/call/environment.yaml
- bio/orthanq/hla/call/environment.yaml
- bio/orthanq/virus/preprocess/environment.yaml
- bio/orthanq/virus/candidates/environment.yaml
- bio/orthanq/hla/preprocess/environment.yaml
- bio/orthanq/hla/candidates/test/hla_alleles.fasta
- bio/orthanq/virus/candidates/meta.yaml
- bio/orthanq/hla/preprocess/meta.yaml
- bio/orthanq/virus/preprocess/meta.yaml
- bio/orthanq/virus/call/meta.yaml
- bio/orthanq/hla/candidates/meta.yaml
🧰 Additional context used
📓 Path-based instructions (6)
bio/orthanq/hla/preprocess/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
bio/orthanq/virus/preprocess/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
bio/orthanq/hla/candidates/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
bio/orthanq/virus/call/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
bio/orthanq/hla/call/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
bio/orthanq/virus/candidates/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
🪛 Ruff (0.8.2)
bio/orthanq/hla/preprocess/wrapper.py
7-7: Undefined name snakemake
(F821)
bio/orthanq/virus/preprocess/wrapper.py
7-7: Undefined name snakemake
(F821)
bio/orthanq/hla/candidates/wrapper.py
7-7: Undefined name snakemake
(F821)
bio/orthanq/virus/call/wrapper.py
7-7: Undefined name snakemake
(F821)
8-8: Undefined name snakemake
(F821)
bio/orthanq/hla/call/wrapper.py
7-7: Undefined name snakemake
(F821)
8-8: Undefined name snakemake
(F821)
bio/orthanq/virus/candidates/wrapper.py
7-7: Undefined name snakemake
(F821)
🔇 Additional comments (24)
bio/orthanq/virus/preprocess/wrapper.py (3)
1-3
: All metadata is properly declared.
The licensing and attribution metadata at the top is clear and well-defined.
7-7
: Valid usage of snakemake context.
This usage of snakemake.log_fmt_shell
is a standard approach in Snakemake wrappers to capture logs. Nothing to address here.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Undefined name snakemake
(F821)
9-12
: Ensure the subcommand parameters match expected workflow inputs.
The shell command for “orthanq preprocess virus” looks correct for the specified input and output. Just verify that downstream steps are ready to consume {snakemake.output[0]}
.
✅ Verification successful
The output file is consistently handled across the codebase
The search results show that snakemake.output[0]
is a standard way to access the first output file in Snakemake wrappers across the codebase. The pattern is used consistently in many wrappers for different tools, and in each case, it's used to specify the primary output file path. This is exactly how it's being used in the orthanq preprocess virus
wrapper.
Looking specifically at the orthanq-related wrappers:
bio/orthanq/virus/preprocess/wrapper.py
(the file under review)bio/orthanq/virus/candidates/wrapper.py
bio/orthanq/virus/call/wrapper.py
bio/orthanq/hla/candidates/wrapper.py
bio/orthanq/hla/preprocess/wrapper.py
bio/orthanq/hla/call/wrapper.py
All of these wrappers use snakemake.output[0]
consistently as the output file parameter for their respective commands, indicating that this is the established pattern for orthanq wrappers.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for usage of {snakemake.output[0]} in the rest of the codebase to ensure consistent handling.
rg -A 3 "snakemake.output\[0\]"
Length of output: 37130
bio/orthanq/hla/preprocess/wrapper.py (2)
1-3
: Metadata aligns with best practices.
The author, copyright, and license declarations are all in order, matching typical conventions.
7-7
: Logging approach is appropriate for Snakemake.
Utilizing Snakemake’s built-in log formatting ensures that stdout and stderr are captured properly.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Undefined name snakemake
(F821)
bio/orthanq/hla/call/test/Snakefile (2)
1-1
: Rule naming is clear and consistent.
The rule name orthanq_call_hla
follows snake_case convention and accurately reflects the function's intended use.
2-5
: Ensure upstream availability of input files.
Kindly verify that haplotype_calls.bcf
, haplotype_variants.vcf
, and hla.xml
are created or available upstream in the workflow before this rule executes.
bio/orthanq/hla/candidates/test/Snakefile (4)
1-1
: Rule name matches typical wrapper usage.
Using orthanq_candidates_hla
clarifies the process being run.
2-5
: Verify existence of required input files.
Please confirm that allele_freq.csv
, hla_alleles.fasta
, genome.fasta
, and hla.xml
are generated or provided by upstream rules or input channels.
8-8
: Output extension alignment.
candidates.vcf
has a VCF extension, which aligns well with typical variant output. Confirm that any downstream rules are prepared to handle this file.
9-10
: Log management approach looks good.
Capturing logs in logs/orthanq_candidates/candidates_hla.log
allows for easier debugging and traceability.
bio/orthanq/hla/candidates/wrapper.py (2)
1-3
: Metadata is helpful.
Author and license information are clearly stated.
9-13
: Wrapper command structure is clear.
The invocation of orthanq candidates hla
correctly references input files and expected outputs. Proper logging is also configured.
bio/orthanq/hla/call/wrapper.py (3)
1-3
: Author and license metadata.
Including author, copyright, and license details helps maintain clarity of ownership and usage terms.
7-8
: Parameter usage is concise.
prior_opt = snakemake.params.get("prior", "")
is a clean way to handle optional parameters.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Undefined name snakemake
(F821)
8-8: Undefined name snakemake
(F821)
9-13
: Well-structured shell command.
All relevant inputs, including haplotype_calls
and haplotype_variants
, are properly integrated into the command. Logging is correctly handled.
bio/orthanq/hla/candidates/test/hla.xml (1)
1-127
: Verify test data licensing and attribution.
This appears to be data from the IPD-IMGT/HLA database. Please ensure:
- The test data usage complies with the database's terms of use
- Proper attribution is included in the repository
- The data version is documented
bio/orthanq/virus/candidates/test/Snakefile (1)
3-3
: Consider explicitly naming output files.
What files are created in the viral_candidates
directory? Aligning with a past request, try naming key outputs for clarity.
bio/orthanq/virus/call/test/Snakefile (1)
1-11
: Validate parameter usage.
Double-check that "uniform"
is the intended default for the prior
parameter. If you anticipate user-defined priors, consider exposing a parameter option in your workflow configuration.
bio/orthanq/virus/candidates/wrapper.py (3)
1-4
: LGTM!
The metadata section is complete with appropriate author, copyright, and license information.
5-7
: LGTM!
The imports and logging setup follow Snakemake conventions correctly.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Undefined name snakemake
(F821)
9-11
: 🛠️ Refactor suggestion
Add essential parameters and error handling to the Orthanq command.
The current implementation appears to be missing several important parameters:
- Input files/parameters for virus candidate generation
- Threading configuration for performance optimization
- Temporary directory usage (mentioned in PR discussion)
- Error handling for missing or invalid files
Consider updating the shell command to include these parameters:
shell(
- "orthanq candidates virus --output {snakemake.output[0]} {log}"
+ "orthanq candidates virus"
+ " --input {snakemake.input.reads}"
+ " --output {snakemake.output[0]}"
+ " --threads {snakemake.threads}"
+ " --temp-dir {snakemake.params.temp_dir}"
+ " {snakemake.params.extra}"
+ " {log}"
)
Let's verify the Orthanq command parameters:
bio/orthanq/hla/preprocess/test/Snakefile (2)
1-1
: Rule name follows snake_case best practices.
8-8
: Check if a BCF index is required.
Typically, BCF files benefit from an accompanying index (e.g., CSI or TBI) if further downstream analysis is expected. Please verify if the downstream steps need an indexed BCF and consider adding it to the output. Here’s a possible snippet:
output:
- "calls.bcf",
+ "calls.bcf",
+ "calls.bcf.csi",
Hi @huzuner, did you have the time to look into the comments above? |
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
I thought about your suggestion but I couldn't be sure whether that would be a better way of organizing it or not. My idea initially was that because the tool has two applications, hla and virus, I wanted to organize the folders in that manner in order to make this more prominent, but your idea also does not seem wrong. In the end I guess I'd rather have it as it is now if that's not really a major issue. We can of course always reorganize it later. In addition, the tool does not have a temp option at the moment. |
QC
For all wrappers added by this PR,
input:
andoutput:
file paths in the resulting rule can be changed arbitrarily,threads: x
statement withx
being a reasonable default,map_reads
for a step that maps reads),environment.yaml
specifications follow the respective best practices,environment.yaml
pinning has been updated by runningsnakedeploy pin-conda-envs environment.yaml
on a linux machine,input:
oroutput:
),Snakefile
s and their entries are explained via comments (input:
/output:
/params:
etc.),stderr
and/orstdout
are logged correctly (log:
), depending on the wrapped tool,tempfile.gettempdir()
points to (see here; this also means that using any Pythontempfile
default behavior works),meta.yaml
contains a link to the documentation of the respective tool or command,Snakefile
s pass the linting (snakemake --lint
),Snakefile
s are formatted with snakefmt,Summary by CodeRabbit
New Features
Documentation
Chores