-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divide-and-conquer demultiplexing #202
Conversation
📝 WalkthroughWalkthroughThis pull request introduces several enhancements across various components. In the conda recipe, a new dependency ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SNK as Snakemake Workflow
participant PR as Partitioning Rules
participant DM as Demultiplex Rule
participant MP as Merge Rules
participant AQ as Assess Quality
participant WS as Workflow Summary
User->>SNK: Trigger workflow
SNK->>PR: Execute partition_reads & partition_index (using seqkit)
PR->>DM: Supply partitioned read files and schema
DM->>MP: Process demultiplexing and output partitioned files
MP->>AQ: Merge partition outputs and barcode logs
AQ->>WS: Generate workflow summary with consolidated logs
WS->>User: Return final summary
Suggested labels
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (9)
harpy/view.py (1)
17-39
: Consider removing the unusedstdscr
variable or referencing it meaningfully.
Currently,stdscr = curses.initscr()
is never used, which triggers a linter warning. If no usage is intended, removing the assignment can improve clarity.- stdscr = curses.initscr() + curses.initscr()🧰 Tools
🪛 Ruff (0.8.2)
19-19: Local variable
stdscr
is assigned to but never usedRemove assignment to unused variable
stdscr
(F841)
harpy/scripts/demultiplex_gen1.py (2)
2-4
: Use context managers for file operations.
Opening the log file with a plainopen()
can be replaced with a context manager for clearer resource handling. Also consider restoringsys.stdout
andsys.stderr
afterwards to avoid side effects.- f = open(snakemake.log[0], "w") - sys.stderr = sys.stdout = f + with open(snakemake.log[0], "w") as f: + sys.stderr = sys.stdout = f + # rest of logic # Possibly restore sys.stderr/stdout here or rely on context manager scope🧰 Tools
🪛 Ruff (0.8.2)
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name
snakemake
(F821)
119-165
: Consider consolidating file operations into a single context manager.
The approach works, but opening multiple output files in separate dictionary comprehensions can be brittle if an exception arises midway. A single context manager block can mitigate partial resource management issues.- R1_output = {sample: open(...) for sample in samples} - R2_output = {sample: open(...) for sample in samples} ... - for r1_rec, r2_rec, i1_rec, i2_rec in zip(...): + with ExitStack() as stack: + R1_output = {sample: stack.enter_context(open(...)) for sample in samples} + R2_output = {sample: stack.enter_context(open(...)) for sample in samples} + # proceed🧰 Tools
🪛 Ruff (0.8.2)
119-119: Undefined name
snakemake
(F821)
harpy/snakefiles/demultiplex_gen1 copy.smk (3)
12-20
: File logging approach is consistent, but confirm file existence.
Removinglogger.logfile
on success/error is fine if guaranteed to exist. Otherwise, consider checking before removal to avoid exceptions.
21-39
: Schema parsing approach is functional but watch multiple barcode letters.
Similar to the Python script, ensure lines with different segment letters do not create unintended merges. Thekeep_unknown
logic is helpful for out-of-scope samples.
79-88
: Compression step is straightforward.
Usinggzip
is a minimal approach. Consider parallel compression if dealing with large files, but otherwise this is acceptable.harpy/snakefiles/demultiplex_gen1.smk (2)
112-122
: Finalize or remove the commented-out merge command.The
default_target: True
setting indicates a main rule, but the actual merging shell command is commented out, suggesting incomplete functionality.
226-226
: Clarify the default target intention.
#default_target: True
was commented out. If this is meant to be the final rule, consider re-enabling or removing it completely.harpy/_conda.py (1)
38-38
: Consider pinningseqkit
to a specific version.While adding
seqkit
supports new partitioning features, a version pin (e.g.,bioconda::seqkit=2.3.1
) could help maintain reproducibility.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
harpy/_conda.py
(1 hunks)harpy/reports/align_bxstats.qmd
(4 hunks)harpy/scripts/demultiplex_gen1.py
(3 hunks)harpy/scripts/demultiplex_gen1.py.bak
(0 hunks)harpy/snakefiles/demultiplex_gen1 copy.smk
(1 hunks)harpy/snakefiles/demultiplex_gen1.smk
(5 hunks)harpy/view.py
(2 hunks)
💤 Files with no reviewable changes (1)
- harpy/scripts/demultiplex_gen1.py.bak
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name snakemake
(F821)
83-83: Undefined name snakemake
(F821)
84-84: Undefined name snakemake
(F821)
85-85: Undefined name snakemake
(F821)
86-86: Undefined name snakemake
(F821)
87-87: Undefined name snakemake
(F821)
88-88: Undefined name snakemake
(F821)
89-89: Undefined name snakemake
(F821)
90-90: Undefined name snakemake
(F821)
91-91: Undefined name snakemake
(F821)
92-92: Undefined name snakemake
(F821)
93-93: Undefined name snakemake
(F821)
94-94: Undefined name snakemake
(F821)
95-95: Undefined name snakemake
(F821)
109-109: Use a context manager for opening files
(SIM115)
110-110: Use a context manager for opening files
(SIM115)
119-119: Undefined name snakemake
(F821)
harpy/view.py
10-10: click.echo_via_pager
imported but unused
Remove unused import: click.echo_via_pager
(F401)
19-19: Local variable stdscr
is assigned to but never used
Remove assignment to unused variable stdscr
(F841)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (24)
harpy/view.py (3)
6-10
: Imports look consistent and essential.
These imports address gzip handling, terminal color checking (via curses), syntax highlighting, and paging. The static analysis warning thatecho_via_pager
is unused appears to be a false positive since it is used at line 129.🧰 Tools
🪛 Ruff (0.8.2)
10-10:
click.echo_via_pager
imported but unusedRemove unused import:
click.echo_via_pager
(F401)
110-120
: Color-based formatter selection is clear and well-structured.
The logic for determining the best formatter based onn_colors
is straightforward and concise. No further concerns here.
133-133
: Blue-colored title for better visual distinction.
Changing the title to"[bold blue] File viewed"
helps highlight the file’s name. This styling choice is consistent.harpy/scripts/demultiplex_gen1.py (2)
25-43
: Check for multi-letter schemas.
If the schema file references multiple distinct leading letters (e.g., “A”, “B”),id_letter = code_letters.pop()
will only capture one letter and discard others. Ensure this is intentional or handle all letters if needed.
83-101
: Snakemake references look valid.
All references tosnakemake.params
and thebar_codes
dictionary are consistent. However, this script will only run under Snakemake. If needed, add a guard or note clarifying that requirement.🧰 Tools
🪛 Ruff (0.8.2)
83-83: Undefined name
snakemake
(F821)
84-84: Undefined name
snakemake
(F821)
85-85: Undefined name
snakemake
(F821)
86-86: Undefined name
snakemake
(F821)
87-87: Undefined name
snakemake
(F821)
88-88: Undefined name
snakemake
(F821)
89-89: Undefined name
snakemake
(F821)
90-90: Undefined name
snakemake
(F821)
91-91: Undefined name
snakemake
(F821)
92-92: Undefined name
snakemake
(F821)
93-93: Undefined name
snakemake
(F821)
94-94: Undefined name
snakemake
(F821)
95-95: Undefined name
snakemake
(F821)
harpy/snakefiles/demultiplex_gen1 copy.smk (10)
1-2
: Container spec is straightforward.
Using a Docker container ensures reproducible environments. No issues noted.
3-4
: Imports for OS and logging are appropriate.
These modules support directory manipulation and log handling across the workflow.
6-11
: Config parameters read logically, but watch out for missing keys.
The code directly accessesconfig[...]
keys. Make sure you handle KeyError or provide defaults if these keys are not guaranteed.
40-42
: Storing parsed samples in a list is clear.
Fulfills the requirement for enumerating sample names from the schema dictionary.
43-52
: Rulebarcode_segments
references a script that isn’t shown.
Ensurehaplotag_acbd.py
is available and tested. Provide a docstring or usage note describing expected outputs of this script.
53-78
: Ruledemultiplex
effectively ties input segments to the script.
Make sure the placeholders{sample}
match whatdemultiplex_gen1.py
expects. The usage oftemp()
outputs is correct for ephemeral intermediates.
89-119
: Ruleassess_quality
: fallback to minimal output on error is helpful.
Generates a pseudo-Falco result if the primary command fails, preventing workflow breaks down the line. This is a practical approach.
120-142
: Rulereport_config
dynamically generates MultiQC config.
Usingyaml.dump
withdefault_flow_style=False
andsort_keys=False
is a good style choice for readability. No issues found.
143-159
: Rulequality_report
properly aggregates QA outputs.
The collected fastqc data and forced override (--force
) look correct. The redirect of errors tomultiqc.log
ensures consistent debugging.
160-192
: Workflow summary is thorough but confirm existence of config keys.
The summary referencesconfig['workflow_call']
directly. Ensure it always exists. Otherwise, guard with a default or a conditional check.harpy/snakefiles/demultiplex_gen1.smk (5)
19-21
: Wildcard constraints look good.The pattern for
part = "\d+"
andFR = "[12]"
aligns well with Snakemake usage.
229-229
: Dependency on merged barcode logs is coherent.Referencing
barcode_logs
inworkflow_summary
aligns well with the newly introduced partitioning and merging steps.
133-155
: Validate the barcode log structure.Your merge logic assumes four columns (
barcode
,total
,correct
,corrected
). Consider additional checks or error handling in case of unexpected file formats.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Optionally verify the presence of the expected columns in all barcode logs rg 'Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads' $(fd barcodes.log)Length of output: 78
Manual Verification Needed: Barcode Log Structure Check
The executed command didn't find any logs with the expected header. This likely means that the header ("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads") is not present in the repository's static content but created at runtime by the merge rule. Please manually verify that the barcode log files generated at runtime contain the expected four columns. Additionally, consider implementing extra checks or error handling in the merge rule (e.g., validating that each line splits into exactly four fields) to guard against unexpected file formats.
77-85
: Check naming consistency for index partitioning.Ensure that the same split logic used in
partition_reads
also applies cleanly to the index reads, matching the output patternreads.I{FR}.part_{part}.fq.gz
.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Confirm index partition files follow the naming scheme fd fastq_chunks -e gz | rg 'reads.I1.part_\d{3}\.fq\.gz|reads.I2.part_\d{3}\.fq\.gz'Length of output: 86
Action Required: Manual Verification Needed for Index Partitioning File Naming
The automated command did not return any files matching the
reads.I{FR}.part_{part}.fq.gz
pattern, so we cannot confirm consistency from the current repository state. Please manually verify (post-pipeline run) that index partition files, if generated, follow this naming convention.
55-76
: Confirm the filename pattern matches seqkit output.By default,
seqkit split2
appends_part_001
to the original file name. Verify that it correctly producesreads.R{FR}.part_{part}.fq.gz
as expected by the rule output.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Example verification: list the generated partition files and validate their naming fd fastq_chunks -e gz --exec ls -l {}Length of output: 41
Filename Pattern Mismatch: Verify & Update Expected Pattern
The rule’s collect pattern currently expects files named like
reads.R{FR}.part_{part}.fq.gz
,
but by default, seqkit split2 appends_part_001
(e.g. producingreads.R1_part_001.fq.gz
). This discrepancy suggests that the pattern in the Snakefile may not correctly capture the files produced by seqkit.Please manually verify the actual output naming from seqkit split2 and update the collect pattern if necessary.
harpy/reports/align_bxstats.qmd (4)
35-59
: Robust handling of empty inputs.Returning a zero-filled data frame when
tb
is empty prevents errors and improves clarity for downstream usage.
60-103
: Gracefully handle an emptymultiplex_df
.If
multiplex_df
is empty (e.g., all barcodes have fewer than 2 reads), mean and standard error computations can result inNaN
. Consider a fallback or guard to avoid warnings.
127-127
: Renaming "SEM" to "SE" enhances consistency.A straightforward terminology tweak that improves clarity.
150-150
: Updated column titles align with new fields.The additional columns for coverage in bp neatly map to the updated return values in
process_input()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
harpy/view.py (1)
17-38
: LGTM! Minor improvement suggested.The function correctly handles terminal color detection. However, the
stdscr
variable is not used beyond initialization.Apply this diff to simplify the code:
- stdscr = curses.initscr() + curses.initscr()🧰 Tools
🪛 Ruff (0.8.2)
19-19: Local variable
stdscr
is assigned to but never usedRemove assignment to unused variable
stdscr
(F841)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
harpy/reports/align_bxstats.qmd
(4 hunks)harpy/snakefiles/demultiplex_gen1.smk
(5 hunks)harpy/view.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- harpy/reports/align_bxstats.qmd
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/view.py
10-10: click.echo_via_pager
imported but unused
Remove unused import: click.echo_via_pager
(F401)
19-19: Local variable stdscr
is assigned to but never used
Remove assignment to unused variable stdscr
(F841)
🔇 Additional comments (9)
harpy/view.py (2)
121-127
: LGTM!The function efficiently handles both gzipped and regular files, uses generators for memory efficiency, and correctly applies syntax highlighting.
110-130
: LGTM!The changes correctly handle terminal color detection, select the appropriate formatter, and use
click.echo_via_pager
for displaying content.harpy/snakefiles/demultiplex_gen1.smk (7)
19-21
: LGTM!The wildcard constraints correctly limit
FR
to 1 or 2 andpart
to digits.
43-43
: LGTM!The list comprehension correctly formats part numbers and limits the range to prevent excessive partitioning.
55-75
: LGTM!The rule efficiently partitions reads using
seqkit split2
and creates symbolic links to avoid copying large files.
77-84
: LGTM!The rule efficiently reuses
partition_reads
logic for index files.
88-91
: LGTM!The input paths correctly point to partitioned files, and the
I2
input path has been fixed.
112-120
: LGTM!The rule efficiently merges partitioned files using a simple
cat
command.
132-153
: LGTM!The rule efficiently aggregates barcode statistics while maintaining the original file format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
132-154
: Enhance error handling and memory efficiency in barcode log merging.The barcode log merging could benefit from better error handling and memory optimization.
run: + from collections import defaultdict - bc_dict = {} + bc_dict = defaultdict(lambda: [0, 0, 0]) for i in input.bc: - with open(i, "r") as bc_log: + try: + with open(i, "r") as bc_log: + # skip first row of column names + _ = bc_log.readline() + for line in bc_log: + try: + barcode,total,correct,corrected = line.split() + bc_stats = [int(total), int(correct), int(corrected)] + bc_dict[barcode] = list(map(sum, zip(bc_stats, bc_dict[barcode]))) + except ValueError as e: + print(f"Warning: Skipping malformed line in {i}: {line.strip()}", file=sys.stderr) + except IOError as e: + print(f"Error reading {i}: {e}", file=sys.stderr) + raise - # skip first row of column names - _ = bc_log.readline() - for line in bc_log: - barcode,total,correct,corrected = line.split() - bc_stats = [int(total), int(correct), int(corrected)] - if barcode not in bc_dict: - bc_dict[barcode] = bc_stats - else: - bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
19-21
: LGTM! Well-defined wildcard constraints.The constraints ensure that
FR
only accepts 1 or 2 (for read pairs) andpart
only accepts digits, which is essential for the partitioning logic.
43-43
: Consider using consistent naming for partition limits.The zero-padding to 3 digits (
{i:03d}
) implies a maximum of 999 parts, which matches the limit inmin(workflow.cores, 999)
. However, this magic number should be defined as a constant for better maintainability.+MAX_PARTITIONS = 999 -fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)] +fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, MAX_PARTITIONS) + 1)]
86-111
: LGTM! Input paths are correctly defined.The rule now correctly uses partitioned inputs and includes the schema parameter. The I2 input path issue from the past review has been addressed.
224-256
: LGTM! Workflow summary correctly includes barcode logs.The workflow summary has been appropriately updated to include the barcode logs in its input dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
55-77
: 🛠️ Refactor suggestionAddress platform compatibility and error handling.
- Using symlinks may not work on Windows systems.
- Missing error handling for seqkit command.
- Missing directory creation for output path.
Apply this diff to improve robustness:
shell: """ + mkdir -p {params.outdir} - ln -sr {input.r1} {output.r1} - ln -sr {input.r2} {output.r2} + cp {input.r1} {output.r1} + cp {input.r2} {output.r2} - seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2 > {log} + if ! seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}; then + echo "Error: seqkit split2 failed" >&2 + exit 1 + fi """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
140-162
: Enhance merge_barcode_logs robustness.The current implementation might have memory issues with large number of files and lacks error handling.
Consider this improved implementation:
run: + from contextlib import ExitStack bc_dict = {} - for i in input.bc: - with open(i, "r") as bc_log: + try: + with ExitStack() as stack: + for i in input.bc: + bc_log = stack.enter_context(open(i, "r")) # skip first row of column names _ = bc_log.readline() for line in bc_log: barcode,total,correct,corrected = line.split() bc_stats = [int(total), int(correct), int(corrected)] if barcode not in bc_dict: bc_dict[barcode] = bc_stats else: bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode])) - with open(output.log, "w") as f: + with open(output.log, "w") as f: f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n") for k,v in bc_dict.items(): f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n") + except Exception as e: + logger.error(f"Error processing barcode logs: {str(e)}") + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk
(4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
19-21
: LGTM! Well-defined wildcard constraints.The constraints appropriately restrict FR to [1,2] and part to digits.
43-43
: LGTM! Well-implemented chunk numbering.The implementation correctly synchronizes chunk numbering with the partition limit and uses zero-padded formatting.
90-115
: LGTM! Correctly configured demultiplex rule.The input paths are properly configured, and temporary files are correctly marked.
232-264
: LGTM! Workflow summary correctly updated.The workflow summary now properly includes the barcode logs in its dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
140-162
: Improve robustness of barcode log merging.The barcode log merging could be enhanced for better error handling and memory efficiency.
Consider these improvements:
run: + from collections import defaultdict - bc_dict = {} + bc_dict = defaultdict(lambda: [0, 0, 0]) for i in input.bc: - with open(i, "r") as bc_log: + try: + with open(i, "r") as bc_log: + # validate header + header = bc_log.readline().strip().split() + if header != ["Barcode", "Total_Reads", "Correct_Reads", "Corrected_Reads"]: + raise ValueError(f"Invalid header in {i}") - # skip first row of column names - _ = bc_log.readline() for line in bc_log: barcode,total,correct,corrected = line.split() bc_stats = [int(total), int(correct), int(corrected)] - if barcode not in bc_dict: - bc_dict[barcode] = bc_stats - else: - bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode])) + bc_dict[barcode] = [x + y for x, y in zip(bc_stats, bc_dict[barcode])] + except Exception as e: + raise RuntimeError(f"Error processing {i}: {str(e)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk
(4 hunks)
🔇 Additional comments (3)
harpy/snakefiles/demultiplex_gen1.smk (3)
19-21
: LGTM! Well-defined wildcard constraints.The constraints appropriately restrict FR to read pairs and part to numeric values.
232-264
: LGTM! Workflow summary properly integrates barcode logs.The changes correctly incorporate the barcode logs into the workflow summary.
43-43
: Document and verify the partition limit assumptions.The implementation assumes:
- A maximum of 999 partitions (implied by 3-digit zero-padding)
- The partition format matches seqkit's output format
Please verify:
- That seqkit uses the same zero-padding format
- Whether 999 is a reasonable limit for the target use case
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check seqkit's output format # Create a small test file echo -e "@test\nACGT\n+\n!!!!" > test.fq # Run seqkit with 2 parts and check the naming seqkit split2 -1 test.fq -2 test.fq -p 2 -O . 2>&1 | grep "part"Length of output: 92
Manual verification needed: Confirm seqkit naming and partition limit assumptions
The initial test did not produce output, so we couldn’t verify if seqkit uses three-digit zero-padding or if the 999 limit is appropriate. Please manually verify that:
- Seqkit’s output filename format for partitions indeed uses three-digit, zero-padded values (to match the
{i:03d}
format).- A cap of 999 partitions is both reflected by seqkit and reasonable for our intended use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
55-77
:⚠️ Potential issueEnhance platform compatibility and error handling in partition_reads rule.
The rule has potential issues with symlinks and error handling.
Apply these changes to improve robustness:
shell: """ - ln -sr {input.r1} {output.r1} - ln -sr {input.r2} {output.r2} + cp {input.r1} {output.r1} + cp {input.r2} {output.r2} + trap 'rm -f {output.r1} {output.r2}' ERR seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log} """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
43-43
: Optimize the fastq_parts list generation.Consider using a list comprehension with f-strings for better performance:
-fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)] +fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk
(4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
93-117
: Input validation needed for schema file.The rule now includes schema as input but lacks validation.
Run this script to check schema file validation in the Python script:
#!/bin/bash # Check schema file validation in demultiplex_gen1.py echo "Searching for schema file validation:" rg "schema.*=.*" "scripts/demultiplex_gen1.py" echo "Searching for file existence checks:" rg "os\.path\.exists" "scripts/demultiplex_gen1.py"
119-130
: LGTM! Efficient implementation of merge_partitions.The rule uses
cat
for efficient concatenation of files.
143-164
: LGTM! Well-structured barcode log merging.The implementation:
- Uses efficient data structures
- Handles file I/O properly
- Maintains consistent output format
235-267
: LGTM! Comprehensive workflow summary.The rule properly includes the new barcode logs and maintains good documentation practices.
This PR attempts to use the divide-and-conquer demultiplexing strategy proposed by @bpenaud, made idiommatic for snakemake.
Additionally:
align_bxstats.qmd
to handle single-sample runs and column namesharpy view
to avoidsubprocess
calls and circumvent the stdout buffer warning that was printed to the terminal.Summary by CodeRabbit
New Features
Bug Fixes
Chores