Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divide-and-conquer demultiplexing #202

Merged
merged 19 commits into from
Feb 12, 2025
Merged

divide-and-conquer demultiplexing #202

merged 19 commits into from
Feb 12, 2025

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Feb 12, 2025

This PR attempts to use the divide-and-conquer demultiplexing strategy proposed by @bpenaud, made idiommatic for snakemake.

Additionally:

  • fixes for align_bxstats.qmd to handle single-sample runs and column names
  • improvements to harpy view to avoid subprocess calls and circumvent the stdout buffer warning that was printed to the terminal.

Summary by CodeRabbit

  • New Features

    • Enhanced reporting now handles empty inputs gracefully, featuring updated metrics and an additional average molecule coverage column.
    • Integrated schema-based sample mapping for improved organization and detailed barcode logging.
    • Upgraded the workflow to partition read data and merge outputs, boosting processing efficiency.
    • Improved terminal display with robust color support and enhanced file highlighting.
  • Bug Fixes

    • Resolved issues causing errors when processing empty inputs in reports.
  • Chores

    • Updated dependency management to include an additional tool for enhanced functionality.

Copy link

coderabbitai bot commented Feb 12, 2025

📝 Walkthrough

Walkthrough

This pull request introduces several enhancements across various components. In the conda recipe, a new dependency (bioconda::seqkit) is added. The R report now checks for empty data and updates output column names and structure. The demultiplexing script adds a new function for schema reading, refines barcode identification with enhanced logging, and accepts an additional parameter. Several changes are applied to the Snakemake workflow to partition reads using seqkit, update rule inputs/outputs, and merge results. Finally, the view module now uses a new approach for file reading and terminal color detection, replacing subprocess calls with gzip and pygments.

Changes

File(s) Change Summary
harpy/_conda.py Added dependency bioconda::seqkit to the demultiplex environment in the create_conda_recipes function.
harpy/reports/align_bxstats.qmd Modified process_input to check for empty data frames; returns a zero-filled data frame when empty. Updated output column name from "SEM" to "SE" and added a new column for "avg molecule coverage (bp)".
harpy/scripts/demultiplex_gen1.py Introduced a new function read_schema and updated get_read_codes for handling both segments. Enhanced logging by opening a log file at the start, added a new part parameter, and organized outputs by sample names.
harpy/scripts/demultiplex_gen1.py.bak Deleted backup file containing the previous demultiplexing implementation.
harpy/snakefiles/demultiplex_gen1.smk Introduced new wildcard constraints and added rules: partition_reads, partition_index (using seqkit for read splitting), merge_partitions, and merge_barcode_logs. Updated the demultiplex, assess_quality, and workflow_summary rules to reflect partitioned read inputs, updated log naming, and consolidated outputs.
harpy/view.py Added new functions check_terminal_colors() (using curses) and _read_file(x) (using gzip). Replaced subprocess-based file reading with a generator function and integrated pygments for syntax highlighting, with output piped through click.echo_via_pager. Adjusted the display of the file content with modified title colors.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SNK as Snakemake Workflow
    participant PR as Partitioning Rules
    participant DM as Demultiplex Rule
    participant MP as Merge Rules
    participant AQ as Assess Quality
    participant WS as Workflow Summary

    User->>SNK: Trigger workflow
    SNK->>PR: Execute partition_reads & partition_index (using seqkit)
    PR->>DM: Supply partitioned read files and schema
    DM->>MP: Process demultiplexing and output partitioned files
    MP->>AQ: Merge partition outputs and barcode logs
    AQ->>WS: Generate workflow summary with consolidated logs
    WS->>User: Return final summary
Loading

Suggested labels

enhancement


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (9)
harpy/view.py (1)

17-39: Consider removing the unused stdscr variable or referencing it meaningfully.
Currently, stdscr = curses.initscr() is never used, which triggers a linter warning. If no usage is intended, removing the assignment can improve clarity.

-    stdscr = curses.initscr()
+    curses.initscr()
🧰 Tools
🪛 Ruff (0.8.2)

19-19: Local variable stdscr is assigned to but never used

Remove assignment to unused variable stdscr

(F841)

harpy/scripts/demultiplex_gen1.py (2)

2-4: Use context managers for file operations.
Opening the log file with a plain open() can be replaced with a context manager for clearer resource handling. Also consider restoring sys.stdout and sys.stderr afterwards to avoid side effects.

- f = open(snakemake.log[0], "w")
- sys.stderr = sys.stdout = f
+ with open(snakemake.log[0], "w") as f:
+     sys.stderr = sys.stdout = f
+     # rest of logic
# Possibly restore sys.stderr/stdout here or rely on context manager scope
🧰 Tools
🪛 Ruff (0.8.2)

2-2: Use a context manager for opening files

(SIM115)


2-2: Undefined name snakemake

(F821)


119-165: Consider consolidating file operations into a single context manager.
The approach works, but opening multiple output files in separate dictionary comprehensions can be brittle if an exception arises midway. A single context manager block can mitigate partial resource management issues.

- R1_output = {sample: open(...) for sample in samples}
- R2_output = {sample: open(...) for sample in samples}
...
- for r1_rec, r2_rec, i1_rec, i2_rec in zip(...):
+ with ExitStack() as stack:
+     R1_output = {sample: stack.enter_context(open(...)) for sample in samples}
+     R2_output = {sample: stack.enter_context(open(...)) for sample in samples}
+     # proceed
🧰 Tools
🪛 Ruff (0.8.2)

119-119: Undefined name snakemake

(F821)

harpy/snakefiles/demultiplex_gen1 copy.smk (3)

12-20: File logging approach is consistent, but confirm file existence.
Removing logger.logfile on success/error is fine if guaranteed to exist. Otherwise, consider checking before removal to avoid exceptions.


21-39: Schema parsing approach is functional but watch multiple barcode letters.
Similar to the Python script, ensure lines with different segment letters do not create unintended merges. The keep_unknown logic is helpful for out-of-scope samples.


79-88: Compression step is straightforward.
Using gzip is a minimal approach. Consider parallel compression if dealing with large files, but otherwise this is acceptable.

harpy/snakefiles/demultiplex_gen1.smk (2)

112-122: Finalize or remove the commented-out merge command.

The default_target: True setting indicates a main rule, but the actual merging shell command is commented out, suggesting incomplete functionality.


226-226: Clarify the default target intention.

#default_target: True was commented out. If this is meant to be the final rule, consider re-enabling or removing it completely.

harpy/_conda.py (1)

38-38: Consider pinning seqkit to a specific version.

While adding seqkit supports new partitioning features, a version pin (e.g., bioconda::seqkit=2.3.1) could help maintain reproducibility.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6a3d5e and 9a0c478.

📒 Files selected for processing (7)
  • harpy/_conda.py (1 hunks)
  • harpy/reports/align_bxstats.qmd (4 hunks)
  • harpy/scripts/demultiplex_gen1.py (3 hunks)
  • harpy/scripts/demultiplex_gen1.py.bak (0 hunks)
  • harpy/snakefiles/demultiplex_gen1 copy.smk (1 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (5 hunks)
  • harpy/view.py (2 hunks)
💤 Files with no reviewable changes (1)
  • harpy/scripts/demultiplex_gen1.py.bak
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py

2-2: Use a context manager for opening files

(SIM115)


2-2: Undefined name snakemake

(F821)


83-83: Undefined name snakemake

(F821)


84-84: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


109-109: Use a context manager for opening files

(SIM115)


110-110: Use a context manager for opening files

(SIM115)


119-119: Undefined name snakemake

(F821)

harpy/view.py

10-10: click.echo_via_pager imported but unused

Remove unused import: click.echo_via_pager

(F401)


19-19: Local variable stdscr is assigned to but never used

Remove assignment to unused variable stdscr

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (24)
harpy/view.py (3)

6-10: Imports look consistent and essential.
These imports address gzip handling, terminal color checking (via curses), syntax highlighting, and paging. The static analysis warning that echo_via_pager is unused appears to be a false positive since it is used at line 129.

🧰 Tools
🪛 Ruff (0.8.2)

10-10: click.echo_via_pager imported but unused

Remove unused import: click.echo_via_pager

(F401)


110-120: Color-based formatter selection is clear and well-structured.
The logic for determining the best formatter based on n_colors is straightforward and concise. No further concerns here.


133-133: Blue-colored title for better visual distinction.
Changing the title to "[bold blue] File viewed" helps highlight the file’s name. This styling choice is consistent.

harpy/scripts/demultiplex_gen1.py (2)

25-43: Check for multi-letter schemas.
If the schema file references multiple distinct leading letters (e.g., “A”, “B”), id_letter = code_letters.pop() will only capture one letter and discard others. Ensure this is intentional or handle all letters if needed.


83-101: Snakemake references look valid.
All references to snakemake.params and the bar_codes dictionary are consistent. However, this script will only run under Snakemake. If needed, add a guard or note clarifying that requirement.

🧰 Tools
🪛 Ruff (0.8.2)

83-83: Undefined name snakemake

(F821)


84-84: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)

harpy/snakefiles/demultiplex_gen1 copy.smk (10)

1-2: Container spec is straightforward.
Using a Docker container ensures reproducible environments. No issues noted.


3-4: Imports for OS and logging are appropriate.
These modules support directory manipulation and log handling across the workflow.


6-11: Config parameters read logically, but watch out for missing keys.
The code directly accesses config[...] keys. Make sure you handle KeyError or provide defaults if these keys are not guaranteed.


40-42: Storing parsed samples in a list is clear.
Fulfills the requirement for enumerating sample names from the schema dictionary.


43-52: Rule barcode_segments references a script that isn’t shown.
Ensure haplotag_acbd.py is available and tested. Provide a docstring or usage note describing expected outputs of this script.


53-78: Rule demultiplex effectively ties input segments to the script.
Make sure the placeholders {sample} match what demultiplex_gen1.py expects. The usage of temp() outputs is correct for ephemeral intermediates.


89-119: Rule assess_quality: fallback to minimal output on error is helpful.
Generates a pseudo-Falco result if the primary command fails, preventing workflow breaks down the line. This is a practical approach.


120-142: Rule report_config dynamically generates MultiQC config.
Using yaml.dump with default_flow_style=False and sort_keys=False is a good style choice for readability. No issues found.


143-159: Rule quality_report properly aggregates QA outputs.
The collected fastqc data and forced override (--force) look correct. The redirect of errors to multiqc.log ensures consistent debugging.


160-192: Workflow summary is thorough but confirm existence of config keys.
The summary references config['workflow_call'] directly. Ensure it always exists. Otherwise, guard with a default or a conditional check.

harpy/snakefiles/demultiplex_gen1.smk (5)

19-21: Wildcard constraints look good.

The pattern for part = "\d+" and FR = "[12]" aligns well with Snakemake usage.


229-229: Dependency on merged barcode logs is coherent.

Referencing barcode_logs in workflow_summary aligns well with the newly introduced partitioning and merging steps.


133-155: Validate the barcode log structure.

Your merge logic assumes four columns (barcode, total, correct, corrected). Consider additional checks or error handling in case of unexpected file formats.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Optionally verify the presence of the expected columns in all barcode logs
rg 'Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads' $(fd barcodes.log)

Length of output: 78


Manual Verification Needed: Barcode Log Structure Check

The executed command didn't find any logs with the expected header. This likely means that the header ("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads") is not present in the repository's static content but created at runtime by the merge rule. Please manually verify that the barcode log files generated at runtime contain the expected four columns. Additionally, consider implementing extra checks or error handling in the merge rule (e.g., validating that each line splits into exactly four fields) to guard against unexpected file formats.


77-85: Check naming consistency for index partitioning.

Ensure that the same split logic used in partition_reads also applies cleanly to the index reads, matching the output pattern reads.I{FR}.part_{part}.fq.gz.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Confirm index partition files follow the naming scheme
fd fastq_chunks -e gz | rg 'reads.I1.part_\d{3}\.fq\.gz|reads.I2.part_\d{3}\.fq\.gz'

Length of output: 86


Action Required: Manual Verification Needed for Index Partitioning File Naming

The automated command did not return any files matching the reads.I{FR}.part_{part}.fq.gz pattern, so we cannot confirm consistency from the current repository state. Please manually verify (post-pipeline run) that index partition files, if generated, follow this naming convention.


55-76: Confirm the filename pattern matches seqkit output.

By default, seqkit split2 appends _part_001 to the original file name. Verify that it correctly produces reads.R{FR}.part_{part}.fq.gz as expected by the rule output.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Example verification: list the generated partition files and validate their naming
fd fastq_chunks -e gz --exec ls -l {}

Length of output: 41


Filename Pattern Mismatch: Verify & Update Expected Pattern

The rule’s collect pattern currently expects files named like
reads.R{FR}.part_{part}.fq.gz,
but by default, seqkit split2 appends _part_001 (e.g. producing reads.R1_part_001.fq.gz). This discrepancy suggests that the pattern in the Snakefile may not correctly capture the files produced by seqkit.

Please manually verify the actual output naming from seqkit split2 and update the collect pattern if necessary.

harpy/reports/align_bxstats.qmd (4)

35-59: Robust handling of empty inputs.

Returning a zero-filled data frame when tb is empty prevents errors and improves clarity for downstream usage.


60-103: Gracefully handle an empty multiplex_df.

If multiplex_df is empty (e.g., all barcodes have fewer than 2 reads), mean and standard error computations can result in NaN. Consider a fallback or guard to avoid warnings.


127-127: Renaming "SEM" to "SE" enhances consistency.

A straightforward terminology tweak that improves clarity.


150-150: Updated column titles align with new fields.

The additional columns for coverage in bp neatly map to the updated return values in process_input().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
harpy/view.py (1)

17-38: LGTM! Minor improvement suggested.

The function correctly handles terminal color detection. However, the stdscr variable is not used beyond initialization.

Apply this diff to simplify the code:

-    stdscr = curses.initscr()
+    curses.initscr()
🧰 Tools
🪛 Ruff (0.8.2)

19-19: Local variable stdscr is assigned to but never used

Remove assignment to unused variable stdscr

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a0c478 and ddb7ec3.

📒 Files selected for processing (3)
  • harpy/reports/align_bxstats.qmd (4 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (5 hunks)
  • harpy/view.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/reports/align_bxstats.qmd
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/view.py

10-10: click.echo_via_pager imported but unused

Remove unused import: click.echo_via_pager

(F401)


19-19: Local variable stdscr is assigned to but never used

Remove assignment to unused variable stdscr

(F841)

🔇 Additional comments (9)
harpy/view.py (2)

121-127: LGTM!

The function efficiently handles both gzipped and regular files, uses generators for memory efficiency, and correctly applies syntax highlighting.


110-130: LGTM!

The changes correctly handle terminal color detection, select the appropriate formatter, and use click.echo_via_pager for displaying content.

harpy/snakefiles/demultiplex_gen1.smk (7)

19-21: LGTM!

The wildcard constraints correctly limit FR to 1 or 2 and part to digits.


43-43: LGTM!

The list comprehension correctly formats part numbers and limits the range to prevent excessive partitioning.


55-75: LGTM!

The rule efficiently partitions reads using seqkit split2 and creates symbolic links to avoid copying large files.


77-84: LGTM!

The rule efficiently reuses partition_reads logic for index files.


88-91: LGTM!

The input paths correctly point to partitioned files, and the I2 input path has been fixed.


112-120: LGTM!

The rule efficiently merges partitioned files using a simple cat command.


132-153: LGTM!

The rule efficiently aggregates barcode statistics while maintaining the original file format.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

132-154: Enhance error handling and memory efficiency in barcode log merging.

The barcode log merging could benefit from better error handling and memory optimization.

 run:
+    from collections import defaultdict
-    bc_dict = {}
+    bc_dict = defaultdict(lambda: [0, 0, 0])
     for i in input.bc:
-        with open(i, "r") as bc_log:
+        try:
+            with open(i, "r") as bc_log:
+                # skip first row of column names
+                _ = bc_log.readline()
+                for line in bc_log:
+                    try:
+                        barcode,total,correct,corrected = line.split()
+                        bc_stats = [int(total), int(correct), int(corrected)]
+                        bc_dict[barcode] = list(map(sum, zip(bc_stats, bc_dict[barcode])))
+                    except ValueError as e:
+                        print(f"Warning: Skipping malformed line in {i}: {line.strip()}", file=sys.stderr)
+        except IOError as e:
+            print(f"Error reading {i}: {e}", file=sys.stderr)
+            raise
-            # skip first row of column names
-            _ = bc_log.readline()
-            for line in bc_log:
-                barcode,total,correct,corrected = line.split()
-                bc_stats = [int(total), int(correct), int(corrected)]
-                if barcode not in bc_dict:
-                    bc_dict[barcode] = bc_stats
-                else:
-                    bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ddb7ec3 and d5833ec.

📒 Files selected for processing (1)
  • harpy/snakefiles/demultiplex_gen1.smk (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)

19-21: LGTM! Well-defined wildcard constraints.

The constraints ensure that FR only accepts 1 or 2 (for read pairs) and part only accepts digits, which is essential for the partitioning logic.


43-43: Consider using consistent naming for partition limits.

The zero-padding to 3 digits ({i:03d}) implies a maximum of 999 parts, which matches the limit in min(workflow.cores, 999). However, this magic number should be defined as a constant for better maintainability.

+MAX_PARTITIONS = 999
-fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
+fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, MAX_PARTITIONS) + 1)]

86-111: LGTM! Input paths are correctly defined.

The rule now correctly uses partitioned inputs and includes the schema parameter. The I2 input path issue from the past review has been addressed.


224-256: LGTM! Workflow summary correctly includes barcode logs.

The workflow summary has been appropriately updated to include the barcode logs in its input dependencies.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

55-77: 🛠️ Refactor suggestion

Address platform compatibility and error handling.

  1. Using symlinks may not work on Windows systems.
  2. Missing error handling for seqkit command.
  3. Missing directory creation for output path.

Apply this diff to improve robustness:

 shell:
     """
+    mkdir -p {params.outdir}
-    ln -sr {input.r1} {output.r1}
-    ln -sr {input.r2} {output.r2}
+    cp {input.r1} {output.r1}
+    cp {input.r2} {output.r2}
-    seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2 > {log}
+    if ! seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}; then
+        echo "Error: seqkit split2 failed" >&2
+        exit 1
+    fi
     """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

140-162: Enhance merge_barcode_logs robustness.

The current implementation might have memory issues with large number of files and lacks error handling.

Consider this improved implementation:

 run:
+    from contextlib import ExitStack
     bc_dict = {}
-    for i in input.bc:
-        with open(i, "r") as bc_log:
+    try:
+        with ExitStack() as stack:
+            for i in input.bc:
+                bc_log = stack.enter_context(open(i, "r"))
                 # skip first row of column names
                 _ = bc_log.readline()
                 for line in bc_log:
                     barcode,total,correct,corrected = line.split()
                     bc_stats = [int(total), int(correct), int(corrected)]
                     if barcode not in bc_dict:
                         bc_dict[barcode] = bc_stats
                     else:
                         bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
-        with open(output.log, "w") as f:
+        with open(output.log, "w") as f:
             f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
             for k,v in bc_dict.items():
                 f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n")
+    except Exception as e:
+        logger.error(f"Error processing barcode logs: {str(e)}")
+        raise
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5833ec and f50cb40.

📒 Files selected for processing (1)
  • harpy/snakefiles/demultiplex_gen1.smk (4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)

19-21: LGTM! Well-defined wildcard constraints.

The constraints appropriately restrict FR to [1,2] and part to digits.


43-43: LGTM! Well-implemented chunk numbering.

The implementation correctly synchronizes chunk numbering with the partition limit and uses zero-padded formatting.


90-115: LGTM! Correctly configured demultiplex rule.

The input paths are properly configured, and temporary files are correctly marked.


232-264: LGTM! Workflow summary correctly updated.

The workflow summary now properly includes the barcode logs in its dependencies.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

140-162: Improve robustness of barcode log merging.

The barcode log merging could be enhanced for better error handling and memory efficiency.

Consider these improvements:

 run:
+    from collections import defaultdict
-    bc_dict = {}
+    bc_dict = defaultdict(lambda: [0, 0, 0])
     for i in input.bc:
-        with open(i, "r") as bc_log:
+        try:
+            with open(i, "r") as bc_log:
+                # validate header
+                header = bc_log.readline().strip().split()
+                if header != ["Barcode", "Total_Reads", "Correct_Reads", "Corrected_Reads"]:
+                    raise ValueError(f"Invalid header in {i}")
-                # skip first row of column names
-                _ = bc_log.readline()
                 for line in bc_log:
                     barcode,total,correct,corrected = line.split()
                     bc_stats = [int(total), int(correct), int(corrected)]
-                    if barcode not in bc_dict:
-                        bc_dict[barcode] = bc_stats
-                    else:
-                        bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
+                    bc_dict[barcode] = [x + y for x, y in zip(bc_stats, bc_dict[barcode])]
+        except Exception as e:
+            raise RuntimeError(f"Error processing {i}: {str(e)}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f50cb40 and 4290901.

📒 Files selected for processing (1)
  • harpy/snakefiles/demultiplex_gen1.smk (4 hunks)
🔇 Additional comments (3)
harpy/snakefiles/demultiplex_gen1.smk (3)

19-21: LGTM! Well-defined wildcard constraints.

The constraints appropriately restrict FR to read pairs and part to numeric values.


232-264: LGTM! Workflow summary properly integrates barcode logs.

The changes correctly incorporate the barcode logs into the workflow summary.


43-43: Document and verify the partition limit assumptions.

The implementation assumes:

  1. A maximum of 999 partitions (implied by 3-digit zero-padding)
  2. The partition format matches seqkit's output format

Please verify:

  1. That seqkit uses the same zero-padding format
  2. Whether 999 is a reasonable limit for the target use case
❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Description: Check seqkit's output format
# Create a small test file
echo -e "@test\nACGT\n+\n!!!!" > test.fq
# Run seqkit with 2 parts and check the naming
seqkit split2 -1 test.fq -2 test.fq -p 2 -O . 2>&1 | grep "part"

Length of output: 92


Manual verification needed: Confirm seqkit naming and partition limit assumptions

The initial test did not produce output, so we couldn’t verify if seqkit uses three-digit zero-padding or if the 999 limit is appropriate. Please manually verify that:

  • Seqkit’s output filename format for partitions indeed uses three-digit, zero-padded values (to match the {i:03d} format).
  • A cap of 999 partitions is both reflected by seqkit and reasonable for our intended use case.

Repository owner deleted a comment from coderabbitai bot Feb 12, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

55-77: ⚠️ Potential issue

Enhance platform compatibility and error handling in partition_reads rule.

The rule has potential issues with symlinks and error handling.

Apply these changes to improve robustness:

 shell:
     """
-    ln -sr {input.r1} {output.r1}
-    ln -sr {input.r2} {output.r2}
+    cp {input.r1} {output.r1}
+    cp {input.r2} {output.r2}
+    trap 'rm -f {output.r1} {output.r2}' ERR
     seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}
     """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)

43-43: Optimize the fastq_parts list generation.

Consider using a list comprehension with f-strings for better performance:

-fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
+fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4290901 and 461bc43.

📒 Files selected for processing (1)
  • harpy/snakefiles/demultiplex_gen1.smk (4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)

93-117: Input validation needed for schema file.

The rule now includes schema as input but lacks validation.

Run this script to check schema file validation in the Python script:

#!/bin/bash
# Check schema file validation in demultiplex_gen1.py
echo "Searching for schema file validation:"
rg "schema.*=.*" "scripts/demultiplex_gen1.py"

echo "Searching for file existence checks:"
rg "os\.path\.exists" "scripts/demultiplex_gen1.py"

119-130: LGTM! Efficient implementation of merge_partitions.

The rule uses cat for efficient concatenation of files.


143-164: LGTM! Well-structured barcode log merging.

The implementation:

  • Uses efficient data structures
  • Handles file I/O properly
  • Maintains consistent output format

235-267: LGTM! Comprehensive workflow summary.

The rule properly includes the new barcode logs and maintains good documentation practices.

@pdimens pdimens merged commit 94b395e into main Feb 12, 2025
20 checks passed
@pdimens pdimens deleted the demux_attempt3 branch February 12, 2025 19:07
@coderabbitai coderabbitai bot mentioned this pull request Feb 13, 2025
@coderabbitai coderabbitai bot mentioned this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

speedup demultiplex barcode by parallelisation
1 participant