fix: use external sort during `clinvar-genes import` #640

tedil · 2025-01-29T18:27:56Z

Supersedes #638
On my machine, this takes about ~15 minutes, ~6GB of memory and 1.2GB of temporary disk space (instead of ~1 minute and ~42GB of memory [and no temporary disk space] before the changes). Could probably be faster, but should be fine for our use-cases (i.e. reduce memory footprint to be able to keep running this on github CI).
Also currently keeps track of file handles / writers for all ~20k files that are produced, which might be an issue regarding "too-many-open-files" limits. Alternatives: Either always re-create the writers (runtime-hit) or use a limited capacity HashMap (LRU cache etc) to keep the number of open files limited.

Summary by CodeRabbit

New Features
- Added support for processing ClinVar variant data with improved record distribution and sorting capabilities.
- Enhanced data handling for variant record management.
Dependencies
- Added tempfile and lru libraries to support file operations and data manipulation.
- Introduced new data processing dependencies, including itertools and serde.
Refactor
- Restructured ClinVar data import logic for better clarity and maintainability.
- Improved error handling and logging for data processing.

coderabbitai · 2025-01-29T18:28:03Z

Walkthrough

This pull request introduces changes to the ClinVar gene import process in the Rust project. It adds new dependencies, tempfile and lru, while removing an unused profile section from Cargo.toml. In src/clinvar_genes/cli/import.rs, the code is refactored to enhance data handling, introducing new structs SortableVcvRecord and ClinvarVariants. The import logic is updated to utilize these new structures, improving the management and distribution of ClinVar variant records with enhanced error handling and logging.

Changes

File	Change Summary
`Cargo.toml`	- Added `tempfile = "3.10.1"` dependency - Added `lru = "0.13.0"` dependency - Removed `[profile.dev.package.similar]` section
`src/clinvar_genes/cli/import.rs`	- Added new structs: `SortableVcvRecord` and `ClinvarVariants` - Introduced methods for record distribution and sorting - Updated import logic with new data handling approach - Enhanced error handling and logging

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Import Command
    participant CV as ClinvarVariants
    participant File as JSONL Files
    participant TempDir as Temporary Directory
    
    CLI->>CV: Initialize with variant paths
    CV->>File: Read variant records
    CV->>TempDir: Distribute records
    CV->>CV: Sort records by HGNC ID
    CV-->>CLI: Return processed variants

Poem

🐰 A Rabbit's Ode to Data Delight

Tempfiles dancing, records so bright,
ClinVar variants take their flight,
Sorted, distributed with care,
Code refactored beyond compare,
Import magic takes its might! 🧬

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/clinvar_genes/cli/import.rs (3)
105-111: Initialization method looks good.
The constructor delegates directory handling to the caller and initializes optional data cleanly. Consider verifying directory write permissions early to fail fast, if needed.

114-141: Potential panics in _iter due to .unwrap() usage.
This private method streams and expands records by gene ID. However, each I/O or JSON parsing error triggers a hard panic. Depending on your design, you could propagate errors for more graceful handling.
- .map(Result::unwrap)
+ .map(|res| res.unwrap_or_else(|err| panic!("Error reading line: {}", err)))
280-327: Database insertion pipeline is coherent but strictly sequential.
The chunk-by mechanism properly groups variant records by HGNC ID. If import speed becomes a bottleneck, consider parallelizing writes or splitting tasks, ensuring concurrency safety in RocksDB usage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf5583 and 99a0dc9.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

Cargo.toml (1 hunks)
src/clinvar_genes/cli/import.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Schema
GitHub Check: Testing
GitHub Check: Linting
GitHub Check: build-and-push-image

🔇 Additional comments (8)

src/clinvar_genes/cli/import.rs (7)

8-16: All new imports appear relevant.
They nicely cover newly introduced functionality such as command-line parsing, iteration utilities, I/O, and serialization. No immediate issues.

93-97: Well-defined wrapper for sorting.
This lightweight struct is clear, combining the HGNC ID with the extracted record for sorting/iteration logic.

99-103: Sensible struct for managing variant data.
Storing the input paths, a temporary directory, and discovered IDs in a single struct makes the import pipeline more cohesive.

181-226: Sorted records logic is clear, but panics on deserialization failures.
The external sort approach is straightforward. As with _iter, consider logging or skipping invalid data instead of panicking.

Would you like me to provide a script or example code refactoring to handle such errors gracefully and log them?

253-260: Temporary directory approach aligns with external sorting.
Creating a new tempdir for distributing records is aligned with the PR goals of managing memory usage and intermediate data.

264-278: Gene list merging is logically sound.
Combining gene IDs from multiple data sources ensures we capture all relevant genes before writing to the database.

343-345: Tempfile cleanup is straightforward.
Dropping the TempDir ensures on-disk resources are freed. This final step meets the PR objective of managing large intermediate data properly.

Cargo.toml (1)

58-58: The addition of tempfile looks correct.
This dependency underpins the external sorting strategy, enabling transient file management while reducing the in-memory footprint.

src/clinvar_genes/cli/import.rs

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99a0dc9 and 6e61733.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

Cargo.toml (1 hunks)
src/clinvar_genes/cli/import.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Schema
GitHub Check: Testing
GitHub Check: Linting
GitHub Check: build-and-push-image

🔇 Additional comments (3)

src/clinvar_genes/cli/import.rs (3)

95-105: Well-designed data structures for managing variant data!

The SortableVcvRecord and ClinvarVariants structs provide a clean separation of concerns and enable efficient sorting and processing of variant data.

162-164: Excellent solution for managing file handles!

The LRU cache with a capacity of 1000 file handles effectively addresses the "too many open files" concern while maintaining good performance.

265-267: Excellent implementation of external sort with proper cleanup!

The use of tempfile for managing temporary storage during the external sort is a great choice:

Provides automatic cleanup

Reduces memory usage

Improves overall reliability

Also applies to: 355-356

src/clinvar_genes/cli/import.rs

…max number of open file handles

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

src/clinvar_genes/cli/import.rs (3)

270-272: Consider cleaning up temporary files on error.

While the temporary directory is cleaned up in the success path (line 361), it might leak if an error occurs before that point.

-    let tempdir = tempfile::tempdir()?;
+    let tempdir = tempfile::tempdir().map(|dir| {
+        tracing::debug!("Created temporary directory at {:?}", dir.path());
+        dir
+    })?;

315-344: Improve error handling for iterator synchronization.

The current implementation panics with a generic message when iterators are out of sync. Consider providing more context about the expected state.

-                panic!("No more records in vars_per_gene_records_by_hgnc_id, even though there should be.");
+                panic!(
+                    "Iterator synchronization error: Expected more records for HGNC ID {} but found none. \
+                     This indicates a mismatch between distributed records and merged gene lists.",
+                    hgnc_id
+                );

176-179: Consider more robust path handling.

The code assumes parent path always exists and panics if it doesn't. Consider using and_then for more graceful error handling.

-                std::fs::create_dir_all(path.parent().expect("failed to get parent path"))
-                    .expect("failed to create directory");
+                path.parent()
+                    .ok_or_else(|| anyhow::anyhow!("Invalid path: no parent directory for {:?}", path))
+                    .and_then(|p| std::fs::create_dir_all(p)
+                        .map_err(|e| anyhow::anyhow!("Failed to create directory {:?}: {}", p, e)))?;

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e61733 and 5b2df2c.

📒 Files selected for processing (1)

src/clinvar_genes/cli/import.rs (4 hunks)

🧰 Additional context used

📓 Learnings (1)

src/clinvar_genes/cli/import.rs (2)

Learnt from: tedil
PR: varfish-org/annonars#640
File: src/clinvar_genes/cli/import.rs:116-143
Timestamp: 2025-01-30T10:27:25.004Z
Learning: In the ClinvarVariants::_iter method, errors during file reading and record parsing should panic (using unwrap) to fail fast and avoid silent data loss through empty iterators.

Learnt from: tedil
PR: varfish-org/annonars#640
File: src/clinvar_genes/cli/import.rs:223-234
Timestamp: 2025-01-30T10:26:11.066Z
Learning: In the ClinVar genes import functionality, deserialization errors should cause a panic rather than being handled gracefully, as they indicate critical data integrity issues that break the functionality.

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Testing
GitHub Check: Schema
GitHub Check: build-and-push-image

🔇 Additional comments (7)

src/clinvar_genes/cli/import.rs (7)

94-98: LGTM! Well-structured data type for sorting.

The SortableVcvRecord struct provides a clean way to associate HGNC IDs with their corresponding records for sorting.

161-192: Good implementation of LRU cache for file handles.

The code effectively addresses the "too-many-open-files" concern by:

Using environment variable ANNONARS_CLINVAR_GENES_IMPORT_CACHE_SIZE for configuration

Implementing LRU caching with a reasonable default of 1000 handles

Properly flushing writers on cache eviction

303-307: LGTM! Efficient set difference calculation.

Good use of HashSet for calculating genes without variant data, which avoids O(n²) complexity.

309-310: LGTM! Memory-efficient record processing.

The use of chunk_by for grouping records by HGNC ID is an excellent choice as it processes records lazily without loading everything into memory.

360-361: LGTM! Proper cleanup of temporary files.

The use of drop(tempdir) ensures immediate cleanup of temporary files after processing is complete.

228-239: LGTM! Appropriate use of panic for data integrity.

Based on the retrieved learnings, the panic on deserialization errors is intentional and correct, as these errors indicate critical data integrity issues.

324-331: Potential panic on missing sequence location.

The code assumes sequence location is always present. Consider handling the None case more gracefully.

tedil and others added 9 commits January 28, 2025 17:40

use extsort during clinvar genes import

b56d74a

use rmp::write_named_vec so decoding actually works without problems

b2e7073

use into_group_map_by instead of chunk_by

3ffe4d4

fix wording

d522cab

use multigzdecoder + bufread of flate2 crate

120c1de

no need to sort at this point

f8ce36d

manually distribute records to $HGNC_ID.jsonl.gz for sorting

c34f3ac

remove unused imports

4caecbb

remove detailed progress prints

99a0dc9

coderabbitai bot reviewed Jan 29, 2025

View reviewed changes

src/clinvar_genes/cli/import.rs Outdated Show resolved Hide resolved

use LRU cache for writers; use subdirectories to reduce #files/dir

6e61733

coderabbitai bot reviewed Jan 30, 2025

View reviewed changes

src/clinvar_genes/cli/import.rs Show resolved Hide resolved

src/clinvar_genes/cli/import.rs Show resolved Hide resolved

add ANNONARS_CLINVAR_GENES_IMPORT_CACHE_SIZE env var for controlling …

5b2df2c

…max number of open file handles

coderabbitai bot reviewed Jan 30, 2025

View reviewed changes

tedil merged commit 10889c7 into main Jan 30, 2025
10 checks passed

tedil deleted the import-clinvar-genes-extsort branch January 30, 2025 13:58

tedil mentioned this pull request Jan 30, 2025

fix: use rocksdb instead of in-memory map to reduce memory footprint #638

Closed

varfish-bot mentioned this pull request Jan 30, 2025

chore(main): release 0.42.4 #641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use external sort during `clinvar-genes import` #640

fix: use external sort during `clinvar-genes import` #640

tedil commented Jan 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

fix: use external sort during clinvar-genes import #640

fix: use external sort during clinvar-genes import #640

Conversation

tedil commented Jan 29, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jan 29, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

fix: use external sort during `clinvar-genes import` #640

fix: use external sort during `clinvar-genes import` #640

tedil commented Jan 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)