-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use external sort during clinvar-genes import
#640
Conversation
WalkthroughThis pull request introduces changes to the ClinVar gene import process in the Rust project. It adds new dependencies, Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Import Command
participant CV as ClinvarVariants
participant File as JSONL Files
participant TempDir as Temporary Directory
CLI->>CV: Initialize with variant paths
CV->>File: Read variant records
CV->>TempDir: Distribute records
CV->>CV: Sort records by HGNC ID
CV-->>CLI: Return processed variants
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/clinvar_genes/cli/import.rs (3)
105-111
: Initialization method looks good.
The constructor delegates directory handling to the caller and initializes optional data cleanly. Consider verifying directory write permissions early to fail fast, if needed.
114-141
: Potential panics in_iter
due to.unwrap()
usage.
This private method streams and expands records by gene ID. However, each I/O or JSON parsing error triggers a hard panic. Depending on your design, you could propagate errors for more graceful handling.- .map(Result::unwrap) + .map(|res| res.unwrap_or_else(|err| panic!("Error reading line: {}", err)))
280-327
: Database insertion pipeline is coherent but strictly sequential.
The chunk-by mechanism properly groups variant records by HGNC ID. If import speed becomes a bottleneck, consider parallelizing writes or splitting tasks, ensuring concurrency safety in RocksDB usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.toml
(1 hunks)src/clinvar_genes/cli/import.rs
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Schema
- GitHub Check: Testing
- GitHub Check: Linting
- GitHub Check: build-and-push-image
🔇 Additional comments (8)
src/clinvar_genes/cli/import.rs (7)
8-16
: All new imports appear relevant.
They nicely cover newly introduced functionality such as command-line parsing, iteration utilities, I/O, and serialization. No immediate issues.
93-97
: Well-defined wrapper for sorting.
This lightweight struct is clear, combining the HGNC ID with the extracted record for sorting/iteration logic.
99-103
: Sensible struct for managing variant data.
Storing the input paths, a temporary directory, and discovered IDs in a single struct makes the import pipeline more cohesive.
181-226
: Sorted records logic is clear, but panics on deserialization failures.
The external sort approach is straightforward. As with_iter
, consider logging or skipping invalid data instead of panicking.Would you like me to provide a script or example code refactoring to handle such errors gracefully and log them?
253-260
: Temporary directory approach aligns with external sorting.
Creating a new tempdir for distributing records is aligned with the PR goals of managing memory usage and intermediate data.
264-278
: Gene list merging is logically sound.
Combining gene IDs from multiple data sources ensures we capture all relevant genes before writing to the database.
343-345
: Tempfile cleanup is straightforward.
Dropping theTempDir
ensures on-disk resources are freed. This final step meets the PR objective of managing large intermediate data properly.Cargo.toml (1)
58-58
: The addition oftempfile
looks correct.
This dependency underpins the external sorting strategy, enabling transient file management while reducing the in-memory footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.toml
(1 hunks)src/clinvar_genes/cli/import.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Schema
- GitHub Check: Testing
- GitHub Check: Linting
- GitHub Check: build-and-push-image
🔇 Additional comments (3)
src/clinvar_genes/cli/import.rs (3)
95-105
: Well-designed data structures for managing variant data!The
SortableVcvRecord
andClinvarVariants
structs provide a clean separation of concerns and enable efficient sorting and processing of variant data.
162-164
: Excellent solution for managing file handles!The LRU cache with a capacity of 1000 file handles effectively addresses the "too many open files" concern while maintaining good performance.
265-267
: Excellent implementation of external sort with proper cleanup!The use of
tempfile
for managing temporary storage during the external sort is a great choice:
- Provides automatic cleanup
- Reduces memory usage
- Improves overall reliability
Also applies to: 355-356
…max number of open file handles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/clinvar_genes/cli/import.rs (3)
270-272
: Consider cleaning up temporary files on error.While the temporary directory is cleaned up in the success path (line 361), it might leak if an error occurs before that point.
- let tempdir = tempfile::tempdir()?; + let tempdir = tempfile::tempdir().map(|dir| { + tracing::debug!("Created temporary directory at {:?}", dir.path()); + dir + })?;
315-344
: Improve error handling for iterator synchronization.The current implementation panics with a generic message when iterators are out of sync. Consider providing more context about the expected state.
- panic!("No more records in vars_per_gene_records_by_hgnc_id, even though there should be."); + panic!( + "Iterator synchronization error: Expected more records for HGNC ID {} but found none. \ + This indicates a mismatch between distributed records and merged gene lists.", + hgnc_id + );
176-179
: Consider more robust path handling.The code assumes parent path always exists and panics if it doesn't. Consider using
and_then
for more graceful error handling.- std::fs::create_dir_all(path.parent().expect("failed to get parent path")) - .expect("failed to create directory"); + path.parent() + .ok_or_else(|| anyhow::anyhow!("Invalid path: no parent directory for {:?}", path)) + .and_then(|p| std::fs::create_dir_all(p) + .map_err(|e| anyhow::anyhow!("Failed to create directory {:?}: {}", p, e)))?;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/clinvar_genes/cli/import.rs
(4 hunks)
🧰 Additional context used
📓 Learnings (1)
src/clinvar_genes/cli/import.rs (2)
Learnt from: tedil
PR: varfish-org/annonars#640
File: src/clinvar_genes/cli/import.rs:116-143
Timestamp: 2025-01-30T10:27:25.004Z
Learning: In the ClinvarVariants::_iter method, errors during file reading and record parsing should panic (using unwrap) to fail fast and avoid silent data loss through empty iterators.
Learnt from: tedil
PR: varfish-org/annonars#640
File: src/clinvar_genes/cli/import.rs:223-234
Timestamp: 2025-01-30T10:26:11.066Z
Learning: In the ClinVar genes import functionality, deserialization errors should cause a panic rather than being handled gracefully, as they indicate critical data integrity issues that break the functionality.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Testing
- GitHub Check: Schema
- GitHub Check: build-and-push-image
🔇 Additional comments (7)
src/clinvar_genes/cli/import.rs (7)
94-98
: LGTM! Well-structured data type for sorting.The
SortableVcvRecord
struct provides a clean way to associate HGNC IDs with their corresponding records for sorting.
161-192
: Good implementation of LRU cache for file handles.The code effectively addresses the "too-many-open-files" concern by:
- Using environment variable
ANNONARS_CLINVAR_GENES_IMPORT_CACHE_SIZE
for configuration- Implementing LRU caching with a reasonable default of 1000 handles
- Properly flushing writers on cache eviction
303-307
: LGTM! Efficient set difference calculation.Good use of HashSet for calculating genes without variant data, which avoids O(n²) complexity.
309-310
: LGTM! Memory-efficient record processing.The use of
chunk_by
for grouping records by HGNC ID is an excellent choice as it processes records lazily without loading everything into memory.
360-361
: LGTM! Proper cleanup of temporary files.The use of
drop(tempdir)
ensures immediate cleanup of temporary files after processing is complete.
228-239
: LGTM! Appropriate use of panic for data integrity.Based on the retrieved learnings, the panic on deserialization errors is intentional and correct, as these errors indicate critical data integrity issues.
324-331
: Potential panic on missing sequence location.The code assumes sequence location is always present. Consider handling the None case more gracefully.
Supersedes #638
On my machine, this takes about ~15 minutes, ~6GB of memory and 1.2GB of temporary disk space (instead of ~1 minute and ~42GB of memory [and no temporary disk space] before the changes). Could probably be faster, but should be fine for our use-cases (i.e. reduce memory footprint to be able to keep running this on github CI).
Also currently keeps track of file handles / writers for all ~20k files that are produced, which might be an issue regarding "too-many-open-files" limits. Alternatives: Either always re-create the writers (runtime-hit) or use a limited capacity HashMap (LRU cache etc) to keep the number of open files limited.
Summary by CodeRabbit
New Features
Dependencies
tempfile
andlru
libraries to support file operations and data manipulation.itertools
andserde
.Refactor