Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set save_intermediate_files to false by default #52

Merged
merged 3 commits into from
Jul 21, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -11,6 +11,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
### Added
- Add F16.config to enable F16 node compatibility
### Changed
- Set `save_intermediate_files` to false in default.config
- Rename call-sSV.nf to main.nf
- Standardize the repo structure based on NF pipeline template
- Update README.md with resource locations and runtime of Delly v1.0.3 on different datasets and node types
2 changes: 1 addition & 1 deletion config/default.config
Original file line number Diff line number Diff line change
@@ -14,7 +14,7 @@ params {
max_memory = SysHelper.getAvailMemory()

cache_intermediate_pipeline_steps = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this argument - it goes to

process.cache = params.cache_intermediate_pipeline_steps

But we also have L37 cache=true

Copy link
Contributor Author

@Faizal-Eeman Faizal-Eeman Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point Taka!

As I understand from NextFlow documentation on process cache, this allows users to resume the pipeline in case a process fails and the process needs to continue after fixing the failure. By default cache = true.

But since we are also specifying the process.cache in methods.config by using cache_intermediate_pipeline_steps = false from default.config, shouldn't this technically introduce conflict as we are having two boolean values for the same process cache argument? One from L37 cache_intermediate_pipeline_steps=false and another from L37 cache=true? Unless the cache_intermediate_pipeline_steps=false over-rides the cache=true argument and we are setting the pipeline to not cache any process files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the pipeline runs smoothly and when the pipeline fails because of an error, I haven't seen in the logs Nextflow's recommendation to resume the pipeline after fixing the error. So this means, process.cache is taken as false. I can comment cache=true and try if the pipeline runs and delivers results as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, for common variables like process.cache, there's a hierarchy of values, where for example the process-specific settings override the general process-level settings. In this case though, since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

Also, as a note, the resume option wouldn't work with the submission script since the working directory gets deleted upon node shutdown. It should work if testing interactively though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

@yashpatel6 So cache_intermediate_pipeline_steps=false overwrites cache=true? Then should we just leave these arguments as such or would you recommend to drop cache=true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashpatel6 Does the retry function require cache = true? I thought that was the case but I'm not 100% sure. The cache is enabled by default (as in NF default) so I think we kept it in the template for development purposes.

The retry function doesn't require cache = true; the cache is used for when an entire pipeline is re-run. Nextflow basically checks if previously run processes completed successfully and uses those results rather than re-running processes that were identical in the previous run and finished successfully; it basically lets the pipeline pick up from the process that failed.

With retry, the pipeline doesn't stop running and just the specific process is re-launched, with updated configs/allocations if necessary. Technically, the results of previous processes are still available with retry because the pipeline hasn't actually stopped running.

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @yashpatel6, very helpful!

@tyamaguchi-ucla So can we hold what to do with cache settings for now? and perhaps make changes after call-sSV 3.0.0 release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's discuss this at the NF WG. We can finish up #46 and then release a new version. We generally want to discuss changes that require a template update. Also, it might be helpful for you to understand how we developed the current config structure. @yashpatel6 maybe we can create a GH discussion pointing to some key PRs (template and call-FusionTranscript) for new developers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

I'm going through some old PRs so we can make a thread with some of the key decisions/discussions we had regarding pipelines, we can review some at the NF WG next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Yash.

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

Yeah, it might be more dependent on the number of jobs. In either case, it's worth checking.

save_intermediate_files = true
save_intermediate_files = false
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved

ucla_cds = true

2 changes: 1 addition & 1 deletion config/template.config
Original file line number Diff line number Diff line change
@@ -19,7 +19,7 @@ params {

output_dir = "where/to/save/outputs/"

save_intermediate_files = true
save_intermediate_files = false

verbose = false