-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel starting of downloadcmd causes file collision #84
Comments
We should be able to address this but I am wondering what the use-case is that requires running multiple instances of the downloadcmd simultaneously. If the end goal is to increase the number parallel downloads I would suggest using the -wt option instead which allows you to explicitly set the number of parallel downloads, since it uses threads instead of spawning new processes. |
@gregmagdits Thanks for the fast reply!
This technique allows for each list of S3 links (for TGZ files containing DICOMs) that are specific to a subject and a session to be placed in a different destination folder. This logical file separation helps. |
This has been fixed in the most recent release of the tools, 0.2.26 |
Excellent! Thanks! I'll try it out the next time I download. Is the documentation for the option in the |
Oh, wait! I see it now. I mistook the https://github.com/NDAR/nda-tools/blob/master/NDATools/clientscripts/downloadcmd.py#L57-L58 Can I request the help text be updated to make it clearer what this is? From:
To:
|
@ericearl We didn't add an option to change the log directory as suggested in #49 . Instead we created the failed_s3_links file with a NamedTemporaryFile to guarantee a unique file name during file creation. |
@gregmagdits: +1 to the idea of being able to configure the location of the log files (along with the metadata text file download location, etc.). We work off of our cluster and user home directories have limited space (10gb). This gets taken up quickly, even by the metadata text file alone, but also by the log files and residual files in the .downloadprogress hidden folder. Before we realized that space was the issue, the metadata text *.gz would seemingly download and attempt to extract, but be broken and cause a pandas parsing error when attempting to find files matching the --regexp argument. For now, I've modified the |
@gregmagdits This issue is biting me again... Should I open a new GitHub issue as a feature request to "Allow configuration of the NDA_ORGINIZATION_ROOT_FOLDER" or do you want to manage it inside this GitHub issue here? I'm thinking this issue could be resolved by allowing configurable or user-input global variables in the https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57C1-L76. Anyway, my current plan feels convoluted because of this need:
Is this basically what you did, @j-tseng? |
@ericearl - yes please open a new issue since this seems unrelated to the original. I will try to get it into the next release. |
@gregmagdits I'm having a problem similar to #49 with
downloadcmd
.When two or more
downloadcmd
calls start in parallel, and more importantly they start at the exact same second, then the first to successfully download deletes the S3 links log file~/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_YYYYMMDDThhmmss.txt
afterward (whereYYYYMMDDThhmmss
is the year, month, day, hour, minute, and second of beginningdownloadcmd
). This leaves every other job started at the same second without a file to delete and therefore throws theFileNotFoundError
.This could be prevented with slower serial calls or by staggering jobs, but that is not as good of a solution as allowing the output log file to be sent somewhere else via a command line argument to
downloadcmd
or some input config file, especially considering the use ofdownloadcmd
in an HPC cluster.Thanks and let me know if you need any logs or further information to reproduce this error yourselves.
The text was updated successfully, but these errors were encountered: