Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create utility tools to copy/download a subset of GW "fix" data #3254

Open
4 tasks
weihuang-jedi opened this issue Jan 24, 2025 · 0 comments
Open
4 tasks

Create utility tools to copy/download a subset of GW "fix" data #3254

weihuang-jedi opened this issue Jan 24, 2025 · 0 comments
Assignees
Labels
feature New feature or request triage Issues that are triage

Comments

@weihuang-jedi
Copy link
Contributor

What new functionality do you need?

When try to setup and run GW on a new machine, a big task is to transfer the "fix" data to the new machine. "fix" is huge, and we do not need the whole data set at the beginning. So we want to create a utility tools, which can fetch the necessary subset as suggested by Rahul in an email:

What would be nice to have as a utility in the workflow is the ability to fetch (and update) the data from S3 to a local machine.
With such a utility:
The user can query what datasets are available on the bucket
The user can fetch the entire dataset (all resolutions, latest set of date timestamped dataset)
The user can fetch the dataset for a subset of resolutions e.g. C48mx500 for the latest timestamped dataset.
The script only fetches the updates if the current destination already has a subset (This is to prevent fetching the same data over and over). e.g. --update

As Walter mentions, this is done just once by the g-w codemanagers on the HPC platforms where this is shared between many users. For the community running in their own HPCs or containers, this process can be tedious. A proper solution for community support would be welcomed.

What are the requirements for the new functionality?

What would be nice to have as a utility in the workflow is the ability to fetch (and update) the data from S3 to a local machine.
With such a utility:
The user can query what datasets are available on the bucket
The user can fetch the entire dataset (all resolutions, latest set of date timestamped dataset)
The user can fetch the dataset for a subset of resolutions e.g. C48mx500 for the latest timestamped dataset.
The script only fetches the updates if the current destination already has a subset (This is to prevent fetching the same data over and over). e.g. --update

As Walter mentions, this is done just once by the g-w codemanagers on the HPC platforms where this is shared between many users. For the community running in their own HPCs or containers, this process can be tedious. A proper solution for community support would be welcomed.

Acceptance Criteria

Suggest a solution (optional)

No response

@weihuang-jedi weihuang-jedi added feature New feature or request triage Issues that are triage labels Jan 24, 2025
@weihuang-jedi weihuang-jedi self-assigned this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request triage Issues that are triage
Projects
None yet
Development

No branches or pull requests

1 participant