Create utility tools to copy/download a subset of GW "fix" data #3254

weihuang-jedi · 2025-01-24T18:24:29Z

What new functionality do you need?

When try to setup and run GW on a new machine, a big task is to transfer the "fix" data to the new machine. "fix" is huge, and we do not need the whole data set at the beginning. So we want to create a utility tools, which can fetch the necessary subset as suggested by Rahul in an email:

What would be nice to have as a utility in the workflow is the ability to fetch (and update) the data from S3 to a local machine.
With such a utility:
The user can query what datasets are available on the bucket
The user can fetch the entire dataset (all resolutions, latest set of date timestamped dataset)
The user can fetch the dataset for a subset of resolutions e.g. C48mx500 for the latest timestamped dataset.
The script only fetches the updates if the current destination already has a subset (This is to prevent fetching the same data over and over). e.g. --update

As Walter mentions, this is done just once by the g-w codemanagers on the HPC platforms where this is shared between many users. For the community running in their own HPCs or containers, this process can be tedious. A proper solution for community support would be welcomed.

What are the requirements for the new functionality?

What would be nice to have as a utility in the workflow is the ability to fetch (and update) the data from S3 to a local machine.
With such a utility:
The user can query what datasets are available on the bucket
The user can fetch the entire dataset (all resolutions, latest set of date timestamped dataset)
The user can fetch the dataset for a subset of resolutions e.g. C48mx500 for the latest timestamped dataset.
The script only fetches the updates if the current destination already has a subset (This is to prevent fetching the same data over and over). e.g. --update

As Walter mentions, this is done just once by the g-w codemanagers on the HPC platforms where this is shared between many users. For the community running in their own HPCs or containers, this process can be tedious. A proper solution for community support would be welcomed.

Acceptance Criteria

Criteria GFS v16 workflow development #1 download a subset specified.
Criteria GFSv15.2.6 production changes and bug fix #2 can repeatedly download/fetch the dataset, but only download what is newly updated.
Criteria Convert checkout.sh to manage_externals #3 can download whole dataset if asked for.
Crireria Remove Theia from global-workflow #4 can update the dataset if needed.

Suggest a solution (optional)

No response

weihuang-jedi added feature New feature or request triage Issues that are triage labels Jan 24, 2025

weihuang-jedi self-assigned this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create utility tools to copy/download a subset of GW "fix" data #3254

Create utility tools to copy/download a subset of GW "fix" data #3254

weihuang-jedi commented Jan 24, 2025

Create utility tools to copy/download a subset of GW "fix" data #3254

Create utility tools to copy/download a subset of GW "fix" data #3254

Comments

weihuang-jedi commented Jan 24, 2025

What new functionality do you need?

What are the requirements for the new functionality?

Acceptance Criteria

Suggest a solution (optional)