Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normal period / shp output #12

Merged
merged 4 commits into from
Aug 11, 2023
Merged

Normal period / shp output #12

merged 4 commits into from
Aug 11, 2023

Conversation

Joshdpaul
Copy link
Collaborator

This PR revises the config/download/compute pipeline to deal with a larger download request for normal periods (~30 years), and further improves the shapefile export function. The README and config files should provide the basic setup for testing this PR, but the basic steps are:

  • set your env var export AR_DATA_DIR=...
  • create your conda env from the environment.yml and activate it
  • inspect the config.py file
  • python download.py
  • python compute_ivt.py
  • use the AR_detection notebook to find ARs and export as shapefile

The download.py and compute_ivt.py scripts were revised in order to download eastward and northward vapor flux components into separate .nc files, and then combine them during the compute process. This was necessary because according to this source (https://confluence.ecmwf.int/pages/viewpage.action?pageId=326278613) the netCDF3 conversion done during the CDS request has a file size limit of 4GB when including more than 1 variable. By downloading the variables separately, we can get around the file size limit when downloading many years of data. (For reference, the separate eastward/northward netCDF files downloaded for 30 years at 6hr tilmestep are 4-5 GB each.)

The config.py file was revised to provide a list of request variables (instead of including the variables in the args dictionary) and a list of output file paths for the variables. These lists are then used in the download.py script to add each variable to the args dictionary in separate CDS requests and download to separate files. The compute_ivt.py script also uses the list of file paths to open as xarray.Datasets and merges them together in a new netCDF. We could automatically delete the original variable netCDFs, but given the download time it may be better to just manually delete these after all processing is complete.

For reference, on my machine it took 2.5 hours for download.py (1 hr per variable request, 15 min per variable download), 3 hrs for compute_ivt.py, and ~45 minutes to run the notebook. (The majority of the notebook processing time were in the get_data_for_ar_criteria() function (25 minutes) and create_geodataframe_with_all_ars() function (9 minutes).)

The create_shapefile() function was revised to replace long column names with abbreviated column names. This is due to the ESRI shapefile limitation of 10 character column names. This function also outputs a columns.csv file that acts as a crosswalk reference between the original and abbreviated column names. NOTE: the abbreviated names are hard-coded in the dictionary inside of this function, and exactly match the number and order of the geodataframe columns created using AR properties/criteria in the create_geodataframe_with_all_ars() function. If any future revisions are made this these functions and/or properties/criteria, we may need to revisit this hard-coded column name dictionary.

Copy link
Member

@charparr charparr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this PR I re-downloaded the source ERA5 data (1992 through 2022) and re-computed using the split-by-variable-then-merge approach. I also the re-executed all cells in the detection notebook.

The changes here work great! I'll not that I experienced similar processing times as those described in the PR.

@charparr charparr merged commit 727b5d5 into main Aug 11, 2023
@charparr charparr deleted the normal_period branch August 11, 2023 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants