Normal period / shp output #12

Joshdpaul · 2023-08-09T17:54:13Z

This PR revises the config/download/compute pipeline to deal with a larger download request for normal periods (~30 years), and further improves the shapefile export function. The README and config files should provide the basic setup for testing this PR, but the basic steps are:

set your env var export AR_DATA_DIR=...
create your conda env from the environment.yml and activate it
inspect the config.py file
python download.py
python compute_ivt.py
use the AR_detection notebook to find ARs and export as shapefile

The download.py and compute_ivt.py scripts were revised in order to download eastward and northward vapor flux components into separate .nc files, and then combine them during the compute process. This was necessary because according to this source (https://confluence.ecmwf.int/pages/viewpage.action?pageId=326278613) the netCDF3 conversion done during the CDS request has a file size limit of 4GB when including more than 1 variable. By downloading the variables separately, we can get around the file size limit when downloading many years of data. (For reference, the separate eastward/northward netCDF files downloaded for 30 years at 6hr tilmestep are 4-5 GB each.)

The config.py file was revised to provide a list of request variables (instead of including the variables in the args dictionary) and a list of output file paths for the variables. These lists are then used in the download.py script to add each variable to the args dictionary in separate CDS requests and download to separate files. The compute_ivt.py script also uses the list of file paths to open as xarray.Datasets and merges them together in a new netCDF. We could automatically delete the original variable netCDFs, but given the download time it may be better to just manually delete these after all processing is complete.

For reference, on my machine it took 2.5 hours for download.py (1 hr per variable request, 15 min per variable download), 3 hrs for compute_ivt.py, and ~45 minutes to run the notebook. (The majority of the notebook processing time were in the get_data_for_ar_criteria() function (25 minutes) and create_geodataframe_with_all_ars() function (9 minutes).)

The create_shapefile() function was revised to replace long column names with abbreviated column names. This is due to the ESRI shapefile limitation of 10 character column names. This function also outputs a columns.csv file that acts as a crosswalk reference between the original and abbreviated column names. NOTE: the abbreviated names are hard-coded in the dictionary inside of this function, and exactly match the number and order of the geodataframe columns created using AR properties/criteria in the create_geodataframe_with_all_ars() function. If any future revisions are made this these functions and/or properties/criteria, we may need to revisit this hard-coded column name dictionary.

charparr

To test this PR I re-downloaded the source ERA5 data (1992 through 2022) and re-computed using the split-by-variable-then-merge approach. I also the re-executed all cells in the detection notebook.

The changes here work great! I'll not that I experienced similar processing times as those described in the PR.

jdpaul3 added 4 commits August 8, 2023 11:35

revisions to download/compute pipeline

050c48f

cleaned up new dl/compute pipeline functions

b98697f

rename cols and export csv during shapefile

7688dc5

edit gdf column dict, rerun notebook

48b8fb4

Joshdpaul requested a review from charparr August 9, 2023 17:54

Joshdpaul mentioned this pull request Aug 9, 2023

Add key for shapefile out column names #9

Closed

charparr approved these changes Aug 11, 2023

View reviewed changes

charparr merged commit 727b5d5 into main Aug 11, 2023

charparr deleted the normal_period branch August 11, 2023 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normal period / shp output #12

Normal period / shp output #12

Joshdpaul commented Aug 9, 2023

charparr left a comment

Normal period / shp output #12

Normal period / shp output #12

Conversation

Joshdpaul commented Aug 9, 2023

charparr left a comment

Choose a reason for hiding this comment