Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR revises the config/download/compute pipeline to deal with a larger download request for normal periods (~30 years), and further improves the shapefile export function. The README and config files should provide the basic setup for testing this PR, but the basic steps are:
export AR_DATA_DIR=...
environment.yml
and activate itconfig.py
filepython download.py
python compute_ivt.py
The
download.py
andcompute_ivt.py
scripts were revised in order to download eastward and northward vapor flux components into separate .nc files, and then combine them during the compute process. This was necessary because according to this source (https://confluence.ecmwf.int/pages/viewpage.action?pageId=326278613) the netCDF3 conversion done during the CDS request has a file size limit of 4GB when including more than 1 variable. By downloading the variables separately, we can get around the file size limit when downloading many years of data. (For reference, the separate eastward/northward netCDF files downloaded for 30 years at 6hr tilmestep are 4-5 GB each.)The
config.py
file was revised to provide a list of request variables (instead of including the variables in the args dictionary) and a list of output file paths for the variables. These lists are then used in thedownload.py
script to add each variable to the args dictionary in separate CDS requests and download to separate files. Thecompute_ivt.py
script also uses the list of file paths to open as xarray.Datasets and merges them together in a new netCDF. We could automatically delete the original variable netCDFs, but given the download time it may be better to just manually delete these after all processing is complete.For reference, on my machine it took 2.5 hours for
download.py
(1 hr per variable request, 15 min per variable download), 3 hrs forcompute_ivt.py
, and ~45 minutes to run the notebook. (The majority of the notebook processing time were in theget_data_for_ar_criteria()
function (25 minutes) andcreate_geodataframe_with_all_ars()
function (9 minutes).)The
create_shapefile()
function was revised to replace long column names with abbreviated column names. This is due to the ESRI shapefile limitation of 10 character column names. This function also outputs acolumns.csv
file that acts as a crosswalk reference between the original and abbreviated column names. NOTE: the abbreviated names are hard-coded in the dictionary inside of this function, and exactly match the number and order of the geodataframe columns created using AR properties/criteria in thecreate_geodataframe_with_all_ars()
function. If any future revisions are made this these functions and/or properties/criteria, we may need to revisit this hard-coded column name dictionary.