Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle empty HRRR data files via linear imputation #245

Merged
merged 3 commits into from
Dec 8, 2021

Conversation

danielolsen
Copy link
Contributor

Pull Request doc

Purpose

When generating wind power profiles from HRRR data, gracefully handle any missing data via linear interpolation. Closes #244.

What the code is doing

The impute module is moved from prereise.gather.winddata.rap.impute to prereise.gather.winddata.impute, and a new linear interpolation method is added which should perform well on small data gaps.

Within prereise.gather.winddata.hrrr.calculations, calculate_pout is refactored to first build an array of all wind speed magnitudes obtained from the NOAA grib files (filling in NA when files are empty), then impute missing values as necessary, and finally convert wind speeds to wind powers.

Testing

Unit tests still pass, and this has been tested end-to-end when generating 2020 wind power profiles for the HIFLD grid (see #227 (comment)). When downloading the 2020 data, there were four files which downloaded empty, even after several attempts, suggesting that the data are missing from the NOAA server.

Usage Example/Visuals

from datetime import datetime
from powersimdata import Grid
from prereise.gather.winddata.hrrr.hrrr import retrieve_data
from prereise.gather.winddata.hrrr.calculations import calculate_pout

start_dt = datetime.fromisoformat("2020-01-01")
end_dt = datetime.fromisoformat("2021-01-01")
directory = "./"

grid = Grid("USA", "hifld")
wind_farms = grid.plant.query("type == 'wind' or type == 'wind_offshore'").copy()
wind_farms["state_abv"] = wind_farms.zone_id.map(grid.model_immutables.zones["id2abv"])
retrieve_data(start_dt=start_dt, end_dt=end_dt, directory=directory)
df = calculate_pout(wind_farms=wind_farms, start_dt=start_dt, end_dt=end_dt, directory=directory)

Time estimate

15-30 minutes.

Copy link
Collaborator

@jenhagg jenhagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

)
for j in range(wind_farm_ct)
]
for i, _ in tqdm(enumerate(dts))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the enumerate here? It seems we just need the length of the dts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate() vs. range(len()), which is more pythonic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i in tqdm(dts)?

Copy link
Collaborator

@BainanXia BainanXia Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-hagg I think we will need the index rather than the element of dts here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm(dts, total=len(dts)) would work, but not sure if that's better than enumerate/range. It's unclear to me if any is the most pythonic, but I think enumerate is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also refactor how we build wind_speed_data--as a dataframe instead of a numpy array--and then we could build wind_power_data using apply calls instead of list comprehensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@danielolsen danielolsen Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. Maybe I'll leave this alone for now then.

EDIT: Or maybe the pandas version can be made a little more transparent by instantiating the dataframe with index= and columns=...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I tried to strike a good balance between compactness and readability.

Copy link
Collaborator

@BainanXia BainanXia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks!

@danielolsen danielolsen force-pushed the daniel/hrrr_gap_tolerant branch 2 times, most recently from 6817900 to ab47e24 Compare December 7, 2021 22:52
@danielolsen danielolsen force-pushed the daniel/hrrr_gap_tolerant branch from ab47e24 to 535b4d9 Compare December 8, 2021 20:00
@danielolsen danielolsen merged commit 3ce5915 into develop Dec 8, 2021
@danielolsen danielolsen deleted the daniel/hrrr_gap_tolerant branch December 8, 2021 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any missing HRRR data causes calculation of power output to fail
4 participants