-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline setup step 3: Add eval.py #94
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/eval.py
Outdated
"""Evaluate the forecasts for all models, all forecast ends, and all scores""" | ||
score_names = config["score_funs"] | ||
model_names = pred["model"].unique() | ||
forecast_ends = pred["forecast_end"].unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion of #93 this should probably be forecast_starts
instead. This applies in several places below, too (e.g. Line 28, 30, etc)
scripts/eval.py
Outdated
model_names = pred["model"].unique() | ||
forecast_ends = pred["forecast_end"].unique() | ||
|
||
# only 'incident' type is evaluated # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that cumulative predictions should be evaluated, not incident. On one hand, evaluating incident predictions seems more "fair" because just one error early in a time series can cause cumulative predictions to look bad for an entire trajectory, even if most of the subsequent incident predictions are quite accurate. On the other hand, being "fair" to the models isn't really what matters; being correct is what matters. And cumulative uptake on a given date is what we're trying to be correct about. @Fuhan-Yang do you have counterarguments? @swo, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on the metrics we use. For example, MSPE will strictly requires the expected values of the model output for calculation (see definition here). This means we should only use the direct output (incident data) to get MSPE, without modifying anything. Otherwise, we are not correct in using MSPE. However, we can make our interested metrics and use cumulative data. For example, we also calculated the absolute error of end-of-season uptake. I used the cumulative projections (from the incident projections) and the cumulative test data to calculate that. For the code structure, here we used incident projections as input for all the metrics, and we calculate the cumulative projections within the metric function if it's needed. I would suggest we keep the existing structure, and we can always add the interested metrics about cumulative projections as we go on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some metrics will clearly be based on cumulative uptake (e.g., end-of-season uptake). For everything else, I think we'll want to score both incident and cumulative. We could consider those as pairs of metrics, e.g., we have incident MSPE and cumulative MSPE.
If there's a model that performs well in incident space but not cumulative space (or vice versa), that's interesting & important and we would have missed it by picking to evaluate one or the other.
If the models have the same relative performance, whether using incident or cumulative, then it's an academic question and we can pick either one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just opened issue #100 to record the evaluation metrics. I will add the metrics of sample distribution in the future. Feel free to add the metric-of-interest. While for MSPE, I still think it's not feasible for cumulative predictions. It is because we optimized the model using ordinary least-square method, reducing the difference between the fitted line and the training data. So it is natural to use the same measure on the prediction and the testing data. I think MSPE should be consistent with the training data and the prediction. That said, we shouldn't use incident data to train the model but evaluate it with cumulative data. The mismatch of the data types may lead to ambiguous results and challenges in interpretation. But, I'm open to use the metrics that are sensible for both incident predictions and cumulative predictions. Feel free to add them in this issue if you find some!
scripts/eval.py
Outdated
obs_data = obs_data.filter(pl.col("estimate_type") == "incident") | ||
|
||
# ensure the same test data is used for all models | ||
test_data = iup.IncidentUptakeData.split_train_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will need to be a loop outside of eval_all_forecasts
to split out the test data differently for each different forecast_start
date, once #93 iterates over forecast_start
instead of forecast_end
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the forecast_start
is fixed here 01e2396, the loop iterating over forecast_start
is inside of eval_all_forecast
! See line 28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But eval_all_forecasts
accepts exactly one test data set, right? As forecast_start
changes, the test data set also changes (for later forecast_start
dates, the test data set shrinks, just like the training data set grows). So as we iterate over different forecast_start
dates, we'll need different test data sets, which must be given to eval_all_forecasts
individually. That's why it seems to me like the loop over forecast_start
must be outside eval_all_forecasts
. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch! Yes, I should separate the test data as the same length as the predictions. How about inputting the entire data set, and filtering the test data given the same time period as each prediction, in the most inner layer of the for loop (between line 29 and 41)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds like an equally fine solution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 83d5341
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be some confusion here about what "forecast start" and "forecast end" mean?
My understanding was that the "timeframe" parameter in the config was a series of forecast dates. On each forecast date, we make a forecast for every target date. The target dates are every week from the forecast date to the end of the season.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct about the "timeframe" config parameters: "start" is the forecast date, "end" is the end-of-season, and target dates are (currently set to) weekly in between. So forecasts are trajectories from "start" to "end." But @Fuhan-Yang is also interested in how model performance changes as "start" is pushed back later in the season. So she not only produces a forecast trajectory from "start" to "end" but also produces incrementally shorter forecast trajectories as the "start" date gets pushed closer to the "end" date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in general we should expect better performance for shorter forecast horizons (i.e., time from forecast date to target date). The later in the season the forecast date, the better the prediction of the end-of-season uptake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"forecast start" here is the next day of the end of training period and the day of the beginning of forecast. As "forecast start" incrementally approaches to "forecast end" (which is the end of season by default) done by run_all_forecast
(), we can see how the performance changes along time.
Just updated the eval.py to be compatible with the returned cumulative predictions. See f9f1526 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR add the last functionality in the pipeline, evaluation. The metric will be calculated for different forecast end dates and for each model. This PR hasn't been tested with real data yet, i.e, the pipeline hasn't been run entirely.