Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline setup step 3: Add eval.py #94

Merged
merged 8 commits into from
Jan 15, 2025
Merged

Pipeline setup step 3: Add eval.py #94

merged 8 commits into from
Jan 15, 2025

Conversation

Fuhan-Yang
Copy link
Contributor

This PR add the last functionality in the pipeline, evaluation. The metric will be calculated for different forecast end dates and for each model. This PR hasn't been tested with real data yet, i.e, the pipeline hasn't been run entirely.

@Fuhan-Yang Fuhan-Yang requested review from swo and eschrom January 10, 2025 17:28
@Fuhan-Yang Fuhan-Yang changed the title Pipeline setup step 2: Add eval.py Pipeline setup step 3: Add eval.py Jan 10, 2025
Copy link
Collaborator

@eschrom eschrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments here are largely just the fallout of comments on #92 and #93. Plus, I think it's also worth debating whether we should be evaluating incident vs. cumulative uptake.

scripts/eval.py Outdated
"""Evaluate the forecasts for all models, all forecast ends, and all scores"""
score_names = config["score_funs"]
model_names = pred["model"].unique()
forecast_ends = pred["forecast_end"].unique()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion of #93 this should probably be forecast_starts instead. This applies in several places below, too (e.g. Line 28, 30, etc)

scripts/eval.py Outdated
model_names = pred["model"].unique()
forecast_ends = pred["forecast_end"].unique()

# only 'incident' type is evaluated #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that cumulative predictions should be evaluated, not incident. On one hand, evaluating incident predictions seems more "fair" because just one error early in a time series can cause cumulative predictions to look bad for an entire trajectory, even if most of the subsequent incident predictions are quite accurate. On the other hand, being "fair" to the models isn't really what matters; being correct is what matters. And cumulative uptake on a given date is what we're trying to be correct about. @Fuhan-Yang do you have counterarguments? @swo, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the metrics we use. For example, MSPE will strictly requires the expected values of the model output for calculation (see definition here). This means we should only use the direct output (incident data) to get MSPE, without modifying anything. Otherwise, we are not correct in using MSPE. However, we can make our interested metrics and use cumulative data. For example, we also calculated the absolute error of end-of-season uptake. I used the cumulative projections (from the incident projections) and the cumulative test data to calculate that. For the code structure, here we used incident projections as input for all the metrics, and we calculate the cumulative projections within the metric function if it's needed. I would suggest we keep the existing structure, and we can always add the interested metrics about cumulative projections as we go on!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metrics will clearly be based on cumulative uptake (e.g., end-of-season uptake). For everything else, I think we'll want to score both incident and cumulative. We could consider those as pairs of metrics, e.g., we have incident MSPE and cumulative MSPE.

If there's a model that performs well in incident space but not cumulative space (or vice versa), that's interesting & important and we would have missed it by picking to evaluate one or the other.

If the models have the same relative performance, whether using incident or cumulative, then it's an academic question and we can pick either one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just opened issue #100 to record the evaluation metrics. I will add the metrics of sample distribution in the future. Feel free to add the metric-of-interest. While for MSPE, I still think it's not feasible for cumulative predictions. It is because we optimized the model using ordinary least-square method, reducing the difference between the fitted line and the training data. So it is natural to use the same measure on the prediction and the testing data. I think MSPE should be consistent with the training data and the prediction. That said, we shouldn't use incident data to train the model but evaluate it with cumulative data. The mismatch of the data types may lead to ambiguous results and challenges in interpretation. But, I'm open to use the metrics that are sensible for both incident predictions and cumulative predictions. Feel free to add them in this issue if you find some!

scripts/eval.py Outdated
obs_data = obs_data.filter(pl.col("estimate_type") == "incident")

# ensure the same test data is used for all models
test_data = iup.IncidentUptakeData.split_train_test(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will need to be a loop outside of eval_all_forecasts to split out the test data differently for each different forecast_start date, once #93 iterates over forecast_start instead of forecast_end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the forecast_start is fixed here 01e2396, the loop iterating over forecast_start is inside of eval_all_forecast! See line 28

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But eval_all_forecasts accepts exactly one test data set, right? As forecast_start changes, the test data set also changes (for later forecast_start dates, the test data set shrinks, just like the training data set grows). So as we iterate over different forecast_start dates, we'll need different test data sets, which must be given to eval_all_forecasts individually. That's why it seems to me like the loop over forecast_start must be outside eval_all_forecasts. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch! Yes, I should separate the test data as the same length as the predictions. How about inputting the entire data set, and filtering the test data given the same time period as each prediction, in the most inner layer of the for loop (between line 29 and 41)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds like an equally fine solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 83d5341

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some confusion here about what "forecast start" and "forecast end" mean?

My understanding was that the "timeframe" parameter in the config was a series of forecast dates. On each forecast date, we make a forecast for every target date. The target dates are every week from the forecast date to the end of the season.

Copy link
Collaborator

@eschrom eschrom Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct about the "timeframe" config parameters: "start" is the forecast date, "end" is the end-of-season, and target dates are (currently set to) weekly in between. So forecasts are trajectories from "start" to "end." But @Fuhan-Yang is also interested in how model performance changes as "start" is pushed back later in the season. So she not only produces a forecast trajectory from "start" to "end" but also produces incrementally shorter forecast trajectories as the "start" date gets pushed closer to the "end" date.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general we should expect better performance for shorter forecast horizons (i.e., time from forecast date to target date). The later in the season the forecast date, the better the prediction of the end-of-season uptake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forecast start" here is the next day of the end of training period and the day of the beginning of forecast. As "forecast start" incrementally approaches to "forecast end" (which is the end of season by default) done by run_all_forecast(), we can see how the performance changes along time.

@Fuhan-Yang Fuhan-Yang requested a review from eschrom January 13, 2025 22:29
@Fuhan-Yang
Copy link
Contributor Author

Just updated the eval.py to be compatible with the returned cumulative predictions. See f9f1526

Copy link
Collaborator

@eschrom eschrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the changes in the eval script make sense to me now, although I must admit it is hard to imagine possible bugs without running the whole pipeline. I think we should merge #93 and #94, and then I volunteer to debug the pipeline from start to finish.

@Fuhan-Yang Fuhan-Yang merged commit 6b22f09 into main Jan 15, 2025
2 checks passed
@Fuhan-Yang Fuhan-Yang deleted the fy_config_eval branch January 15, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants