Pipeline setup step 3: Add eval.py #94

Fuhan-Yang · 2025-01-10T17:27:57Z

This PR add the last functionality in the pipeline, evaluation. The metric will be calculated for different forecast end dates and for each model. This PR hasn't been tested with real data yet, i.e, the pipeline hasn't been run entirely.

eschrom

My comments here are largely just the fallout of comments on #92 and #93. Plus, I think it's also worth debating whether we should be evaluating incident vs. cumulative uptake.

eschrom · 2025-01-13T19:50:23Z

scripts/eval.py

+    """Evaluate the forecasts for all models, all forecast ends, and all scores"""
+    score_names = config["score_funs"]
+    model_names = pred["model"].unique()
+    forecast_ends = pred["forecast_end"].unique()


Based on our discussion of #93 this should probably be forecast_starts instead. This applies in several places below, too (e.g. Line 28, 30, etc)

eschrom · 2025-01-13T19:56:55Z

scripts/eval.py

+    model_names = pred["model"].unique()
+    forecast_ends = pred["forecast_end"].unique()
+
+    # only 'incident' type is evaluated #


Seems to me that cumulative predictions should be evaluated, not incident. On one hand, evaluating incident predictions seems more "fair" because just one error early in a time series can cause cumulative predictions to look bad for an entire trajectory, even if most of the subsequent incident predictions are quite accurate. On the other hand, being "fair" to the models isn't really what matters; being correct is what matters. And cumulative uptake on a given date is what we're trying to be correct about. @Fuhan-Yang do you have counterarguments? @swo, what do you think?

I think it depends on the metrics we use. For example, MSPE will strictly requires the expected values of the model output for calculation (see definition here). This means we should only use the direct output (incident data) to get MSPE, without modifying anything. Otherwise, we are not correct in using MSPE. However, we can make our interested metrics and use cumulative data. For example, we also calculated the absolute error of end-of-season uptake. I used the cumulative projections (from the incident projections) and the cumulative test data to calculate that. For the code structure, here we used incident projections as input for all the metrics, and we calculate the cumulative projections within the metric function if it's needed. I would suggest we keep the existing structure, and we can always add the interested metrics about cumulative projections as we go on!

Some metrics will clearly be based on cumulative uptake (e.g., end-of-season uptake). For everything else, I think we'll want to score both incident and cumulative. We could consider those as pairs of metrics, e.g., we have incident MSPE and cumulative MSPE.

If there's a model that performs well in incident space but not cumulative space (or vice versa), that's interesting & important and we would have missed it by picking to evaluate one or the other.

If the models have the same relative performance, whether using incident or cumulative, then it's an academic question and we can pick either one.

Just opened issue #100 to record the evaluation metrics. I will add the metrics of sample distribution in the future. Feel free to add the metric-of-interest. While for MSPE, I still think it's not feasible for cumulative predictions. It is because we optimized the model using ordinary least-square method, reducing the difference between the fitted line and the training data. So it is natural to use the same measure on the prediction and the testing data. I think MSPE should be consistent with the training data and the prediction. That said, we shouldn't use incident data to train the model but evaluate it with cumulative data. The mismatch of the data types may lead to ambiguous results and challenges in interpretation. But, I'm open to use the metrics that are sensible for both incident predictions and cumulative predictions. Feel free to add them in this issue if you find some!

eschrom · 2025-01-13T20:03:33Z

scripts/eval.py

+    obs_data = obs_data.filter(pl.col("estimate_type") == "incident")
+
+    # ensure the same test data is used for all models
+    test_data = iup.IncidentUptakeData.split_train_test(


I think there will need to be a loop outside of eval_all_forecasts to split out the test data differently for each different forecast_start date, once #93 iterates over forecast_start instead of forecast_end.

Once the forecast_start is fixed here 01e2396, the loop iterating over forecast_start is inside of eval_all_forecast! See line 28

But eval_all_forecasts accepts exactly one test data set, right? As forecast_start changes, the test data set also changes (for later forecast_start dates, the test data set shrinks, just like the training data set grows). So as we iterate over different forecast_start dates, we'll need different test data sets, which must be given to eval_all_forecasts individually. That's why it seems to me like the loop over forecast_start must be outside eval_all_forecasts. Does that make sense?

That's a good catch! Yes, I should separate the test data as the same length as the predictions. How about inputting the entire data set, and filtering the test data given the same time period as each prediction, in the most inner layer of the for loop (between line 29 and 41)?

Sure, that sounds like an equally fine solution!

See 83d5341

I think there might be some confusion here about what "forecast start" and "forecast end" mean?

My understanding was that the "timeframe" parameter in the config was a series of forecast dates. On each forecast date, we make a forecast for every target date. The target dates are every week from the forecast date to the end of the season.

You are correct about the "timeframe" config parameters: "start" is the forecast date, "end" is the end-of-season, and target dates are (currently set to) weekly in between. So forecasts are trajectories from "start" to "end." But @Fuhan-Yang is also interested in how model performance changes as "start" is pushed back later in the season. So she not only produces a forecast trajectory from "start" to "end" but also produces incrementally shorter forecast trajectories as the "start" date gets pushed closer to the "end" date.

Yes, in general we should expect better performance for shorter forecast horizons (i.e., time from forecast date to target date). The later in the season the forecast date, the better the prediction of the end-of-season uptake.

"forecast start" here is the next day of the end of training period and the day of the beginning of forecast. As "forecast start" incrementally approaches to "forecast end" (which is the end of season by default) done by run_all_forecast(), we can see how the performance changes along time.

Fuhan-Yang · 2025-01-15T20:16:22Z

Just updated the eval.py to be compatible with the returned cumulative predictions. See f9f1526

eschrom

I think all the changes in the eval script make sense to me now, although I must admit it is hard to imagine possible bugs without running the whole pipeline. I think we should merge #93 and #94, and then I volunteer to debug the pipeline from start to finish.

Fuhan-Yang added 2 commits January 10, 2025 12:17

update makefile

efe5605

add eval

f8be0c4

Fuhan-Yang requested review from swo and eschrom January 10, 2025 17:28

Fuhan-Yang changed the title ~~Pipeline setup step 2: Add eval.py~~ Pipeline setup step 3: Add eval.py Jan 10, 2025

eschrom requested changes Jan 13, 2025

View reviewed changes

Fuhan-Yang added 2 commits January 13, 2025 17:20

change forecast_end to forecast_start

01e2396

add pre-commit compliance

5589b8e

Fuhan-Yang requested a review from eschrom January 13, 2025 22:29

Fuhan-Yang added 2 commits January 13, 2025 17:55

fix bug to separate test data

83d5341

update per returned cumulative predictions

f9f1526

eschrom approved these changes Jan 15, 2025

View reviewed changes

Fuhan-Yang and others added 2 commits January 15, 2025 17:07

Merge branch 'main' into fy_config_eval

51b9f6d

fix ruff

8e2c8b1

Fuhan-Yang merged commit 6b22f09 into main Jan 15, 2025
2 checks passed

Fuhan-Yang deleted the fy_config_eval branch January 15, 2025 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline setup step 3: Add eval.py #94

Pipeline setup step 3: Add eval.py #94

Fuhan-Yang commented Jan 10, 2025

eschrom left a comment

eschrom Jan 13, 2025

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

swo Jan 15, 2025

Fuhan-Yang Jan 15, 2025

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

swo Jan 15, 2025

eschrom Jan 15, 2025 •

edited

Loading

swo Jan 15, 2025

Fuhan-Yang Jan 15, 2025

Fuhan-Yang commented Jan 15, 2025

eschrom left a comment

Pipeline setup step 3: Add eval.py #94

Pipeline setup step 3: Add eval.py #94

Conversation

Fuhan-Yang commented Jan 10, 2025

eschrom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschrom Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fuhan-Yang commented Jan 15, 2025

eschrom left a comment

Choose a reason for hiding this comment

eschrom Jan 15, 2025 •

edited

Loading