Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Y_df parameter in reconcile() method requiring more columns than necessary when using TopDown or MiddleOut #329

Closed
macarw opened this issue Feb 16, 2025 · 9 comments · Fixed by #330

Comments

@macarw
Copy link

macarw commented Feb 16, 2025

Hi! Thanks for this amazing library.

I was trying to reproduce the example notebook TourismSmall.ipynb. In the example, you use BottomUp, TopDown with "forecast_proportions", and MiddleOut with "forecast_proportions", none of which require in-sample data.

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

ColumnNotFoundError: The following columns were not found: [‘AutoARIMA’]
or
ColumnNotFoundError: The following columns were not found: [‘Naive']

To clarify, the code is identical to the notebook, except that I added four additional reconcilers to the reconcilers list:

reconcilers = [
    # Used in the example notebook
    BottomUp(),
    TopDown(method="forecast_proportions"),
    MiddleOut(middle_level="Country/Purpose/State", top_down_method="forecast_proportions"),
    
    # Added for testing, raises the errors mentioned
    #TopDown(method=“average_proportions"),
    #TopDown(method=“proportion_averages"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="average_proportions"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="proportion_averages"),
]

In the notebook, the training dataset containing the columns unique_id, ds and y are passed as Y_df. This should be sufficient to compute historical proportions, yet these methods also requires Y_df to include the columns from Y_hat_df (e.g., AutoARIMA and Naive).

Why do we need to include in-sample predictions to use these methods, even though they should not rely on them?
I found a similar previous issue.

I’d really appreciate any clarification.
Thanks in advance.

@macarw
Copy link
Author

macarw commented Feb 16, 2025

I just encountered an even stranger related behavior:

Following the same steps as in the notebook, I tested the following configurations for the reconcilers list:

✅ Works:

reconcilers = [
   TopDown(method='forecast_proportions'),
  ]

✅ Works:

reconcilers = [
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]

However, combining these four working reconcilers causes an error:

❌ Fails:

reconcilers = [
   TopDown(method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]
Error:
ColumnNotFoundError: The following columns were not found: ['AutoARIMA']

It’s unexpected that individually working reconcilers would fail when combined, right?


Moreover, why would MiddleOut with historical proportions work (no need for in-sample predictions in Y_df), but not TopDown with historical proportions? Adding NaN/filled columns makes it work and produce results, but it seems like a rather odd workaround.

I am using the latest version 1.0.1

@elephaint
Copy link
Contributor

elephaint commented Feb 17, 2025

Hi! Thanks for this amazing library.

I was trying to reproduce the example notebook TourismSmall.ipynb. In the example, you use BottomUp, TopDown with "forecast_proportions", and MiddleOut with "forecast_proportions", none of which require in-sample data.

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

ColumnNotFoundError: The following columns were not found: [‘AutoARIMA’] or ColumnNotFoundError: The following columns were not found: [‘Naive']

To clarify, the code is identical to the notebook, except that I added four additional reconcilers to the reconcilers list:

reconcilers = [
    # Used in the example notebook
    BottomUp(),
    TopDown(method="forecast_proportions"),
    MiddleOut(middle_level="Country/Purpose/State", top_down_method="forecast_proportions"),
    
    # Added for testing, raises the errors mentioned
    #TopDown(method=“average_proportions"),
    #TopDown(method=“proportion_averages"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="average_proportions"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="proportion_averages"),
]

In the notebook, the training dataset containing the columns unique_id, ds and y are passed as Y_df. This should be sufficient to compute historical proportions, yet these methods also requires Y_df to include the columns from Y_hat_df (e.g., AutoARIMA and Naive).

Why do we need to include in-sample predictions to use these methods, even though they should not rely on them? I found a similar previous issue.

I’d really appreciate any clarification. Thanks in advance.

Thanks! This sounds like #290

@elephaint
Copy link
Contributor

elephaint commented Feb 17, 2025

I just encountered an even stranger related behavior:

Following the same steps as in the notebook, I tested the following configurations for the reconcilers list:

✅ Works:

reconcilers = [
   TopDown(method='forecast_proportions'),
  ]

✅ Works:

reconcilers = [
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]

However, combining these four working reconcilers causes an error:

❌ Fails:

reconcilers = [
   TopDown(method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]
Error:
ColumnNotFoundError: The following columns were not found: ['AutoARIMA']

It’s unexpected that individually working reconcilers would fail when combined, right?


Moreover, why would MiddleOut with historical proportions work (no need for in-sample predictions in Y_df), but not TopDown with historical proportions? Adding NaN/filled columns makes it work and produce results, but it seems like a rather odd workaround.

I am using the latest version 1.0.1

The first behaviour is strange, I'll look into it.

Second point seems again behaviour of #290

@macarw
Copy link
Author

macarw commented Feb 17, 2025

Thanks for the quick response, Olivier.

I think this might be a different behavior than #290.

In #290, the discussion was about needing base forecasts for all levels in the hierarchy, even when using traditional single-level methods. This meant extra rows are required in Y_hat_df.

What I am noticing here is that extra columns are being required in Y_df, specifically the in-sample predictions, even though single-level methods shouldn't need them.

Let me know if that makes sense.
Happy to clarify if needed.

@elephaint
Copy link
Contributor

elephaint commented Feb 17, 2025

average_proportions and proportions_average are based on the insample proportions, not on the historic actual proportions. Thus, the model needs a column AutoARIMA to compute these.

In #290 the same issue occurs, ánd it leads to having too few rows in Y_hat_df. But I agree that they are better handled as different issues.

Concretely, the following statement is incorrect:

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

TopDown and MiddleOut methods are based on insample proportions, and require in-sample data.

It's a fair question whether this is the best behaviour though, the current behaviour is more general but I suspect most users don't expect it.

@elephaint
Copy link
Contributor

elephaint commented Feb 17, 2025

@macarw I've made a PR that addresses both issues, that should fix them.

The behavior will be:
- If the model insample values are provided, those will be used to compute the proportions (old behavior)
- If they are not provided, the historical values will be used.

Edit: I was incorrect. The behavior is a result of a bug, TopDown only uses historical actual values. It does require the existence of the model insample predictions, but doesn't actually use them. This has an underlying different reason, that also needs to be addressed.

@macarw
Copy link
Author

macarw commented Feb 17, 2025

Thank you, that was super quick 🚀

I’ll take a look at the PR. Before I do, I’d like to clarify my point:

  • I still believe my initial statement is correct. TopDown and MiddleOut methods based on historical proportions do require in-sample data, which is accurate. Why do you say it is incorrect?
  • I am referring to in-sample data (e.g. y col from Y_df) and in-sample predictions data (e.g. AutoARIMA col from Y_df) as two different things.
  • From my testing, overriding the AutoARIMA col with NaN while keeping the y column in Y_df allows all historical proportion methods to work. Since passing a column full of NaNs still works, it seems more like the presence of the column is required but its values are not being used. Because of this, I’m confused about the sentence: "If the model insample values are provided, those will be used to compute the proportions (old behavior)" (Note: I understand that by in-sample values, you're referring to in-sample prediction values. Let me know if it is not the case).

I hope I have expressed myself clearly, and I will be reviewing the changes you have made.

@elephaint
Copy link
Contributor

Thank you, that was super quick 🚀

I’ll take a look at the PR. Before I do, I’d like to clarify my point:

  • I still believe my initial statement is correct. TopDown and MiddleOut methods based on historical proportions do require in-sample data, which is accurate. Why do you say it is incorrect?
  • I am referring to in-sample data (e.g. y col from Y_df) and in-sample predictions data (e.g. AutoARIMA col from Y_df) as two different things.
  • From my testing, overriding the AutoARIMA col with NaN while keeping the y column in Y_df allows all historical proportion methods to work. Since passing a column full of NaNs still works, it seems more like the presence of the column is required but its values are not being used. Because of this, I’m confused about the sentence: "If the model insample values are provided, those will be used to compute the proportions (old behavior)" (Note: I understand that by in-sample values, you're referring to in-sample prediction values. Let me know if it is not the case).

I hope I have expressed myself clearly, and I will be reviewing the changes you have made.

You're right, I am wrong! Apologies. I was put off by our own code base, which requires the insample predictions for TopDown, but doesn't actually use it to compute the top-down reconciliation, for which it uses the historical insample (actual) values.

There's a weird quirk requiring the existence of the model insample predictions due to an unfortunate combination of conditional checks in the code.

Anyways, apologies again, I should've read better!

@macarw
Copy link
Author

macarw commented Feb 18, 2025

No worries at all! Thanks for taking another look and for putting together a PR that addresses the issues so quickly. I’ll be testing the new version with these changes once it’s available 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants