`Y_df` parameter in `reconcile()` method requiring more columns than necessary when using TopDown or MiddleOut #329

macarw · 2025-02-16T19:55:33Z

Hi! Thanks for this amazing library.

I was trying to reproduce the example notebook TourismSmall.ipynb. In the example, you use BottomUp, TopDown with "forecast_proportions", and MiddleOut with "forecast_proportions", none of which require in-sample data.

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

ColumnNotFoundError: The following columns were not found: [‘AutoARIMA’]
or
ColumnNotFoundError: The following columns were not found: [‘Naive']

To clarify, the code is identical to the notebook, except that I added four additional reconcilers to the reconcilers list:

reconcilers = [
    # Used in the example notebook
    BottomUp(),
    TopDown(method="forecast_proportions"),
    MiddleOut(middle_level="Country/Purpose/State", top_down_method="forecast_proportions"),
    
    # Added for testing, raises the errors mentioned
    #TopDown(method=“average_proportions"),
    #TopDown(method=“proportion_averages"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="average_proportions"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="proportion_averages"),
]

In the notebook, the training dataset containing the columns unique_id, ds and y are passed as Y_df. This should be sufficient to compute historical proportions, yet these methods also requires Y_df to include the columns from Y_hat_df (e.g., AutoARIMA and Naive).

Why do we need to include in-sample predictions to use these methods, even though they should not rely on them?
I found a similar previous issue.

I’d really appreciate any clarification.
Thanks in advance.

The text was updated successfully, but these errors were encountered:

macarw · 2025-02-16T23:34:23Z

I just encountered an even stranger related behavior:

Following the same steps as in the notebook, I tested the following configurations for the reconcilers list:

✅ Works:

reconcilers = [
   TopDown(method='forecast_proportions'),
  ]

✅ Works:

reconcilers = [
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]

However, combining these four working reconcilers causes an error:

❌ Fails:

reconcilers = [
   TopDown(method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]

Error:
ColumnNotFoundError: The following columns were not found: ['AutoARIMA']

It’s unexpected that individually working reconcilers would fail when combined, right? 

Moreover, why would MiddleOut with historical proportions work (no need for in-sample predictions in Y_df), but not TopDown with historical proportions? Adding NaN/filled columns makes it work and produce results, but it seems like a rather odd workaround.

I am using the latest version 1.0.1

elephaint · 2025-02-17T12:24:33Z

Hi! Thanks for this amazing library.

I was trying to reproduce the example notebook TourismSmall.ipynb. In the example, you use BottomUp, TopDown with "forecast_proportions", and MiddleOut with "forecast_proportions", none of which require in-sample data.

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

ColumnNotFoundError: The following columns were not found: [‘AutoARIMA’] or ColumnNotFoundError: The following columns were not found: [‘Naive']

To clarify, the code is identical to the notebook, except that I added four additional reconcilers to the reconcilers list:
reconcilers = [
    # Used in the example notebook
    BottomUp(),
    TopDown(method="forecast_proportions"),
    MiddleOut(middle_level="Country/Purpose/State", top_down_method="forecast_proportions"),
    
    # Added for testing, raises the errors mentioned
    #TopDown(method=“average_proportions"),
    #TopDown(method=“proportion_averages"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="average_proportions"),
    #MiddleOut(middle_level=“Country/Purpose/State", top_down_method="proportion_averages"),
]
In the notebook, the training dataset containing the columns unique_id, ds and y are passed as Y_df. This should be sufficient to compute historical proportions, yet these methods also requires Y_df to include the columns from Y_hat_df (e.g., AutoARIMA and Naive).

Why do we need to include in-sample predictions to use these methods, even though they should not rely on them? I found a similar previous issue.

I’d really appreciate any clarification. Thanks in advance.

Thanks! This sounds like #290

elephaint · 2025-02-17T12:26:03Z

I just encountered an even stranger related behavior:

Following the same steps as in the notebook, I tested the following configurations for the reconcilers list:

✅ Works:
reconcilers = [
   TopDown(method='forecast_proportions'),
  ]
✅ Works:
reconcilers = [
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]
However, combining these four working reconcilers causes an error:

❌ Fails:
reconcilers = [
   TopDown(method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='average_proportions'),
   MiddleOut(middle_level='Country/Purpose/State', top_down_method='proportion_averages'),
]
Error:
ColumnNotFoundError: The following columns were not found: ['AutoARIMA']
It’s unexpected that individually working reconcilers would fail when combined, right? 

Moreover, why would MiddleOut with historical proportions work (no need for in-sample predictions in Y_df), but not TopDown with historical proportions? Adding NaN/filled columns makes it work and produce results, but it seems like a rather odd workaround.

I am using the latest version 1.0.1

The first behaviour is strange, I'll look into it.

Second point seems again behaviour of #290

macarw · 2025-02-17T13:05:26Z

Thanks for the quick response, Olivier.

I think this might be a different behavior than #290.

In #290, the discussion was about needing base forecasts for all levels in the hierarchy, even when using traditional single-level methods. This meant extra rows are required in Y_hat_df.

What I am noticing here is that extra columns are being required in Y_df, specifically the in-sample predictions, even though single-level methods shouldn't need them.

Let me know if that makes sense.
Happy to clarify if needed.

elephaint · 2025-02-17T15:15:36Z

average_proportions and proportions_average are based on the insample proportions, not on the historic actual proportions. Thus, the model needs a column AutoARIMA to compute these.

In #290 the same issue occurs, ánd it leads to having too few rows in Y_hat_df. But I agree that they are better handled as different issues.

Concretely, the following statement is incorrect:

However, when attempting to use the remaining TopDown and MiddleOut methods that are based on historical proportions ("average_proportions" and "proportion_averages"), and which do require in-sample data, I encountered the following errors:

TopDown and MiddleOut methods are based on insample proportions, and require in-sample data.

It's a fair question whether this is the best behaviour though, the current behaviour is more general but I suspect most users don't expect it.

elephaint · 2025-02-17T16:43:41Z

@macarw I've made a PR that addresses both issues, that should fix them.

~~The behavior will be:~~
~~- If the model insample values are provided, those will be used to compute the proportions (old behavior)~~
~~- If they are not provided, the historical values will be used.~~

Edit: I was incorrect. The behavior is a result of a bug, TopDown only uses historical actual values. It does require the existence of the model insample predictions, but doesn't actually use them. This has an underlying different reason, that also needs to be addressed.

macarw · 2025-02-17T17:17:55Z

Thank you, that was super quick 🚀

I’ll take a look at the PR. Before I do, I’d like to clarify my point:

I still believe my initial statement is correct. TopDown and MiddleOut methods based on historical proportions do require in-sample data, which is accurate. Why do you say it is incorrect?
I am referring to in-sample data (e.g. y col from Y_df) and in-sample predictions data (e.g. AutoARIMA col from Y_df) as two different things.
From my testing, overriding the AutoARIMA col with NaN while keeping the y column in Y_df allows all historical proportion methods to work. Since passing a column full of NaNs still works, it seems more like the presence of the column is required but its values are not being used. Because of this, I’m confused about the sentence: "If the model insample values are provided, those will be used to compute the proportions (old behavior)" (Note: I understand that by in-sample values, you're referring to in-sample prediction values. Let me know if it is not the case).

I hope I have expressed myself clearly, and I will be reviewing the changes you have made.

elephaint · 2025-02-18T08:56:31Z

Thank you, that was super quick 🚀

I’ll take a look at the PR. Before I do, I’d like to clarify my point:

I still believe my initial statement is correct. TopDown and MiddleOut methods based on historical proportions do require in-sample data, which is accurate. Why do you say it is incorrect?

I am referring to in-sample data (e.g. y col from Y_df) and in-sample predictions data (e.g. AutoARIMA col from Y_df) as two different things.

From my testing, overriding the AutoARIMA col with NaN while keeping the y column in Y_df allows all historical proportion methods to work. Since passing a column full of NaNs still works, it seems more like the presence of the column is required but its values are not being used. Because of this, I’m confused about the sentence: "If the model insample values are provided, those will be used to compute the proportions (old behavior)" (Note: I understand that by in-sample values, you're referring to in-sample prediction values. Let me know if it is not the case).

I hope I have expressed myself clearly, and I will be reviewing the changes you have made.

You're right, I am wrong! Apologies. I was put off by our own code base, which requires the insample predictions for TopDown, but doesn't actually use it to compute the top-down reconciliation, for which it uses the historical insample (actual) values.

There's a weird quirk requiring the existence of the model insample predictions due to an unfortunate combination of conditional checks in the code.

Anyways, apologies again, I should've read better!

macarw · 2025-02-18T21:02:22Z

No worries at all! Thanks for taking another look and for putting together a PR that addresses the issues so quickly. I’ll be testing the new version with these changes once it’s available 👏

elephaint mentioned this issue Feb 17, 2025

[FIX] TopDown failing on combinations with other methods #330

Merged

elephaint linked a pull request Feb 17, 2025 that will close this issue

[FIX] TopDown failing on combinations with other methods #330

Merged

elephaint closed this as completed in #330 Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Y_df` parameter in `reconcile()` method requiring more columns than necessary when using TopDown or MiddleOut #329

`Y_df` parameter in `reconcile()` method requiring more columns than necessary when using TopDown or MiddleOut #329

macarw commented Feb 16, 2025

macarw commented Feb 16, 2025

elephaint commented Feb 17, 2025 •

edited

Loading

elephaint commented Feb 17, 2025 •

edited

Loading

macarw commented Feb 17, 2025

elephaint commented Feb 17, 2025 •

edited

Loading

elephaint commented Feb 17, 2025 •

edited

Loading

macarw commented Feb 17, 2025

elephaint commented Feb 18, 2025

macarw commented Feb 18, 2025

Y_df parameter in reconcile() method requiring more columns than necessary when using TopDown or MiddleOut #329

Y_df parameter in reconcile() method requiring more columns than necessary when using TopDown or MiddleOut #329

Comments

macarw commented Feb 16, 2025

macarw commented Feb 16, 2025

elephaint commented Feb 17, 2025 • edited Loading

elephaint commented Feb 17, 2025 • edited Loading

macarw commented Feb 17, 2025

elephaint commented Feb 17, 2025 • edited Loading

elephaint commented Feb 17, 2025 • edited Loading

macarw commented Feb 17, 2025

elephaint commented Feb 18, 2025

macarw commented Feb 18, 2025

`Y_df` parameter in `reconcile()` method requiring more columns than necessary when using TopDown or MiddleOut #329

`Y_df` parameter in `reconcile()` method requiring more columns than necessary when using TopDown or MiddleOut #329

elephaint commented Feb 17, 2025 •

edited

Loading

elephaint commented Feb 17, 2025 •

edited

Loading

elephaint commented Feb 17, 2025 •

edited

Loading

elephaint commented Feb 17, 2025 •

edited

Loading