-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for additional estimators for multiseries datasets #4385
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4385 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 357 357
Lines 39928 39965 +37
=======================================
+ Hits 39802 39840 +38
+ Misses 126 125 -1 ☔ View full report in Codecov by Sentry. |
dcee5cf
to
60a7fad
Compare
6f4b97e
to
e42a615
Compare
e42a615
to
443f7a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending performance tests
@@ -98,19 +97,20 @@ def test_iterative_algorithm_init( | |||
assert algo.batch_number == 0 | |||
assert algo.default_max_batches == 1 | |||
estimators = get_estimators(problem_type) | |||
decomposer = [STLDecomposer] if is_regression(problem_type) else [] | |||
decomposer = [True, False] if is_regression(problem_type) else [True] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment clarifying why you're using true false instead of the decomposer name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to just parametrize the include_decomposer
argument here - this and the below section are confusing to read out of context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, we are basically only checking the number of pipelines matches up. Before, we only needed to add the decomposer one once since there was one estimator type (VARMAX).
Now, since we have multiple estimator types, each estimator type will have one pipeline with a decomposer and another without a decomposer. As such, we need to have this [True, False] and to iterate through it in order to generate the correct number of pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a clarifying comment would be useful here 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a short one to this test
self.statistically_significant_lags = {} | ||
for column in y.columns: | ||
self.statistically_significant_lags[ | ||
column | ||
] = self._find_significant_lags( | ||
y[column], | ||
conf_level=self.conf_level, | ||
start_delay=self.start_delay, | ||
max_delay=self.max_delay, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this section more concise/easier to maintain by folding the single series case into the multiseries case by converting the single series to a dataframe and keeping this code for both cases - following the pattern in other files that already support multiseries (stl decomposer might be a good example?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just consolidated it.
Actually I just remembered that it's structured this was so that we're still able to run self._find_significant_lags
even when y is None. Is there a way you had in mind to structure it so that y can still be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, seems like y being None is something we'd want to have explicit behavior for, since right now the behavior is unclear. I think we should just handle it entirely separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We handle the y being null in self._find_significant_lags
since we calculate all lags in that function (and just set significant lags to be all_lags if y is None). Should I put it off into it's own separate branch, even though the code would be identical to the case where y is a series? e.g.
# For the multiseries case, each series ID has individualized lag values
if isinstance(y, pd.DataFrame):
self.statistically_significant_lags = {}
for column in y.columns:
self.statistically_significant_lags[
column
] = self._find_significant_lags(
y[column],
conf_level=self.conf_level,
start_delay=self.start_delay,
max_delay=self.max_delay,
)
elif y is None:
self.statistically_significant_lags = self._find_significant_lags(
y,
conf_level=self.conf_level,
start_delay=self.start_delay,
max_delay=self.max_delay,
)
else:
self.statistically_significant_lags = self._find_significant_lags(
y,
conf_level=self.conf_level,
start_delay=self.start_delay,
max_delay=self.max_delay,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sorry for drilling into this so much, but I think I understand now. My new potentially hot take proposal is something like:
if y is None:
self.statistically_significant_lags = np.arange(self.start_delay, self.start_delay +self. max_delay + 1)
else:
if isinstance(y, pd.Series):
y = y.to_frame()
for column in y.columns:
self.statistically_significant_lags = ...
And then we can remove the handling of y being None from the static function. My argument for doing this is that calling all lags the statistically significant lags is a misnomer, since we didn't actually check statistical significance. This is me getting very into the weeds though, so I very much understand if you would rather keep things closer to the way they are 😅
Regardless, even with your new proposal, we'd still be able to combine the two non y=None cases by casting the series to a dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example makes sense to me, I don't see our behavior for y is None
changing anytime soon so I'm comfortable with pulling that out and changing the function. Will update!
@@ -98,19 +97,20 @@ def test_iterative_algorithm_init( | |||
assert algo.batch_number == 0 | |||
assert algo.default_max_batches == 1 | |||
estimators = get_estimators(problem_type) | |||
decomposer = [STLDecomposer] if is_regression(problem_type) else [] | |||
decomposer = [True, False] if is_regression(problem_type) else [True] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to just parametrize the include_decomposer
argument here - this and the below section are confusing to read out of context
d99084e
to
b7b6bf9
Compare
self.statistically_significant_lags = {} | ||
for column in y.columns: | ||
self.statistically_significant_lags[ | ||
column | ||
] = self._find_significant_lags( | ||
y[column], | ||
conf_level=self.conf_level, | ||
start_delay=self.start_delay, | ||
max_delay=self.max_delay, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, seems like y being None is something we'd want to have explicit behavior for, since right now the behavior is unclear. I think we should just handle it entirely separately
evalml/pipelines/components/transformers/preprocessing/drop_nan_rows_transformer.py
Outdated
Show resolved
Hide resolved
unstacked_predictions.index = X_unstacked[self.time_index] | ||
stacked_predictions = stack_data( | ||
unstacked_predictions, | ||
include_series_id=include_series_id, | ||
include_series_id=True, | ||
series_id_name=self.series_id, | ||
) | ||
|
||
stacked_predictions = stacked_predictions.reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind setting the index and then immediately resetting the index? The value of the index shouldn't impact the order of stacking, right?
Either way, we can explicitly control the index in stack_data
with the starting_index
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this snippet is to set the index as the time index column, stack the data (and thus using the dates in the time index column to generate new stacked dates) and then resetting the index so that the resulting time index column can be used when we pd.merge
later on in line 193.
While it's possible to just copy over the time_index column from X
after stacking, I think it's safer to just generate it from the X_unstacked
index like this as we know for sure that the X_unstacked time_index aligns with the unstacked_predictions whereas it's technically possible to have an X
time_index that's out of order (and thus would be incorrect if we simply copied over this column). I'm open to suggestions for a cleaner implementation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand now! I think a comment would be great. I also wonder if it would be useful to explicitly say reset_index(drop=False)
, so that even if pandas changes their defaults we don't get screwed.
My motivation here is that this is something that might be confusing to someone looking back at it in the future, since the goal isn't clear from the code itself. I hope that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the reset index parameter, I'll add that in. I'll add a clarifying comment or two so that it's clear what's going on here.
Your motivation makes sense! I feel like I've been so lost in the weeds of this implementation for a while now so it's good to have multiple pairs of eyes on this to highlight what's intuitive and what isn't 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid! I think with a final few code comments this is all set 😎
self.statistically_significant_lags = {} | ||
for column in y.columns: | ||
self.statistically_significant_lags[ | ||
column | ||
] = self._find_significant_lags( | ||
y[column], | ||
conf_level=self.conf_level, | ||
start_delay=self.start_delay, | ||
max_delay=self.max_delay, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sorry for drilling into this so much, but I think I understand now. My new potentially hot take proposal is something like:
if y is None:
self.statistically_significant_lags = np.arange(self.start_delay, self.start_delay +self. max_delay + 1)
else:
if isinstance(y, pd.Series):
y = y.to_frame()
for column in y.columns:
self.statistically_significant_lags = ...
And then we can remove the handling of y being None from the static function. My argument for doing this is that calling all lags the statistically significant lags is a misnomer, since we didn't actually check statistical significance. This is me getting very into the weeds though, so I very much understand if you would rather keep things closer to the way they are 😅
Regardless, even with your new proposal, we'd still be able to combine the two non y=None cases by casting the series to a dataframe
unstacked_predictions.index = X_unstacked[self.time_index] | ||
stacked_predictions = stack_data( | ||
unstacked_predictions, | ||
include_series_id=include_series_id, | ||
include_series_id=True, | ||
series_id_name=self.series_id, | ||
) | ||
|
||
stacked_predictions = stacked_predictions.reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand now! I think a comment would be great. I also wonder if it would be useful to explicitly say reset_index(drop=False)
, so that even if pandas changes their defaults we don't get screwed.
My motivation here is that this is something that might be confusing to someone looking back at it in the future, since the goal isn't clear from the code itself. I hope that makes sense!
Resolves #4386