-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed error when stacking data with no exogenous variables #4275
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4275 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 355 355
Lines 39073 39096 +23
=======================================
+ Hits 38953 38976 +23
Misses 120 120
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions
evalml/pipelines/utils.py
Outdated
for col in X.columns: | ||
if col == time_index: | ||
continue | ||
separated_name = col.split("_") | ||
original_columns.add("_".join(separated_name[:-1])) | ||
series_ids.add(separated_name[-1]) | ||
if series_id_values is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this check if series_ids
is a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess here is that we'd have an issue if the series_id_values
passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values
don't match the column names"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, why do we need series_id_values
? I think I'm missing it, where is it actually used outside of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can refactor this for loop to only run if series_id_values
is not set. We use series_id_values
in the case where we aren't able to pull the series ID values from the column names. In that case, it is set as series_ids
, which is used to repeat the dates as seen in this line.
evalml/pipelines/utils.py
Outdated
|
||
restacked_X = [] | ||
|
||
if len(series_ids) == 0: | ||
raise ValueError( | ||
"Unable to stack X as X had no exogenous variables and `series_id_values` is None.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more accurate if it was no exogenous variables or
series_id_values is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to "X has no exogenous variables and
series_id_values is None."
as it's a bit more succinct
evalml/pipelines/utils.py
Outdated
time_index_col.index = restacked_X.index | ||
restacked_X[time_index] = time_index_col | ||
|
||
if len(restacked_X) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me understand why we need this logic block here? If restacked_X
has length 0 why don't we error out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restacked_X
can be zero in the case where we only have the time_index columns in the unstacked X
dataframe. We want to be able to still stack in this case, which is why we have this logic block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, with the refactor I just put in, I ended up condition to be if len(original_columns
) == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the changes, to be fully honest. Is the goal here to handle the case where the only column in X is the time index column? If so, there has got to be a clearer way to handle that case. Even though right now stack_X
is only used in the context of split_multiseries_data
, we shouldn't need to handle it partly in split_multiseries
and partly in stack_X
. stack_X
should be able to handle any case where the time index is the only column, and not depend on precalculating a variable to handle it.
evalml/pipelines/utils.py
Outdated
@@ -1381,6 +1381,7 @@ def unstack_multiseries( | |||
# Perform the unstacking | |||
X_unstacked_cols = [] | |||
y_unstacked_cols = [] | |||
new_time_index = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't need this - new_time_index
is never used outside of the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, deleted.
evalml/pipelines/utils.py
Outdated
for col in X.columns: | ||
if col == time_index: | ||
continue | ||
separated_name = col.split("_") | ||
original_columns.add("_".join(separated_name[:-1])) | ||
series_ids.add(separated_name[-1]) | ||
if series_id_values is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess here is that we'd have an issue if the series_id_values
passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values
don't match the column names"?
evalml/pipelines/utils.py
Outdated
for col in X.columns: | ||
if col == time_index: | ||
continue | ||
separated_name = col.split("_") | ||
original_columns.add("_".join(separated_name[:-1])) | ||
series_ids.add(separated_name[-1]) | ||
if series_id_values is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, why do we need series_id_values
? I think I'm missing it, where is it actually used outside of this?
@jeremyliweishih @eccabay yep, at a high level we need to handle the case where the unstacked data is only composed of a time index column. This case arises when the stacked data is solely composed of a time index column and a series ID column. When the data is unstacked, we essentially only retain the unique time stamps and drop any series ID information from X (and the series ID values essentially goes to y). When stacking, we need to know how many series ID values there are so that we can repeat the number of timestamp values for each series ID value as seen here. Since
Open to either of these approaches or any alternatives! |
6a58945
to
eb066a7
Compare
@christopherbunn gotcha, I understand the challenge now. I think your new implementation makes sense, although I still have to do a full review. If possible, I think we should be as explicit as possible that |
4c65b13
to
e0ae9e9
Compare
@eccabay makes sense to me, I updated the docstring and error message that is raised to clarify this. Open to tweaking the wording of either though |
e0ae9e9
to
c7f7189
Compare
c7f7189
to
996fd03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few nits
start=start_index, | ||
stop=start_index + len(time_index_col), | ||
) | ||
time_index_col.index = stacked_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it in the case where the time_index_col
has a different starting index value other than 0. Without it, the test cases where starting_index
is not None will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but agree with Becca's nits!
Resolves #4276