Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed error when stacking data with no exogenous variables #4275

Merged
merged 5 commits into from
Aug 21, 2023

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Aug 16, 2023

Resolves #4276

@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Merging #4275 (3e28e27) into main (7781c77) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4275     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        355     355             
  Lines      39073   39096     +23     
=======================================
+ Hits       38953   38976     +23     
  Misses       120     120             
Files Changed Coverage Δ
evalml/pipelines/utils.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/preprocessing/utils.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.6% <100.0%> (+0.1%) ⬆️
...valml/tests/preprocessing_tests/test_split_data.py 100.0% <100.0%> (ø)

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions

for col in X.columns:
if col == time_index:
continue
separated_name = col.split("_")
original_columns.add("_".join(separated_name[:-1]))
series_ids.add(separated_name[-1])
if series_id_values is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this check if series_ids is a set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess here is that we'd have an issue if the series_id_values passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values don't match the column names"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, why do we need series_id_values? I think I'm missing it, where is it actually used outside of this?

Copy link
Contributor Author

@christopherbunn christopherbunn Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can refactor this for loop to only run if series_id_values is not set. We use series_id_values in the case where we aren't able to pull the series ID values from the column names. In that case, it is set as series_ids, which is used to repeat the dates as seen in this line.


restacked_X = []

if len(series_ids) == 0:
raise ValueError(
"Unable to stack X as X had no exogenous variables and `series_id_values` is None.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more accurate if it was no exogenous variables or series_id_values is None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to "X has no exogenous variables and series_id_values is None." as it's a bit more succinct

time_index_col.index = restacked_X.index
restacked_X[time_index] = time_index_col

if len(restacked_X) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand why we need this logic block here? If restacked_X has length 0 why don't we error out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restacked_X can be zero in the case where we only have the time_index columns in the unstacked X dataframe. We want to be able to still stack in this case, which is why we have this logic block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, with the refactor I just put in, I ended up condition to be if len(original_columns) == 0.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the changes, to be fully honest. Is the goal here to handle the case where the only column in X is the time index column? If so, there has got to be a clearer way to handle that case. Even though right now stack_X is only used in the context of split_multiseries_data, we shouldn't need to handle it partly in split_multiseries and partly in stack_X. stack_X should be able to handle any case where the time index is the only column, and not depend on precalculating a variable to handle it.

@@ -1381,6 +1381,7 @@ def unstack_multiseries(
# Perform the unstacking
X_unstacked_cols = []
y_unstacked_cols = []
new_time_index = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need this - new_time_index is never used outside of the loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, deleted.

for col in X.columns:
if col == time_index:
continue
separated_name = col.split("_")
original_columns.add("_".join(separated_name[:-1]))
series_ids.add(separated_name[-1])
if series_id_values is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess here is that we'd have an issue if the series_id_values passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values don't match the column names"?

for col in X.columns:
if col == time_index:
continue
separated_name = col.split("_")
original_columns.add("_".join(separated_name[:-1]))
series_ids.add(separated_name[-1])
if series_id_values is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, why do we need series_id_values? I think I'm missing it, where is it actually used outside of this?

@christopherbunn
Copy link
Contributor Author

@jeremyliweishih @eccabay yep, at a high level we need to handle the case where the unstacked data is only composed of a time index column. This case arises when the stacked data is solely composed of a time index column and a series ID column. When the data is unstacked, we essentially only retain the unique time stamps and drop any series ID information from X (and the series ID values essentially goes to y).

When stacking, we need to know how many series ID values there are so that we can repeat the number of timestamp values for each series ID value as seen here. Since stack_X() only takes in X (which does not have series ID info), we must somehow pass it in from an external source. We could do one of the following:

  1. We take in a set of the current series ID values as the series_id_values parameter. The length of this set is then used to repeat the timestamps an appropriate number of times. This is our current approach.
  2. Alternatively, we can just take in just the number of unique series IDs as an integer with a n_series_ids parameter.
  3. We can expand out stack_X() to optionally take in the y dataframe as an optional parameter. If passed in, we can use the column names of y to generate the series ID values.

Open to either of these approaches or any alternatives!

@christopherbunn christopherbunn force-pushed the stack_utils_no_exogenous branch from 6a58945 to eb066a7 Compare August 17, 2023 21:19
@eccabay
Copy link
Contributor

eccabay commented Aug 18, 2023

@christopherbunn gotcha, I understand the challenge now. I think your new implementation makes sense, although I still have to do a full review. If possible, I think we should be as explicit as possible that series_id_values is a required parameter if the only column in X is the time index - maybe specifically checking and calling that out as the root error?

@christopherbunn christopherbunn force-pushed the stack_utils_no_exogenous branch 2 times, most recently from 4c65b13 to e0ae9e9 Compare August 18, 2023 18:52
@christopherbunn
Copy link
Contributor Author

@eccabay makes sense to me, I updated the docstring and error message that is raised to clarify this. Open to tweaking the wording of either though

@christopherbunn christopherbunn force-pushed the stack_utils_no_exogenous branch from c7f7189 to 996fd03 Compare August 21, 2023 15:03
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few nits

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
start=start_index,
stop=start_index + len(time_index_col),
)
time_index_col.index = stacked_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it in the case where the time_index_col has a different starting index value other than 0. Without it, the test cases where starting_index is not None will fail.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but agree with Becca's nits!

@christopherbunn christopherbunn merged commit 53bd61b into main Aug 21, 2023
@christopherbunn christopherbunn deleted the stack_utils_no_exogenous branch August 21, 2023 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when stacking X dataframes without exogenous variables
3 participants