Fixed error when stacking data with no exogenous variables #4275

christopherbunn · 2023-08-16T16:26:46Z

Resolves #4276

codecov · 2023-08-16T16:35:28Z

Codecov Report

Merging #4275 (3e28e27) into main (7781c77) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4275     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        355     355             
  Lines      39073   39096     +23     
=======================================
+ Hits       38953   38976     +23     
  Misses       120     120

Files Changed	Coverage Δ
evalml/pipelines/utils.py	`99.7% <100.0%> (+0.1%)`	⬆️
evalml/preprocessing/utils.py	`100.0% <100.0%> (ø)`
evalml/tests/pipeline_tests/test_pipeline_utils.py	`99.6% <100.0%> (+0.1%)`	⬆️
...valml/tests/preprocessing_tests/test_split_data.py	`100.0% <100.0%> (ø)`

jeremyliweishih

some questions

jeremyliweishih · 2023-08-16T17:01:00Z

evalml/pipelines/utils.py

    for col in X.columns:
        if col == time_index:
            continue
        separated_name = col.split("_")
        original_columns.add("_".join(separated_name[:-1]))
-        series_ids.add(separated_name[-1])
+        if series_id_values is None:


why do we need this check if series_ids is a set?

My guess here is that we'd have an issue if the series_id_values passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values don't match the column names"?

Moreover, why do we need series_id_values? I think I'm missing it, where is it actually used outside of this?

I can refactor this for loop to only run if series_id_values is not set. We use series_id_values in the case where we aren't able to pull the series ID values from the column names. In that case, it is set as series_ids, which is used to repeat the dates as seen in this line.

jeremyliweishih · 2023-08-16T18:54:53Z

evalml/pipelines/utils.py


    restacked_X = []

+    if len(series_ids) == 0:
+        raise ValueError(
+            "Unable to stack X as X had no exogenous variables and `series_id_values` is None.",


I think this would be more accurate if it was no exogenous variables or series_id_values is None

Changed it to "X has no exogenous variables and series_id_values is None." as it's a bit more succinct

jeremyliweishih · 2023-08-16T18:57:24Z

evalml/pipelines/utils.py

-    time_index_col.index = restacked_X.index
-    restacked_X[time_index] = time_index_col
+
+    if len(restacked_X) == 0:


can you help me understand why we need this logic block here? If restacked_X has length 0 why don't we error out?

restacked_X can be zero in the case where we only have the time_index columns in the unstacked X dataframe. We want to be able to still stack in this case, which is why we have this logic block.

That being said, with the refactor I just put in, I ended up condition to be if len(original_columns) == 0.

eccabay

I'm a bit confused by the changes, to be fully honest. Is the goal here to handle the case where the only column in X is the time index column? If so, there has got to be a clearer way to handle that case. Even though right now stack_X is only used in the context of split_multiseries_data, we shouldn't need to handle it partly in split_multiseries and partly in stack_X. stack_X should be able to handle any case where the time index is the only column, and not depend on precalculating a variable to handle it.

eccabay · 2023-08-17T14:52:31Z

evalml/pipelines/utils.py

@@ -1381,6 +1381,7 @@ def unstack_multiseries(
    # Perform the unstacking
    X_unstacked_cols = []
    y_unstacked_cols = []
+    new_time_index = None


Looks like we don't need this - new_time_index is never used outside of the loop

Agreed, deleted.

eccabay · 2023-08-17T14:56:11Z

evalml/pipelines/utils.py

    for col in X.columns:
        if col == time_index:
            continue
        separated_name = col.split("_")
        original_columns.add("_".join(separated_name[:-1]))
-        series_ids.add(separated_name[-1])
+        if series_id_values is None:


My guess here is that we'd have an issue if the series_id_values passed in didn't match the series ids extracted from the column names. But the issue there then becomes "what if the passed in series_id_values don't match the column names"?

eccabay · 2023-08-17T14:59:56Z

evalml/pipelines/utils.py

    for col in X.columns:
        if col == time_index:
            continue
        separated_name = col.split("_")
        original_columns.add("_".join(separated_name[:-1]))
-        series_ids.add(separated_name[-1])
+        if series_id_values is None:


Moreover, why do we need series_id_values? I think I'm missing it, where is it actually used outside of this?

christopherbunn · 2023-08-17T21:10:27Z

@jeremyliweishih @eccabay yep, at a high level we need to handle the case where the unstacked data is only composed of a time index column. This case arises when the stacked data is solely composed of a time index column and a series ID column. When the data is unstacked, we essentially only retain the unique time stamps and drop any series ID information from X (and the series ID values essentially goes to y).

When stacking, we need to know how many series ID values there are so that we can repeat the number of timestamp values for each series ID value as seen here. Since stack_X() only takes in X (which does not have series ID info), we must somehow pass it in from an external source. We could do one of the following:

We take in a set of the current series ID values as the series_id_values parameter. The length of this set is then used to repeat the timestamps an appropriate number of times. This is our current approach.
Alternatively, we can just take in just the number of unique series IDs as an integer with a n_series_ids parameter.
We can expand out stack_X() to optionally take in the y dataframe as an optional parameter. If passed in, we can use the column names of y to generate the series ID values.

Open to either of these approaches or any alternatives!

eccabay · 2023-08-18T14:10:23Z

@christopherbunn gotcha, I understand the challenge now. I think your new implementation makes sense, although I still have to do a full review. If possible, I think we should be as explicit as possible that series_id_values is a required parameter if the only column in X is the time index - maybe specifically checking and calling that out as the root error?

christopherbunn · 2023-08-18T18:52:54Z

@eccabay makes sense to me, I updated the docstring and error message that is raised to clarify this. Open to tweaking the wording of either though

eccabay

Thanks! Just a few nits

evalml/pipelines/utils.py

eccabay · 2023-08-21T15:56:38Z

evalml/pipelines/utils.py

+            start=start_index,
+            stop=start_index + len(time_index_col),
+        )
+        time_index_col.index = stacked_index


I don't think we need this line

We need it in the case where the time_index_col has a different starting index value other than 0. Without it, the test cases where starting_index is not None will fail.

jeremyliweishih

LGTM but agree with Becca's nits!

christopherbunn marked this pull request as ready for review August 16, 2023 16:58

auto-assign bot assigned christopherbunn Aug 16, 2023

christopherbunn requested review from jeremyliweishih, fjlanasa, MichaelFu512, eccabay, chukarsten and remyogasawara August 16, 2023 16:58

jeremyliweishih reviewed Aug 16, 2023

View reviewed changes

eccabay reviewed Aug 17, 2023

View reviewed changes

christopherbunn force-pushed the stack_utils_no_exogenous branch from 6a58945 to eb066a7 Compare August 17, 2023 21:19

christopherbunn force-pushed the stack_utils_no_exogenous branch 2 times, most recently from 4c65b13 to e0ae9e9 Compare August 18, 2023 18:52

christopherbunn force-pushed the stack_utils_no_exogenous branch from e0ae9e9 to c7f7189 Compare August 21, 2023 14:18

christopherbunn requested review from eccabay and jeremyliweishih August 21, 2023 15:02

christopherbunn added 4 commits August 21, 2023 11:03

Initial commit

5351b88

Updated release notes

dbf692e

Refactored code structure.

4a32c11

Updated error message and docstring

996fd03

christopherbunn force-pushed the stack_utils_no_exogenous branch from c7f7189 to 996fd03 Compare August 21, 2023 15:03

eccabay approved these changes Aug 21, 2023

View reviewed changes

jeremyliweishih approved these changes Aug 21, 2023

View reviewed changes

Final nits

3e28e27

christopherbunn merged commit 53bd61b into main Aug 21, 2023

christopherbunn deleted the stack_utils_no_exogenous branch August 21, 2023 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed error when stacking data with no exogenous variables #4275

Fixed error when stacking data with no exogenous variables #4275

christopherbunn commented Aug 16, 2023 •

edited

Loading

codecov bot commented Aug 16, 2023 •

edited

Loading

jeremyliweishih left a comment

jeremyliweishih Aug 16, 2023

eccabay Aug 17, 2023

eccabay Aug 17, 2023

christopherbunn Aug 17, 2023 •

edited

Loading

jeremyliweishih Aug 16, 2023

christopherbunn Aug 17, 2023

jeremyliweishih Aug 16, 2023

christopherbunn Aug 17, 2023

christopherbunn Aug 17, 2023

eccabay left a comment

eccabay Aug 17, 2023

christopherbunn Aug 17, 2023

eccabay Aug 17, 2023

eccabay Aug 17, 2023

christopherbunn commented Aug 17, 2023

eccabay commented Aug 18, 2023

christopherbunn commented Aug 18, 2023

eccabay left a comment

eccabay Aug 21, 2023

christopherbunn Aug 21, 2023

jeremyliweishih left a comment

Fixed error when stacking data with no exogenous variables #4275

Fixed error when stacking data with no exogenous variables #4275

Conversation

christopherbunn commented Aug 16, 2023 • edited Loading

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Aug 17, 2023

eccabay commented Aug 18, 2023

christopherbunn commented Aug 18, 2023

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

christopherbunn commented Aug 16, 2023 •

edited

Loading

codecov bot commented Aug 16, 2023 •

edited

Loading

christopherbunn Aug 17, 2023 •

edited

Loading