You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.
I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.
The text was updated successfully, but these errors were encountered:
Add the secondary table with 'valid_from' as the time_index
es = ft.EntitySet(id="your_entity_set")
es = es.entity_from_dataframe(
entity_id="secondary_table",
dataframe=secondary_df, # your secondary dataframe
index="secondary_id", # primary key of the secondary table
time_index="valid_from" # use valid_from as the time_index
)
Make sure to filter by 'valid_to' in any relationship between this table and the target table
relationship = ft.Relationship(
es["target_table"]["target_id"], # Foreign key in target table
es["secondary_table"]["secondary_id"], # Primary key in secondary table
)
es = es.add_relationship(relationship)
Filter secondary table to avoid using records where valid_to < cutoff
During feature engineering, this will automatically apply the filter to prevent leakage
I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.
I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.
The text was updated successfully, but these errors were encountered: