how to add a dataframe that rows are valid for a period of time with featuretools #2756

eddyfathi · 2024-10-30T21:57:13Z

I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.

I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.

Adi6501 · 2024-11-11T04:22:58Z

import featuretools as ft

Assuming `es` is your existing entity set

Add the secondary table with 'valid_from' as the time_index

es = ft.EntitySet(id="your_entity_set")
es = es.entity_from_dataframe(
entity_id="secondary_table",
dataframe=secondary_df, # your secondary dataframe
index="secondary_id", # primary key of the secondary table
time_index="valid_from" # use valid_from as the time_index
)

Make sure to filter by 'valid_to' in any relationship between this table and the target table

relationship = ft.Relationship(
es["target_table"]["target_id"], # Foreign key in target table
es["secondary_table"]["secondary_id"], # Primary key in secondary table
)
es = es.add_relationship(relationship)

Filter secondary table to avoid using records where `valid_to` < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

def filter_valid_rows(df, cutoff_time):
return df[(df['valid_to'] >= cutoff_time)]

es["secondary_table"] = es["secondary_table"].df.groupby('secondary_id').apply(filter_valid_rows)

Use the filtered data in DFS

feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="target_table",
cutoff_time=cutoff_times_df, # DataFrame containing cutoffs for each instance
features_only=False
)

Adi6501 · 2024-11-11T04:24:19Z

This should help u , if u have any questions u can reach out to me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to add a dataframe that rows are valid for a period of time with featuretools #2756

how to add a dataframe that rows are valid for a period of time with featuretools #2756

eddyfathi commented Oct 30, 2024

Adi6501 commented Nov 11, 2024

Adi6501 commented Nov 11, 2024

how to add a dataframe that rows are valid for a period of time with featuretools #2756

how to add a dataframe that rows are valid for a period of time with featuretools #2756

Comments

eddyfathi commented Oct 30, 2024

Adi6501 commented Nov 11, 2024

Assuming es is your existing entity set

Add the secondary table with 'valid_from' as the time_index

Make sure to filter by 'valid_to' in any relationship between this table and the target table

Filter secondary table to avoid using records where valid_to < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

Use the filtered data in DFS

Adi6501 commented Nov 11, 2024

Assuming `es` is your existing entity set

Filter secondary table to avoid using records where `valid_to` < cutoff