Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to add a dataframe that rows are valid for a period of time with featuretools #2756

Open
eddyfathi opened this issue Oct 30, 2024 · 2 comments

Comments

@eddyfathi
Copy link

I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.

I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.

@Adi6501
Copy link

Adi6501 commented Nov 11, 2024

import featuretools as ft

Assuming es is your existing entity set

Add the secondary table with 'valid_from' as the time_index

es = ft.EntitySet(id="your_entity_set")
es = es.entity_from_dataframe(
entity_id="secondary_table",
dataframe=secondary_df, # your secondary dataframe
index="secondary_id", # primary key of the secondary table
time_index="valid_from" # use valid_from as the time_index
)

Make sure to filter by 'valid_to' in any relationship between this table and the target table

relationship = ft.Relationship(
es["target_table"]["target_id"], # Foreign key in target table
es["secondary_table"]["secondary_id"], # Primary key in secondary table
)
es = es.add_relationship(relationship)

Filter secondary table to avoid using records where valid_to < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

def filter_valid_rows(df, cutoff_time):
return df[(df['valid_to'] >= cutoff_time)]

es["secondary_table"] = es["secondary_table"].df.groupby('secondary_id').apply(filter_valid_rows)

Use the filtered data in DFS

feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="target_table",
cutoff_time=cutoff_times_df, # DataFrame containing cutoffs for each instance
features_only=False
)

@Adi6501
Copy link

Adi6501 commented Nov 11, 2024

This should help u , if u have any questions u can reach out to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants