Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Refactor DeepFeatureSynthesis._build_features #2106

Open
dvreed77 opened this issue Jun 10, 2022 · 0 comments
Open

EPIC: Refactor DeepFeatureSynthesis._build_features #2106

dvreed77 opened this issue Jun 10, 2022 · 0 comments
Labels
enhancement Improvement to an existing feature refactor Work being done to refactor code. tech debt additional rework caused by choosing an easy solution now

Comments

@dvreed77
Copy link
Contributor

DeepFeatureSynthesis._build_features is in need of a refactor to improve speed, maintainability, and scalability.

There are many optimizations that can be made underneath this function to improve performance while maintaining API signature. As a rough benchmark, the get_valid_primitives function takes 2 hours to run on the retail entityset to produce a little over 5 million feature defintions. This can be optimized to be much take a much shorter time.

Functions should be more granular and testable:

For example one of the most granular functions should take a datastructure which is a hashmap of features, keyed by their ColumnSchema as a single argument, and another argument which is an inputset (eg. Numeric, Boolean), and return a list of lists of all feature combinations that match this inputtype signature. This function should be pure, which would improve maintainability by being very readable and testable.

Optimizations:

Caching

Using the example above, this function could be wrapped with an LRU Cache decorator that would allow primitives that have input signatures matching other primitives to return immediately. Memory issues should be of little concern since these calculations can be perfomed using very datastructures containing logical types only and no data, but this should be measured and tested.

Data Structures

Features and primitives should be hashed by their associated logical types for faster lookup.

@exalate-issue-sync exalate-issue-sync bot assigned dvreed77 and ozzieD and unassigned dvreed77 Jun 21, 2022
@exalate-issue-sync exalate-issue-sync bot assigned cp2boston and unassigned ozzieD Jul 20, 2022
@exalate-issue-sync exalate-issue-sync bot assigned dvreed77 and unassigned cp2boston Aug 10, 2022
@gsheni gsheni added enhancement Improvement to an existing feature refactor Work being done to refactor code. tech debt additional rework caused by choosing an easy solution now labels Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to an existing feature refactor Work being done to refactor code. tech debt additional rework caused by choosing an easy solution now
Projects
None yet
Development

No branches or pull requests

4 participants