Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Genetic Programming for Feature Engineering #2121

Open
aerdem4 opened this issue Apr 22, 2020 · 8 comments
Open

[FEA] Genetic Programming for Feature Engineering #2121

aerdem4 opened this issue Apr 22, 2020 · 8 comments
Labels
feature request New feature or request inactive-30d New Algorithm For tracking new algorithms that will be added to our existing collection proposal Change current process or code

Comments

@aerdem4
Copy link
Contributor

aerdem4 commented Apr 22, 2020

Is your feature request related to a problem? Please describe.
Genetic Programming is very useful for feature engineering but main challenge is its time complexity. Luckily, they are easily parallelizable. Therefore, I believe it is a good fit for cuML.

Example: Let's assume you have 2 columns A and B, and a binary target. This target is 1 most of the time when A > B. It is very difficult to learn it with a tree based model but GP can engineer this feature for you.

Describe the solution you'd like
I would like to have the functionalities of gplearn accelerated on GPU. (https://gplearn.readthedocs.io/en/stable/)

@aerdem4 aerdem4 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Apr 22, 2020
@teju85
Copy link
Member

teju85 commented Apr 22, 2020

@aerdem4 so, are you only looking for a gpu-accelerated SymbolicTransformer?

@aerdem4
Copy link
Contributor Author

aerdem4 commented Apr 22, 2020

@teju85 I think all of them are the same except the metric. Multiple options for the metric would be nice but spearman is the most useful.

@viclafargue viclafargue added New Algorithm For tracking new algorithms that will be added to our existing collection proposal Change current process or code good first issue Good for newcomers and removed ? - Needs Triage Need team to review and classify labels Apr 29, 2020
@JohnZed
Copy link
Contributor

JohnZed commented Apr 29, 2020

Alright, whose idea of a joke was it to tag this with Good First Issue? I'm looking at you @WXBN ! ;)

@JohnZed JohnZed removed the good first issue Good for newcomers label Apr 29, 2020
@teju85
Copy link
Member

teju85 commented Jan 19, 2021

@aerdem4 we are going to have an intern provide us with an initial implementation of this in cuML! For starters, can we assume max program AST depth of 10 or so? Or do you think that's too low to begin with? In practice, what's the deepest program you've come across?

@aerdem4
Copy link
Contributor Author

aerdem4 commented Jan 19, 2021

@teju85 thanks for the good news! I think 10 is enough for AST depth. Generated features don't need to be very complex but should capture the interactions the model can't. If the intern needs any help, I would be happy to be involved btw.

@teju85
Copy link
Member

teju85 commented Jan 20, 2021

tagging @vimarsh6739 who'll be implementing this.

@aerdem4
Copy link
Contributor Author

aerdem4 commented Jan 27, 2021

A simple Kaggle test case:
https://www.kaggle.com/c/loan-default-prediction This dataset has 800 features. People claim that without extracting the feature f527-f528, GBM performs poorly in this old competition. There may be more complex magic features too.

I can also create artificial datasets that we can test if GP can reverse engineer the features that contribute to the target.

rapids-bot bot pushed a commit that referenced this issue Feb 3, 2021
This PR introduces/proposes some of the basic and core (gpu-friendly!) data structures for implementing gplearn in cuML in order to address the issue #2121 .

Tagging all who will be involved in this development: @vinaydes @venkywonka @vimarsh6739.

PS: It also contains an experimental register-based stack implementation that will be useful while implementing CUDA-based AST evaluation, which is needed for organizing tournaments.

Authors:
  - Thejaswi. N. S (@teju85)

Approvers:
  - Corey J. Nolet (@cjnolet)

URL: #3387
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request inactive-30d New Algorithm For tracking new algorithms that will be added to our existing collection proposal Change current process or code
Projects
None yet
Development

No branches or pull requests

4 participants