Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not handle NaN #218

Closed
Roh-codeur opened this issue Mar 30, 2023 · 4 comments
Closed

Does not handle NaN #218

Roh-codeur opened this issue Mar 30, 2023 · 4 comments

Comments

@Roh-codeur
Copy link

Roh-codeur commented Mar 30, 2023

Hi

I am trying out EvoTrees for Binary classification task as below. it turns out, however, that it does not support NaNs. is there a reason why to doesn't? I currently XGBoost for my model and it handles NaNs. the primary issue is that I have missing data in my dataset and some of the features are NaNs for some values (correctly so).

config = EvoTreeRegressor(
    loss=:logistic, 
    metric = :logloss,
    nrounds=100, 
    nbins = 32,
    lambda = 0.5, 
    gamma=0.1, eta=0.1,
    max_depth = 6,
    rowsample=0.5, 
    colsample=1.0)

model = fit_evotree(config; x_train=X, y_train=Y, x_eval=testX, y_eval=testY, print_every_n = 25)

Is there a plan for EvoTrees to handle NaNs please? I am also curious as to how others handle NaNs? Imputing or mean/median? those approaches wont work for me, I am afraid. anything else I can try out please?

ta!

@jeremiedb
Copy link
Member

It wasn't on my radar to have support for the NAs/missings. Typically, inputing would work (mean/media, or min/max), or otherwise creation of indicator variable [0, 1] if either missing or not. Any reason why such method wouldn't be applicable to your situation?

@Roh-codeur
Copy link
Author

Roh-codeur commented Mar 31, 2023

Imputing wouldn’t work. I have NaNs or missing in instances where I don’t have enough data to calculate features. For instance, the first few features would be NaN or missing.

I am afraid I am not aware of indicator variables - can you please elaborate? My features are continuous values

I hope we can come up with a solution for this, I am quite impressed with the performance, am looking forward to using this package

Thanks!

@jeremiedb
Copy link
Member

jeremiedb commented Apr 1, 2023

By indicator variable, I mean something similar to:

julia> df = DataFrame(v1 = [missing, rand(3)...])
4×1 DataFrame
 Row │ v1
     │ Float64?
─────┼─────────────────
   1missing
   20.777368
   30.0461273
   40.71682

julia> transform!(df, "v1" => ByRow(ismissing) => "v1_flag")
4×2 DataFrame
 Row │ v1               v1_flag
     │ Float64?         Bool
─────┼──────────────────────────
   1missing             true
   20.777368     false
   30.0461273    false
   40.71682      false

Then, you need to make an imputation for the original variable, could be 0, mean, or other relevant value:

julia> transform!(df, "v1" => ByRow(x -> ismissing(x) ? 0.0 : x) => "v1")

4×2 DataFrame
 Row │ v1         v1_flag
     │ Float64    Bool
─────┼────────────────────
   10.0           true
   20.777368     false
   30.0461273    false
   40.71682      false

Such kind of approach should typically cover most use cases.

@Roh-codeur
Copy link
Author

ahh, I understand. thanks for this. I will give this a shot, although, I will have to think more about the replacement values. I am working with Financial data, so, 0s are valid values. Imputing data with mean, median etc, would be misleading.

thanks
Roh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants